UTF-8 Parsing

The Parseff.Utf8 module provides primitives that operate on Unicode code points (Uchar.t) instead of byte sequences (string). The input is still an OCaml string, but characters are decoded as UTF-8 sequences.

Primitives operate at the code point level. A base character followed by a combining accent is two separate code points. Parseff.Utf8.satisfy returns them individually. For grapheme-level parsing, compose the code-point primitives with a grapheme segmentation library like uuseg.

Mixing with byte-level primitives

UTF-8 and byte-level primitives can be freely mixed in the same parser. Use byte-level Parseff.consume or Parseff.satisfy for ASCII structural tokens, and Parseff.Utf8 for multilingual text content:

let field () =
  let key = Parseff.take_while ~at_least:1
    (fun c -> c >= 'a' && c <= 'z')
    ~label:"key"
  in
  let _ = Parseff.consume ":" in
  Parseff.Utf8.skip_whitespace ();
  let value = Parseff.Utf8.take_while ~at_least:1
    ~label:"value"
    (fun u -> Uchar.to_int u <> 0x0A)
  in
  (key, value)

satisfy

Parseff.Utf8.satisfy decodes the next UTF-8 code point and tests it against a predicate. Advances by 1--4 bytes depending on the encoding.

val satisfy : (Uchar.t -> bool) -> label:string -> Uchar.t
(* Match any CJK Unified Ideograph *)
let cjk_char () =
  Parseff.Utf8.satisfy
    (fun u ->
      let i = Uchar.to_int u in
      i >= 0x4E00 && i <= 0x9FFF)
    ~label:"CJK character"

char

Parseff.Utf8.char matches an exact Unicode code point.

val char : Uchar.t -> Uchar.t
let lambda () = Parseff.Utf8.char (Uchar.of_int 0x03BB)  (* λ *)
let arrow () = Parseff.Utf8.char (Uchar.of_int 0x2192)   (* → *)

take_while

Parseff.Utf8.take_while consumes code points while the predicate holds. Returns the matched UTF-8 bytes as a string. The optional ~at_least parameter counts code points, not bytes — users think in characters when using Unicode primitives, so the count matches that mental model.

val take_while : ?at_least:int -> ?label:string -> (Uchar.t -> bool) -> string
(* Parse a Unicode word *)
let word () =
  Parseff.Utf8.take_while
    Uucp.Alpha.is_alphabetic
    ~at_least:1
    ~label:"letter"

(* Parses "hello", "café", "東京", "Москва", etc. *)

skip_while

Parseff.Utf8.skip_while advances past code points without building a string. More efficient than Parseff.Utf8.take_while when you don't need the result.

val skip_while : (Uchar.t -> bool) -> unit

take_while_span

Parseff.Utf8.take_while_span returns a zero-copy Parseff.span instead of allocating a new string. Use Parseff.span_to_string to materialize when needed.

val take_while_span : (Uchar.t -> bool) -> span

skip_while_then_char

Parseff.Utf8.skip_while_then_char skips code points matching a predicate, then matches a specific terminating code point. More efficient than calling Parseff.Utf8.skip_while followed by Parseff.Utf8.char separately.

val skip_while_then_char : (Uchar.t -> bool) -> Uchar.t -> unit

Convenience combinators

These are built on top of the structural primitives using Unicode character properties from the uucp library.

letter

Parseff.Utf8.letter matches any Unicode alphabetic character using Uucp.Alpha.is_alphabetic. This covers Latin, Greek, Cyrillic, CJK, Arabic, Devanagari, and all other Unicode scripts.

let l = Parseff.Utf8.letter ()
(* Matches: 'a', 'é', 'λ', '中', 'д', 'ع', 'अ', ... *)

digit

Parseff.Utf8.digit matches ASCII digits 0--9 only and returns an int. Unicode digit categories (Nd) include Arabic-Indic, Devanagari, and other numeral systems where mapping to int is non-trivial. Keeping it ASCII-only makes the return value unambiguous. For Unicode digit handling, use Parseff.Utf8.satisfy with a custom predicate.

alphanum

Parseff.Utf8.alphanum matches a Unicode alphabetic character or an ASCII digit. Combines Uucp.Alpha.is_alphabetic with the ASCII digit range.

whitespace and skip_whitespace

Parseff.Utf8.whitespace and Parseff.Utf8.skip_whitespace use the full Unicode White_Space property (Uucp.White.is_white_space). This includes ASCII whitespace plus:

  • NO-BREAK SPACE (U+00A0)
  • EN SPACE (U+2002), EM SPACE (U+2003)
  • IDEOGRAPHIC SPACE (U+3000)
  • and others

The ~at_least parameter on Parseff.Utf8.whitespace counts code points, not bytes.

(* Skip any Unicode whitespace before a value *)
Parseff.Utf8.skip_whitespace ();
let value = Parseff.Utf8.take_while ~at_least:1
  Uucp.Alpha.is_alphabetic
  ~label:"word"

is_whitespace

Parseff.Utf8.is_whitespace exposes the Unicode whitespace predicate for use with Parseff.Utf8.take_while or Parseff.Utf8.skip_while directly.

Invalid UTF-8

All UTF-8 primitives raise a parse error when they encounter an invalid byte sequence. This includes:

  • Bare continuation bytes (0x80--0xBF)
  • Invalid lead bytes (0xFE, 0xFF)
  • Overlong encodings
  • Truncated multi-byte sequences

The error message is "invalid UTF-8" and the position points to the first invalid byte.

Position tracking

Positions remain byte offsets, consistent with the rest of parseff. A single Parseff.Utf8.satisfy call advances the position by 1--4 bytes depending on the UTF-8 encoding of the matched code point. Parseff.position always returns a byte offset.

Streaming support

All UTF-8 primitives work with streaming input (Parseff.parse_source and Parseff.parse_source_until_end). Multi-byte UTF-8 sequences that span chunk boundaries are handled correctly, the streaming runtime ensures enough bytes are available before decoding each code point.