UTF-8 Parsing
The Parseff.Utf8 module provides primitives that operate on Unicode code points (Uchar.t) instead of byte sequences (string). The input is still an OCaml string, but characters are decoded as UTF-8 sequences.
Primitives operate at the code point level. A base character followed by a combining accent is two separate code points. Parseff.Utf8.satisfy returns them individually. For grapheme-level parsing, compose the code-point primitives with a grapheme segmentation library like uuseg.
Mixing with byte-level primitives
UTF-8 and byte-level primitives can be freely mixed in the same parser. Use byte-level Parseff.consume or Parseff.satisfy for ASCII structural tokens, and Parseff.Utf8 for multilingual text content:
let field () =
let key = Parseff.take_while ~at_least:1
(fun c -> c >= 'a' && c <= 'z')
~label:"key"
in
let _ = Parseff.consume ":" in
Parseff.Utf8.skip_whitespace ();
let value = Parseff.Utf8.take_while ~at_least:1
~label:"value"
(fun u -> Uchar.to_int u <> 0x0A)
in
(key, value)satisfy
Parseff.Utf8.satisfy decodes the next UTF-8 code point and tests it against a predicate. Advances by 1--4 bytes depending on the encoding.
val satisfy : (Uchar.t -> bool) -> label:string -> Uchar.t(* Match any CJK Unified Ideograph *)
let cjk_char () =
Parseff.Utf8.satisfy
(fun u ->
let i = Uchar.to_int u in
i >= 0x4E00 && i <= 0x9FFF)
~label:"CJK character"char
Parseff.Utf8.char matches an exact Unicode code point.
val char : Uchar.t -> Uchar.tlet lambda () = Parseff.Utf8.char (Uchar.of_int 0x03BB) (* λ *)
let arrow () = Parseff.Utf8.char (Uchar.of_int 0x2192) (* → *)take_while
Parseff.Utf8.take_while consumes code points while the predicate holds. Returns the matched UTF-8 bytes as a string. The optional ~at_least parameter counts code points, not bytes — users think in characters when using Unicode primitives, so the count matches that mental model.
val take_while : ?at_least:int -> ?label:string -> (Uchar.t -> bool) -> string(* Parse a Unicode word *)
let word () =
Parseff.Utf8.take_while
Uucp.Alpha.is_alphabetic
~at_least:1
~label:"letter"
(* Parses "hello", "café", "東京", "Москва", etc. *)skip_while
Parseff.Utf8.skip_while advances past code points without building a string. More efficient than Parseff.Utf8.take_while when you don't need the result.
val skip_while : (Uchar.t -> bool) -> unittake_while_span
Parseff.Utf8.take_while_span returns a zero-copy Parseff.span instead of allocating a new string. Use Parseff.span_to_string to materialize when needed.
val take_while_span : (Uchar.t -> bool) -> spanskip_while_then_char
Parseff.Utf8.skip_while_then_char skips code points matching a predicate, then matches a specific terminating code point. More efficient than calling Parseff.Utf8.skip_while followed by Parseff.Utf8.char separately.
val skip_while_then_char : (Uchar.t -> bool) -> Uchar.t -> unitConvenience combinators
These are built on top of the structural primitives using Unicode character properties from the uucp library.
letter
Parseff.Utf8.letter matches any Unicode alphabetic character using Uucp.Alpha.is_alphabetic. This covers Latin, Greek, Cyrillic, CJK, Arabic, Devanagari, and all other Unicode scripts.
let l = Parseff.Utf8.letter ()
(* Matches: 'a', 'é', 'λ', '中', 'д', 'ع', 'अ', ... *)digit
Parseff.Utf8.digit matches ASCII digits 0--9 only and returns an int. Unicode digit categories (Nd) include Arabic-Indic, Devanagari, and other numeral systems where mapping to int is non-trivial. Keeping it ASCII-only makes the return value unambiguous. For Unicode digit handling, use Parseff.Utf8.satisfy with a custom predicate.
alphanum
Parseff.Utf8.alphanum matches a Unicode alphabetic character or an ASCII digit. Combines Uucp.Alpha.is_alphabetic with the ASCII digit range.
whitespace and skip_whitespace
Parseff.Utf8.whitespace and Parseff.Utf8.skip_whitespace use the full Unicode White_Space property (Uucp.White.is_white_space). This includes ASCII whitespace plus:
- NO-BREAK SPACE (U+00A0)
- EN SPACE (U+2002), EM SPACE (U+2003)
- IDEOGRAPHIC SPACE (U+3000)
- and others
The ~at_least parameter on Parseff.Utf8.whitespace counts code points, not bytes.
(* Skip any Unicode whitespace before a value *)
Parseff.Utf8.skip_whitespace ();
let value = Parseff.Utf8.take_while ~at_least:1
Uucp.Alpha.is_alphabetic
~label:"word"is_whitespace
Parseff.Utf8.is_whitespace exposes the Unicode whitespace predicate for use with Parseff.Utf8.take_while or Parseff.Utf8.skip_while directly.
Invalid UTF-8
All UTF-8 primitives raise a parse error when they encounter an invalid byte sequence. This includes:
- Bare continuation bytes (0x80--0xBF)
- Invalid lead bytes (0xFE, 0xFF)
- Overlong encodings
- Truncated multi-byte sequences
The error message is "invalid UTF-8" and the position points to the first invalid byte.
Position tracking
Positions remain byte offsets, consistent with the rest of parseff. A single Parseff.Utf8.satisfy call advances the position by 1--4 bytes depending on the UTF-8 encoding of the matched code point. Parseff.position always returns a byte offset.
Streaming support
All UTF-8 primitives work with streaming input (Parseff.parse_source and Parseff.parse_source_until_end). Multi-byte UTF-8 sequences that span chunk boundaries are handled correctly, the streaming runtime ensures enough bytes are available before decoding each code point.