Non-blocking streaming Unicode codec.
Uutf is a non-blocking streaming codec to decode and
encode the UTF-8, UTF-16, UTF-16LE
and UTF-16BE encoding schemes. It can efficiently work character by
character without blocking on IO. Decoders perform
character position tracking and support newline normalization.
See examples of use.
Release 0.9.3 - Daniel Bünzli <daniel.buenzl firstname.lastname@example.org>
Uutf uses the term character for a Unicode
value which is an integer value in the ranges
0x10FFFF. This should not be
confused with a Unicode
code point, which is
a scalar value or a (textually meaningless)
The type for Unicode characters. Any value of this type returned
Uutf is a Unicode
The type for Unicode encoding schemes.
The type for newline normalizations. The variant argument is the normalization character.
`ASCII, normalizes CR (
U+000D), LF (
U+000A) and CRLF (<
`NLF, normalizes the Unicode newline function (NLF). This is NEL (
U+0085) and the normalizations of
`Readline, normalizes for a Unicode readline function. This is FF (
U+000C), LS (
U+2028), PS (
U+2029), and the normalizations of
Used with an appropriate normalization character the
`Readline normalizations allow to implement all the different
recommendations of Unicode's newline guidelines (section 5.8 in
The type for decoders.
decoder nln encoding src is a decoder that inputs from
Byte order mark.
Byte order mark
(BOM) constraints are application dependent and prone to
misunderstandings (see the
Uutf decoders have a simple rule: an initial BOM is always
removed from the input and not counted in character position
tracking. The function decoder_removed_bom does however return
true if a BOM was removed so that all the information can be
recovered if needed.
For UTF-16BE and UTF-16LE the above rule is a violation of
conformance D96 and D97 of the standard.
Uutf favors the idea
that if there's a BOM, decoding with
`UTF_16 or the
corresponding to the BOM should decode the same character sequence
(this is not the case if you stick to the standard). The client
can however regain conformance by consulting the result of
decoder_removed_bom and take appropriate action.
encoding specifies the decoded encoding
`UTF_16 is used the endianness is determined
according to the standard: from a
if there is one,
encoding is unspecified it is guessed. The result of a guess
can only be
`UTF_16LE. The heuristic
looks at the first three bytes of input (or less if impossible)
and takes the first matching byte pattern in the table below.
xx = any byte .. = any byte or no byte (input too small) pp = positive byte uu = valid UTF-8 first byte Bytes | Guess | Rationale ---------+-----------+----------------------------------------------- EF BB BF | `UTF_8 | UTF-8 BOM FE FF .. | `UTF_16BE | UTF-16BE BOM FF FE .. | `UTF_16LE | UTF-16LE BOM 00 pp .. | `UTF_16BE | ASCII UTF-16BE and U+0000 is often forbidden pp 00 .. | `UTF_16LE | ASCII UTF-16LE and U+0000 is often forbidden uu .. .. | `UTF_8 | ASCII UTF-8 or valid UTF-8 first byte. xx xx .. | `UTF_16BE | Not UTF-8 => UTF-16, no BOM => UTF-16BE .. .. .. | `UTF_8 | Single malformed UTF-8 byte or no input.
This heuristic is compatible both with BOM based recognitition and JSON-like encoding recognition that relies on ASCII being present at the beginning of the stream. Also, decoder_removed_bom will tell the client if the guess was BOM based.
Newline normalization. If
nln is specified, the given
newline normalization is performed, see nln. Otherwise
all newlines are returned as found in the input.
Character position. The line number, column number and
character count of the last decoded character (including
`Malformed ones) are respectively returned by decoder_line,
decoder_col and decoder_count. Before the first call to
decode the line number is
1 and the column is
`Malformed increments the column
until a newline. On a newline, the line number is incremented and
the column set to zero. For example the line is
2 and column
after the first newline was decoded. This can be understood as if decode
was moving an insertion point to the right in the data. A newline is anything normalized by
`Readline, see nln.
Uutf assumes that each Unicode scalar value has a column width
of 1. The same assumption may not be made by the display program
emacs' compilation mode you need to set
nil). For implementing
more involved column width increments yourself, look into
grapheme cluster boundaries.
decode d is:
`Manualinput source and awaits for more input. The client must use Manual.src to provide it.
`Uchar uif a Unicode scalar value
`Endif the end of input was reached.
`Malformed bytesif the
bytessequence is malformed according to the decoded encoding scheme. If you are interested in a best-effort decoding you can still continue to decode after an error until the decoder synchronizes again on valid bytes. It may however be a good idea to signal the malformed characters by adding an u_rep character to the parsed data, see the examples.
Note. Repeated invocation always eventually returns
in case of errors.
The type for Unicode encoders.
encode e v is :
`Manualdestination and needs more output storage. The client must use Manual.dst to provide a new buffer and then call encode with
`Okwhen the encoder is ready to encode a new
`Manual destination, encoding
`End always returns
`Partial, the client should continue as usual with
`Ok is returned at which point Manual.dst_rem
guaranteed to be the size of the last provided buffer (i.e. nothing
Invalid_argument if an
`End is encoded
Manual sources and destinations.
Warning. Use only with
`Manual decoder and encoders.
Fold over the characters of UTF encoded OCaml
encoding_guess s is the encoding guessed for
s coupled with
true iff there's an initial
Note. Initial BOMs are also folded over.
The type for character folders. The integer is the index in the
string where the
UTF encode characters in OCaml Buffer.t values.
The value of
lines src is the list of lines in
src as UTF-8
encoded OCaml strings. Line breaks are determined according to the
recommendation R4 for a
readline function in section 5.8 of
Unicode 6.1.0. If a decoding error occurs we silently replace the
malformed sequence by the replacement character u_rep and continue.
let lines ?encoding (src : [`Channel of in_channel | `String of string]) = let rec loop d buf acc = match Uutf.decode d with | `Uchar 0x000A -> let line = Buffer.contents buf in Buffer.clear buf; loop d buf (line :: acc) | `Uchar u -> Uutf.Buffer.add_utf_8 buf u; loop d buf acc | `End -> List.rev (Buffer.contents buf :: acc) | `Malformed _ -> Uutf.Buffer.add_utf_8 buf Uutf.u_rep; loop d buf acc | `Await -> assert false in let nln = `Readline 0x000A in loop (Uutf.decoder ~nln ?encoding src) (Buffer.create 512) 
lines_fd does the same but on a Unix file
let lines_fd ?encoding (fd : Unix.file_descr) = let rec loop fd s d buf acc = match Uutf.decode d with | `Uchar 0x000A -> let line = Buffer.contents buf in Buffer.clear buf; loop fd s d buf (line :: acc) | `Uchar u -> Uutf.Buffer.add_utf_8 buf u; loop fd s d buf acc | `End -> List.rev (Buffer.contents buf :: acc) | `Malformed _ -> Uutf.Buffer.add_utf_8 buf Uutf.u_rep; loop fd s d buf acc | `Await -> let rec unix_read fd s j l = try Unix.read fd s j l with | Unix.Unix_error (Unix.EINTR, _, _) -> unix_read fd s j l in let rc = unix_read fd s 0 (String.length s) in Uutf.Manual.src d s 0 rc; loop fd s d buf acc in let s = String.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in let nln = `Readline 0x000A in loop fd s (Uutf.decoder ~nln ?encoding `Manual) (Buffer.create 512) 
The result of
recode src out_encoding dst has the characters of
src written on
dst with encoding
out_encoding. If a
decoding error occurs we silently replace the malformed sequence
by the replacement character u_rep and continue. Note that we
don't add an initial
recoding will thus loose the initial BOM
src may have. Whether
this is a problem or not depends on the context.
let recode ?nln ?encoding out_encoding (src : [`Channel of in_channel | `String of string]) (dst : [`Channel of out_channel | `Buffer of Buffer.t]) = let rec loop d e = match Uutf.decode d with | `Uchar _ as u -> ignore (Uutf.encode e u); loop d e | `End -> ignore (Uutf.encode e `End) | `Malformed _ -> ignore (Uutf.encode e (`Uchar Uutf.u_rep)); loop d e | `Await -> assert false in let d = Uutf.decoder ?nln ?encoding src in let e = Uutf.encoder out_encoding dst in loop d e
recode_fd does the same but between
Unix file descriptors.
let recode_fd ?nln ?encoding out_encoding (fdi : Unix.file_descr) (fdo : Unix.file_descr) = let rec encode fd s e v = match Uutf.encode e v with `Ok -> () | `Partial -> let rec unix_write fd s j l = let rec write fd s j l = try Unix.single_write fd s j l with | Unix.Unix_error (Unix.EINTR, _, _) -> write fd s j l in let wc = write fd s j l in if wc < l then unix_write fd s (j + wc) (l - wc) else () in unix_write fd s 0 (String.length s - Uutf.Manual.dst_rem e); Uutf.Manual.dst e s 0 (String.length s); encode fd s e `Await in let rec loop fdi fdo ds es d e = match Uutf.decode d with | `Uchar _ as u -> encode fdo es e u; loop fdi fdo ds es d e | `End -> encode fdo es e `End | `Malformed _ -> encode fdo es e (`Uchar Uutf.u_rep); loop fdi fdo ds es d e | `Await -> let rec unix_read fd s j l = try Unix.read fd s j l with | Unix.Unix_error (Unix.EINTR, _, _) -> unix_read fd s j l in let rc = unix_read fdi ds 0 (String.length ds) in Uutf.Manual.src d ds 0 rc; loop fdi fdo ds es d e in let ds = String.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in let es = String.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in let d = Uutf.decoder ?nln ?encoding `Manual in let e = Uutf.encoder out_encoding `Manual in Uutf.Manual.dst e es 0 (String.length es); loop fdi fdo ds es d e