Non-blocking streaming JSON codec.
Jsonm is a non-blocking streaming codec to
decode and encode the
JSON data format. It can
process JSON text without blocking on IO and without a complete
in-memory representation of the data.
The uncut codec also processes whitespace and (non-standard) JSON with JavaScript comments.
Consult the data model, limitations and examples of use.
Version 0.9.1 - Daniel Bünzli <daniel.buenzli at erratique.ch>
The type for JSON lexemes. `As and `Ae
start and end arrays and `Os and `Oe start
and end objects. `Name is for the member names of objects.
A well-formed sequence of lexemes belongs to the language of
the json grammar:
json = object / array
object = `Os *member `Oe
member = (`Name s) value
array = `As *value `Ae
value = `Null / `Bool b / `Float f / `String s / object / array
A decoder returns only well-formed sequences of
lexemes or `Errors are returned. The
UTF-8,
UTF-16, UTF-16LE and
UTF-16BE encoding schemes are supported. The strings of decoded
`Name and `String lexemes are however always UTF-8 encoded. In
these strings, characters originally escaped in the input are in
their unescaped representation.
An encoder accepts only well-formed sequences
of lexemes or Invalid_argument is raised. Only the UTF-8
encoding scheme is supported. The strings of encoded `Name and
`String lexemes are assumed to be immutable and must be UTF-8
encoded, this is not checked by the module. In these strings,
the delimiter characters U+0022 and U+005C ('"', '\')
aswell as the control characters U+0000-U+001F are automatically
escaped by the encoders, as mandated by the standard.
The type for decoding errors.
The type for Unicode encoding schemes.
The type for JSON decoders.
decode d is:
`Await if d has a `Manual source and awaits for more input.
The client must use Manual.src to provide it.`Lexeme l if a lexeme l was decoded.`End if the end of input was reached.`Error e if a decoding error occured. If the client is interested
in a best-effort decoding it can still continue to decode
after an error (see errorrecovery) although the resulting sequence
of `Lexemes is undefined and may not be well-formed.The Uncut.pp_decode function can be used to inspect decode results.
Note. Repeated invocation always eventually returns `End, even
in case of errors.
The type for JSON encoders.
encode e v is:
`Partial iff e has a `Manual destination and needs more
output storage. The client must use Manual.dst to provide
a new buffer and then call encode with `Await until `Ok
is returned.`Ok when the encoder is ready to encode a new `Lexeme
or `End.
For `Manual destinations, encoding `End always returns `Partial,
the client should as usual use Manual.dst and continue with `Await
until `Ok is returned at which point Manual.dst_rem e is guaranteed
to be the size of the last provided buffer (i.e. nothing was written).
Raises. Invalid_argument if a non well-formed
sequence of lexemes is encoded or if `Lexeme or `End is
encoded after a `Partial encode.
Codec with comments and whitespace.
The uncut codec also processes whitespace and JavaScript
comments. The latter is non-standard JSON, fail on `Comment
decoding if you want to process whitespace but stick to the standard.
The uncut codec preserves as much of the original input as
possible. Perfect round-trip with Jsonm is however impossible for
the following reasons:
':' and
value separators ",". If you just reencode the sequence of
decodes, whitespace and comments may (harmlessly, but significantly)
commute with these separators.U+000A ('\n') for newlines.`Float lexemes may be rewritten differently by the encoder.The uncut data model is the same as the regular data model, except that before or after any lexeme you may decode/encode one or more:
`White w, representing JSON whitespace w. On input
the sequence CR (U+000D) and CRLF (<U+000A, U+000A>)
are normalized to U+000A. The string w must be
a sequence of U+0020, U+0009, U+000A or U+000D
characters (' ', '\t', '\n', '\r').`Comment (`S, c), representing a JavaScript single line
comment c. c is the comment's content without the starting
// and the ending newline. The string c must not contain any newline.`Comment (`M, c), representing a JavaScript multi-line
comment c. c is the comment's content without the starting
/* and the ending */. The string c must not contain the
sequence */.Warning. Uncut.encode does not check the above constraints on
w and c.
Decoders parse valid JSON with the following limitations:
float values.
This means that it can only represent integers exactly
in the in the interval [-253;253]. This is equivalent
to the contraints JavaScript has.`String, `Name, `White and `Comment
are limited by Sys.max_string_length. There is no built-in
protection against the fact that the internal OCaml Buffer.t
value may raise Failure on Jsonm.decode. This should
however only be a problem on 32-bits platforms if your
strings are greater than 16Mo.Position tracking assumes that each decoded Unicode scalar value
has a column width of 1. The same assumption may not be made by
the display program (e.g. for emacs' compilation mode you need
to set compilation-error-screen-columns to nil).
The newlines LF (U+000A), CR (U+000D), and CRLF are all normalized
to LF internally. This may have an impact in some corner `Error
cases. For example the invalid escape sequence <U+005C,U+000D> in
a string will be reported as being `Illegal_escape (`Not_esc_uchar
0x000A).
Encoders produce valid JSON provided the client ensures that the following holds.
Jsonm.`Float lexemes must not be, Pervasives.nan,
Pervasives.infinity or Pervasives.neg_infinity. They
are encoded with the format string "%.16g", this allows
to roundtrip all the integers that can be precisely represented
in OCaml float values, i.e. the integers in the interval
[-253;253]. This is equivalent to the constraints
JavaScript has.`White must be made
of JSON whitespace and `Comment
must never be encoded.After a decoding error, if best-effort decoding is performed. The following happens before continuing:
`Illegal_BOM, the initial
BOM is skipped.`Illegal_bytes, `Illegal_escape, `Illegal_string_uchar, a
Unicode
replacement
character (U+FFFD) is substituted to the illegal sequence.`Illegal_literal, `Illegal_number the corresponding
`Lexeme is skipped.`Expected r, input is discarded until a synchronyzing lexeme
that depends on r is found.`Unclosed, the end of input is reached, further decodes will be
`EndThe result of trip src dst has the JSON from src written on dst.
let trip ?encoding ?minify
(src : [`Channel of in_channel | `String of string])
(dst : [`Channel of out_channel | `Buffer of Buffer.t])
=
let rec loop d e = match Jsonm.decode d with
| `Lexeme _ as v -> ignore (Jsonm.encode e v); loop d e
| `End -> ignore (Jsonm.encode e `End); `Ok
| `Error err -> `Error (Jsonm.decoded_range d, err)
| `Await -> assert false
in
let d = Jsonm.decoder ?encoding src in
let e = Jsonm.encoder ?minify dst in
loop d e
Using the `Manual interface, trip_fd does the same but between Unix
file descriptors.
let trip_fd ?encoding ?minify
(fdi : Unix.file_descr)
(fdo : Unix.file_descr)
=
let rec encode fd s e v = match Jsonm.encode e v with `Ok -> ()
| `Partial ->
let rec unix_write fd s j l =
let rec write fd s j l = try Unix.single_write fd s j l with
| Unix.Unix_error (Unix.EINTR, _, _) -> write fd s j l
in
let wc = write fd s j l in
if wc < l then unix_write fd s (j + wc) (l - wc) else ()
in
unix_write fd s 0 (String.length s - Jsonm.Manual.dst_rem e);
Jsonm.Manual.dst e s 0 (String.length s);
encode fd s e `Await
in
let rec loop fdi fdo ds es d e = match Jsonm.decode d with
| `Lexeme _ as v -> encode fdo es e v; loop fdi fdo ds es d e
| `End -> encode fdo es e `End; `Ok
| `Error err -> `Error (Jsonm.decoded_range d, err)
| `Await ->
let rec unix_read fd s j l = try Unix.read fd s j l with
| Unix.Unix_error (Unix.EINTR, _, _) -> unix_read fd s j l
in
let rc = unix_read fdi ds 0 (String.length ds) in
Jsonm.Manual.src d ds 0 rc; loop fdi fdo ds es d e
in
let ds = String.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in
let es = String.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in
let d = Jsonm.decoder ?encoding `Manual in
let e = Jsonm.encoder ?minify `Manual in
Jsonm.Manual.dst e es 0 (String.length es);
loop fdi fdo ds es d e
The result of memsel names src is the list of string values of
members of src that have their name in names. In this example,
decoding errors are silently ignored.
let memsel ?encoding names
(src : [`Channel of in_channel | `String of string])
=
let rec loop acc names d = match Jsonm.decode d with
| `Lexeme (`Name n) when List.mem n names ->
begin match Jsonm.decode d with
| `Lexeme (`String s) -> loop (s :: acc) names d
| _ -> loop acc names d
end
| `Lexeme _ | `Error _ -> loop acc names d
| `End -> List.rev acc
| `Await -> assert false
in
loop [] names (Jsonm.decoder ?encoding src)
A generic OCaml representation of JSON text is the following one.
type json =
[ `Null | `Bool of bool | `Float of float| `String of string
| `A of json list | `O of (string * json) list ]
The result of json_of_src src is the JSON text from src in this
representation. The function is tail recursive.
exception Escape of ((int * int) * (int * int)) * Jsonm.error
let json_of_src ?encoding
(src : [`Channel of in_channel | `String of string])
=
let dec d = match Jsonm.decode d with
| `Lexeme l -> l
| `Error e -> raise (Escape (Jsonm.decoded_range d, e))
| `End | `Await -> assert false
in
let rec value v k d = match v with
| `Os -> obj [] k d | `As -> arr [] k d
| `Null | `Bool _ | `String _ | `Float _ as v -> k v d
| _ -> assert false
and arr vs k d = match dec d with
| `Ae -> k (`A (List.rev vs)) d
| v -> value v (fun v -> arr (v :: vs) k) d
and obj ms k d = match dec d with
| `Oe -> k (`O (List.rev ms)) d
| `Name n -> value (dec d) (fun v -> obj ((n, v) :: ms) k) d
| _ -> assert false
in
let d = Jsonm.decoder ?encoding src in
try `JSON (value (dec d) (fun v _ -> v) d) with
| Escape (r, e) -> `Error (r, e)
The result of json_to_dst dst json has the JSON text json written
on dst. The function is tail recursive.
let json_to_dst ~minify
(dst : [`Channel of out_channel | `Buffer of Buffer.t ])
(json : json)
=
let enc e l = ignore (Jsonm.encode e (`Lexeme l)) in
let rec value v k e = match v with
| `A vs -> arr vs k e
| `O ms -> obj ms k e
| `Null | `Bool _ | `Float _ | `String _ as v -> enc e v; k e
and arr vs k e = enc e `As; arr_vs vs k e
and arr_vs vs k e = match vs with
| v :: vs' -> value v (arr_vs vs' k) e
| [] -> enc e `Ae; k e
and obj ms k e = enc e `Os; obj_ms ms k e
and obj_ms ms k e = match ms with
| (n, v) :: ms -> enc e (`Name n); value v (obj_ms ms k) e
| [] -> enc e `Oe; k e
in
let e = Jsonm.encoder ~minify dst in
let finish e = ignore (Jsonm.encode e `End) in
match json with `A _ | `O _ as json -> value json finish e
| _ -> invalid_arg "invalid json text"