Support module for Alain Frisch's ulex
lexer generator
The sub module ULB
is a Unicode-based lexing buffer that
reads encoded strings and makes them available to the lexer
as both Unicode arrays and UTF-8 strings.
The sub module Ulexing
is a replacement for the module
in ulex
with the same name. It uses ULB
to represent
the main lexing buffer. It is much faster than the original
Ulexing
implementation when the scanned text is UTF-8
encoded and Ulexing.utf8_lexeme
is frequently called to
get the lexeme strings. Furthermore, it can process input
data of all encodings available to Netconversion
. It is,
however, no drop-in replacement as it has a different
signature.
To enable this version of Ulexing
, simply put an
open Netulex
before using the ulex
lexers.
Note that the tutorial has been moved to [root:Netulex_tut].
This module provides the unicode_lexbuf
record with
access functions. In this record, the data is available
in two forms: As an array of Unicode code points
ulb_chars
, and as string of encoded chars ulb_rawbuf
.
Both buffers are synchronised by ulb_chars_pos
. This
array stores where every character of ulb_chars
can be
found in ulb_rawbuf
.
#
mutable ulb_encoding
| : Netconversion.encoding | ; | (* | The character encoding of ulb_rawbuf | *) |
#
mutable ulb_encoding_start
| : int | ; | (* | The first character position to which ulb_encoding
applies (the encoding of earlier positions is
lost) | *) |
#
mutable ulb_rawbuf
| : string | ; | (* | The encoded string to analyse | *) |
#
mutable ulb_rawbuf_len
| : int | ; | (* | The filled part of ulb_rawbuf | *) |
#
mutable ulb_rawbuf_end
| : int | ; | (* | The analysed part of ulb_rawbuf . We have always
ulb_rawbuf_end <= ulb_rawbuf_len . The analysed part
may be shorter than the filled part because there is
not enough space in ulb_chars , or because the filled
part ends with an incomplete multi-byte character | *) |
#
mutable ulb_rawbuf_const
| : bool | ; | (* | Whether ulb_rawbuf is considered as a constant. If
true , it is never blitted. | *) |
#
mutable ulb_chars
| : int array | ; | (* | The analysed part of ulb_rawbuf as array of Unicode
code points. Only the positions 0 to ulb_chars_len-1
of the array are filled. | *) |
#
mutable ulb_chars_pos
| : int array | ; | (* | For every analysed character this array stores the
byte position where the character begins in ulb_rawbuf .
In addition, the array contains at ulb_chars_len the
value of ulb_rawbuf_end .This array is one element longer than ulb_chars . | *) |
#
mutable ulb_chars_len
| : int | ; | (* | The filled part of ulb_chars | *) |
#
mutable ulb_eof
| : bool | ; | (* | Whether EOF has been seen | *) |
#
mutable ulb_refill
| : string -> int -> int -> int | ; | (* | The refill function | *) |
#
mutable ulb_enc_change_hook
| : unicode_lexbuf -> unit | ; | (* | This function is called when the encoding changes | *) |
#
mutable ulb_cursor
| : Netconversion.cursor | ; | (* | Internally used by the implementation | *) |
Creates a unicode_lexbuf
to analyse strings of the
passed encoding
coming from the refill
function.
ulb_rawbuf
. Defaults to 512
ulb_chars
. Defaults to 256
set_encoding
.
ulb_rawbuf
,
ulb_rawbuf_len
, and l
, where
l = String.length ulb_rawbuf - ulb_rawbuf_len
is the free
space in the buffer. The function should fill new bytes into
this substring, and return the number of added bytes. The
return value 0 signals EOF.
Creates a unicode_lexbuf
to analyse strings of the
passed encoding
coming from the object channel.
ulb_rawbuf
. Defaults to 512
ulb_chars
. Defaults to 256
set_encoding
.
Creates a unicode_lexbuf
analysing the passed string encoded in
the passed encoding. This function copies the input string.
set_encoding
Creates a unicode_lexbuf
analysing the passed string encoded in
the passed encoding. This function does not copy the input string,
but uses it directly as ulb_rawbuf
. The string is not modified by ULB
,
but the caller must ensure that other program parts do not
modify it either.
set_encoding
Deletes the number of characters from unicode_lexbuf
.
These characters
are removed from the beginning of the buffer, i.e.
ulb_chars.(n)
becomes the new first character of the
buffer. All three buffers ulb_rawbuf
, ulb_chars
, and
ulb_chars_pos
are blitted as necessary.
When the buffer is already at EOF, the function fails.
For efficiency, it should be tried to call delete
as seldom as
possible. Its speed is linear to the number of characters to move.
Tries to add characters to the unicode_lexbuf
by calling the
ulb_refill
function. When the buffer is already at EOF, the
exception End_of_file
is raised, and the buffer is not modified.
Otherwise, the ulb_refill
function is called to
add new characters. If necessary, ulb_rawbuf
, ulb_chars
, and
ulb_chars_pos
are enlarged such that it is ensured that either
at least one new character is added, or that EOF is found for
the first time
In the latter case, ulb_eof
is set to true
(and the next call
of refill_unicode_lexbuf
will raise End_of_file
).
Sets the encoding
to the passed value. This only affects future
refill
calls. The hook enc_change_hook
is invoked when defined.
Sets ulb_eof
of the unicode_lexbuf
. The rest of the buffer
is not modified
The two int
arguments are the position and length of a sub
string of the lexbuf that is returned as UTF8 string. Position
and length are given as character multiples, not byte multiples.
Returns String.length(utf8_sub_string args)
. Tries not to
allocate the UTF-8 string.
This is a lexing buffer for ulex
.
Lexical error
Creates a new lexbuf
from the unicode_lexbuf
. After that,
the unicode_lexbuf
must no longer be modified.
Returns a substring of the lexeme as array of Unicode
code points. The first int
is the characater position
where to start, the second int
is the number of
characters.