Runtime support for lexers generated by ulex
.
This module is roughly equivalent to the module Lexing from
the OCaml standard library, except that its lexbuffers handles
Unicode code points (OCaml type: int
in the range
0..0x10ffff
) instead of bytes (OCaml type: char
).
It is possible to have ulex-generated lexers work on a custom
implementation for lex buffers. To do this, define a module L
which
implements the start
, next
, mark
and backtrack
functions
(See the Internal Interface section below for a specification),
and the Error
exception.
They need not work on a type named lexbuf
: you can use the type
name you want. Then, just do in your ulex-processed source, before
the first lexer specification:
module Ulexing = L
Of course, you'll probably want to define functions like lexeme
to be used in the lexers semantic actions.
The type of lexer buffers. A lexer buffer is the argument passed to the scanning functions defined by the generated lexers. The lexer buffer holds the internal information for the scanners, including the code points of the token currently scanned, its position from the beginning of the input stream, and the current position of the lexer.
Raised by a lexer when it cannot parse a token from the lexbuf.
The functions Ulexing.lexeme_start
(resp. Ulexing.lexeme_end
) can be
used to find to positions of the first code point of the current
matched substring (resp. the first code point that yield the error).
Raised by some functions to signal that some code point is not compatible with a specified encoding.
Create a generic lexer buffer. When the lexer needs more
characters, it will call the given function, giving it an array of
integers a
, a position pos
and a code point count n
. The
function should put n
code points or less in a
, starting at
position pos
, and return the number of characters provided. A
return value of 0 means end of input.
Create a lexbuf from a Latin1 encoded input channel. The client is responsible for closing the channel.
Create a lexbuf from a UTF-8 encoded input channel.
Create a lexbuf from a stream whose encoding is subject to change during lexing. The reference can be changed at any point. Note that bytes that have been consumed by the lexer buffer are not re-interpreted with the new encoding.
In Ascii
mode, non-ASCII bytes (ie >127
) in the stream
raise an InvalidCodepoint
exception.
Same as Ulexing.from_var_enc_stream
with a string as input.
Same as Ulexing.from_var_enc_stream
with a channel as input.
The following functions can be called from the semantic actions of
lexer definitions. They give access to the character string matched
by the regular expression associated with the semantic action. These
functions must be applied to the argument lexbuf
, which, in the
code generated by ulex
, is bound to the lexer buffer passed to the
parsing function.
These functions can also be called when capturing a Ulexing.Error
exception to retrieve the problematic string.
Ulexing.lexeme_start lexbuf
returns the offset in the
input stream of the first code point of the matched string.
The first code point of the stream has offset 0.
Ulexing.lexeme_end lexbuf
returns the offset in the input stream
of the character following the last code point of the matched
string. The first character of the stream has offset 0.
Ulexing.loc lexbuf
returns the pair
(Ulexing.lexeme_start lexbuf,Ulexing.lexeme_end lexbuf)
.
Ulexing.loc lexbuf
returns the difference
(Ulexing.lexeme_end lexbuf) - (Ulexing.lexeme_start lexbuf)
,
that is, the length (in code points) of the matched string.
Ulexing.lexeme lexbuf
returns the string matched by
the regular expression as an array of Unicode code point.
Direct access to the starting position of the lexeme in the internal buffer.
Direct access to the current position (end of lexeme) in the internal buffer.
Ulexing.lexeme_char lexbuf pos
returns code point number pos
in
the matched string.
Ulexing.lexeme lexbuf pos len
returns a substring of the string
matched by the regular expression as an array of Unicode code point.
As Ulexing.lexeme
with a result encoded in Latin1.
This function throws an exception InvalidCodepoint
if it is not possible
to encode the result in Latin1.
As Ulexing.sub_lexeme
with a result encoded in Latin1.
This function throws an exception InvalidCodepoint
if it is not possible
to encode the result in Latin1.
As Ulexing.lexeme_char
with a result encoded in Latin1.
This function throws an exception InvalidCodepoint
if it is not possible
to encode the result in Latin1.
As Ulexing.sub_lexeme
with a result encoded in UTF-8.
Ulexing.rollback lexbuf
puts lexbuf
back in its configuration before
the last lexeme was matched. It is then possible to use another
lexer to parse the same characters again. The other functions
above in this section should not be used in the semantic action
after a call to Ulexing.rollback
.
These functions are used internally by the lexers. They could be used
to write lexers by hand, or with a lexer generator different from
ulex
. The lexer buffers have a unique internal slot that can store
an integer. They also store a "backtrack" position.
Ulexing.start lexbuf
informs the lexer buffer that any
code points until the current position can be discarded.
The current position become the "start" position as returned
by Ulexing.lexeme_start
. Moreover, the internal slot is set to
-1
and the backtrack position is set to the current position.
Ulexing.next lexbuf next
extracts the next code point from the
lexer buffer and increments to current position. If the input stream
is exhausted, the function returns -1
.