Camomile's toplevel interface
Type of configuration parametor
Configuration values
Directory of compiled Unicode data
Directory of compiled character mapping tables a la ISO
Directory of camomile-style compiled character mapping table
Directory of compiled locale data
Individual modules
Object Oriented Channel
Generic input channel Have the same interface of Polymorphic input channel of http://www.ocaml-programming.de/rec/IO-Classes.html All channels of Camomile having this interface must confirm the behaviour defined in the recommendation above.
If close_oout cannot output all buffered objects, flush raises Failure
If flush cannot output all buffered objects, flush raises Failure
Generic output channel Have the same interface of Polymorphic output channel of http://www.ocaml-programming.de/rec/IO-Classes.html All channels of Camomile having this interface must confirm the behaviour defined in the recommendation above.
Convert stream to obj_input_channel
Character(byte) input channel. Have the same interface of octet input channel of http://www.ocaml-programming.de/rec/IO-Classes.html All channels of Camomile having this interface must confirm the behaviour defined in the recommendation above. In addition, all channels are assumed to be blocking. If you supply a non-blocking channel to Camomile API, the outcome is undefined.
Character(byte) output channel. Have the same interface of octet input channel of http://www.ocaml-programming.de/rec/IO-Classes.html All channels of Camomile having this interface must confirm the behaviour defined in the recommendation above. In addition, all channels are assumed to be blocking. If you supply a non-blocking channel to Camomile API, the outcome is undefined.
Convert a polymorphic input channel to a character input channel
Convert a character input channel to a polymorphic input channel
Convert a polymorphic output channel to a character output channel
Convert a character output channel to a polymorphic output channel
Convert an OCaml input channel to an OO-based character input channel
Convert an OCaml output channel to an OO-based character output channel
Unicode (ISO-UCS) characters.
This module implements Unicode (actually ISO-UCS) characters. All 31-bit code points are allowed.
Unicode characters. All 31bit code points are allowed.
char_of u
returns the Latin-1 representation of u
.
If u
can not be represented by Latin-1, raises Out_of_range
code u
returns the Unicode code number of u
.
If the value can not be represented by a positive integer,
raise Out_of_range
code n
returns the Unicode character with the code number n
.
If n >= 2^32 or n < 0, raises invalid_arg
uint_code u
returns the Unicode code number of u
.
The returned int is unsigned, that is, on 32-bits platforms,
the sign bit is used for storing the 31-th bit of the code number.
chr_of_uint n
returns the Unicode character of the code number n
.
n
is interpreted as unsigned, that is, on 32-bits platforms,
the sign bit is treated as the 31-th bit of the code number.
If n exceed 31-bits values, then raise invalid_arg
.
Sets of Unicode characters, implemented as sets of intervals. The signature is mostly same to Set.S in stdlib
fold_range f s x
is equivalent to
f u_i u_(i+1) (... (f u_3 u_4 (f u_1 u_2 x)))
if s
is consisted of
the intervals u1
-u2
, u3
-u4
, ..., u_i
-u_(i + 1)
in increasing order. The intervals given to proc
are always separated by the character not in s
.
Maps over Unicode characters.
map ?eq f m
and mapi ?eq f m
: Similar to map
and mapi
in stdlib Map, but if the map m'
is returned, it is only guaranteed
that eq (find u m') (f (find u m ))
is true for map
and
eq (find u m') (f u (find u m ))
is true for mapi
. If eq
is
not specified, structural equality is used.
fold_range f m x
is equivalent to
f u_(2n) u_(2n+1) v_n (... (f u_1 u_2 v_1 x))
where all characters in
the range u_(2k)
-u_(2k+1)
are mapped to v_k
and
u_1
< u_3
< ... in code point order.
For each range u_(2k)
-u_(2k+1)
is separated by a character
which is not mapped to v_k
.
Signature for Unicode strings. UText, XString, UTF8, UTF16, UCS4 have matched signatures to UStorage and satisfy the semantics described below. If users want to supply their own Unicode strings, please design the module with the following signature and properties.
The type of string.
locations in storages.
next x i, prev x i
:
The operation is valid if i
points the valid element, i.e. the
returned value may point the location beyond valid elements by one.
If i
does not point a valid element, the results are unspecified.
An implementation of Unicode string.
An implementation of Unicode string. Internally, it uses integer array. The semantics matches the description of UStorage.
Phantom type for distinguishing mutability
Line IO
Line I/O, conversion of line separators.
Line separators.
`CR
specifies carriage return.`LF
specifies linefeed.`CRLF
specifies the sequence of carriage return and linefeed.`NEL
specifies next line (\u0085).`LS
specifies Unicode line separator (\u2028).`PS
specifies Unicode paragraph separator (\u2029).new input separator input_obj
creates the new input channel object
OOChannel.obj_input_channel which reads from input_obj
and
converts line separators (all of CR, LF, CRLF, NEL, LS, PS) to
separator
.
new output separator output_obj
creates the new output channel
object OOChannel.obj_output_channel which receives Unicode characters
and converts line separators (all of CR, LF, CRLF, NEL, LS, PS) to
separator
.
new input_line input_obj
creates the new input channel object
OOChannel.obj_input_channel which reads Unicode characters
from input_obj
and output lines. All of CR, LF, CRLF, NEL, LS, PS,
as well as FF (formfeed) are recognised as a line separator.
new output_line ~sp output_obj
create the new output channel object
OOChannel.obj_output_channel which output each line to output_obj
using sp
as a line separator.
If sp
is omitted, linefeed (LF) is used.
Camomile has a locale system similar to Java. A locale is a string with a form as "<LANG>_<COUNTRY>_<MODIFIER>..." where <LANG> is a 2-letter ISO 639 language code, <COUNTRY> is a 2-letter ISO 3166 country code. Some field may not present.
Type of locales.
read root suffix reader locale
reads locale information using reader
.
Locale data is supposed to reside in root
directory with
the name locale
.suffix
.
reader
takes in_channel
as an argument and read data from in_channel.
If data is not found, then reader
should raise Not_found.
If the file is not found or reader
raises Not_found, then
more generic locales are tried.
For example, if fr_CA.suffix
is not found, then read
tries fr.suffix
.
If fr.suffix
is also not found, then the file root
.suffix
is tried.
Still the data is not found, then Not_found
is raised.
contain loc1 loc2
:
If loc1
is contained in loc2
then true otherwise false.
For example, "fr" is contained in "fr_CA" while "en_CA"
does not contain "fr"
UTF-8 encoded Unicode strings. The type is normal string.
UTF-8 encoded Unicode strings. The type is normal string.
validate s
successes if s is valid UTF-8, otherwise raises Malformed_code.
Other functions assume strings are valid UTF-8, so it is prudent
to test their validity for strings from untrusted origins.
Positions in the string represented by the number of bytes from the head.
The location of the first character is 0
next s i
returns the position of the head of the Unicode character
located immediately after i
.
If i
is inside of s
, the function always successes.
If i
is inside of s
and there is no Unicode character after i
,
the position outside s
is returned.
If i
is not inside of s
, the behaviour is unspecified.
prev s i
returns the position of the head of the Unicode character
located immediately before i
.
If i
is inside of s
, the function always successes.
If i
is inside of s
and there is no Unicode character before i
,
the position outside s
is returned.
If i
is not inside of s
, the behaviour is unspecified.
UTF-16 encoded string. the type is the bigarray of 16-bit integers. The characters must be 21-bits code points, and not surrogate points, 0xfffe, 0xffff. Bigarray.cma or Bigarray.cmxa must be linked when this module is used.
validate s
If s
is valid UTF-16 then successes otherwise raises Malformed_code
.
Other functions assume strings are valid UTF-16, so it is prudent
to test their validity for strings from untrusted origins.
All functions below assume strings are valid UTF-16. If not, the result is unspecified.
Positions in the string represented by the number of 16-bit unit
from the head.
The location of the first character is 0
next s i
returns the position of the head of the Unicode character
located immediately after i
.
i
is a valid position, the function always success.i
is a valid position and there is no Unicode character after i
,
the position outside s
is returned.i
is not a valid position, the behaviour is undefined.prev s i
returns the position of the head of the Unicode character
located immediately before i
.
i
is a valid position, the function always success.i
is a valid position and there is no Unicode character before i
,
the position outside s
is returned.i
is not a valid position, the behaviour is undefined.UCS4 encoded string. The type is the bigarray of 32-bit integers. Bigarray.cma or Bigarray.cmxa must be linked when this module is used.
validate s
If s
is valid UCS4 then successes otherwise raises Malformed_code
.
Other functions assume strings are valid UCS4, so it is prudent
to test their validity for strings from untrusted origins.
All functions below assume strings are valid UCS4. If not, the result is unspecified.
Positions in the string represented by the number of characters
from the head.
The location of the first character is 0
next s i
returns the position of the head of the Unicode character
located immediately after i
.
If i
is a valid position, the function always success.
If i
is a valid position and there is no Unicode character after i
,
the position outside s
is returned.
If i
is not a valid position, the behaviour is undefined.
prev s i
returns the position of the head of the Unicode character
located immediately before i
.
If i
is a valid position, the function always success.
If i
is a valid position and there is no Unicode character before i
,
the position outside s
is returned.
If i
is not a valid position, the behaviour is undefined.
Functions for toplevel
Aliases for UChar.uint_code, UChar.chr_of_uint
Regular expression engine.
Match semantics.
regexp_match ?sem r t i
tries matching r
and substrings
of t
beginning from i
. If match successes, Some g
is
returned where g
is the array containing the matched
string of n
-th group in the n
-element.
The matched string of the whole r
is stored in the 0
-th element.
If matching fails, None
is returned.
string_match r t i
tests whether r
can match a substring
of t
beginning from i
.
search_forward ?sem r t i
searches a substring of t
matching r
from i
. The returned value is similar to
URe.Type.regexp_match.
Module for character encodings.
Failure of decoding
Failure of encoding
Type for encodings.
new_enc name enc
registers the new encoding enc
under the name name
alias alias name
: Define alias
as an alias of
the encoding with the name name
.
Returns the encoding of the given name. Fails if the encoding is unknown. Encoding names are the same to codeset names in charmap files for the encodings defined by charmap. See charmaps directory in the source directory for the available encodings. In addition to the encodings via the charmap files, camomile supports ISO-2022-CN, ISO-2022-JP, ISO-2022-JP-2, ISO-2022-KR, jauto (Auto detection of Japanese encodings), UTF-8, UTF-16, UTF-16BE, UTF-16LE. UTF-32, UTF-32BE, UTF-32LE, UCS-4(Big endian order). The encoding also can be referred by "IANA/<IANA name>", if the encoding is supported.
Shortcuts
new uchar_input_channel_of enc c_in
creates the new intput
channel which convert characters to Unicode using encoding
enc
.
new uchar_ouput_channel_of enc c_out
creates the new output
channel which convert Unicode to its byte representation using
encoding enc
.
new convert_uchar_input enc c_in
creates the new channel which
convert Unicode input to its byte representation using encoding
enc
.
new convert_uchar_output enc c_in
creates the new channel which
convert character output to Unicode using encoding enc
.
new convert_input in_enc out_enc c_in
create the new input
channel using encoding out_enc
from the input channel using
encoding in_enc
new convert_ouput in_enc out_enc c_in
create the new output
channel using encoding in_enc
from the output channel using
encoding out_enc
new out_channel enc outchan
creates the output channel object
OOChannel.obj_output_channel which
receives Unicode characters and outputs them to outchan
using
the encoding enc
.
new in_channel enc inchan
creates the intput channel object
OOChannel.obj_input_channel which
reads bytes from inchan
and converts them to Unicode characters.
Unicode character informations
Character Information
Type of Unicode general character categories. Each variant specifies
`Lu
: Letter, Uppercase`Ll
: Letter, Lowercase`Lt
: Letter, Titlecase`Mn
: Mark, Non-Spacing`Mc
: Mark, Spacing Combining`Me
: Mark, Enclosing`Nd
: Number, Decimal Digit`Nl
: Number, Letter`No
: Number, Other`Zs
: Separator, Space`Zl
: Separator, Line`Zp
: Separator, Paragraph`Cc
: Other, Control`Cf
: Other, Format`Cs
: Other, Surrogate`Co
: Other, Private Use`Cn
: Other, Not Assigned`Lm
: Letter, Modifier`Lo
: Letter, Other`Pc
: Punctuation, Connector`Pd
: Punctuation, Dash`Ps
: Punctuation, Open`Pe
: Punctuation, Close`Pi
: Punctuation, Initial`Pf
: Punctuation, Final`Po
: Punctuation, Other`Sm
: Symbol, Math`Sc
: Symbol, Currency`Sk
: Symbol, Modifier`So
: Symbol, OtherType of character properties
Load the table for the given character type.
Load the set of characters of the given character type.
Type for script type
age
age c
unicode version in wich c
was introduced
older v1 v2
is true
if v1
is older ( or the same version )
than v2
. Everithing is older than `Nc
casing
Decomposition
Types of decomposition.
Canonical Composition
Unicode normal form (NFD, NFKD, NFC, NFKC) as described in UTR #15
Unicode collation algorithm
String comparison by collation as described in UTR #10
How variables are handled
Strength of comparison. For European languages, each strength roughly means as `Primary : Ignore accents and case `Secondary : Ignore case but accents are counted in. `Tertiary : Accents and case are counted in. For the case of `Shifted, `Shift_Trimmed, there is the fourth strength. `Quaternary : Variables such as - (hyphen) are counted in.
For locale, see Locale.
If locale
is omitted, the standard UCA order is used.
If prec
is omitted, the maximum possible strength is used.
If variable
is omitted, the default of the locale
(usually `Shifted
) is used.
The meaning of the returned value is similar to Pervasives.compare
Binary comparison of sort_key gives the same result as compare
.
i.e.
compare t1 t2 = Pervasives.compare (sort_key t1) (sort_key t2)
If the same texts are repeatedly compared,
pre-computation of sort_key gives better performance.
Comparison with the sort key.
Module for a Str-like regular expression syntax. The difference can be summarised as follows.
Theses functions are similar to Str.
regexp_match ?sem r t i
tries matching r
and substrings
of t
beginning from i
. If match successes, Some g
is
returned where g
is the array containing the matched
string of n
-th group in the n
-element.
The matched string of the whole r
is stored in the 0
-th element.
If matching fails, None
is returned.
string_match r t i
tests whether r
can match a substring
of t
beginning from i
.
search_forward ?sem r t i
searches a substring of t
matching r
from i
. The returned value is similar to
URe.Type.regexp_match.
Object Oriented Channel
Generic input channel Have the same interface of Polymorphic input channel of http://www.ocaml-programming.de/rec/IO-Classes.html All channels of Camomile having this interface must confirm the behaviour defined in the recommendation above.
If close_oout cannot output all buffered objects, flush raises Failure
If flush cannot output all buffered objects, flush raises Failure
Generic output channel Have the same interface of Polymorphic output channel of http://www.ocaml-programming.de/rec/IO-Classes.html All channels of Camomile having this interface must confirm the behaviour defined in the recommendation above.
Convert stream to obj_input_channel
Character(byte) input channel. Have the same interface of octet input channel of http://www.ocaml-programming.de/rec/IO-Classes.html All channels of Camomile having this interface must confirm the behaviour defined in the recommendation above. In addition, all channels are assumed to be blocking. If you supply a non-blocking channel to Camomile API, the outcome is undefined.
Character(byte) output channel. Have the same interface of octet input channel of http://www.ocaml-programming.de/rec/IO-Classes.html All channels of Camomile having this interface must confirm the behaviour defined in the recommendation above. In addition, all channels are assumed to be blocking. If you supply a non-blocking channel to Camomile API, the outcome is undefined.
Convert a polymorphic input channel to a character input channel
Convert a character input channel to a polymorphic input channel
Convert a polymorphic output channel to a character output channel
Convert a character output channel to a polymorphic output channel
Convert an OCaml input channel to an OO-based character input channel
Convert an OCaml output channel to an OO-based character output channel
Unicode (ISO-UCS) characters.
This module implements Unicode (actually ISO-UCS) characters. All 31-bit code points are allowed.
Unicode characters. All 31bit code points are allowed.
char_of u
returns the Latin-1 representation of u
.
If u
can not be represented by Latin-1, raises Out_of_range
code u
returns the Unicode code number of u
.
If the value can not be represented by a positive integer,
raise Out_of_range
code n
returns the Unicode character with the code number n
.
If n >= 2^32 or n < 0, raises invalid_arg
uint_code u
returns the Unicode code number of u
.
The returned int is unsigned, that is, on 32-bits platforms,
the sign bit is used for storing the 31-th bit of the code number.
chr_of_uint n
returns the Unicode character of the code number n
.
n
is interpreted as unsigned, that is, on 32-bits platforms,
the sign bit is treated as the 31-th bit of the code number.
If n exceed 31-bits values, then raise invalid_arg
.
Sets of Unicode characters, implemented as sets of intervals. The signature is mostly same to Set.S in stdlib
fold_range f s x
is equivalent to
f u_i u_(i+1) (... (f u_3 u_4 (f u_1 u_2 x)))
if s
is consisted of
the intervals u1
-u2
, u3
-u4
, ..., u_i
-u_(i + 1)
in increasing order. The intervals given to proc
are always separated by the character not in s
.
Maps over Unicode characters.
map ?eq f m
and mapi ?eq f m
: Similar to map
and mapi
in stdlib Map, but if the map m'
is returned, it is only guaranteed
that eq (find u m') (f (find u m ))
is true for map
and
eq (find u m') (f u (find u m ))
is true for mapi
. If eq
is
not specified, structural equality is used.
fold_range f m x
is equivalent to
f u_(2n) u_(2n+1) v_n (... (f u_1 u_2 v_1 x))
where all characters in
the range u_(2k)
-u_(2k+1)
are mapped to v_k
and
u_1
< u_3
< ... in code point order.
For each range u_(2k)
-u_(2k+1)
is separated by a character
which is not mapped to v_k
.
Signature for Unicode strings. UText, XString, UTF8, UTF16, UCS4 have matched signatures to UStorage and satisfy the semantics described below. If users want to supply their own Unicode strings, please design the module with the following signature and properties.
The type of string.
locations in storages.
next x i, prev x i
:
The operation is valid if i
points the valid element, i.e. the
returned value may point the location beyond valid elements by one.
If i
does not point a valid element, the results are unspecified.
An implementation of Unicode string.
An implementation of Unicode string. Internally, it uses integer array. The semantics matches the description of UStorage.
Phantom type for distinguishing mutability
Line IO
Line I/O, conversion of line separators.
Line separators.
`CR
specifies carriage return.`LF
specifies linefeed.`CRLF
specifies the sequence of carriage return and linefeed.`NEL
specifies next line (\u0085).`LS
specifies Unicode line separator (\u2028).`PS
specifies Unicode paragraph separator (\u2029).new input separator input_obj
creates the new input channel object
OOChannel.obj_input_channel which reads from input_obj
and
converts line separators (all of CR, LF, CRLF, NEL, LS, PS) to
separator
.
new output separator output_obj
creates the new output channel
object OOChannel.obj_output_channel which receives Unicode characters
and converts line separators (all of CR, LF, CRLF, NEL, LS, PS) to
separator
.
new input_line input_obj
creates the new input channel object
OOChannel.obj_input_channel which reads Unicode characters
from input_obj
and output lines. All of CR, LF, CRLF, NEL, LS, PS,
as well as FF (formfeed) are recognised as a line separator.
new output_line ~sp output_obj
create the new output channel object
OOChannel.obj_output_channel which output each line to output_obj
using sp
as a line separator.
If sp
is omitted, linefeed (LF) is used.
Camomile has a locale system similar to Java. A locale is a string with a form as "<LANG>_<COUNTRY>_<MODIFIER>..." where <LANG> is a 2-letter ISO 639 language code, <COUNTRY> is a 2-letter ISO 3166 country code. Some field may not present.
Type of locales.
read root suffix reader locale
reads locale information using reader
.
Locale data is supposed to reside in root
directory with
the name locale
.suffix
.
reader
takes in_channel
as an argument and read data from in_channel.
If data is not found, then reader
should raise Not_found.
If the file is not found or reader
raises Not_found, then
more generic locales are tried.
For example, if fr_CA.suffix
is not found, then read
tries fr.suffix
.
If fr.suffix
is also not found, then the file root
.suffix
is tried.
Still the data is not found, then Not_found
is raised.
contain loc1 loc2
:
If loc1
is contained in loc2
then true otherwise false.
For example, "fr" is contained in "fr_CA" while "en_CA"
does not contain "fr"
UTF-8 encoded Unicode strings. The type is normal string.
UTF-8 encoded Unicode strings. The type is normal string.
validate s
successes if s is valid UTF-8, otherwise raises Malformed_code.
Other functions assume strings are valid UTF-8, so it is prudent
to test their validity for strings from untrusted origins.
Positions in the string represented by the number of bytes from the head.
The location of the first character is 0
next s i
returns the position of the head of the Unicode character
located immediately after i
.
If i
is inside of s
, the function always successes.
If i
is inside of s
and there is no Unicode character after i
,
the position outside s
is returned.
If i
is not inside of s
, the behaviour is unspecified.
prev s i
returns the position of the head of the Unicode character
located immediately before i
.
If i
is inside of s
, the function always successes.
If i
is inside of s
and there is no Unicode character before i
,
the position outside s
is returned.
If i
is not inside of s
, the behaviour is unspecified.
UTF-16 encoded string. the type is the bigarray of 16-bit integers. The characters must be 21-bits code points, and not surrogate points, 0xfffe, 0xffff. Bigarray.cma or Bigarray.cmxa must be linked when this module is used.
validate s
If s
is valid UTF-16 then successes otherwise raises Malformed_code
.
Other functions assume strings are valid UTF-16, so it is prudent
to test their validity for strings from untrusted origins.
All functions below assume strings are valid UTF-16. If not, the result is unspecified.
Positions in the string represented by the number of 16-bit unit
from the head.
The location of the first character is 0
next s i
returns the position of the head of the Unicode character
located immediately after i
.
i
is a valid position, the function always success.i
is a valid position and there is no Unicode character after i
,
the position outside s
is returned.i
is not a valid position, the behaviour is undefined.prev s i
returns the position of the head of the Unicode character
located immediately before i
.
i
is a valid position, the function always success.i
is a valid position and there is no Unicode character before i
,
the position outside s
is returned.i
is not a valid position, the behaviour is undefined.UCS4 encoded string. The type is the bigarray of 32-bit integers. Bigarray.cma or Bigarray.cmxa must be linked when this module is used.
validate s
If s
is valid UCS4 then successes otherwise raises Malformed_code
.
Other functions assume strings are valid UCS4, so it is prudent
to test their validity for strings from untrusted origins.
All functions below assume strings are valid UCS4. If not, the result is unspecified.
Positions in the string represented by the number of characters
from the head.
The location of the first character is 0
next s i
returns the position of the head of the Unicode character
located immediately after i
.
If i
is a valid position, the function always success.
If i
is a valid position and there is no Unicode character after i
,
the position outside s
is returned.
If i
is not a valid position, the behaviour is undefined.
prev s i
returns the position of the head of the Unicode character
located immediately before i
.
If i
is a valid position, the function always success.
If i
is a valid position and there is no Unicode character before i
,
the position outside s
is returned.
If i
is not a valid position, the behaviour is undefined.
Functions for toplevel
Aliases for UChar.uint_code, UChar.chr_of_uint
Regular expression engine.
Match semantics.
regexp_match ?sem r t i
tries matching r
and substrings
of t
beginning from i
. If match successes, Some g
is
returned where g
is the array containing the matched
string of n
-th group in the n
-element.
The matched string of the whole r
is stored in the 0
-th element.
If matching fails, None
is returned.
string_match r t i
tests whether r
can match a substring
of t
beginning from i
.
search_forward ?sem r t i
searches a substring of t
matching r
from i
. The returned value is similar to
URe.Type.regexp_match.
How variables are handled
Strength of comparison. For European languages, each strength roughly means as `Primary : Ignore accents and case `Secondary : Ignore case but accents are counted in. `Tertiary : Accents and case are counted in. For the case of `Shifted, `Shift_Trimmed, there is the fourth strength. `Quaternary : Variables such as - (hyphen) are counted in.
All-in-one, configure once modules