Uniform Resource Locators (URLs)
Contents
The tutorial has been moved to [root:Neturl_tut].
This module provides functions to parse URLs, to print URLs, to store URLs, to modify URLs, and to apply relative URLs.
URLs are strings formed according to pattern (1) or (2):
scheme://user;userparams:password@host:port/path;params?query#fragment
scheme:other;params?query#fragment
The word at the beginning of the URL identifies the URL scheme
(such as "http" or "file"). Depending on the scheme, not all of the
parts are allowed, or parts may be omitted. This module defines the
type url_syntax
whose values describe which parts are allowed/required/
not allowed for a concrete URL scheme (see below).
Not all characters are allowed in a URL. Some characters are allowed,
but have the special task to separate the various parts of the URL
(reserved characters).
However, it is possible to include even invalid or reserved characters
as normal content by applying the %
-encoding on these characters:
A '%'
indicates that an encoded character follows, and the character
is denoted by a two-digit hexadecimal number (e.g. %2f
for '/'
).
In the following descriptions, the term "encoded string" means a string
containing such %
-encoded characters, and the "decoded string" means a
string not containing such characters.
See the module Netencoding.Url for functions encoding or decoding
strings.
The type url
describes values storing the components of a URL,
and the url_syntax
for the URL. In general, the components are
stored as encoded strings; however, not for all components the
%
-encoding is applicable.
For convenience, the functions creating, modifying, and accessing URLs can handle both encoded and decoded strings. In order to avoid errors, the functions pass strings even in their decoded form.
Note that there is currently no function to compare URLs. The
canoncical comparison ( =
) is not applicable because the same URL
may be written in different ways.
Note that nothing is said about the character set/encoding of URLs.
Some protocols and standards prefer UTF-8 as fundamental encoding
and apply the %
-encoding on top of it; i.e. the byte sequence
representing a character in UTF-8 is %
-encoded.
Standards Compliance
This module implements RFC 1738 and RFC 1808. There is also a newer RFC, 2396, updating the former RFCs, but this module is not fully compatible with RFC 2396. The following (minor) problems may occur:
imap
URLs.In one point, RFC 2396 is preferred:
"http://host?query"
. This is illegal in RFC 1738. The consequence
is, however, that question marks in user strings must be escaped.RFC 3986 introduces IPv6 addresses. These are now supported (but see the comments below).
Raised by a number of functions when encountering a badly formed URL.
Returns the URL scheme from the string representation of an URL.
E.g. extract_url_scheme "http://host/path" = "http"
.
The scheme name is always converted to lowercase characters.
Raises Malformed_URL
if the scheme name is not found.
#
url_enable_scheme
| : url_syntax_option | ; | |||
#
url_enable_user
| : url_syntax_option | ; | |||
#
url_enable_user_param
| : url_syntax_option | ; | |||
#
url_enable_password
| : url_syntax_option | ; | |||
#
url_enable_host
| : url_syntax_option | ; | |||
#
url_enable_port
| : url_syntax_option | ; | |||
#
url_enable_path
| : url_syntax_option | ; | |||
#
url_enable_param
| : url_syntax_option | ; | |||
#
url_enable_query
| : url_syntax_option | ; | |||
#
url_enable_fragment
| : url_syntax_option | ; | |||
#
url_enable_other
| : url_syntax_option | ; | |||
#
url_accepts_8bits
| : bool | ; | |||
#
url_is_valid
| : url -> bool | ; | |||
#
url_enable_relative
| : bool | ; |
Values of type url_syntax
describe which components of an URL are
recognized, which are allowed (and optional), and which are required.
Not all combinations are valid; the predicate expressed by the
function url_syntax_is_valid
must hold.
The function url_is_valid
is applied when a fresh URL is created
and must return true
. This function allows it to add an arbitrary
validity criterion to url_syntax
. (Note that the URL passed to
this function is not fully working; you can safely assume that the
accessor functions url_scheme
etc. can be applied to it.)
Switch url_accepts_8bit
: If true
, the bytes with code 128 to
255 are treated like alphanumeric characters; if false
these bytes
are illegal (but it is still possible to include such byte in their
encoded form: %80
to %FF
).
Switch url_enable_relative
: If true
, the syntax allows relative
URLs in principle. Actually, parsing of relative URLs is possible
when the optional parts are flagged as Url_part_allowed
and not
as Url_part_required
. However, it is useful to specify URL syntaxes
always as absolute URLs, and to weaken them on demand when a relative
URL is found by the parser. This switch enables that. In particular,
the function partial_url_syntax
checks this flag.
Values of type url
describe concrete URLs. Every URL must have
a fundamental url_syntax
, and it is only possible to create URLs
conforming to the syntax. See make_url
for further information.
Checks whether the passed url_syntax
is valid. This means:
Transforms the syntax into another syntax where all required parts are changed into optional parts.
An URL syntax that recognizes nothing. Use this as base for your own definitions, e.g.
let my_syntax = { null_url_syntax with
url_enable_host = Url_part_required; ... }
Syntax for IP based protocols. This syntax allows scheme, user, password, host, port, path, param, query, fragment, but not "other". It does not accept 8 bit bytes.
Syntax descriptions for common URL schemes. The key of the hashtable is the scheme name, and the value is the corresponding syntax.
"file"
: scheme, host?, path"ftp"
: scheme, user?, password?, host, port?, path?, param?
Note: param is not checked."http"
, "https"
:
scheme, user?, password?, host, port?, path?, query?"mailto"
: scheme, other, query? (RFC 2368)"pop"
, "pops"
: scheme, user?, user_param?, password?, host, port?
Note: user_param is not checked.
(RFC 2384)"imap"
, "imaps"
: scheme, user?, user_param?, password?, host, port?,
path?, query? (RFC 2192)
Note: "param" is intentionally not recognized to get the resolution of
relative URLs as described in the RFC. When analysing this kind of URL,
it is recommended to re-parse it with "param" enabled."news"
: scheme, other (RFC 1738)"nntp"
, "nntps"
: scheme, host, port?, path (with two components)
(RFC 1738)"data"
: scheme, other (RFC 2397). "other" is not further decomposed."ipp"
, "ipps"
: scheme, host, port? , path?, query? (RFC 3510)"cid"
, "mid"
: Content/message identifiers: scheme, otherNotes:
partial_url_syntax
.url_enable_fragment
to Url_part_allowed
. E.g.
{ file_url_syntax with url_enable_fragment = Url_part_allowed }
Creates a URL from components:
scheme
and host
are simple strings to which the
%
-encoding is not applicable. host
may be a (DNS) name, an
IPv4 address as "dotted quad", or an IPv6 address enclosed in
brackets.addr
also sets host
, but directly from an inet_addr
.port
is a simple number. Of course, the %
-encoding
is not applicable, too.socksymbol
sets both host
and port
from the socksymbol of
type `Inet
or `Inet_byname
.user
, password
, query
, fragment
, and other
are strings which may contain %
-encoded characters. By default,
you can pass any string for these components, and problematic characters
are automatically encoded. If you set encoded:true
, the passed
strings must already be encoded, but the function checks whether
the encoding is syntactically correct.
Note that for query
even the characters '?'
and '='
are encoded
by default, so you need to set encoded:true
to pass a reasonable
query string.user_param
, path
and param
are lists of strings which may
contain %
-encoded characters. Again, the default is to pass
decoded strings to the function, and the function encodes them
automatically, and by setting encoded:true
the caller is responsible
for encoding the strings. Passing empty lists for these components
means that they are not part of the constructed URL.
See below for the respresentation of these components.socksymbol
has precedence over addr
, which has precedence over
host
. socksymbol
also has precedence over port
.
The strings representing the components do not contain the characters separating the components from each other.
The created URL must conform to the url_syntax
, i.e.:
url_is_valid
function of the syntax.The path of a URL is represented as a list of '/'
-separated path
components. i.e.
[ s1; s2; ...; sN ]
represents the path
s1 ^ "/" ^ s2 ^ "/" ^ ... ^ "/" ^ sN
As special cases:
[]
is the non-existing path[ "" ]
is "/"
[ "";"" ]
is illegalExcept of s1
and sN
, the path components must not be empty strings.
To avoid ambiguities, it is illegal to create URLs with both relative
paths (s1 <> ""
) and host components.
Parameters of URLs (param
and user_param
) are components
beginning with ';'
. The list
of parameters is represented as list of strings where the strings
contain the value following ';'
.
Modifies the passed components and returns the modified URL. The modfied URL shares unmodified components with the original URL.
Removes the true
components from the URL, and returns the modified
URL.
The modfied URL shares unmodified components with the original
URL.
Adds missing components and returns the modified URL. The modfied URL shares unmodified components with the original URL.
Removes components from the URL if they have the passed value, and returns the modified URL. Note: The values must always be passed in encoded form! The modfied URL shares unmodified components with the original URL.
Parses the passed string according to the passed url_syntax
.
Parses the string and returns the URL the string represents.
If the URL is absolute (i.e. begins with a scheme like
"http:..."), the syntax will be looked up in schemes
.
If the URL is relative, the base_syntax
will be taken
if passed. Without base_syntax
, relative URLs cannot be
parsed.
common_url_syntax
.
Malformed_URL
on a relative URL.
false
, the default, it depends on the
syntax descriptions in schemes
whether 8 bit characters are
accepted in the input or not. If true
, 8 bit characters are
always accepted.
false
, the default, it depends on the
syntax descriptions in schemes
whether fragment identifiers
(e.g. "#fragment") are recognized or not. If true
, fragments
are always recognized.
Escapes some unsafe or "unwise" characters that are commonly used in URL strings: space, < > { } ^ \\ | and double quotes. Call this function before parsing the URL to support these characters.
If escape_hash
is set, '#' is also escaped.
Change: Since Ocamlnet-3.4, square brackets are no longer fixed up, because they have now a legal use to denote IPv6 addresses.
Returns true
iff the URL has all of the components passed with
true
value.
Return components of the URL. The functions return decoded strings
unless encoded:true
is set.
If the component does not exist, the exception Not_found
is raised.
Note that IPv6 addresses, when returned by url_host
, are enclosed
in square brackets. Modules calling url_host
may require porting
to support this syntax variant.
url_socksymbol url default_port
: Returns the host
and port
parts
of the URL as socksymbol
. If the port is missing in the URL,
default_port
is substituted. If the host
is missing in the URL
the exception Not_found
is raised.
Splits a '/'
-separated path into components (e.g. to set up the
path
argument of make_url
).
E.g.
split_path "a/b/c" = [ "a"; "b"; "c" ],
split_path "/a/b" = [ ""; "a"; "b" ],
split_path "a/b/" = [ "a"; "b"; "" ]
Beware that split_path ".."
returns [".."]
while split_path "../"
returns [".."; ""]
. The two will behave differently, for example
when used with Neturl.apply_relative_url.
Concatenates the path components (reverse function of split_path).
Removes "."
and ".."
from the path if possible. Deletes double slashes.
Examples
norm_path ["."] = []
norm_path ["."; ""] = []
norm_path ["a"; "."] = ["a"; ""]
norm_path ["a"; "b"; "."] = ["a"; "b"; ""]
norm_path ["a"; "."; "b"; "."] = ["a"; "b"; ""]
norm_path [".."] = [".."; ""]
norm_path [".."; ""] = [".."; ""]
norm_path ["a"; "b"; ".."; "c" ] = ["a"; "c"]
norm_path ["a"; "b"; ".."; "c"; ""] = ["a"; "c"; ""]
norm_path ["";"";"a";"";"b"] = [""; "a"; "b"]
norm_path ["a"; "b"; ""; ".."; "c"; ""] = ["a"; "c"; ""]
norm_path ["a"; ".."] = []
apply_relative_url base rel
:
Interprets rel
relative to base
and returns the new URL. This
function implements RFC 1808.
It is not necessary that rel
has the same syntax as base
.
Note, however, that it is checked whether the resulting URL is
syntactically correct with the syntax of base
. If not, the
exception Malformed_URL
will be raised.
Examples (the URLs are represented as strings, see Neturl.split_path to split them for Neturl.make_url):
base="x/y", url="a/b" => result="x/a/b" base="x/y/", url="a/b" => result="x/y/a/b" base="x/y/..", url="a/b" => result="x/y/a/b" (beware!) base="x/y/../", url="a/b" => result="x/a/b"
If the anonymous URL is absolute, it is just returned as result of
this function. If the URL is relative, it is tried to make it
absolute by resolving it relative to base
. If there is no base
or if the the base URL does not allow the parts that would be added
(e.g. if the anonymous URL possesses a fragment and base
does not
allow that), this will fail, and the function raises Malformed_URL
.
Generates a URL with "file" scheme from the passed path name. The URL is always absolute, i.e. the current directory is prepended if the path is not absolute.
Note that no character set conversions are performed.
Win32: The input path name may use forward or backward slashes.
Absolute paths with drive letters and UNC paths are recognised.
Relative paths with drive letters, however, are not recognised
(e.g. "c:file"
), as it is not possible to access the drive-specific
working directory from the O'Caml runtime.
Cygwin: The input path name may use forward or backward slashes.
Absolute paths with drive letters and UNC paths are recognised.
The former are translated to "/cygdrive"
names.
Sys.getcwd
Extracts the path from an absolute file URL, and returns a correct path name.
If the URL is not a file URL, or is not absolute, the function will fail.
Win32: The URL must either contain a drive letter, or must refer to another host.
Cygwin: Drive letters and remote URLs are recognised.