Erlsom
This is the documentation for Erlsom v. 1.2.1. Erlsom is a set of functions to parse (and generate) XML documents.
Erlsom can be used in a couple of very different modes:
-
as a SAX parser.
This is a more or less standardized model (see http://www.saxproject.org/apidoc/org/xml/sax/ContentHandler.html) for parsing XML. Every time the parser has processed a meaningful
part of the XML document (such as a start tag), it will tell your application
about this. The application can process this information – potentially in
parallel – while the parser continues to parse the rest of the document. The
SAX parser will allow you to efficiently parse XML documents of arbitrary size,
but it may take some time to get used to it. If you invest some effort, you may
find that it fits very well with the Erlang programming model – personally I
have always been very happy about my choice to use a SAX parser as the basis
for the rest of Erlsom. You will find a couple of examples how the SAX parser
can be used below.
-
As a simple sort of DOM parser.
Erlsom can translate your XML to the ‘simple form’ that is used by Xmerl. This
is a form that is easy to understand, but you have to search your way through
the output to get to the information that you need. Section … provides an
example. (Note: in the examples directory you will also find an example that
translates the XML to the more complex output format that is produced by Xmerl.
On this output you can use the Xpath functions that come with Xmerl – but I
haven’t tested this extensively).
-
As a ‘data binder’
Erlsom can translate the XML document to an Erlang data structure that
corresponds to an XML Schema. It has the advantage over the SAX parser that it
validates the XML document, and that you know exactly what the layout of the
output will be. This makes it easy to access the elements that you need in a
very direct way. Section … gives more information. (See http://www.rpbourret.com/xml/XMLDataBinding.htm
for a general description of XML data binding.)
For all modes the following applies:
- If the document is too big to fit into memory, or if the document arrives in some kind of data stream, it can be passed to the parser in blocks of arbitrary size.
- The parser can work directly on binaries. There is no need to transform binaries to lists before passing the data to Erlsom. Using binaries as input has a positive effect on the memory usage and on the speed (provided that you are using Erlang 12B or later – if you are using an older Erlang version the speed will be better if you transform binaries to lists). The binaries can be latin-1, utf-8 or utf-16 encoded.
- The parser has an option to produce output in binary form (only the character data: names of elements and attributes are always strings). This may be convenient if you want to minimize the memory usage, and/or if you need the result in binary format for further processing. Note that it will slow down the parser slightly. If you select this option the encoding of the result will be utf-8 (irrespective of the encoding of the input document).
Unless otherwise indicated, the examples in the next sections will use the following, very simple XML document as input:
<foo attr="baz"><bar>x</bar><bar>y</bar></foo>
This document is stored in a file called “minimal.xml”, and read into a variable called Xml by the following commands in the shell:
1> {ok, Xml} = file:read_file("minimal.xml").
{ok,<<"<foo attr=\"baz\"><bar>x</bar><bar>y</bar></foo>\r\n">>}
The following, corresponding XSD (“minimal.xsd”) is used in the first example for the data binder:
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="foo" type="foo_type"/>
<xsd:complexType name="foo_type">
<xsd:sequence>
<xsd:element name="bar"
type="xsd:string" maxOccurs="unbounded"/>
</xsd:sequence>
<xsd:attribute name="attr" type="xsd:string"/>
</xsd:complexType>
</xsd:schema>
The example below shows how the example XML (see above) can be processed using the SAX parser:
2> erlsom:parse_sax(Xml, [], fun(Event, Acc) -> io:format("~p~n", [Event]), Acc end).
startDocument
{startElement,[],"foo",[],[{attribute,"attr",[],[],"baz"}]}
{startElement,[],"bar",[],[]}
{characters,"x"}
{endElement,[],"bar",[]}
{startElement,[],"bar",[],[]}
{characters,"y"}
{endElement,[],"bar",[]}
{endElement,[],"foo",[]}
endDocument
{ok,[],"\r\n"}
The function erlsom:parse_sax takes as its arguments: the XML document, an accumulator value and an ‘event processing function’. This function will process the parts of the XML documents that have been parsed. In this example, this function simply prints these events.
The next example does something slightly more meaningful: it counts the number of times the “bar” element occurs in the XML document. Ok, maybe not very useful, but at least this example will produce a result, not only side effects.
3> CountBar = fun(Event, Acc) -> case Event of {startElement, _, "bar", _, _} -> Acc + 1; _ -> Acc end end.
#Fun<erl_eval.12.113037538>
4> erlsom:parse_sax(Xml, 0, CountBar).
{ok,2,"\r\n"}
To describe it in a rather formal way: parse_sax(Xml, Acc0, Fun) calls Fun(Event, AccIn) on successive ‘XML events’ that result from parsing Xml, starting with AccIn == Acc0. Fun/2 must return a new accumulator which is passed to the next call. The function returns {ok, AccOut, Tail}, where AccOut is the final value of the accumulator and Tail the list of characters that follow after the last tag of the XML document. In this example AccOut == 2, since the tag occurs twice.
(Notice how similar this is to lists:foldl(Fun, Acc0, Sax_events), assuming that Sax_events is the list of Sax events – I more or less copied this description from the documentation of the lists module.)
It may still not be very clear to you how this SAX parser can be used to produce useful results. There are some additional examples in the examples directory of the Erlsom distribution. If you are still not convinced you can try to decipher the source code for the ‘data mapper’ mode (erlsom_parse.erl) – this was also built on top of the SAX parser.
startDocument
endDocument
Will NOT be sent out in case of an error
{startPrefixMapping, Prefix, URI}
Begin the scope of a prefix - URI namespace mapping
Will be sent immediately before the corresponding startElement event.
{endPrefixMapping, Prefix}
End the scope of a prefix - URI namespace mapping
Will be sent immediately before the corresponding endElement event.
{startElement, Uri, LocalName, Prefix, [Attributes]}
The beginning of an element.
There will be a corresponding endElement (even when the element is
empty).
All three name components will be provided.
[Attributes] is a list of attribute records, see sax.hrl.
Namespace attributes (xmlns:*) will not be reported.
There will be NO attribute values for defaulted attributes!
Providing 'Prefix' in stead of 'Qualified name' is probably not quite
in line with the SAX spec, but it appears to be more convenient.
{endElement, Uri, LocalName, Prefix}
The end of an element.
{characters, Characters}
Character data.
All character data will be in one chunk, except if there is a
CDATA section included inside a character section. In that case
there will be separate events for the characters before the CDATA, the
CDATA section and the characters following it (if any, of course).
{ignorableWhitespace, Characters}
If a character data section (as it would be reported by the 'characters'
event, see above) consists ONLY of whitespace, it will be
reported as ignorableWhitespace.
{processingInstruction, Target, Data}
{error, Description}
{internalError, Description}
This mode translates the XML document to a generic data structure. It doesn’t really follow the DOM standard, but in stead it provides a very simple format. In fact, it is very similar to format that is defined as the ‘simple-form’ in the Xmerl documentation.
An example will probably be sufficient to explain it:
erlsom:simple_form(Xml).
{ok,{"foo",
[{"attr","baz"}],
[{"bar",[],["x"]},{"bar",[],["y"]}]},
"\r\n"}
Result = {ok, Element, Tail}, where Element = {Tag, Attributes, Content}, Tag is a string (there is an option that allows you to format Tag differently, see the reference section below), Attributes = [{AttributeName, Value}], and Content is a list of Elements and/or strings.
In this mode, Erlsom parses XML documents that are associated with an XSD (or Schema). It checks whether the XML document conforms to the Schema, and it translates the document to an Erlang structure that is based on the types defined in the Schema. This section tries to explain the relation between the Schema and the Erlang data structure that is produced by Erlsom.
First a quick example using the same XML that was used for the other modes. Before we can parse the document we need to ‘compile’ the XML Schema (similar to how you might compile a regular expression).
10> {ok, Model} = erlsom:compile_xsd_file("minimal.xsd").
{ok,{model,[{typ…
Now you can use this compiled model:
11> {ok, Result, _} = erlsom:scan(Xml, Model).
{ok,{foo_type,[],"baz",["x","y"]},"\r\n"}
Assuming that you have defined a suitable record #foo_type{} (erlsom:write_xsd_hrl_file() can do it for you), you can use in your program (won’t work in the shell):
BarValues = Result#foo_type.bar,
AttrValue = Result#foo_type.attr,
Nice and compact, as you see, but it may need more explanation. I will use a more complex example from the XML Schema Primer (XML Schema Part 0: Primer Second Edition) [Primer]. It can be found here: http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/[1]. Sections that have been copied from this document are contained in a blue box.
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:annotation>
<xsd:documentation xml:lang="en">
Purchase order schema for Example.com.
Copyright 2000 Example.com. All rights reserved.
</xsd:documentation>
</xsd:annotation>
<xsd:element name="purchaseOrder" type="PurchaseOrderType"/>
<xsd:element name="comment" type="xsd:string"/>
<xsd:complexType name="PurchaseOrderType">
<xsd:sequence>
<xsd:element name="shipTo" type="USAddress"/>
<xsd:element name="billTo" type="USAddress"/>
<xsd:element ref="comment" minOccurs="0"/>
<xsd:element name="items" type="Items"/>
</xsd:sequence>
<xsd:attribute name="orderDate" type="xsd:date"/>
</xsd:complexType>
<xsd:complexType name="USAddress">
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="street" type="xsd:string"/>
<xsd:element name="city" type="xsd:string"/>
<xsd:element name="state" type="xsd:string"/>
<xsd:element name="zip" type="xsd:decimal"/>
</xsd:sequence>
<xsd:attribute name="country" type="xsd:NMTOKEN"
fixed="US"/>
</xsd:complexType>
<xsd:complexType name="Items">
<xsd:sequence>
<xsd:element name="item" minOccurs="0" maxOccurs="unbounded">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="productName" type="xsd:string"/>
<xsd:element name="quantity">
<xsd:simpleType>
<xsd:restriction base="xsd:positiveInteger">
<xsd:maxExclusive value="100"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
<xsd:element name="USPrice" type="xsd:decimal"/>
<xsd:element ref="comment" minOccurs="0"/>
<xsd:element name="shipDate" type="xsd:date" minOccurs="0"/>
</xsd:sequence>
<xsd:attribute name="partNum" type="SKU" use="required"/>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
<!-- Stock Keeping Unit, a code for identifying products -->
<xsd:simpleType name="SKU">
<xsd:restriction base="xsd:string">
<xsd:pattern value="\d{3}-[A-Z]{2}"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:schema>
example 1: po.xsd
This XSD can be processed by Erlsom: the compiler accepts it, and the parser can parse instances (XML documents) that conform to this schema.
Like the Primer, I will use po.xml as an example XML document.
<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
<shipTo country="US">
<name>Alice Smith</name>
<street>123 Maple Street</street>
<city>Mill Valley</city>
<state>CA</state>
<zip>90952</zip>
</shipTo>
<billTo country="US">
<name>Robert Smith</name>
<street>8 Oak Avenue</street>
<city>Old Town</city>
<state>PA</state>
<zip>95819</zip>
</billTo>
<comment>Hurry, my lawn is going wild<!/comment>
<items>
<item partNum="872-AA">
<productName>Lawnmower</productName>
<quantity>1</quantity>
<USPrice>148.95</USPrice>
<comment>Confirm this is electric</comment>
</item>
<item partNum="926-AA">
<productName>Baby Monitor</productName>
<quantity>1</quantity>
<USPrice>39.98</USPrice>
<shipDate>1999-05-21</shipDate>
</item>
</items>
</purchaseOrder>
example 2: po.xml
Translating po.xml using erlsom:scan/2 will result in:
{'PurchaseOrderType',[],
"1999-10-20",
{'USAddress',[],
"US",
"Alice Smith",
"123 Maple Street",
"Mill Valley",
"CA",
"90952"},
{'USAddress',[],
"US",
"Robert Smith",
"8 Oak Avenue",
"Old Town",
"PA",
"95819"},
"Hurry, my lawn is going wild!",
{'Items',[],
[{'Items/item',
[],
"872-AA",
"Lawnmower",
"1",
"148.95",
"Confirm this is electric",
undefined},
{'Items/item',
[],
"926-AA",
"Baby Monitor",
"1",
"39.98",
undefined,
"1999-05-21"}]}}
example 3: output for po.xml
The output can be interpreted as a structure built from Erlang records. The definition of these records can either be generated by erlsom:write_xsd_hrl_file/3, or you can define them yourself (or a combination: you can run write_xsd_hrl_file and change a few fieldnames and add some defaults). For example of the record definitions might be:
-record(purchaseOrderType, {anyAttribs, shipTo, billTo, comment, items}).
-record(‘USAddress’, {anyAttribs, country, name, city, state, zip}).
-record(‘Items’, {anyAttribs, listOfItem}).
-record(‘Items/item’, {anyAttribs, partNum, productName, quantity, ‘USPrice’, comment, shipDate}).
example 4: possible record definitions for po.xsd
As can be seen from the example:
- attributes are included in the records as the first elements (country, partNum)
- elements that are optional (minOccurs="0") for which no value is provided get the value undefined (comment, shipDate).
- elements that can occur more than once (maxOccurs > 0 or unbounded) are translated to a list (listOfItem).
- every record has ‘anyAttribs’ as its first element. If the Schema allows ‘anyAttributes’, and if these are present in the XML document, then the values will be found here (as a list of attribute-value pairs)
It should be noted that there is quite a bit of information in po.xsd that is not used by erlsom:
- Only in a limited number of situations does erlsom do type checking and translation: only if an element is defined as integer, int, boolean or QName without any further restrictions or extensions. The ‘quantity’ element doesn’t meet these conditions, since (a) it is a positiveInteger, and (b) it is restricted. A value for the quantity element of Ten or -1 would not result in an error or warning, and the string value is not translated to an Erlang integer. This also applies for the user defined simpleTypes, like SKU in the example.
- The fixed attribute is ignored. If there would have been another value than US in po.xml, this would have been accepted without warning or error.
- The annotation is ignored (obviously).
In example 5 a number of additional features is illustrated:
- elements that belong to a namespace are prefixed in the result. The prefix is determined by a parameter of the function that compiles the XSD.
- anonymous types (in the example: spouse) get a name that include the ‘path’, in order to avoid name conflicts.
- types (‘records’) are created for choices – the type indicates which alternative was selected (the record b:personType-hobby shows that “Mowing the lawn” is a hobby, not a profession).
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.org"
xmlns="http://www.example.org"
elementFormDefault="qualified">
<xsd:element name="person" type="personType"/>
<xsd:complexType name="personType">
<xsd:sequence>
<!-- an element with an attribute -->
<xsd:element name = "id">
<xsd:complexType>
<xsd:simpleContent>
<xsd:extension base = "xsd:string">
<xsd:attribute name = "type" type = "xsd:string"/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:element>
<!-- choice -->
<xsd:choice>
<xsd:element name="profession" type="xsd:string"/>
<xsd:element name="hobby" type="xsd:string"/>
</xsd:choice>
<!-- group -->
<xsd:group ref="name"/>
<!-- local type -->
<xsd:element name="spouse">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="age" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
<xsd:group name="name">
<xsd:sequence>
<xsd:element name="firstName" type="xsd:string"/>
<xsd:element name="lastName" type="xsd:string"/>
</xsd:sequence>
</xsd:group>
</xsd:schema>
example 5: misc.xsd: namespace, choice, group
<?xml version="1.0"?>
<person xmlns="http://www.example.org">
<id type="passport">123</id>
<hobby>mowing the lawn</hobby>
<firstName>Jan</firstName>
<lastName>Pietersen</lastName>
<spouse>
<name>Jet Pietersen</name>
<age>33</age>
</spouse>
</person>
example 6: misc.xml
The XSD can be compiled by the command
> {ok, Model} = erlsom:compile_xsd_file(“misc.xsd”,
[{prefix, “b”}]).
After that the XML can be parsed using the command
> {ok, Out, Rest} = erlsom:scan_file(“misc_example.xml”, Model).
Out is the output shown below, and Rest is a string of the characters that may follow after the end tag of the XML.
{'b:personType',[],
{'b:personType/id',[], "passport","123"},
{'b:personType-hobby',[], "mowing the lawn"},
{'b:name',[], "Jan","Pietersen"},
{'b:personType/spouse',[], "Jet Pietersen","33"}}
example 7: output for misc.xml
The easiest way to install Erlsom is probably to get it from CEAN (http://cean.process-one.net). If you don’t want to do that, or if it doesn’t work for you, you can also install it from Sourceforge.
I have no experience with installation procedures, makefiles etc for Erlang. Fortunately, Klacke (Claes Wickstrom) has provided a makefile. This should enable Unix users to install Erlsom easily.
Anyway, even for Windows users installing erlsom should be straightforward. One way to do it would is described below.
- Put all the files into the directory ‘ROOT/lib/erlsom-1.2.1/src’, where ROOT is the directory that contains Erlang (C:\Program Files\erl5.6.1 on my Windows system).
- Start the Erlang shell
- Change the working directory to ‘ROOT/lib/erlsom-1.2.1/src’:
1>
cd('../lib/erlsom-1.2.1/src').
C:/Program
Files/erl5.6.1/lib/erlsom-1.2.1/src
ok
- Compile the source files:
2> c("erlsom"),
c("erlsom_parse"),
c("erlsom_lib"),
c("erlsom_compile"),
c("erlsom_write"),
c("erlsom_parseXsd"),
c("erlsom_sax"),
c("erlsom_pass2"),
c("erlsom_writeHrl"),
c("erlsom_add"),
c("erlsom_ucs"),
c("erlsom_sax_utf8"),
c("erlsom_sax_latin1"),
c("erlsom_sax_utf16be"),
c("erlsom_sax_utf16le"),
c("erlsom_sax_list"),
c("erlsom_sax_lib"),
c("erlsom_simple_form").
- Move the .beam files to ‘ROOT/lib/erlsom-1.2.1/ebin’.
- Alternatively you can use emake for the last 2 steps:
2> make:all([{outdir, "../ebin"}]).
The distribution includes 7 examples:
-
erlsom_example: this shows the use of the basic
functions to compile an XSD, to parse an XML document and to write an XML
document.
To run the example from the Erlang shell: cd to the directory that contains the
code (something like cd('lib/erlsom-1.2.1/examples/erlsom_example').), compile (c("erlsom_example").) and run (erlsom_example:run().).
- erlsom_sax_example: this shows the features of the SAX parser.
- example1: this example has 2 purposes:
o It shows how easy Erlsom makes it for you to use an XML configuration file. The configuration file describes a set of 10 test cases, which are run by this example. The configuration file is described by “example1.xsd”. Compiling this XSD and then parsing the configuration file (“example1.xml”) gives you access to an Erlang structure of records that corresponds with the XML schema.
o It shows how 11 different schemas (names “abb1.xsd” through
“abb11.xsd”) can describe the same XML document (named “abb.xml”), and it shows
the output that results from running Erlsom on this file using these schema’s.
To run the example for XSD abb1.xsd, use the command example1:test_erlsom("abb1”).
- soap_example: this shows how to use the erlsom:add_xsd_file() function, and it gives an example how you might parse and generate SOAP messages.
- continuation: this shows how to use the sax parser with a ‘continuation-function’. This can be used for parsing of very big files or streams. The continuation function should return a block of data; this will be parsed (calling the sax callback function when appropriate) and after that the function is called again to get the next block of data. The example shows how a file of arbitrary size can be parsed. The comments in the code should help you to understand and use this function.
- complex_form: shows how you could create a back-end to the sax parser that produces the same output as Xmerl, and how you could then use the Xpath functions that Xmerl provides.
- book_store; actually three examples, demonstrating the three modes that erlsom supports. The third example shows how you might combine different modes within a function that scans a file.
The sax parser accepts binaries as input. It will recognize UTF-8 and UTF-16 encoding by looking at the byte order mark and the first character of the document. Additionally ISO-8859-1 encoding is recognized if this is indicated by the XML declaration. If the XML declaration specifies another character set, an error will be thrown. It should not be very difficult to add support for other character sets, however.
As specified by the XML standard, the default encoding is UTF-8. If the first byte of the document is a ‘<’ ASCII character and if the XML declaration does not specify anything else, it will be assumed that the encoding is UTF-8.
The result of erlsom:write is a list of Unicode code points. Normally this will have to be encoded before it can be used. The function xmerl_ucs:to_utf8/1 can be used to do this.
Especially in the context of internet applications, it may be a problem if new atoms are created as a result of communication based on XML (SOAP, XML-RPC, AJAX). The number of atoms that can be created within the Erlang runtime environment is limited, and uncontrolled creation of atoms may cause the system to crash.
Erlsom:scan/2 does not create new atoms. It uses string_to_existing_atom to create the atoms that are used in the records.
Erlsom:compile_xsd does create atoms. However, usually this function won’t be called with arbitrary end user input as its argument, so normally this should not be a problem.
Some checks/validity constraints are accepted in the XSD, but not enforced during parsing:
- all simple types are interpreted as string. This applies to the built in types (float, positiveInteger, gYear etc), and also to types that are restricted (using 'facets') or extended (for example 'union' types). The only exceptions are Integer, Boolean and QName, these are translated.
- Key, Unique etc. are not supported – if these elements occur in the XSD, they are simply ignored.
The SAX parser has the following limitations:
- It doesn’t support external entities.
- It doesn’t do any validation: if the XML includes a DTD, this is simply ignored.
The data binder has the following additional limitation:
- Names of elements and attributes cannot contain characters outside the Erlang character set (because they are translated to atoms).
All |
Supported. The parser puts the elements into the resulting record in a fixed place (independent of the order in which they are received). |
Annotation |
Ignored (anything enclosed in <documentation></documentation> is ignored). |
Any |
Supported. However, only elements that are included in the model will show up in the result. Elements are part of the model if they are included in the XSD that was compiled, or if they have been added using erlsom:add_file(). |
anyAttribute |
Supported |
Appinfo |
Ignored (anything enclosed in <documentation></documentation> is ignored. |
Attribute |
Supported |
attributeGroup |
Supported |
Choice |
Supported |
complexContent |
Supported |
complexType |
Supported |
Documentation |
Accepted, but ignored. Anything enclosed in <documentation></documentation> is ignored (as long as it is valid XML). |
Element |
Supported |
Enumeration |
Ignored (all restrictions on simple types are ignored – those types are treated as ‘string’) |
Extension |
Supported |
Field |
Ignored (anything enclosed in <unique></unique> is ignored). |
Group |
Supported. |
Import |
Supported. However, the support for finding the imported files is limited. See (and modify, if necessary…) the function findFile in erlsom_lib.erl. |
Include |
Supported. However, the support for finding the included files is limited. See (and modify, if necessary…) the function findFile in erlsom_lib.erl. |
Key |
Ignored. |
Keyref |
Ignored |
Length |
Ignored (all restrictions on simple types are ignored – those types are treated as ‘string’) |
List |
Ignored (all restrictions on simple types are ignored – those types are treated as ‘string’) |
maxInclusive |
(all restrictions on simple types are ignored – those types are treated as ‘string’) |
maxLength |
(see maxInclusive) |
minInclusive |
(see maxInclusive) |
minLength |
(see maxInclusive) |
Pattern |
(see maxInclusive) |
Redefine |
Supported. However, the support for finding the imported files is limited. See (and modify, if necessary…) the function findFile in erlsom_lib.erl. |
Restriction |
Supported as a way to create a derived complex type (but it is not checked whether this is really a restriction of the base type). Ignored on simpleTypes (all restrictions on simple types are ignored – those types are treated as ‘string’) |
Schema |
Supported |
Selector |
Ignored (anything enclosed in <unique></unique> is ignored). |
Sequence |
Supported |
simpleContent |
Supported |
simpleType |
Supported |
Union |
Ignored (all restrictions on simple types are ignored – those types are treated as ‘string’) |
Unique |
Ignored |
Abstract |
ignored. As a consequence, the parser may accept documents that contain instances of abstract types. |
attributeFormDefault |
Supported. |
Block |
not supported |
blockDefault |
ignored |
Default |
ignored (note that this not just a check that is not performed: the default value will not be provided) |
Final |
Ignored |
finalDefault |
Ignored |
Fixed |
Ignored |
Form |
not supported |
Mixed |
Supported (text values are inserted into a list of values) |
minOccurs, maxOccurs |
supported, except on group definitions |
namespace (for ‘any’) |
supported, but lists of values are not supported (##any, ##local and ##other are supported). A list of values is treated as ‘##any’. |
schemaLocation |
supported in a limited way, see ‘import’. |
xsi:schemaLocation |
Ignored |
substitutionGroup |
Supported |
Type |
supported, but there is no check on the built-in types, except for integer, int, boolean and QName. |
Use |
supported, but ‘prohibited’ is ignored (treated as ‘optional’). |
compile_xsd(XSD) -> {ok, Model}
compile_xsd(XSD, Options) -> {ok, Model}
Types:
XSD = [int()]
Options = [Option]
Option = {prefix, Prefix} |
{type_prefix, TypePrefix} |
{group_prefix, GroupPrefix} |
{include_fun, Include_fun} |
{include_dirs, Dir_list} |
{include_files, Include_files}
Model = the internal representation of the XSD
Compiles an XSD into a structure to be used by erlsom:scan() and erlsom:write(). Returns {ok, Model} or {error, Error}.
XSD can be an encoded binary (see section on character encoding) or a decoded list of Unicode code points.
An explanation of the Options:
Prefix is prefixed to the record names in the XSD. It should be a string or 'undefined'. If it is 'undefined', no prefix will be applied. The default is 'undefined' (no prefix). The prefix specified with this option is applied to the records that correspond to types from the target namespace of the specified XSD. Different prefixes can be specified for XSDs that are imported, see the other options below.
Note that Erlsom:write() uses the prefixes to assign the namespaces. As a consequence, you should use prefixes in case your XML documents use elements from more than one namespace (or if they contain a mixture of elements that are namespace qualified and elements that are not).
TypePrefix is prefixed to the record names that correspond to type definitions
in the XSD. It should be a string.
Record definitions are created for elements, groups and types. In the XSD there
may be groups, elements and types with the same name; this would lead to more
than one record with the same name. In order to avoid the problems that this
would create, it is possible to specify a prefix that will be put in between
the namespace prefix (see above) and the name of the type.
GroupPrefix is prefixed to the record names that correspond to group definitions in the XSD. It should be a string. See the explanation provided above for the TypePrefix option for the background of this option.
Include_fun is a function that finds the files that are included or imported in the XSD. It should be a function that takes 4 arguments:
- Namespace (from the XSD). This is a string or 'undefined'
- SchemaLocation (from the XSD). This is a string or 'undefined'
- Include_files. This is the value of the ‘include_files’ option if this option was passed to compile_xsd(); [] otherwise.
- Dir_list. This is the value of the ‘dir_list’ option if this option was passed to compile_xsd(); 'undefined' otherwise.
Include_fun should return {XSD, Prefix}, where XSD is a XSD = string(), Prefix = string or 'undefined' – if the value is undefined, ‘P’ will be used.
Include_fun defaults to a function that uses the Dir_list and Include_list options as specified below.
Include_files is a list of tuples {Namespace, Prefix, Location}. Default is [].
Dir_list is a list of directories (strings). It defaults to ["."].
Behavior for include and import:
If 'include_fun' option was specified, this function will be called. This should
return both the contents of the file as a string and the prefix (a tuple {Xsd, Prefix}).
Otherwise, if the 'includes_files' option is present, the list provided with this
option will be searched for a matching namespace. If this is found, the
specified prefix will be used. If a file is also specified, then this file will
be used. If no file is specified (value is undefined), then the 'location'
attribute and the 'dir_list' option will be used to locate the file.
If the 'include_files' option is not present, or if the namespace is not found, then
the file will be searched for in the dir_list (based on the 'location'
attribute). No prefix will be used.
compile_xsd_file(XSD) -> {ok, Model}
compile_xsd_file(XSD, Options) -> {ok, Model}
As compile_xsd(), but taking it’s input from a file.
add_xsd_file(FileName, Options, Model) -> {ok, Model}
Compiles an XSD file (FileName), and adds the elements defined by this XSD to Model. The purpose is to add elements (namespaces) to a model that uses the XML Schema ‘any’ construct. Only elements that are part of the model will be part of the output of ‘parse()’! See the soap example for an example where this is used.
See compile_xsd() for a description of the options.
scan(XML, Model) -> {ok, Struct, Rest}
scan(XML, Model, Options) -> {ok, Struct, Rest}
Types:
XML = [int()] or an encoded binary
Model = the internal representation of the XSD, result of erlsom:compile()
Options = [Option]
Option = {continuation_function, Continuation_function, Continuation_state} |
{output_encoding, utf8}
Struct = the translation of the XSD to an Erlang data structure
Rest = list of characters that follow after the end of the XML document
Translates an XML document that conforms to the XSD to a structure of records.
Returns {ok, Struct, Rest} or {error, Error}.
Error has the following structure:
[{exception, Exception}, {stack, Stack}, {received, Event}], where:
Exception is the exception that was thrown by the program
Stack is a representation of the 'stack' that is maintained by erlsom.
Event is the sax event that erlsom was processing when it ran into problems.
If specified, the continuation function is called whenever the end of the input XML document is reached before the parsing of the XML has finished. The function should have 1 argument (Continuation_state). It should return a tuple {NewData, NewState}, where NewData should be the next block of data (again a list of unicode code points or binary data – but the data type has to be the same for each invocation, and it has to match the data type of XML), and NewState is the information that is passed to the next invocation. Note: if the encoding of the document supports multi-byte characters (UTF8, UTF16) you don’t have to ensure that each block of data contains only complete characters – but in case of UTF16 encoding you do have to ensure that you return an odd number of bytes.
If the ‘output_encoding’ option is used, the text values will be binary encoded – but the values that are specified as integer in the XSD will still be integers.
scan_file(XMLFile, Model) -> {ok, Struct, Rest}
As parse, but taking it’s input from a file.
write(Struct, Model) -> {ok, XML}
Types:
Struct = a structure that represents an XML document
Model = the internal representation of the XSD
XML = [int()].
Translates a structure of records to an XML document. It is the inverse of erlsom:parse().
Note that the output is a list of Unicode code points. If you want to write it to a file, or send it over a wire, you should transform it to binary, and generally you should encode it. You can use xmerl_ucs:to_utf8() to do this.
write_xsd_hrl_file(XSD, Output, Options) -> ok
Types:
XSD = the name of the file that contains the XSD
Options = a list of Options, see compile_xsd().
Output = the name of the output file
Produces a set of record definitions for the types defined by the XSD. Note that the options have to be identical to those that are passed to compile_xsd().
parse_sax(XML, Acc0, EventFun, Options) -> {ok, AccOut, Rest}
Types:
Xml - [int()], a list of Unicode code points
Acc0 - a term() that is passed to the EventFun.
Eventfun - a fun() that is called by the parser whenever it has parsed a bit of the Xml input
EventFun should accept the following arguments:
- Event, a tuple that describes the event, see above.
- AccIn , a term() – Acc0 for the first invocation, and the result from the previous invocation for each of the following invocations.
EventFun should return AccOut, a term() that will be passed back to the next invocation of EventFun.
Options – [Option]
Option – {continuation_function, CState, CFunction} | {output_format, utf8}
The ‘output_format’ option determines the encoding of the 'character data': element values and attribute values. The only supported encoding at this moment is 'utf8'. The default is string().
CFunction –should be a function that takes 2 arguments: Tail and State.
- Tail is the (short) list of characters (or a short binary) that could not yet be parsed because it is (or might be) an incomplete token, or because an encoded character is not complete. Since this still has to be parsed, CFunction should include this in front of the next block of data.
- State is information that is passed by the parser to the callback function transparently. This can be used to keep track of the location in the file etc.
The function returns {NewData, NewState}, where
NewData is a list of characters/unicode code points/binary, and NewState the
new value for the State. NewData has to be in the same type of encoding as the
first part of the document.
Note: if the encoding of the document supports multi-byte characters (UTF8,
UTF16) you don’t have to ensure that each block of data contains only complete
characters – but in case of UTF16 you do
have to ensure that you return an odd number of bytes.
AccOut - a the result of the last invocation of EventFun.
Rest - list of characters that follow after the end of the XML document
simple_form(XML) -> {ok, SimpleFormElement, Rest}
simple_form(XML, Options) -> {ok, SimpleFormElement, Rest}
Types:
XML = [int()] or an encoded binary
Options = [Option]
Option = {nameFun, NameFun} |
{output_encoding, utf8}
SimpleFormElement = {Tag, Attributes, Content},
Rest = list of characters that follow after the end of the XML document
Tag is a string (unless otherwise specified through the nameFun option, see below), Attributes = [{AttributeName, Value}], and Content is a list of SimpleFormElements and/or strings.
Namefun is a function with 3 arguments: Name, Namespace, Prefix. It should return a term. It is called for each tag and attribute name. The result will be used in the output. Default is Name if Namespace == undefined, and a string {Namespace}Name otherwise.
erlsom_lib:toUnicode(XML) -> DecodedXML
Types:
XML = the XML in binary form.
DecodeXML = the XML in the form of a list of Unicode code points.
Decodes the XML, see the section on character decoding above.
erlsom_lib:find_xsd(Namespace, Location, Dir_list, Include_list) -> {XSD, Prefix}
Types:
Namespace (from the XSD). This is a string or 'undefined'
Location (from the XSD). This is a string or 'undefined'
Dir_list. This is the value of the Dir_list option if this option was passed to compile_xsd(); 'undefined' otherwise.
Include_list. This is the value of the Include_list option if this option was passed to compile_xsd(); 'undefined' otherwise.
The function erlsom_lib:find_xsd can be passed to compile_xsd as the value for the 'include_fun' option. It will attempt to get imported XSDs from the internet (if the import, include or redefine statement includes a ‘location’ attribute in the form of a URL).
If find_xsd cannot find the file on the internet, it will attempt to find the file using the standard function, see the description provided above with the compile_xsd function.
erlsom_lib:detect_encoding(Document) -> {Encoding, Binary}
Types:
Document = the XML document, either in binary form or as a list
Encoding = the encoding, as an atom
Binary = the XML document in binary form.
Tries to detect the encoding. It looks at the first couple of bytes. If these bytes cannot give a definitive answer, it looks into the xml declaration.
Possible values for Encoding:
ucs4be
ucs4le
utf16be
utf16le
utf8
iso_8859_1
The second return value is identical to the input if the input was in binary form, and the translation to the binary form if the input was a list.
(the basis of this function was copied from xmerl_lib, but it was extended to look into the xml declaration).
erlsom_ucs: from_utf8(Data) -> {List, Tail}
erlsom_ucs: from_utf16le(Data) -> {List, Tail}
erlsom_ucs: from_utf16be(Data) -> {List, Tail}
Types:
Data = a block of data, either as a list of bytes or as a binary
List = the input translated to a list of Unicode code points
Tail = remaining bytes at the end of the input (a list of bytes).
These functions are based on the corresponding functions in xmerl_ucs, but they have been modified so that they can be used to translate blocks of data. The end of a block can be in the middle of a series of bytes that together correspond to 1 Unicode code point. The remaining bytes are returned, so that they can be put in front of the next block of data.
Note on performance: the functions work on lists, not binaries! If the input is a binary, this is translated to a list in a first step, since the functions are faster that way. If you are reading the xml document from a file, it is probably fastest to use pread() in such a way that it returns a list, and not a binary.
See the ‘continuation’ example for an example of how this can be used to deal with very large documents (or streams of data).
parse(XML, Model) -> {ok, Struct}
Note: This function has been replaced by scan()! Please use scan().
parse_file(XMLFile, Model) -> {ok, Struct}
Note: This function has been replaced by scan_file()! Please use scan_file().
write_hrl(XSD, Namespaces, Output) -> ok
This function has been replaced by write_xsd_hrl()! Please use write_xsd_hrl().
erlsom_sax:parseDocument(Xml, State, EventFun) -> {State2 Rest}
erlsom_sax:parseDocument(Xml, State, EventFun, Options) -> {State2, Rest}
Obsolete, use parse_sax().
compile(XSD, Prefix, Namespaces) -> {ok, Model}
This function has been replaced by compile_xsd()! Please use compile_xsd().
compile_file(FileName, Prefix, Namespaces) -> {ok, Model}
This function has been replaced by compile_xsd_file()! Please use compile_xsd_file().
add_file(FileName, Prefix, Model) -> Model
This function has been replaced by add_xsd_file()! Please use add_xsd_file().
Willem de Jong
Copyright © 2006, 2007, 2008 Willem de Jong