SourceForge.net Logo

Erlsom

 

Introduction. 1

Example XML document. 8

SAX Mode 8

SAX Events. 8

Simple DOM Mode 8

Data Binder Mode 8

Installation. 7

Examples. 7

Character encoding. 9

Creation of atoms. 10

Limitations. 10

XML Schema elements. 10

XML Schema Attributes. 12

Reference. 12

Author 18

 

Introduction

This is the documentation for Erlsom v. 1.2.1. Erlsom is a set of functions to parse (and generate) XML documents.

 

Erlsom can be used in a couple of very different modes:

-          as a SAX parser.
This is a more or less standardized model (see
http://www.saxproject.org/apidoc/org/xml/sax/ContentHandler.html) for parsing XML. Every time the parser has processed a meaningful part of the XML document (such as a start tag), it will tell your application about this. The application can process this information – potentially in parallel – while the parser continues to parse the rest of the document. The SAX parser will allow you to efficiently parse XML documents of arbitrary size, but it may take some time to get used to it. If you invest some effort, you may find that it fits very well with the Erlang programming model – personally I have always been very happy about my choice to use a SAX parser as the basis for the rest of Erlsom. You will find a couple of examples how the SAX parser can be used below.

-          As a simple sort of DOM parser.
Erlsom can translate your XML to the ‘simple form’ that is used by Xmerl. This is a form that is easy to understand, but you have to search your way through the output to get to the information that you need. Section … provides an example.  (Note: in the examples directory you will also find an example that translates the XML to the more complex output format that is produced by Xmerl. On this output you can use the Xpath functions that come with Xmerl – but I haven’t tested this extensively).

-          As a ‘data binder’
Erlsom can translate the XML document to an Erlang data structure that corresponds to an XML Schema. It has the advantage over the SAX parser that it validates the XML document, and that you know exactly what the layout of the output will be. This makes it easy to access the elements that you need in a very direct way. Section  … gives more information. (See http://www.rpbourret.com/xml/XMLDataBinding.htm for a general description of XML data binding.)

 

For all modes the following applies:

-          If the document is too big to fit into memory, or if the document arrives in some kind of data stream, it can be passed to the parser in blocks of arbitrary size.

-          The parser can work directly on binaries. There is no need to transform binaries to lists before passing the data to Erlsom. Using binaries as input has a positive effect on the memory usage and on the speed (provided that you are using Erlang 12B or later – if you are using an older Erlang version the speed will be better if you transform binaries to lists). The binaries can be latin-1, utf-8 or utf-16 encoded.

-          The parser has an option to produce output in binary form (only the character data: names of elements and attributes are always strings). This may be convenient if you want to minimize the memory usage, and/or if you need the result in binary format for further processing. Note that it will slow down the parser slightly. If you select this option the encoding of the result will be utf-8 (irrespective of the encoding of the input document). 

 

Example XML document

Unless otherwise indicated, the examples in the next sections will use the following, very simple XML document as input:

 

<foo attr="baz"><bar>x</bar><bar>y</bar></foo>

 

This document is stored in a file called “minimal.xml”, and read into a variable called Xml by the following commands in the shell:

 

1> {ok, Xml} = file:read_file("minimal.xml").

{ok,<<"<foo attr=\"baz\"><bar>x</bar><bar>y</bar></foo>\r\n">>}

 

The following, corresponding XSD (“minimal.xsd”) is used in the first example for the data binder:

 

<?xml version="1.0" encoding="UTF-8"?>

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

    <xsd:element name="foo" type="foo_type"/>

    <xsd:complexType name="foo_type">

         <xsd:sequence>

             <xsd:element name="bar"

                          type="xsd:string" maxOccurs="unbounded"/>

         </xsd:sequence>

         <xsd:attribute name="attr" type="xsd:string"/>

     </xsd:complexType>

</xsd:schema>

 

SAX Mode

The example below shows how the example XML (see above) can be processed using the SAX parser:

 

2> erlsom:parse_sax(Xml, [], fun(Event, Acc) -> io:format("~p~n", [Event]), Acc end).

startDocument

{startElement,[],"foo",[],[{attribute,"attr",[],[],"baz"}]}

{startElement,[],"bar",[],[]}

{characters,"x"}

{endElement,[],"bar",[]}

{startElement,[],"bar",[],[]}

{characters,"y"}

{endElement,[],"bar",[]}

{endElement,[],"foo",[]}

endDocument

{ok,[],"\r\n"}

 

The function erlsom:parse_sax takes as its arguments: the XML document, an accumulator value and an ‘event processing function’. This function will process the parts of the XML documents that have been parsed. In this example, this function simply prints these events.

 

The next example does something slightly more meaningful: it counts the number of times the “bar” element occurs in the XML document. Ok, maybe not very useful, but at least this example will produce a result, not only side effects.

 

3> CountBar = fun(Event, Acc) -> case Event of {startElement, _, "bar", _, _} -> Acc + 1; _ -> Acc end end.

#Fun<erl_eval.12.113037538>

 

4> erlsom:parse_sax(Xml, 0, CountBar).                                        

{ok,2,"\r\n"}

 

To describe it in a rather formal way: parse_sax(Xml, Acc0, Fun) calls Fun(Event, AccIn) on successive ‘XML events’ that result from parsing Xml, starting with AccIn == Acc0. Fun/2 must return a new accumulator which is passed to the next call. The function returns {ok, AccOut, Tail}, where AccOut is the final value of the accumulator and Tail the list of characters that follow after the last tag of the XML document. In this example AccOut == 2, since the tag occurs twice.

(Notice how similar this is to lists:foldl(Fun, Acc0, Sax_events), assuming that Sax_events is the list of Sax events – I more or less copied this description from the documentation of the lists module.)

 

It may still not be very clear to you how this SAX parser can be used to produce useful results. There are some additional examples in the examples directory of the Erlsom distribution. If you are still not convinced you can try to decipher the source code for the ‘data mapper’ mode (erlsom_parse.erl) – this was also built on top of the SAX parser.

 

SAX Events

startDocument

             

endDocument

Will NOT be sent out in case of an error

 

{startPrefixMapping, Prefix, URI}

Begin the scope of a prefix - URI namespace mapping

Will be sent immediately before the corresponding startElement event.

 

{endPrefixMapping, Prefix}

End the scope of a prefix - URI namespace mapping

Will be sent immediately before the corresponding endElement event.

 

{startElement, Uri, LocalName, Prefix, [Attributes]}

The beginning of an element.

There will be a corresponding endElement (even when the element is

empty).

All three name components will be provided.

 

[Attributes] is a list of attribute records, see sax.hrl.

Namespace attributes (xmlns:*) will not be reported.

There will be NO attribute values for defaulted attributes!

 

Providing 'Prefix' in stead of 'Qualified name' is probably not quite

in line with the SAX spec, but it appears to be more convenient.

 

{endElement, Uri, LocalName, Prefix}

The end of an element.

 

{characters, Characters}

Character data.

All character data will be in one chunk, except if there is a

CDATA section included inside a character section. In that case

there will be separate events for the characters before the CDATA, the

CDATA section and the characters following it (if any, of course).

 

{ignorableWhitespace, Characters}

If a character data section (as it would be reported by the 'characters'

event, see above) consists ONLY of whitespace, it will be

reported as ignorableWhitespace.

 

{processingInstruction, Target, Data}

 

{error, Description}

 

{internalError, Description}

 

Simple DOM Mode

This mode translates the XML document to a generic data structure. It doesn’t really follow the DOM standard, but in stead it provides a very simple format. In fact, it is very similar to format that is defined as the ‘simple-form’ in the Xmerl documentation.

 

An example will probably be sufficient to explain it:

 

erlsom:simple_form(Xml).       

{ok,{"foo",

     [{"attr","baz"}],

     [{"bar",[],["x"]},{"bar",[],["y"]}]},

    "\r\n"}

 

Result = {ok, Element, Tail}, where Element = {Tag, Attributes, Content}, Tag is a string (there is an option that allows you to format Tag differently, see the reference section below), Attributes = [{AttributeName, Value}], and Content is a list of Elements and/or strings.

Data Binder Mode

In this mode, Erlsom parses XML documents that are associated with an XSD (or Schema). It checks whether the XML document conforms to the Schema, and it translates the document to an Erlang structure that is based on the types defined in the Schema. This section tries to explain the relation between the Schema and the Erlang data structure that is produced by Erlsom.

 

First a quick example using the same XML that was used for the other modes. Before we can parse the document we need to ‘compile’ the XML Schema (similar to how you might compile a regular expression).

 

10> {ok, Model} = erlsom:compile_xsd_file("minimal.xsd").

{ok,{model,[{typ…

 

Now you can use this compiled model:

 

11> {ok, Result, _} = erlsom:scan(Xml, Model).

{ok,{foo_type,[],"baz",["x","y"]},"\r\n"}

 

Assuming that you have defined a suitable record #foo_type{} (erlsom:write_xsd_hrl_file() can do it for you), you can use in your program (won’t work in the shell):

 

BarValues = Result#foo_type.bar,

AttrValue = Result#foo_type.attr,

 

Nice and compact, as you see, but it may need more explanation. I will use a more complex example from the XML Schema  Primer (XML Schema Part 0: Primer Second Edition) [Primer]. It can be found here: http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/[1]. Sections that have been copied from this document are contained in a blue box.

 

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

 

  <xsd:annotation>

    <xsd:documentation xml:lang="en">

     Purchase order schema for Example.com.

     Copyright 2000 Example.com. All rights reserved.

    </xsd:documentation>

  </xsd:annotation>

 

  <xsd:element name="purchaseOrder" type="PurchaseOrderType"/>

 

  <xsd:element name="comment" type="xsd:string"/>

 

  <xsd:complexType name="PurchaseOrderType">

    <xsd:sequence>

      <xsd:element name="shipTo" type="USAddress"/>

      <xsd:element name="billTo" type="USAddress"/>

      <xsd:element ref="comment" minOccurs="0"/>

      <xsd:element name="items"  type="Items"/>

    </xsd:sequence>

    <xsd:attribute name="orderDate" type="xsd:date"/>

  </xsd:complexType>

 

  <xsd:complexType name="USAddress">

    <xsd:sequence>

      <xsd:element name="name"   type="xsd:string"/>

      <xsd:element name="street" type="xsd:string"/>

      <xsd:element name="city"   type="xsd:string"/>

      <xsd:element name="state"  type="xsd:string"/>

      <xsd:element name="zip"    type="xsd:decimal"/>

    </xsd:sequence>

    <xsd:attribute name="country" type="xsd:NMTOKEN"

                   fixed="US"/>

  </xsd:complexType>

 

  <xsd:complexType name="Items">

    <xsd:sequence>

      <xsd:element name="item" minOccurs="0" maxOccurs="unbounded">

        <xsd:complexType>

          <xsd:sequence>

            <xsd:element name="productName" type="xsd:string"/>

            <xsd:element name="quantity">

              <xsd:simpleType>

                <xsd:restriction base="xsd:positiveInteger">

                  <xsd:maxExclusive value="100"/>

                </xsd:restriction>

              </xsd:simpleType>

            </xsd:element>

            <xsd:element name="USPrice"  type="xsd:decimal"/>

            <xsd:element ref="comment"   minOccurs="0"/>

            <xsd:element name="shipDate" type="xsd:date" minOccurs="0"/>

          </xsd:sequence>

          <xsd:attribute name="partNum" type="SKU" use="required"/>

        </xsd:complexType>

      </xsd:element>

    </xsd:sequence>

  </xsd:complexType>

 

  <!-- Stock Keeping Unit, a code for identifying products -->

  <xsd:simpleType name="SKU">

    <xsd:restriction base="xsd:string">

      <xsd:pattern value="\d{3}-[A-Z]{2}"/>

    </xsd:restriction>

  </xsd:simpleType>

 

</xsd:schema>

example 1: po.xsd

 

This XSD can be processed by Erlsom: the compiler accepts it, and the parser can parse instances (XML documents) that conform to this schema.

 

Like the Primer, I will use po.xml as an example XML document.

 

<?xml version="1.0"?>

<purchaseOrder orderDate="1999-10-20">

   <shipTo country="US">

      <name>Alice Smith</name>

      <street>123 Maple Street</street>

      <city>Mill Valley</city>

      <state>CA</state>

      <zip>90952</zip>

   </shipTo>

   <billTo country="US">

      <name>Robert Smith</name>

      <street>8 Oak Avenue</street>

      <city>Old Town</city>

      <state>PA</state>

      <zip>95819</zip>

   </billTo>

   <comment>Hurry, my lawn is going wild<!/comment>

   <items>

      <item partNum="872-AA">

         <productName>Lawnmower</productName>

         <quantity>1</quantity>

         <USPrice>148.95</USPrice>

         <comment>Confirm this is electric</comment>

      </item>

      <item partNum="926-AA">

         <productName>Baby Monitor</productName>

         <quantity>1</quantity>

         <USPrice>39.98</USPrice>

         <shipDate>1999-05-21</shipDate>

      </item>

   </items>

</purchaseOrder>

example 2: po.xml

 

Translating po.xml using erlsom:scan/2 will result in:

 

{'PurchaseOrderType',[],

                     "1999-10-20",

                     {'USAddress',[],

                                  "US",

                                  "Alice Smith",

                                  "123 Maple Street",

                                  "Mill Valley",

                                  "CA",

                                  "90952"},

                     {'USAddress',[],

                                  "US",

                                  "Robert Smith",

                                  "8 Oak Avenue",

                                  "Old Town",

                                  "PA",

                                  "95819"},

                     "Hurry, my lawn is going wild!",

                     {'Items',[],

                              [{'Items/item',

                                     [],

                                     "872-AA",

                                     "Lawnmower",

                                     "1",

                                     "148.95",

                                     "Confirm this is electric",

                                     undefined},

                               {'Items/item',

                                     [],

                                     "926-AA",

                                     "Baby Monitor",

                                     "1",

                                     "39.98",

                                     undefined,

                                     "1999-05-21"}]}}

example 3: output for po.xml

 

The output can be interpreted as a structure built from Erlang records. The definition of these records can either be generated by erlsom:write_xsd_hrl_file/3, or you can define them yourself (or a combination: you can run write_xsd_hrl_file and change a few fieldnames and add some defaults). For example of the record definitions might be:

 

-record(purchaseOrderType, {anyAttribs, shipTo, billTo, comment, items}).

-record(‘USAddress’, {anyAttribs, country, name, city, state, zip}).

-record(‘Items’, {anyAttribs, listOfItem}).

-record(‘Items/item’, {anyAttribs, partNum, productName, quantity, ‘USPrice’, comment, shipDate}).

example 4: possible record definitions for po.xsd

 

As can be seen from the example:

-          attributes are included in the records as the first elements (country, partNum)

-          elements that are optional (minOccurs="0") for which no value is provided get the value undefined (comment, shipDate).

-          elements that can occur more than once (maxOccurs > 0 or unbounded) are translated to a list (listOfItem).

-          every record has ‘anyAttribs’ as its first element. If the Schema allows ‘anyAttributes’, and if these are present in the XML document, then the values will be found here (as a list of attribute-value pairs)

 

It should be noted that there is quite a bit of information in po.xsd that is not used by erlsom:

 

-          Only in a limited number of situations does erlsom do type checking and translation: only if an element is defined as integer, int, boolean or QName without any further restrictions or extensions. The ‘quantity’ element doesn’t meet these conditions, since (a) it is a positiveInteger, and (b) it is restricted. A value for the quantity element of Ten or -1 would not result in an error or warning, and the string value is not translated to an Erlang integer. This also applies for the user defined simpleTypes, like SKU in the example.

-          The fixed attribute is ignored. If there would have been another value than US in po.xml, this would have been accepted without warning or error.

-          The annotation is ignored (obviously).

 

In example 5 a number of additional features is illustrated:

 

-          elements that belong to a namespace are prefixed in the result. The prefix is determined by a parameter of the function that compiles the XSD.

-          anonymous types (in the example: spouse) get a name that include the ‘path’, in order to avoid name conflicts.

-          types (‘records’) are created for choices – the type indicates which alternative was selected (the record b:personType-hobby shows that “Mowing the lawn” is a hobby, not a profession).

 

<?xml version="1.0"?>

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"

            targetNamespace="http://www.example.org"

            xmlns="http://www.example.org"

            elementFormDefault="qualified">

  <xsd:element name="person" type="personType"/>

  <xsd:complexType name="personType">

     <xsd:sequence>

       <!-- an element with an attribute -->

       <xsd:element name = "id">

         <xsd:complexType>

        <xsd:simpleContent>

          <xsd:extension base = "xsd:string">

             <xsd:attribute name = "type" type = "xsd:string"/>

          </xsd:extension>

        </xsd:simpleContent>

      </xsd:complexType>

       </xsd:element>

 

       <!-- choice -->

       <xsd:choice>

         <xsd:element name="profession" type="xsd:string"/>

         <xsd:element name="hobby" type="xsd:string"/>

       </xsd:choice>

 

       <!-- group -->

       <xsd:group  ref="name"/>

      

       <!-- local type -->

       <xsd:element name="spouse">

         <xsd:complexType>

           <xsd:sequence>

             <xsd:element  name="name" type="xsd:string"/>

             <xsd:element  name="age" type="xsd:string"/>

           </xsd:sequence>

         </xsd:complexType>

       </xsd:element>

 

    </xsd:sequence>

  </xsd:complexType>

 

  <xsd:group name="name">

    <xsd:sequence>

      <xsd:element name="firstName" type="xsd:string"/>

      <xsd:element name="lastName" type="xsd:string"/>

    </xsd:sequence>

  </xsd:group>

 

</xsd:schema>

example 5: misc.xsd: namespace, choice, group

 

<?xml version="1.0"?>

<person xmlns="http://www.example.org">

  <id type="passport">123</id>

  <hobby>mowing the lawn</hobby>

  <firstName>Jan</firstName>

  <lastName>Pietersen</lastName>

  <spouse>

      <name>Jet Pietersen</name>

      <age>33</age>

  </spouse>

</person>

example 6: misc.xml

 

The XSD can be compiled by the command

> {ok, Model} = erlsom:compile_xsd_file(“misc.xsd”,

                                        [{prefix, “b”}]).

 

After that the XML can be parsed using the command

> {ok, Out, Rest} = erlsom:scan_file(“misc_example.xml”, Model).

 

Out is the output shown below, and Rest is a string of the characters that may follow after the end tag of the XML.

 

 

{'b:personType',[],

                {'b:personType/id',[], "passport","123"},

                {'b:personType-hobby',[], "mowing the lawn"},

                {'b:name',[], "Jan","Pietersen"},

                {'b:personType/spouse',[], "Jet Pietersen","33"}}

example 7: output for misc.xml

 

Installation

The easiest way to install Erlsom is probably to get it from CEAN (http://cean.process-one.net). If you don’t want to do that, or if it doesn’t work for you, you can also install it from Sourceforge.

 

I have no experience with installation procedures, makefiles etc for Erlang. Fortunately, Klacke (Claes Wickstrom) has provided a makefile. This should enable Unix users to install Erlsom easily.

 

Anyway,  even for Windows users installing erlsom should be straightforward. One way to do it would is described below.

 

-          Put all the files into the directory ‘ROOT/lib/erlsom-1.2.1/src’, where ROOT is the directory that contains Erlang (C:\Program Files\erl5.6.1 on my Windows system).

-          Start the Erlang shell

-          Change the working directory to ‘ROOT/lib/erlsom-1.2.1/src’:


1> cd('../lib/erlsom-1.2.1/src').
C:/Program Files/erl5.6.1/lib/erlsom-1.2.1/src
ok

 

-          Compile the source files:

 

2> c("erlsom"),

c("erlsom_parse"),

c("erlsom_lib"),

c("erlsom_compile"),

c("erlsom_write"),

c("erlsom_parseXsd"),

c("erlsom_sax"),

c("erlsom_pass2"),

c("erlsom_writeHrl"),

c("erlsom_add"),

c("erlsom_ucs"),

c("erlsom_sax_utf8"),

c("erlsom_sax_latin1"),

c("erlsom_sax_utf16be"),

c("erlsom_sax_utf16le"),

c("erlsom_sax_list"),

c("erlsom_sax_lib"),

c("erlsom_simple_form").

 

-          Move the .beam files to ‘ROOT/lib/erlsom-1.2.1/ebin’.

 

-     Alternatively you can use emake for the last 2 steps:

 

2> make:all([{outdir, "../ebin"}]).

Examples

The distribution includes 7 examples:

 

-          erlsom_example: this shows the use of the basic functions to compile an XSD, to parse an XML document and to write an XML document.

To run the example from the Erlang shell: cd to the directory that contains the code (something like
cd('lib/erlsom-1.2.1/examples/erlsom_example').), compile (c("erlsom_example").) and run (erlsom_example:run().).

 

-          erlsom_sax_example: this shows the features of the SAX parser.

 

-          example1: this example has 2 purposes:

o        It shows how easy Erlsom makes it for you to use an XML configuration file. The configuration file describes a set of 10 test cases, which are run by this example. The configuration file is described by “example1.xsd”. Compiling this XSD and then parsing the configuration file (“example1.xml”) gives you access to an Erlang structure of records that corresponds with the XML schema.

o        It shows how 11 different schemas (names “abb1.xsd” through “abb11.xsd”) can describe the same XML document (named “abb.xml”), and it shows the output that results from running Erlsom on this file using these schema’s.
To run the example for XSD abb1.xsd, use the command 
example1:test_erlsom("abb1”).

 

-          soap_example: this shows how to use the erlsom:add_xsd_file() function, and it gives an example how you might parse and generate SOAP messages.

 

-          continuation: this shows how to use the sax parser with a ‘continuation-function’. This can be used for parsing of very big files or streams. The continuation function should return a block of data; this will be parsed (calling the sax callback function when appropriate) and after that the function is called again to get the next block of data. The example shows how a file of arbitrary size can be parsed. The comments in the code should help you to understand and use this function.

 

-          complex_form: shows how you could create a back-end to the sax parser that produces the same output as Xmerl, and how you could then use the Xpath functions that Xmerl provides.

 

-          book_store; actually three examples, demonstrating the three modes that erlsom supports. The third example shows how you might combine different modes within a function that scans a file.

Character encoding

The sax parser accepts binaries as input. It will recognize UTF-8 and UTF-16 encoding by looking at the byte order mark and the first character of the document. Additionally ISO-8859-1 encoding is recognized if this is indicated by the XML declaration. If the XML declaration specifies another character set, an error will be thrown. It should not be very difficult to add support for other character sets, however.

 

As specified by the XML standard, the default encoding is UTF-8. If the first byte of the document is a ‘<’ ASCII character and if the XML declaration does not specify anything else, it will be assumed that the encoding is UTF-8.

 

The result of erlsom:write is a list of Unicode code points. Normally this will have to be encoded before it can be used. The function xmerl_ucs:to_utf8/1 can be used to do this.

Creation of atoms

Especially in the context of internet applications, it may be a problem if new atoms are created as a result of communication based on XML (SOAP, XML-RPC, AJAX). The number of atoms that can be created within the Erlang runtime environment is limited, and uncontrolled creation of atoms may cause the system to crash.

 

Erlsom:scan/2 does not create new atoms. It uses string_to_existing_atom to create the atoms that are used in the records.

 

Erlsom:compile_xsd does create atoms. However, usually this function won’t be called with arbitrary end user input as its argument, so normally this should not be a problem. 

Limitations

Some checks/validity constraints are accepted in the XSD, but not enforced during parsing:

 

-    all simple types are interpreted as string. This applies to the built in types (float, positiveInteger, gYear etc), and also to types that are restricted (using 'facets') or extended (for example 'union' types). The only exceptions are Integer, Boolean and QName, these are translated.

-    Key, Unique etc. are not supported – if these elements occur in the XSD, they are simply ignored.

 

The SAX parser has the following limitations:

 

-          It doesn’t support external entities.

-          It doesn’t do any validation: if the XML includes a DTD, this is simply ignored.

 

The data binder has the following additional limitation:

-          Names of elements and attributes cannot contain characters outside the Erlang character set (because they are translated to atoms).

 

 

XML Schema elements

All

Supported. The parser puts the elements into the resulting record in a fixed place (independent of the order in which they are received).

Annotation

Ignored (anything enclosed in <documentation></documentation> is ignored).

Any

Supported. However, only elements that are included in the model will show up in the result. Elements are part of the model if they are included in the XSD that was compiled, or if they have been added using erlsom:add_file().

anyAttribute

Supported

Appinfo

Ignored (anything enclosed in <documentation></documentation> is ignored.

Attribute

Supported

attributeGroup

Supported

Choice

Supported

complexContent

Supported

complexType

Supported

Documentation

Accepted, but ignored. Anything enclosed in <documentation></documentation> is ignored (as long as it is valid XML).

Element

Supported

Enumeration

Ignored (all restrictions on simple types are ignored – those types are treated as ‘string’)

Extension

Supported

Field

Ignored (anything enclosed in <unique></unique> is ignored).

Group

Supported.

Import

Supported. However, the support for finding the imported files is limited. See (and modify, if necessary…) the function findFile in erlsom_lib.erl.

Include

Supported. However, the support for finding the included files is limited. See (and modify, if necessary…) the function findFile in erlsom_lib.erl.

Key

Ignored.

Keyref

Ignored

Length

Ignored (all restrictions on simple types are ignored – those types are treated as ‘string’)

List

Ignored (all restrictions on simple types are ignored – those types are treated as ‘string’)

maxInclusive

(all restrictions on simple types are ignored – those types are treated as ‘string’)

maxLength

(see maxInclusive)

minInclusive

(see maxInclusive)

minLength

(see maxInclusive)

Pattern

(see maxInclusive)

Redefine

Supported. However, the support for finding the imported files is limited. See (and modify, if necessary…) the function findFile in erlsom_lib.erl.

Restriction

Supported as a way to create a derived complex type (but it is not checked whether this is really a restriction of the base type). Ignored on simpleTypes (all restrictions on simple types are ignored – those types are treated as ‘string’)

Schema

Supported

Selector

Ignored (anything enclosed in <unique></unique> is ignored).

Sequence

Supported

simpleContent

Supported

simpleType

Supported

Union

Ignored (all restrictions on simple types are ignored – those types are treated as ‘string’)

Unique

Ignored

 

XML Schema Attributes

Abstract

ignored. As a consequence, the parser may accept documents that contain instances of abstract types.

attributeFormDefault

Supported.

Block

not supported

blockDefault

ignored

Default

ignored (note that this not just a check that is not performed: the default value will not be provided)

Final

Ignored

finalDefault

Ignored

Fixed

Ignored

Form

not supported

Mixed

Supported (text values are inserted into a list of values)

minOccurs, maxOccurs

supported, except on group definitions

namespace (for ‘any’)

supported, but lists of values are not supported (##any, ##local and ##other are supported). A list of values is treated as ‘##any’.

schemaLocation

supported in a limited way, see ‘import’.

xsi:schemaLocation

Ignored

substitutionGroup

Supported

Type

supported, but there is no check on the built-in types, except for integer, int, boolean and QName.

Use

supported, but ‘prohibited’ is ignored (treated as ‘optional’).

 

Reference

compile_xsd(XSD) -> {ok, Model}

compile_xsd(XSD, Options) -> {ok, Model}

 

Types:

            XSD =  [int()]

            Options = [Option]

            Option =   {prefix, Prefix} |

                             {type_prefix, TypePrefix} |

                             {group_prefix, GroupPrefix} |

                             {include_fun, Include_fun} |

                             {include_dirs, Dir_list} |

                             {include_files, Include_files}

 

            Model = the internal representation of the XSD

 

Compiles an XSD into a structure to be used by erlsom:scan() and erlsom:write(). Returns {ok, Model} or {error, Error}.

 

XSD can be an encoded binary (see section on character encoding) or a decoded list of Unicode code points. 

 

An explanation of the Options:

 

Prefix is prefixed to the record names in the XSD. It should be a string or 'undefined'. If it is 'undefined', no prefix will be applied. The default is 'undefined' (no prefix).  The prefix specified with this option is applied to the records that correspond to  types from the target namespace of the specified XSD. Different prefixes can be specified for XSDs that are imported, see the other options below.

 

          Note that Erlsom:write() uses the prefixes to assign the namespaces. As a consequence, you should use prefixes in case your XML documents use elements from more than one namespace (or if they contain a mixture of elements that are namespace qualified and elements that are not).

 

TypePrefix is prefixed to the record names that correspond to type definitions in the XSD. It should be a string.

Record definitions are created for elements, groups and types. In the XSD there may be groups, elements and types with the same name; this would lead to more than one record with the same name. In order to avoid the problems that this would create, it is possible to specify a prefix that will be put in between the namespace prefix (see above) and the name of the type.

 

GroupPrefix is prefixed to the record names that correspond to group definitions in the XSD. It should be a string. See the explanation provided above for the TypePrefix option for the background of this option.

Include_fun is a function that finds the files that are included or imported in the XSD. It should be a function that takes 4 arguments:

          - Namespace (from the XSD). This is a string or 'undefined'

          - SchemaLocation (from the XSD). This is a string or 'undefined'

          - Include_files. This is the value of the ‘include_files’ option if this option was passed to compile_xsd(); [] otherwise.

          - Dir_list. This is the value of the ‘dir_list’ option if this option was passed to compile_xsd(); 'undefined' otherwise.

 

Include_fun should return {XSD, Prefix}, where XSD is a XSD = string(), Prefix = string or 'undefined' – if the value is undefined, ‘P’ will be used.

 

Include_fun defaults to a function that uses the Dir_list and Include_list options as specified below.

 

Include_files  is a list of tuples {Namespace, Prefix, Location}. Default is [].

 

Dir_list is a list of directories (strings). It defaults to ["."].

 

Behavior for include and import:

 

If 'include_fun' option was specified, this function will be called. This should

return both the contents of the file as a string and the prefix (a tuple {Xsd, Prefix}).

 

Otherwise, if the 'includes_files' option is present, the list provided with this

option will be searched for a matching namespace. If this is found, the

specified prefix will be used. If a file is also specified, then this file will

be used. If no file is specified (value is undefined), then the 'location'

attribute and the 'dir_list' option will be used to locate the file.

 

If the 'include_files' option is not present, or if the namespace is not found, then

the file will be searched for in the dir_list (based on the 'location'

attribute). No prefix will be used.

 

compile_xsd_file(XSD) -> {ok, Model}

compile_xsd_file(XSD, Options) -> {ok, Model}

 

As compile_xsd(), but taking it’s input from a file.

 

add_xsd_file(FileName, Options, Model) -> {ok, Model}

 

Compiles an XSD file (FileName), and adds the elements defined by this XSD to Model. The purpose is to add elements (namespaces) to a model that uses the XML Schema ‘any’  construct. Only elements that are part of the model will be part of the output of ‘parse()’! See the soap example for an example where this is used.

 

See compile_xsd() for a description of the options.

 

 

scan(XML, Model) -> {ok, Struct, Rest}

scan(XML, Model, Options) -> {ok, Struct, Rest}

 

Types:

            XML = [int()] or an encoded binary

            Model = the internal representation of the XSD, result of erlsom:compile()

Options = [Option]

            Option =  {continuation_function, Continuation_function,  Continuation_state} |

                            {output_encoding, utf8}

            Struct = the translation of the XSD to an Erlang data structure

            Rest = list of characters that follow after the end of the XML document

 

Translates an XML document that conforms to the XSD to a structure of records.

 

Returns {ok, Struct, Rest} or {error, Error}.

 

Error has the following structure:

[{exception, Exception}, {stack, Stack}, {received, Event}], where:

 

Exception is the exception that was thrown by the program

Stack is a representation of the 'stack' that is maintained by erlsom.

Event is the sax event that erlsom was processing when it ran into problems.

 

If specified, the continuation function is called whenever the end of the input XML document is reached before the parsing of the XML has finished. The function should have 1 argument (Continuation_state). It should return a tuple {NewData, NewState}, where NewData should be the next block of data (again a list of unicode code points or binary data – but the data type has to be the same for each invocation, and it has to match the data type of XML), and NewState is the information that is passed to the next invocation. Note: if the encoding of the document supports multi-byte characters (UTF8, UTF16) you don’t have to ensure that each block of data contains only complete characters – but in case of UTF16 encoding you do have to ensure that you return an odd number of bytes.

 

If the ‘output_encoding’ option is used, the text values will be binary encoded – but the values that are specified as integer in the XSD will still be integers.

 

 

scan_file(XMLFile, Model) -> {ok, Struct, Rest}

 

As parse, but taking it’s input from a file.

 

 

write(Struct, Model) -> {ok, XML}

 

Types:

Struct = a structure that represents an XML document

Model = the internal representation of the XSD

XML = [int()].

 

Translates a structure of records to an XML document. It is the inverse of erlsom:parse().

 

Note that the output is a list of Unicode code points. If you want to write it to a file, or send it over a wire, you should transform it to binary, and generally you should encode it. You can use xmerl_ucs:to_utf8() to do this.

 

 

write_xsd_hrl_file(XSD, Output, Options) -> ok

 

Types:

XSD = the name of the file that contains the XSD

            Options = a list of Options, see compile_xsd().

            Output = the name of the output file

           

Produces a set of record definitions for the types defined by the XSD. Note that the options have to be identical to those that are passed to compile_xsd().

  

 

parse_sax(XML, Acc0, EventFun, Options) -> {ok, AccOut, Rest} 

 

Types:

Xml  - [int()], a list of Unicode code points

 

Acc0 - a term() that is passed to the EventFun.

 

Eventfun - a fun() that is called by the parser whenever it has parsed a bit of the Xml input

 

EventFun should accept the following arguments:

- Event, a tuple that describes the event, see above.

- AccIn , a term() – Acc0 for the first invocation, and the result from the previous invocation for each of the following invocations.

 

EventFun should return AccOut, a term() that will be passed back to the next invocation of EventFun.

 

Options – [Option]

 

Option – {continuation_function, CState, CFunction} | {output_format, utf8}

 

The ‘output_format’ option determines the encoding of the 'character data': element values and attribute values. The only supported encoding at this moment is 'utf8'. The default is string().

 

CFunction –should be a function that takes 2 arguments: Tail and State.

 - Tail is the (short) list of characters (or a short binary) that could not yet be parsed because it is (or might be) an incomplete token, or because an encoded character is not complete. Since this still has to be parsed, CFunction should include this in front of the next block of data.

- State is information that is passed by the parser to the callback function transparently. This can be used to keep track of the location in the file etc.

The function returns {NewData, NewState}, where NewData is a list of characters/unicode code points/binary, and NewState the new value for the State. NewData has to be in the same type of encoding as the first part of the document.  

Note: if the encoding of the document supports multi-byte characters (UTF8, UTF16) you don’t have to ensure that each block of data contains only complete characters – but in case of UTF16 you do have to ensure that you return an odd number of bytes.

 

 

AccOut - a the result of the last invocation of EventFun.

 

Rest - list of characters that follow after the end of the XML document

 

simple_form(XML) -> {ok, SimpleFormElement, Rest}

simple_form(XML, Options) -> {ok, SimpleFormElement, Rest}

 

Types:

XML = [int()] or an encoded binary

Options = [Option]

            Option =  {nameFun, NameFun} |

                            {output_encoding, utf8}

            SimpleFormElement = {Tag, Attributes, Content},

            Rest = list of characters that follow after the end of the XML document

           

Tag is a string (unless otherwise specified through the nameFun option, see below), Attributes = [{AttributeName, Value}], and Content is a list of SimpleFormElements and/or strings.

 

Namefun is a function with 3 arguments: Name, Namespace, Prefix. It should return a term. It is called for each tag and attribute name. The result will be used in the output. Default is Name if Namespace == undefined, and a string {Namespace}Name otherwise.

 

erlsom_lib:toUnicode(XML) -> DecodedXML

 

Types:

XML = the XML in binary form.

DecodeXML = the XML in the form of a list of Unicode code points.

 

Decodes the XML, see the section on character decoding above.

 

erlsom_lib:find_xsd(Namespace, Location, Dir_list, Include_list) -> {XSD, Prefix}

 

Types:

Namespace (from the XSD). This is a string or 'undefined'

Location (from the XSD). This is a string or 'undefined'

Dir_list. This is the value of the Dir_list option if this option was passed to compile_xsd(); 'undefined' otherwise.

Include_list. This is the value of the Include_list option if this option was passed to compile_xsd(); 'undefined' otherwise.

 

The function erlsom_lib:find_xsd can be passed to compile_xsd as the value for the 'include_fun' option. It will attempt to get imported XSDs from the internet (if the import, include or redefine statement includes a ‘location’ attribute in the form of a URL).

 

If find_xsd cannot find the file on the internet, it will attempt to find the file using the standard function, see the description provided above with the compile_xsd function.

 

erlsom_lib:detect_encoding(Document) -> {Encoding, Binary}

 

Types:

Document = the XML document, either in binary form or as a list

Encoding = the encoding, as an atom

Binary = the XML document in binary form.

 

Tries to detect the encoding. It looks at the first couple of bytes. If these bytes cannot give a definitive answer, it looks into the xml declaration.

 

Possible values for Encoding:

ucs4be

ucs4le

utf16be

utf16le

utf8

iso_8859_1

 

The second return value is identical to the input if the input was in binary form, and the translation to the binary form if the input was a list.

 

(the basis of this function was copied from xmerl_lib, but it was extended to look into the xml declaration).

 

erlsom_ucs: from_utf8(Data) -> {List, Tail}

erlsom_ucs: from_utf16le(Data) -> {List, Tail}

erlsom_ucs: from_utf16be(Data) -> {List, Tail}

 

Types:

Data = a block of data, either as a list of bytes or as a binary

List = the input translated to a list of Unicode code points

Tail = remaining bytes at the end of the input (a list of bytes).

 

These functions are based on the corresponding functions in xmerl_ucs, but they have been modified so that they can be used to translate blocks of data. The end of a block can be in the middle of a series of bytes that together correspond to 1 Unicode code point. The remaining bytes are returned, so that they can be put in front of the next block of data.

 

Note on performance: the functions work on lists, not binaries! If the input is a binary, this is translated to a list in a first step, since the functions are faster that way. If you are reading the xml document from a file, it is probably fastest to use pread() in such a way that it returns a list, and not a binary.

 

See the ‘continuation’ example for an example of how this can be used to deal with very large documents (or streams of data).

 

parse(XML, Model) -> {ok, Struct} 

Note: This function has been replaced by scan()! Please use scan().

 

parse_file(XMLFile, Model) -> {ok, Struct}

Note: This function has been replaced by scan_file()! Please use scan_file().

 

write_hrl(XSD, Namespaces, Output) -> ok

This function has been replaced by write_xsd_hrl()! Please use write_xsd_hrl().

 

erlsom_sax:parseDocument(Xml, State, EventFun) -> {State2 Rest}

erlsom_sax:parseDocument(Xml, State, EventFun, Options) -> {State2, Rest}

Obsolete, use parse_sax().

 

compile(XSD, Prefix, Namespaces) -> {ok, Model}

This function has been replaced by compile_xsd()! Please use compile_xsd().

 

compile_file(FileName, Prefix, Namespaces) -> {ok, Model}

This function has been replaced by compile_xsd_file()! Please use compile_xsd_file().

 

add_file(FileName, Prefix, Model) -> Model 

This function has been replaced by add_xsd_file()! Please use add_xsd_file().

Author

Willem de Jong

 

Copyright © 2006, 2007, 2008 Willem de Jong