Erlsom is an Erlang library to parse (and generate) XML documents.
Erlsom can be used in a couple of very different modes:
-
As a SAX parser. This is a more or less standardized model (see
http://www.saxproject.org/apidoc/org/xml/sax/ContentHandler.html) for parsing
XML. Every time the parser has processed a meaningful part of the XML document
(such as a start tag), it will tell your application about this. The
application can process this information (potentially in parallel) while the
parser continues to parse the rest of the document. The SAX parser will allow
you to efficiently parse XML documents of arbitrary size, but it may take some
time to get used to it. If you invest some effort, you may find that it fits
very well with the Erlang programming model (personally I have always been
very happy about my choice to use a SAX parser as the basis for the rest of
Erlsom).
-
As a simple sort of DOM parser. Erlsom can translate your XML to
the ‘simple form’ that is used by Xmerl. This is a form that is easy to
understand, but you have to search your way through the output to get to the
information that you need.
-
As a ‘data binder’ Erlsom can translate the XML document to an
Erlang data structure that corresponds to an XML Schema. It has the advantage
over the SAX parser that it validates the XML document, and that you know
exactly what the layout of the output will be. This makes it easy to access the
elements that you need in a very direct way. (See http://www.rpbourret.com/xml/XMLDataBinding.htm for a general description
of XML data binding.)
For all modes the following applies:
-
If the document is too big to fit into memory, or if the document
arrives in some kind of data stream, it can be passed to the parser in blocks
of arbitrary size.
-
The parser can work directly on binaries. There is no need to
transform binaries to lists before passing the data to Erlsom. Using binaries
as input has a positive effect on the memory usage and on the speed (provided
that you are using Erlang 12B or later - if you are using an older Erlang
version the speed will be better if you transform binaries to lists). The
binaries can be latin-1, utf-8 or utf-16 encoded.
- The parser has an option to produce output in binary form (only the
character data: names of elements and attributes are always strings). This may
be convenient if you want to minimize the memory usage, and/or if you need the
result in binary format for further processing. Note that it will slow down the
parser slightly. If you select this option the encoding of the result will be
utf-8 (irrespective of the encoding of the input document).
Read the documentation