1
<!-- neon XML interface -*- text -*- -->
5
<title>Parsing XML</title>
7
<para>The &neon; XML interface is exposed by the
8
<filename>ne_xml.h</filename> header file. This interface gives a
9
wrapper around the standard <ulink
10
url="http://www.saxproject.org/">SAX</ulink> API used by XML
11
parsers, with an additional abstraction, <firstterm>stacked SAX
12
handlers</firstterm>, and also giving consistent <ulink
13
url="http://www.w3.org/TR/REC-xml-names">XML Namespace</ulink> support.</para>
16
<title>Introduction to SAX</title>
18
<para>A SAX-based parser works by emitting a sequence of
19
<firstterm>events</firstterm> to reflect the tokens being parsed
20
from the XML document. For example, parsing the following document
23
<programlisting><![CDATA[
27
results in the following events:
31
<simpara>&startelm; "hello"</simpara>
34
<simpara>&cdata; "world"</simpara>
37
<simpara>&endelm; "hello"</simpara>
41
This example demonstrates the three event types used used in the
42
subset of SAX exposed by the &neon; XML interface: &startelm;,
43
&cdata; and &endelm;. In a C API, an <quote>event</quote> is
44
implemented as a function callback; three callback types are used in
45
&neon;, one for each type of event.</para>
49
<sect2 id="xml-stacked">
50
<title>Stacked SAX handlers</title>
52
<para>WebDAV property values are represented as fragments of XML,
53
transmitted as parts of larger XML documents over HTTP (notably in
54
the body of the response to a <literal>PROPFIND</literal> request).
55
When &neon; parses such documents, the SAX events generated for
56
these property value fragments may need to be handled by the
57
application, since &neon; has no knowledge of the structure of
58
properties used by the application.</para>
60
<para>To solve this problem<footnote><para>This
61
<quote>problem</quote> only needs solving because the SAX interface
62
is so inflexible when implemented as C function callbacks; a better
63
approach would be to use an XML parser interface which is not based
64
on callbacks.</para></footnote> the &neon; XML interface introduces
65
the concept of a <firstterm>SAX handler</firstterm>. A SAX handler
66
comprises a &startelm;, &cdata; and &endelm; callback; the
67
&startelm; callback being defined such that each handler may
68
<emphasis>accept</emphasis> or <emphasis>decline</emphasis> the
69
&startelm; event. Handlers are composed into a <firstterm>handler
70
stack</firstterm> before parsing a document. When a new &startelm;
71
event is generated by the XML parser, &neon; invokes each &startelm;
72
callback in the handler stack in turn until one accepts the event.
73
The handler which accepts the event will then be subsequently be
74
passed &cdata; events if the element contains character data,
75
followed by an &endelm; event when the element is closed. If no
76
handler in the stack accepts a &startelm; event, the branch of the
77
tree is ignored.</para>
79
<para>To illustrate, given a handler A, which accepts the
80
<literal>cat</literal> and <literal>age</literal> elements, and a
81
handler B, which accepts the <literal>name</literal> element, the
84
<example id="xml-example">
85
<title>An example XML document</title>
86
<programlisting><![CDATA[
91
]]></programlisting></example>
93
would be parsed as follows:
97
<simpara>A &startelm; "cat" → <emphasis>accept</emphasis></simpara>
100
<simpara>A &startelm; "age" → <emphasis>accept</emphasis></simpara>
103
<simpara>A &cdata; "3"</simpara>
106
<simpara>A &endelm; "age"</simpara>
109
<simpara>A &startelm; "name" → <emphasis>decline</emphasis></simpara>
112
<simpara>B &startelm; "name" → <emphasis>accept</emphasis></simpara>
115
<simpara>B &cdata; "Bob"</simpara>
118
<simpara>B &endelm; "name"</simpara>
121
<simpara>A &endelm; "cat"</simpara>
123
</orderedlist></para>
125
<para>The search for a handler which will accept a &startelm; event
126
begins at the handler of the parent element and continues toward the
127
top of the stack. For the root element, it begins at the base of
128
the stack. In the above example, handler A is at the base, and
129
handler B at the top; if the <literal>name</literal> element had any
130
children, only B's &startelm; would be invoked to accept
135
<sect2 id="xml-state">
136
<title>Maintaining state</title>
138
<para>To facilitate communication between independent handlers, a
139
<firstterm>state integer</firstterm> is associated with each element
140
being parsed. This integer is returned by &startelm; callback and
141
is passed to the subsequent &cdata; and &endelm; callbacks
142
associated with the element. The state integer of the parent
143
element is also passed to each &startelm; callback, the value zero
144
used for the root element (which by definition has no
147
<para>To further extend <xref linkend="xml-example"/>: if handler A
148
defines that the state of the root element <sgmltag>cat</sgmltag>
149
will be <literal>42</literal>, the event trace would be as
154
<simpara>A &startelm; (parent = 0, "cat") →
155
<emphasis>accept</emphasis>, state = 42
159
<simpara>A &startelm; (parent = 42, "age") →
160
<emphasis>accept</emphasis>, state = 50
164
<simpara>A &cdata; (state = 50, "3")</simpara>
167
<simpara>A &endelm; (state = 50, "age")</simpara>
170
<simpara>A &startelm; (parent = 42, "name") →
171
<emphasis>decline</emphasis></simpara>
174
<simpara>B &startelm; (parent = 42, "name") →
175
<emphasis>accept</emphasis>, state = 99</simpara>
178
<simpara>B &cdata; (state = 99, "Bob")</simpara>
181
<simpara>B &endelm; (state = 99, "name")</simpara>
184
<simpara>A &endelm; (state = 42, "cat")</simpara>
186
</orderedlist></para>
188
<para>To avoid collisions between state integers used by different
189
handlers, the interface definition of any handler includes the range
190
of integers it will use.</para>
195
<title>XML namespaces</title>
197
<para>To support XML namespaces, every element name is represented
198
as a <emphasis>(namespace, name)</emphasis> pair. The &startelm;
199
and &endelm; callbacks are passed namespace and name strings
200
accordingly. If an element in the XML document has no declared
201
namespace, the namespace given will be the empty string,
202
<literal>""</literal>.</para>