4
>How to parse a document from an application</TITLE
7
CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
9
TITLE="The PXP user's guide"
10
HREF="index.html"><LINK
13
HREF="c533.html"><LINK
16
HREF="c533.html"><LINK
18
TITLE="Class-based processing of the node tree"
19
HREF="x677.html"><LINK
22
HREF="markup.css"></HEAD
41
>The PXP user's guide</TH
56
>Chapter 2. Using <SPAN
79
>2.2. How to parse a document from an application</A
82
>Let me first give a rough overview of the object model of the parser. The
83
following items are represented by objects:
90
STYLE="list-style-type: disc"
95
> The document representation is more or less the
96
anchor for the application; all accesses to the parsed entities start here. It
97
is described by the class <TT
100
> contained in the module
104
>. You can get some global information, such
105
as the XML declaration the document begins with, the DTD of the document,
106
global processing instructions, and most important, the document tree. </P
109
STYLE="list-style-type: disc"
113
>The contents of documents:</I
114
> The contents have the structure
115
of a tree: Elements contain other elements and text<A
121
The common type to represent both kinds of content is <TT
125
which is a class type that unifies the properties of elements and character
126
data. Every node has a list of children (which is empty if the element is empty
127
or the node represents text); nodes may have attributes; nodes have always text
128
contents. There are two implementations of <TT
135
> for elements, and the class
139
> for text data. You find these classes and class
140
types in the module <TT
145
>Note that attribute lists are represented by non-class values.</P
148
STYLE="list-style-type: disc"
152
>The node extension:</I
153
> For advanced usage, every node of the
154
document may have an associated <I
158
a second object. This object must have the three methods
169
> as bare minimum, but you are free to add methods as
170
you want. This is the preferred way to add functionality to the document
175
>. The class type <TT
185
STYLE="list-style-type: disc"
190
> Sometimes it is necessary to access the DTD of a
191
document; the average application does not need this feature. The class
195
> describes DTDs, and makes it possible to get
196
representations of element, entity, and notation declarations as well as
197
processing instructions contained in the DTD. This class, and
207
>proc_instruction</TT
208
> can be found in the module
212
>. There are a couple of classes representing
213
different kinds of entities; these can be found in the module
222
Additionally, the following modules play a role:
229
STYLE="list-style-type: disc"
234
> Here the main parsing functions such as
237
>parse_document_entity</TT
238
> are located. Some additional types and
239
functions allow the parser to be configured in a non-standard way.</P
242
STYLE="list-style-type: disc"
247
> This is a collection of basic types and
253
There are some further modules that are needed internally but are not part of
256
>Let the document to be parsed be stored in a file called
260
>. The parsing process is started by calling the
264
CLASS="PROGRAMLISTING"
265
>val parse_document_entity : config -> source -> 'ext spec -> 'ext document</PRE
268
defined in the module <TT
271
>. The first argument
272
specifies some global properties of the parser; it is recommended to start with
276
>. The second argument determines where the
277
document to be parsed comes from; this may be a file, a channel, or an entity
281
>, it is sufficient to pass
284
>from_file "doc.xml"</TT
287
>The third argument passes the object specification to use. Roughly
288
speaking, it determines which classes implement the node objects of which
289
element types, and which extensions are to be used. The <TT
293
polymorphic variable is the type of the extension. For the moment, let us
297
> as this argument, and ignore it.</P
299
>So the following expression parses <TT
305
CLASS="PROGRAMLISTING"
307
let d = parse_document_entity default_config (from_file "doc.xml") default_spec</PRE
313
> implies that warnings are collected
314
but not printed. Errors raise one of the exception defined in
318
>; to get readable errors and warnings catch the
319
exceptions as follows:
322
CLASS="PROGRAMLISTING"
326
print_endline ("WARNING: " ^ w)
331
let config = { default_config with warner = new warner } in
332
let d = parse_document_entity config (from_file "doc.xml") default_spec
337
print_endline (Pxp_types.string_of_exn e)</PRE
343
> is an object of the <TT
347
class. If you want the node tree, you can get the root element by
350
CLASS="PROGRAMLISTING"
351
>let root = d # root</PRE
354
and if you would rather like to access the DTD, determine it by
357
CLASS="PROGRAMLISTING"
358
>let dtd = d # dtd</PRE
361
As it is more interesting, let us investigate the node tree now. Given the root
362
element, it is possible to recursively traverse the whole tree. The children of
366
> are returned by the method
370
>, and the type of a node is returned by
374
>. This function traverses the tree, and prints the
378
CLASS="PROGRAMLISTING"
379
>let rec print_structure n =
380
let ntype = n # node_type in
382
T_element name ->
383
print_endline ("Element of type " ^ name);
384
let children = n # sub_nodes in
385
List.iter print_structure children
389
(* Other node types are not possible unless the parser is configured
395
You can call this function by
398
CLASS="PROGRAMLISTING"
399
>print_structure root</PRE
402
The type returned by <TT
416
element type is the string included in the angle brackets. Note that only
417
elements have children; data nodes are always leaves of the tree.</P
419
>There are some more methods in order to access a parsed node tree:
426
STYLE="list-style-type: disc"
431
>: Returns the parent node, or raises
435
> if the node is already the root</P
438
STYLE="list-style-type: disc"
443
>: Returns the root of the node tree. </P
446
STYLE="list-style-type: disc"
451
>: Returns the value of the attribute with
455
>. The method returns a value for every
459
> attribute, independently of whether the attribute
460
instance is defined or not. If the attribute is not declared,
464
> will be raised. (In well-formedness mode, every
465
existing attribute is considered as being implicitly declared with type
469
>, so you will get either <TT
478
>The following return values are possible: <TT
489
The first two value types indicate that the attribute value is available,
490
either because there is a definition
505
in the XML text, or because there is a default value (declared in the
506
DTD). Only if both the instance definition and the default declaration are
507
missing, the latter value <TT
510
> will be returned.</P
512
>In the DTD, every attribute is typed. There are single-value types (CDATA, ID,
513
IDREF, ENTITY, NMTOKEN, enumerations), in which case the method passes
521
string value of the attribute. The other types (IDREFS, ENTITIES, NMTOKENS)
522
represent list values, and the parser splits the XML literal into several
523
tokens and returns these tokens as <TT
528
>Normalization means that entity references (the
548
by the text they represent, and that white space characters are converted into
552
STYLE="list-style-type: disc"
557
>: Returns the character data contained in the
558
node. For data nodes, the meaning is obvious as this is the main content of
559
data nodes. For element nodes, this method returns the concatenated contents of
560
all inner data nodes.</P
562
>Note that entity references included in the text are resolved while they are
563
being parsed; for example the text "a &lt;&gt; b" will be returned
564
as "a <> b" by this method. Spaces of data nodes are always
565
preserved. Newlines are preserved, but always converted to \n characters even
566
if newlines are encoded as \r\n or \r. Normally you will never see two adjacent
567
data nodes because the parser collapses all data material at one location into
568
one node. (However, if you create your own tree or transform the parsed tree,
569
it is possible to have adjacent data nodes.)</P
571
>Note that elements that do <I
574
> allow #PCDATA as content
575
will not have data nodes as children. This means that spaces and newlines, the
576
only character material allowed for such elements, are silently dropped.</P
581
For example, if the task is to print all contents of elements with type
582
"valuable" whose attribute "priority" is "1", this function can help:
585
CLASS="PROGRAMLISTING"
586
>let rec print_valuable_prio1 n =
587
let ntype = n # node_type in
589
T_element "valuable" when n # attribute "priority" = Value "1" ->
590
print_endline "Valuable node with priotity 1 found:";
591
print_endline (n # data)
592
| (T_element _ | T_data) ->
593
let children = n # sub_nodes in
594
List.iter print_valuable_prio1 children
599
You can call this function by:
602
CLASS="PROGRAMLISTING"
603
>print_valuable_prio1 root</PRE
606
If you like a DSSSL-like style, you can make the function
609
>process_children</TT
613
CLASS="PROGRAMLISTING"
614
>let rec print_valuable_prio1 n =
616
let process_children n =
617
let children = n # sub_nodes in
618
List.iter print_valuable_prio1 children
621
let ntype = n # node_type in
623
T_element "valuable" when n # attribute "priority" = Value "1" ->
624
print_endline "Valuable node with priority 1 found:";
625
print_endline (n # data)
626
| (T_element _ | T_data) ->
632
So far, O'Caml is now a simple "style-sheet language": You can form a big
633
"match" expression to distinguish between all significant cases, and provide
634
different reactions on different conditions. But this technique has
635
limitations; the "match" expression tends to get larger and larger, and it is
636
difficult to store intermediate values as there is only one big
637
recursion. Alternatively, it is also possible to represent the various cases as
638
classes, and to use dynamic method lookup to find the appropiate class. The
639
next section explains this technique in detail. </P
655
HREF="x550.html#AEN562"
664
also contain processing instructions. Unlike other document models, <SPAN
668
separates processing instructions from the rest of the text and provides a
669
second interface to access them (method <TT
673
there is a parser option (<TT
675
>enable_pinstr_nodes</TT
677
the behaviour of the parser such that extra nodes for processing instructions
678
are included into the tree.</P
680
>Furthermore, the tree does normally not contain nodes for XML comments;
681
they are ignored by default. Again, there is an option
684
>enable_comment_nodes</TT
695
HREF="x550.html#AEN582"
703
>Due to the typing system it is more or less impossible to
704
derive recursive classes in O'Caml. To get around this, it is common practice
705
to put the modifiable or extensible part of recursive objects into parallel
766
>Class-based processing of the node tree</TD
b'\\ No newline at end of file'