1
******************************************************************************
2
The Preprocessor for PXP
3
******************************************************************************
6
==============================================================================
7
The Preprocessor for PXP
8
==============================================================================
10
Since PXP-1.1.95, there is a preprocessor as part of the PXP distribution. It
11
allows you to compose XML trees and event lists dynamically, which is very
12
handy to write XML transformations.
14
To enable the preprocessor, compile your source files as in:
16
ocamlfind ocamlc -syntax camlp4o -package pxp-pp,... ...
18
The package pxp-pp contains the preprocessor. The -syntax option enables
19
camlp4, on which the preprocessor is based. It is also possible to use it
20
together with the revised syntax, use "-syntax camlp4r" in this case.
22
Important: Up to version 1.0.4, findlib (ocamlfind) has a problem with the
23
definition for pxp-pp. There is an easy workaround: Use "-syntax camlp4o,byte".
35
The preprocessor defines the following new syntax notations, explained below in
38
<:pxp_charset< CHARSET_DECL >>
45
The basic notation is "pxp_tree" which creates a tree of PXP document nodes as
46
described in EXPR. "pxp_vtree" is the variant where the tree is immediately
47
validated. "pxp_evlist" creates a list of PXP events instead of nodes, useful
48
together with the event-based parser. "pxp_evpull" is a variation of the
49
latter: Instead of an event list an event generator is created that works like
52
The "pxp_charset" notation only configures the character sets to assume.
53
Finally, "pxp_text" is a notation for string literals.
55
------------------------------------------------------------------------------
57
------------------------------------------------------------------------------
59
The following examples are all written for "pxp_tree". You can also use one of
60
the other XML composers instead, but see the notes below.
62
In order to use "pxp_tree", you must define two variables in the environment:
65
let spec = Pxp_tree_parser.default_spec;;
66
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
68
These variables occur in the code generated by the preprocessor. The "dtd"
69
variable is the DTD object. Note that you need it even in well-formedness mode
70
(validation turned off). The "spec" variable controls which classes are
71
instantiated as node representation (see PXP manual).
73
Now you can create XML trees like in
78
[ <title>[ "The Lord of The Rings" ]
79
<author>[ "J.R.R. Tolkien" ]
83
As you can see, the syntax is somehow XML-related but not really XML. (Many
84
ideas are borrowed from CDUCE, by the way.) In particular, there are start tags
85
like <title> but no end tags. Instead, we are using square brackets to denote
86
the children of an XML element. Furthermore, character data must be put into
89
You may ask why the well-known XML syntax has been modified for this
90
preprocessor. There are many reasons, and they will become clearer in the
91
following explanations. For now, you can see the advantage that the syntax is
92
less verbose, as you need not to repeat the element names in end tags.
93
Furthermore, you can exactly control which characters are part of the data
94
nodes without having to make compromises with indentation.
96
Attributes are written as in XML:
101
[ <title lang="en">[ "The Lord of The Rings" ]
102
<author>[ "J.R.R. Tolkien" ]
108
An element without children can be written
118
You can also create processing instructions and comment nodes:
123
[ <!>"Now the list of books follows!"
124
<?>"formatter_directive" "one book per page"
129
The notation "<!>" creates a comment node with the following string as
130
contents. The notation "<?>" needs two strings, first the target, then the
131
value (here, this results in "<?formatter_directive one book per page?>".
133
Look again at the last example: The O'Caml variable "book" occurs, and it
134
inserts its tree into the list of books. Identifiers without "decoration" just
135
refer to O'Caml variables. We will see more examples below.
137
The preprocessor syntax knows a number of shortcuts and variations. First, you
138
can omit the square brackets when an element has exactly one child:
140
<element><child>"Data inside child"
144
<element>[ <child>[ "Data inside child" ] ]
146
Second, you are already used to a common abbreviation: Strings are
147
automatically converted to data nodes. The "expanded" syntax is
151
where "<*>" denotes a data node, and the following string is used as contents.
152
Usually, you can omit "<*>". However, there are a few occasions where this
153
notation is still useful, see below.
155
In strings, the usual entity references can be used: "Double quotes: "".
156
For a newline character, write .
158
The preprocessor knows two operators: "^" concatenates strings, and "@"
159
concatenates lists. Examples:
161
<element>[ "Word1" ^ "Word2" ]
162
<element>([ <a/> ] @ [ <b/> ])
166
Parentheses can be used to clarify precedence. For example:
170
Here, the concatenation operator "@" could also be parsed as
174
Parentheses may be used in every expression.
176
Rarely used, there is also a notation for the "super root" nodes (see the PXP
177
manual for their meaning):
183
------------------------------------------------------------------------------
185
------------------------------------------------------------------------------
187
Let us begin with an example. The task is to convert O'Caml values of type
197
<book id="BOOK_'isbn'">
198
<title>'title'</title>
199
<author>'author'</title>
202
(conventional syntax). When b is the book variable, the solution is
206
and author = b.author
209
<book id=("BOOK_" ^ isbn)>
215
First, we bind the simple O'Caml variables "title", "author", and "isbn". The
216
reason is that the preprocessor syntax does not allow expressions like
217
"b.title" directly in the XML tree (but see below for a better workaround).
219
The XML tree contains the O'Caml variables. The "id" attribute is a
220
concatenation of the fixed prefix "BOOK_" and the contents of "isbn". The
221
"title" and "author" elements contain a data node whose contents are the O'Caml
222
strings "title", and "author", respectively.
224
Why "<*>"? If we just wrote "<title>title", the generated code would assume
225
that the "title" variable is an XML node, and not a string. From this point of
226
view, "<*>" works like a type annotation, as it specialises the type of the
227
following expression.
229
Here is an alternate solution:
233
<book id=("BOOK_" ^ (: b.isbn :))>
234
[ <title><*>(: b.title :)
235
<author><*>(: b.author :)
239
The notation "(: ... :)" allows you to include arbitrary O'Caml expressions
240
into the tree. In this solution it is no longer necessary to create artificial
241
O'Caml variables for the only purpose of injecting values into trees.
243
It is possible to create XML elements with dynamic names: Just put parentheses
244
around the expression. Example:
247
<:pxp_tree< <(name)> ... >>
249
With the same notation, one can also set attribute names dynamically:
251
let att_name = "id" in
252
<:pxp_tree< <book (att_name)=...> ... >>
254
Finally, it is also possible to include complete attribute lists dynamically:
256
let att_list = [ "id", ("BOOK_" ^ b.isbn) ] in
257
<:pxp_tree< <book (: att_list :) > ... >>
261
Typing: Depending on where a variable or O'Caml expression occurs, different
262
types are assumed. Compare the following examples:
264
<:pxp_tree< <element>x1 >>
265
<:pxp_tree< <element>[x2] >>
266
<:pxp_tree< <element><*>x3 >>
268
As a rule of thumb, the most general type is assumed that would make sense at a
269
certain location. As x1 could be replaced by a list of children, its type is
270
assumed to be a node list. As x2 could be replaced by a single node, its type
271
is assumed to be a node. And x3 is a string, we had this case already.
273
------------------------------------------------------------------------------
275
------------------------------------------------------------------------------
277
As the preprocessor generates code that builds XML trees, it must know two
280
- Which encoding is used in the source code (in the .ml file)
282
- Which encoding is used in the XML representation, i.e. in the O'Caml values
283
representing the XML trees
285
Both encodings can be set independently. The syntax is:
287
<:pxp_charset< source="ENC" representation="ENC" >>
289
The default is ISO-8859-1 for both encodings. For example, to set the
290
representation encoding to UTF-8, use:
292
<:pxp_charset< representation="UTF-8" >>
294
The "pxp_charset" notation is a constant expression that always evaluates to
295
"()". (A requirement by camlp4 that looks artificial.)
297
When you set the representation encoding, it is required that the encoding
298
stored in the DTD object is the same. Remember that we need a DTD object like
300
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
302
Of course, we must change this to the representation encoding, too, in our
305
let dtd = Pxp_dtd.create_dtd `Enc_utf8;;
307
The preprocessor cannot check this at compile time, and for performance
308
reasons, a runtime check is not generated. So it is up to the programmer that
309
the character encodings are used in a consistent way.
311
------------------------------------------------------------------------------
313
------------------------------------------------------------------------------
315
In order to validate trees, you need a filled DTD object. In principle, you can
316
create this object by a number of methods. For example, you can parse an
319
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_file "sample.dtd")
321
It is, however, often more convenient to include the DTD literally into the
322
program. This works by
324
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string "...")
326
As the double quotes are often used inside DTDs, O'Caml string literals are a
327
bit impractical, as they are also delimited by double quotes, and one needs to
328
add backslashes as escape characters. The "pxp_text" notation is often more
329
readable here: <:pxp_text<STRING>> is just another way of writing "STRING". In
334
<!ELEMENT book (title,author)>
335
<!ATTLIST book id CDATA #REQUIRED>
336
<!ELEMENT title (#PCDATA)>
337
<!ATTLIST title lang CDATA "en">
338
<!ELEMENT author (#PCDATA)>
340
let config = default_config;;
341
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string dtd_text);;
343
Note that "pxp_text" is not restricted to DTDs, as it can be used for any kind
346
After we have the DTD, we can validate the trees. One option is to call the
352
[ <title>[ "The Lord of The Rings" ]
353
<author>[ "J.R.R. Tolkien" ]
356
Pxp_document.validate book;;
358
(This example is invalid, as the "id" attribute is missing.)
360
Note that it is a misunderstanding that "pxp_tree" builds XML trees in
361
well-formed mode. You can create any tree with it, and the fact is that
362
"pxp_tree" just does not invoke the validator. So if the DTD enforces
363
validation, the tree is validated when the "validate" function is called. If
364
the DTD is in well-formedness mode, the tree is effectively not validated, even
365
when the "validate" function is invoked. Btw, the following statements would
366
create a DTD in well-formedness mode:
368
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
369
dtd # allow_arbitrary;
371
As an alternative of calling the "validate" function, one can also use
372
"pxp_vtree" instead. It immediately validates every XML element it creates.
373
However, "injected" subtrees are not validated, i.e. validation does not
374
proceed recursively to subnodes as the "validate" function does it.
376
------------------------------------------------------------------------------
378
------------------------------------------------------------------------------
380
As PXP has also an event model to represent XML, the preprocessor can also
381
produce such events. In particular, there are two modes: The "pxp_evlist"
382
notation outputs lists of events (type "event list") representing the XML
383
expression. The "pxp_evpull" notation creates an automaton from which one can
384
"pull" events (like from a pull parser).
386
These two notations work very much like "pxp_tree". For example,
391
[ <title>[ "The Lord of The Rings" ]
392
<author>[ "J.R.R. Tolkien" ]
398
[ E_start_tag ("book", [], None, <obj>);
399
E_start_tag ("title", [], None, <obj>);
400
E_char_data "The Lord of The Rings";
401
E_end_tag ("title", <obj>);
402
E_start_tag ("author", [], None, <obj>);
403
E_char_data "J.R.R. Tolkien";
404
E_end_tag ("author", <obj>);
405
E_end_tag ("book", <obj>)
408
Note that you neither need a "dtd" variable nor a "spec" variable. There is one
409
important difference, however: Both nodes and lists of nodes are represented by
410
the same type, "event list". That has the consequence that in the following
411
example x1 and x2 have the same type "event list":
413
<:pxp_evlist< <element>x1 >>
414
<:pxp_evlist< <element>[x2] >>
415
<:pxp_evlist< <element><*>x3 >>
417
In principle, it could be checked at runtime whether x1 and x2 have the right
418
structure. However, this is not done because of performance reasons.
420
As mentioned, "pxp_evpull" works like a pull parser. After defining
425
[ <title>[ "The Lord of The Rings" ]
426
<author>[ "J.R.R. Tolkien" ]
430
"book" is a function 'a->event. One can call it to get the events one after the
433
let e1 = book();; (* = Some(E_start_tag ("book", [], None, <obj>)) *)
434
let e2 = book();; (* = Some(E_start_tag ("title", [], None, <obj>)) *)
437
After the last event, "book" returns None to indicate the end of the event
440
As for "pxp_evlist", it is not possible to distinguish between nodes and node
441
lists. In this example, both x1 and x2 are assumed to have type 'a->event:
443
<:pxp_evlist< <element>x1 >>
444
<:pxp_evlist< <element>[x2] >>
445
<:pxp_evlist< <element><*>x3 >>
447
Note that "<element>x1" actually means to build a new pull automaton around the
448
existing pull automaton x1: The children of "element" are retrieved by pulling
449
events from x1 until "None" is returned.
451
A consequence of the pull semantics is that once an event is obtained from an
452
automaton, the state of the automaton is modified such that it is not possible
453
to get the same event again. If you need an automaton that can be reset to the
454
beginning, just wrap the "pxp_evlist" notation into a functional abstraction:
457
<:pxp_evpull< <book ...> ... >>;;
458
let book1 = book_maker();;
459
let book2 = book_maker();;
461
This way, "book1" and "book2" are independent event streams.
463
There is another implication of the nature of the automatons: Subexpressions
464
are lazily evaluated. For example, in
466
<:pxp_evpull< <element>[ <*> (: get_data_contents() :) ] >>
468
the call of get_data_contents is performed just before the event for the data
471
------------------------------------------------------------------------------
473
------------------------------------------------------------------------------
475
By default, the preprocessor does not generate nodes or events that support
476
namespaces. It can, however, be configured to create namespace-aware XML
479
In any case, you need a namespace manager. This is an object that tracks the
480
usage of namespace prefixes in XML nodes. For example, we can create a
481
namespace manager that knows the "html" prefix:
483
let mng = new namespace_manager in
484
mng # add_namespace "html" "http://www.w3.org/1999/xhtml"
486
Here, we declare that we want to use the "html" prefix for the internal
487
representation of the XML nodes. This kind of prefix is called normalized
488
prefix, or normprefix for short. It is possible to configure different prefixes
489
for the external representation, i.e. when the XML tree is printed to a file.
490
This other kind of prefix is called display prefix. We will have a look at them
493
Next, we must tell the DTD object that we have a namespace manager:
495
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
496
dtd # set_namespace_manager mng;;
500
For "pxp_evlist" and "pxp_evpull" we are now prepared (note that we need now a
501
"dtd" variable, as the DTD object knows the namespace manager). For "pxp_tree"
502
and "pxp_vtree", it is required to use a namespace-aware specification:
504
let spec = Pxp_tree_parser.default_namespace_spec
506
(Normal specifications do not work, you would get "Namespace method not
507
applicable" errors if you tried to use them.)
509
The special notation "<:autoscope>" enables namespace mode in this example:
520
In particular, "<:autoscope>" defines a new O'Caml variable for its
521
subexpression: "scope". This variable contains the namespace scope object,
522
which contains the namespace declarations for the subexpression. "<:autoscope>"
523
initialises this variable from the namespace manager such that it contains now
524
a declaration for the "html" prefix.
526
In general, the namespace scope object contains the prefixes to use for the
527
external representation. For this simple example, we have chosen to use the
528
same prefixes as for the internal representation, and "<:autoscope>" performs
529
the right initialisations for this.
533
list # display (`Out_channel stdout) `Enc_iso88591
535
The point is to call the "display" method and not the "write" method. The
536
latter would not respect the display prefixes.
538
Alternatively, we can also create the "scope" variable manually:
540
let scope = Pxp_dtd.create_namespace_scope
541
~decl:[ "", "http://www.w3.org/1999/xhtml" ]
552
Note that we now use "<:scope>". In this simple form, this construct just
553
enables namespace mode, and takes the "scope" variable from the environment.
555
Furthermore, the namespace scope contains now a different namespace
556
declaration: The display prefix "" is used for HTML. The empty prefix just
557
means to declare a default prefix (by xmlns="URI"). The effect can be seen when
558
the XML tree is printed by calling the "display" method.
560
Here is a third variant of the same example:
562
let scope = Pxp_dtd.create_namespace_scope mng ;;
565
<:scope ("")="http://www.w3.org/1999/xhtml">
572
The "scope" is now initially empty. The "<:scope>" notation is used to extend
573
the scope for the time the subexpression is evaluated.
575
There is also a notation "<:emptyscope" that creates an empty scope object, so
581
<:scope ("")="http://www.w3.org/1999/xhtml">
590
It is recommended to create the "scope" variable manually with a reasonable
591
initial declaration, and to use "<:scope>" to enable namespace processing, and
592
to extend the scope when necessary. The advantage of this approach is that the
593
same scope object can be shared by many XML nodes, so you need less memory.
595
One tip: To get a namespace scope that is initialised with all prefixes of the
596
namespace manager (as <:autoscope> does it), define
598
let scope = create_namespace_scope ~decl: mng#as_declaration mng
602
For event-based processing of XML, the namespace mode works in the same way as
603
described here, there is no difference.