4
>Highlights of XML</TITLE
7
CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
9
TITLE="The PXP user's guide"
10
HREF="index.html"><LINK
18
TITLE="A complete example: The readme DTD"
19
HREF="x468.html"><LINK
22
HREF="markup.css"></HEAD
41
>The PXP user's guide</TH
56
>Chapter 1. What is XML?</TD
76
>1.2. Highlights of XML</A
79
>This section explains many of the features of XML, but not all, and some
80
features not in detail. For a complete description, see the <A
81
HREF="http://www.w3.org/TR/1998/REC-xml-19980210.html"
92
>1.2.1. The DTD and the instance</A
95
>The DTD contains various declarations; in general you can only use a feature if
96
you have previously declared it. The document instance file may contain the
97
full DTD, but it is also possible to split the DTD into an internal and an
98
external subset. A document must begin as follows if the full DTD is included:
101
CLASS="PROGRAMLISTING"
102
><?xml version="1.0" encoding="<TT
123
These declarations are called the <I
127
that the usage of entities and conditional sections is restricted within the
130
>If the declarations are located in a different file, you can refer to this file
134
CLASS="PROGRAMLISTING"
135
><?xml version="1.0" encoding="<TT
154
The declarations in the file are called the <I
158
>. The file name is called the <I
163
It is also possible to refer to the file by a so-called
166
>public identifier</I
167
>, but most XML applications won't use
170
>You can also specify both internal and external subsets. In this case, the
171
declarations of both subsets are mixed, and if there are conflicts, the
172
declaration of the internal subset overrides those of the external subset with
173
the same name. This looks as follows:
176
CLASS="PROGRAMLISTING"
177
><?xml version="1.0" encoding="<TT
203
>The XML declaration (the string beginning with <TT
210
>) should specify the encoding of the
211
file. Common values are UTF-8, and the ISO-8859 series of character sets. Note
212
that every file parsed by the XML processor can begin with an XML declaration
213
and that every file may have its own encoding.</P
215
>The name of the root element must be mentioned directly after the
219
> string. This means that a full document instance
223
CLASS="PROGRAMLISTING"
224
><?xml version="1.0" encoding="<TT
275
>1.2.2. Reserved characters</A
278
>Some characters are generally reserved to indicate markup such that they cannot
279
be used for character data. These characters are <, >, and
280
&. Furthermore, single and double quotes are sometimes reserved. If you
281
want to include such a character as character, write it as follows:
288
STYLE="list-style-type: disc"
296
STYLE="list-style-type: disc"
304
STYLE="list-style-type: disc"
309
> instead of &</P
312
STYLE="list-style-type: disc"
320
STYLE="list-style-type: disc"
330
All other characters are free in the document instance. It is possible to
331
include a character by its position in the Unicode alphabet:
334
CLASS="PROGRAMLISTING"
348
> is the decimal number of the
349
character. Alternatively, you can specify the character by its hexadecimal
353
CLASS="PROGRAMLISTING"
362
In the scope of declarations, the character % is no longer free. To include it
363
as character, you must use the notations <TT
372
>Note that besides &lt;, &gt;, &amp;,
373
&apos;, and &quot; there are no predefines character entities. This is
374
different from HTML which defines a list of characters that can be referenced
375
by name (e.g. &auml; for �); however, if you prefer named characters, you
376
can declare such entities yourself (see below).</P
384
>1.2.3. Elements and ELEMENT declarations</A
387
>Elements structure the document instance in a hierarchical way. There is a
388
top-level element, the <I
392
sequence of inner elements and character sections. The inner elements are
393
structured in the same way. Every element has an <I
397
>. The beginning of the element is indicated by a <I
404
CLASS="PROGRAMLISTING"
413
and the element continues until the corresponding <I
420
CLASS="PROGRAMLISTING"
429
In XML, it is not allowed to omit start or end tags, even if the DTD would
430
permit this. Note that there are no special rules how to interpret spaces or
431
newlines near start or end tags; all spaces and newlines count.</P
433
>Every element type must be declared before it can be used. The declaration
434
consists of two parts: the ELEMENT declaration describes the content model,
435
i.e. which inner elements are allowed; the ATTLIST declaration describes the
436
attributes of the element.</P
438
>An element can simply allow everything as content. This is written:
441
CLASS="PROGRAMLISTING"
450
On the opposite, an element can be forced to be empty; declared by:
453
CLASS="PROGRAMLISTING"
462
Note that there is an abbreviated notation for empty element instances:
473
>There are two more sophisticated forms of declarations: so-called
476
>mixed declarations</I
481
>. An element with mixed content contains character data
482
interspersed with inner elements, and the set of allowed inner elements can be
483
specified. In contrast to this, a regular expression declaration does not allow
484
character data, but the inner elements can be described by the more powerful
485
means of regular expressions.</P
487
>A declaration for mixed content looks as follows:
490
CLASS="PROGRAMLISTING"
513
or if you do not want to allow any inner element, simply
516
CLASS="PROGRAMLISTING"
537
CLASS="PROGRAMLISTING"
538
><!ELEMENT q (#PCDATA | r | s)*></PRE
541
this is a legal instance:
544
CLASS="PROGRAMLISTING"
545
><q>This is character data<r></r>with <s></s>inner elements</q></PRE
548
But this is illegal because <TT
551
> has not been enumerated in the
555
CLASS="PROGRAMLISTING"
556
><q>This is character data<r></r>with <t></t>inner elements</q></PRE
560
>The other form uses a regular expression to describe the possible contents:
563
CLASS="PROGRAMLISTING"
577
The following well-known regexp operators are allowed:
584
STYLE="list-style-type: disc"
597
STYLE="list-style-type: disc"
622
STYLE="list-style-type: disc"
647
STYLE="list-style-type: disc"
660
STYLE="list-style-type: disc"
673
STYLE="list-style-type: disc"
691
> operator indicates a sequence of sub-models, the
695
> operator describes alternative sub-models. The
699
> indicates zero or more repetitions, and
703
> one or more repetitions. Finally, <TT
707
be used for optional sub-models. As atoms the regexp can contain names of
708
elements; note that it is not allowed to include <TT
713
>The exact syntax of the regular expressions is rather strange. This can be
714
explained best by a list of constraints:
721
STYLE="list-style-type: disc"
723
>The outermost expression must not be
740
><!ELEMENT x y></TT
741
>; this must be written as
744
><!ELEMENT x (y)></TT
748
STYLE="list-style-type: disc"
750
>For the unary operators <TT
785
> must not be again an
794
><!ELEMENT x y**></TT
795
>; this must be written as
798
><!ELEMENT x (y*)*></TT
802
STYLE="list-style-type: disc"
807
> and one of the unary operatory
818
not be whitespace.</P
826
><!ELEMENT x (y|z) *></TT
827
>; this must be written as
830
><!ELEMENT x (y|z)*></TT
834
STYLE="list-style-type: disc"
836
>There is the additional constraint that the
837
right parenthsis must be contained in the same entity as the left parenthesis;
838
see the section about parsed entities below.</P
843
>Note that there is another restriction on regular expressions which must be
844
deterministic. This means that the parser must be able to see by looking at the
845
next token which alternative is actually used, or whether the repetition
846
stops. The reason for this is simply compatability with SGML (there is no
847
intrinsic reason for this rule; XML can live without this restriction).</P
855
>The elements are declared as follows:
858
CLASS="PROGRAMLISTING"
859
><!ELEMENT q (r?, (s | t)+)>
860
<!ELEMENT r (#PCDATA)>
861
<!ELEMENT s EMPTY>
862
<!ELEMENT t (q | r)></PRE
865
This is a legal instance:
868
CLASS="PROGRAMLISTING"
869
><q><r>Some characters</r><s/></q></PRE
875
> is an abbreviation for
878
><s></s></TT
881
It would be illegal to leave <TT
885
least one instance of <TT
892
present. It would be illegal, too, if characters existed outside the
896
> element; the only exception is white space. -- This is
900
CLASS="PROGRAMLISTING"
901
><q><s/><t><q><s/></q></t></q></PRE
911
>1.2.4. Attribute lists and ATTLIST declarations</A
914
>Elements may have attributes. These are put into the start tag of an element as
918
CLASS="PROGRAMLISTING"
967
it is also possible to use single quotes as in
979
Note that you cannot use double quotes literally within the value of the
980
attribute if double quotes are the delimiters; the same applies to single
981
quotes. You can generally not use < and & as characters in attribute
982
values. It is possible to include the paraphrases &lt;, &gt;,
983
&amp;, &apos;, and &quot; (and any other reference to a general
984
entity as long as the entity is not defined by an external file) as well as
992
>Before you can use an attribute you must declare it. An ATTLIST declaration
996
CLASS="PROGRAMLISTING"
1016
>attribute-default</I
1033
>attribute-default</I
1039
There are a lot of types, but most important are:
1046
STYLE="list-style-type: disc"
1051
>: Every string is allowed as attribute value.</P
1054
STYLE="list-style-type: disc"
1059
>: Every nametoken is allowed as attribute
1060
value. Nametokens consist (mainly) of letters, digits, ., :, -, _ in arbitrary
1064
STYLE="list-style-type: disc"
1069
>: A space-separated list of nametokens is allowed as
1075
The most interesting default declarations are:
1082
STYLE="list-style-type: disc"
1087
>: The attribute must be specified.</P
1090
STYLE="list-style-type: disc"
1095
>: The attribute can be specified but also can be
1096
left out. The application can find out whether the attribute was present or
1100
STYLE="list-style-type: disc"
1119
>: This particular value is
1120
used as default if the attribute is omitted in the element.</P
1131
>This is a valid attribute declaration for element type <TT
1137
CLASS="PROGRAMLISTING"
1141
z NMTOKENS "one two three"></PRE
1147
> is a required attribute that cannot be
1155
XML parser indicates the application whether <TT
1162
> is missing the default value
1163
"one two three" is returned automatically. </P
1165
>This is a valid example of these attributes:
1168
CLASS="PROGRAMLISTING"
1169
><r x="He said: &quot;I don't like quotes!&quot;" y='1'></PRE
1179
>1.2.5. Parsed entities</A
1182
>Elements describe the logical structure of the document, while
1186
> determine the physical structure. Entities are
1187
the pieces of text the parser operates on, mostly files and macros. Entities
1191
> in which case the parser reads the text and
1192
interprets it as XML markup, or <I
1196
means that the data of the entity has a foreign format (e.g. a GIF icon).</P
1198
>If the parsed entity is going to be used as part of the DTD, it
1201
>parameter entity</I
1202
>. You can declare a parameter
1203
entity with a fixed text as content by:
1206
CLASS="PROGRAMLISTING"
1220
Within the DTD, you can <I
1223
> this entity, i.e. read
1224
the text of the entity, by:
1227
CLASS="PROGRAMLISTING"
1236
Such entities behave like macros, i.e. when they are referred to, the
1237
macro text is inserted and read instead of the original text.
1246
>For example, you can declare two elements with the same content model by:
1249
CLASS="PROGRAMLISTING"
1250
><!ENTITY % model "a | b | c">
1251
<!ELEMENT x (%model;)>
1252
<!ELEMENT y (%model;)></PRE
1257
If the contents of the entity are given as string constant, the entity is
1261
> entity. It is also possible to name a
1262
file to be used as content (an <I
1268
CLASS="PROGRAMLISTING"
1282
There are some restrictions for parameter entities:
1289
STYLE="list-style-type: disc"
1291
>If the internal parameter entity contains the first token of a declaration
1295
>), it must also contain the last token of the
1296
declaration, i.e. the <TT
1299
>. This means that the entity
1300
either contains a whole number of complete declarations, or some text from the
1301
middle of one declaration.</P
1308
CLASS="PROGRAMLISTING"
1309
><!ENTITY % e "(a | b | c)>">
1310
<!ELEMENT x %e;</PRE
1314
> is contained in the main
1315
entity, and the corresponding <TT
1318
> is contained in the
1325
STYLE="list-style-type: disc"
1327
>If the internal parameter entity contains a left paranthesis, it must also
1328
contain the corresponding right paranthesis.</P
1335
CLASS="PROGRAMLISTING"
1336
><!ENTITY % e "(a | b | c">
1337
<!ELEMENT x %e;)></PRE
1341
> is contained in the entity
1345
>, and the corresponding <TT
1349
contained in the main entity.</P
1352
STYLE="list-style-type: disc"
1354
>When reading text from an entity, the parser automatically inserts one space
1355
character before the entity text and one space character after the entity
1356
text. However, this rule is not applied within the definition of another
1364
CLASS="PROGRAMLISTING"
1366
<!ENTITY % suffix "gif">
1367
<!ENTITY iconfile 'icon.%suffix;'></PRE
1371
> is referenced within
1372
the definition text for <TT
1375
>, no additional spaces are
1383
CLASS="PROGRAMLISTING"
1384
><!ENTITY % suffix "test">
1385
<!ELEMENT x.%suffix; ANY></PRE
1390
> is referenced outside the definition
1391
text of another entity, the parser replaces <TT
1415
CLASS="PROGRAMLISTING"
1416
><!ENTITY % e "(a | b | c)">
1417
<!ELEMENT x %e;*></PRE
1418
> Because there is a whitespace between <TT
1425
>, which is illegal.</P
1428
STYLE="list-style-type: disc"
1430
>An external parameter entity must always consist of a whole number of complete
1434
STYLE="list-style-type: disc"
1436
>In the internal subset of the DTD, a reference to a parameter entity (internal
1437
or external) is only allowed at positions where a new declaration can start.</P
1442
>If the parsed entity is going to be used in the document instance, it is called
1446
>. Such entities can be used as
1447
abbreviations for frequent phrases, or to include external files. Internal
1448
general entities are declared as follows:
1451
CLASS="PROGRAMLISTING"
1465
External general entities are declared this way:
1468
CLASS="PROGRAMLISTING"
1482
References to general entities are written as:
1485
CLASS="PROGRAMLISTING"
1494
The main difference between parameter and general entities is that the former
1495
are only recognized in the DTD and that the latter are only recognized in the
1496
document instance. As the DTD is parsed before the document, the parameter
1497
entities are expanded first; for example it is possible to use the content of a
1498
parameter entity as the name of a general entity:
1501
>&#38;%name;;</TT
1508
>General entities must respect the element hierarchy. This means that there must
1509
be an end tag for every start tag in the entity value, and that end tags
1510
without corresponding start tags are not allowed.</P
1518
>If the author of a document changes sometimes, it is worthwhile to set up a
1519
general entity containing the names of the authors. If the author changes, you
1520
need only to change the definition of the entity, and do not need to check all
1521
occurrences of authors' names:
1524
CLASS="PROGRAMLISTING"
1525
><!ENTITY authors "Gerd Stolpmann"></PRE
1528
In the document text, you can now refer to the author names by writing
1538
The following two entities are illegal because the elements in the definition
1539
do not nest properly:
1542
CLASS="PROGRAMLISTING"
1543
><!ENTITY lengthy-tag "<section textcolor='white' background='graphic'>">
1544
<!ENTITY nonsense "<a></b>"></PRE
1548
>Earlier in this introduction we explained that there are substitutes for
1549
reserved characters: &lt;, &gt;, &amp;, &apos;, and
1550
&quot;. These are simply predefined general entities; note that they are
1551
the only predefined entities. It is allowed to define these entities again
1552
as long as the meaning is unchanged.</P
1560
>1.2.6. Notations and unparsed entities</A
1563
>Unparsed entities have a foreign format and can thus not be read by the XML
1564
parser. Unparsed entities are always external. The format of an unparsed entity
1565
must have been declared, such a format is called a
1569
>. The entity can then be declared by referring to
1570
this notation. As unparsed entities do not contain XML text, it is not possible
1571
to include them directly into the document; you can only declare attributes
1572
such that names of unparsed entities are acceptable values.</P
1574
>As you can see, unparsed entities are too complicated in order to have any
1575
purpose. It is almost always better to simply pass the name of the data file as
1576
normal attribute value, and let the application recognize and process the
1594
HREF="x107.html#AEN445"
1602
>This construct is only
1603
allowed within the definition of another entity; otherwise extra spaces would
1604
be added (as explained above). Such indirection is not recommended.</P
1608
CLASS="PROGRAMLISTING"
1609
><!ENTITY % variant "a"> <!-- or "b" -->
1610
<!ENTITY text-a "This is text A.">
1611
<!ENTITY text-b "This is text B.">
1612
<!ENTITY text "&#38;text-%variant;;"></PRE
1614
You can now write <TT
1617
> in the document instance, and
1618
depending on the value of <TT
1685
>A complete example: The <I
b'\\ No newline at end of file'