7
CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
9
TITLE="The PXP user's guide"
10
HREF="index.html"><LINK
18
TITLE="Highlights of XML"
19
HREF="x107.html"><LINK
22
HREF="markup.css"></HEAD
41
>The PXP user's guide</TH
75
>Chapter 1. What is XML?</A
97
>A complete example: The <I
110
>1.1. Introduction</A
115
>Extensible Markup Language</I
117
generalizes the idea that text documents are typically structured in sections,
118
sub-sections, paragraphs, and so on. The format of the document is not fixed
119
(as, for example, in HTML), but can be declared by a so-called DTD (document
120
type definition). The DTD describes only the rules how the document can be
121
structured, but not how the document can be processed. For example, if you want
122
to publish a book that uses XML markup, you will need a processor that converts
123
the XML file into a printable format such as Postscript. On the one hand, the
124
structure of XML documents is configurable; on the other hand, there is no
125
longer a canonical interpretation of the elements of the document; for example
126
one XML DTD might want that paragraphes are delimited by
130
> tags, and another DTD expects <TT
134
for the same purpose. As a result, for every DTD a new processor is required.</P
136
>Although XML can be used to express structured text documents it is not limited
137
to this kind of application. For example, XML can also be used to exchange
138
structured data over a network, or to simply store structured data in
139
files. Note that XML documents cannot contain arbitrary binary data because
140
some characters are forbidden; for some applications you need to encode binary
141
data as text (e.g. the base 64 encoding).</P
148
>1.1.1. The "hello world" example</A
151
>The following example shows a very simple DTD, and a corresponding document
152
instance. The document is structured such that it consists of sections, and
153
that sections consist of paragraphs, and that paragraphs contain plain text:</P
155
CLASS="PROGRAMLISTING"
156
><!ELEMENT document (section)+>
157
<!ELEMENT section (paragraph)+>
158
<!ELEMENT paragraph (#PCDATA)></PRE
160
>The following document is an instance of this DTD:</P
162
CLASS="PROGRAMLISTING"
163
><?xml version="1.0" encoding="ISO-8859-1"?>
164
<!DOCTYPE document SYSTEM "simple.dtd">
167
<paragraph>This is a paragraph of the first section.</paragraph>
168
<paragraph>This is another paragraph of the first section.</paragraph>
171
<paragraph>This is the only paragraph of the second section.</paragraph>
173
</document></PRE
175
>As in HTML (and, of course, in grand-father SGML), the "pieces" of
176
the document are delimited by element braces, i.e. such a piece begins with
179
><name-of-the-type-of-the-piece></TT
183
></name-of-the-type-of-the-piece></TT
184
>, and the pieces are
188
>. Unlike HTML and SGML, both start tags and
189
end tags (i.e. the delimiters written in angle brackets) can never be left
190
out. For example, HTML calls the paragraphs simply <TT
194
because paragraphs never contain paragraphs, a sequence of several paragraphs
198
CLASS="PROGRAMLISTING"
199
><p>First paragraph
200
<p>Second paragraph</PRE
203
This is not possible in XML; continuing our example above we must always write
206
CLASS="PROGRAMLISTING"
207
><paragraph>First paragraph</paragraph>
208
<paragraph>Second paragraph</paragraph></PRE
211
The rationale behind that is to (1) simplify the development of XML parsers
212
(you need not convert the DTD into a deterministic finite automaton which is
213
required to detect omitted tags), and to (2) make it possible to parse the
214
document independent of whether the DTD is known or not.</P
216
>The first line of our sample document,
219
CLASS="PROGRAMLISTING"
220
><?xml version="1.0" encoding="ISO-8859-1"?></PRE
226
>. It expresses that the
227
document follows the conventions of XML version 1.0, and that the document is
228
encoded using characters from the ISO-8859-1 character set (often known as
229
"Latin 1", mostly used in Western Europe). Although the XML declaration is not
230
mandatory, it is good style to include it; everybody sees at the first glance
231
that the document uses XML markup and not the similar-looking HTML and SGML
232
markup languages. If you omit the XML declaration, the parser will assume
233
that the document is encoded as UTF-8 or UTF-16 (there is a rule that makes
234
it possible to distinguish between UTF-8 and UTF-16 automatically); these
235
are encodings of Unicode's universal character set. (Note that <SPAN
239
predecessor "Markup", fully supports Unicode.)</P
244
CLASS="PROGRAMLISTING"
245
><!DOCTYPE document SYSTEM "simple.dtd"></PRE
248
names the DTD that is going to be used for the rest of the document. In
249
general, it is possible that the DTD consists of two parts, the so-called
250
external and the internal subset. "External" means that the DTD exists as a
251
second file; "internal" means that the DTD is included in the same file. In
252
this example, there is only an external subset, and the system identifier
253
"simple.dtd" specifies where the DTD file can be found. System identifiers are
254
interpreted as URLs; for instance this would be legal:
257
CLASS="PROGRAMLISTING"
258
><!DOCTYPE document SYSTEM "http://host/location/simple.dtd"></PRE
261
Please note that <SPAN
264
> cannot interpret HTTP identifiers by default, but it is
265
possible to change the interpretation of system identifiers.</P
267
>The word immediately following <TT
270
> determines which of
271
the declared element types (here "document", "section", and "paragraph") is
272
used for the outermost element, the <I
279
> because the outermost element is
282
><document></TT
286
></document></TT
289
>The DTD consists of three declarations for element types:
300
>. Such a declaration has two parts:
303
CLASS="PROGRAMLISTING"
317
The content model is a regular expression which describes the possible inner
318
structure of the element. Here, <TT
322
more sections, and a <TT
325
> contains one or more
326
paragraphs. Note that these two element types are not allowed to contain
327
arbitrary text. Only the <TT
330
> element type is declared
331
such that parsed character data (indicated by the symbol
337
>See below for a detailed discussion of content models. </P
345
>1.1.2. XML parsers and processors</A
348
>XML documents are human-readable, but this is not the main purpose of this
349
language. XML has been designed such that documents can be read by a program
353
>. The parser checks that the document
354
is well-formatted, and it represents the document as objects of the programming
355
language. There are two aspects when checking the document: First, the document
356
must follow some basic syntactic rules, such as that tags are written in angle
357
brackets, that for every start tag there must be a corresponding end tag and so
358
on. A document respecting these rules is
362
>. Second, the document must match the DTD in
363
which case the document is <I
366
>. Many parsers check only
367
on well-formedness and ignore the DTD; <SPAN
370
> is designed such that it can
371
even validate the document.</P
373
>A parser does not make a sensible application, it only reads XML
374
documents. The whole application working with XML-formatted data is called an
378
>. Often XML processors convert documents into
379
another format, such as HTML or Postscript. Sometimes processors extract data
380
of the documents and output the processed data again XML-formatted. The parser
381
can help the application processing the document; for example it can provide
382
means to access the document in a specific manner. <SPAN
386
object-oriented access layer specially.</P
394
>1.1.3. Discussion</A
397
>As we have seen, there are two levels of description: On the one hand, XML can
398
define rules about the format of a document (the DTD), on the other hand, XML
399
expresses structured documents. There are a number of possible applications:</P
405
STYLE="list-style-type: disc"
407
>XML can be used to express structured texts. Unlike HTML, there is no canonical
408
interpretation; one would have to write a backend for the DTD that translates
409
the structured texts into a format that existing browsers, printers
410
etc. understand. The advantage of a self-defined document format is that it is
411
possible to design the format in a more problem-oriented way. For example, if
412
the task is to extract reports from a database, one can use a DTD that reflects
413
the structure of the report or the database. A possible approach would be to
414
have an element type for every database table and for every column. Once the
415
DTD has been designed, the report procedure can be splitted up in a part that
416
selects the database rows and outputs them as an XML document according to the
417
DTD, and in a part that translates the document into other formats. Of course,
418
the latter part can be solved in a generic way, e.g. there may be configurable
419
backends for all DTDs that follow the approach and have element types for
420
tables and columns.</P
422
>XML plays the role of a configurable intermediate format. The database
423
extraction function can be written without having to know the details of
424
typesetting; the backends can be written without having to know the details of
427
>Of course, there are traditional solutions. One can define an ad hoc
428
intermediate text file format. This disadvantage is that there are no names for
429
the pieces of the format, and that such formats usually lack of documentation
430
because of this. Another solution would be to have a binary representation,
431
either as language-dependent or language-independent structure (example of the
432
latter can be found in RPC implementations). The disadvantage is that it is
433
harder to view such representations, one has to write pretty printers for this
434
purpose. It is also more difficult to enter test data; XML is plain text that
435
can be written using an arbitrary editor (Emacs has even a good XML mode,
436
PSGML). All these alternatives suffer from a missing structure checker,
437
i.e. the programs processing these formats usually do not check the input file
438
or input object in detail; XML parsers check the syntax of the input (the
439
so-called well-formedness check), and the advanced parsers like <SPAN
443
verify that the structure matches the DTD (the so-called validation).</P
446
STYLE="list-style-type: disc"
448
>XML can be used as configurable communication language. A fundamental problem
449
of every communication is that sender and receiver must follow the same
450
conventions about the language. For data exchange, the question is usually
451
which data records and fields are available, how they are syntactically
452
composed, and which values are possible for the various fields. Similar
453
questions arise for text document exchange. XML does not answer these problems
454
completely, but it reduces the number of ambiguities for such conventions: The
455
outlines of the syntax are specified by the DTD (but not necessarily the
456
details), and XML introduces canonical names for the components of documents
457
such that it is simpler to describe the rest of the syntax and the semantics
461
STYLE="list-style-type: disc"
463
>XML is a data storage format. Currently, every software product tends to use
464
its own way to store data; commercial software often does not describe such
465
formats, and it is a pain to integrate such software into a bigger project.
466
XML can help to improve this situation when several applications share the same
467
syntax of data files. DTDs are then neutral instances that check the format of
468
data files independent of applications. </P
527
>Highlights of XML</TD
b'\\ No newline at end of file'