203
204
>>> print(end[0].tag)
206
>>> root[0] = root[-1] # this moves the element in lxml.etree!
207
>>> for child in root:
213
207
Prior to ElementTree 1.3 and lxml 2.0, you could also check the truth value of
214
an Element to see if it has children, i.e. if the list of children is empty.
215
This is no longer supported as people tend to find it surprising that a
216
non-None reference to an existing Element can evaluate to False. Instead, use
217
``len(element)``, which is both more explicit and less error prone.
219
Note in the examples that the last element was *moved* to a different position
220
in the last example. This is a difference from the original ElementTree (and
221
from lists), where elements can sit in multiple positions of any number of
222
trees. In lxml.etree, elements can only sit in one position of one tree at a
225
If you want to *copy* an element to a different position, consider creating an
226
independent *deep copy* using the ``copy`` module from Python's standard
208
an Element to see if it has children, i.e. if the list of children is empty:
210
.. sourcecode:: python
212
if root: # this no longer works!
213
print("The root element has children")
215
This is no longer supported as people tend to expect that a "something"
216
evaluates to True and expect Elements to be "something", may they have
217
children or not. So, many users find it surprising that any Element
218
would evaluate to False in an if-statement like the above. Instead,
219
use ``len(element)``, which is both more explicit and less error prone.
221
.. sourcecode:: pycon
223
>>> print(etree.iselement(root)) # test if it's some kind of Element
225
>>> if len(root): # test if it has children
226
... print("The root element has children")
227
The root element has children
229
There is another important case where the behaviour of Elements in lxml
230
(in 2.0 and later) deviates from that of lists and from that of the
231
original ElementTree (prior to version 1.3 or Python 2.7/3.2):
233
.. sourcecode:: pycon
235
>>> for child in root:
241
>>> root[0] = root[-1] # this moves the element in lxml.etree!
242
>>> for child in root:
248
In this example, the last element is *moved* to a different position,
249
instead of being copied, i.e. it is automatically removed from its
250
previous position when it is put in a different place. In lists,
251
objects can appear in multiple positions at the same time, and the
252
above assignment would just copy the item reference into the first
253
position, so that both contain the exact same item:
255
.. sourcecode:: pycon
262
Note that in the original ElementTree, a single Element object can sit
263
in any number of places in any number of trees, which allows for the same
264
copy operation as with lists. The obvious drawback is that modifications
265
to such an Element will apply to all places where it appears in a tree,
266
which may or may not be intended.
268
The upside of this difference is that an Element in lxml.etree always
269
has exactly one parent, which can be queried through the ``getparent()``
270
method. This is not supported in the original ElementTree.
272
.. sourcecode:: pycon
274
>>> root is root[0].getparent() # lxml.etree only!
277
If you want to *copy* an element to a different position in lxml.etree,
278
consider creating an independent *deep copy* using the ``copy`` module
279
from Python's standard library:
229
281
.. sourcecode:: pycon
268
313
>>> etree.tostring(root)
269
314
b'<root interesting="totally"/>'
271
Fast and direct access to these attributes is provided by the ``set()`` and
272
``get()`` methods of elements:
316
Attributes are just unordered name-value pairs, so a very convenient way
317
of dealing with them is through the dictionary-like interface of Elements:
274
319
.. sourcecode:: pycon
276
321
>>> print(root.get("interesting"))
279
>>> root.set("interesting", "somewhat")
280
>>> print(root.get("interesting"))
283
However, a very convenient way of dealing with them is through the dictionary
284
interface of the ``attrib`` property:
324
>>> print(root.get("hello"))
326
>>> root.set("hello", "Huhu")
327
>>> print(root.get("hello"))
330
>>> etree.tostring(root)
331
b'<root interesting="totally" hello="Huhu"/>'
333
>>> sorted(root.keys())
334
['hello', 'interesting']
336
>>> for name, value in sorted(root.items()):
337
... print('%s = %r' % (name, value))
339
interesting = 'totally'
341
For the cases where you want to do item lookup or have other reasons for
342
getting a 'real' dictionary-like object, e.g. for passing it around,
343
you can use the ``attrib`` property:
286
345
.. sourcecode:: pycon
288
347
>>> attributes = root.attrib
290
349
>>> print(attributes["interesting"])
293
>>> print(attributes.get("hello"))
351
>>> print(attributes.get("no-such-attribute"))
296
354
>>> attributes["hello"] = "Guten Tag"
297
>>> print(attributes.get("hello"))
355
>>> print(attributes["hello"])
299
357
>>> print(root.get("hello"))
360
Note that ``attrib`` is a dict-like object backed by the Element itself.
361
This means that any changes to the Element are reflected in ``attrib``
362
and vice versa. It also means that the XML tree stays alive in memory
363
as long as the ``attrib`` of one of its Elements is in use. To get an
364
independent snapshot of the attributes that does not depend on the XML
365
tree, copy it into a dict:
367
.. sourcecode:: pycon
369
>>> d = dict(root.attrib)
370
>>> sorted(d.items())
371
[('hello', 'Guten Tag'), ('interesting', 'totally')]
303
374
Elements contain text
304
375
---------------------
625
696
The W3C has a good `article about the Unicode character set and
626
character encodings`_.
628
.. _`article about the Unicode character set and character encodings`: http://www.w3.org/International/tutorials/tutorial-char-enc/
698
<http://www.w3.org/International/tutorials/tutorial-char-enc/>`_.
631
701
The ElementTree class
632
702
=====================
634
704
An ``ElementTree`` is mainly a document wrapper around a tree with a
635
root node. It provides a couple of methods for parsing, serialisation
636
and general document handling. One of the bigger differences is that
637
it serialises as a complete document, as opposed to a single
638
``Element``. This includes top-level processing instructions and
639
comments, as well as a DOCTYPE and other DTD content in the document:
705
root node. It provides a couple of methods for serialisation and
706
general document handling.
641
708
.. sourcecode:: pycon
643
>>> tree = etree.parse(StringIO('''\
710
>>> root = etree.XML('''\
644
711
... <?xml version="1.0"?>
645
... <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
712
... <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]>
647
714
... <a>&tasty;</a>
718
>>> tree = etree.ElementTree(root)
719
>>> print(tree.docinfo.xml_version)
651
721
>>> print(tree.docinfo.doctype)
652
722
<!DOCTYPE root SYSTEM "test">
654
>>> # lxml 1.3.4 and later
655
>>> print(etree.tostring(tree))
656
<!DOCTYPE root SYSTEM "test" [
657
<!ENTITY tasty "eggs">
663
>>> # lxml 1.3.4 and later
664
>>> print(etree.tostring(etree.ElementTree(tree.getroot())))
665
<!DOCTYPE root SYSTEM "test" [
666
<!ENTITY tasty "eggs">
672
>>> # ElementTree and lxml <= 1.3.3
724
An ``ElementTree`` is also what you get back when you call the
725
``parse()`` function to parse files or file-like objects (see the
726
parsing section below).
728
One of the important differences is that the ``ElementTree`` class
729
serialises as a complete document, as opposed to a single ``Element``.
730
This includes top-level processing instructions and comments, as well
731
as a DOCTYPE and other DTD content in the document:
733
.. sourcecode:: pycon
735
>>> print(etree.tostring(tree)) # lxml 1.3.4 and later
736
<!DOCTYPE root SYSTEM "test" [
737
<!ENTITY tasty "parsnips">
743
In the original xml.etree.ElementTree implementation and in lxml
744
up to 1.3.3, the output looks the same as when serialising only
747
.. sourcecode:: pycon
673
749
>>> print(etree.tostring(tree.getroot()))
678
Note that this has changed in lxml 1.3.4 to match the behaviour of
679
lxml 2.0. Before, the examples were serialised without DTD content,
680
which made lxml loose DTD information in an input-output cycle.
754
This serialisation behaviour has changed in lxml 1.3.4. Before,
755
the tree was serialised without DTD content, which made lxml
756
loose DTD information in an input-output cycle.
683
759
Parsing from strings and files
721
797
>>> etree.tostring(root)
722
798
b'<root>data</root>'
800
There is also a corresponding function ``HTML()`` for HTML literals.
725
803
The parse() function
726
804
--------------------
728
The ``parse()`` function is used to parse from files and file-like objects:
806
The ``parse()`` function is used to parse from files and file-like objects.
808
As an example of such a file-like object, the following code uses the
809
``StringIO`` class for reading from a string instead of an external file.
810
That class comes from the ``StringIO`` module in Python 2. In Python 2.6
811
and later, including Python 3.x, you would rather use the ``BytesIO`` class
812
from the ``io`` module. However, in real life, you would obviously avoid
813
doing this all together and use the string parsing functions above.
730
815
.. sourcecode:: pycon
732
>>> some_file_like = StringIO("<root>data</root>")
817
>>> some_file_like_object = StringIO("<root>data</root>")
734
>>> tree = etree.parse(some_file_like)
819
>>> tree = etree.parse(some_file_like_object)
736
821
>>> etree.tostring(tree)
737
822
b'<root>data</root>'
763
848
* an HTTP or FTP URL string
765
850
Note that passing a filename or URL is usually faster than passing an
851
open file or file-like object. However, the HTTP/FTP client in libxml2
852
is rather simple, so things like HTTP authentication require a dedicated
853
URL request library, e.g. ``urllib2`` or ``request``. These libraries
854
usually provide a file-like object for the result that you can parse
855
from while the response is streaming in.
1023
1113
<html:body>Hello World</html:body>
1026
.. _`namespace prefixes`: http://www.w3.org/TR/xml-names/#ns-qualnames
1028
The notation that ElementTree uses was originally brought up by `James
1029
Clark`_. It has the major advantage of providing a universally
1030
qualified name for a tag, regardless of any prefixes that may or may
1031
not have been used or defined in a document. By moving the
1032
indirection of prefixes out of the way, it makes namespace aware code
1033
much clearer and safer.
1035
.. _`James Clark`: http://www.jclark.com/xml/xmlns.htm
1116
The notation that ElementTree uses was originally brought up by
1117
`James Clark <http://www.jclark.com/xml/xmlns.htm>`_. It has the major
1118
advantage of providing a universally qualified name for a tag, regardless
1119
of any prefixes that may or may not have been used or defined in a document.
1120
By moving the indirection of prefixes out of the way, it makes namespace
1121
aware code much clearer and easier to get right.
1037
1123
As you can see from the example, prefixes only become important when
1038
1124
you serialise the result. However, the above code looks somewhat
1058
1144
<body>Hello World</body>
1147
You can also use the ``QName`` helper class to build or split qualified
1150
.. sourcecode:: pycon
1152
>>> tag = etree.QName('http://www.w3.org/1999/xhtml', 'html')
1153
>>> print(tag.localname)
1155
>>> print(tag.namespace)
1156
http://www.w3.org/1999/xhtml
1158
{http://www.w3.org/1999/xhtml}html
1160
>>> tag = etree.QName('{http://www.w3.org/1999/xhtml}html')
1161
>>> print(tag.localname)
1163
>>> print(tag.namespace)
1164
http://www.w3.org/1999/xhtml
1166
>>> root = etree.Element('{http://www.w3.org/1999/xhtml}html')
1167
>>> tag = etree.QName(root)
1168
>>> print(tag.localname)
1171
>>> tag = etree.QName(root, 'script')
1173
{http://www.w3.org/1999/xhtml}script
1174
>>> tag = etree.QName('{http://www.w3.org/1999/xhtml}html', 'script')
1176
{http://www.w3.org/1999/xhtml}script
1061
1178
lxml.etree allows you to look up the current namespaces defined for a
1062
1179
node through the ``.nsmap`` property:
1086
1203
Therefore, modifying the returned dict cannot have any meaningful
1087
1204
impact on the Element. Any changes to it are ignored.
1089
Namespaces on attributes work alike, but since version 2.3, lxml.etree
1206
Namespaces on attributes work alike, but as of version 2.3, lxml.etree
1090
1207
will make sure that the attribute uses a prefixed namespace
1091
1208
declaration. This is because unprefixed attribute names are not
1092
1209
considered being in a namespace by the XML namespace specification