1
.. include:: global.rst
8
In this tutorial, you will be given a gentle introduction to
9
`XPath <http://en.wikipedia.org/wiki/XPath>`_, a query language that can be
10
used to select arbitrary parts of `HTML <http://en.wikipedia.org/wiki/HTML>`_
11
documents in |app|. XPath is a widely
12
used standard, and googling it will yield a ton of information. This tutorial,
13
however, focuses on using XPath for ebook related tasks like finding chapter
14
headings in an unstructured HTML document.
16
.. contents:: Contents
21
----------------------------------------
23
The simplest form of selection is to select tags by name. For example,
24
suppose you want to select all the ``<h2>`` tags in a document. The XPath
25
query for this is simply::
27
//h2 (Selects all <h2> tags)
29
The prefix `//` means *search at any level of the document*. Now suppose you
30
want to search for ``<span>`` tags that are inside ``<a>`` tags. That can be
33
//a/span (Selects <span> tags inside <a> tags)
35
If you want to search for tags at a particular level in the document, change
38
/body/div/p (Selects <p> tags that are children of <div> tags that are
39
children of the <body> tag)
41
This will match only ``<p>A very short ebook to demonstrate the use of XPath.</p>``
42
in the `Sample ebook`_ but not any of the other ``<p>`` tags.
44
Now suppose you want to select both ``<h1>`` and ``<h2>`` tags. To do that,
45
we need a XPath construct called *predicate*. A :dfn:`predicate` is simply
46
a test that is used to select tags. Tests can be arbitrarily powerful and as
47
this tutorial progresses, you will see more powerful examples. A predicate
48
is created by enclosing the test expression in square brackets::
50
//*[name()='h1' or name()='h2']
52
There are several new features in this XPath expression. The first is the use
53
of the wildcard ``*``. It means *match any tag*. Now look at the test expression
54
``name()='h1' or name()='h2'``. :term:`name()` is an example of a *built-in function*.
55
It simply evaluates to the name of the tag. So by using it, we can select tags
56
whose names are either `h1` or `h2`. XPath has several useful built-in functions.
57
A few more will be introduced in this tutorial.
59
Selecting by attributes
60
-----------------------
62
To select tags based on their attributes, the use of predicates is required::
64
//*[@style] (Select all tags that have a style attribute)
65
//*[@class="chapter"] (Select all tags that have class="chapter")
66
//h1[@class="bookTitle"] (Select all h1 tags that have class="bookTitle")
68
Here, the ``@`` operator refers to the attributes of the tag. You can use some
69
of the `XPath built-in functions`_ to perform more sophisticated
70
matching on attribute values.
73
Selecting by tag content
74
------------------------
76
Using XPath, you can even select tags based on the text they contain. The best way to do this is
77
to use the power of *regular expressions* via the built-in function :term:`re:test()`::
79
//h2[re:test(., 'chapter|section', 'i')] (Selects <h2> tags that contain the words chapter or
82
Here the ``.`` operator refers to the contents of the tag, just as the ``@`` operator referred
89
.. literalinclude:: xpath.xhtml
92
XPath built-in functions
93
------------------------
98
The name of the current tag.
101
``contains(s1, s2)`` returns `true` if s1 contains s2.
104
``re:test(src, pattern, flags)`` returns `true` if the string `src` matches the
105
regular expression `pattern`. A particularly useful flag is ``i``, it makes matching
106
case insensitive. A good primer on the syntax for regular expressions can be found
107
at `regexp syntax <http://docs.python.org/lib/re-syntax.html>`_