~jtaylor/ubuntu/oneiric/genshi/dh_python2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
.. -*- mode: rst; encoding: utf-8 -*-

==============
Markup Streams
==============

A stream is the common representation of markup as a *stream of events*.


.. contents:: Contents
   :depth: 2
.. sectnum::


Basics
======

A stream can be attained in a number of ways. It can be:

* the result of parsing XML or HTML text, or
* programmatically generated, or
* the result of selecting a subset of another stream filtered by an XPath
  expression.

For example, the functions ``XML()`` and ``HTML()`` can be used to convert
literal XML or HTML text to a markup stream::

  >>> from genshi import XML
  >>> stream = XML('<p class="intro">Some text and '
  ...              '<a href="http://example.org/">a link</a>.'
  ...              '<br/></p>')
  >>> stream
  <genshi.core.Stream object at 0x6bef0>

The stream is the result of parsing the text into events. Each event is a tuple
of the form ``(kind, data, pos)``, where:

* ``kind`` defines what kind of event it is (such as the start of an element,
  text, a comment, etc).
* ``data`` is the actual data associated with the event. How this looks depends
  on the event kind.
* ``pos`` is a ``(filename, lineno, column)`` tuple that describes where the
  event “comes from”.

::

  >>> for kind, data, pos in stream:
  ...     print kind, `data`, pos
  ... 
  START (u'p', [(u'class', u'intro')]) ('<string>', 1, 0)
  TEXT u'Some text and ' ('<string>', 1, 31)
  START (u'a', [(u'href', u'http://example.org/')]) ('<string>', 1, 31)
  TEXT u'a link' ('<string>', 1, 67)
  END u'a' ('<string>', 1, 67)
  TEXT u'.' ('<string>', 1, 72)
  START (u'br', []) ('<string>', 1, 72)
  END u'br' ('<string>', 1, 77)
  END u'p' ('<string>', 1, 77)


Filtering
=========

One important feature of markup streams is that you can apply *filters* to the
stream, either filters that come with Genshi, or your own custom filters.

A filter is simply a callable that accepts the stream as parameter, and returns
the filtered stream::

  def noop(stream):
      """A filter that doesn't actually do anything with the stream."""
      for kind, data, pos in stream:
          yield kind, data, pos

Filters can be applied in a number of ways. The simplest is to just call the
filter directly::

  stream = noop(stream)

The ``Stream`` class also provides a ``filter()`` method, which takes an
arbitrary number of filter callables and applies them all::

  stream = stream.filter(noop)

Finally, filters can also be applied using the *bitwise or* operator (``|``),
which allows a syntax similar to pipes on Unix shells::

  stream = stream | noop

One example of a filter included with Genshi is the ``HTMLSanitizer`` in
``genshi.filters``. It processes a stream of HTML markup, and strips out any
potentially dangerous constructs, such as Javascript event handlers.
``HTMLSanitizer`` is not a function, but rather a class that implements
``__call__``, which means instances of the class are callable.

Both the ``filter()`` method and the pipe operator allow easy chaining of
filters::

  from genshi.filters import HTMLSanitizer
  stream = stream.filter(noop, HTMLSanitizer())

That is equivalent to::

  stream = stream | noop | HTMLSanitizer()


Serialization
=============

The ``Stream`` class provides two methods for serializing this list of events:
``serialize()`` and ``render()``. The former is a generator that yields chunks
of ``Markup`` objects (which are basically unicode strings that are considered
safe for output on the web). The latter returns a single string, by default
UTF-8 encoded.

Here's the output from ``serialize()``::

  >>> for output in stream.serialize():
  ...     print `output`
  ... 
  <Markup u'<p class="intro">'>
  <Markup u'Some text and '>
  <Markup u'<a href="http://example.org/">'>
  <Markup u'a link'>
  <Markup u'</a>'>
  <Markup u'.'>
  <Markup u'<br/>'>
  <Markup u'</p>'>

And here's the output from ``render()``::

  >>> print stream.render()
  <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br/></p>

Both methods can be passed a ``method`` parameter that determines how exactly
the events are serialzed to text. This parameter can be either “xml” (the
default), “xhtml”, “html”, “text”, or a custom serializer class::

  >>> print stream.render('html')
  <p class="intro">Some text and <a href="http://example.org/">a link</a>.<br></p>

Note how the `<br>` element isn't closed, which is the right thing to do for
HTML.

In addition, the ``render()`` method takes an ``encoding`` parameter, which
defaults to “UTF-8”. If set to ``None``, the result will be a unicode string.

The different serializer classes in ``genshi.output`` can also be used
directly::

  >>> from genshi.filters import HTMLSanitizer
  >>> from genshi.output import TextSerializer
  >>> print TextSerializer()(HTMLSanitizer()(stream))
  Some text and a link.

The pipe operator allows a nicer syntax::

  >>> print stream | HTMLSanitizer() | TextSerializer()
  Some text and a link.

Using XPath
===========

XPath can be used to extract a specific subset of the stream via the
``select()`` method::

  >>> substream = stream.select('a')
  >>> substream
  <genshi.core.Stream object at 0x7118b0>
  >>> print substream
  <a href="http://example.org/">a link</a>

Often, streams cannot be reused: in the above example, the sub-stream is based
on a generator. Once it has been serialized, it will have been fully consumed,
and cannot be rendered again. To work around this, you can wrap such a stream
in a ``list``::

  >>> from genshi import Stream
  >>> substream = Stream(list(stream.select('a')))
  >>> substream
  <genshi.core.Stream object at 0x7118b0>
  >>> print substream
  <a href="http://example.org/">a link</a>
  >>> print substream.select('@href')
  http://example.org/
  >>> print substream.select('text()')
  a link