3
Fixed a small but annoying bug that caused BS to crash when presented
4
with HTML that contained boolean attributes.
8
A hybrid version that supports 2.4 and can be automatically converted
9
to run under Python 3.0. There are three backwards-incompatible
10
changes you should be aware of, but no new features or deliberate
13
1. str() may no longer do what you want. This is because the meaning
14
of str() inverts between Python 2 and 3; in Python 2 it gives you a
15
byte string, in Python 3 it gives you a Unicode string.
17
The effect of this is that you can't pass an encoding to .__str__
18
anymore. Use encode() to get a string and decode() to get Unicode, and
19
you'll be ready (well, readier) for Python 3.
21
2. Beautiful Soup is now based on HTMLParser rather than SGMLParser,
22
which is gone in Python 3. There's some bad HTML that SGMLParser
23
handled but HTMLParser doesn't, usually to do with attribute values
24
that aren't closed or have brackets inside them:
26
<a href="foo</a>, </a><a href="bar">baz</a>
27
<a b="<a>">', '<a b="<a>"></a><a>"></a>
29
A later version of Beautiful Soup will allow you to plug in different
30
parsers to make tradeoffs between speed and the ability to handle bad
33
3. In Python 3 (but not Python 2),HTMLParser converts entities within
34
attributes to the corresponding Unicode characters. In Python 2 it's
35
possible to parse this string and leave the é intact.
37
<a href="http://crummy.com?sacré&bleu">
39
In Python 3, the é is always converted to \xe9 during
45
Added an import that makes BS work in Python 2.3.
50
Fixed a UnicodeDecodeError when unpickling documents that contain
53
Fixed a TypeError that occured in some circumstances when a tag
56
Jump through hoops to avoid the use of chardet, which can be extremely
57
slow in some circumstances. UTF-8 documents should never trigger the
60
Whitespace is preserved inside <pre> and <textarea> tags that contain
61
nothing but whitespace.
63
Beautiful Soup can now parse a doctype that's scoped to an XML namespace.
68
Got rid of a very old debug line that prevented chardet from working.
70
Added a Tag.decompose() method that completely disconnects a tree or a
71
subset of a tree, breaking it up into bite-sized pieces that are
72
easy for the garbage collecter to collect.
74
Tag.extract() now returns the tag that was extracted.
76
Tag.findNext() now does something with the keyword arguments you pass
77
it instead of dropping them on the floor.
79
Fixed a Unicode conversion bug.
81
Fixed a bug that garbled some <meta> tags when rewriting them.
86
Soup objects can now be pickled, and copied with copy.deepcopy.
88
Tag.append now works properly on existing BS objects. (It wasn't
89
originally intended for outside use, but it can be now.) (Giles
92
Passing in a nonexistent encoding will no longer crash the parser on
93
Python 2.4 (John Nagle).
95
Fixed an underlying bug in SGMLParser that thinks ASCII has 255
96
characters instead of 127 (John Nagle).
98
Entities are converted more consistently to Unicode characters.
100
Entity references in attribute values are now converted to Unicode
101
characters when appropriate. Numeric entities are always converted,
102
because SGMLParser always converts them outside of attribute values.
104
ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to
107
The regular expression for bare ampersands was too loose. In some
108
cases ampersands were not being escaped. (Sam Ruby?)
110
Non-breaking spaces and other special Unicode space characters are no
111
longer folded to ASCII spaces. (Robert Leftwich)
113
Information inside a TEXTAREA tag is now parsed literally, not as HTML
114
tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang)
119
Fixed a bug that crashed Unicode conversion in some cases.
121
Fixed a bug that prevented UnicodeDammit from being used as a
122
general-purpose data scrubber.
124
Fixed some unit test failures when running against Python 2.5.
126
When considering whether to convert smart quotes, UnicodeDammit now
127
looks at the original encoding in a case-insensitive way.