602
by Leonard Richardson
NavigableString and its subclasses now implement the get_text() |
1 |
Beautiful Soup's official support for Python 2 ended on December 31st, |
606
by Leonard Richardson
Goodbye, Python 2. [bug=1942919] |
2 |
2020. The final release to support Python 2 was Beautiful Soup
|
3 |
4.9.3. In the Launchpad Bazaar repository, the final revision to support
|
|
4 |
Python 2 was revision 605.
|
|
5 |
||
608
by Leonard Richardson
Ported unit tests to use pytest. |
6 |
= 4.11.0 (Unreleased)
|
7 |
||
8 |
* Ported unit tests to use pytest.
|
|
9 |
||
614
by Leonard Richardson
Added special string classes, RubyParenthesisString and RubyTextString, |
10 |
* Added special string classes, RubyParenthesisString and RubyTextString,
|
11 |
to make it possible to treat ruby text specially in get_text() calls.
|
|
12 |
[bug=1941980]
|
|
13 |
||
626
by Leonard Richardson
If the charset-normalizer Python module |
14 |
* If the charset-normalizer Python module
|
15 |
(https://pypi.org/project/charset-normalizer/) is installed, Beautiful
|
|
16 |
Soup will use it to detect the character sets of incoming documents.
|
|
17 |
This is also the module used by newer versions of the Requests library.
|
|
18 |
For the sake of backwards compatibility, chardet and cchardet both take
|
|
19 |
precedence if installed. [bug=1955346]
|
|
617
by Leonard Richardson
Fixed a crash when overriding multi_valued_attributes and using the |
20 |
|
618
by Leonard Richardson
Added a workaround for an lxml bug (https://bugs.launchpad.net/lxml/+bug/1948551) that caused |
21 |
* Added a workaround for an lxml bug
|
622
by Leonard Richardson
Issue a warning when an HTML parser is used to parse a document that |
22 |
(https://bugs.launchpad.net/lxml/+bug/1948551) that causes
|
618
by Leonard Richardson
Added a workaround for an lxml bug (https://bugs.launchpad.net/lxml/+bug/1948551) that caused |
23 |
problems when parsing a Unicode string beginning with BYTE ORDER MARK.
|
24 |
[bug=1947768]
|
|
25 |
||
622
by Leonard Richardson
Issue a warning when an HTML parser is used to parse a document that |
26 |
* Issue a warning when an HTML parser is used to parse a document that
|
27 |
looks like XML but not XHTML. [bug=1939121]
|
|
28 |
||
624
by Leonard Richardson
Do a better job of keeping track of namespaces as an XML document is |
29 |
* Do a better job of keeping track of namespaces as an XML document is
|
30 |
parsed, so that CSS selectors that use namespaces will do the right
|
|
31 |
thing more often. [bug=1946243]
|
|
32 |
||
619
by Leonard Richardson
Renamed the 'text' field to 'string' for real. Tests are not changed in this commit to demonstrate that the renaming doesn't break anything. [bug=1947038] |
33 |
* Some time ago, the misleadingly named "text" argument to find-type
|
34 |
methods was renamed to the more accurate "string." But this supposed
|
|
35 |
"renaming" didn't make it into important places like the method |
|
36 |
signatures or the docstrings. That's corrected in this |
|
622
by Leonard Richardson
Issue a warning when an HTML parser is used to parse a document that |
37 |
version. "text" still works, but will give a DeprecationWarning.
|
38 |
[bug=1947038]
|
|
619
by Leonard Richardson
Renamed the 'text' field to 'string' for real. Tests are not changed in this commit to demonstrate that the renaming doesn't break anything. [bug=1947038] |
39 |
|
626
by Leonard Richardson
If the charset-normalizer Python module |
40 |
* Fixed a crash when pickling a BeautifulSoup object that has no
|
625
by Leonard Richardson
Fix a crash when pickling a BeautifulSoup object that has no |
41 |
tree builder. [bug=1934003]
|
42 |
||
626
by Leonard Richardson
If the charset-normalizer Python module |
43 |
* Fixed a crash when overriding multi_valued_attributes and using the
|
44 |
html5lib parser. [bug=1948488]
|
|
45 |
||
606
by Leonard Richardson
Goodbye, Python 2. [bug=1942919] |
46 |
= 4.10.0 (20210907)
|
47 |
||
48 |
* This is the first release of Beautiful Soup to only support Python
|
|
49 |
3. I dropped Python 2 support to maintain support for newer versions
|
|
50 |
(58 and up) of setuptools. See:
|
|
51 |
https://github.com/pypa/setuptools/issues/2769 [bug=1942919]
|
|
602
by Leonard Richardson
NavigableString and its subclasses now implement the get_text() |
52 |
|
600
by Leonard Richardson
The behavior of methods like .get_text() and .strings now differs |
53 |
* The behavior of methods like .get_text() and .strings now differs
|
54 |
depending on the type of tag. The change is visible with HTML tags
|
|
55 |
like <script>, <style>, and <template>. Starting in 4.9.0, methods
|
|
56 |
like get_text() returned no results on such tags, because the
|
|
57 |
contents of those tags are not considered 'text' within the document |
|
58 |
as a whole.
|
|
59 |
||
60 |
But a user who calls script.get_text() is working from a different
|
|
61 |
definition of 'text' than a user who calls div.get_text()--otherwise |
|
62 |
there would be no need to call script.get_text() at all. In 4.10.0,
|
|
63 |
the contents of (e.g.) a <script> tag are considered 'text' during a |
|
64 |
get_text() call on the tag itself, but not considered 'text' during |
|
65 |
a get_text() call on the tag's parent. |
|
66 |
||
67 |
Because of this change, calling get_text() on each child of a tag |
|
68 |
may now return a different result than calling get_text() on the tag |
|
69 |
itself. That's because different tags now have different |
|
70 |
understandings of what counts as 'text'. [bug=1906226] [bug=1868861] |
|
601
by Leonard Richardson
The 'html5' formatter now treats attributes whose values are the |
71 |
|
602
by Leonard Richardson
NavigableString and its subclasses now implement the get_text() |
72 |
* NavigableString and its subclasses now implement the get_text()
|
73 |
method, as well as the properties .strings and
|
|
74 |
.stripped_strings. These methods will either return the string
|
|
75 |
itself, or nothing, so the only reason to use this is when iterating
|
|
76 |
over a list of mixed Tag and NavigableString objects. [bug=1904309]
|
|
77 |
||
601
by Leonard Richardson
The 'html5' formatter now treats attributes whose values are the |
78 |
* The 'html5' formatter now treats attributes whose values are the |
79 |
empty string as HTML boolean attributes. Previously (and in other
|
|
80 |
formatters), an attribute value must be set as None to be treated as
|
|
81 |
a boolean attribute. In a future release, I plan to also give this
|
|
82 |
behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424] |
|
83 |
||
605
by Leonard Richardson
The 'replace_with()' method now takes a variable number of arguments, |
84 |
* The 'replace_with()' method now takes a variable number of arguments, |
85 |
and can be used to replace a single element with a sequence of elements.
|
|
86 |
Patch by Bill Chandos. [rev=605]
|
|
87 |
||
595
by Leonard Richardson
Corrected output when the namespace prefix associated with a |
88 |
* Corrected output when the namespace prefix associated with a
|
89 |
namespaced attribute is the empty string, as opposed to
|
|
90 |
None. [bug=1915583]
|
|
91 |
||
597
by Leonard Richardson
Performance improvement when processing tags that speeds up overall |
92 |
* Performance improvement when processing tags that speeds up overall
|
93 |
tree construction by 2%. Patch by Morotti. [bug=1899358]
|
|
94 |
||
599
by Leonard Richardson
Corrected the use of special string container classes in cases when a |
95 |
* Corrected the use of special string container classes in cases when a
|
96 |
single tag may contain strings with different containers; such as
|
|
97 |
the <template> tag, which may contain both TemplateString objects
|
|
98 |
and Comment objects. [bug=1913406]
|
|
99 |
||
605
by Leonard Richardson
The 'replace_with()' method now takes a variable number of arguments, |
100 |
* The html.parser tree builder can now handle named entities
|
604
by Leonard Richardson
The html.parser tree builder can now handles named entities |
101 |
found in the HTML5 spec in much the same way that the html5lib
|
605
by Leonard Richardson
The 'replace_with()' method now takes a variable number of arguments, |
102 |
tree builder does. Note that the lxml HTML tree builder doesn't handle |
103 |
named entities this way. [bug=1924908] |
|
604
by Leonard Richardson
The html.parser tree builder can now handles named entities |
104 |
|
598
by Leonard Richardson
Added a second way to pass specify encodings to UnicodeDammit and |
105 |
* Added a second way to pass specify encodings to UnicodeDammit and |
106 |
EncodingDetector, based on the order of precedence defined in the |
|
107 |
HTML5 spec, starting at: |
|
108 |
https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding |
|
109 |
||
110 |
Encodings in 'known_definite_encodings' are tried first, then |
|
111 |
byte-order-mark sniffing is run, then encodings in 'user_encodings' |
|
112 |
are tried. The old argument, 'override_encodings', is now a |
|
113 |
deprecated alias for 'known_definite_encodings'. |
|
114 |
||
115 |
This changes the default behavior of the html.parser and lxml tree |
|
116 |
builders, in a way that may slightly improve encoding |
|
117 |
detection but will probably have no effect. [bug=1889014] |
|
118 |
||
596
by Leonard Richardson
Improve the warning issued when a directory name (as opposed to |
119 |
* Improve the warning issued when a directory name (as opposed to |
120 |
the name of a regular file) is passed as markup into the BeautifulSoup |
|
121 |
constructor. [bug=1913628] |
|
122 |
||
592
by Leonard Richardson
Prepare for release. |
123 |
= 4.9.3 (20201003) |
591
by Leonard Richardson
Implemented a significant performance optimization to the process of |
124 |
|
125 |
* Implemented a significant performance optimization to the process of |
|
126 |
searching the parse tree. Patch by Morotti. [bug=1898212] |
|
127 |
||
588
by Leonard Richardson
Increment version number. |
128 |
= 4.9.2 (20200926) |
579
by Leonard Richardson
Fixed a bug that caused too many tags to be popped from the tag |
129 |
|
130 |
* Fixed a bug that caused too many tags to be popped from the tag |
|
131 |
stack during tree building, when encountering a closing tag that had |
|
132 |
no matching opening tag. [bug=1880420] |
|
133 |
||
587
by Leonard Richardson
Fixed a bug that inconsistently moved elements over when passing |
134 |
* Fixed a bug that inconsistently moved elements over when passing |
135 |
a Tag, rather than a list, into Tag.extend(). [bug=1885710] |
|
136 |
||
585
by Leonard Richardson
Specify the soupsieve dependency in a way that complies with |
137 |
* Specify the soupsieve dependency in a way that complies with |
586
by Leonard Richardson
Change the signatures for BeautifulSoup.insert_before and insert_after |
138 |
PEP 508. Patch by Mike Nerone. [bug=1893696] |
139 |
||
140 |
* Change the signatures for BeautifulSoup.insert_before and insert_after |
|
141 |
(which are not implemented) to match PageElement.insert_before and |
|
142 |
insert_after, quieting warnings in some IDEs. [bug=1897120] |
|
585
by Leonard Richardson
Specify the soupsieve dependency in a way that complies with |
143 |
|
577
by Leonard Richardson
Prep for release. |
144 |
= 4.9.1 (20200517) |
568
by Leonard Richardson
Fixed test failures when run against soupselect 2.0. Patch by Tomáš |
145 |
|
573
by Leonard Richardson
Added a keyword argument on_duplicate_attribute to the |
146 |
* Added a keyword argument 'on_duplicate_attribute' to the |
147 |
BeautifulSoupHTMLParser constructor (used by the html.parser tree |
|
148 |
builder) which lets you customize the handling of markup that |
|
149 |
contains the same attribute more than once, as in: |
|
575
by Leonard Richardson
Documented some recently added customization features. |
150 |
<a href="url1" href="url2"> [bug=1878209] |
573
by Leonard Richardson
Added a keyword argument on_duplicate_attribute to the |
151 |
|
570
by Leonard Richardson
Fixed typo. |
152 |
* Added a distinct subclass, GuessedAtParserWarning, for the warning |
569
by Leonard Richardson
Added two distinct UserWarning subclasses for warnings issued from the BeautifulSoup constructor which a caller may want to filter out. [bug=1873787] |
153 |
issued when BeautifulSoup is instantiated without a parser being |
154 |
specified. [bug=1873787] |
|
155 |
||
156 |
* Added a distinct subclass, MarkupResemblesLocatorWarning, for the |
|
157 |
warning issued when BeautifulSoup is instantiated with 'markup' that |
|
158 |
actually seems to be a URL or the path to a file on |
|
159 |
disk. [bug=1873787] |
|
160 |
||
568
by Leonard Richardson
Fixed test failures when run against soupselect 2.0. Patch by Tomáš |
161 |
* The new NavigableString subclasses (Stylesheet, Script, and |
162 |
TemplateString) can now be imported directly from the bs4 package. |
|
163 |
||
571
by Leonard Richardson
If you encode a document with a Python-specific encoding like |
164 |
* If you encode a document with a Python-specific encoding like |
165 |
'unicode_escape', that encoding is no longer mentioned in the final |
|
166 |
XML or HTML document. Instead, encoding information is omitted or |
|
167 |
left blank. [bug=1874955] |
|
168 |
||
568
by Leonard Richardson
Fixed test failures when run against soupselect 2.0. Patch by Tomáš |
169 |
* Fixed test failures when run against soupselect 2.0. Patch by Tomáš |
170 |
Chvátal. [bug=1872279] |
|
171 |
||
564
by Leonard Richardson
Embedded CSS and Javascript is now stored in distinct Stylesheet and |
172 |
= 4.9.0 (20200405) |
554
by Leonard Richardson
API CHANGE - Added PageElement.decomposed, a new property which lets you |
173 |
|
174 |
* Added PageElement.decomposed, a new property which lets you |
|
175 |
check whether you've already called decompose() on a Tag or |
|
176 |
NavigableString.
|
|
553
by Leonard Richardson
Fixed an unhandled exception when formatting a Tag that had been decomposed.[bug=1857767] |
177 |
|
564
by Leonard Richardson
Embedded CSS and Javascript is now stored in distinct Stylesheet and |
178 |
* Embedded CSS and Javascript is now stored in distinct Stylesheet and
|
566
by Leonard Richardson
Added a notice about the new behavior of .text to the documentation. |
179 |
Script tags, which are ignored by methods like get_text() since most
|
180 |
people don't consider this sort of content to be 'text'. This |
|
564
by Leonard Richardson
Embedded CSS and Javascript is now stored in distinct Stylesheet and |
181 |
feature is not supported by the html5lib treebuilder. [bug=1868861] |
182 |
||
561
by Leonard Richardson
Added a Russian translation by 'authoress' to the repository. |
183 |
* Added a Russian translation by 'authoress' to the repository. |
184 |
||
553
by Leonard Richardson
Fixed an unhandled exception when formatting a Tag that had been decomposed.[bug=1857767] |
185 |
* Fixed an unhandled exception when formatting a Tag that had been |
186 |
decomposed.[bug=1857767] |
|
187 |
||
559
by Leonard Richardson
Fixed a bug that happened when passing a Unicode filename containing |
188 |
* Fixed a bug that happened when passing a Unicode filename containing |
189 |
non-ASCII characters as markup into Beautiful Soup, on a system that |
|
190 |
allows Unicode filenames. [bug=1866717] |
|
191 |
||
556
by Leonard Richardson
Added a performance optimization to PageElement.extract(). Patch by Arthur Darcet. |
192 |
* Added a performance optimization to PageElement.extract(). Patch by |
193 |
Arthur Darcet. |
|
194 |
||
544
by Leonard Richardson
Wrote docstrings for formatter.py. |
195 |
= 4.8.2 (20191224) |
534
by Leonard Richardson
Fixed a deprecation warning on Python 3.7. Patch by Colin |
196 |
|
546
by Leonard Richardson
Added docstrings for some but not all tree buidlers. |
197 |
* Added Python docstrings to all public methods of the most commonly |
198 |
used classes. |
|
540
by Leonard Richardson
Added Python docstrings to all public methods in element.py. |
199 |
|
543
by Leonard Richardson
Fixed deprecation warning. [bug=1855301] |
200 |
* Added a Chinese translation by Deron Wang and a Brazilian Portuguese |
201 |
translation by Cezar Peixeiro to the repository. |
|
202 |
||
203 |
* Fixed two deprecation warnings. Patches by Colin |
|
204 |
Watson and Nicholas Neumann. [bug=1847592] [bug=1855301] |
|
205 |
||
538
by Leonard Richardson
The html.parser tree builder now correctly handles DOCTYPEs that are |
206 |
* The html.parser tree builder now correctly handles DOCTYPEs that are |
207 |
not uppercase. [bug=1848401] |
|
208 |
||
543
by Leonard Richardson
Fixed deprecation warning. [bug=1855301] |
209 |
* PageElement.select() now returns a ResultSet rather than a regular |
210 |
list, making it consistent with methods like find_all(). |
|
540
by Leonard Richardson
Added Python docstrings to all public methods in element.py. |
211 |
|
528
by Leonard Richardson
Added section on Python 2 sunsetting. |
212 |
= 4.8.1 (20191006) |
515
by Leonard Richardson
Adapt Chris Mayo's code to track line number and position when using html.parser. |
213 |
|
516
by Leonard Richardson
Implemented line number tracking for html5lib. |
214 |
* When the html.parser or html5lib parsers are in use, Beautiful Soup |
215 |
will, by default, record the position in the original document where |
|
216 |
each tag was encountered. This includes line number (Tag.sourceline) |
|
217 |
and position within a line (Tag.sourcepos). Based on code by Chris |
|
517
by Leonard Richardson
Added a section about project support to the README. |
218 |
Mayo. [bug=1742921] |
515
by Leonard Richardson
Adapt Chris Mayo's code to track line number and position when using html.parser. |
219 |
|
527
by Leonard Richardson
Avoid a crash when unpickling certain parse trees generated using html5lib on Python 3. [bug=1843545] |
220 |
* When instantiating a BeautifulSoup object, it's now possible to |
528
by Leonard Richardson
Added section on Python 2 sunsetting. |
221 |
provide a dictionary ('element_classes') of the classes you'd like to be |
222 |
instantiated instead of Tag, NavigableString, etc. |
|
527
by Leonard Richardson
Avoid a crash when unpickling certain parse trees generated using html5lib on Python 3. [bug=1843545] |
223 |
|
524
by Leonard Richardson
Fixed the definition of the default XML namespace when using |
224 |
* Fixed the definition of the default XML namespace when using |
225 |
lxml 4.4. Patch by Isaac Muse. [bug=1840141] |
|
226 |
||
520
by Leonard Richardson
Copying a Tag preserves information that was originally obtained from |
227 |
* Fixed a crash when pretty-printing tags that were not created |
228 |
during initial parsing. [bug=1838903] |
|
229 |
||
230 |
* Copying a Tag preserves information that was originally obtained from |
|
231 |
the TreeBuilder used to build the original Tag. [bug=1838903] |
|
518
by Leonard Richardson
Fixed a crash when pretty-printing tags that were not created |
232 |
|
526
by Leonard Richardson
Avoid a crash when trying to detect the declared encoding of a |
233 |
* Raise an explanatory exception when the underlying parser |
234 |
completely rejects the incoming markup. [bug=1838877] |
|
235 |
||
236 |
* Avoid a crash when trying to detect the declared encoding of a |
|
237 |
Unicode document. [bug=1838877] |
|
238 |
||
527
by Leonard Richardson
Avoid a crash when unpickling certain parse trees generated using html5lib on Python 3. [bug=1843545] |
239 |
* Avoid a crash when unpickling certain parse trees generated |
240 |
using html5lib on Python 3. [bug=1843545] |
|
241 |
||
513
by Leonard Richardson
Clarified the changelog. |
242 |
= 4.8.0 (20190720, "One Small Soup") |
501
by Leonard Richardson
It's now possible to customize the TreeBuilder object by passing |
243 |
|
514
by Leonard Richardson
Minor changes to docs and CHANGELOG. |
244 |
This release focuses on making it easier to customize Beautiful Soup's |
245 |
input mechanism (the TreeBuilder) and output mechanism (the Formatter).
|
|
246 |
||
247 |
* You can customize the TreeBuilder object by passing keyword
|
|
248 |
arguments into the BeautifulSoup constructor. Those keyword
|
|
249 |
arguments will be passed along into the TreeBuilder constructor.
|
|
250 |
||
251 |
The main reason to do this right now is to change how which
|
|
252 |
attributes are treated as multi-valued attributes (the way 'class' |
|
253 |
is treated by default). You can do this with the
|
|
254 |
'multi_valued_attributes' argument. [bug=1832978] |
|
511
by Leonard Richardson
Added documentation for Tag.smooth(). |
255 |
|
512
by Leonard Richardson
Prep for release. |
256 |
* The role of Formatter objects has been greatly expanded. The Formatter
|
257 |
class now controls the following:
|
|
511
by Leonard Richardson
Added documentation for Tag.smooth(). |
258 |
|
259 |
- The function to call to perform entity substitution. (This was
|
|
260 |
previously Formatter's only job.) |
|
261 |
- Which tags should be treated as containing CDATA and have their |
|
262 |
contents exempt from entity substitution. |
|
263 |
- The order in which a tag's attributes are output. [bug=1812422] |
|
264 |
- Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>' |
|
265 |
||
266 |
All preexisting code should work as before.
|
|
267 |
||
268 |
* Added a new method to the API, Tag.smooth(), which consolidates
|
|
514
by Leonard Richardson
Minor changes to docs and CHANGELOG. |
269 |
multiple adjacent NavigableString elements. [bug=1697296]
|
511
by Leonard Richardson
Added documentation for Tag.smooth(). |
270 |
|
514
by Leonard Richardson
Minor changes to docs and CHANGELOG. |
271 |
* ' (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is always
|
511
by Leonard Richardson
Added documentation for Tag.smooth(). |
272 |
recognized as a named entity and converted to a single quote. [bug=1818721]
|
504
by Leonard Richardson
' (which is valid in XML and XHTML, but not HTML 4) is now |
273 |
|
496
by Leonard Richardson
Tried even harder to avoid the deprecation warning originally fixed in |
274 |
= 4.7.1 (20190106)
|
495
by Leonard Richardson
Fixed an incorrectly raised exception when inserting a tag before or |
275 |
|
276 |
* Fixed a significant performance problem introduced in 4.7.0. [bug=1810617]
|
|
277 |
||
278 |
* Fixed an incorrectly raised exception when inserting a tag before or
|
|
279 |
after an identical tag. [bug=1810692]
|
|
280 |
||
281 |
* Beautiful Soup will no longer try to keep track of namespaces that
|
|
282 |
are not defined with a prefix; this can confuse soupselect. [bug=1810680]
|
|
283 |
||
496
by Leonard Richardson
Tried even harder to avoid the deprecation warning originally fixed in |
284 |
* Tried even harder to avoid the deprecation warning originally fixed in
|
285 |
4.6.1. [bug=1778909]
|
|
286 |
||
488
by Leonard Richardson
Prep for release. |
287 |
= 4.7.0 (20181231)
|
477
by Leonard Richardson
Merged in next_previous_fixes from Isaac Muse. [bug=1782928,1798699] |
288 |
|
481
by Leonard Richardson
Issue a warning and raise a more useful exception if someone tries to call Tag.select() without SoupSieve installed. |
289 |
* Beautiful Soup's CSS Selector implementation has been replaced by a |
290 |
dependency on Isaac Muse's SoupSieve project (the soupsieve package |
|
291 |
on PyPI). The good news is that SoupSieve has a much more robust and
|
|
292 |
complete implementation of CSS selectors, resolving a large number
|
|
293 |
of longstanding issues. The bad news is that from this point onward,
|
|
294 |
SoupSieve must be installed if you want to use the select() method.
|
|
295 |
||
296 |
You don't have to change anything lf you installed Beautiful Soup |
|
297 |
through pip (SoupSieve will be automatically installed when you |
|
298 |
upgrade Beautiful Soup) or if you don't use CSS selectors from |
|
299 |
within Beautiful Soup.
|
|
300 |
||
301 |
SoupSieve documentation: https://facelessuser.github.io/soupsieve/
|
|
302 |
||
490
by Leonard Richardson
Added information to CHANGELOG I forgot to add earlier. |
303 |
* Added the PageElement.extend() method, which works like list.append().
|
304 |
[bug=1514970]
|
|
305 |
||
306 |
* PageElement.insert_before() and insert_after() now take a variable
|
|
307 |
number of arguments. [bug=1514970]
|
|
308 |
||
477
by Leonard Richardson
Merged in next_previous_fixes from Isaac Muse. [bug=1782928,1798699] |
309 |
* Fix a number of problems with the tree builder that caused
|
310 |
trees that were superficially okay, but which fell apart when bits
|
|
483
by Leonard Richardson
Merging the linkage checker and html5lib fixes by Isaac Muse found in https://code.launchpad.net/~facelessuser/beautifulsoup/html5lib-fix/+merge/361282. [bug=1809910] |
311 |
were extracted. Patch by Isaac Muse. [bug=1782928,1809910]
|
477
by Leonard Richardson
Merged in next_previous_fixes from Isaac Muse. [bug=1782928,1798699] |
312 |
|
313 |
* Fixed a problem with the tree builder in which elements that
|
|
314 |
contained no content (such as empty comments and all-whitespace
|
|
315 |
elements) were not being treated as part of the tree. Patch by Isaac
|
|
316 |
Muse. [bug=1798699]
|
|
317 |
||
484
by Leonard Richardson
Fixed a problem with multi-valued attributes where the value |
318 |
* Fixed a problem with multi-valued attributes where the value
|
319 |
contained whitespace. Thanks to Jens Svalgaard for the
|
|
320 |
fix. [bug=1787453]
|
|
321 |
||
482
by Leonard Richardson
Clarified the software license. |
322 |
* Clarified ambiguous license statements in the source code. Beautiful
|
484
by Leonard Richardson
Fixed a problem with multi-valued attributes where the value |
323 |
Soup is released under the MIT license, and has been since 4.4.0.
|
482
by Leonard Richardson
Clarified the software license. |
324 |
|
488
by Leonard Richardson
Prep for release. |
325 |
* This file has been renamed from NEWS.txt to CHANGELOG.
|
326 |
||
476
by Leonard Richardson
Bump up to version 4.6.3 so I can re-release. |
327 |
= 4.6.3 (20180812)
|
328 |
||
329 |
* Exactly the same as 4.6.2. Re-released to make the README file
|
|
330 |
render properly on PyPI.
|
|
331 |
||
475
by Leonard Richardson
Converted README to Markdown format. |
332 |
= 4.6.2 (20180812)
|
474
by Leonard Richardson
Fix an exception when a custom formatter was asked to format a void |
333 |
|
334 |
* Fix an exception when a custom formatter was asked to format a void
|
|
335 |
element. [bug=1784408]
|
|
336 |
||
473
by Leonard Richardson
Prep for release. |
337 |
= 4.6.1 (20180728)
|
451
by Leonard Richardson
Improve the warning given when no parser is specified. [bug=1780571] |
338 |
|
459
by Leonard Richardson
Stop data loss when encountering an empty numeric entity, and |
339 |
* Stop data loss when encountering an empty numeric entity, and
|
340 |
possibly in other cases. Thanks to tos.kamiya for the fix. [bug=1698503]
|
|
341 |
||
465
by Leonard Richardson
Preserve XML namespaces when they are introduced inside an XML |
342 |
* Preserve XML namespaces introduced inside an XML document, not just
|
343 |
the ones introduced at the top level. [bug=1718787]
|
|
344 |
||
466
by Leonard Richardson
Fixed a bug where find_all() was not working when asked to find a |
345 |
* Added a new formatter, "html5", which represents void elements
|
469
by Leonard Richardson
Fixed a problem where the html.parser tree builder interpreted |
346 |
as "<element>" rather than "<element/>". [bug=1716272]
|
347 |
||
348 |
* Fixed a problem where the html.parser tree builder interpreted
|
|
349 |
a string like "&foo " as the character entity "&foo;" [bug=1728706]
|
|
466
by Leonard Richardson
Fixed a bug where find_all() was not working when asked to find a |
350 |
|
471
by Leonard Richardson
Correctly handle invalid HTML numeric character entities like “ |
351 |
* Correctly handle invalid HTML numeric character entities like “
|
352 |
which reference code points that are not Unicode code points. Note
|
|
353 |
that this is only fixed when Beautiful Soup is used with the
|
|
354 |
html.parser parser -- html5lib already worked and I couldn't fix it |
|
355 |
with lxml. [bug=1782933] |
|
356 |
||
452
by Leonard Richardson
Fixed code that was causing deprecation warnings in recent Python 3 |
357 |
* Improved the warning given when no parser is specified. [bug=1780571] |
358 |
||
472
by Leonard Richardson
When markup contains duplicate elements, a select() call that |
359 |
* When markup contains duplicate elements, a select() call that |
360 |
includes multiple match clauses will match all relevant |
|
361 |
elements. [bug=1770596] |
|
362 |
||
452
by Leonard Richardson
Fixed code that was causing deprecation warnings in recent Python 3 |
363 |
* Fixed code that was causing deprecation warnings in recent Python 3 |
364 |
versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496] |
|
451
by Leonard Richardson
Improve the warning given when no parser is specified. [bug=1780571] |
365 |
|
453
by Leonard Richardson
Fixed a Windows crash in diagnose() when checking whether a long |
366 |
* Fixed a Windows crash in diagnose() when checking whether a long |
367 |
markup string is a filename. [bug=1737121] |
|
368 |
||
454
by Leonard Richardson
Stopped HTMLParser from raising an exception in very rare cases of |
369 |
* Stopped HTMLParser from raising an exception in very rare cases of |
370 |
bad markup. [bug=1708831] |
|
371 |
||
466
by Leonard Richardson
Fixed a bug where find_all() was not working when asked to find a |
372 |
* Fixed a bug where find_all() was not working when asked to find a |
373 |
tag with a namespaced name in an XML document that was parsed as |
|
374 |
HTML. [bug=1723783] |
|
462
by Leonard Richardson
Introduced the Formatter system. [bug=1716272]. |
375 |
|
376 |
* You can get finer control over formatting by subclassing |
|
377 |
bs4.element.Formatter and passing a Formatter instance into (e.g.) |
|
378 |
encode(). [bug=1716272] |
|
461
by Leonard Richardson
It's possible for a TreeBuilder subclass to specify that void |
379 |
|
464
by Leonard Richardson
You can pass a dictionary of into |
380 |
* You can pass a dictionary of `attrs` into |
381 |
BeautifulSoup.new_tag. This makes it possible to create a tag with |
|
382 |
an attribute like 'name' that would otherwise be masked by another |
|
383 |
argument of new_tag. [bug=1779276] |
|
384 |
||
470
by Leonard Richardson
Clarified the deprecation warning when accessing tag.fooTag, to cover |
385 |
* Clarified the deprecation warning when accessing tag.fooTag, to cover |
386 |
the possibility that you might really have been looking for a tag |
|
387 |
called 'fooTag'. |
|
388 |
||
450
by Leonard Richardson
Prep for 4.6.0 release. |
389 |
= 4.6.0 (20170507) = |
444
by Leonard Richardson
Added the method, which acts like for |
390 |
|
447
by Leonard Richardson
Replace get_attribute_text with get_attribute_list. |
391 |
* Added the `Tag.get_attribute_list` method, which acts like `Tag.get` for |
392 |
getting the value of an attribute, but which always returns a list, |
|
393 |
whether or not the attribute is a multi-value attribute. [bug=1678589] |
|
442
by Leonard Richardson
It's now possible to use a tag's namespace prefix when searching, |
394 |
|
443
by Leonard Richardson
HTML parsers treat all HTML4 and HTML5 empty element tags (aka void element tags) correctly. [bug=1656909] |
395 |
* It's now possible to use a tag's namespace prefix when searching, |
396 |
e.g. soup.find('namespace:tag') [bug=1655332] |
|
397 |
||
446
by Leonard Richardson
Improved the handling of empty-element tags like <br> when using the |
398 |
* Improved the handling of empty-element tags like <br> when using the |
399 |
html.parser parser. [bug=1676935] |
|
400 |
||
443
by Leonard Richardson
HTML parsers treat all HTML4 and HTML5 empty element tags (aka void element tags) correctly. [bug=1656909] |
401 |
* HTML parsers treat all HTML4 and HTML5 empty element tags (aka void |
402 |
element tags) correctly. [bug=1656909] |
|
442
by Leonard Richardson
It's now possible to use a tag's namespace prefix when searching, |
403 |
|
449
by Leonard Richardson
Namespace prefix is preserved when an XML tag is copied. Thanks |
404 |
* Namespace prefix is preserved when an XML tag is copied. Thanks |
405 |
to Vikas for a patch and test. [bug=1685172] |
|
406 |
||
439
by Leonard Richardson
I need to do another release because of an error while running the release script. |
407 |
= 4.5.3 (20170102) = |
434
by Leonard Richardson
Fixed yet another problem that caused the html5lib tree builder to |
408 |
|
436
by Leonard Richardson
Fixed foster parenting when html5lib is the tree builder. Thanks to Geoffrey Sneddon for a patch and test. |
409 |
* Fixed foster parenting when html5lib is the tree builder. Thanks to |
410 |
Geoffrey Sneddon for a patch and test. |
|
439
by Leonard Richardson
I need to do another release because of an error while running the release script. |
411 |
|
434
by Leonard Richardson
Fixed yet another problem that caused the html5lib tree builder to |
412 |
* Fixed yet another problem that caused the html5lib tree builder to |
413 |
create a disconnected parse tree. [bug=1629825] |
|
414 |
||
439
by Leonard Richardson
I need to do another release because of an error while running the release script. |
415 |
= 4.5.2 (20170102) = |
416 |
||
417 |
* Apart from the version number, this release is identical to |
|
418 |
4.5.3. Due to user error, it could not be completely uploaded to |
|
419 |
PyPI. Use 4.5.3 instead. |
|
420 |
||
430
by Leonard Richardson
Bump version number. |
421 |
= 4.5.1 (20160802) = |
428
by Leonard Richardson
Fixed a reported (but not duplicated) bug involving processing instructions fed into the lxml HTML parser. |
422 |
|
429
by Leonard Richardson
Explained why we test both unicode and bytestring processing instructions. |
423 |
* Fixed a crash when passing Unicode markup that contained a |
424 |
processing instruction into the lxml HTML parser on Python |
|
425 |
3. [bug=1608048] |
|
428
by Leonard Richardson
Fixed a reported (but not duplicated) bug involving processing instructions fed into the lxml HTML parser. |
426 |
|
419
by Leonard Richardson
Updated NEWS in preparation for release. |
427 |
= 4.5.0 (20160719) = |
428 |
||
429 |
* Beautiful Soup is no longer compatible with Python 2.6. This |
|
430 |
actually happened a few releases ago, but it's now official. |
|
400
by Leonard Richardson
Fixed a Python 3 ByteWarning when a URL was passed in as though it |
431 |
|
406
by Leonard Richardson
Beautiful Soup will now work with versions of html5lib greater than |
432 |
* Beautiful Soup will now work with versions of html5lib greater than
|
433 |
0.99999999. [bug=1603299]
|
|
434 |
||
417
by Leonard Richardson
If a search against each individual value of a multi-valued |
435 |
* If a search against each individual value of a multi-valued
|
436 |
attribute fails, the search will be run one final time against the
|
|
437 |
complete attribute value considered as a single string. That is, if
|
|
438 |
a tag has class="foo bar" and neither "foo" nor "bar" matches, but
|
|
439 |
"foo bar" does, the tag is now considered a match.
|
|
440 |
||
441 |
This happened in previous versions, but only when the value being
|
|
419
by Leonard Richardson
Updated NEWS in preparation for release. |
442 |
searched for was a string. Now it also works when that value is
|
443 |
a regular expression, a list of strings, etc. [bug=1476868]
|
|
417
by Leonard Richardson
If a search against each individual value of a multi-valued |
444 |
|
410
by Leonard Richardson
Although the previously fixed problem only occurs when using the html5lib tree builder, it's not actually a problem with the tree builder itself. |
445 |
* Fixed a bug that deranged the tree when a whitespace element was
|
446 |
reparented into a tag that contained an identical whitespace
|
|
447 |
element. [bug=1505351]
|
|
409
by Leonard Richardson
Fixed a bug in the html5lib treebuilder that deranged the tree |
448 |
|
415
by Leonard Richardson
Added support for CSS selector values that contain quoted spaces, |
449 |
* Added support for CSS selector values that contain quoted spaces,
|
450 |
such as tag[style="display: foo"]. [bug=1540588]
|
|
451 |
||
400
by Leonard Richardson
Fixed a Python 3 ByteWarning when a URL was passed in as though it |
452 |
* Corrected handling of XML processing instructions. [bug=1504393]
|
453 |
||
416
by Leonard Richardson
Corrected an encoding error that happened when a BeautifulSoup |
454 |
* Corrected an encoding error that happened when a BeautifulSoup
|
455 |
object was copied. [bug=1554439]
|
|
456 |
||
401
by Leonard Richardson
The contents of <textarea> tags will no longer be modified when the |
457 |
* The contents of <textarea> tags will no longer be modified when the
|
458 |
tree is prettified. [bug=1555829]
|
|
459 |
||
411
by Leonard Richardson
When a BeautifulSoup object is pickled but its tree builder cannot |
460 |
* When a BeautifulSoup object is pickled but its tree builder cannot
|
461 |
be pickled, its .builder attribute is set to None instead of being
|
|
462 |
destroyed. This avoids a performance problem once the object is
|
|
463 |
unpickled. [bug=1523629]
|
|
464 |
||
402
by Leonard Richardson
Specify the file and line number when warning about a |
465 |
* Specify the file and line number when warning about a
|
466 |
BeautifulSoup object being instantiated without a parser being
|
|
467 |
specified. [bug=1574647]
|
|
468 |
||
414
by Leonard Richardson
The argument to now works correctly, though it's |
469 |
* The `limit` argument to `select()` now works correctly, though it's |
470 |
not implemented very efficiently. [bug=1520530] |
|
471 |
||
400
by Leonard Richardson
Fixed a Python 3 ByteWarning when a URL was passed in as though it |
472 |
* Fixed a Python 3 ByteWarning when a URL was passed in as though it |
473 |
were markup. Thanks to James Salter for a patch and |
|
474 |
test. [bug=1533762] |
|
475 |
||
405
by Leonard Richardson
We don't run the check for a filename passed in as markup if the |
476 |
* We don't run the check for a filename passed in as markup if the |
477 |
'filename' contains a less-than character; the less-than character |
|
478 |
indicates it's most likely a very small document. [bug=1577864] |
|
479 |
||
392
by Leonard Richardson
Fixed a bug that deranged the tree when part of it was |
480 |
= 4.4.1 (20150928) = |
390
by Leonard Richardson
Fixed the test_detect_utf8 test so that it works when chardet is |
481 |
|
392
by Leonard Richardson
Fixed a bug that deranged the tree when part of it was |
482 |
* Fixed a bug that deranged the tree when part of it was |
483 |
removed. Thanks to Eric Weiser for the patch and John Wiseman for a |
|
484 |
test. [bug=1481520] |
|
485 |
||
395
by Leonard Richardson
Fixed a parse bug with the html5lib tree-builder. Thanks to Roel |
486 |
* Fixed a parse bug with the html5lib tree-builder. Thanks to Roel |
487 |
Kramer for the patch. [bug=1483781] |
|
488 |
||
394
by Leonard Richardson
Improved the implementation of CSS selector grouping. Thanks to Orangain for the patch. [bug=1484543] |
489 |
* Improved the implementation of CSS selector grouping. Thanks to |
490 |
Orangain for the patch. [bug=1484543] |
|
491 |
||
393
by Leonard Richardson
Corrected the output of Declaration objects. [bug=1477847] |
492 |
* Fixed the test_detect_utf8 test so that it works when chardet is |
493 |
installed. [bug=1471359] |
|
494 |
||
495 |
* Corrected the output of Declaration objects. [bug=1477847] |
|
496 |
||
394
by Leonard Richardson
Improved the implementation of CSS selector grouping. Thanks to Orangain for the patch. [bug=1484543] |
497 |
|
386
by Leonard Richardson
Change setup.py to focus on creating wheels. |
498 |
= 4.4.0 (20150703) = |
358
by Leonard Richardson
Started using a standard MIT license. [bug=1294662] |
499 |
|
379
by Leonard Richardson
Reorganized changelog. |
500 |
Especially important changes: |
501 |
||
502 |
* Added a warning when you instantiate a BeautifulSoup object without |
|
503 |
explicitly naming a parser. [bug=1398866] |
|
504 |
||
366
by Leonard Richardson
In Python 3, __str__ now returns a Unicode string instead |
505 |
* __repr__ now returns an ASCII bytestring in Python 2, and a Unicode |
506 |
string in Python 3, instead of a UTF8-encoded bytestring in both |
|
507 |
versions. In Python 3, __str__ now returns a Unicode string instead |
|
508 |
of a bytestring. [bug=1420131] |
|
509 |
||
379
by Leonard Richardson
Reorganized changelog. |
510 |
* The `text` argument to the find_* methods is now called `string`, |
511 |
which is more accurate. `text` still works, but `string` is the |
|
512 |
argument described in the documentation. `text` may eventually |
|
513 |
change its meaning, but not for a very long time. [bug=1366856] |
|
514 |
||
381
by Leonard Richardson
Changed the way soup objects work under copy.copy(). Copying a |
515 |
* Changed the way soup objects work under copy.copy(). Copying a |
516 |
NavigableString or a Tag will give you a new NavigableString that's |
|
517 |
equal to the old one but not connected to the parse tree. Patch by
|
|
518 |
Martijn Peters. [bug=1307490]
|
|
380
by Leonard Richardson
Copying a NavigableString will give you a new NavigableString that is not connected to the parse tree. |
519 |
|
379
by Leonard Richardson
Reorganized changelog. |
520 |
* Started using a standard MIT license. [bug=1294662]
|
521 |
||
522 |
* Added a Chinese translation of the documentation by Delong .w.
|
|
523 |
||
524 |
New features:
|
|
525 |
||
371
by Leonard Richardson
Introduced the select_one() method, which uses a CSS selector but |
526 |
* Introduced the select_one() method, which uses a CSS selector but
|
527 |
only returns the first match, instead of a list of
|
|
528 |
matches. [bug=1349367]
|
|
529 |
||
376
by Leonard Richardson
Raise a NotImplementedError whenever an unsupported CSS pseudoclass |
530 |
* You can now create a Tag object without specifying a
|
531 |
TreeBuilder. Patch by Martijn Pieters. [bug=1307471]
|
|
532 |
||
533 |
* You can now create a NavigableString or a subclass just by invoking
|
|
534 |
the constructor. [bug=1294315]
|
|
535 |
||
373
by Leonard Richardson
Added an exclude_encodings argument to UnicodeDammit and to the |
536 |
* Added an `exclude_encodings` argument to UnicodeDammit and to the
|
537 |
Beautiful Soup constructor, which lets you prohibit the detection of
|
|
538 |
an encoding that you know is wrong. [bug=1469408]
|
|
539 |
||
379
by Leonard Richardson
Reorganized changelog. |
540 |
* The select() method now supports selector grouping. Patch by
|
541 |
Francisco Canas [bug=1191917]
|
|
542 |
||
543 |
Bug fixes:
|
|
544 |
||
338
by Leonard Richardson
Fixed yet another problem that caused the html5lib tree builder to |
545 |
* Fixed yet another problem that caused the html5lib tree builder to
|
546 |
create a disconnected parse tree. [bug=1237763]
|
|
547 |
||
359
by Leonard Richardson
Improved docstring for encode_contents() and decode_contents(). [bug=1441543] |
548 |
* Force object_was_parsed() to keep the tree intact even when an element
|
549 |
from later in the document is moved into place. [bug=1430633]
|
|
550 |
||
551 |
* Fixed yet another bug that caused a disconnected tree when html5lib
|
|
552 |
copied an element from one part of the tree to another. [bug=1270611]
|
|
553 |
||
378
by Leonard Richardson
Fixed a bug where Element.extract() could create an infinite loop in |
554 |
* Fixed a bug where Element.extract() could create an infinite loop in
|
555 |
the remaining tree.
|
|
556 |
||
352
by Leonard Richardson
The select() method can now find tags whose names contain |
557 |
* The select() method can now find tags whose names contain
|
360
by Leonard Richardson
The select() method can now find tags with attributes whose names |
558 |
dashes. Patch by Francisco Canas. [bug=1276211]
|
559 |
||
560 |
* The select() method can now find tags with attributes whose names
|
|
561 |
contain dashes. Patch by Marek Kapolka. [bug=1304007]
|
|
352
by Leonard Richardson
The select() method can now find tags whose names contain |
562 |
|
353
by Leonard Richardson
Improved the lxml tree builder's handling of processing |
563 |
* Improved the lxml tree builder's handling of processing |
564 |
instructions. [bug=1294645] |
|
565 |
||
337
by Leonard Richardson
Restored the helpful syntax error that happens when you try to |
566 |
* Restored the helpful syntax error that happens when you try to |
567 |
import the Python 2 edition of Beautiful Soup under Python |
|
568 |
3. [bug=1213387] |
|
569 |
||
347
by Leonard Richardson
In Python 3.4 and above, set the new convert_charrefs argument to |
570 |
* In Python 3.4 and above, set the new convert_charrefs argument to |
571 |
the html.parser constructor to avoid a warning and future |
|
572 |
failures. Patch by Stefano Revera. [bug=1375721] |
|
573 |
||
350
by Leonard Richardson
The warning when you pass in a filename or URL as markup will now be |
574 |
* The warning when you pass in a filename or URL as markup will now be |
575 |
displayed correctly even if the filename or URL is a Unicode |
|
576 |
string. [bug=1268888] |
|
342
by Leonard Richardson
Added a Chinese translation of the documentation by Delong .w. |
577 |
|
360.1.1
by Leonard Richardson
If the initial <html> tag contains a CDATA list attribute such as |
578 |
* If the initial <html> tag contains a CDATA list attribute such as |
579 |
'class', the html5lib tree builder will now turn its value into a |
|
580 |
list, as it would with any other tag. [bug=1296481] |
|
581 |
||
360.1.3
by Leonard Richardson
Fixed an import error in Python 3.5 caused by the removal of the |
582 |
* Fixed an import error in Python 3.5 caused by the removal of the |
583 |
HTMLParseError class. [bug=1420063] |
|
584 |
||
359
by Leonard Richardson
Improved docstring for encode_contents() and decode_contents(). [bug=1441543] |
585 |
* Improved docstring for encode_contents() and |
586 |
decode_contents(). [bug=1441543] |
|
357
by Leonard Richardson
Fixed yet another bug that caused a disconnected tree when html5lib |
587 |
|
364
by Leonard Richardson
Fixed a crash in Unicode, Dammit's encoding detector when the name |
588 |
* Fixed a crash in Unicode, Dammit's encoding detector when the name |
589 |
of the encoding itself contained invalid bytes. [bug=1360913]
|
|
590 |
||
367
by Leonard Richardson
Improved the exception raised when you call .unwrap() or |
591 |
* Improved the exception raised when you call .unwrap() or
|
592 |
.replace_with() on an element that's not attached to a tree. |
|
593 |
||
376
by Leonard Richardson
Raise a NotImplementedError whenever an unsupported CSS pseudoclass |
594 |
* Raise a NotImplementedError whenever an unsupported CSS pseudoclass |
595 |
is used in select(). Previously some cases did not result in a |
|
596 |
NotImplementedError. |
|
368
by Leonard Richardson
You can now create a NavigableString or a subclass just by invoking |
597 |
|
382
by Leonard Richardson
It's now possible to pickle a BeautifulSoup object no matter which |
598 |
* It's now possible to pickle a BeautifulSoup object no matter which |
599 |
tree builder was used to create it. However, the only tree builder
|
|
600 |
that survives the pickling process is the HTMLParserTreeBuilder
|
|
601 |
('html.parser'). If you unpickle a BeautifulSoup object created with |
|
602 |
some other tree builder, soup.builder will be None. [bug=1231545]
|
|
603 |
||
336
by Leonard Richardson
Prep for release. |
604 |
= 4.3.2 (20131002) =
|
331
by Leonard Richardson
Combined two tests to stop a spurious test failure when tests are |
605 |
|
333
by Leonard Richardson
Fixed a bug in which short Unicode input was improperly encoded to ASCII when checking whether or not it was a file on |
606 |
* Fixed a bug in which short Unicode input was improperly encoded to
|
336
by Leonard Richardson
Prep for release. |
607 |
ASCII when checking whether or not it was the name of a file on
|
333
by Leonard Richardson
Fixed a bug in which short Unicode input was improperly encoded to ASCII when checking whether or not it was a file on |
608 |
disk. [bug=1227016]
|
609 |
||
334
by Leonard Richardson
Fixed a crash when a short input contains data not valid in |
610 |
* Fixed a crash when a short input contains data not valid in
|
611 |
filenames. [bug=1232604]
|
|
612 |
||
335
by Leonard Richardson
Fixed a bug that caused Unicode data put into UnicodeDammit to |
613 |
* Fixed a bug that caused Unicode data put into UnicodeDammit to
|
614 |
return None instead of the original data. [bug=1214983]
|
|
615 |
||
331
by Leonard Richardson
Combined two tests to stop a spurious test failure when tests are |
616 |
* Combined two tests to stop a spurious test failure when tests are
|
332
by Leonard Richardson
Fixed typo. |
617 |
run by nosetests. [bug=1212445]
|
331
by Leonard Richardson
Combined two tests to stop a spurious test failure when tests are |
618 |
|
329
by Leonard Richardson
Updated NEWS. |
619 |
= 4.3.1 (20130815) =
|
327
by Leonard Richardson
* Fixed yet another problem with the html5lib tree builder, caused by |
620 |
|
621 |
* Fixed yet another problem with the html5lib tree builder, caused by
|
|
622 |
html5lib's tendency to rearrange the tree during |
|
623 |
parsing. [bug=1189267] |
|
624 |
||
329
by Leonard Richardson
Updated NEWS. |
625 |
* Fixed a bug that caused the optimized version of find_all() to |
626 |
return nothing. [bug=1212655] |
|
627 |
||
326
by Leonard Richardson
Prep for release. |
628 |
= 4.3.0 (20130812) = |
305
by Leonard Richardson
Merged in big encoding-detection refactoring branch. |
629 |
|
630 |
* Instead of converting incoming data to Unicode and feeding it to the |
|
324
by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by |
631 |
lxml tree builder in chunks, Beautiful Soup now makes successive |
632 |
guesses at the encoding of the incoming data, and tells lxml to |
|
633 |
parse the data as that encoding. Giving lxml more control over the |
|
634 |
parsing process improves performance and avoids a number of bugs and |
|
635 |
issues with the lxml parser which had previously required elaborate |
|
636 |
workarounds: |
|
323
by Leonard Richardson
A little cleanup. |
637 |
|
324
by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by |
638 |
- An issue in which lxml refuses to parse Unicode strings on some |
639 |
systems. [bug=1180527] |
|
323
by Leonard Richardson
A little cleanup. |
640 |
|
641 |
- A returning bug that truncated documents longer than a (very |
|
642 |
small) size. [bug=963880] |
|
643 |
||
644 |
- A returning bug in which extra spaces were added to a document if |
|
645 |
the document defined a charset other than UTF-8. [bug=972466] |
|
305
by Leonard Richardson
Merged in big encoding-detection refactoring branch. |
646 |
|
647 |
This required a major overhaul of the tree builder architecture. If |
|
648 |
you wrote your own tree builder and didn't tell me, you'll need to |
|
649 |
modify your prepare_markup() method. |
|
650 |
||
651 |
* The UnicodeDammit code that makes guesses at encodings has been |
|
652 |
split into its own class, EncodingDetector. A lot of apparently |
|
653 |
redundant code has been removed from Unicode, Dammit, and some |
|
654 |
undocumented features have also been removed. |
|
655 |
||
306
by Leonard Richardson
Beautiful Soup will issue a warning if instead of markup you pass it |
656 |
* Beautiful Soup will issue a warning if instead of markup you pass it |
324
by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by |
657 |
a URL or the name of a file on disk (a common beginner's mistake). |
306
by Leonard Richardson
Beautiful Soup will issue a warning if instead of markup you pass it |
658 |
|
317
by Leonard Richardson
Added raw html5lib to the list of parsers that get tested. |
659 |
* A number of optimizations improve the performance of the lxml tree
|
322
by Leonard Richardson
Updated NEWS. |
660 |
builder by about 33%, the html.parser tree builder by about 20%, and
|
661 |
the html5lib tree builder by about 15%.
|
|
317
by Leonard Richardson
Added raw html5lib to the list of parsers that get tested. |
662 |
|
324
by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by |
663 |
* All find_all calls should now return a ResultSet object. Patch by
|
664 |
Aaron DeVore. [bug=1194034]
|
|
665 |
||
302
by Leonard Richardson
Reverted the patch that gives NavigableString a .name property, because that's too big an API change for a bugfix release. |
666 |
= 4.2.1 (20130531) =
|
295
by Leonard Richardson
html5lib now supports Python 3. Fixed some Python 2-specific |
667 |
|
301
by Leonard Richardson
The default XML formatter will now replace ampersands even if they appear to be part of entities. That is, "<" will become "&lt;".[bug=1182183] |
668 |
* The default XML formatter will now replace ampersands even if they
|
669 |
appear to be part of entities. That is, "<" will become
|
|
670 |
"&lt;". The old code was left over from Beautiful Soup 3, which
|
|
671 |
didn't always turn entities into Unicode characters. |
|
672 |
||
673 |
If you really want the old behavior (maybe because you add new |
|
674 |
strings to the tree, those strings include entities, and you want |
|
675 |
the formatter to leave them alone on output), it can be found in |
|
676 |
EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183] |
|
677 |
||
296
by Leonard Richardson
Gave new_string() the ability to create subclasses of |
678 |
* Gave new_string() the ability to create subclasses of |
679 |
NavigableString. [bug=1181986] |
|
680 |
||
297
by Leonard Richardson
Fixed another bug by which the html5lib tree builder could create a |
681 |
* Fixed another bug by which the html5lib tree builder could create a |
682 |
disconnected tree. [bug=1182089] |
|
683 |
||
299
by Leonard Richardson
The .previous_element of a BeautifulSoup object is now always None, |
684 |
* The .previous_element of a BeautifulSoup object is now always None, |
685 |
not the last element to be parsed. [bug=1182089] |
|
686 |
||
295
by Leonard Richardson
html5lib now supports Python 3. Fixed some Python 2-specific |
687 |
* Fixed test failures when lxml is not installed. [bug=1181589] |
688 |
||
689 |
* html5lib now supports Python 3. Fixed some Python 2-specific |
|
690 |
code in the html5lib test suite. [bug=1181624] |
|
691 |
||
303
by Leonard Richardson
The html.parser treebuilder can now handle numeric attributes in |
692 |
* The html.parser treebuilder can now handle numeric attributes in |
693 |
text when the hexidecimal name of the attribute starts with a |
|
694 |
capital X. Patch by Tim Shirley. [bug=1186242] |
|
695 |
||
288.1.1
by Leonard Richardson
Added a deprecation warning to has_key(). |
696 |
= 4.2.0 (20130514) = |
272
by Leonard Richardson
In an HTML document, the contents of a <script> or <style> tag will |
697 |
|
282.1.12
by Leonard Richardson
Updated news. |
698 |
* The Tag.select() method now supports a much wider variety of CSS |
699 |
selectors. |
|
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
700 |
|
701 |
- Added support for the adjacent sibling combinator (+) and the |
|
702 |
general sibling combinator (~). Tests by "liquider". [bug=1082144] |
|
703 |
||
282.1.13
by Leonard Richardson
Fixed terminology. |
704 |
- The combinators (>, +, and ~) can now combine with any supported |
282.1.12
by Leonard Richardson
Updated news. |
705 |
selector, not just one that selects based on tag name. |
706 |
||
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
707 |
- Added limited support for the "nth-of-type" pseudo-class. Code |
708 |
by Sven Slootweg. [bug=1109952] |
|
709 |
||
274.1.3
by Leonard Richardson
Aliased the BeautifulSoup class to the easier-to-type "_s" and "_soup". |
710 |
* The BeautifulSoup class is now aliased to "_s" and "_soup", making |
278
by Leonard Richardson
Added support for the "nth-of-type" CSS selector. The CSS selector ">" can now find a tag by means other than the tag name. Code by Sven Slootweg. |
711 |
it quicker to type the import statement in an interactive session: |
274.1.3
by Leonard Richardson
Aliased the BeautifulSoup class to the easier-to-type "_s" and "_soup". |
712 |
|
713 |
from bs4 import _s |
|
714 |
or
|
|
715 |
from bs4 import _soup |
|
716 |
||
282
by Leonard Richardson
Fixed up diagnose() and added it to the docs. |
717 |
The alias may change in the future, so don't use this in code you're |
718 |
going to run more than once. |
|
719 |
||
720 |
* Added the 'diagnose' submodule, which includes several useful |
|
721 |
functions for reporting problems and doing tech support. |
|
722 |
||
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
723 |
- diagnose(data) tries the given markup on every installed parser, |
282
by Leonard Richardson
Fixed up diagnose() and added it to the docs. |
724 |
reporting exceptions and displaying successes. If a parser is not |
725 |
installed, diagnose() mentions this fact. |
|
726 |
||
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
727 |
- lxml_trace(data, html=True) runs the given markup through lxml's |
282
by Leonard Richardson
Fixed up diagnose() and added it to the docs. |
728 |
XML parser or HTML parser, and prints out the parser events as
|
729 |
they happen. This helps you quickly determine whether a given
|
|
730 |
problem occurs in lxml code or Beautiful Soup code.
|
|
731 |
||
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
732 |
- htmlparser_trace(data) is the same thing, but for Python's |
282
by Leonard Richardson
Fixed up diagnose() and added it to the docs. |
733 |
built-in HTMLParser class. |
278
by Leonard Richardson
Added support for the "nth-of-type" CSS selector. The CSS selector ">" can now find a tag by means other than the tag name. Code by Sven Slootweg. |
734 |
|
282.1.12
by Leonard Richardson
Updated news. |
735 |
* In an HTML document, the contents of a <script> or <style> tag will |
736 |
no longer undergo entity substitution by default. XML documents work |
|
737 |
the same way they did before. [bug=1085953] |
|
738 |
||
739 |
* Methods like get_text() and properties like .strings now only give |
|
740 |
you strings that are visible in the document--no comments or |
|
741 |
processing commands. [bug=1050164] |
|
742 |
||
277
by Leonard Richardson
The prettify() method now leaves the contents of <pre> tags |
743 |
* The prettify() method now leaves the contents of <pre> tags |
744 |
alone. [bug=1095654] |
|
745 |
||
264
by Leonard Richardson
Added bug reference. |
746 |
* Fix a bug in the html5lib treebuilder which sometimes created |
747 |
disconnected trees. [bug=1039527] |
|
748 |
||
265.1.1
by Leonard Richardson
Fix a bug in the lxml treebuilder which crashed when a tag included |
749 |
* Fix a bug in the lxml treebuilder which crashed when a tag included |
750 |
an attribute from the predefined "xml:" namespace. [bug=1065617] |
|
751 |
||
273
by Leonard Richardson
Fix a bug by which keyword arguments to find_parent() were not being passed on. [bug=1126734] |
752 |
* Fix a bug by which keyword arguments to find_parent() were not |
753 |
being passed on. [bug=1126734] |
|
754 |
||
275
by Leonard Richardson
Stop a crash when unwisely messing with a tag that's been |
755 |
* Stop a crash when unwisely messing with a tag that's been |
756 |
decomposed. [bug=1097699]
|
|
757 |
||
288.1.1
by Leonard Richardson
Added a deprecation warning to has_key(). |
758 |
* Now that lxml's segfault on invalid doctype has been fixed, fixed a |
274.1.1
by Leonard Richardson
Now that lxml's segfault on invalid doctype has been fixed, fix a |
759 |
corresponding problem on the Beautiful Soup end that was previously |
760 |
invisible. [bug=984936] |
|
761 |
||
279
by Leonard Richardson
Fixed an exception when an overspecified CSS selector didn't match |
762 |
* Fixed an exception when an overspecified CSS selector didn't match |
763 |
anything. Code by Stefaan Lippens. [bug=1168167]
|
|
764 |
||
258
by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503] |
765 |
= 4.1.3 (20120820) =
|
766 |
||
260
by Leonard Richardson
Python 3.1 also needs to skip the unicode attribute name test. |
767 |
* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious
|
768 |
test failure caused by the lousy HTMLParser in those
|
|
769 |
versions. [bug=1038503]
|
|
258
by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503] |
770 |
|
259
by Leonard Richardson
Raise a more specific error (FeatureNotFound) when a requested |
771 |
* Raise a more specific error (FeatureNotFound) when a requested
|
772 |
parser or parser feature is not installed. Raise NotImplementedError
|
|
773 |
instead of ValueError when the user calls insert_before() or
|
|
774 |
insert_after() on the BeautifulSoup object itself. Patch by Aaron
|
|
775 |
Devore. [bug=1038301]
|
|
258
by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503] |
776 |
|
252
by Leonard Richardson
Prep for release. |
777 |
= 4.1.2 (20120817) =
|
245
by Leonard Richardson
Use logging.warning() instead of warning.warn() to notify the user that characters were replaced with REPLACEMENT CHARACTER. [bug=1013862] |
778 |
|
251
by Leonard Richardson
As per PEP-8, allow searching by CSS class using the 'class_' |
779 |
* As per PEP-8, allow searching by CSS class using the 'class_' |
780 |
keyword argument. [bug=1037624]
|
|
781 |
||
255
by Leonard Richardson
Fixed a crash on encoding when an attribute name contained |
782 |
* Display namespace prefixes for namespaced attribute names, instead of
|
250
by Leonard Richardson
Use namespace prefixes for namespaced attribute names, instead of |
783 |
the fully-qualified names given by the lxml parser. [bug=1037597]
|
784 |
||
255
by Leonard Richardson
Fixed a crash on encoding when an attribute name contained |
785 |
* Fixed a crash on encoding when an attribute name contained
|
786 |
non-ASCII characters.
|
|
787 |
||
251
by Leonard Richardson
As per PEP-8, allow searching by CSS class using the 'class_' |
788 |
* When sniffing encodings, if the cchardet library is installed,
|
258
by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503] |
789 |
Beautiful Soup uses it instead of chardet. cchardet is much
|
251
by Leonard Richardson
As per PEP-8, allow searching by CSS class using the 'class_' |
790 |
faster. [bug=1020748]
|
246
by Leonard Richardson
When sniffing encodings, if the cchardet library is installed, use it instead of chardet. It's much faster. [bug=1020748] |
791 |
|
245
by Leonard Richardson
Use logging.warning() instead of warning.warn() to notify the user that characters were replaced with REPLACEMENT CHARACTER. [bug=1013862] |
792 |
* Use logging.warning() instead of warning.warn() to notify the user
|
793 |
that characters were replaced with REPLACEMENT
|
|
794 |
CHARACTER. [bug=1013862]
|
|
795 |
||
243
by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387] |
796 |
= 4.1.1 (20120703) =
|
239
by Leonard Richardson
Fixed an html5lib tree builder crash which happened when html5lib |
797 |
|
241
by Leonard Richardson
Fixed a typo that made parsing much slower than it should have been. [bug=1020268] |
798 |
* Fixed an html5lib tree builder crash which happened when html5lib
|
243
by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387] |
799 |
moved a tag with a multivalued attribute from one part of the tree
|
800 |
to another. [bug=1019603]
|
|
239
by Leonard Richardson
Fixed an html5lib tree builder crash which happened when html5lib |
801 |
|
243
by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387] |
802 |
* Correctly display closing tags with an XML namespace declared. Patch
|
241
by Leonard Richardson
Fixed a typo that made parsing much slower than it should have been. [bug=1020268] |
803 |
by Andreas Kostyrka. [bug=1019635]
|
804 |
||
805 |
* Fixed a typo that made parsing significantly slower than it should
|
|
243
by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387] |
806 |
have been, and also waited too long to close tags with XML
|
807 |
namespaces. [bug=1020268]
|
|
808 |
||
809 |
* get_text() now returns an empty Unicode string if there is no text,
|
|
810 |
rather than an empty bytestring. [bug=1020387]
|
|
241
by Leonard Richardson
Fixed a typo that made parsing much slower than it should have been. [bug=1020268] |
811 |
|
236
by Leonard Richardson
Prep for release. |
812 |
= 4.1.0 (20120529) =
|
228
by Leonard Richardson
Added experimental support for fixing Windows-1252 characters embedded in UTF-8 documents. |
813 |
|
814 |
* Added experimental support for fixing Windows-1252 characters
|
|
232
by Leonard Richardson
Fixed a bug with the lxml treebuilder that prevented the user from adding attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch. |
815 |
embedded in UTF-8 documents. (UnicodeDammit.detwingle())
|
228
by Leonard Richardson
Added experimental support for fixing Windows-1252 characters embedded in UTF-8 documents. |
816 |
|
230
by Leonard Richardson
Fixed the handling of " with the built-in parser. [bug=993871] |
817 |
* Fixed the handling of " with the built-in parser. [bug=993871]
|
818 |
||
231
by Leonard Richardson
Comments, processing instructions, document type declarations, and markup declarations are now treated as preformatted strings, the way CData blocks are. [bug=1001025] Also in this commit: renamed detwingle method to detwingle(). |
819 |
* Comments, processing instructions, document type declarations, and
|
820 |
markup declarations are now treated as preformatted strings, the way
|
|
821 |
CData blocks are. [bug=1001025]
|
|
822 |
||
232
by Leonard Richardson
Fixed a bug with the lxml treebuilder that prevented the user from adding attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch. |
823 |
* Fixed a bug with the lxml treebuilder that prevented the user from
|
824 |
adding attributes to a tag that didn't originally have |
|
236
by Leonard Richardson
Prep for release. |
825 |
attributes. [bug=1002378] Thanks to Oliver Beattie for the patch. |
232
by Leonard Richardson
Fixed a bug with the lxml treebuilder that prevented the user from adding attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch. |
826 |
|
233
by Leonard Richardson
Fixed some edge-case bugs having to do with inserting an element |
827 |
* Fixed some edge-case bugs having to do with inserting an element |
828 |
into a tag it's already inside, and replacing one of a tag's |
|
829 |
children with another. [bug=997529] |
|
830 |
||
236
by Leonard Richardson
Prep for release. |
831 |
* Added the ability to search for attribute values specified in UTF-8. [bug=1003974] |
235
by Leonard Richardson
Fixed the inability to search for non-ASCII attribute |
832 |
|
833 |
This caused a major refactoring of the search code. All the tests |
|
834 |
pass, but it's possible that some searches will behave differently. |
|
234
by Leonard Richardson
Fixed the basic failure in [bug=1003974], but not more advanced cases. |
835 |
|
225
by Leonard Richardson
Prep for release. |
836 |
= 4.0.5 (20120427) =
|
214
by Leonard Richardson
Fixed a bug that made the HTMLParser treebuilder generate XML definitions ending with two question marks instead of one. [bug=984258] |
837 |
|
229
by Leonard Richardson
Fixed NEWS. |
838 |
* Added a new method, wrap(), which wraps an element in a tag.
|
224
by Leonard Richardson
Added a new method, wrap(). |
839 |
|
223
by Leonard Richardson
Renamed replace_with_children() to the jQuery name, unwrap(). |
840 |
* Renamed replace_with_children() to unwrap(), which is easier to
|
841 |
understand and also the jQuery name of the function.
|
|
842 |
||
217
by Leonard Richardson
Made encoding substitution in <meta> tags completely transparent (no more %SOUP-ENCODING%). |
843 |
* Made encoding substitution in <meta> tags completely transparent (no
|
844 |
more %SOUP-ENCODING%).
|
|
845 |
||
222
by Leonard Richardson
Fixed a bug in decoding data that contained a byte-order mark, such as data encoded in UTF-16LE. [bug=988980] |
846 |
* Fixed a bug in decoding data that contained a byte-order mark, such
|
847 |
as data encoded in UTF-16LE. [bug=988980]
|
|
848 |
||
214
by Leonard Richardson
Fixed a bug that made the HTMLParser treebuilder generate XML definitions ending with two question marks instead of one. [bug=984258] |
849 |
* Fixed a bug that made the HTMLParser treebuilder generate XML
|
850 |
definitions ending with two question marks instead of
|
|
851 |
one. [bug=984258]
|
|
852 |
||
221
by Leonard Richardson
Upon document generation, CData objects are no longer run through the formatter. [bug=988905] |
853 |
* Upon document generation, CData objects are no longer run through
|
854 |
the formatter. [bug=988905]
|
|
855 |
||
220
by Leonard Richardson
The test suite now passes when lxml is not installed, whether or not html5lib is installed. [bug=987004] |
856 |
* The test suite now passes when lxml is not installed, whether or not
|
857 |
html5lib is installed. [bug=987004]
|
|
858 |
||
215
by Leonard Richardson
Print a warning on HTMLParseErrors to let people know they should install an external parser. |
859 |
* Print a warning on HTMLParseErrors to let people know they should
|
860 |
install a better parser library.
|
|
861 |
||
213
by Leonard Richardson
Prep for release. |
862 |
= 4.0.4 (20120416) =
|
205
by Leonard Richardson
Have objects_was_parsed set the previous element's next_element if possible. [bug=975926] |
863 |
|
864 |
* Fixed a bug that sometimes created disconnected trees.
|
|
865 |
||
209
by Leonard Richardson
Fixed a bug with the string setter that moved a string around the |
866 |
* Fixed a bug with the string setter that moved a string around the
|
867 |
tree instead of copying it. [bug=983050]
|
|
868 |
||
210
by Leonard Richardson
Attribute values are now run through the provided output formatter. Previously they were always run through the 'minimal' formatter. [bug=980237] |
869 |
* Attribute values are now run through the provided output formatter.
|
870 |
Previously they were always run through the 'minimal' formatter. In |
|
871 |
the future I may make it possible to specify different formatters
|
|
872 |
for attribute values and strings, but for now, consistent behavior
|
|
873 |
is better than inconsistent behavior. [bug=980237]
|
|
874 |
||
206
by Leonard Richardson
Added renderContents back. |
875 |
* Added the missing renderContents method from Beautiful Soup 3. Also
|
876 |
added an encode_contents() method to go along with decode_contents().
|
|
877 |
||
208
by Leonard Richardson
Give a more useful error when the user tries to run the Python 2 version of BS under Python 3. |
878 |
* Give a more useful error when the user tries to run the Python 2
|
879 |
version of BS under Python 3.
|
|
880 |
||
211
by Leonard Richardson
Unicode, Dammit now has an option to turn MS smart quotes into ASCII characters. |
881 |
* UnicodeDammit can now convert Microsoft smart quotes to ASCII with
|
882 |
UnicodeDammit(markup, smart_quotes_to="ascii").
|
|
883 |
||
204
by Leonard Richardson
Prep for release. |
884 |
= 4.0.3 (20120403) =
|
197
by Leonard Richardson
Fixed a typo that caused some versions of Python 3 to convert the Beautiful Soup codebase incorrectly. |
885 |
|
886 |
* Fixed a typo that caused some versions of Python 3 to convert the
|
|
887 |
Beautiful Soup codebase incorrectly.
|
|
888 |
||
203
by Leonard Richardson
Got rid of the 4.0.2 workaround for HTML documents--it was unnecessary and the workaround was triggering a (possibly different, but related) bug in lxml. [bug=972466] |
889 |
* Got rid of the 4.0.2 workaround for HTML documents--it was
|
890 |
unnecessary and the workaround was triggering a (possibly different,
|
|
891 |
but related) bug in lxml. [bug=972466]
|
|
892 |
||
196
by Leonard Richardson
Prep for release. |
893 |
= 4.0.2 (20120326) =
|
194
by Leonard Richardson
Fixed a bug where specifying 'text' while searching for a tag only worked if 'text' specified an exact string match. [bug=955942] |
894 |
|
195
by Leonard Richardson
Pass data into XMLParser.feed() in chunks. [bug=963880] |
895 |
* Worked around a possible bug in lxml that prevents non-tiny XML
|
896 |
documents from being parsed. [bug=963880, bug=963936]
|
|
897 |
||
196
by Leonard Richardson
Prep for release. |
898 |
* Fixed a bug where specifying `text` while also searching for a tag
|
899 |
only worked if `text` wanted an exact string match. [bug=955942]
|
|
194
by Leonard Richardson
Fixed a bug where specifying 'text' while searching for a tag only worked if 'text' specified an exact string match. [bug=955942] |
900 |
|
188
by Leonard Richardson
Bumped version number. |
901 |
= 4.0.1 (20120314) =
|
902 |
||
903 |
* This is the first official release of Beautiful Soup 4. There is no
|
|
904 |
4.0.0 release, to eliminate any possibility that packaging software
|
|
905 |
might treat "4.0.0" as being an earlier version than "4.0.0b10".
|
|
187
by Leonard Richardson
Brought the soupselect port up to date. |
906 |
|
907 |
* Brought BS up to date with the latest release of soupselect, adding
|
|
908 |
CSS selector support for direct descendant matches and multiple CSS
|
|
909 |
class matches.
|
|
910 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
911 |
= 4.0.0b10 (20120302) =
|
179.1.3
by Leonard Richardson
Test that CSS selectors work within the tree as well as at the top level. |
912 |
|
179.1.4
by Leonard Richardson
Updated docs. |
913 |
* Added support for simple CSS selectors, taken from the soupselect project.
|
179.1.3
by Leonard Richardson
Test that CSS selectors work within the tree as well as at the top level. |
914 |
|
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
915 |
* Fixed a crash when using html5lib. [bug=943246]
|
916 |
||
182
by Leonard Richardson
In HTML5-style <meta charset="foo"> tags, the value of the "charset" attribute is now replaced with the appropriate encoding on output. [bug=942714] |
917 |
* In HTML5-style <meta charset="foo"> tags, the value of the "charset"
|
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
918 |
attribute is now replaced with the appropriate encoding on
|
919 |
output. [bug=942714]
|
|
920 |
||
921 |
* Fixed a bug that caused calling a tag to sometimes call find_all()
|
|
922 |
with the wrong arguments. [bug=944426]
|
|
182
by Leonard Richardson
In HTML5-style <meta charset="foo"> tags, the value of the "charset" attribute is now replaced with the appropriate encoding on output. [bug=942714] |
923 |
|
184
by Leonard Richardson
For backwards compatibility, brought back the BeautifulStoneSoup class as a deprecated wrapper around BeautifulSoup. |
924 |
* For backwards compatibility, brought back the BeautifulStoneSoup
|
925 |
class as a deprecated wrapper around BeautifulSoup.
|
|
926 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
927 |
= 4.0.0b9 (20120228) =
|
175
by Leonard Richardson
Renamed Tag.nsprefix to Tag.prefix, for consistency with NamespacedAttribute. |
928 |
|
177
by Leonard Richardson
Fixed DOCTYPE handling. |
929 |
* Fixed the string representation of DOCTYPEs that have both a public
|
930 |
ID and a system ID.
|
|
931 |
||
179
by Leonard Richardson
Fixed the generated XML declaration. |
932 |
* Fixed the generated XML declaration.
|
933 |
||
175
by Leonard Richardson
Renamed Tag.nsprefix to Tag.prefix, for consistency with NamespacedAttribute. |
934 |
* Renamed Tag.nsprefix to Tag.prefix, for consistency with
|
935 |
NamespacedAttribute.
|
|
936 |
||
421.1.1
by Ville Skyttä
Spelling fixes |
937 |
* Fixed a test failure that occurred on Python 3.x when chardet was
|
176
by Leonard Richardson
Fixed a test failure that occured on Python 3.x when chardet was installed. |
938 |
installed.
|
939 |
||
178
by Leonard Richardson
Make prettify() return Unicode by default, so it will look nice when passed into print() under Python 3. |
940 |
* Made prettify() return Unicode by default, so it will look nice on
|
941 |
Python 3 when passed into print().
|
|
942 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
943 |
= 4.0.0b8 (20120224) =
|
158.1.10
by Leonard Richardson
Bumped version number. |
944 |
|
945 |
* All tree builders now preserve namespace information in the
|
|
174
by Leonard Richardson
I keep typing assertEquals. |
946 |
documents they parse. If you use the html5lib parser or lxml's XML |
947 |
parser, you can access the namespace URL for a tag as tag.namespace. |
|
158.1.10
by Leonard Richardson
Bumped version number. |
948 |
|
949 |
However, there is no special support for namespace-oriented |
|
950 |
searching or tree manipulation. When you search the tree, you need |
|
951 |
to use namespace prefixes exactly as they're used in the original |
|
952 |
document.
|
|
953 |
||
158.1.11
by Leonard Richardson
Fixed handling of the closing of namespaced tags. |
954 |
* The string representation of a DOCTYPE always ends in a newline.
|
955 |
||
173
by Leonard Richardson
Warn when SoupStrainer is used with the html5lib tree builder. |
956 |
* Issue a warning if the user tries to use a SoupStrainer in
|
957 |
conjunction with the html5lib tree builder, which doesn't support |
|
958 |
them. |
|
959 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
960 |
= 4.0.0b7 (20120223) = |
157
by Leonard Richardson
Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion. |
961 |
|
158
by Leonard Richardson
By default, turn unrecognized characters into numeric XML entity refs. |
962 |
* Upon decoding to string, any characters that can't be represented in |
963 |
your chosen encoding will be converted into numeric XML entity
|
|
964 |
references.
|
|
965 |
||
157
by Leonard Richardson
Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion. |
966 |
* Issue a warning if characters were replaced with REPLACEMENT
|
967 |
CHARACTER during Unicode conversion.
|
|
968 |
||
160
by Leonard Richardson
Added code from 2.7's standard library so that the tests will run on Python 2.6. |
969 |
* Restored compatibility with Python 2.6.
|
970 |
||
421.1.1
by Ville Skyttä
Spelling fixes |
971 |
* The install process no longer installs docs or auxiliary text files.
|
169
by Leonard Richardson
It's now possible to copy a BeautifulSoup object created with the html.parser treebuilder. |
972 |
|
973 |
* It's now possible to deepcopy a BeautifulSoup object created with |
|
974 |
Python's built-in HTML parser. |
|
975 |
||
169.1.6
by Leonard Richardson
Updated NEWS. |
976 |
* About 100 unit tests that "test" the behavior of various parsers on
|
977 |
invalid markup have been removed. Legitimate changes to those
|
|
978 |
parsers caused these tests to fail, indicating that perhaps
|
|
979 |
Beautiful Soup should not test the behavior of foreign
|
|
980 |
libraries.
|
|
981 |
||
982 |
The problematic unit tests have been reformulated as informational
|
|
983 |
comparisons generated by the script
|
|
984 |
scripts/demonstrate_parser_differences.py.
|
|
985 |
||
986 |
This makes Beautiful Soup compatible with html5lib version 0.95 and
|
|
987 |
future versions of HTMLParser.
|
|
988 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
989 |
= 4.0.0b6 (20120216) =
|
150.1.8
by Leonard Richardson
Added to NEWS. |
990 |
|
157
by Leonard Richardson
Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion. |
991 |
* Multi-valued attributes like "class" always have a list of values,
|
992 |
even if there's only one value in the list. |
|
993 |
||
994 |
* Added a number of multi-valued attributes defined in HTML5. |
|
154
by Leonard Richardson
The value of multi-valued attributes like class are always turned into a list, even if there's only one value. |
995 |
|
155
by Leonard Richardson
Added a kind of hacky way to interpret the restriction class='foo bar'. Stop generating a space before the slash that closes an empty-element tag. |
996 |
* Stopped generating a space before the slash that closes an |
997 |
empty-element tag. This may come back if I add a special XHTML mode |
|
998 |
(http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty |
|
999 |
useless.
|
|
1000 |
||
152
by Leonard Richardson
Better defined behavior when the user wants to search for a combination of text and tag-specific arguments. [bug=695312] |
1001 |
* Passing text along with tag-specific arguments to a find* method:
|
1002 |
||
1003 |
find("a", text="Click here")
|
|
1004 |
||
1005 |
will find tags that contain the given text as their
|
|
1006 |
.string. Previously, the tag-specific arguments were ignored and
|
|
1007 |
only strings were searched.
|
|
1008 |
||
150.1.8
by Leonard Richardson
Added to NEWS. |
1009 |
* Fixed a bug that caused the html5lib tree builder to build a
|
1010 |
partially disconnected tree. Generally cleaned up the html5lib tree
|
|
1011 |
builder.
|
|
1012 |
||
155
by Leonard Richardson
Added a kind of hacky way to interpret the restriction class='foo bar'. Stop generating a space before the slash that closes an empty-element tag. |
1013 |
* If you restrict a multi-valued attribute like "class" to a string
|
1014 |
that contains spaces, Beautiful Soup will only consider it a match
|
|
1015 |
if the values correspond to that specific string.
|
|
1016 |
||
149
by Leonard Richardson
Bumped version number. |
1017 |
= 4.0.0b5 (20120209) =
|
138
by Leonard Richardson
Rationalized the treatment of multi-valued HTML attributes such as 'class' |
1018 |
|
1019 |
* Rationalized Beautiful Soup's treatment of CSS class. A tag |
|
1020 |
belonging to multiple CSS classes is treated as having a list of |
|
1021 |
values for the 'class' attribute. Searching for a CSS class will |
|
1022 |
match *any* of the CSS classes. |
|
1023 |
||
1024 |
This actually affects all attributes that the HTML standard defines |
|
1025 |
as taking multiple values (class, rel, rev, archive, accept-charset, |
|
148
by Leonard Richardson
Added bug reference. |
1026 |
and headers), but 'class' is by far the most common. [bug=41034] |
138
by Leonard Richardson
Rationalized the treatment of multi-valued HTML attributes such as 'class' |
1027 |
|
1028 |
* If you pass anything other than a dictionary as the second argument |
|
1029 |
to one of the find* methods, it'll assume you want to use that |
|
1030 |
object to search against a tag's CSS classes. Previously this only |
|
1031 |
worked if you passed in a string. |
|
1032 |
||
140
by Leonard Richardson
Fixed a bug that caused a crash when you passed a dictionary as an attribute value (possibly because you mistyped attrs). [bug=842419] |
1033 |
* Fixed a bug that caused a crash when you passed a dictionary as an |
1034 |
attribute value (possibly because you mistyped "attrs"). [bug=842419] |
|
1035 |
||
144
by Leonard Richardson
Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags like <meta charset="utf-8" />. [bug=837268] |
1036 |
* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags |
1037 |
like <meta charset="utf-8" />. [bug=837268] |
|
1038 |
||
146
by Leonard Richardson
As a last-ditch attempt to turn data into Unicode, use errors=replace instead of errors=strict. |
1039 |
* If Unicode, Dammit can't figure out a consistent encoding for a |
1040 |
page, it will try each of its guesses again, with errors="replace"
|
|
1041 |
instead of errors="strict". This may mean that some data gets
|
|
1042 |
replaced with REPLACEMENT CHARACTER, but at least most of it will
|
|
1043 |
get turned into Unicode. [bug=754903]
|
|
1044 |
||
145
by Leonard Richardson
Patched over a bug in html5lib (?) that was crashing Beautiful Soup on certain kinds of markup. [bug=838800] |
1045 |
* Patched over a bug in html5lib (?) that was crashing Beautiful Soup
|
1046 |
on certain kinds of markup. [bug=838800]
|
|
1047 |
||
141
by Leonard Richardson
Fixed a bug that wrecked the tree if you replaced an element with an empty string. [bug=728697] |
1048 |
* Fixed a bug that wrecked the tree if you replaced an element with an
|
1049 |
empty string. [bug=728697]
|
|
1050 |
||
142
by Leonard Richardson
Improved Unicode, Dammit's behavior when you give it Unicode to begin with. |
1051 |
* Improved Unicode, Dammit's behavior when you give it Unicode to |
1052 |
begin with. |
|
1053 |
||
134
by Leonard Richardson
Moved the historical changelog into NEWS. |
1054 |
= 4.0.0b4 (20120208) = |
131
by Leonard Richardson
Moved around a bunch of metadata. |
1055 |
|
1056 |
* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag() |
|
1057 |
||
1058 |
* BeautifulSoup.new_tag() will follow the rules of whatever |
|
1059 |
tree-builder was used to create the original BeautifulSoup object. A |
|
1060 |
new <p> tag will look like "<p />" if the soup object was created to |
|
1061 |
parse XML, but it will look like "<p></p>" if the soup object was |
|
1062 |
created to parse HTML. |
|
1063 |
||
1064 |
* We pass in strict=False to html.parser on Python 3, greatly |
|
1065 |
improving html.parser's ability to handle bad HTML. |
|
1066 |
||
1067 |
* We also monkeypatch a serious bug in html.parser that made
|
|
1068 |
strict=False disastrous on Python 3.2.2.
|
|
1069 |
||
1070 |
* Replaced the "substitute_html_entities" argument with the
|
|
133
by Leonard Richardson
Added more detail to the NEWS. |
1071 |
more general "formatter" argument.
|
131
by Leonard Richardson
Moved around a bunch of metadata. |
1072 |
|
1073 |
* Bare ampersands and angle brackets are always converted to XML
|
|
1074 |
entities unless the user prevents it.
|
|
1075 |
||
133
by Leonard Richardson
Added more detail to the NEWS. |
1076 |
* Added PageElement.insert_before() and PageElement.insert_after(),
|
1077 |
which let you put an element into the parse tree with respect to
|
|
1078 |
some other element.
|
|
131
by Leonard Richardson
Moved around a bunch of metadata. |
1079 |
|
1080 |
* Raise an exception when the user tries to do something nonsensical
|
|
1081 |
like insert a tag into itself.
|
|
1082 |
||
122
by Leonard Richardson
Documented today's changes. |
1083 |
|
134
by Leonard Richardson
Moved the historical changelog into NEWS. |
1084 |
= 4.0.0b3 (20120203) =
|
126
by Leonard Richardson
Package the docs with the code. |
1085 |
|
1086 |
Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful
|
|
1087 |
Soup's custom HTML parser in favor of a system that lets you write a |
|
1088 |
little glue code and plug in any HTML or XML parser you want. |
|
1089 |
||
1090 |
Beautiful Soup 4.0 comes with glue code for four parsers: |
|
1091 |
||
1092 |
* Python's standard HTMLParser (html.parser in Python 3) |
|
1093 |
* lxml's HTML and XML parsers |
|
1094 |
* html5lib's HTML parser |
|
1095 |
||
1096 |
HTMLParser is the default, but I recommend you install lxml if you
|
|
1097 |
can.
|
|
1098 |
||
1099 |
For complete documentation, see the Sphinx documentation in
|
|
1100 |
bs4/doc/source/. What follows is a summary of the changes from
|
|
1101 |
Beautiful Soup 3.
|
|
1102 |
||
1103 |
=== The module name has changed ===
|
|
1104 |
||
1105 |
Previously you imported the BeautifulSoup class from a module also
|
|
1106 |
called BeautifulSoup. To save keystrokes and make it clear which
|
|
1107 |
version of the API is in use, the module is now called 'bs4': |
|
1108 |
||
1109 |
>>> from bs4 import BeautifulSoup
|
|
1110 |
||
1111 |
=== It works with Python 3 ===
|
|
1112 |
||
1113 |
Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
|
|
1114 |
so bad that it barely worked at all. Beautiful Soup 4 works with
|
|
1115 |
Python 3, and since its parser is pluggable, you don't sacrifice |
|
1116 |
quality. |
|
1117 |
||
1118 |
Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 |
|
1119 |
support to the finish line. Ezio Melotti is also to thank for greatly |
|
1120 |
improving the HTML parser that comes with Python 3.2. |
|
1121 |
||
1122 |
=== CDATA sections are normal text, if they're understood at all. === |
|
1123 |
||
1124 |
Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
|
|
1125 |
markup:
|
|
1126 |
||
1127 |
<p><![CDATA[foo]]></p> => <p></p>
|
|
1128 |
||
1129 |
A future version of html5lib will turn CDATA sections into text nodes,
|
|
1130 |
but only within tags like <svg> and <math>:
|
|
1131 |
||
1132 |
<svg><![CDATA[foo]]></svg> => <p>foo</p>
|
|
1133 |
||
1134 |
The default XML parser (which uses lxml behind the scenes) turns CDATA
|
|
1135 |
sections into ordinary text elements:
|
|
1136 |
||
1137 |
<p><![CDATA[foo]]></p> => <p>foo</p>
|
|
1138 |
||
1139 |
In theory it's possible to preserve the CDATA sections when using the |
|
1140 |
XML parser, but I don't see how to get it to work in practice. |
|
1141 |
||
1142 |
=== Miscellaneous other stuff ===
|
|
1143 |
||
1144 |
If the BeautifulSoup instance has .is_xml set to True, an appropriate
|
|
1145 |
XML declaration will be emitted when the tree is transformed into a
|
|
1146 |
string:
|
|
1147 |
||
1148 |
<?xml version="1.0" encoding="utf-8">
|
|
1149 |
<markup>
|
|
1150 |
...
|
|
1151 |
</markup>
|
|
1152 |
||
1153 |
The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree |
|
1154 |
builders set it to False. If you want to parse XHTML with an HTML
|
|
1155 |
parser, you can set it manually.
|
|
1156 |
||
75.1.4
by Leonard Richardson
Emit an XML declaration when appropriate. |
1157 |
|
92
by Leonard Richardson
Prep for beta release. |
1158 |
= 3.2.0 =
|
1159 |
||
1160 |
The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2 |
|
1161 |
to make it obvious which one you should use. |
|
1162 |
||
1
by Leonard Richardson
Initial (manual) import. |
1163 |
= 3.1.0 = |
1164 |
||
1165 |
A hybrid version that supports 2.4 and can be automatically converted |
|
1166 |
to run under Python 3.0. There are three backwards-incompatible |
|
1167 |
changes you should be aware of, but no new features or deliberate |
|
1168 |
behavior changes. |
|
1169 |
||
1170 |
1. str() may no longer do what you want. This is because the meaning |
|
1171 |
of str() inverts between Python 2 and 3; in Python 2 it gives you a |
|
1172 |
byte string, in Python 3 it gives you a Unicode string. |
|
1173 |
||
1174 |
The effect of this is that you can't pass an encoding to .__str__ |
|
1175 |
anymore. Use encode() to get a string and decode() to get Unicode, and
|
|
1176 |
you'll be ready (well, readier) for Python 3. |
|
1177 |
||
1178 |
2. Beautiful Soup is now based on HTMLParser rather than SGMLParser, |
|
1179 |
which is gone in Python 3. There's some bad HTML that SGMLParser |
|
1180 |
handled but HTMLParser doesn't, usually to do with attribute values |
|
1181 |
that aren't closed or have brackets inside them: |
|
1182 |
||
1183 |
<a href="foo</a>, </a><a href="bar">baz</a>
|
|
1184 |
<a b="<a>">', '<a b="<a>"></a><a>"></a> |
|
1185 |
||
1186 |
A later version of Beautiful Soup will allow you to plug in different
|
|
1187 |
parsers to make tradeoffs between speed and the ability to handle bad
|
|
1188 |
HTML.
|
|
1189 |
||
87.1.3
by Aaron DeVore
Changelog for attribute renames |
1190 |
3. In Python 3 (but not Python 2), HTMLParser converts entities within
|
1
by Leonard Richardson
Initial (manual) import. |
1191 |
attributes to the corresponding Unicode characters. In Python 2 it's |
1192 |
possible to parse this string and leave the é intact. |
|
1193 |
||
1194 |
<a href="http://crummy.com?sacré&bleu"> |
|
1195 |
||
1196 |
In Python 3, the é is always converted to \xe9 during |
|
1197 |
parsing. |
|
1198 |
||
1199 |
||
1200 |
= 3.0.7a = |
|
1201 |
||
1202 |
Added an import that makes BS work in Python 2.3. |
|
1203 |
||
1204 |
||
1205 |
= 3.0.7 = |
|
1206 |
||
1207 |
Fixed a UnicodeDecodeError when unpickling documents that contain |
|
1208 |
non-ASCII characters. |
|
1209 |
||
421.1.1
by Ville Skyttä
Spelling fixes |
1210 |
Fixed a TypeError that occurred in some circumstances when a tag |
1
by Leonard Richardson
Initial (manual) import. |
1211 |
contained no text. |
1212 |
||
1213 |
Jump through hoops to avoid the use of chardet, which can be extremely |
|
1214 |
slow in some circumstances. UTF-8 documents should never trigger the |
|
1215 |
use of chardet. |
|
1216 |
||
1217 |
Whitespace is preserved inside <pre> and <textarea> tags that contain |
|
1218 |
nothing but whitespace. |
|
1219 |
||
1220 |
Beautiful Soup can now parse a doctype that's scoped to an XML namespace. |
|
1221 |
||
1222 |
||
1223 |
= 3.0.6 =
|
|
1224 |
||
1225 |
Got rid of a very old debug line that prevented chardet from working.
|
|
1226 |
||
1227 |
Added a Tag.decompose() method that completely disconnects a tree or a
|
|
1228 |
subset of a tree, breaking it up into bite-sized pieces that are
|
|
1229 |
easy for the garbage collecter to collect.
|
|
1230 |
||
1231 |
Tag.extract() now returns the tag that was extracted.
|
|
1232 |
||
1233 |
Tag.findNext() now does something with the keyword arguments you pass
|
|
1234 |
it instead of dropping them on the floor.
|
|
1235 |
||
1236 |
Fixed a Unicode conversion bug.
|
|
1237 |
||
1238 |
Fixed a bug that garbled some <meta> tags when rewriting them.
|
|
1239 |
||
1240 |
||
1241 |
= 3.0.5 =
|
|
1242 |
||
1243 |
Soup objects can now be pickled, and copied with copy.deepcopy.
|
|
1244 |
||
1245 |
Tag.append now works properly on existing BS objects. (It wasn't |
|
1246 |
originally intended for outside use, but it can be now.) (Giles |
|
1247 |
Radford) |
|
1248 |
||
1249 |
Passing in a nonexistent encoding will no longer crash the parser on |
|
1250 |
Python 2.4 (John Nagle). |
|
1251 |
||
1252 |
Fixed an underlying bug in SGMLParser that thinks ASCII has 255 |
|
1253 |
characters instead of 127 (John Nagle). |
|
1254 |
||
1255 |
Entities are converted more consistently to Unicode characters. |
|
1256 |
||
1257 |
Entity references in attribute values are now converted to Unicode |
|
1258 |
characters when appropriate. Numeric entities are always converted, |
|
1259 |
because SGMLParser always converts them outside of attribute values. |
|
1260 |
||
1261 |
ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to |
|
1262 |
XHTML_ENTITIES. |
|
1263 |
||
1264 |
The regular expression for bare ampersands was too loose. In some |
|
1265 |
cases ampersands were not being escaped. (Sam Ruby?) |
|
1266 |
||
1267 |
Non-breaking spaces and other special Unicode space characters are no |
|
1268 |
longer folded to ASCII spaces. (Robert Leftwich) |
|
1269 |
||
1270 |
Information inside a TEXTAREA tag is now parsed literally, not as HTML |
|
1271 |
tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang) |
|
1272 |
||
1273 |
= 3.0.4 = |
|
1274 |
||
1275 |
Fixed a bug that crashed Unicode conversion in some cases. |
|
1276 |
||
1277 |
Fixed a bug that prevented UnicodeDammit from being used as a |
|
1278 |
general-purpose data scrubber. |
|
1279 |
||
1280 |
Fixed some unit test failures when running against Python 2.5. |
|
1281 |
||
1282 |
When considering whether to convert smart quotes, UnicodeDammit now |
|
1283 |
looks at the original encoding in a case-insensitive way. |
|
134
by Leonard Richardson
Moved the historical changelog into NEWS. |
1284 |
|
1285 |
= 3.0.3 (20060606) = |
|
1286 |
||
1287 |
Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be |
|
1288 |
sure to pass in an appropriate value for convertEntities, or XML/HTML |
|
1289 |
entities might stick around that aren't valid in HTML/XML). The result |
|
1290 |
may not validate, but it should be good enough to not choke a
|
|
1291 |
real-world XML parser. Specifically, the output of a properly
|
|
1292 |
constructed soup object should always be valid as part of an XML
|
|
1293 |
document, but parts may be missing if they were missing in the
|
|
1294 |
original. As always, if the input is valid XML, the output will also
|
|
1295 |
be valid.
|
|
1296 |
||
1297 |
= 3.0.2 (20060602) =
|
|
1298 |
||
1299 |
Previously, Beautiful Soup correctly handled attribute values that
|
|
1300 |
contained embedded quotes (sometimes by escaping), but not other kinds
|
|
1301 |
of XML character. Now, it correctly handles or escapes all special XML
|
|
1302 |
characters in attribute values.
|
|
1303 |
||
1304 |
I aliased methods to the 2.x names (fetch, find, findText, etc.) for
|
|
1305 |
backwards compatibility purposes. Those names are deprecated and if I
|
|
1306 |
ever do a 4.0 I will remove them. I will, I tell you!
|
|
1307 |
||
1308 |
Fixed a bug where the findAll method wasn't passing along any keyword |
|
1309 |
arguments. |
|
1310 |
||
1311 |
When run from the command line, Beautiful Soup now acts as an HTML |
|
1312 |
pretty-printer, not an XML pretty-printer. |
|
1313 |
||
1314 |
= 3.0.1 (20060530) = |
|
1315 |
||
1316 |
Reintroduced the "fetch by CSS class" shortcut. I thought keyword |
|
1317 |
arguments would replace it, but they don't. You can't call soup('a', |
|
1318 |
class='foo') because class is a Python keyword. |
|
1319 |
||
1320 |
If Beautiful Soup encounters a meta tag that declares the encoding, |
|
1321 |
but a SoupStrainer tells it not to parse that tag, Beautiful Soup will |
|
1322 |
no longer try to rewrite the meta tag to mention the new |
|
1323 |
encoding. Basically, this makes SoupStrainers work in real-world |
|
1324 |
applications instead of crashing the parser. |
|
1325 |
||
1326 |
= 3.0.0 "Who would not give all else for two p" (20060528) = |
|
1327 |
||
1328 |
This release is not backward-compatible with previous releases. If |
|
1329 |
you've got code written with a previous version of the library, go |
|
1330 |
ahead and keep using it, unless one of the features mentioned here
|
|
1331 |
really makes your life easier. Since the library is self-contained,
|
|
1332 |
you can include an old copy of the library in your old applications,
|
|
1333 |
and use the new version for everything else.
|
|
1334 |
||
1335 |
The documentation has been rewritten and greatly expanded with many
|
|
1336 |
more examples.
|
|
1337 |
||
1338 |
Beautiful Soup autodetects the encoding of a document (or uses the one
|
|
1339 |
you specify), and converts it from its native encoding to
|
|
1340 |
Unicode. Internally, it only deals with Unicode strings. When you
|
|
1341 |
print out the document, it converts to UTF-8 (or another encoding you
|
|
1342 |
specify). [Doc reference]
|
|
1343 |
||
1344 |
It's now easy to make large-scale changes to the parse tree without |
|
1345 |
screwing up the navigation members. The methods are extract, |
|
1346 |
replaceWith, and insert. [Doc reference. See also Improving Memory |
|
1347 |
Usage with extract] |
|
1348 |
||
1349 |
Passing True in as an attribute value gives you tags that have any |
|
1350 |
value for that attribute. You don't have to create a regular |
|
1351 |
expression. Passing None for an attribute value gives you tags that
|
|
1352 |
don't have that attribute at all. |
|
1353 |
||
1354 |
Tag objects now know whether or not they're self-closing. This avoids |
|
1355 |
the problem where Beautiful Soup thought that tags like <BR /> were
|
|
1356 |
self-closing even in XML documents. You can customize the self-closing
|
|
1357 |
tags for a parser object by passing them in as a list of
|
|
1358 |
selfClosingTags: you don't have to subclass anymore. |
|
1359 |
||
1360 |
There's a new built-in parser, MinimalSoup, which has most of |
|
1361 |
BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc |
|
1362 |
reference] |
|
1363 |
||
1364 |
You can use a SoupStrainer to tell Beautiful Soup to parse only part |
|
1365 |
of a document. This saves time and memory, often making Beautiful Soup |
|
1366 |
about as fast as a custom-built SGMLParser subclass. [Doc reference, |
|
1367 |
SoupStrainer reference] |
|
1368 |
||
1369 |
You can (usually) use keyword arguments instead of passing a |
|
1370 |
dictionary of attributes to a search method. That is, you can replace |
|
1371 |
soup(args={"id" : "5"}) with soup(id="5"). You can still use args if |
|
1372 |
(for instance) you need to find an attribute whose name clashes with |
|
1373 |
the name of an argument to findAll. [Doc reference: **kwargs attrs] |
|
1374 |
||
1375 |
The method names have changed to the better method names used in |
|
1376 |
Rubyful Soup. Instead of find methods and fetch methods, there are |
|
1377 |
only find methods. Instead of a scheme where you can't remember which |
|
1378 |
method finds one element and which one finds them all, we have find
|
|
1379 |
and findAll. In general, if the method name mentions All or a plural
|
|
1380 |
noun (eg. findNextSiblings), then it finds many elements
|
|
1381 |
method. Otherwise, it only finds one element. [Doc reference]
|
|
1382 |
||
1383 |
Some of the argument names have been renamed for clarity. For instance
|
|
1384 |
avoidParserProblems is now parserMassage.
|
|
1385 |
||
1386 |
Beautiful Soup no longer implements a feed method. You need to pass a
|
|
1387 |
string or a filehandle into the soup constructor, not with feed after
|
|
1388 |
the soup has been created. There is still a feed method, but it's the |
|
1389 |
feed method implemented by SGMLParser and calling it will bypass |
|
1390 |
Beautiful Soup and cause problems. |
|
1391 |
||
1392 |
The NavigableText class has been renamed to NavigableString. There is |
|
1393 |
no NavigableUnicodeString anymore, because every string inside a |
|
1394 |
Beautiful Soup parse tree is a Unicode string. |
|
1395 |
||
1396 |
findText and fetchText are gone. Just pass a text argument into find |
|
1397 |
or findAll. |
|
1398 |
||
1399 |
Null was more trouble than it was worth, so I got rid of it. Anything |
|
1400 |
that used to return Null now returns None. |
|
1401 |
||
1402 |
Special XML constructs like comments and CDATA now have their own |
|
1403 |
NavigableString subclasses, instead of being treated as oddly-formed |
|
1404 |
data. If you parse a document that contains CDATA and write it back |
|
1405 |
out, the CDATA will still be there. |
|
1406 |
||
1407 |
When you're parsing a document, you can get Beautiful Soup to convert |
|
1408 |
XML or HTML entities into the corresponding Unicode characters. [Doc
|
|
1409 |
reference]
|
|
1410 |
||
1411 |
= 2.1.1 (20050918) =
|
|
1412 |
||
1413 |
Fixed a serious performance bug in BeautifulStoneSoup which was
|
|
1414 |
causing parsing to be incredibly slow.
|
|
1415 |
||
1416 |
Corrected several entities that were previously being incorrectly
|
|
1417 |
translated from Microsoft smart-quote-like characters.
|
|
1418 |
||
1419 |
Fixed a bug that was breaking text fetch.
|
|
1420 |
||
1421 |
Fixed a bug that crashed the parser when text chunks that look like
|
|
1422 |
HTML tag names showed up within a SCRIPT tag.
|
|
1423 |
||
1424 |
THEAD, TBODY, and TFOOT tags are now nestable within TABLE
|
|
1425 |
tags. Nested tables should parse more sensibly now.
|
|
1426 |
||
1427 |
BASE is now considered a self-closing tag.
|
|
1428 |
||
1429 |
= 2.1.0 "Game, or any other dish?" (20050504) =
|
|
1430 |
||
1431 |
Added a wide variety of new search methods which, given a starting
|
|
1432 |
point inside the tree, follow a particular navigation member (like
|
|
1433 |
nextSibling) over and over again, looking for Tag and NavigableText
|
|
1434 |
objects that match certain criteria. The new methods are findNext,
|
|
1435 |
fetchNext, findPrevious, fetchPrevious, findNextSibling,
|
|
1436 |
fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings,
|
|
1437 |
findParent, and fetchParents. All of these use the same basic code
|
|
1438 |
used by first and fetch, so you can pass your weird ways of matching
|
|
1439 |
things into these methods.
|
|
1440 |
||
1441 |
The fetch method and its derivatives now accept a limit argument.
|
|
1442 |
||
1443 |
You can now pass keyword arguments when calling a Tag object as though
|
|
1444 |
it were a method.
|
|
1445 |
||
1446 |
Fixed a bug that caused all hand-created tags to share a single set of
|
|
1447 |
attributes.
|
|
1448 |
||
1449 |
= 2.0.3 (20050501) =
|
|
1450 |
||
1451 |
Fixed Python 2.2 support for iterators.
|
|
1452 |
||
1453 |
Fixed a bug that gave the wrong representation to tags within quote
|
|
1454 |
tags like <script>.
|
|
1455 |
||
1456 |
Took some code from Mark Pilgrim that treats CDATA declarations as
|
|
1457 |
data instead of ignoring them.
|
|
1458 |
||
1459 |
Beautiful Soup's setup.py will now do an install even if the unit |
|
1460 |
tests fail. It won't build a source distribution if the unit tests |
|
1461 |
fail, so I can't release a new version unless they pass. |
|
1462 |
||
1463 |
= 2.0.2 (20050416) = |
|
1464 |
||
1465 |
Added the unit tests in a separate module, and packaged it with |
|
1466 |
distutils. |
|
1467 |
||
1468 |
Fixed a bug that sometimes caused renderContents() to return a Unicode |
|
1469 |
string even if there was no Unicode in the original string. |
|
1470 |
||
1471 |
Added the done() method, which closes all of the parser's open |
|
1472 |
tags. It gets called automatically when you pass in some text to the
|
|
1473 |
constructor of a parser class; otherwise you must call it yourself.
|
|
1474 |
||
1475 |
Reinstated some backwards compatibility with 1.x versions: referencing
|
|
1476 |
the string member of a NavigableText object returns the NavigableText
|
|
1477 |
object instead of throwing an error.
|
|
1478 |
||
1479 |
= 2.0.1 (20050412) =
|
|
1480 |
||
1481 |
Fixed a bug that caused bad results when you tried to reference a tag
|
|
1482 |
name shorter than 3 characters as a member of a Tag, eg. tag.table.td.
|
|
1483 |
||
1484 |
Made sure all Tags have the 'hidden' attribute so that an attempt to |
|
1485 |
access tag.hidden doesn't spawn an attempt to find a tag named |
|
1486 |
'hidden'. |
|
1487 |
||
1488 |
Fixed a bug in the comparison operator. |
|
1489 |
||
1490 |
= 2.0.0 "Who cares for fish?" (20050410) |
|
1491 |
||
1492 |
Beautiful Soup version 1 was very useful but also pretty stupid. I |
|
1493 |
originally wrote it without noticing any of the problems inherent in |
|
1494 |
trying to build a parse tree out of ambiguous HTML tags. This version |
|
1495 |
solves all of those problems to my satisfaction. It also adds many new |
|
1496 |
clever things to make up for the removal of the stupid things. |
|
1497 |
||
1498 |
== Parsing == |
|
1499 |
||
1500 |
The parser logic has been greatly improved, and the BeautifulSoup |
|
1501 |
class should much more reliably yield a parse tree that looks like |
|
1502 |
what the page author intended. For a particular class of odd edge |
|
1503 |
cases that now causes problems, there is a new class, |
|
1504 |
ICantBelieveItsBeautifulSoup. |
|
1505 |
||
1506 |
By default, Beautiful Soup now performs some cleanup operations on |
|
1507 |
text before parsing it. This is to avoid common problems with bad |
|
1508 |
definitions and self-closing tags that crash SGMLParser. You can |
|
1509 |
provide your own set of cleanup operations, or turn it off |
|
1510 |
altogether. The cleanup operations include fixing self-closing tags |
|
1511 |
that don't close, and replacing Microsoft smart quotes and similar |
|
1512 |
characters with their HTML entity equivalents.
|
|
1513 |
||
1514 |
You can now get a pretty-print version of parsed HTML to get a visual
|
|
1515 |
picture of how Beautiful Soup parses it, with the Tag.prettify()
|
|
1516 |
method.
|
|
1517 |
||
1518 |
== Strings and Unicode ==
|
|
1519 |
||
1520 |
There are separate NavigableText subclasses for ASCII and Unicode
|
|
1521 |
strings. These classes directly subclass the corresponding base data
|
|
1522 |
types. This means you can treat NavigableText objects as strings
|
|
1523 |
instead of having to call methods on them to get the strings.
|
|
1524 |
||
1525 |
str() on a Tag always returns a string, and unicode() always returns
|
|
1526 |
Unicode. Previously it was inconsistent.
|
|
1527 |
||
1528 |
== Tree traversal ==
|
|
1529 |
||
1530 |
In a first() or fetch() call, the tag name or the desired value of an
|
|
1531 |
attribute can now be any of the following:
|
|
1532 |
||
1533 |
* A string (matches that specific tag or that specific attribute value)
|
|
1534 |
* A list of strings (matches any tag or attribute value in the list)
|
|
1535 |
* A compiled regular expression object (matches any tag or attribute
|
|
1536 |
value that matches the regular expression)
|
|
1537 |
* A callable object that takes the Tag object or attribute value as a
|
|
1538 |
string. It returns None/false/empty string if the given string
|
|
1539 |
doesn't match, and any other value if it does. |
|
1540 |
||
1541 |
This is much easier to use than SQL-style wildcards (see, regular |
|
1542 |
expressions are good for something). Because of this, I took out |
|
1543 |
SQL-style wildcards. I'll put them back if someone complains, but |
|
1544 |
their removal simplifies the code a lot.
|
|
1545 |
||
1546 |
You can use fetch() and first() to search for text in the parse tree,
|
|
1547 |
not just tags. There are new alias methods fetchText() and firstText()
|
|
1548 |
designed for this purpose. As with searching for tags, you can pass in
|
|
1549 |
a string, a regular expression object, or a method to match your text.
|
|
1550 |
||
1551 |
If you pass in something besides a map to the attrs argument of
|
|
1552 |
fetch() or first(), Beautiful Soup will assume you want to match that
|
|
1553 |
thing against the "class" attribute. When you're scraping |
|
1554 |
well-structured HTML, this makes your code a lot cleaner. |
|
1555 |
||
1556 |
1.x and 2.x both let you call a Tag object as a shorthand for |
|
1557 |
fetch(). For instance, foo("bar") is a shorthand for |
|
1558 |
foo.fetch("bar"). In 2.x, you can also access a specially-named member |
|
1559 |
of a Tag object as a shorthand for first(). For instance, foo.barTag |
|
1560 |
is a shorthand for foo.first("bar"). By chaining these shortcuts you |
|
1561 |
traverse a tree in very little code: for header in |
|
1562 |
soup.bodyTag.pTag.tableTag('th'): |
|
1563 |
||
1564 |
If an element relationship (like parent or next) doesn't apply to a |
|
1565 |
tag, it'll now show up Null instead of None. first() will also return |
|
1566 |
Null if you ask it for a nonexistent tag. Null is an object that's |
|
1567 |
just like None, except you can do whatever you want to it and it'll |
|
1568 |
give you Null instead of throwing an error. |
|
1569 |
||
1570 |
This lets you do tree traversals like soup.htmlTag.headTag.titleTag |
|
1571 |
without having to worry if the intermediate stages are actually |
|
1572 |
there. Previously, if there was no 'head' tag in the document, headTag |
|
1573 |
in that instance would have been None, and accessing its 'titleTag' |
|
1574 |
member would have thrown an AttributeError. Now, you can get what you |
|
1575 |
want when it exists, and get Null when it doesn't, without having to |
|
1576 |
do a lot of conditionals checking to see if every stage is None.
|
|
1577 |
||
1578 |
There are two new relations between page elements: previousSibling and
|
|
1579 |
nextSibling. They reference the previous and next element at the same
|
|
1580 |
level of the parse tree. For instance, if you have HTML like this:
|
|
1581 |
||
1582 |
<p><ul><li>Foo<br /><li>Bar</ul>
|
|
1583 |
||
1584 |
The first 'li' tag has a previousSibling of Null and its nextSibling |
|
1585 |
is the second 'li' tag. The second 'li' tag has a nextSibling of Null |
|
1586 |
and its previousSibling is the first 'li' tag. The previousSibling of |
|
1587 |
the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the |
|
1588 |
'br' tag. |
|
1589 |
||
1590 |
I took out the ability to use fetch() to find tags that have a
|
|
1591 |
specific list of contents. See, I can't even explain it well. It was |
|
1592 |
really difficult to use, I never used it, and I don't think anyone |
|
1593 |
else ever used it. To the extent anyone did, they can probably use
|
|
1594 |
fetchText() instead. If it turns out someone needs it I'll think of |
|
1595 |
another solution. |
|
1596 |
||
1597 |
== Tree manipulation == |
|
1598 |
||
1599 |
You can add new attributes to a tag, and delete attributes from a |
|
1600 |
tag. In 1.x you could only change a tag's existing attributes. |
|
1601 |
||
1602 |
== Porting Considerations ==
|
|
1603 |
||
1604 |
There are three changes in 2.0 that break old code:
|
|
1605 |
||
1606 |
In the post-1.2 release you could pass in a function into fetch(). The
|
|
1607 |
function took a string, the tag name. In 2.0, the function takes the
|
|
1608 |
actual Tag object.
|
|
1609 |
||
1610 |
It's no longer to pass in SQL-style wildcards to fetch(). Use a |
|
1611 |
regular expression instead. |
|
1612 |
||
1613 |
The different parsing algorithm means the parse tree may not be shaped |
|
1614 |
like you expect. This will only actually affect you if your code uses |
|
1615 |
one of the affected parts. I haven't run into this problem yet while |
|
1616 |
porting my code.
|
|
1617 |
||
1618 |
= Between 1.2 and 2.0 =
|
|
1619 |
||
1620 |
This is the release to get if you want Python 1.5 compatibility.
|
|
1621 |
||
1622 |
The desired value of an attribute can now be any of the following:
|
|
1623 |
||
1624 |
* A string
|
|
1625 |
* A string with SQL-style wildcards
|
|
1626 |
* A compiled RE object
|
|
1627 |
* A callable that returns None/false/empty string if the given value
|
|
1628 |
doesn't match, and any other value otherwise. |
|
1629 |
||
1630 |
This is much easier to use than SQL-style wildcards (see, regular |
|
1631 |
expressions are good for something). Because of this, I no longer |
|
1632 |
recommend you use SQL-style wildcards. They may go away in a future |
|
1633 |
release to clean up the code. |
|
1634 |
||
1635 |
Made Beautiful Soup handle processing instructions as text instead of |
|
1636 |
ignoring them. |
|
1637 |
||
1638 |
Applied patch from Richie Hindle (richie at entrian dot com) that |
|
1639 |
makes tag.string a shorthand for tag.contents[0].string when the tag |
|
1640 |
has only one string-owning child. |
|
1641 |
||
1642 |
Added still more nestable tags. The nestable tags thing won't work in |
|
1643 |
a lot of cases and needs to be rethought.
|
|
1644 |
||
1645 |
Fixed an edge case where searching for "%foo" would match any string
|
|
1646 |
shorter than "foo".
|
|
1647 |
||
1648 |
= 1.2 "Who for such dainties would not stoop?" (20040708) =
|
|
1649 |
||
1650 |
Applied patch from Ben Last (ben at benlast dot com) that made
|
|
1651 |
Tag.renderContents() correctly handle Unicode.
|
|
1652 |
||
1653 |
Made BeautifulStoneSoup even dumber by making it not implicitly close
|
|
1654 |
a tag when another tag of the same type is encountered; only when an
|
|
1655 |
actual closing tag is encountered. This change courtesy of Fuzzy (mike
|
|
1656 |
at pcblokes dot com). BeautifulSoup still works as before.
|
|
1657 |
||
1658 |
= 1.1 "Swimming in a hot tureen" =
|
|
1659 |
||
1660 |
Added more 'nestable' tags. Changed popping semantics so that when a |
|
1661 |
nestable tag is encountered, tags are popped up to the previously
|
|
1662 |
encountered nestable tag (of whatever kind). I will revert this if
|
|
1663 |
enough people complain, but it should make more people's lives easier |
|
1664 |
than harder. This enhancement was suggested by Anthony Baxter (anthony |
|
1665 |
at interlink dot com dot au). |
|
1666 |
||
1667 |
= 1.0 "So rich and green" (20040420) = |
|
1668 |
||
1669 |
Initial release. |