602
by Leonard Richardson
NavigableString and its subclasses now implement the get_text() |
1 |
Beautiful Soup's official support for Python 2 ended on December 31st, |
606
by Leonard Richardson
Goodbye, Python 2. [bug=1942919] |
2 |
2020. The final release to support Python 2 was Beautiful Soup |
3 |
4.9.3. In the Launchpad Bazaar repository, the final revision to support |
|
4 |
Python 2 was revision 605. |
|
5 |
||
608
by Leonard Richardson
Ported unit tests to use pytest. |
6 |
= 4.11.0 (Unreleased) |
7 |
||
8 |
* Ported unit tests to use pytest. |
|
9 |
||
614
by Leonard Richardson
Added special string classes, RubyParenthesisString and RubyTextString, |
10 |
* Added special string classes, RubyParenthesisString and RubyTextString, |
11 |
to make it possible to treat ruby text specially in get_text() calls. |
|
12 |
[bug=1941980] |
|
13 |
||
629
by Leonard Richardson
It's now possible to customize the way output is indented by |
14 |
* It's now possible to customize the way output is indented by |
15 |
providing a value for the 'indent' argument to the Formatter |
|
16 |
constructor. The 'indent' argument works very similarly to the |
|
17 |
argument of the same name in the Python standard library's |
|
630
by Leonard Richardson
I guess that's not a method. |
18 |
json.dump() function. [bug=1955497] |
629
by Leonard Richardson
It's now possible to customize the way output is indented by |
19 |
|
626
by Leonard Richardson
If the charset-normalizer Python module |
20 |
* If the charset-normalizer Python module |
21 |
(https://pypi.org/project/charset-normalizer/) is installed, Beautiful |
|
22 |
Soup will use it to detect the character sets of incoming documents. |
|
23 |
This is also the module used by newer versions of the Requests library. |
|
24 |
For the sake of backwards compatibility, chardet and cchardet both take |
|
25 |
precedence if installed. [bug=1955346] |
|
617
by Leonard Richardson
Fixed a crash when overriding multi_valued_attributes and using the |
26 |
|
618
by Leonard Richardson
Added a workaround for an lxml bug (https://bugs.launchpad.net/lxml/+bug/1948551) that caused |
27 |
* Added a workaround for an lxml bug |
622
by Leonard Richardson
Issue a warning when an HTML parser is used to parse a document that |
28 |
(https://bugs.launchpad.net/lxml/+bug/1948551) that causes |
618
by Leonard Richardson
Added a workaround for an lxml bug (https://bugs.launchpad.net/lxml/+bug/1948551) that caused |
29 |
problems when parsing a Unicode string beginning with BYTE ORDER MARK. |
30 |
[bug=1947768] |
|
31 |
||
622
by Leonard Richardson
Issue a warning when an HTML parser is used to parse a document that |
32 |
* Issue a warning when an HTML parser is used to parse a document that |
33 |
looks like XML but not XHTML. [bug=1939121] |
|
34 |
||
624
by Leonard Richardson
Do a better job of keeping track of namespaces as an XML document is |
35 |
* Do a better job of keeping track of namespaces as an XML document is |
36 |
parsed, so that CSS selectors that use namespaces will do the right |
|
37 |
thing more often. [bug=1946243] |
|
38 |
||
619
by Leonard Richardson
Renamed the 'text' field to 'string' for real. Tests are not changed in this commit to demonstrate that the renaming doesn't break anything. [bug=1947038] |
39 |
* Some time ago, the misleadingly named "text" argument to find-type |
40 |
methods was renamed to the more accurate "string." But this supposed |
|
41 |
"renaming" didn't make it into important places like the method |
|
42 |
signatures or the docstrings. That's corrected in this |
|
622
by Leonard Richardson
Issue a warning when an HTML parser is used to parse a document that |
43 |
version. "text" still works, but will give a DeprecationWarning. |
44 |
[bug=1947038] |
|
619
by Leonard Richardson
Renamed the 'text' field to 'string' for real. Tests are not changed in this commit to demonstrate that the renaming doesn't break anything. [bug=1947038] |
45 |
|
626
by Leonard Richardson
If the charset-normalizer Python module |
46 |
* Fixed a crash when pickling a BeautifulSoup object that has no |
625
by Leonard Richardson
Fix a crash when pickling a BeautifulSoup object that has no |
47 |
tree builder. [bug=1934003] |
48 |
||
626
by Leonard Richardson
If the charset-normalizer Python module |
49 |
* Fixed a crash when overriding multi_valued_attributes and using the |
50 |
html5lib parser. [bug=1948488] |
|
51 |
||
627
by Leonard Richardson
Removed support for the iconv_codec library, which doesn't seem |
52 |
* Removed support for the iconv_codec library, which doesn't seem |
53 |
to exist anymore and was never put up on PyPI. (The closest |
|
628
by Leonard Richardson
Remove a huge list of HTML entities that was only necessary under Python 2. |
54 |
replacement on PyPI, iconv_codecs, is GPL-licensed, so we can't use |
55 |
it--it's also quite old.) |
|
627
by Leonard Richardson
Removed support for the iconv_codec library, which doesn't seem |
56 |
|
606
by Leonard Richardson
Goodbye, Python 2. [bug=1942919] |
57 |
= 4.10.0 (20210907) |
58 |
||
59 |
* This is the first release of Beautiful Soup to only support Python |
|
60 |
3. I dropped Python 2 support to maintain support for newer versions |
|
61 |
(58 and up) of setuptools. See: |
|
62 |
https://github.com/pypa/setuptools/issues/2769 [bug=1942919] |
|
602
by Leonard Richardson
NavigableString and its subclasses now implement the get_text() |
63 |
|
600
by Leonard Richardson
The behavior of methods like .get_text() and .strings now differs |
64 |
* The behavior of methods like .get_text() and .strings now differs |
65 |
depending on the type of tag. The change is visible with HTML tags |
|
66 |
like <script>, <style>, and <template>. Starting in 4.9.0, methods |
|
67 |
like get_text() returned no results on such tags, because the |
|
68 |
contents of those tags are not considered 'text' within the document |
|
69 |
as a whole. |
|
70 |
||
71 |
But a user who calls script.get_text() is working from a different |
|
72 |
definition of 'text' than a user who calls div.get_text()--otherwise |
|
73 |
there would be no need to call script.get_text() at all. In 4.10.0, |
|
74 |
the contents of (e.g.) a <script> tag are considered 'text' during a |
|
75 |
get_text() call on the tag itself, but not considered 'text' during |
|
76 |
a get_text() call on the tag's parent. |
|
77 |
||
78 |
Because of this change, calling get_text() on each child of a tag |
|
79 |
may now return a different result than calling get_text() on the tag |
|
80 |
itself. That's because different tags now have different |
|
81 |
understandings of what counts as 'text'. [bug=1906226] [bug=1868861] |
|
601
by Leonard Richardson
The 'html5' formatter now treats attributes whose values are the |
82 |
|
602
by Leonard Richardson
NavigableString and its subclasses now implement the get_text() |
83 |
* NavigableString and its subclasses now implement the get_text() |
84 |
method, as well as the properties .strings and |
|
85 |
.stripped_strings. These methods will either return the string |
|
86 |
itself, or nothing, so the only reason to use this is when iterating |
|
87 |
over a list of mixed Tag and NavigableString objects. [bug=1904309] |
|
88 |
||
601
by Leonard Richardson
The 'html5' formatter now treats attributes whose values are the |
89 |
* The 'html5' formatter now treats attributes whose values are the |
90 |
empty string as HTML boolean attributes. Previously (and in other |
|
91 |
formatters), an attribute value must be set as None to be treated as |
|
92 |
a boolean attribute. In a future release, I plan to also give this |
|
93 |
behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424] |
|
94 |
||
605
by Leonard Richardson
The 'replace_with()' method now takes a variable number of arguments, |
95 |
* The 'replace_with()' method now takes a variable number of arguments, |
96 |
and can be used to replace a single element with a sequence of elements. |
|
97 |
Patch by Bill Chandos. [rev=605] |
|
98 |
||
595
by Leonard Richardson
Corrected output when the namespace prefix associated with a |
99 |
* Corrected output when the namespace prefix associated with a |
100 |
namespaced attribute is the empty string, as opposed to |
|
101 |
None. [bug=1915583] |
|
102 |
||
597
by Leonard Richardson
Performance improvement when processing tags that speeds up overall |
103 |
* Performance improvement when processing tags that speeds up overall |
104 |
tree construction by 2%. Patch by Morotti. [bug=1899358] |
|
105 |
||
599
by Leonard Richardson
Corrected the use of special string container classes in cases when a |
106 |
* Corrected the use of special string container classes in cases when a |
107 |
single tag may contain strings with different containers; such as |
|
108 |
the <template> tag, which may contain both TemplateString objects |
|
109 |
and Comment objects. [bug=1913406] |
|
110 |
||
605
by Leonard Richardson
The 'replace_with()' method now takes a variable number of arguments, |
111 |
* The html.parser tree builder can now handle named entities |
604
by Leonard Richardson
The html.parser tree builder can now handles named entities |
112 |
found in the HTML5 spec in much the same way that the html5lib |
605
by Leonard Richardson
The 'replace_with()' method now takes a variable number of arguments, |
113 |
tree builder does. Note that the lxml HTML tree builder doesn't handle |
114 |
named entities this way. [bug=1924908] |
|
604
by Leonard Richardson
The html.parser tree builder can now handles named entities |
115 |
|
598
by Leonard Richardson
Added a second way to pass specify encodings to UnicodeDammit and |
116 |
* Added a second way to pass specify encodings to UnicodeDammit and |
117 |
EncodingDetector, based on the order of precedence defined in the |
|
118 |
HTML5 spec, starting at: |
|
119 |
https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding |
|
120 |
||
121 |
Encodings in 'known_definite_encodings' are tried first, then |
|
122 |
byte-order-mark sniffing is run, then encodings in 'user_encodings' |
|
123 |
are tried. The old argument, 'override_encodings', is now a |
|
124 |
deprecated alias for 'known_definite_encodings'. |
|
125 |
||
126 |
This changes the default behavior of the html.parser and lxml tree |
|
127 |
builders, in a way that may slightly improve encoding |
|
128 |
detection but will probably have no effect. [bug=1889014] |
|
129 |
||
596
by Leonard Richardson
Improve the warning issued when a directory name (as opposed to |
130 |
* Improve the warning issued when a directory name (as opposed to |
131 |
the name of a regular file) is passed as markup into the BeautifulSoup |
|
132 |
constructor. [bug=1913628] |
|
133 |
||
592
by Leonard Richardson
Prepare for release. |
134 |
= 4.9.3 (20201003) |
591
by Leonard Richardson
Implemented a significant performance optimization to the process of |
135 |
|
136 |
* Implemented a significant performance optimization to the process of |
|
137 |
searching the parse tree. Patch by Morotti. [bug=1898212] |
|
138 |
||
588
by Leonard Richardson
Increment version number. |
139 |
= 4.9.2 (20200926) |
579
by Leonard Richardson
Fixed a bug that caused too many tags to be popped from the tag |
140 |
|
141 |
* Fixed a bug that caused too many tags to be popped from the tag |
|
142 |
stack during tree building, when encountering a closing tag that had |
|
143 |
no matching opening tag. [bug=1880420] |
|
144 |
||
587
by Leonard Richardson
Fixed a bug that inconsistently moved elements over when passing |
145 |
* Fixed a bug that inconsistently moved elements over when passing |
146 |
a Tag, rather than a list, into Tag.extend(). [bug=1885710] |
|
147 |
||
585
by Leonard Richardson
Specify the soupsieve dependency in a way that complies with |
148 |
* Specify the soupsieve dependency in a way that complies with |
586
by Leonard Richardson
Change the signatures for BeautifulSoup.insert_before and insert_after |
149 |
PEP 508. Patch by Mike Nerone. [bug=1893696] |
150 |
||
151 |
* Change the signatures for BeautifulSoup.insert_before and insert_after |
|
152 |
(which are not implemented) to match PageElement.insert_before and |
|
153 |
insert_after, quieting warnings in some IDEs. [bug=1897120] |
|
585
by Leonard Richardson
Specify the soupsieve dependency in a way that complies with |
154 |
|
577
by Leonard Richardson
Prep for release. |
155 |
= 4.9.1 (20200517) |
568
by Leonard Richardson
Fixed test failures when run against soupselect 2.0. Patch by Tomáš |
156 |
|
573
by Leonard Richardson
Added a keyword argument on_duplicate_attribute to the |
157 |
* Added a keyword argument 'on_duplicate_attribute' to the |
158 |
BeautifulSoupHTMLParser constructor (used by the html.parser tree |
|
159 |
builder) which lets you customize the handling of markup that |
|
160 |
contains the same attribute more than once, as in: |
|
575
by Leonard Richardson
Documented some recently added customization features. |
161 |
<a href="url1" href="url2"> [bug=1878209] |
573
by Leonard Richardson
Added a keyword argument on_duplicate_attribute to the |
162 |
|
570
by Leonard Richardson
Fixed typo. |
163 |
* Added a distinct subclass, GuessedAtParserWarning, for the warning |
569
by Leonard Richardson
Added two distinct UserWarning subclasses for warnings issued from the BeautifulSoup constructor which a caller may want to filter out. [bug=1873787] |
164 |
issued when BeautifulSoup is instantiated without a parser being |
165 |
specified. [bug=1873787] |
|
166 |
||
167 |
* Added a distinct subclass, MarkupResemblesLocatorWarning, for the |
|
168 |
warning issued when BeautifulSoup is instantiated with 'markup' that |
|
169 |
actually seems to be a URL or the path to a file on |
|
170 |
disk. [bug=1873787] |
|
171 |
||
568
by Leonard Richardson
Fixed test failures when run against soupselect 2.0. Patch by Tomáš |
172 |
* The new NavigableString subclasses (Stylesheet, Script, and |
173 |
TemplateString) can now be imported directly from the bs4 package. |
|
174 |
||
571
by Leonard Richardson
If you encode a document with a Python-specific encoding like |
175 |
* If you encode a document with a Python-specific encoding like |
176 |
'unicode_escape', that encoding is no longer mentioned in the final |
|
177 |
XML or HTML document. Instead, encoding information is omitted or |
|
178 |
left blank. [bug=1874955] |
|
179 |
||
568
by Leonard Richardson
Fixed test failures when run against soupselect 2.0. Patch by Tomáš |
180 |
* Fixed test failures when run against soupselect 2.0. Patch by Tomáš |
181 |
Chvátal. [bug=1872279] |
|
182 |
||
564
by Leonard Richardson
Embedded CSS and Javascript is now stored in distinct Stylesheet and |
183 |
= 4.9.0 (20200405) |
554
by Leonard Richardson
API CHANGE - Added PageElement.decomposed, a new property which lets you |
184 |
|
185 |
* Added PageElement.decomposed, a new property which lets you |
|
186 |
check whether you've already called decompose() on a Tag or |
|
187 |
NavigableString. |
|
553
by Leonard Richardson
Fixed an unhandled exception when formatting a Tag that had been decomposed.[bug=1857767] |
188 |
|
564
by Leonard Richardson
Embedded CSS and Javascript is now stored in distinct Stylesheet and |
189 |
* Embedded CSS and Javascript is now stored in distinct Stylesheet and |
566
by Leonard Richardson
Added a notice about the new behavior of .text to the documentation. |
190 |
Script tags, which are ignored by methods like get_text() since most |
191 |
people don't consider this sort of content to be 'text'. This |
|
564
by Leonard Richardson
Embedded CSS and Javascript is now stored in distinct Stylesheet and |
192 |
feature is not supported by the html5lib treebuilder. [bug=1868861] |
193 |
||
561
by Leonard Richardson
Added a Russian translation by 'authoress' to the repository. |
194 |
* Added a Russian translation by 'authoress' to the repository. |
195 |
||
553
by Leonard Richardson
Fixed an unhandled exception when formatting a Tag that had been decomposed.[bug=1857767] |
196 |
* Fixed an unhandled exception when formatting a Tag that had been |
197 |
decomposed.[bug=1857767] |
|
198 |
||
559
by Leonard Richardson
Fixed a bug that happened when passing a Unicode filename containing |
199 |
* Fixed a bug that happened when passing a Unicode filename containing |
200 |
non-ASCII characters as markup into Beautiful Soup, on a system that |
|
201 |
allows Unicode filenames. [bug=1866717] |
|
202 |
||
556
by Leonard Richardson
Added a performance optimization to PageElement.extract(). Patch by Arthur Darcet. |
203 |
* Added a performance optimization to PageElement.extract(). Patch by |
204 |
Arthur Darcet. |
|
205 |
||
544
by Leonard Richardson
Wrote docstrings for formatter.py. |
206 |
= 4.8.2 (20191224) |
534
by Leonard Richardson
Fixed a deprecation warning on Python 3.7. Patch by Colin |
207 |
|
546
by Leonard Richardson
Added docstrings for some but not all tree buidlers. |
208 |
* Added Python docstrings to all public methods of the most commonly |
209 |
used classes. |
|
540
by Leonard Richardson
Added Python docstrings to all public methods in element.py. |
210 |
|
543
by Leonard Richardson
Fixed deprecation warning. [bug=1855301] |
211 |
* Added a Chinese translation by Deron Wang and a Brazilian Portuguese |
212 |
translation by Cezar Peixeiro to the repository. |
|
213 |
||
214 |
* Fixed two deprecation warnings. Patches by Colin |
|
215 |
Watson and Nicholas Neumann. [bug=1847592] [bug=1855301] |
|
216 |
||
538
by Leonard Richardson
The html.parser tree builder now correctly handles DOCTYPEs that are |
217 |
* The html.parser tree builder now correctly handles DOCTYPEs that are |
218 |
not uppercase. [bug=1848401] |
|
219 |
||
543
by Leonard Richardson
Fixed deprecation warning. [bug=1855301] |
220 |
* PageElement.select() now returns a ResultSet rather than a regular |
221 |
list, making it consistent with methods like find_all(). |
|
540
by Leonard Richardson
Added Python docstrings to all public methods in element.py. |
222 |
|
528
by Leonard Richardson
Added section on Python 2 sunsetting. |
223 |
= 4.8.1 (20191006) |
515
by Leonard Richardson
Adapt Chris Mayo's code to track line number and position when using html.parser. |
224 |
|
516
by Leonard Richardson
Implemented line number tracking for html5lib. |
225 |
* When the html.parser or html5lib parsers are in use, Beautiful Soup |
226 |
will, by default, record the position in the original document where |
|
227 |
each tag was encountered. This includes line number (Tag.sourceline) |
|
228 |
and position within a line (Tag.sourcepos). Based on code by Chris |
|
517
by Leonard Richardson
Added a section about project support to the README. |
229 |
Mayo. [bug=1742921] |
515
by Leonard Richardson
Adapt Chris Mayo's code to track line number and position when using html.parser. |
230 |
|
527
by Leonard Richardson
Avoid a crash when unpickling certain parse trees generated using html5lib on Python 3. [bug=1843545] |
231 |
* When instantiating a BeautifulSoup object, it's now possible to |
528
by Leonard Richardson
Added section on Python 2 sunsetting. |
232 |
provide a dictionary ('element_classes') of the classes you'd like to be |
233 |
instantiated instead of Tag, NavigableString, etc. |
|
527
by Leonard Richardson
Avoid a crash when unpickling certain parse trees generated using html5lib on Python 3. [bug=1843545] |
234 |
|
524
by Leonard Richardson
Fixed the definition of the default XML namespace when using |
235 |
* Fixed the definition of the default XML namespace when using |
236 |
lxml 4.4. Patch by Isaac Muse. [bug=1840141] |
|
237 |
||
520
by Leonard Richardson
Copying a Tag preserves information that was originally obtained from |
238 |
* Fixed a crash when pretty-printing tags that were not created |
239 |
during initial parsing. [bug=1838903] |
|
240 |
||
241 |
* Copying a Tag preserves information that was originally obtained from |
|
242 |
the TreeBuilder used to build the original Tag. [bug=1838903] |
|
518
by Leonard Richardson
Fixed a crash when pretty-printing tags that were not created |
243 |
|
526
by Leonard Richardson
Avoid a crash when trying to detect the declared encoding of a |
244 |
* Raise an explanatory exception when the underlying parser |
245 |
completely rejects the incoming markup. [bug=1838877] |
|
246 |
||
247 |
* Avoid a crash when trying to detect the declared encoding of a |
|
248 |
Unicode document. [bug=1838877] |
|
249 |
||
527
by Leonard Richardson
Avoid a crash when unpickling certain parse trees generated using html5lib on Python 3. [bug=1843545] |
250 |
* Avoid a crash when unpickling certain parse trees generated |
251 |
using html5lib on Python 3. [bug=1843545] |
|
252 |
||
513
by Leonard Richardson
Clarified the changelog. |
253 |
= 4.8.0 (20190720, "One Small Soup") |
501
by Leonard Richardson
It's now possible to customize the TreeBuilder object by passing |
254 |
|
514
by Leonard Richardson
Minor changes to docs and CHANGELOG. |
255 |
This release focuses on making it easier to customize Beautiful Soup's |
256 |
input mechanism (the TreeBuilder) and output mechanism (the Formatter). |
|
257 |
||
258 |
* You can customize the TreeBuilder object by passing keyword |
|
259 |
arguments into the BeautifulSoup constructor. Those keyword |
|
260 |
arguments will be passed along into the TreeBuilder constructor. |
|
261 |
||
262 |
The main reason to do this right now is to change how which |
|
263 |
attributes are treated as multi-valued attributes (the way 'class' |
|
264 |
is treated by default). You can do this with the |
|
265 |
'multi_valued_attributes' argument. [bug=1832978] |
|
511
by Leonard Richardson
Added documentation for Tag.smooth(). |
266 |
|
512
by Leonard Richardson
Prep for release. |
267 |
* The role of Formatter objects has been greatly expanded. The Formatter |
268 |
class now controls the following: |
|
511
by Leonard Richardson
Added documentation for Tag.smooth(). |
269 |
|
270 |
- The function to call to perform entity substitution. (This was |
|
271 |
previously Formatter's only job.) |
|
272 |
- Which tags should be treated as containing CDATA and have their |
|
273 |
contents exempt from entity substitution. |
|
274 |
- The order in which a tag's attributes are output. [bug=1812422] |
|
275 |
- Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>' |
|
276 |
||
277 |
All preexisting code should work as before. |
|
278 |
||
279 |
* Added a new method to the API, Tag.smooth(), which consolidates |
|
514
by Leonard Richardson
Minor changes to docs and CHANGELOG. |
280 |
multiple adjacent NavigableString elements. [bug=1697296] |
511
by Leonard Richardson
Added documentation for Tag.smooth(). |
281 |
|
514
by Leonard Richardson
Minor changes to docs and CHANGELOG. |
282 |
* ' (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is always |
511
by Leonard Richardson
Added documentation for Tag.smooth(). |
283 |
recognized as a named entity and converted to a single quote. [bug=1818721] |
504
by Leonard Richardson
' (which is valid in XML and XHTML, but not HTML 4) is now |
284 |
|
496
by Leonard Richardson
Tried even harder to avoid the deprecation warning originally fixed in |
285 |
= 4.7.1 (20190106) |
495
by Leonard Richardson
Fixed an incorrectly raised exception when inserting a tag before or |
286 |
|
287 |
* Fixed a significant performance problem introduced in 4.7.0. [bug=1810617] |
|
288 |
||
289 |
* Fixed an incorrectly raised exception when inserting a tag before or |
|
290 |
after an identical tag. [bug=1810692] |
|
291 |
||
292 |
* Beautiful Soup will no longer try to keep track of namespaces that |
|
293 |
are not defined with a prefix; this can confuse soupselect. [bug=1810680] |
|
294 |
||
496
by Leonard Richardson
Tried even harder to avoid the deprecation warning originally fixed in |
295 |
* Tried even harder to avoid the deprecation warning originally fixed in |
296 |
4.6.1. [bug=1778909] |
|
297 |
||
488
by Leonard Richardson
Prep for release. |
298 |
= 4.7.0 (20181231) |
477
by Leonard Richardson
Merged in next_previous_fixes from Isaac Muse. [bug=1782928,1798699] |
299 |
|
481
by Leonard Richardson
Issue a warning and raise a more useful exception if someone tries to call Tag.select() without SoupSieve installed. |
300 |
* Beautiful Soup's CSS Selector implementation has been replaced by a |
301 |
dependency on Isaac Muse's SoupSieve project (the soupsieve package |
|
302 |
on PyPI). The good news is that SoupSieve has a much more robust and |
|
303 |
complete implementation of CSS selectors, resolving a large number |
|
304 |
of longstanding issues. The bad news is that from this point onward, |
|
305 |
SoupSieve must be installed if you want to use the select() method. |
|
306 |
||
307 |
You don't have to change anything lf you installed Beautiful Soup |
|
308 |
through pip (SoupSieve will be automatically installed when you |
|
309 |
upgrade Beautiful Soup) or if you don't use CSS selectors from |
|
310 |
within Beautiful Soup. |
|
311 |
||
312 |
SoupSieve documentation: https://facelessuser.github.io/soupsieve/ |
|
313 |
||
490
by Leonard Richardson
Added information to CHANGELOG I forgot to add earlier. |
314 |
* Added the PageElement.extend() method, which works like list.append(). |
315 |
[bug=1514970] |
|
316 |
||
317 |
* PageElement.insert_before() and insert_after() now take a variable |
|
318 |
number of arguments. [bug=1514970] |
|
319 |
||
477
by Leonard Richardson
Merged in next_previous_fixes from Isaac Muse. [bug=1782928,1798699] |
320 |
* Fix a number of problems with the tree builder that caused |
321 |
trees that were superficially okay, but which fell apart when bits |
|
483
by Leonard Richardson
Merging the linkage checker and html5lib fixes by Isaac Muse found in https://code.launchpad.net/~facelessuser/beautifulsoup/html5lib-fix/+merge/361282. [bug=1809910] |
322 |
were extracted. Patch by Isaac Muse. [bug=1782928,1809910] |
477
by Leonard Richardson
Merged in next_previous_fixes from Isaac Muse. [bug=1782928,1798699] |
323 |
|
324 |
* Fixed a problem with the tree builder in which elements that |
|
325 |
contained no content (such as empty comments and all-whitespace |
|
326 |
elements) were not being treated as part of the tree. Patch by Isaac |
|
327 |
Muse. [bug=1798699] |
|
328 |
||
484
by Leonard Richardson
Fixed a problem with multi-valued attributes where the value |
329 |
* Fixed a problem with multi-valued attributes where the value |
330 |
contained whitespace. Thanks to Jens Svalgaard for the |
|
331 |
fix. [bug=1787453] |
|
332 |
||
482
by Leonard Richardson
Clarified the software license. |
333 |
* Clarified ambiguous license statements in the source code. Beautiful |
484
by Leonard Richardson
Fixed a problem with multi-valued attributes where the value |
334 |
Soup is released under the MIT license, and has been since 4.4.0. |
482
by Leonard Richardson
Clarified the software license. |
335 |
|
488
by Leonard Richardson
Prep for release. |
336 |
* This file has been renamed from NEWS.txt to CHANGELOG. |
337 |
||
476
by Leonard Richardson
Bump up to version 4.6.3 so I can re-release. |
338 |
= 4.6.3 (20180812) |
339 |
||
340 |
* Exactly the same as 4.6.2. Re-released to make the README file |
|
341 |
render properly on PyPI. |
|
342 |
||
475
by Leonard Richardson
Converted README to Markdown format. |
343 |
= 4.6.2 (20180812) |
474
by Leonard Richardson
Fix an exception when a custom formatter was asked to format a void |
344 |
|
345 |
* Fix an exception when a custom formatter was asked to format a void |
|
346 |
element. [bug=1784408] |
|
347 |
||
473
by Leonard Richardson
Prep for release. |
348 |
= 4.6.1 (20180728) |
451
by Leonard Richardson
Improve the warning given when no parser is specified. [bug=1780571] |
349 |
|
459
by Leonard Richardson
Stop data loss when encountering an empty numeric entity, and |
350 |
* Stop data loss when encountering an empty numeric entity, and |
351 |
possibly in other cases. Thanks to tos.kamiya for the fix. [bug=1698503] |
|
352 |
||
465
by Leonard Richardson
Preserve XML namespaces when they are introduced inside an XML |
353 |
* Preserve XML namespaces introduced inside an XML document, not just |
354 |
the ones introduced at the top level. [bug=1718787] |
|
355 |
||
466
by Leonard Richardson
Fixed a bug where find_all() was not working when asked to find a |
356 |
* Added a new formatter, "html5", which represents void elements |
469
by Leonard Richardson
Fixed a problem where the html.parser tree builder interpreted |
357 |
as "<element>" rather than "<element/>". [bug=1716272] |
358 |
||
359 |
* Fixed a problem where the html.parser tree builder interpreted |
|
360 |
a string like "&foo " as the character entity "&foo;" [bug=1728706] |
|
466
by Leonard Richardson
Fixed a bug where find_all() was not working when asked to find a |
361 |
|
471
by Leonard Richardson
Correctly handle invalid HTML numeric character entities like “ |
362 |
* Correctly handle invalid HTML numeric character entities like “ |
363 |
which reference code points that are not Unicode code points. Note |
|
364 |
that this is only fixed when Beautiful Soup is used with the |
|
365 |
html.parser parser -- html5lib already worked and I couldn't fix it |
|
366 |
with lxml. [bug=1782933] |
|
367 |
||
452
by Leonard Richardson
Fixed code that was causing deprecation warnings in recent Python 3 |
368 |
* Improved the warning given when no parser is specified. [bug=1780571] |
369 |
||
472
by Leonard Richardson
When markup contains duplicate elements, a select() call that |
370 |
* When markup contains duplicate elements, a select() call that |
371 |
includes multiple match clauses will match all relevant |
|
372 |
elements. [bug=1770596] |
|
373 |
||
452
by Leonard Richardson
Fixed code that was causing deprecation warnings in recent Python 3 |
374 |
* Fixed code that was causing deprecation warnings in recent Python 3 |
375 |
versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496] |
|
451
by Leonard Richardson
Improve the warning given when no parser is specified. [bug=1780571] |
376 |
|
453
by Leonard Richardson
Fixed a Windows crash in diagnose() when checking whether a long |
377 |
* Fixed a Windows crash in diagnose() when checking whether a long |
378 |
markup string is a filename. [bug=1737121] |
|
379 |
||
454
by Leonard Richardson
Stopped HTMLParser from raising an exception in very rare cases of |
380 |
* Stopped HTMLParser from raising an exception in very rare cases of |
381 |
bad markup. [bug=1708831] |
|
382 |
||
466
by Leonard Richardson
Fixed a bug where find_all() was not working when asked to find a |
383 |
* Fixed a bug where find_all() was not working when asked to find a |
384 |
tag with a namespaced name in an XML document that was parsed as |
|
385 |
HTML. [bug=1723783] |
|
462
by Leonard Richardson
Introduced the Formatter system. [bug=1716272]. |
386 |
|
387 |
* You can get finer control over formatting by subclassing |
|
388 |
bs4.element.Formatter and passing a Formatter instance into (e.g.) |
|
389 |
encode(). [bug=1716272] |
|
461
by Leonard Richardson
It's possible for a TreeBuilder subclass to specify that void |
390 |
|
464
by Leonard Richardson
You can pass a dictionary of into |
391 |
* You can pass a dictionary of `attrs` into |
392 |
BeautifulSoup.new_tag. This makes it possible to create a tag with |
|
393 |
an attribute like 'name' that would otherwise be masked by another |
|
394 |
argument of new_tag. [bug=1779276] |
|
395 |
||
470
by Leonard Richardson
Clarified the deprecation warning when accessing tag.fooTag, to cover |
396 |
* Clarified the deprecation warning when accessing tag.fooTag, to cover |
397 |
the possibility that you might really have been looking for a tag |
|
398 |
called 'fooTag'. |
|
399 |
||
450
by Leonard Richardson
Prep for 4.6.0 release. |
400 |
= 4.6.0 (20170507) = |
444
by Leonard Richardson
Added the method, which acts like for |
401 |
|
447
by Leonard Richardson
Replace get_attribute_text with get_attribute_list. |
402 |
* Added the `Tag.get_attribute_list` method, which acts like `Tag.get` for |
403 |
getting the value of an attribute, but which always returns a list, |
|
404 |
whether or not the attribute is a multi-value attribute. [bug=1678589] |
|
442
by Leonard Richardson
It's now possible to use a tag's namespace prefix when searching, |
405 |
|
443
by Leonard Richardson
HTML parsers treat all HTML4 and HTML5 empty element tags (aka void element tags) correctly. [bug=1656909] |
406 |
* It's now possible to use a tag's namespace prefix when searching, |
407 |
e.g. soup.find('namespace:tag') [bug=1655332] |
|
408 |
||
446
by Leonard Richardson
Improved the handling of empty-element tags like <br> when using the |
409 |
* Improved the handling of empty-element tags like <br> when using the |
410 |
html.parser parser. [bug=1676935] |
|
411 |
||
443
by Leonard Richardson
HTML parsers treat all HTML4 and HTML5 empty element tags (aka void element tags) correctly. [bug=1656909] |
412 |
* HTML parsers treat all HTML4 and HTML5 empty element tags (aka void |
413 |
element tags) correctly. [bug=1656909] |
|
442
by Leonard Richardson
It's now possible to use a tag's namespace prefix when searching, |
414 |
|
449
by Leonard Richardson
Namespace prefix is preserved when an XML tag is copied. Thanks |
415 |
* Namespace prefix is preserved when an XML tag is copied. Thanks |
416 |
to Vikas for a patch and test. [bug=1685172] |
|
417 |
||
439
by Leonard Richardson
I need to do another release because of an error while running the release script. |
418 |
= 4.5.3 (20170102) = |
434
by Leonard Richardson
Fixed yet another problem that caused the html5lib tree builder to |
419 |
|
436
by Leonard Richardson
Fixed foster parenting when html5lib is the tree builder. Thanks to Geoffrey Sneddon for a patch and test. |
420 |
* Fixed foster parenting when html5lib is the tree builder. Thanks to |
421 |
Geoffrey Sneddon for a patch and test. |
|
439
by Leonard Richardson
I need to do another release because of an error while running the release script. |
422 |
|
434
by Leonard Richardson
Fixed yet another problem that caused the html5lib tree builder to |
423 |
* Fixed yet another problem that caused the html5lib tree builder to |
424 |
create a disconnected parse tree. [bug=1629825] |
|
425 |
||
439
by Leonard Richardson
I need to do another release because of an error while running the release script. |
426 |
= 4.5.2 (20170102) = |
427 |
||
428 |
* Apart from the version number, this release is identical to |
|
429 |
4.5.3. Due to user error, it could not be completely uploaded to |
|
430 |
PyPI. Use 4.5.3 instead. |
|
431 |
||
430
by Leonard Richardson
Bump version number. |
432 |
= 4.5.1 (20160802) = |
428
by Leonard Richardson
Fixed a reported (but not duplicated) bug involving processing instructions fed into the lxml HTML parser. |
433 |
|
429
by Leonard Richardson
Explained why we test both unicode and bytestring processing instructions. |
434 |
* Fixed a crash when passing Unicode markup that contained a |
435 |
processing instruction into the lxml HTML parser on Python |
|
436 |
3. [bug=1608048] |
|
428
by Leonard Richardson
Fixed a reported (but not duplicated) bug involving processing instructions fed into the lxml HTML parser. |
437 |
|
419
by Leonard Richardson
Updated NEWS in preparation for release. |
438 |
= 4.5.0 (20160719) = |
439 |
||
440 |
* Beautiful Soup is no longer compatible with Python 2.6. This |
|
441 |
actually happened a few releases ago, but it's now official. |
|
400
by Leonard Richardson
Fixed a Python 3 ByteWarning when a URL was passed in as though it |
442 |
|
406
by Leonard Richardson
Beautiful Soup will now work with versions of html5lib greater than |
443 |
* Beautiful Soup will now work with versions of html5lib greater than |
444 |
0.99999999. [bug=1603299] |
|
445 |
||
417
by Leonard Richardson
If a search against each individual value of a multi-valued |
446 |
* If a search against each individual value of a multi-valued |
447 |
attribute fails, the search will be run one final time against the |
|
448 |
complete attribute value considered as a single string. That is, if |
|
449 |
a tag has class="foo bar" and neither "foo" nor "bar" matches, but |
|
450 |
"foo bar" does, the tag is now considered a match. |
|
451 |
||
452 |
This happened in previous versions, but only when the value being |
|
419
by Leonard Richardson
Updated NEWS in preparation for release. |
453 |
searched for was a string. Now it also works when that value is |
454 |
a regular expression, a list of strings, etc. [bug=1476868] |
|
417
by Leonard Richardson
If a search against each individual value of a multi-valued |
455 |
|
410
by Leonard Richardson
Although the previously fixed problem only occurs when using the html5lib tree builder, it's not actually a problem with the tree builder itself. |
456 |
* Fixed a bug that deranged the tree when a whitespace element was |
457 |
reparented into a tag that contained an identical whitespace |
|
458 |
element. [bug=1505351] |
|
409
by Leonard Richardson
Fixed a bug in the html5lib treebuilder that deranged the tree |
459 |
|
415
by Leonard Richardson
Added support for CSS selector values that contain quoted spaces, |
460 |
* Added support for CSS selector values that contain quoted spaces, |
461 |
such as tag[style="display: foo"]. [bug=1540588] |
|
462 |
||
400
by Leonard Richardson
Fixed a Python 3 ByteWarning when a URL was passed in as though it |
463 |
* Corrected handling of XML processing instructions. [bug=1504393] |
464 |
||
416
by Leonard Richardson
Corrected an encoding error that happened when a BeautifulSoup |
465 |
* Corrected an encoding error that happened when a BeautifulSoup |
466 |
object was copied. [bug=1554439] |
|
467 |
||
401
by Leonard Richardson
The contents of <textarea> tags will no longer be modified when the |
468 |
* The contents of <textarea> tags will no longer be modified when the |
469 |
tree is prettified. [bug=1555829] |
|
470 |
||
411
by Leonard Richardson
When a BeautifulSoup object is pickled but its tree builder cannot |
471 |
* When a BeautifulSoup object is pickled but its tree builder cannot |
472 |
be pickled, its .builder attribute is set to None instead of being |
|
473 |
destroyed. This avoids a performance problem once the object is |
|
474 |
unpickled. [bug=1523629] |
|
475 |
||
402
by Leonard Richardson
Specify the file and line number when warning about a |
476 |
* Specify the file and line number when warning about a |
477 |
BeautifulSoup object being instantiated without a parser being |
|
478 |
specified. [bug=1574647] |
|
479 |
||
414
by Leonard Richardson
The argument to now works correctly, though it's |
480 |
* The `limit` argument to `select()` now works correctly, though it's |
481 |
not implemented very efficiently. [bug=1520530] |
|
482 |
||
400
by Leonard Richardson
Fixed a Python 3 ByteWarning when a URL was passed in as though it |
483 |
* Fixed a Python 3 ByteWarning when a URL was passed in as though it |
484 |
were markup. Thanks to James Salter for a patch and |
|
485 |
test. [bug=1533762] |
|
486 |
||
405
by Leonard Richardson
We don't run the check for a filename passed in as markup if the |
487 |
* We don't run the check for a filename passed in as markup if the |
488 |
'filename' contains a less-than character; the less-than character |
|
489 |
indicates it's most likely a very small document. [bug=1577864] |
|
490 |
||
392
by Leonard Richardson
Fixed a bug that deranged the tree when part of it was |
491 |
= 4.4.1 (20150928) = |
390
by Leonard Richardson
Fixed the test_detect_utf8 test so that it works when chardet is |
492 |
|
392
by Leonard Richardson
Fixed a bug that deranged the tree when part of it was |
493 |
* Fixed a bug that deranged the tree when part of it was |
494 |
removed. Thanks to Eric Weiser for the patch and John Wiseman for a |
|
495 |
test. [bug=1481520] |
|
496 |
||
395
by Leonard Richardson
Fixed a parse bug with the html5lib tree-builder. Thanks to Roel |
497 |
* Fixed a parse bug with the html5lib tree-builder. Thanks to Roel |
498 |
Kramer for the patch. [bug=1483781] |
|
499 |
||
394
by Leonard Richardson
Improved the implementation of CSS selector grouping. Thanks to Orangain for the patch. [bug=1484543] |
500 |
* Improved the implementation of CSS selector grouping. Thanks to |
501 |
Orangain for the patch. [bug=1484543] |
|
502 |
||
393
by Leonard Richardson
Corrected the output of Declaration objects. [bug=1477847] |
503 |
* Fixed the test_detect_utf8 test so that it works when chardet is |
504 |
installed. [bug=1471359] |
|
505 |
||
506 |
* Corrected the output of Declaration objects. [bug=1477847] |
|
507 |
||
394
by Leonard Richardson
Improved the implementation of CSS selector grouping. Thanks to Orangain for the patch. [bug=1484543] |
508 |
|
386
by Leonard Richardson
Change setup.py to focus on creating wheels. |
509 |
= 4.4.0 (20150703) = |
358
by Leonard Richardson
Started using a standard MIT license. [bug=1294662] |
510 |
|
379
by Leonard Richardson
Reorganized changelog. |
511 |
Especially important changes: |
512 |
||
513 |
* Added a warning when you instantiate a BeautifulSoup object without |
|
514 |
explicitly naming a parser. [bug=1398866] |
|
515 |
||
366
by Leonard Richardson
In Python 3, __str__ now returns a Unicode string instead |
516 |
* __repr__ now returns an ASCII bytestring in Python 2, and a Unicode |
517 |
string in Python 3, instead of a UTF8-encoded bytestring in both |
|
518 |
versions. In Python 3, __str__ now returns a Unicode string instead |
|
519 |
of a bytestring. [bug=1420131] |
|
520 |
||
379
by Leonard Richardson
Reorganized changelog. |
521 |
* The `text` argument to the find_* methods is now called `string`, |
522 |
which is more accurate. `text` still works, but `string` is the |
|
523 |
argument described in the documentation. `text` may eventually |
|
524 |
change its meaning, but not for a very long time. [bug=1366856] |
|
525 |
||
381
by Leonard Richardson
Changed the way soup objects work under copy.copy(). Copying a |
526 |
* Changed the way soup objects work under copy.copy(). Copying a |
527 |
NavigableString or a Tag will give you a new NavigableString that's |
|
528 |
equal to the old one but not connected to the parse tree. Patch by |
|
529 |
Martijn Peters. [bug=1307490] |
|
380
by Leonard Richardson
Copying a NavigableString will give you a new NavigableString that is not connected to the parse tree. |
530 |
|
379
by Leonard Richardson
Reorganized changelog. |
531 |
* Started using a standard MIT license. [bug=1294662] |
532 |
||
533 |
* Added a Chinese translation of the documentation by Delong .w. |
|
534 |
||
535 |
New features: |
|
536 |
||
371
by Leonard Richardson
Introduced the select_one() method, which uses a CSS selector but |
537 |
* Introduced the select_one() method, which uses a CSS selector but |
538 |
only returns the first match, instead of a list of |
|
539 |
matches. [bug=1349367] |
|
540 |
||
376
by Leonard Richardson
Raise a NotImplementedError whenever an unsupported CSS pseudoclass |
541 |
* You can now create a Tag object without specifying a |
542 |
TreeBuilder. Patch by Martijn Pieters. [bug=1307471] |
|
543 |
||
544 |
* You can now create a NavigableString or a subclass just by invoking |
|
545 |
the constructor. [bug=1294315] |
|
546 |
||
373
by Leonard Richardson
Added an exclude_encodings argument to UnicodeDammit and to the |
547 |
* Added an `exclude_encodings` argument to UnicodeDammit and to the |
548 |
Beautiful Soup constructor, which lets you prohibit the detection of |
|
549 |
an encoding that you know is wrong. [bug=1469408] |
|
550 |
||
379
by Leonard Richardson
Reorganized changelog. |
551 |
* The select() method now supports selector grouping. Patch by |
552 |
Francisco Canas [bug=1191917] |
|
553 |
||
554 |
Bug fixes: |
|
555 |
||
338
by Leonard Richardson
Fixed yet another problem that caused the html5lib tree builder to |
556 |
* Fixed yet another problem that caused the html5lib tree builder to |
557 |
create a disconnected parse tree. [bug=1237763] |
|
558 |
||
359
by Leonard Richardson
Improved docstring for encode_contents() and decode_contents(). [bug=1441543] |
559 |
* Force object_was_parsed() to keep the tree intact even when an element |
560 |
from later in the document is moved into place. [bug=1430633] |
|
561 |
||
562 |
* Fixed yet another bug that caused a disconnected tree when html5lib |
|
563 |
copied an element from one part of the tree to another. [bug=1270611] |
|
564 |
||
378
by Leonard Richardson
Fixed a bug where Element.extract() could create an infinite loop in |
565 |
* Fixed a bug where Element.extract() could create an infinite loop in |
566 |
the remaining tree. |
|
567 |
||
352
by Leonard Richardson
The select() method can now find tags whose names contain |
568 |
* The select() method can now find tags whose names contain |
360
by Leonard Richardson
The select() method can now find tags with attributes whose names |
569 |
dashes. Patch by Francisco Canas. [bug=1276211] |
570 |
||
571 |
* The select() method can now find tags with attributes whose names |
|
572 |
contain dashes. Patch by Marek Kapolka. [bug=1304007] |
|
352
by Leonard Richardson
The select() method can now find tags whose names contain |
573 |
|
353
by Leonard Richardson
Improved the lxml tree builder's handling of processing |
574 |
* Improved the lxml tree builder's handling of processing |
575 |
instructions. [bug=1294645] |
|
576 |
||
337
by Leonard Richardson
Restored the helpful syntax error that happens when you try to |
577 |
* Restored the helpful syntax error that happens when you try to |
578 |
import the Python 2 edition of Beautiful Soup under Python |
|
579 |
3. [bug=1213387] |
|
580 |
||
347
by Leonard Richardson
In Python 3.4 and above, set the new convert_charrefs argument to |
581 |
* In Python 3.4 and above, set the new convert_charrefs argument to |
582 |
the html.parser constructor to avoid a warning and future |
|
583 |
failures. Patch by Stefano Revera. [bug=1375721] |
|
584 |
||
350
by Leonard Richardson
The warning when you pass in a filename or URL as markup will now be |
585 |
* The warning when you pass in a filename or URL as markup will now be |
586 |
displayed correctly even if the filename or URL is a Unicode |
|
587 |
string. [bug=1268888] |
|
342
by Leonard Richardson
Added a Chinese translation of the documentation by Delong .w. |
588 |
|
360.1.1
by Leonard Richardson
If the initial <html> tag contains a CDATA list attribute such as |
589 |
* If the initial <html> tag contains a CDATA list attribute such as |
590 |
'class', the html5lib tree builder will now turn its value into a |
|
591 |
list, as it would with any other tag. [bug=1296481] |
|
592 |
||
360.1.3
by Leonard Richardson
Fixed an import error in Python 3.5 caused by the removal of the |
593 |
* Fixed an import error in Python 3.5 caused by the removal of the |
594 |
HTMLParseError class. [bug=1420063] |
|
595 |
||
359
by Leonard Richardson
Improved docstring for encode_contents() and decode_contents(). [bug=1441543] |
596 |
* Improved docstring for encode_contents() and |
597 |
decode_contents(). [bug=1441543] |
|
357
by Leonard Richardson
Fixed yet another bug that caused a disconnected tree when html5lib |
598 |
|
364
by Leonard Richardson
Fixed a crash in Unicode, Dammit's encoding detector when the name |
599 |
* Fixed a crash in Unicode, Dammit's encoding detector when the name |
600 |
of the encoding itself contained invalid bytes. [bug=1360913] |
|
601 |
||
367
by Leonard Richardson
Improved the exception raised when you call .unwrap() or |
602 |
* Improved the exception raised when you call .unwrap() or |
603 |
.replace_with() on an element that's not attached to a tree. |
|
604 |
||
376
by Leonard Richardson
Raise a NotImplementedError whenever an unsupported CSS pseudoclass |
605 |
* Raise a NotImplementedError whenever an unsupported CSS pseudoclass |
606 |
is used in select(). Previously some cases did not result in a |
|
607 |
NotImplementedError. |
|
368
by Leonard Richardson
You can now create a NavigableString or a subclass just by invoking |
608 |
|
382
by Leonard Richardson
It's now possible to pickle a BeautifulSoup object no matter which |
609 |
* It's now possible to pickle a BeautifulSoup object no matter which |
610 |
tree builder was used to create it. However, the only tree builder |
|
611 |
that survives the pickling process is the HTMLParserTreeBuilder |
|
612 |
('html.parser'). If you unpickle a BeautifulSoup object created with |
|
613 |
some other tree builder, soup.builder will be None. [bug=1231545] |
|
614 |
||
336
by Leonard Richardson
Prep for release. |
615 |
= 4.3.2 (20131002) = |
331
by Leonard Richardson
Combined two tests to stop a spurious test failure when tests are |
616 |
|
333
by Leonard Richardson
Fixed a bug in which short Unicode input was improperly encoded to ASCII when checking whether or not it was a file on |
617 |
* Fixed a bug in which short Unicode input was improperly encoded to |
336
by Leonard Richardson
Prep for release. |
618 |
ASCII when checking whether or not it was the name of a file on |
333
by Leonard Richardson
Fixed a bug in which short Unicode input was improperly encoded to ASCII when checking whether or not it was a file on |
619 |
disk. [bug=1227016] |
620 |
||
334
by Leonard Richardson
Fixed a crash when a short input contains data not valid in |
621 |
* Fixed a crash when a short input contains data not valid in |
622 |
filenames. [bug=1232604] |
|
623 |
||
335
by Leonard Richardson
Fixed a bug that caused Unicode data put into UnicodeDammit to |
624 |
* Fixed a bug that caused Unicode data put into UnicodeDammit to |
625 |
return None instead of the original data. [bug=1214983] |
|
626 |
||
331
by Leonard Richardson
Combined two tests to stop a spurious test failure when tests are |
627 |
* Combined two tests to stop a spurious test failure when tests are |
332
by Leonard Richardson
Fixed typo. |
628 |
run by nosetests. [bug=1212445] |
331
by Leonard Richardson
Combined two tests to stop a spurious test failure when tests are |
629 |
|
329
by Leonard Richardson
Updated NEWS. |
630 |
= 4.3.1 (20130815) = |
327
by Leonard Richardson
* Fixed yet another problem with the html5lib tree builder, caused by |
631 |
|
632 |
* Fixed yet another problem with the html5lib tree builder, caused by |
|
633 |
html5lib's tendency to rearrange the tree during |
|
634 |
parsing. [bug=1189267] |
|
635 |
||
329
by Leonard Richardson
Updated NEWS. |
636 |
* Fixed a bug that caused the optimized version of find_all() to |
637 |
return nothing. [bug=1212655] |
|
638 |
||
326
by Leonard Richardson
Prep for release. |
639 |
= 4.3.0 (20130812) = |
305
by Leonard Richardson
Merged in big encoding-detection refactoring branch. |
640 |
|
641 |
* Instead of converting incoming data to Unicode and feeding it to the |
|
324
by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by |
642 |
lxml tree builder in chunks, Beautiful Soup now makes successive |
643 |
guesses at the encoding of the incoming data, and tells lxml to |
|
644 |
parse the data as that encoding. Giving lxml more control over the |
|
645 |
parsing process improves performance and avoids a number of bugs and |
|
646 |
issues with the lxml parser which had previously required elaborate |
|
647 |
workarounds: |
|
323
by Leonard Richardson
A little cleanup. |
648 |
|
324
by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by |
649 |
- An issue in which lxml refuses to parse Unicode strings on some |
650 |
systems. [bug=1180527] |
|
323
by Leonard Richardson
A little cleanup. |
651 |
|
652 |
- A returning bug that truncated documents longer than a (very |
|
653 |
small) size. [bug=963880] |
|
654 |
||
655 |
- A returning bug in which extra spaces were added to a document if |
|
656 |
the document defined a charset other than UTF-8. [bug=972466] |
|
305
by Leonard Richardson
Merged in big encoding-detection refactoring branch. |
657 |
|
658 |
This required a major overhaul of the tree builder architecture. If |
|
659 |
you wrote your own tree builder and didn't tell me, you'll need to |
|
660 |
modify your prepare_markup() method. |
|
661 |
||
662 |
* The UnicodeDammit code that makes guesses at encodings has been |
|
663 |
split into its own class, EncodingDetector. A lot of apparently |
|
664 |
redundant code has been removed from Unicode, Dammit, and some |
|
665 |
undocumented features have also been removed. |
|
666 |
||
306
by Leonard Richardson
Beautiful Soup will issue a warning if instead of markup you pass it |
667 |
* Beautiful Soup will issue a warning if instead of markup you pass it |
324
by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by |
668 |
a URL or the name of a file on disk (a common beginner's mistake). |
306
by Leonard Richardson
Beautiful Soup will issue a warning if instead of markup you pass it |
669 |
|
317
by Leonard Richardson
Added raw html5lib to the list of parsers that get tested. |
670 |
* A number of optimizations improve the performance of the lxml tree |
322
by Leonard Richardson
Updated NEWS. |
671 |
builder by about 33%, the html.parser tree builder by about 20%, and |
672 |
the html5lib tree builder by about 15%. |
|
317
by Leonard Richardson
Added raw html5lib to the list of parsers that get tested. |
673 |
|
324
by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by |
674 |
* All find_all calls should now return a ResultSet object. Patch by |
675 |
Aaron DeVore. [bug=1194034] |
|
676 |
||
302
by Leonard Richardson
Reverted the patch that gives NavigableString a .name property, because that's too big an API change for a bugfix release. |
677 |
= 4.2.1 (20130531) = |
295
by Leonard Richardson
html5lib now supports Python 3. Fixed some Python 2-specific |
678 |
|
301
by Leonard Richardson
The default XML formatter will now replace ampersands even if they appear to be part of entities. That is, "<" will become "&lt;".[bug=1182183] |
679 |
* The default XML formatter will now replace ampersands even if they |
680 |
appear to be part of entities. That is, "<" will become |
|
681 |
"&lt;". The old code was left over from Beautiful Soup 3, which |
|
682 |
didn't always turn entities into Unicode characters. |
|
683 |
||
684 |
If you really want the old behavior (maybe because you add new |
|
685 |
strings to the tree, those strings include entities, and you want |
|
686 |
the formatter to leave them alone on output), it can be found in |
|
687 |
EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183] |
|
688 |
||
296
by Leonard Richardson
Gave new_string() the ability to create subclasses of |
689 |
* Gave new_string() the ability to create subclasses of |
690 |
NavigableString. [bug=1181986] |
|
691 |
||
297
by Leonard Richardson
Fixed another bug by which the html5lib tree builder could create a |
692 |
* Fixed another bug by which the html5lib tree builder could create a |
693 |
disconnected tree. [bug=1182089] |
|
694 |
||
299
by Leonard Richardson
The .previous_element of a BeautifulSoup object is now always None, |
695 |
* The .previous_element of a BeautifulSoup object is now always None, |
696 |
not the last element to be parsed. [bug=1182089] |
|
697 |
||
295
by Leonard Richardson
html5lib now supports Python 3. Fixed some Python 2-specific |
698 |
* Fixed test failures when lxml is not installed. [bug=1181589] |
699 |
||
700 |
* html5lib now supports Python 3. Fixed some Python 2-specific |
|
701 |
code in the html5lib test suite. [bug=1181624] |
|
702 |
||
303
by Leonard Richardson
The html.parser treebuilder can now handle numeric attributes in |
703 |
* The html.parser treebuilder can now handle numeric attributes in |
704 |
text when the hexidecimal name of the attribute starts with a |
|
705 |
capital X. Patch by Tim Shirley. [bug=1186242] |
|
706 |
||
288.1.1
by Leonard Richardson
Added a deprecation warning to has_key(). |
707 |
= 4.2.0 (20130514) = |
272
by Leonard Richardson
In an HTML document, the contents of a <script> or <style> tag will |
708 |
|
282.1.12
by Leonard Richardson
Updated news. |
709 |
* The Tag.select() method now supports a much wider variety of CSS |
710 |
selectors. |
|
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
711 |
|
712 |
- Added support for the adjacent sibling combinator (+) and the |
|
713 |
general sibling combinator (~). Tests by "liquider". [bug=1082144] |
|
714 |
||
282.1.13
by Leonard Richardson
Fixed terminology. |
715 |
- The combinators (>, +, and ~) can now combine with any supported |
282.1.12
by Leonard Richardson
Updated news. |
716 |
selector, not just one that selects based on tag name. |
717 |
||
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
718 |
- Added limited support for the "nth-of-type" pseudo-class. Code |
719 |
by Sven Slootweg. [bug=1109952] |
|
720 |
||
274.1.3
by Leonard Richardson
Aliased the BeautifulSoup class to the easier-to-type "_s" and "_soup". |
721 |
* The BeautifulSoup class is now aliased to "_s" and "_soup", making |
278
by Leonard Richardson
Added support for the "nth-of-type" CSS selector. The CSS selector ">" can now find a tag by means other than the tag name. Code by Sven Slootweg. |
722 |
it quicker to type the import statement in an interactive session: |
274.1.3
by Leonard Richardson
Aliased the BeautifulSoup class to the easier-to-type "_s" and "_soup". |
723 |
|
724 |
from bs4 import _s |
|
725 |
or
|
|
726 |
from bs4 import _soup |
|
727 |
||
282
by Leonard Richardson
Fixed up diagnose() and added it to the docs. |
728 |
The alias may change in the future, so don't use this in code you're |
729 |
going to run more than once. |
|
730 |
||
731 |
* Added the 'diagnose' submodule, which includes several useful |
|
732 |
functions for reporting problems and doing tech support. |
|
733 |
||
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
734 |
- diagnose(data) tries the given markup on every installed parser, |
282
by Leonard Richardson
Fixed up diagnose() and added it to the docs. |
735 |
reporting exceptions and displaying successes. If a parser is not |
736 |
installed, diagnose() mentions this fact. |
|
737 |
||
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
738 |
- lxml_trace(data, html=True) runs the given markup through lxml's |
282
by Leonard Richardson
Fixed up diagnose() and added it to the docs. |
739 |
XML parser or HTML parser, and prints out the parser events as |
740 |
they happen. This helps you quickly determine whether a given |
|
741 |
problem occurs in lxml code or Beautiful Soup code. |
|
742 |
||
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
743 |
- htmlparser_trace(data) is the same thing, but for Python's |
282
by Leonard Richardson
Fixed up diagnose() and added it to the docs. |
744 |
built-in HTMLParser class. |
278
by Leonard Richardson
Added support for the "nth-of-type" CSS selector. The CSS selector ">" can now find a tag by means other than the tag name. Code by Sven Slootweg. |
745 |
|
282.1.12
by Leonard Richardson
Updated news. |
746 |
* In an HTML document, the contents of a <script> or <style> tag will |
747 |
no longer undergo entity substitution by default. XML documents work |
|
748 |
the same way they did before. [bug=1085953] |
|
749 |
||
750 |
* Methods like get_text() and properties like .strings now only give |
|
751 |
you strings that are visible in the document--no comments or |
|
752 |
processing commands. [bug=1050164] |
|
753 |
||
277
by Leonard Richardson
The prettify() method now leaves the contents of <pre> tags |
754 |
* The prettify() method now leaves the contents of <pre> tags |
755 |
alone. [bug=1095654] |
|
756 |
||
264
by Leonard Richardson
Added bug reference. |
757 |
* Fix a bug in the html5lib treebuilder which sometimes created |
758 |
disconnected trees. [bug=1039527] |
|
759 |
||
265.1.1
by Leonard Richardson
Fix a bug in the lxml treebuilder which crashed when a tag included |
760 |
* Fix a bug in the lxml treebuilder which crashed when a tag included |
761 |
an attribute from the predefined "xml:" namespace. [bug=1065617] |
|
762 |
||
273
by Leonard Richardson
Fix a bug by which keyword arguments to find_parent() were not being passed on. [bug=1126734] |
763 |
* Fix a bug by which keyword arguments to find_parent() were not |
764 |
being passed on. [bug=1126734] |
|
765 |
||
275
by Leonard Richardson
Stop a crash when unwisely messing with a tag that's been |
766 |
* Stop a crash when unwisely messing with a tag that's been |
767 |
decomposed. [bug=1097699] |
|
768 |
||
288.1.1
by Leonard Richardson
Added a deprecation warning to has_key(). |
769 |
* Now that lxml's segfault on invalid doctype has been fixed, fixed a |
274.1.1
by Leonard Richardson
Now that lxml's segfault on invalid doctype has been fixed, fix a |
770 |
corresponding problem on the Beautiful Soup end that was previously |
771 |
invisible. [bug=984936] |
|
772 |
||
279
by Leonard Richardson
Fixed an exception when an overspecified CSS selector didn't match |
773 |
* Fixed an exception when an overspecified CSS selector didn't match |
774 |
anything. Code by Stefaan Lippens. [bug=1168167] |
|
775 |
||
258
by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503] |
776 |
= 4.1.3 (20120820) = |
777 |
||
260
by Leonard Richardson
Python 3.1 also needs to skip the unicode attribute name test. |
778 |
* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious |
779 |
test failure caused by the lousy HTMLParser in those |
|
780 |
versions. [bug=1038503] |
|
258
by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503] |
781 |
|
259
by Leonard Richardson
Raise a more specific error (FeatureNotFound) when a requested |
782 |
* Raise a more specific error (FeatureNotFound) when a requested |
783 |
parser or parser feature is not installed. Raise NotImplementedError |
|
784 |
instead of ValueError when the user calls insert_before() or |
|
785 |
insert_after() on the BeautifulSoup object itself. Patch by Aaron |
|
786 |
Devore. [bug=1038301] |
|
258
by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503] |
787 |
|
252
by Leonard Richardson
Prep for release. |
788 |
= 4.1.2 (20120817) = |
245
by Leonard Richardson
Use logging.warning() instead of warning.warn() to notify the user that characters were replaced with REPLACEMENT CHARACTER. [bug=1013862] |
789 |
|
251
by Leonard Richardson
As per PEP-8, allow searching by CSS class using the 'class_' |
790 |
* As per PEP-8, allow searching by CSS class using the 'class_' |
791 |
keyword argument. [bug=1037624] |
|
792 |
||
255
by Leonard Richardson
Fixed a crash on encoding when an attribute name contained |
793 |
* Display namespace prefixes for namespaced attribute names, instead of |
250
by Leonard Richardson
Use namespace prefixes for namespaced attribute names, instead of |
794 |
the fully-qualified names given by the lxml parser. [bug=1037597] |
795 |
||
255
by Leonard Richardson
Fixed a crash on encoding when an attribute name contained |
796 |
* Fixed a crash on encoding when an attribute name contained |
797 |
non-ASCII characters. |
|
798 |
||
251
by Leonard Richardson
As per PEP-8, allow searching by CSS class using the 'class_' |
799 |
* When sniffing encodings, if the cchardet library is installed, |
258
by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503] |
800 |
Beautiful Soup uses it instead of chardet. cchardet is much |
251
by Leonard Richardson
As per PEP-8, allow searching by CSS class using the 'class_' |
801 |
faster. [bug=1020748] |
246
by Leonard Richardson
When sniffing encodings, if the cchardet library is installed, use it instead of chardet. It's much faster. [bug=1020748] |
802 |
|
245
by Leonard Richardson
Use logging.warning() instead of warning.warn() to notify the user that characters were replaced with REPLACEMENT CHARACTER. [bug=1013862] |
803 |
* Use logging.warning() instead of warning.warn() to notify the user |
804 |
that characters were replaced with REPLACEMENT |
|
805 |
CHARACTER. [bug=1013862] |
|
806 |
||
243
by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387] |
807 |
= 4.1.1 (20120703) = |
239
by Leonard Richardson
Fixed an html5lib tree builder crash which happened when html5lib |
808 |
|
241
by Leonard Richardson
Fixed a typo that made parsing much slower than it should have been. [bug=1020268] |
809 |
* Fixed an html5lib tree builder crash which happened when html5lib |
243
by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387] |
810 |
moved a tag with a multivalued attribute from one part of the tree |
811 |
to another. [bug=1019603] |
|
239
by Leonard Richardson
Fixed an html5lib tree builder crash which happened when html5lib |
812 |
|
243
by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387] |
813 |
* Correctly display closing tags with an XML namespace declared. Patch |
241
by Leonard Richardson
Fixed a typo that made parsing much slower than it should have been. [bug=1020268] |
814 |
by Andreas Kostyrka. [bug=1019635] |
815 |
||
816 |
* Fixed a typo that made parsing significantly slower than it should |
|
243
by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387] |
817 |
have been, and also waited too long to close tags with XML |
818 |
namespaces. [bug=1020268] |
|
819 |
||
820 |
* get_text() now returns an empty Unicode string if there is no text, |
|
821 |
rather than an empty bytestring. [bug=1020387] |
|
241
by Leonard Richardson
Fixed a typo that made parsing much slower than it should have been. [bug=1020268] |
822 |
|
236
by Leonard Richardson
Prep for release. |
823 |
= 4.1.0 (20120529) = |
228
by Leonard Richardson
Added experimental support for fixing Windows-1252 characters embedded in UTF-8 documents. |
824 |
|
825 |
* Added experimental support for fixing Windows-1252 characters |
|
232
by Leonard Richardson
Fixed a bug with the lxml treebuilder that prevented the user from adding attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch. |
826 |
embedded in UTF-8 documents. (UnicodeDammit.detwingle()) |
228
by Leonard Richardson
Added experimental support for fixing Windows-1252 characters embedded in UTF-8 documents. |
827 |
|
230
by Leonard Richardson
Fixed the handling of " with the built-in parser. [bug=993871] |
828 |
* Fixed the handling of " with the built-in parser. [bug=993871] |
829 |
||
231
by Leonard Richardson
Comments, processing instructions, document type declarations, and markup declarations are now treated as preformatted strings, the way CData blocks are. [bug=1001025] Also in this commit: renamed detwingle method to detwingle(). |
830 |
* Comments, processing instructions, document type declarations, and |
831 |
markup declarations are now treated as preformatted strings, the way |
|
832 |
CData blocks are. [bug=1001025] |
|
833 |
||
232
by Leonard Richardson
Fixed a bug with the lxml treebuilder that prevented the user from adding attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch. |
834 |
* Fixed a bug with the lxml treebuilder that prevented the user from |
835 |
adding attributes to a tag that didn't originally have |
|
236
by Leonard Richardson
Prep for release. |
836 |
attributes. [bug=1002378] Thanks to Oliver Beattie for the patch. |
232
by Leonard Richardson
Fixed a bug with the lxml treebuilder that prevented the user from adding attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch. |
837 |
|
233
by Leonard Richardson
Fixed some edge-case bugs having to do with inserting an element |
838 |
* Fixed some edge-case bugs having to do with inserting an element |
839 |
into a tag it's already inside, and replacing one of a tag's |
|
840 |
children with another. [bug=997529] |
|
841 |
||
236
by Leonard Richardson
Prep for release. |
842 |
* Added the ability to search for attribute values specified in UTF-8. [bug=1003974] |
235
by Leonard Richardson
Fixed the inability to search for non-ASCII attribute |
843 |
|
844 |
This caused a major refactoring of the search code. All the tests |
|
845 |
pass, but it's possible that some searches will behave differently. |
|
234
by Leonard Richardson
Fixed the basic failure in [bug=1003974], but not more advanced cases. |
846 |
|
225
by Leonard Richardson
Prep for release. |
847 |
= 4.0.5 (20120427) = |
214
by Leonard Richardson
Fixed a bug that made the HTMLParser treebuilder generate XML definitions ending with two question marks instead of one. [bug=984258] |
848 |
|
229
by Leonard Richardson
Fixed NEWS. |
849 |
* Added a new method, wrap(), which wraps an element in a tag. |
224
by Leonard Richardson
Added a new method, wrap(). |
850 |
|
223
by Leonard Richardson
Renamed replace_with_children() to the jQuery name, unwrap(). |
851 |
* Renamed replace_with_children() to unwrap(), which is easier to |
852 |
understand and also the jQuery name of the function. |
|
853 |
||
217
by Leonard Richardson
Made encoding substitution in <meta> tags completely transparent (no more %SOUP-ENCODING%). |
854 |
* Made encoding substitution in <meta> tags completely transparent (no |
855 |
more %SOUP-ENCODING%). |
|
856 |
||
222
by Leonard Richardson
Fixed a bug in decoding data that contained a byte-order mark, such as data encoded in UTF-16LE. [bug=988980] |
857 |
* Fixed a bug in decoding data that contained a byte-order mark, such |
858 |
as data encoded in UTF-16LE. [bug=988980] |
|
859 |
||
214
by Leonard Richardson
Fixed a bug that made the HTMLParser treebuilder generate XML definitions ending with two question marks instead of one. [bug=984258] |
860 |
* Fixed a bug that made the HTMLParser treebuilder generate XML |
861 |
definitions ending with two question marks instead of |
|
862 |
one. [bug=984258] |
|
863 |
||
221
by Leonard Richardson
Upon document generation, CData objects are no longer run through the formatter. [bug=988905] |
864 |
* Upon document generation, CData objects are no longer run through |
865 |
the formatter. [bug=988905] |
|
866 |
||
220
by Leonard Richardson
The test suite now passes when lxml is not installed, whether or not html5lib is installed. [bug=987004] |
867 |
* The test suite now passes when lxml is not installed, whether or not |
868 |
html5lib is installed. [bug=987004] |
|
869 |
||
215
by Leonard Richardson
Print a warning on HTMLParseErrors to let people know they should install an external parser. |
870 |
* Print a warning on HTMLParseErrors to let people know they should |
871 |
install a better parser library. |
|
872 |
||
213
by Leonard Richardson
Prep for release. |
873 |
= 4.0.4 (20120416) = |
205
by Leonard Richardson
Have objects_was_parsed set the previous element's next_element if possible. [bug=975926] |
874 |
|
875 |
* Fixed a bug that sometimes created disconnected trees. |
|
876 |
||
209
by Leonard Richardson
Fixed a bug with the string setter that moved a string around the |
877 |
* Fixed a bug with the string setter that moved a string around the |
878 |
tree instead of copying it. [bug=983050] |
|
879 |
||
210
by Leonard Richardson
Attribute values are now run through the provided output formatter. Previously they were always run through the 'minimal' formatter. [bug=980237] |
880 |
* Attribute values are now run through the provided output formatter. |
881 |
Previously they were always run through the 'minimal' formatter. In |
|
882 |
the future I may make it possible to specify different formatters |
|
883 |
for attribute values and strings, but for now, consistent behavior |
|
884 |
is better than inconsistent behavior. [bug=980237] |
|
885 |
||
206
by Leonard Richardson
Added renderContents back. |
886 |
* Added the missing renderContents method from Beautiful Soup 3. Also |
887 |
added an encode_contents() method to go along with decode_contents(). |
|
888 |
||
208
by Leonard Richardson
Give a more useful error when the user tries to run the Python 2 version of BS under Python 3. |
889 |
* Give a more useful error when the user tries to run the Python 2 |
890 |
version of BS under Python 3. |
|
891 |
||
211
by Leonard Richardson
Unicode, Dammit now has an option to turn MS smart quotes into ASCII characters. |
892 |
* UnicodeDammit can now convert Microsoft smart quotes to ASCII with |
893 |
UnicodeDammit(markup, smart_quotes_to="ascii"). |
|
894 |
||
204
by Leonard Richardson
Prep for release. |
895 |
= 4.0.3 (20120403) = |
197
by Leonard Richardson
Fixed a typo that caused some versions of Python 3 to convert the Beautiful Soup codebase incorrectly. |
896 |
|
897 |
* Fixed a typo that caused some versions of Python 3 to convert the |
|
898 |
Beautiful Soup codebase incorrectly. |
|
899 |
||
203
by Leonard Richardson
Got rid of the 4.0.2 workaround for HTML documents--it was unnecessary and the workaround was triggering a (possibly different, but related) bug in lxml. [bug=972466] |
900 |
* Got rid of the 4.0.2 workaround for HTML documents--it was |
901 |
unnecessary and the workaround was triggering a (possibly different, |
|
902 |
but related) bug in lxml. [bug=972466] |
|
903 |
||
196
by Leonard Richardson
Prep for release. |
904 |
= 4.0.2 (20120326) = |
194
by Leonard Richardson
Fixed a bug where specifying 'text' while searching for a tag only worked if 'text' specified an exact string match. [bug=955942] |
905 |
|
195
by Leonard Richardson
Pass data into XMLParser.feed() in chunks. [bug=963880] |
906 |
* Worked around a possible bug in lxml that prevents non-tiny XML |
907 |
documents from being parsed. [bug=963880, bug=963936] |
|
908 |
||
196
by Leonard Richardson
Prep for release. |
909 |
* Fixed a bug where specifying `text` while also searching for a tag |
910 |
only worked if `text` wanted an exact string match. [bug=955942] |
|
194
by Leonard Richardson
Fixed a bug where specifying 'text' while searching for a tag only worked if 'text' specified an exact string match. [bug=955942] |
911 |
|
188
by Leonard Richardson
Bumped version number. |
912 |
= 4.0.1 (20120314) = |
913 |
||
914 |
* This is the first official release of Beautiful Soup 4. There is no |
|
915 |
4.0.0 release, to eliminate any possibility that packaging software |
|
916 |
might treat "4.0.0" as being an earlier version than "4.0.0b10". |
|
187
by Leonard Richardson
Brought the soupselect port up to date. |
917 |
|
918 |
* Brought BS up to date with the latest release of soupselect, adding |
|
919 |
CSS selector support for direct descendant matches and multiple CSS |
|
920 |
class matches. |
|
921 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
922 |
= 4.0.0b10 (20120302) = |
179.1.3
by Leonard Richardson
Test that CSS selectors work within the tree as well as at the top level. |
923 |
|
179.1.4
by Leonard Richardson
Updated docs. |
924 |
* Added support for simple CSS selectors, taken from the soupselect project. |
179.1.3
by Leonard Richardson
Test that CSS selectors work within the tree as well as at the top level. |
925 |
|
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
926 |
* Fixed a crash when using html5lib. [bug=943246] |
927 |
||
182
by Leonard Richardson
In HTML5-style <meta charset="foo"> tags, the value of the "charset" attribute is now replaced with the appropriate encoding on output. [bug=942714] |
928 |
* In HTML5-style <meta charset="foo"> tags, the value of the "charset" |
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
929 |
attribute is now replaced with the appropriate encoding on |
930 |
output. [bug=942714] |
|
931 |
||
932 |
* Fixed a bug that caused calling a tag to sometimes call find_all() |
|
933 |
with the wrong arguments. [bug=944426] |
|
182
by Leonard Richardson
In HTML5-style <meta charset="foo"> tags, the value of the "charset" attribute is now replaced with the appropriate encoding on output. [bug=942714] |
934 |
|
184
by Leonard Richardson
For backwards compatibility, brought back the BeautifulStoneSoup class as a deprecated wrapper around BeautifulSoup. |
935 |
* For backwards compatibility, brought back the BeautifulStoneSoup |
936 |
class as a deprecated wrapper around BeautifulSoup. |
|
937 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
938 |
= 4.0.0b9 (20120228) = |
175
by Leonard Richardson
Renamed Tag.nsprefix to Tag.prefix, for consistency with NamespacedAttribute. |
939 |
|
177
by Leonard Richardson
Fixed DOCTYPE handling. |
940 |
* Fixed the string representation of DOCTYPEs that have both a public |
941 |
ID and a system ID. |
|
942 |
||
179
by Leonard Richardson
Fixed the generated XML declaration. |
943 |
* Fixed the generated XML declaration. |
944 |
||
175
by Leonard Richardson
Renamed Tag.nsprefix to Tag.prefix, for consistency with NamespacedAttribute. |
945 |
* Renamed Tag.nsprefix to Tag.prefix, for consistency with |
946 |
NamespacedAttribute. |
|
947 |
||
421.1.1
by Ville Skyttä
Spelling fixes |
948 |
* Fixed a test failure that occurred on Python 3.x when chardet was |
176
by Leonard Richardson
Fixed a test failure that occured on Python 3.x when chardet was installed. |
949 |
installed. |
950 |
||
178
by Leonard Richardson
Make prettify() return Unicode by default, so it will look nice when passed into print() under Python 3. |
951 |
* Made prettify() return Unicode by default, so it will look nice on |
952 |
Python 3 when passed into print(). |
|
953 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
954 |
= 4.0.0b8 (20120224) = |
158.1.10
by Leonard Richardson
Bumped version number. |
955 |
|
956 |
* All tree builders now preserve namespace information in the |
|
174
by Leonard Richardson
I keep typing assertEquals. |
957 |
documents they parse. If you use the html5lib parser or lxml's XML |
958 |
parser, you can access the namespace URL for a tag as tag.namespace. |
|
158.1.10
by Leonard Richardson
Bumped version number. |
959 |
|
960 |
However, there is no special support for namespace-oriented |
|
961 |
searching or tree manipulation. When you search the tree, you need |
|
962 |
to use namespace prefixes exactly as they're used in the original |
|
963 |
document. |
|
964 |
||
158.1.11
by Leonard Richardson
Fixed handling of the closing of namespaced tags. |
965 |
* The string representation of a DOCTYPE always ends in a newline. |
966 |
||
173
by Leonard Richardson
Warn when SoupStrainer is used with the html5lib tree builder. |
967 |
* Issue a warning if the user tries to use a SoupStrainer in |
968 |
conjunction with the html5lib tree builder, which doesn't support |
|
969 |
them. |
|
970 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
971 |
= 4.0.0b7 (20120223) = |
157
by Leonard Richardson
Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion. |
972 |
|
158
by Leonard Richardson
By default, turn unrecognized characters into numeric XML entity refs. |
973 |
* Upon decoding to string, any characters that can't be represented in |
974 |
your chosen encoding will be converted into numeric XML entity |
|
975 |
references. |
|
976 |
||
157
by Leonard Richardson
Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion. |
977 |
* Issue a warning if characters were replaced with REPLACEMENT |
978 |
CHARACTER during Unicode conversion. |
|
979 |
||
160
by Leonard Richardson
Added code from 2.7's standard library so that the tests will run on Python 2.6. |
980 |
* Restored compatibility with Python 2.6. |
981 |
||
421.1.1
by Ville Skyttä
Spelling fixes |
982 |
* The install process no longer installs docs or auxiliary text files. |
169
by Leonard Richardson
It's now possible to copy a BeautifulSoup object created with the html.parser treebuilder. |
983 |
|
984 |
* It's now possible to deepcopy a BeautifulSoup object created with |
|
985 |
Python's built-in HTML parser. |
|
986 |
||
169.1.6
by Leonard Richardson
Updated NEWS. |
987 |
* About 100 unit tests that "test" the behavior of various parsers on |
988 |
invalid markup have been removed. Legitimate changes to those |
|
989 |
parsers caused these tests to fail, indicating that perhaps |
|
990 |
Beautiful Soup should not test the behavior of foreign |
|
991 |
libraries. |
|
992 |
||
993 |
The problematic unit tests have been reformulated as informational |
|
994 |
comparisons generated by the script |
|
995 |
scripts/demonstrate_parser_differences.py. |
|
996 |
||
997 |
This makes Beautiful Soup compatible with html5lib version 0.95 and |
|
998 |
future versions of HTMLParser. |
|
999 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
1000 |
= 4.0.0b6 (20120216) = |
150.1.8
by Leonard Richardson
Added to NEWS. |
1001 |
|
157
by Leonard Richardson
Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion. |
1002 |
* Multi-valued attributes like "class" always have a list of values, |
1003 |
even if there's only one value in the list. |
|
1004 |
||
1005 |
* Added a number of multi-valued attributes defined in HTML5. |
|
154
by Leonard Richardson
The value of multi-valued attributes like class are always turned into a list, even if there's only one value. |
1006 |
|
155
by Leonard Richardson
Added a kind of hacky way to interpret the restriction class='foo bar'. Stop generating a space before the slash that closes an empty-element tag. |
1007 |
* Stopped generating a space before the slash that closes an |
1008 |
empty-element tag. This may come back if I add a special XHTML mode |
|
1009 |
(http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty |
|
1010 |
useless. |
|
1011 |
||
152
by Leonard Richardson
Better defined behavior when the user wants to search for a combination of text and tag-specific arguments. [bug=695312] |
1012 |
* Passing text along with tag-specific arguments to a find* method: |
1013 |
||
1014 |
find("a", text="Click here") |
|
1015 |
||
1016 |
will find tags that contain the given text as their |
|
1017 |
.string. Previously, the tag-specific arguments were ignored and |
|
1018 |
only strings were searched. |
|
1019 |
||
150.1.8
by Leonard Richardson
Added to NEWS. |
1020 |
* Fixed a bug that caused the html5lib tree builder to build a |
1021 |
partially disconnected tree. Generally cleaned up the html5lib tree |
|
1022 |
builder. |
|
1023 |
||
155
by Leonard Richardson
Added a kind of hacky way to interpret the restriction class='foo bar'. Stop generating a space before the slash that closes an empty-element tag. |
1024 |
* If you restrict a multi-valued attribute like "class" to a string |
1025 |
that contains spaces, Beautiful Soup will only consider it a match |
|
1026 |
if the values correspond to that specific string. |
|
1027 |
||
149
by Leonard Richardson
Bumped version number. |
1028 |
= 4.0.0b5 (20120209) = |
138
by Leonard Richardson
Rationalized the treatment of multi-valued HTML attributes such as 'class' |
1029 |
|
1030 |
* Rationalized Beautiful Soup's treatment of CSS class. A tag |
|
1031 |
belonging to multiple CSS classes is treated as having a list of |
|
1032 |
values for the 'class' attribute. Searching for a CSS class will |
|
1033 |
match *any* of the CSS classes. |
|
1034 |
||
1035 |
This actually affects all attributes that the HTML standard defines |
|
1036 |
as taking multiple values (class, rel, rev, archive, accept-charset, |
|
148
by Leonard Richardson
Added bug reference. |
1037 |
and headers), but 'class' is by far the most common. [bug=41034] |
138
by Leonard Richardson
Rationalized the treatment of multi-valued HTML attributes such as 'class' |
1038 |
|
1039 |
* If you pass anything other than a dictionary as the second argument |
|
1040 |
to one of the find* methods, it'll assume you want to use that |
|
1041 |
object to search against a tag's CSS classes. Previously this only |
|
1042 |
worked if you passed in a string. |
|
1043 |
||
140
by Leonard Richardson
Fixed a bug that caused a crash when you passed a dictionary as an attribute value (possibly because you mistyped attrs). [bug=842419] |
1044 |
* Fixed a bug that caused a crash when you passed a dictionary as an |
1045 |
attribute value (possibly because you mistyped "attrs"). [bug=842419] |
|
1046 |
||
144
by Leonard Richardson
Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags like <meta charset="utf-8" />. [bug=837268] |
1047 |
* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags |
1048 |
like <meta charset="utf-8" />. [bug=837268] |
|
1049 |
||
146
by Leonard Richardson
As a last-ditch attempt to turn data into Unicode, use errors=replace instead of errors=strict. |
1050 |
* If Unicode, Dammit can't figure out a consistent encoding for a |
1051 |
page, it will try each of its guesses again, with errors="replace" |
|
1052 |
instead of errors="strict". This may mean that some data gets |
|
1053 |
replaced with REPLACEMENT CHARACTER, but at least most of it will |
|
1054 |
get turned into Unicode. [bug=754903] |
|
1055 |
||
145
by Leonard Richardson
Patched over a bug in html5lib (?) that was crashing Beautiful Soup on certain kinds of markup. [bug=838800] |
1056 |
* Patched over a bug in html5lib (?) that was crashing Beautiful Soup |
1057 |
on certain kinds of markup. [bug=838800] |
|
1058 |
||
141
by Leonard Richardson
Fixed a bug that wrecked the tree if you replaced an element with an empty string. [bug=728697] |
1059 |
* Fixed a bug that wrecked the tree if you replaced an element with an |
1060 |
empty string. [bug=728697] |
|
1061 |
||
142
by Leonard Richardson
Improved Unicode, Dammit's behavior when you give it Unicode to begin with. |
1062 |
* Improved Unicode, Dammit's behavior when you give it Unicode to |
1063 |
begin with. |
|
1064 |
||
134
by Leonard Richardson
Moved the historical changelog into NEWS. |
1065 |
= 4.0.0b4 (20120208) = |
131
by Leonard Richardson
Moved around a bunch of metadata. |
1066 |
|
1067 |
* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag() |
|
1068 |
||
1069 |
* BeautifulSoup.new_tag() will follow the rules of whatever |
|
1070 |
tree-builder was used to create the original BeautifulSoup object. A |
|
1071 |
new <p> tag will look like "<p />" if the soup object was created to |
|
1072 |
parse XML, but it will look like "<p></p>" if the soup object was |
|
1073 |
created to parse HTML. |
|
1074 |
||
1075 |
* We pass in strict=False to html.parser on Python 3, greatly |
|
1076 |
improving html.parser's ability to handle bad HTML. |
|
1077 |
||
1078 |
* We also monkeypatch a serious bug in html.parser that made |
|
1079 |
strict=False disastrous on Python 3.2.2. |
|
1080 |
||
1081 |
* Replaced the "substitute_html_entities" argument with the |
|
133
by Leonard Richardson
Added more detail to the NEWS. |
1082 |
more general "formatter" argument. |
131
by Leonard Richardson
Moved around a bunch of metadata. |
1083 |
|
1084 |
* Bare ampersands and angle brackets are always converted to XML |
|
1085 |
entities unless the user prevents it. |
|
1086 |
||
133
by Leonard Richardson
Added more detail to the NEWS. |
1087 |
* Added PageElement.insert_before() and PageElement.insert_after(), |
1088 |
which let you put an element into the parse tree with respect to |
|
1089 |
some other element. |
|
131
by Leonard Richardson
Moved around a bunch of metadata. |
1090 |
|
1091 |
* Raise an exception when the user tries to do something nonsensical |
|
1092 |
like insert a tag into itself. |
|
1093 |
||
122
by Leonard Richardson
Documented today's changes. |
1094 |
|
134
by Leonard Richardson
Moved the historical changelog into NEWS. |
1095 |
= 4.0.0b3 (20120203) = |
126
by Leonard Richardson
Package the docs with the code. |
1096 |
|
1097 |
Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful |
|
1098 |
Soup's custom HTML parser in favor of a system that lets you write a |
|
1099 |
little glue code and plug in any HTML or XML parser you want. |
|
1100 |
||
1101 |
Beautiful Soup 4.0 comes with glue code for four parsers: |
|
1102 |
||
1103 |
* Python's standard HTMLParser (html.parser in Python 3) |
|
1104 |
* lxml's HTML and XML parsers |
|
1105 |
* html5lib's HTML parser |
|
1106 |
||
1107 |
HTMLParser is the default, but I recommend you install lxml if you |
|
1108 |
can. |
|
1109 |
||
1110 |
For complete documentation, see the Sphinx documentation in |
|
1111 |
bs4/doc/source/. What follows is a summary of the changes from |
|
1112 |
Beautiful Soup 3. |
|
1113 |
||
1114 |
=== The module name has changed === |
|
1115 |
||
1116 |
Previously you imported the BeautifulSoup class from a module also |
|
1117 |
called BeautifulSoup. To save keystrokes and make it clear which |
|
1118 |
version of the API is in use, the module is now called 'bs4': |
|
1119 |
||
1120 |
>>> from bs4 import BeautifulSoup |
|
1121 |
||
1122 |
=== It works with Python 3 === |
|
1123 |
||
1124 |
Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was |
|
1125 |
so bad that it barely worked at all. Beautiful Soup 4 works with |
|
1126 |
Python 3, and since its parser is pluggable, you don't sacrifice |
|
1127 |
quality. |
|
1128 |
||
1129 |
Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 |
|
1130 |
support to the finish line. Ezio Melotti is also to thank for greatly |
|
1131 |
improving the HTML parser that comes with Python 3.2. |
|
1132 |
||
1133 |
=== CDATA sections are normal text, if they're understood at all. === |
|
1134 |
||
1135 |
Currently, the lxml and html5lib HTML parsers ignore CDATA sections in |
|
1136 |
markup: |
|
1137 |
||
1138 |
<p><![CDATA[foo]]></p> => <p></p> |
|
1139 |
||
1140 |
A future version of html5lib will turn CDATA sections into text nodes, |
|
1141 |
but only within tags like <svg> and <math>: |
|
1142 |
||
1143 |
<svg><![CDATA[foo]]></svg> => <p>foo</p> |
|
1144 |
||
1145 |
The default XML parser (which uses lxml behind the scenes) turns CDATA |
|
1146 |
sections into ordinary text elements: |
|
1147 |
||
1148 |
<p><![CDATA[foo]]></p> => <p>foo</p> |
|
1149 |
||
1150 |
In theory it's possible to preserve the CDATA sections when using the |
|
1151 |
XML parser, but I don't see how to get it to work in practice. |
|
1152 |
||
1153 |
=== Miscellaneous other stuff === |
|
1154 |
||
1155 |
If the BeautifulSoup instance has .is_xml set to True, an appropriate |
|
1156 |
XML declaration will be emitted when the tree is transformed into a |
|
1157 |
string: |
|
1158 |
||
1159 |
<?xml version="1.0" encoding="utf-8"> |
|
1160 |
<markup> |
|
1161 |
...
|
|
1162 |
</markup> |
|
1163 |
||
1164 |
The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree |
|
1165 |
builders set it to False. If you want to parse XHTML with an HTML |
|
1166 |
parser, you can set it manually. |
|
1167 |
||
75.1.4
by Leonard Richardson
Emit an XML declaration when appropriate. |
1168 |
|
92
by Leonard Richardson
Prep for beta release. |
1169 |
= 3.2.0 = |
1170 |
||
1171 |
The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2 |
|
1172 |
to make it obvious which one you should use. |
|
1173 |
||
1
by Leonard Richardson
Initial (manual) import. |
1174 |
= 3.1.0 = |
1175 |
||
1176 |
A hybrid version that supports 2.4 and can be automatically converted |
|
1177 |
to run under Python 3.0. There are three backwards-incompatible |
|
1178 |
changes you should be aware of, but no new features or deliberate |
|
1179 |
behavior changes. |
|
1180 |
||
1181 |
1. str() may no longer do what you want. This is because the meaning |
|
1182 |
of str() inverts between Python 2 and 3; in Python 2 it gives you a |
|
1183 |
byte string, in Python 3 it gives you a Unicode string. |
|
1184 |
||
1185 |
The effect of this is that you can't pass an encoding to .__str__ |
|
1186 |
anymore. Use encode() to get a string and decode() to get Unicode, and |
|
1187 |
you'll be ready (well, readier) for Python 3. |
|
1188 |
||
1189 |
2. Beautiful Soup is now based on HTMLParser rather than SGMLParser, |
|
1190 |
which is gone in Python 3. There's some bad HTML that SGMLParser |
|
1191 |
handled but HTMLParser doesn't, usually to do with attribute values |
|
1192 |
that aren't closed or have brackets inside them: |
|
1193 |
||
1194 |
<a href="foo</a>, </a><a href="bar">baz</a> |
|
1195 |
<a b="<a>">', '<a b="<a>"></a><a>"></a> |
|
1196 |
||
1197 |
A later version of Beautiful Soup will allow you to plug in different |
|
1198 |
parsers to make tradeoffs between speed and the ability to handle bad |
|
1199 |
HTML. |
|
1200 |
||
87.1.3
by Aaron DeVore
Changelog for attribute renames |
1201 |
3. In Python 3 (but not Python 2), HTMLParser converts entities within |
1
by Leonard Richardson
Initial (manual) import. |
1202 |
attributes to the corresponding Unicode characters. In Python 2 it's |
1203 |
possible to parse this string and leave the é intact. |
|
1204 |
||
1205 |
<a href="http://crummy.com?sacré&bleu"> |
|
1206 |
||
1207 |
In Python 3, the é is always converted to \xe9 during |
|
1208 |
parsing. |
|
1209 |
||
1210 |
||
1211 |
= 3.0.7a = |
|
1212 |
||
1213 |
Added an import that makes BS work in Python 2.3. |
|
1214 |
||
1215 |
||
1216 |
= 3.0.7 = |
|
1217 |
||
1218 |
Fixed a UnicodeDecodeError when unpickling documents that contain |
|
1219 |
non-ASCII characters. |
|
1220 |
||
421.1.1
by Ville Skyttä
Spelling fixes |
1221 |
Fixed a TypeError that occurred in some circumstances when a tag |
1
by Leonard Richardson
Initial (manual) import. |
1222 |
contained no text. |
1223 |
||
1224 |
Jump through hoops to avoid the use of chardet, which can be extremely |
|
1225 |
slow in some circumstances. UTF-8 documents should never trigger the |
|
1226 |
use of chardet. |
|
1227 |
||
1228 |
Whitespace is preserved inside <pre> and <textarea> tags that contain |
|
1229 |
nothing but whitespace. |
|
1230 |
||
1231 |
Beautiful Soup can now parse a doctype that's scoped to an XML namespace. |
|
1232 |
||
1233 |
||
1234 |
= 3.0.6 = |
|
1235 |
||
1236 |
Got rid of a very old debug line that prevented chardet from working. |
|
1237 |
||
1238 |
Added a Tag.decompose() method that completely disconnects a tree or a |
|
1239 |
subset of a tree, breaking it up into bite-sized pieces that are |
|
1240 |
easy for the garbage collecter to collect. |
|
1241 |
||
1242 |
Tag.extract() now returns the tag that was extracted. |
|
1243 |
||
1244 |
Tag.findNext() now does something with the keyword arguments you pass |
|
1245 |
it instead of dropping them on the floor. |
|
1246 |
||
1247 |
Fixed a Unicode conversion bug. |
|
1248 |
||
1249 |
Fixed a bug that garbled some <meta> tags when rewriting them. |
|
1250 |
||
1251 |
||
1252 |
= 3.0.5 = |
|
1253 |
||
1254 |
Soup objects can now be pickled, and copied with copy.deepcopy. |
|
1255 |
||
1256 |
Tag.append now works properly on existing BS objects. (It wasn't |
|
1257 |
originally intended for outside use, but it can be now.) (Giles |
|
1258 |
Radford) |
|
1259 |
||
1260 |
Passing in a nonexistent encoding will no longer crash the parser on |
|
1261 |
Python 2.4 (John Nagle). |
|
1262 |
||
1263 |
Fixed an underlying bug in SGMLParser that thinks ASCII has 255 |
|
1264 |
characters instead of 127 (John Nagle). |
|
1265 |
||
1266 |
Entities are converted more consistently to Unicode characters. |
|
1267 |
||
1268 |
Entity references in attribute values are now converted to Unicode |
|
1269 |
characters when appropriate. Numeric entities are always converted, |
|
1270 |
because SGMLParser always converts them outside of attribute values. |
|
1271 |
||
1272 |
ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to |
|
1273 |
XHTML_ENTITIES. |
|
1274 |
||
1275 |
The regular expression for bare ampersands was too loose. In some |
|
1276 |
cases ampersands were not being escaped. (Sam Ruby?) |
|
1277 |
||
1278 |
Non-breaking spaces and other special Unicode space characters are no |
|
1279 |
longer folded to ASCII spaces. (Robert Leftwich) |
|
1280 |
||
1281 |
Information inside a TEXTAREA tag is now parsed literally, not as HTML |
|
1282 |
tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang) |
|
1283 |
||
1284 |
= 3.0.4 = |
|
1285 |
||
1286 |
Fixed a bug that crashed Unicode conversion in some cases. |
|
1287 |
||
1288 |
Fixed a bug that prevented UnicodeDammit from being used as a |
|
1289 |
general-purpose data scrubber. |
|
1290 |
||
1291 |
Fixed some unit test failures when running against Python 2.5. |
|
1292 |
||
1293 |
When considering whether to convert smart quotes, UnicodeDammit now |
|
1294 |
looks at the original encoding in a case-insensitive way. |
|
134
by Leonard Richardson
Moved the historical changelog into NEWS. |
1295 |
|
1296 |
= 3.0.3 (20060606) = |
|
1297 |
||
1298 |
Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be |
|
1299 |
sure to pass in an appropriate value for convertEntities, or XML/HTML |
|
1300 |
entities might stick around that aren't valid in HTML/XML). The result |
|
1301 |
may not validate, but it should be good enough to not choke a |
|
1302 |
real-world XML parser. Specifically, the output of a properly |
|
1303 |
constructed soup object should always be valid as part of an XML |
|
1304 |
document, but parts may be missing if they were missing in the |
|
1305 |
original. As always, if the input is valid XML, the output will also |
|
1306 |
be valid. |
|
1307 |
||
1308 |
= 3.0.2 (20060602) = |
|
1309 |
||
1310 |
Previously, Beautiful Soup correctly handled attribute values that |
|
1311 |
contained embedded quotes (sometimes by escaping), but not other kinds |
|
1312 |
of XML character. Now, it correctly handles or escapes all special XML |
|
1313 |
characters in attribute values. |
|
1314 |
||
1315 |
I aliased methods to the 2.x names (fetch, find, findText, etc.) for |
|
1316 |
backwards compatibility purposes. Those names are deprecated and if I |
|
1317 |
ever do a 4.0 I will remove them. I will, I tell you! |
|
1318 |
||
1319 |
Fixed a bug where the findAll method wasn't passing along any keyword |
|
1320 |
arguments. |
|
1321 |
||
1322 |
When run from the command line, Beautiful Soup now acts as an HTML |
|
1323 |
pretty-printer, not an XML pretty-printer. |
|
1324 |
||
1325 |
= 3.0.1 (20060530) = |
|
1326 |
||
1327 |
Reintroduced the "fetch by CSS class" shortcut. I thought keyword |
|
1328 |
arguments would replace it, but they don't. You can't call soup('a', |
|
1329 |
class='foo') because class is a Python keyword. |
|
1330 |
||
1331 |
If Beautiful Soup encounters a meta tag that declares the encoding, |
|
1332 |
but a SoupStrainer tells it not to parse that tag, Beautiful Soup will |
|
1333 |
no longer try to rewrite the meta tag to mention the new |
|
1334 |
encoding. Basically, this makes SoupStrainers work in real-world |
|
1335 |
applications instead of crashing the parser. |
|
1336 |
||
1337 |
= 3.0.0 "Who would not give all else for two p" (20060528) = |
|
1338 |
||
1339 |
This release is not backward-compatible with previous releases. If |
|
1340 |
you've got code written with a previous version of the library, go |
|
1341 |
ahead and keep using it, unless one of the features mentioned here |
|
1342 |
really makes your life easier. Since the library is self-contained, |
|
1343 |
you can include an old copy of the library in your old applications, |
|
1344 |
and use the new version for everything else. |
|
1345 |
||
1346 |
The documentation has been rewritten and greatly expanded with many |
|
1347 |
more examples. |
|
1348 |
||
1349 |
Beautiful Soup autodetects the encoding of a document (or uses the one |
|
1350 |
you specify), and converts it from its native encoding to |
|
1351 |
Unicode. Internally, it only deals with Unicode strings. When you |
|
1352 |
print out the document, it converts to UTF-8 (or another encoding you |
|
1353 |
specify). [Doc reference] |
|
1354 |
||
1355 |
It's now easy to make large-scale changes to the parse tree without |
|
1356 |
screwing up the navigation members. The methods are extract, |
|
1357 |
replaceWith, and insert. [Doc reference. See also Improving Memory |
|
1358 |
Usage with extract] |
|
1359 |
||
1360 |
Passing True in as an attribute value gives you tags that have any |
|
1361 |
value for that attribute. You don't have to create a regular |
|
1362 |
expression. Passing None for an attribute value gives you tags that |
|
1363 |
don't have that attribute at all. |
|
1364 |
||
1365 |
Tag objects now know whether or not they're self-closing. This avoids |
|
1366 |
the problem where Beautiful Soup thought that tags like <BR /> were |
|
1367 |
self-closing even in XML documents. You can customize the self-closing |
|
1368 |
tags for a parser object by passing them in as a list of |
|
1369 |
selfClosingTags: you don't have to subclass anymore. |
|
1370 |
||
1371 |
There's a new built-in parser, MinimalSoup, which has most of |
|
1372 |
BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc |
|
1373 |
reference] |
|
1374 |
||
1375 |
You can use a SoupStrainer to tell Beautiful Soup to parse only part |
|
1376 |
of a document. This saves time and memory, often making Beautiful Soup |
|
1377 |
about as fast as a custom-built SGMLParser subclass. [Doc reference, |
|
1378 |
SoupStrainer reference] |
|
1379 |
||
1380 |
You can (usually) use keyword arguments instead of passing a |
|
1381 |
dictionary of attributes to a search method. That is, you can replace |
|
1382 |
soup(args={"id" : "5"}) with soup(id="5"). You can still use args if |
|
1383 |
(for instance) you need to find an attribute whose name clashes with |
|
1384 |
the name of an argument to findAll. [Doc reference: **kwargs attrs] |
|
1385 |
||
1386 |
The method names have changed to the better method names used in |
|
1387 |
Rubyful Soup. Instead of find methods and fetch methods, there are |
|
1388 |
only find methods. Instead of a scheme where you can't remember which |
|
1389 |
method finds one element and which one finds them all, we have find |
|
1390 |
and findAll. In general, if the method name mentions All or a plural |
|
1391 |
noun (eg. findNextSiblings), then it finds many elements |
|
1392 |
method. Otherwise, it only finds one element. [Doc reference] |
|
1393 |
||
1394 |
Some of the argument names have been renamed for clarity. For instance |
|
1395 |
avoidParserProblems is now parserMassage. |
|
1396 |
||
1397 |
Beautiful Soup no longer implements a feed method. You need to pass a |
|
1398 |
string or a filehandle into the soup constructor, not with feed after |
|
1399 |
the soup has been created. There is still a feed method, but it's the |
|
1400 |
feed method implemented by SGMLParser and calling it will bypass |
|
1401 |
Beautiful Soup and cause problems. |
|
1402 |
||
1403 |
The NavigableText class has been renamed to NavigableString. There is |
|
1404 |
no NavigableUnicodeString anymore, because every string inside a |
|
1405 |
Beautiful Soup parse tree is a Unicode string. |
|
1406 |
||
1407 |
findText and fetchText are gone. Just pass a text argument into find |
|
1408 |
or findAll. |
|
1409 |
||
1410 |
Null was more trouble than it was worth, so I got rid of it. Anything |
|
1411 |
that used to return Null now returns None. |
|
1412 |
||
1413 |
Special XML constructs like comments and CDATA now have their own |
|
1414 |
NavigableString subclasses, instead of being treated as oddly-formed |
|
1415 |
data. If you parse a document that contains CDATA and write it back |
|
1416 |
out, the CDATA will still be there. |
|
1417 |
||
1418 |
When you're parsing a document, you can get Beautiful Soup to convert |
|
1419 |
XML or HTML entities into the corresponding Unicode characters. [Doc |
|
1420 |
reference] |
|
1421 |
||
1422 |
= 2.1.1 (20050918) = |
|
1423 |
||
1424 |
Fixed a serious performance bug in BeautifulStoneSoup which was |
|
1425 |
causing parsing to be incredibly slow. |
|
1426 |
||
1427 |
Corrected several entities that were previously being incorrectly |
|
1428 |
translated from Microsoft smart-quote-like characters. |
|
1429 |
||
1430 |
Fixed a bug that was breaking text fetch. |
|
1431 |
||
1432 |
Fixed a bug that crashed the parser when text chunks that look like |
|
1433 |
HTML tag names showed up within a SCRIPT tag. |
|
1434 |
||
1435 |
THEAD, TBODY, and TFOOT tags are now nestable within TABLE |
|
1436 |
tags. Nested tables should parse more sensibly now. |
|
1437 |
||
1438 |
BASE is now considered a self-closing tag. |
|
1439 |
||
1440 |
= 2.1.0 "Game, or any other dish?" (20050504) = |
|
1441 |
||
1442 |
Added a wide variety of new search methods which, given a starting |
|
1443 |
point inside the tree, follow a particular navigation member (like |
|
1444 |
nextSibling) over and over again, looking for Tag and NavigableText |
|
1445 |
objects that match certain criteria. The new methods are findNext, |
|
1446 |
fetchNext, findPrevious, fetchPrevious, findNextSibling, |
|
1447 |
fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings, |
|
1448 |
findParent, and fetchParents. All of these use the same basic code |
|
1449 |
used by first and fetch, so you can pass your weird ways of matching |
|
1450 |
things into these methods. |
|
1451 |
||
1452 |
The fetch method and its derivatives now accept a limit argument. |
|
1453 |
||
1454 |
You can now pass keyword arguments when calling a Tag object as though |
|
1455 |
it were a method. |
|
1456 |
||
1457 |
Fixed a bug that caused all hand-created tags to share a single set of |
|
1458 |
attributes. |
|
1459 |
||
1460 |
= 2.0.3 (20050501) = |
|
1461 |
||
1462 |
Fixed Python 2.2 support for iterators. |
|
1463 |
||
1464 |
Fixed a bug that gave the wrong representation to tags within quote |
|
1465 |
tags like <script>. |
|
1466 |
||
1467 |
Took some code from Mark Pilgrim that treats CDATA declarations as |
|
1468 |
data instead of ignoring them. |
|
1469 |
||
1470 |
Beautiful Soup's setup.py will now do an install even if the unit |
|
1471 |
tests fail. It won't build a source distribution if the unit tests |
|
1472 |
fail, so I can't release a new version unless they pass. |
|
1473 |
||
1474 |
= 2.0.2 (20050416) = |
|
1475 |
||
1476 |
Added the unit tests in a separate module, and packaged it with |
|
1477 |
distutils. |
|
1478 |
||
1479 |
Fixed a bug that sometimes caused renderContents() to return a Unicode |
|
1480 |
string even if there was no Unicode in the original string. |
|
1481 |
||
1482 |
Added the done() method, which closes all of the parser's open |
|
1483 |
tags. It gets called automatically when you pass in some text to the |
|
1484 |
constructor of a parser class; otherwise you must call it yourself. |
|
1485 |
||
1486 |
Reinstated some backwards compatibility with 1.x versions: referencing |
|
1487 |
the string member of a NavigableText object returns the NavigableText |
|
1488 |
object instead of throwing an error. |
|
1489 |
||
1490 |
= 2.0.1 (20050412) = |
|
1491 |
||
1492 |
Fixed a bug that caused bad results when you tried to reference a tag |
|
1493 |
name shorter than 3 characters as a member of a Tag, eg. tag.table.td. |
|
1494 |
||
1495 |
Made sure all Tags have the 'hidden' attribute so that an attempt to |
|
1496 |
access tag.hidden doesn't spawn an attempt to find a tag named |
|
1497 |
'hidden'. |
|
1498 |
||
1499 |
Fixed a bug in the comparison operator. |
|
1500 |
||
1501 |
= 2.0.0 "Who cares for fish?" (20050410) |
|
1502 |
||
1503 |
Beautiful Soup version 1 was very useful but also pretty stupid. I |
|
1504 |
originally wrote it without noticing any of the problems inherent in |
|
1505 |
trying to build a parse tree out of ambiguous HTML tags. This version |
|
1506 |
solves all of those problems to my satisfaction. It also adds many new |
|
1507 |
clever things to make up for the removal of the stupid things. |
|
1508 |
||
1509 |
== Parsing == |
|
1510 |
||
1511 |
The parser logic has been greatly improved, and the BeautifulSoup |
|
1512 |
class should much more reliably yield a parse tree that looks like |
|
1513 |
what the page author intended. For a particular class of odd edge |
|
1514 |
cases that now causes problems, there is a new class, |
|
1515 |
ICantBelieveItsBeautifulSoup. |
|
1516 |
||
1517 |
By default, Beautiful Soup now performs some cleanup operations on |
|
1518 |
text before parsing it. This is to avoid common problems with bad |
|
1519 |
definitions and self-closing tags that crash SGMLParser. You can |
|
1520 |
provide your own set of cleanup operations, or turn it off |
|
1521 |
altogether. The cleanup operations include fixing self-closing tags |
|
1522 |
that don't close, and replacing Microsoft smart quotes and similar |
|
1523 |
characters with their HTML entity equivalents. |
|
1524 |
||
1525 |
You can now get a pretty-print version of parsed HTML to get a visual |
|
1526 |
picture of how Beautiful Soup parses it, with the Tag.prettify() |
|
1527 |
method. |
|
1528 |
||
1529 |
== Strings and Unicode == |
|
1530 |
||
1531 |
There are separate NavigableText subclasses for ASCII and Unicode |
|
1532 |
strings. These classes directly subclass the corresponding base data |
|
1533 |
types. This means you can treat NavigableText objects as strings |
|
1534 |
instead of having to call methods on them to get the strings. |
|
1535 |
||
1536 |
str() on a Tag always returns a string, and unicode() always returns |
|
1537 |
Unicode. Previously it was inconsistent. |
|
1538 |
||
1539 |
== Tree traversal == |
|
1540 |
||
1541 |
In a first() or fetch() call, the tag name or the desired value of an |
|
1542 |
attribute can now be any of the following: |
|
1543 |
||
1544 |
* A string (matches that specific tag or that specific attribute value) |
|
1545 |
* A list of strings (matches any tag or attribute value in the list) |
|
1546 |
* A compiled regular expression object (matches any tag or attribute |
|
1547 |
value that matches the regular expression) |
|
1548 |
* A callable object that takes the Tag object or attribute value as a |
|
1549 |
string. It returns None/false/empty string if the given string |
|
1550 |
doesn't match, and any other value if it does. |
|
1551 |
||
1552 |
This is much easier to use than SQL-style wildcards (see, regular |
|
1553 |
expressions are good for something). Because of this, I took out |
|
1554 |
SQL-style wildcards. I'll put them back if someone complains, but |
|
1555 |
their removal simplifies the code a lot. |
|
1556 |
||
1557 |
You can use fetch() and first() to search for text in the parse tree, |
|
1558 |
not just tags. There are new alias methods fetchText() and firstText() |
|
1559 |
designed for this purpose. As with searching for tags, you can pass in |
|
1560 |
a string, a regular expression object, or a method to match your text. |
|
1561 |
||
1562 |
If you pass in something besides a map to the attrs argument of |
|
1563 |
fetch() or first(), Beautiful Soup will assume you want to match that |
|
1564 |
thing against the "class" attribute. When you're scraping |
|
1565 |
well-structured HTML, this makes your code a lot cleaner. |
|
1566 |
||
1567 |
1.x and 2.x both let you call a Tag object as a shorthand for |
|
1568 |
fetch(). For instance, foo("bar") is a shorthand for |
|
1569 |
foo.fetch("bar"). In 2.x, you can also access a specially-named member |
|
1570 |
of a Tag object as a shorthand for first(). For instance, foo.barTag |
|
1571 |
is a shorthand for foo.first("bar"). By chaining these shortcuts you |
|
1572 |
traverse a tree in very little code: for header in |
|
1573 |
soup.bodyTag.pTag.tableTag('th'): |
|
1574 |
||
1575 |
If an element relationship (like parent or next) doesn't apply to a |
|
1576 |
tag, it'll now show up Null instead of None. first() will also return |
|
1577 |
Null if you ask it for a nonexistent tag. Null is an object that's |
|
1578 |
just like None, except you can do whatever you want to it and it'll |
|
1579 |
give you Null instead of throwing an error. |
|
1580 |
||
1581 |
This lets you do tree traversals like soup.htmlTag.headTag.titleTag |
|
1582 |
without having to worry if the intermediate stages are actually |
|
1583 |
there. Previously, if there was no 'head' tag in the document, headTag |
|
1584 |
in that instance would have been None, and accessing its 'titleTag' |
|
1585 |
member would have thrown an AttributeError. Now, you can get what you |
|
1586 |
want when it exists, and get Null when it doesn't, without having to |
|
1587 |
do a lot of conditionals checking to see if every stage is None. |
|
1588 |
||
1589 |
There are two new relations between page elements: previousSibling and |
|
1590 |
nextSibling. They reference the previous and next element at the same |
|
1591 |
level of the parse tree. For instance, if you have HTML like this: |
|
1592 |
||
1593 |
<p><ul><li>Foo<br /><li>Bar</ul> |
|
1594 |
||
1595 |
The first 'li' tag has a previousSibling of Null and its nextSibling |
|
1596 |
is the second 'li' tag. The second 'li' tag has a nextSibling of Null |
|
1597 |
and its previousSibling is the first 'li' tag. The previousSibling of |
|
1598 |
the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the |
|
1599 |
'br' tag. |
|
1600 |
||
1601 |
I took out the ability to use fetch() to find tags that have a |
|
1602 |
specific list of contents. See, I can't even explain it well. It was |
|
1603 |
really difficult to use, I never used it, and I don't think anyone |
|
1604 |
else ever used it. To the extent anyone did, they can probably use |
|
1605 |
fetchText() instead. If it turns out someone needs it I'll think of |
|
1606 |
another solution. |
|
1607 |
||
1608 |
== Tree manipulation == |
|
1609 |
||
1610 |
You can add new attributes to a tag, and delete attributes from a |
|
1611 |
tag. In 1.x you could only change a tag's existing attributes. |
|
1612 |
||
1613 |
== Porting Considerations == |
|
1614 |
||
1615 |
There are three changes in 2.0 that break old code: |
|
1616 |
||
1617 |
In the post-1.2 release you could pass in a function into fetch(). The |
|
1618 |
function took a string, the tag name. In 2.0, the function takes the |
|
1619 |
actual Tag object. |
|
1620 |
||
1621 |
It's no longer to pass in SQL-style wildcards to fetch(). Use a |
|
1622 |
regular expression instead. |
|
1623 |
||
1624 |
The different parsing algorithm means the parse tree may not be shaped |
|
1625 |
like you expect. This will only actually affect you if your code uses |
|
1626 |
one of the affected parts. I haven't run into this problem yet while |
|
1627 |
porting my code. |
|
1628 |
||
1629 |
= Between 1.2 and 2.0 = |
|
1630 |
||
1631 |
This is the release to get if you want Python 1.5 compatibility. |
|
1632 |
||
1633 |
The desired value of an attribute can now be any of the following: |
|
1634 |
||
1635 |
* A string |
|
1636 |
* A string with SQL-style wildcards |
|
1637 |
* A compiled RE object |
|
1638 |
* A callable that returns None/false/empty string if the given value |
|
1639 |
doesn't match, and any other value otherwise. |
|
1640 |
||
1641 |
This is much easier to use than SQL-style wildcards (see, regular |
|
1642 |
expressions are good for something). Because of this, I no longer |
|
1643 |
recommend you use SQL-style wildcards. They may go away in a future |
|
1644 |
release to clean up the code. |
|
1645 |
||
1646 |
Made Beautiful Soup handle processing instructions as text instead of |
|
1647 |
ignoring them. |
|
1648 |
||
1649 |
Applied patch from Richie Hindle (richie at entrian dot com) that |
|
1650 |
makes tag.string a shorthand for tag.contents[0].string when the tag |
|
1651 |
has only one string-owning child. |
|
1652 |
||
1653 |
Added still more nestable tags. The nestable tags thing won't work in |
|
1654 |
a lot of cases and needs to be rethought. |
|
1655 |
||
1656 |
Fixed an edge case where searching for "%foo" would match any string |
|
1657 |
shorter than "foo". |
|
1658 |
||
1659 |
= 1.2 "Who for such dainties would not stoop?" (20040708) = |
|
1660 |
||
1661 |
Applied patch from Ben Last (ben at benlast dot com) that made |
|
1662 |
Tag.renderContents() correctly handle Unicode. |
|
1663 |
||
1664 |
Made BeautifulStoneSoup even dumber by making it not implicitly close |
|
1665 |
a tag when another tag of the same type is encountered; only when an |
|
1666 |
actual closing tag is encountered. This change courtesy of Fuzzy (mike |
|
1667 |
at pcblokes dot com). BeautifulSoup still works as before. |
|
1668 |
||
1669 |
= 1.1 "Swimming in a hot tureen" = |
|
1670 |
||
1671 |
Added more 'nestable' tags. Changed popping semantics so that when a |
|
1672 |
nestable tag is encountered, tags are popped up to the previously |
|
1673 |
encountered nestable tag (of whatever kind). I will revert this if |
|
1674 |
enough people complain, but it should make more people's lives easier |
|
1675 |
than harder. This enhancement was suggested by Anthony Baxter (anthony |
|
1676 |
at interlink dot com dot au). |
|
1677 |
||
1678 |
= 1.0 "So rich and green" (20040420) = |
|
1679 |
||
1680 |
Initial release. |