639
by Leonard Richardson
Some cleanup work to get more consistent and complete about what gets packaged with the Beautiful Soup release. |
1 |
Beautiful Soup's official support for Python 2 ended on January 1st, |
2 |
2021. The final release to support Python 2 was Beautiful Soup
|
|
606
by Leonard Richardson
Goodbye, Python 2. [bug=1942919] |
3 |
4.9.3. In the Launchpad Bazaar repository, the final revision to support
|
4 |
Python 2 was revision 605.
|
|
5 |
||
641
by Leonard Richardson
Fixed another crash when overriding multi_valued_attributes and using the |
6 |
= Unreleased
|
7 |
||
642
by Leonard Richardson
Fixed a test failure when cchardet is not installed but |
8 |
* Fixed a test failure when cchardet is not installed but
|
9 |
charset_normalizer is. [bug=1973072]
|
|
10 |
||
641
by Leonard Richardson
Fixed another crash when overriding multi_valued_attributes and using the |
11 |
* Fixed another crash when overriding multi_valued_attributes and using the
|
12 |
html5lib parser. [bug=1948488]
|
|
13 |
||
639
by Leonard Richardson
Some cleanup work to get more consistent and complete about what gets packaged with the Beautiful Soup release. |
14 |
= 4.11.1 (20220408)
|
15 |
||
16 |
This release was done to ensure that the unit tests are packaged along
|
|
17 |
with the released source. There are no functionality changes in this
|
|
18 |
release, but there are a few other packaging changes:
|
|
19 |
||
20 |
* The Japanese and Korean translations of the documentation are included.
|
|
21 |
* The changelog is now packaged as CHANGELOG, and the license file is
|
|
22 |
packaged as LICENSE. NEWS.txt and COPYING.txt are still present,
|
|
23 |
but may be removed in the future.
|
|
24 |
* TODO.txt is no longer packaged, since a TODO is not relevant for released
|
|
25 |
code.
|
|
26 |
||
636
by Leonard Richardson
Omit untrusted input when issuing warnings. |
27 |
= 4.11.0 (20220407)
|
608
by Leonard Richardson
Ported unit tests to use pytest. |
28 |
|
29 |
* Ported unit tests to use pytest.
|
|
30 |
||
614
by Leonard Richardson
Added special string classes, RubyParenthesisString and RubyTextString, |
31 |
* Added special string classes, RubyParenthesisString and RubyTextString,
|
32 |
to make it possible to treat ruby text specially in get_text() calls.
|
|
33 |
[bug=1941980]
|
|
34 |
||
629
by Leonard Richardson
It's now possible to customize the way output is indented by |
35 |
* It's now possible to customize the way output is indented by |
36 |
providing a value for the 'indent' argument to the Formatter |
|
37 |
constructor. The 'indent' argument works very similarly to the |
|
38 |
argument of the same name in the Python standard library's |
|
630
by Leonard Richardson
I guess that's not a method. |
39 |
json.dump() function. [bug=1955497]
|
629
by Leonard Richardson
It's now possible to customize the way output is indented by |
40 |
|
626
by Leonard Richardson
If the charset-normalizer Python module |
41 |
* If the charset-normalizer Python module
|
42 |
(https://pypi.org/project/charset-normalizer/) is installed, Beautiful
|
|
43 |
Soup will use it to detect the character sets of incoming documents.
|
|
44 |
This is also the module used by newer versions of the Requests library.
|
|
45 |
For the sake of backwards compatibility, chardet and cchardet both take
|
|
46 |
precedence if installed. [bug=1955346]
|
|
617
by Leonard Richardson
Fixed a crash when overriding multi_valued_attributes and using the |
47 |
|
618
by Leonard Richardson
Added a workaround for an lxml bug (https://bugs.launchpad.net/lxml/+bug/1948551) that caused |
48 |
* Added a workaround for an lxml bug
|
622
by Leonard Richardson
Issue a warning when an HTML parser is used to parse a document that |
49 |
(https://bugs.launchpad.net/lxml/+bug/1948551) that causes
|
618
by Leonard Richardson
Added a workaround for an lxml bug (https://bugs.launchpad.net/lxml/+bug/1948551) that caused |
50 |
problems when parsing a Unicode string beginning with BYTE ORDER MARK.
|
51 |
[bug=1947768]
|
|
52 |
||
622
by Leonard Richardson
Issue a warning when an HTML parser is used to parse a document that |
53 |
* Issue a warning when an HTML parser is used to parse a document that
|
54 |
looks like XML but not XHTML. [bug=1939121]
|
|
55 |
||
624
by Leonard Richardson
Do a better job of keeping track of namespaces as an XML document is |
56 |
* Do a better job of keeping track of namespaces as an XML document is
|
57 |
parsed, so that CSS selectors that use namespaces will do the right
|
|
58 |
thing more often. [bug=1946243]
|
|
59 |
||
619
by Leonard Richardson
Renamed the 'text' field to 'string' for real. Tests are not changed in this commit to demonstrate that the renaming doesn't break anything. [bug=1947038] |
60 |
* Some time ago, the misleadingly named "text" argument to find-type
|
61 |
methods was renamed to the more accurate "string." But this supposed
|
|
62 |
"renaming" didn't make it into important places like the method |
|
63 |
signatures or the docstrings. That's corrected in this |
|
622
by Leonard Richardson
Issue a warning when an HTML parser is used to parse a document that |
64 |
version. "text" still works, but will give a DeprecationWarning.
|
65 |
[bug=1947038]
|
|
619
by Leonard Richardson
Renamed the 'text' field to 'string' for real. Tests are not changed in this commit to demonstrate that the renaming doesn't break anything. [bug=1947038] |
66 |
|
626
by Leonard Richardson
If the charset-normalizer Python module |
67 |
* Fixed a crash when pickling a BeautifulSoup object that has no
|
625
by Leonard Richardson
Fix a crash when pickling a BeautifulSoup object that has no |
68 |
tree builder. [bug=1934003]
|
69 |
||
626
by Leonard Richardson
If the charset-normalizer Python module |
70 |
* Fixed a crash when overriding multi_valued_attributes and using the
|
71 |
html5lib parser. [bug=1948488]
|
|
72 |
||
633
by Leonard Richardson
Corrected typo. |
73 |
* Standardized the wording of the MarkupResemblesLocatorWarning
|
636
by Leonard Richardson
Omit untrusted input when issuing warnings. |
74 |
warnings to omit untrusted input and make the warnings less
|
75 |
judgmental about what you ought to be doing. [bug=1955450]
|
|
632
by Leonard Richardson
Standardized the wording of the MarkupResemblesLocatorWarning |
76 |
|
627
by Leonard Richardson
Removed support for the iconv_codec library, which doesn't seem |
77 |
* Removed support for the iconv_codec library, which doesn't seem |
78 |
to exist anymore and was never put up on PyPI. (The closest |
|
628
by Leonard Richardson
Remove a huge list of HTML entities that was only necessary under Python 2. |
79 |
replacement on PyPI, iconv_codecs, is GPL-licensed, so we can't use |
80 |
it--it's also quite old.) |
|
627
by Leonard Richardson
Removed support for the iconv_codec library, which doesn't seem |
81 |
|
606
by Leonard Richardson
Goodbye, Python 2. [bug=1942919] |
82 |
= 4.10.0 (20210907) |
83 |
||
84 |
* This is the first release of Beautiful Soup to only support Python |
|
85 |
3. I dropped Python 2 support to maintain support for newer versions |
|
86 |
(58 and up) of setuptools. See: |
|
87 |
https://github.com/pypa/setuptools/issues/2769 [bug=1942919] |
|
602
by Leonard Richardson
NavigableString and its subclasses now implement the get_text() |
88 |
|
600
by Leonard Richardson
The behavior of methods like .get_text() and .strings now differs |
89 |
* The behavior of methods like .get_text() and .strings now differs |
90 |
depending on the type of tag. The change is visible with HTML tags |
|
91 |
like <script>, <style>, and <template>. Starting in 4.9.0, methods |
|
92 |
like get_text() returned no results on such tags, because the |
|
93 |
contents of those tags are not considered 'text' within the document |
|
94 |
as a whole. |
|
95 |
||
96 |
But a user who calls script.get_text() is working from a different |
|
97 |
definition of 'text' than a user who calls div.get_text()--otherwise |
|
98 |
there would be no need to call script.get_text() at all. In 4.10.0, |
|
99 |
the contents of (e.g.) a <script> tag are considered 'text' during a |
|
100 |
get_text() call on the tag itself, but not considered 'text' during |
|
101 |
a get_text() call on the tag's parent. |
|
102 |
||
103 |
Because of this change, calling get_text() on each child of a tag
|
|
104 |
may now return a different result than calling get_text() on the tag
|
|
105 |
itself. That's because different tags now have different |
|
106 |
understandings of what counts as 'text'. [bug=1906226] [bug=1868861] |
|
601
by Leonard Richardson
The 'html5' formatter now treats attributes whose values are the |
107 |
|
602
by Leonard Richardson
NavigableString and its subclasses now implement the get_text() |
108 |
* NavigableString and its subclasses now implement the get_text() |
109 |
method, as well as the properties .strings and |
|
110 |
.stripped_strings. These methods will either return the string |
|
111 |
itself, or nothing, so the only reason to use this is when iterating |
|
112 |
over a list of mixed Tag and NavigableString objects. [bug=1904309] |
|
113 |
||
601
by Leonard Richardson
The 'html5' formatter now treats attributes whose values are the |
114 |
* The 'html5' formatter now treats attributes whose values are the |
115 |
empty string as HTML boolean attributes. Previously (and in other |
|
116 |
formatters), an attribute value must be set as None to be treated as |
|
117 |
a boolean attribute. In a future release, I plan to also give this |
|
118 |
behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424] |
|
119 |
||
605
by Leonard Richardson
The 'replace_with()' method now takes a variable number of arguments, |
120 |
* The 'replace_with()' method now takes a variable number of arguments, |
121 |
and can be used to replace a single element with a sequence of elements. |
|
122 |
Patch by Bill Chandos. [rev=605] |
|
123 |
||
595
by Leonard Richardson
Corrected output when the namespace prefix associated with a |
124 |
* Corrected output when the namespace prefix associated with a |
125 |
namespaced attribute is the empty string, as opposed to |
|
126 |
None. [bug=1915583] |
|
127 |
||
597
by Leonard Richardson
Performance improvement when processing tags that speeds up overall |
128 |
* Performance improvement when processing tags that speeds up overall |
129 |
tree construction by 2%. Patch by Morotti. [bug=1899358] |
|
130 |
||
599
by Leonard Richardson
Corrected the use of special string container classes in cases when a |
131 |
* Corrected the use of special string container classes in cases when a |
132 |
single tag may contain strings with different containers; such as |
|
133 |
the <template> tag, which may contain both TemplateString objects |
|
134 |
and Comment objects. [bug=1913406] |
|
135 |
||
605
by Leonard Richardson
The 'replace_with()' method now takes a variable number of arguments, |
136 |
* The html.parser tree builder can now handle named entities |
604
by Leonard Richardson
The html.parser tree builder can now handles named entities |
137 |
found in the HTML5 spec in much the same way that the html5lib |
605
by Leonard Richardson
The 'replace_with()' method now takes a variable number of arguments, |
138 |
tree builder does. Note that the lxml HTML tree builder doesn't handle |
139 |
named entities this way. [bug=1924908]
|
|
604
by Leonard Richardson
The html.parser tree builder can now handles named entities |
140 |
|
598
by Leonard Richardson
Added a second way to pass specify encodings to UnicodeDammit and |
141 |
* Added a second way to pass specify encodings to UnicodeDammit and
|
142 |
EncodingDetector, based on the order of precedence defined in the
|
|
143 |
HTML5 spec, starting at:
|
|
144 |
https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
|
|
145 |
||
146 |
Encodings in 'known_definite_encodings' are tried first, then |
|
147 |
byte-order-mark sniffing is run, then encodings in 'user_encodings' |
|
148 |
are tried. The old argument, 'override_encodings', is now a |
|
149 |
deprecated alias for 'known_definite_encodings'. |
|
150 |
||
151 |
This changes the default behavior of the html.parser and lxml tree
|
|
152 |
builders, in a way that may slightly improve encoding
|
|
153 |
detection but will probably have no effect. [bug=1889014]
|
|
154 |
||
596
by Leonard Richardson
Improve the warning issued when a directory name (as opposed to |
155 |
* Improve the warning issued when a directory name (as opposed to
|
156 |
the name of a regular file) is passed as markup into the BeautifulSoup
|
|
157 |
constructor. [bug=1913628]
|
|
158 |
||
592
by Leonard Richardson
Prepare for release. |
159 |
= 4.9.3 (20201003)
|
591
by Leonard Richardson
Implemented a significant performance optimization to the process of |
160 |
|
161 |
* Implemented a significant performance optimization to the process of
|
|
162 |
searching the parse tree. Patch by Morotti. [bug=1898212]
|
|
163 |
||
588
by Leonard Richardson
Increment version number. |
164 |
= 4.9.2 (20200926)
|
579
by Leonard Richardson
Fixed a bug that caused too many tags to be popped from the tag |
165 |
|
166 |
* Fixed a bug that caused too many tags to be popped from the tag
|
|
167 |
stack during tree building, when encountering a closing tag that had
|
|
168 |
no matching opening tag. [bug=1880420]
|
|
169 |
||
587
by Leonard Richardson
Fixed a bug that inconsistently moved elements over when passing |
170 |
* Fixed a bug that inconsistently moved elements over when passing
|
171 |
a Tag, rather than a list, into Tag.extend(). [bug=1885710]
|
|
172 |
||
585
by Leonard Richardson
Specify the soupsieve dependency in a way that complies with |
173 |
* Specify the soupsieve dependency in a way that complies with
|
586
by Leonard Richardson
Change the signatures for BeautifulSoup.insert_before and insert_after |
174 |
PEP 508. Patch by Mike Nerone. [bug=1893696]
|
175 |
||
176 |
* Change the signatures for BeautifulSoup.insert_before and insert_after
|
|
177 |
(which are not implemented) to match PageElement.insert_before and
|
|
178 |
insert_after, quieting warnings in some IDEs. [bug=1897120]
|
|
585
by Leonard Richardson
Specify the soupsieve dependency in a way that complies with |
179 |
|
577
by Leonard Richardson
Prep for release. |
180 |
= 4.9.1 (20200517)
|
568
by Leonard Richardson
Fixed test failures when run against soupselect 2.0. Patch by Tomáš |
181 |
|
573
by Leonard Richardson
Added a keyword argument on_duplicate_attribute to the |
182 |
* Added a keyword argument 'on_duplicate_attribute' to the |
183 |
BeautifulSoupHTMLParser constructor (used by the html.parser tree
|
|
184 |
builder) which lets you customize the handling of markup that
|
|
185 |
contains the same attribute more than once, as in:
|
|
575
by Leonard Richardson
Documented some recently added customization features. |
186 |
<a href="url1" href="url2"> [bug=1878209]
|
573
by Leonard Richardson
Added a keyword argument on_duplicate_attribute to the |
187 |
|
570
by Leonard Richardson
Fixed typo. |
188 |
* Added a distinct subclass, GuessedAtParserWarning, for the warning
|
569
by Leonard Richardson
Added two distinct UserWarning subclasses for warnings issued from the BeautifulSoup constructor which a caller may want to filter out. [bug=1873787] |
189 |
issued when BeautifulSoup is instantiated without a parser being
|
190 |
specified. [bug=1873787]
|
|
191 |
||
192 |
* Added a distinct subclass, MarkupResemblesLocatorWarning, for the
|
|
193 |
warning issued when BeautifulSoup is instantiated with 'markup' that |
|
194 |
actually seems to be a URL or the path to a file on
|
|
195 |
disk. [bug=1873787]
|
|
196 |
||
568
by Leonard Richardson
Fixed test failures when run against soupselect 2.0. Patch by Tomáš |
197 |
* The new NavigableString subclasses (Stylesheet, Script, and
|
198 |
TemplateString) can now be imported directly from the bs4 package.
|
|
199 |
||
571
by Leonard Richardson
If you encode a document with a Python-specific encoding like |
200 |
* If you encode a document with a Python-specific encoding like
|
201 |
'unicode_escape', that encoding is no longer mentioned in the final |
|
202 |
XML or HTML document. Instead, encoding information is omitted or
|
|
203 |
left blank. [bug=1874955]
|
|
204 |
||
568
by Leonard Richardson
Fixed test failures when run against soupselect 2.0. Patch by Tomáš |
205 |
* Fixed test failures when run against soupselect 2.0. Patch by Tomáš
|
206 |
Chvátal. [bug=1872279]
|
|
207 |
||
564
by Leonard Richardson
Embedded CSS and Javascript is now stored in distinct Stylesheet and |
208 |
= 4.9.0 (20200405)
|
554
by Leonard Richardson
API CHANGE - Added PageElement.decomposed, a new property which lets you |
209 |
|
210 |
* Added PageElement.decomposed, a new property which lets you
|
|
211 |
check whether you've already called decompose() on a Tag or |
|
212 |
NavigableString. |
|
553
by Leonard Richardson
Fixed an unhandled exception when formatting a Tag that had been decomposed.[bug=1857767] |
213 |
|
564
by Leonard Richardson
Embedded CSS and Javascript is now stored in distinct Stylesheet and |
214 |
* Embedded CSS and Javascript is now stored in distinct Stylesheet and |
566
by Leonard Richardson
Added a notice about the new behavior of .text to the documentation. |
215 |
Script tags, which are ignored by methods like get_text() since most |
216 |
people don't consider this sort of content to be 'text'. This |
|
564
by Leonard Richardson
Embedded CSS and Javascript is now stored in distinct Stylesheet and |
217 |
feature is not supported by the html5lib treebuilder. [bug=1868861]
|
218 |
||
561
by Leonard Richardson
Added a Russian translation by 'authoress' to the repository. |
219 |
* Added a Russian translation by 'authoress' to the repository. |
220 |
||
553
by Leonard Richardson
Fixed an unhandled exception when formatting a Tag that had been decomposed.[bug=1857767] |
221 |
* Fixed an unhandled exception when formatting a Tag that had been
|
222 |
decomposed.[bug=1857767]
|
|
223 |
||
559
by Leonard Richardson
Fixed a bug that happened when passing a Unicode filename containing |
224 |
* Fixed a bug that happened when passing a Unicode filename containing
|
225 |
non-ASCII characters as markup into Beautiful Soup, on a system that
|
|
226 |
allows Unicode filenames. [bug=1866717]
|
|
227 |
||
556
by Leonard Richardson
Added a performance optimization to PageElement.extract(). Patch by Arthur Darcet. |
228 |
* Added a performance optimization to PageElement.extract(). Patch by
|
229 |
Arthur Darcet.
|
|
230 |
||
544
by Leonard Richardson
Wrote docstrings for formatter.py. |
231 |
= 4.8.2 (20191224)
|
534
by Leonard Richardson
Fixed a deprecation warning on Python 3.7. Patch by Colin |
232 |
|
546
by Leonard Richardson
Added docstrings for some but not all tree buidlers. |
233 |
* Added Python docstrings to all public methods of the most commonly
|
234 |
used classes.
|
|
540
by Leonard Richardson
Added Python docstrings to all public methods in element.py. |
235 |
|
543
by Leonard Richardson
Fixed deprecation warning. [bug=1855301] |
236 |
* Added a Chinese translation by Deron Wang and a Brazilian Portuguese
|
237 |
translation by Cezar Peixeiro to the repository.
|
|
238 |
||
239 |
* Fixed two deprecation warnings. Patches by Colin
|
|
240 |
Watson and Nicholas Neumann. [bug=1847592] [bug=1855301]
|
|
241 |
||
538
by Leonard Richardson
The html.parser tree builder now correctly handles DOCTYPEs that are |
242 |
* The html.parser tree builder now correctly handles DOCTYPEs that are
|
243 |
not uppercase. [bug=1848401]
|
|
244 |
||
543
by Leonard Richardson
Fixed deprecation warning. [bug=1855301] |
245 |
* PageElement.select() now returns a ResultSet rather than a regular
|
246 |
list, making it consistent with methods like find_all().
|
|
540
by Leonard Richardson
Added Python docstrings to all public methods in element.py. |
247 |
|
528
by Leonard Richardson
Added section on Python 2 sunsetting. |
248 |
= 4.8.1 (20191006)
|
515
by Leonard Richardson
Adapt Chris Mayo's code to track line number and position when using html.parser. |
249 |
|
516
by Leonard Richardson
Implemented line number tracking for html5lib. |
250 |
* When the html.parser or html5lib parsers are in use, Beautiful Soup
|
251 |
will, by default, record the position in the original document where
|
|
252 |
each tag was encountered. This includes line number (Tag.sourceline)
|
|
253 |
and position within a line (Tag.sourcepos). Based on code by Chris
|
|
517
by Leonard Richardson
Added a section about project support to the README. |
254 |
Mayo. [bug=1742921]
|
515
by Leonard Richardson
Adapt Chris Mayo's code to track line number and position when using html.parser. |
255 |
|
527
by Leonard Richardson
Avoid a crash when unpickling certain parse trees generated using html5lib on Python 3. [bug=1843545] |
256 |
* When instantiating a BeautifulSoup object, it's now possible to |
528
by Leonard Richardson
Added section on Python 2 sunsetting. |
257 |
provide a dictionary ('element_classes') of the classes you'd like to be |
258 |
instantiated instead of Tag, NavigableString, etc.
|
|
527
by Leonard Richardson
Avoid a crash when unpickling certain parse trees generated using html5lib on Python 3. [bug=1843545] |
259 |
|
524
by Leonard Richardson
Fixed the definition of the default XML namespace when using |
260 |
* Fixed the definition of the default XML namespace when using
|
261 |
lxml 4.4. Patch by Isaac Muse. [bug=1840141]
|
|
262 |
||
520
by Leonard Richardson
Copying a Tag preserves information that was originally obtained from |
263 |
* Fixed a crash when pretty-printing tags that were not created
|
264 |
during initial parsing. [bug=1838903]
|
|
265 |
||
266 |
* Copying a Tag preserves information that was originally obtained from
|
|
267 |
the TreeBuilder used to build the original Tag. [bug=1838903]
|
|
518
by Leonard Richardson
Fixed a crash when pretty-printing tags that were not created |
268 |
|
526
by Leonard Richardson
Avoid a crash when trying to detect the declared encoding of a |
269 |
* Raise an explanatory exception when the underlying parser
|
270 |
completely rejects the incoming markup. [bug=1838877]
|
|
271 |
||
272 |
* Avoid a crash when trying to detect the declared encoding of a
|
|
273 |
Unicode document. [bug=1838877]
|
|
274 |
||
527
by Leonard Richardson
Avoid a crash when unpickling certain parse trees generated using html5lib on Python 3. [bug=1843545] |
275 |
* Avoid a crash when unpickling certain parse trees generated
|
276 |
using html5lib on Python 3. [bug=1843545]
|
|
277 |
||
513
by Leonard Richardson
Clarified the changelog. |
278 |
= 4.8.0 (20190720, "One Small Soup")
|
501
by Leonard Richardson
It's now possible to customize the TreeBuilder object by passing |
279 |
|
514
by Leonard Richardson
Minor changes to docs and CHANGELOG. |
280 |
This release focuses on making it easier to customize Beautiful Soup's |
281 |
input mechanism (the TreeBuilder) and output mechanism (the Formatter). |
|
282 |
||
283 |
* You can customize the TreeBuilder object by passing keyword |
|
284 |
arguments into the BeautifulSoup constructor. Those keyword |
|
285 |
arguments will be passed along into the TreeBuilder constructor. |
|
286 |
||
287 |
The main reason to do this right now is to change how which |
|
288 |
attributes are treated as multi-valued attributes (the way 'class' |
|
289 |
is treated by default). You can do this with the |
|
290 |
'multi_valued_attributes' argument. [bug=1832978] |
|
511
by Leonard Richardson
Added documentation for Tag.smooth(). |
291 |
|
512
by Leonard Richardson
Prep for release. |
292 |
* The role of Formatter objects has been greatly expanded. The Formatter |
293 |
class now controls the following: |
|
511
by Leonard Richardson
Added documentation for Tag.smooth(). |
294 |
|
295 |
- The function to call to perform entity substitution. (This was |
|
296 |
previously Formatter's only job.) |
|
297 |
- Which tags should be treated as containing CDATA and have their
|
|
298 |
contents exempt from entity substitution.
|
|
299 |
- The order in which a tag's attributes are output. [bug=1812422] |
|
300 |
- Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>' |
|
301 |
||
302 |
All preexisting code should work as before. |
|
303 |
||
304 |
* Added a new method to the API, Tag.smooth(), which consolidates |
|
514
by Leonard Richardson
Minor changes to docs and CHANGELOG. |
305 |
multiple adjacent NavigableString elements. [bug=1697296] |
511
by Leonard Richardson
Added documentation for Tag.smooth(). |
306 |
|
514
by Leonard Richardson
Minor changes to docs and CHANGELOG. |
307 |
* ' (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is always |
511
by Leonard Richardson
Added documentation for Tag.smooth(). |
308 |
recognized as a named entity and converted to a single quote. [bug=1818721] |
504
by Leonard Richardson
' (which is valid in XML and XHTML, but not HTML 4) is now |
309 |
|
496
by Leonard Richardson
Tried even harder to avoid the deprecation warning originally fixed in |
310 |
= 4.7.1 (20190106) |
495
by Leonard Richardson
Fixed an incorrectly raised exception when inserting a tag before or |
311 |
|
312 |
* Fixed a significant performance problem introduced in 4.7.0. [bug=1810617] |
|
313 |
||
314 |
* Fixed an incorrectly raised exception when inserting a tag before or |
|
315 |
after an identical tag. [bug=1810692] |
|
316 |
||
317 |
* Beautiful Soup will no longer try to keep track of namespaces that |
|
318 |
are not defined with a prefix; this can confuse soupselect. [bug=1810680] |
|
319 |
||
496
by Leonard Richardson
Tried even harder to avoid the deprecation warning originally fixed in |
320 |
* Tried even harder to avoid the deprecation warning originally fixed in |
321 |
4.6.1. [bug=1778909] |
|
322 |
||
488
by Leonard Richardson
Prep for release. |
323 |
= 4.7.0 (20181231) |
477
by Leonard Richardson
Merged in next_previous_fixes from Isaac Muse. [bug=1782928,1798699] |
324 |
|
481
by Leonard Richardson
Issue a warning and raise a more useful exception if someone tries to call Tag.select() without SoupSieve installed. |
325 |
* Beautiful Soup's CSS Selector implementation has been replaced by a |
326 |
dependency on Isaac Muse's SoupSieve project (the soupsieve package |
|
327 |
on PyPI). The good news is that SoupSieve has a much more robust and |
|
328 |
complete implementation of CSS selectors, resolving a large number |
|
329 |
of longstanding issues. The bad news is that from this point onward, |
|
330 |
SoupSieve must be installed if you want to use the select() method. |
|
331 |
||
332 |
You don't have to change anything lf you installed Beautiful Soup |
|
333 |
through pip (SoupSieve will be automatically installed when you
|
|
334 |
upgrade Beautiful Soup) or if you don't use CSS selectors from |
|
335 |
within Beautiful Soup. |
|
336 |
||
337 |
SoupSieve documentation: https://facelessuser.github.io/soupsieve/ |
|
338 |
||
490
by Leonard Richardson
Added information to CHANGELOG I forgot to add earlier. |
339 |
* Added the PageElement.extend() method, which works like list.append(). |
340 |
[bug=1514970] |
|
341 |
||
342 |
* PageElement.insert_before() and insert_after() now take a variable |
|
343 |
number of arguments. [bug=1514970] |
|
344 |
||
477
by Leonard Richardson
Merged in next_previous_fixes from Isaac Muse. [bug=1782928,1798699] |
345 |
* Fix a number of problems with the tree builder that caused |
346 |
trees that were superficially okay, but which fell apart when bits |
|
483
by Leonard Richardson
Merging the linkage checker and html5lib fixes by Isaac Muse found in https://code.launchpad.net/~facelessuser/beautifulsoup/html5lib-fix/+merge/361282. [bug=1809910] |
347 |
were extracted. Patch by Isaac Muse. [bug=1782928,1809910] |
477
by Leonard Richardson
Merged in next_previous_fixes from Isaac Muse. [bug=1782928,1798699] |
348 |
|
349 |
* Fixed a problem with the tree builder in which elements that |
|
350 |
contained no content (such as empty comments and all-whitespace |
|
351 |
elements) were not being treated as part of the tree. Patch by Isaac |
|
352 |
Muse. [bug=1798699] |
|
353 |
||
484
by Leonard Richardson
Fixed a problem with multi-valued attributes where the value |
354 |
* Fixed a problem with multi-valued attributes where the value |
355 |
contained whitespace. Thanks to Jens Svalgaard for the |
|
356 |
fix. [bug=1787453] |
|
357 |
||
482
by Leonard Richardson
Clarified the software license. |
358 |
* Clarified ambiguous license statements in the source code. Beautiful |
484
by Leonard Richardson
Fixed a problem with multi-valued attributes where the value |
359 |
Soup is released under the MIT license, and has been since 4.4.0. |
482
by Leonard Richardson
Clarified the software license. |
360 |
|
488
by Leonard Richardson
Prep for release. |
361 |
* This file has been renamed from NEWS.txt to CHANGELOG. |
362 |
||
476
by Leonard Richardson
Bump up to version 4.6.3 so I can re-release. |
363 |
= 4.6.3 (20180812) |
364 |
||
365 |
* Exactly the same as 4.6.2. Re-released to make the README file |
|
366 |
render properly on PyPI. |
|
367 |
||
475
by Leonard Richardson
Converted README to Markdown format. |
368 |
= 4.6.2 (20180812) |
474
by Leonard Richardson
Fix an exception when a custom formatter was asked to format a void |
369 |
|
370 |
* Fix an exception when a custom formatter was asked to format a void |
|
371 |
element. [bug=1784408] |
|
372 |
||
473
by Leonard Richardson
Prep for release. |
373 |
= 4.6.1 (20180728) |
451
by Leonard Richardson
Improve the warning given when no parser is specified. [bug=1780571] |
374 |
|
459
by Leonard Richardson
Stop data loss when encountering an empty numeric entity, and |
375 |
* Stop data loss when encountering an empty numeric entity, and |
376 |
possibly in other cases. Thanks to tos.kamiya for the fix. [bug=1698503] |
|
377 |
||
465
by Leonard Richardson
Preserve XML namespaces when they are introduced inside an XML |
378 |
* Preserve XML namespaces introduced inside an XML document, not just |
379 |
the ones introduced at the top level. [bug=1718787] |
|
380 |
||
466
by Leonard Richardson
Fixed a bug where find_all() was not working when asked to find a |
381 |
* Added a new formatter, "html5", which represents void elements |
469
by Leonard Richardson
Fixed a problem where the html.parser tree builder interpreted |
382 |
as "<element>" rather than "<element/>". [bug=1716272] |
383 |
||
384 |
* Fixed a problem where the html.parser tree builder interpreted |
|
385 |
a string like "&foo " as the character entity "&foo;" [bug=1728706] |
|
466
by Leonard Richardson
Fixed a bug where find_all() was not working when asked to find a |
386 |
|
471
by Leonard Richardson
Correctly handle invalid HTML numeric character entities like “ |
387 |
* Correctly handle invalid HTML numeric character entities like “ |
388 |
which reference code points that are not Unicode code points. Note |
|
389 |
that this is only fixed when Beautiful Soup is used with the |
|
390 |
html.parser parser -- html5lib already worked and I couldn't fix it |
|
391 |
with lxml. [bug=1782933] |
|
392 |
||
452
by Leonard Richardson
Fixed code that was causing deprecation warnings in recent Python 3 |
393 |
* Improved the warning given when no parser is specified. [bug=1780571] |
394 |
||
472
by Leonard Richardson
When markup contains duplicate elements, a select() call that |
395 |
* When markup contains duplicate elements, a select() call that |
396 |
includes multiple match clauses will match all relevant |
|
397 |
elements. [bug=1770596] |
|
398 |
||
452
by Leonard Richardson
Fixed code that was causing deprecation warnings in recent Python 3 |
399 |
* Fixed code that was causing deprecation warnings in recent Python 3 |
400 |
versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496] |
|
451
by Leonard Richardson
Improve the warning given when no parser is specified. [bug=1780571] |
401 |
|
453
by Leonard Richardson
Fixed a Windows crash in diagnose() when checking whether a long |
402 |
* Fixed a Windows crash in diagnose() when checking whether a long |
403 |
markup string is a filename. [bug=1737121] |
|
404 |
||
454
by Leonard Richardson
Stopped HTMLParser from raising an exception in very rare cases of |
405 |
* Stopped HTMLParser from raising an exception in very rare cases of |
406 |
bad markup. [bug=1708831] |
|
407 |
||
466
by Leonard Richardson
Fixed a bug where find_all() was not working when asked to find a |
408 |
* Fixed a bug where find_all() was not working when asked to find a |
409 |
tag with a namespaced name in an XML document that was parsed as |
|
410 |
HTML. [bug=1723783] |
|
462
by Leonard Richardson
Introduced the Formatter system. [bug=1716272]. |
411 |
|
412 |
* You can get finer control over formatting by subclassing |
|
413 |
bs4.element.Formatter and passing a Formatter instance into (e.g.) |
|
414 |
encode(). [bug=1716272] |
|
461
by Leonard Richardson
It's possible for a TreeBuilder subclass to specify that void |
415 |
|
464
by Leonard Richardson
You can pass a dictionary of into |
416 |
* You can pass a dictionary of `attrs` into |
417 |
BeautifulSoup.new_tag. This makes it possible to create a tag with |
|
418 |
an attribute like 'name' that would otherwise be masked by another |
|
419 |
argument of new_tag. [bug=1779276] |
|
420 |
||
470
by Leonard Richardson
Clarified the deprecation warning when accessing tag.fooTag, to cover |
421 |
* Clarified the deprecation warning when accessing tag.fooTag, to cover |
422 |
the possibility that you might really have been looking for a tag |
|
423 |
called 'fooTag'. |
|
424 |
||
450
by Leonard Richardson
Prep for 4.6.0 release. |
425 |
= 4.6.0 (20170507) = |
444
by Leonard Richardson
Added the method, which acts like for |
426 |
|
447
by Leonard Richardson
Replace get_attribute_text with get_attribute_list. |
427 |
* Added the `Tag.get_attribute_list` method, which acts like `Tag.get` for |
428 |
getting the value of an attribute, but which always returns a list, |
|
429 |
whether or not the attribute is a multi-value attribute. [bug=1678589] |
|
442
by Leonard Richardson
It's now possible to use a tag's namespace prefix when searching, |
430 |
|
443
by Leonard Richardson
HTML parsers treat all HTML4 and HTML5 empty element tags (aka void element tags) correctly. [bug=1656909] |
431 |
* It's now possible to use a tag's namespace prefix when searching, |
432 |
e.g. soup.find('namespace:tag') [bug=1655332] |
|
433 |
||
446
by Leonard Richardson
Improved the handling of empty-element tags like <br> when using the |
434 |
* Improved the handling of empty-element tags like <br> when using the |
435 |
html.parser parser. [bug=1676935] |
|
436 |
||
443
by Leonard Richardson
HTML parsers treat all HTML4 and HTML5 empty element tags (aka void element tags) correctly. [bug=1656909] |
437 |
* HTML parsers treat all HTML4 and HTML5 empty element tags (aka void |
438 |
element tags) correctly. [bug=1656909] |
|
442
by Leonard Richardson
It's now possible to use a tag's namespace prefix when searching, |
439 |
|
449
by Leonard Richardson
Namespace prefix is preserved when an XML tag is copied. Thanks |
440 |
* Namespace prefix is preserved when an XML tag is copied. Thanks |
441 |
to Vikas for a patch and test. [bug=1685172] |
|
442 |
||
439
by Leonard Richardson
I need to do another release because of an error while running the release script. |
443 |
= 4.5.3 (20170102) = |
434
by Leonard Richardson
Fixed yet another problem that caused the html5lib tree builder to |
444 |
|
436
by Leonard Richardson
Fixed foster parenting when html5lib is the tree builder. Thanks to Geoffrey Sneddon for a patch and test. |
445 |
* Fixed foster parenting when html5lib is the tree builder. Thanks to |
446 |
Geoffrey Sneddon for a patch and test. |
|
439
by Leonard Richardson
I need to do another release because of an error while running the release script. |
447 |
|
434
by Leonard Richardson
Fixed yet another problem that caused the html5lib tree builder to |
448 |
* Fixed yet another problem that caused the html5lib tree builder to |
449 |
create a disconnected parse tree. [bug=1629825] |
|
450 |
||
439
by Leonard Richardson
I need to do another release because of an error while running the release script. |
451 |
= 4.5.2 (20170102) = |
452 |
||
453 |
* Apart from the version number, this release is identical to |
|
454 |
4.5.3. Due to user error, it could not be completely uploaded to |
|
455 |
PyPI. Use 4.5.3 instead. |
|
456 |
||
430
by Leonard Richardson
Bump version number. |
457 |
= 4.5.1 (20160802) = |
428
by Leonard Richardson
Fixed a reported (but not duplicated) bug involving processing instructions fed into the lxml HTML parser. |
458 |
|
429
by Leonard Richardson
Explained why we test both unicode and bytestring processing instructions. |
459 |
* Fixed a crash when passing Unicode markup that contained a |
460 |
processing instruction into the lxml HTML parser on Python |
|
461 |
3. [bug=1608048] |
|
428
by Leonard Richardson
Fixed a reported (but not duplicated) bug involving processing instructions fed into the lxml HTML parser. |
462 |
|
419
by Leonard Richardson
Updated NEWS in preparation for release. |
463 |
= 4.5.0 (20160719) = |
464 |
||
465 |
* Beautiful Soup is no longer compatible with Python 2.6. This |
|
466 |
actually happened a few releases ago, but it's now official. |
|
400
by Leonard Richardson
Fixed a Python 3 ByteWarning when a URL was passed in as though it |
467 |
|
406
by Leonard Richardson
Beautiful Soup will now work with versions of html5lib greater than |
468 |
* Beautiful Soup will now work with versions of html5lib greater than
|
469 |
0.99999999. [bug=1603299]
|
|
470 |
||
417
by Leonard Richardson
If a search against each individual value of a multi-valued |
471 |
* If a search against each individual value of a multi-valued
|
472 |
attribute fails, the search will be run one final time against the
|
|
473 |
complete attribute value considered as a single string. That is, if
|
|
474 |
a tag has class="foo bar" and neither "foo" nor "bar" matches, but
|
|
475 |
"foo bar" does, the tag is now considered a match.
|
|
476 |
||
477 |
This happened in previous versions, but only when the value being
|
|
419
by Leonard Richardson
Updated NEWS in preparation for release. |
478 |
searched for was a string. Now it also works when that value is
|
479 |
a regular expression, a list of strings, etc. [bug=1476868]
|
|
417
by Leonard Richardson
If a search against each individual value of a multi-valued |
480 |
|
410
by Leonard Richardson
Although the previously fixed problem only occurs when using the html5lib tree builder, it's not actually a problem with the tree builder itself. |
481 |
* Fixed a bug that deranged the tree when a whitespace element was
|
482 |
reparented into a tag that contained an identical whitespace
|
|
483 |
element. [bug=1505351]
|
|
409
by Leonard Richardson
Fixed a bug in the html5lib treebuilder that deranged the tree |
484 |
|
415
by Leonard Richardson
Added support for CSS selector values that contain quoted spaces, |
485 |
* Added support for CSS selector values that contain quoted spaces,
|
486 |
such as tag[style="display: foo"]. [bug=1540588]
|
|
487 |
||
400
by Leonard Richardson
Fixed a Python 3 ByteWarning when a URL was passed in as though it |
488 |
* Corrected handling of XML processing instructions. [bug=1504393]
|
489 |
||
416
by Leonard Richardson
Corrected an encoding error that happened when a BeautifulSoup |
490 |
* Corrected an encoding error that happened when a BeautifulSoup
|
491 |
object was copied. [bug=1554439]
|
|
492 |
||
401
by Leonard Richardson
The contents of <textarea> tags will no longer be modified when the |
493 |
* The contents of <textarea> tags will no longer be modified when the
|
494 |
tree is prettified. [bug=1555829]
|
|
495 |
||
411
by Leonard Richardson
When a BeautifulSoup object is pickled but its tree builder cannot |
496 |
* When a BeautifulSoup object is pickled but its tree builder cannot
|
497 |
be pickled, its .builder attribute is set to None instead of being
|
|
498 |
destroyed. This avoids a performance problem once the object is
|
|
499 |
unpickled. [bug=1523629]
|
|
500 |
||
402
by Leonard Richardson
Specify the file and line number when warning about a |
501 |
* Specify the file and line number when warning about a
|
502 |
BeautifulSoup object being instantiated without a parser being
|
|
503 |
specified. [bug=1574647]
|
|
504 |
||
414
by Leonard Richardson
The argument to now works correctly, though it's |
505 |
* The `limit` argument to `select()` now works correctly, though it's |
506 |
not implemented very efficiently. [bug=1520530] |
|
507 |
||
400
by Leonard Richardson
Fixed a Python 3 ByteWarning when a URL was passed in as though it |
508 |
* Fixed a Python 3 ByteWarning when a URL was passed in as though it |
509 |
were markup. Thanks to James Salter for a patch and |
|
510 |
test. [bug=1533762] |
|
511 |
||
405
by Leonard Richardson
We don't run the check for a filename passed in as markup if the |
512 |
* We don't run the check for a filename passed in as markup if the |
513 |
'filename' contains a less-than character; the less-than character |
|
514 |
indicates it's most likely a very small document. [bug=1577864] |
|
515 |
||
392
by Leonard Richardson
Fixed a bug that deranged the tree when part of it was |
516 |
= 4.4.1 (20150928) = |
390
by Leonard Richardson
Fixed the test_detect_utf8 test so that it works when chardet is |
517 |
|
392
by Leonard Richardson
Fixed a bug that deranged the tree when part of it was |
518 |
* Fixed a bug that deranged the tree when part of it was |
519 |
removed. Thanks to Eric Weiser for the patch and John Wiseman for a |
|
520 |
test. [bug=1481520] |
|
521 |
||
395
by Leonard Richardson
Fixed a parse bug with the html5lib tree-builder. Thanks to Roel |
522 |
* Fixed a parse bug with the html5lib tree-builder. Thanks to Roel |
523 |
Kramer for the patch. [bug=1483781] |
|
524 |
||
394
by Leonard Richardson
Improved the implementation of CSS selector grouping. Thanks to Orangain for the patch. [bug=1484543] |
525 |
* Improved the implementation of CSS selector grouping. Thanks to |
526 |
Orangain for the patch. [bug=1484543] |
|
527 |
||
393
by Leonard Richardson
Corrected the output of Declaration objects. [bug=1477847] |
528 |
* Fixed the test_detect_utf8 test so that it works when chardet is |
529 |
installed. [bug=1471359] |
|
530 |
||
531 |
* Corrected the output of Declaration objects. [bug=1477847] |
|
532 |
||
394
by Leonard Richardson
Improved the implementation of CSS selector grouping. Thanks to Orangain for the patch. [bug=1484543] |
533 |
|
386
by Leonard Richardson
Change setup.py to focus on creating wheels. |
534 |
= 4.4.0 (20150703) = |
358
by Leonard Richardson
Started using a standard MIT license. [bug=1294662] |
535 |
|
379
by Leonard Richardson
Reorganized changelog. |
536 |
Especially important changes: |
537 |
||
538 |
* Added a warning when you instantiate a BeautifulSoup object without |
|
539 |
explicitly naming a parser. [bug=1398866] |
|
540 |
||
366
by Leonard Richardson
In Python 3, __str__ now returns a Unicode string instead |
541 |
* __repr__ now returns an ASCII bytestring in Python 2, and a Unicode |
542 |
string in Python 3, instead of a UTF8-encoded bytestring in both |
|
543 |
versions. In Python 3, __str__ now returns a Unicode string instead |
|
544 |
of a bytestring. [bug=1420131] |
|
545 |
||
379
by Leonard Richardson
Reorganized changelog. |
546 |
* The `text` argument to the find_* methods is now called `string`, |
547 |
which is more accurate. `text` still works, but `string` is the |
|
548 |
argument described in the documentation. `text` may eventually |
|
549 |
change its meaning, but not for a very long time. [bug=1366856] |
|
550 |
||
381
by Leonard Richardson
Changed the way soup objects work under copy.copy(). Copying a |
551 |
* Changed the way soup objects work under copy.copy(). Copying a |
552 |
NavigableString or a Tag will give you a new NavigableString that's |
|
553 |
equal to the old one but not connected to the parse tree. Patch by
|
|
554 |
Martijn Peters. [bug=1307490]
|
|
380
by Leonard Richardson
Copying a NavigableString will give you a new NavigableString that is not connected to the parse tree. |
555 |
|
379
by Leonard Richardson
Reorganized changelog. |
556 |
* Started using a standard MIT license. [bug=1294662]
|
557 |
||
558 |
* Added a Chinese translation of the documentation by Delong .w.
|
|
559 |
||
560 |
New features:
|
|
561 |
||
371
by Leonard Richardson
Introduced the select_one() method, which uses a CSS selector but |
562 |
* Introduced the select_one() method, which uses a CSS selector but
|
563 |
only returns the first match, instead of a list of
|
|
564 |
matches. [bug=1349367]
|
|
565 |
||
376
by Leonard Richardson
Raise a NotImplementedError whenever an unsupported CSS pseudoclass |
566 |
* You can now create a Tag object without specifying a
|
567 |
TreeBuilder. Patch by Martijn Pieters. [bug=1307471]
|
|
568 |
||
569 |
* You can now create a NavigableString or a subclass just by invoking
|
|
570 |
the constructor. [bug=1294315]
|
|
571 |
||
373
by Leonard Richardson
Added an exclude_encodings argument to UnicodeDammit and to the |
572 |
* Added an `exclude_encodings` argument to UnicodeDammit and to the
|
573 |
Beautiful Soup constructor, which lets you prohibit the detection of
|
|
574 |
an encoding that you know is wrong. [bug=1469408]
|
|
575 |
||
379
by Leonard Richardson
Reorganized changelog. |
576 |
* The select() method now supports selector grouping. Patch by
|
577 |
Francisco Canas [bug=1191917]
|
|
578 |
||
579 |
Bug fixes:
|
|
580 |
||
338
by Leonard Richardson
Fixed yet another problem that caused the html5lib tree builder to |
581 |
* Fixed yet another problem that caused the html5lib tree builder to
|
582 |
create a disconnected parse tree. [bug=1237763]
|
|
583 |
||
359
by Leonard Richardson
Improved docstring for encode_contents() and decode_contents(). [bug=1441543] |
584 |
* Force object_was_parsed() to keep the tree intact even when an element
|
585 |
from later in the document is moved into place. [bug=1430633]
|
|
586 |
||
587 |
* Fixed yet another bug that caused a disconnected tree when html5lib
|
|
588 |
copied an element from one part of the tree to another. [bug=1270611]
|
|
589 |
||
378
by Leonard Richardson
Fixed a bug where Element.extract() could create an infinite loop in |
590 |
* Fixed a bug where Element.extract() could create an infinite loop in
|
591 |
the remaining tree.
|
|
592 |
||
352
by Leonard Richardson
The select() method can now find tags whose names contain |
593 |
* The select() method can now find tags whose names contain
|
360
by Leonard Richardson
The select() method can now find tags with attributes whose names |
594 |
dashes. Patch by Francisco Canas. [bug=1276211]
|
595 |
||
596 |
* The select() method can now find tags with attributes whose names
|
|
597 |
contain dashes. Patch by Marek Kapolka. [bug=1304007]
|
|
352
by Leonard Richardson
The select() method can now find tags whose names contain |
598 |
|
353
by Leonard Richardson
Improved the lxml tree builder's handling of processing |
599 |
* Improved the lxml tree builder's handling of processing |
600 |
instructions. [bug=1294645] |
|
601 |
||
337
by Leonard Richardson
Restored the helpful syntax error that happens when you try to |
602 |
* Restored the helpful syntax error that happens when you try to |
603 |
import the Python 2 edition of Beautiful Soup under Python |
|
604 |
3. [bug=1213387] |
|
605 |
||
347
by Leonard Richardson
In Python 3.4 and above, set the new convert_charrefs argument to |
606 |
* In Python 3.4 and above, set the new convert_charrefs argument to |
607 |
the html.parser constructor to avoid a warning and future |
|
608 |
failures. Patch by Stefano Revera. [bug=1375721] |
|
609 |
||
350
by Leonard Richardson
The warning when you pass in a filename or URL as markup will now be |
610 |
* The warning when you pass in a filename or URL as markup will now be |
611 |
displayed correctly even if the filename or URL is a Unicode |
|
612 |
string. [bug=1268888] |
|
342
by Leonard Richardson
Added a Chinese translation of the documentation by Delong .w. |
613 |
|
360.1.1
by Leonard Richardson
If the initial <html> tag contains a CDATA list attribute such as |
614 |
* If the initial <html> tag contains a CDATA list attribute such as |
615 |
'class', the html5lib tree builder will now turn its value into a |
|
616 |
list, as it would with any other tag. [bug=1296481] |
|
617 |
||
360.1.3
by Leonard Richardson
Fixed an import error in Python 3.5 caused by the removal of the |
618 |
* Fixed an import error in Python 3.5 caused by the removal of the |
619 |
HTMLParseError class. [bug=1420063] |
|
620 |
||
359
by Leonard Richardson
Improved docstring for encode_contents() and decode_contents(). [bug=1441543] |
621 |
* Improved docstring for encode_contents() and |
622 |
decode_contents(). [bug=1441543] |
|
357
by Leonard Richardson
Fixed yet another bug that caused a disconnected tree when html5lib |
623 |
|
364
by Leonard Richardson
Fixed a crash in Unicode, Dammit's encoding detector when the name |
624 |
* Fixed a crash in Unicode, Dammit's encoding detector when the name |
625 |
of the encoding itself contained invalid bytes. [bug=1360913]
|
|
626 |
||
367
by Leonard Richardson
Improved the exception raised when you call .unwrap() or |
627 |
* Improved the exception raised when you call .unwrap() or
|
628 |
.replace_with() on an element that's not attached to a tree. |
|
629 |
||
376
by Leonard Richardson
Raise a NotImplementedError whenever an unsupported CSS pseudoclass |
630 |
* Raise a NotImplementedError whenever an unsupported CSS pseudoclass |
631 |
is used in select(). Previously some cases did not result in a |
|
632 |
NotImplementedError. |
|
368
by Leonard Richardson
You can now create a NavigableString or a subclass just by invoking |
633 |
|
382
by Leonard Richardson
It's now possible to pickle a BeautifulSoup object no matter which |
634 |
* It's now possible to pickle a BeautifulSoup object no matter which |
635 |
tree builder was used to create it. However, the only tree builder
|
|
636 |
that survives the pickling process is the HTMLParserTreeBuilder
|
|
637 |
('html.parser'). If you unpickle a BeautifulSoup object created with |
|
638 |
some other tree builder, soup.builder will be None. [bug=1231545]
|
|
639 |
||
336
by Leonard Richardson
Prep for release. |
640 |
= 4.3.2 (20131002) =
|
331
by Leonard Richardson
Combined two tests to stop a spurious test failure when tests are |
641 |
|
333
by Leonard Richardson
Fixed a bug in which short Unicode input was improperly encoded to ASCII when checking whether or not it was a file on |
642 |
* Fixed a bug in which short Unicode input was improperly encoded to
|
336
by Leonard Richardson
Prep for release. |
643 |
ASCII when checking whether or not it was the name of a file on
|
333
by Leonard Richardson
Fixed a bug in which short Unicode input was improperly encoded to ASCII when checking whether or not it was a file on |
644 |
disk. [bug=1227016]
|
645 |
||
334
by Leonard Richardson
Fixed a crash when a short input contains data not valid in |
646 |
* Fixed a crash when a short input contains data not valid in
|
647 |
filenames. [bug=1232604]
|
|
648 |
||
335
by Leonard Richardson
Fixed a bug that caused Unicode data put into UnicodeDammit to |
649 |
* Fixed a bug that caused Unicode data put into UnicodeDammit to
|
650 |
return None instead of the original data. [bug=1214983]
|
|
651 |
||
331
by Leonard Richardson
Combined two tests to stop a spurious test failure when tests are |
652 |
* Combined two tests to stop a spurious test failure when tests are
|
332
by Leonard Richardson
Fixed typo. |
653 |
run by nosetests. [bug=1212445]
|
331
by Leonard Richardson
Combined two tests to stop a spurious test failure when tests are |
654 |
|
329
by Leonard Richardson
Updated NEWS. |
655 |
= 4.3.1 (20130815) =
|
327
by Leonard Richardson
* Fixed yet another problem with the html5lib tree builder, caused by |
656 |
|
657 |
* Fixed yet another problem with the html5lib tree builder, caused by
|
|
658 |
html5lib's tendency to rearrange the tree during |
|
659 |
parsing. [bug=1189267] |
|
660 |
||
329
by Leonard Richardson
Updated NEWS. |
661 |
* Fixed a bug that caused the optimized version of find_all() to |
662 |
return nothing. [bug=1212655] |
|
663 |
||
326
by Leonard Richardson
Prep for release. |
664 |
= 4.3.0 (20130812) = |
305
by Leonard Richardson
Merged in big encoding-detection refactoring branch. |
665 |
|
666 |
* Instead of converting incoming data to Unicode and feeding it to the |
|
324
by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by |
667 |
lxml tree builder in chunks, Beautiful Soup now makes successive |
668 |
guesses at the encoding of the incoming data, and tells lxml to |
|
669 |
parse the data as that encoding. Giving lxml more control over the |
|
670 |
parsing process improves performance and avoids a number of bugs and |
|
671 |
issues with the lxml parser which had previously required elaborate |
|
672 |
workarounds: |
|
323
by Leonard Richardson
A little cleanup. |
673 |
|
324
by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by |
674 |
- An issue in which lxml refuses to parse Unicode strings on some |
675 |
systems. [bug=1180527] |
|
323
by Leonard Richardson
A little cleanup. |
676 |
|
677 |
- A returning bug that truncated documents longer than a (very |
|
678 |
small) size. [bug=963880] |
|
679 |
||
680 |
- A returning bug in which extra spaces were added to a document if |
|
681 |
the document defined a charset other than UTF-8. [bug=972466] |
|
305
by Leonard Richardson
Merged in big encoding-detection refactoring branch. |
682 |
|
683 |
This required a major overhaul of the tree builder architecture. If |
|
684 |
you wrote your own tree builder and didn't tell me, you'll need to |
|
685 |
modify your prepare_markup() method. |
|
686 |
||
687 |
* The UnicodeDammit code that makes guesses at encodings has been |
|
688 |
split into its own class, EncodingDetector. A lot of apparently |
|
689 |
redundant code has been removed from Unicode, Dammit, and some |
|
690 |
undocumented features have also been removed. |
|
691 |
||
306
by Leonard Richardson
Beautiful Soup will issue a warning if instead of markup you pass it |
692 |
* Beautiful Soup will issue a warning if instead of markup you pass it |
324
by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by |
693 |
a URL or the name of a file on disk (a common beginner's mistake). |
306
by Leonard Richardson
Beautiful Soup will issue a warning if instead of markup you pass it |
694 |
|
317
by Leonard Richardson
Added raw html5lib to the list of parsers that get tested. |
695 |
* A number of optimizations improve the performance of the lxml tree
|
322
by Leonard Richardson
Updated NEWS. |
696 |
builder by about 33%, the html.parser tree builder by about 20%, and
|
697 |
the html5lib tree builder by about 15%.
|
|
317
by Leonard Richardson
Added raw html5lib to the list of parsers that get tested. |
698 |
|
324
by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by |
699 |
* All find_all calls should now return a ResultSet object. Patch by
|
700 |
Aaron DeVore. [bug=1194034]
|
|
701 |
||
302
by Leonard Richardson
Reverted the patch that gives NavigableString a .name property, because that's too big an API change for a bugfix release. |
702 |
= 4.2.1 (20130531) =
|
295
by Leonard Richardson
html5lib now supports Python 3. Fixed some Python 2-specific |
703 |
|
301
by Leonard Richardson
The default XML formatter will now replace ampersands even if they appear to be part of entities. That is, "<" will become "&lt;".[bug=1182183] |
704 |
* The default XML formatter will now replace ampersands even if they
|
705 |
appear to be part of entities. That is, "<" will become
|
|
706 |
"&lt;". The old code was left over from Beautiful Soup 3, which
|
|
707 |
didn't always turn entities into Unicode characters. |
|
708 |
||
709 |
If you really want the old behavior (maybe because you add new |
|
710 |
strings to the tree, those strings include entities, and you want |
|
711 |
the formatter to leave them alone on output), it can be found in |
|
712 |
EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183] |
|
713 |
||
296
by Leonard Richardson
Gave new_string() the ability to create subclasses of |
714 |
* Gave new_string() the ability to create subclasses of |
715 |
NavigableString. [bug=1181986] |
|
716 |
||
297
by Leonard Richardson
Fixed another bug by which the html5lib tree builder could create a |
717 |
* Fixed another bug by which the html5lib tree builder could create a |
718 |
disconnected tree. [bug=1182089] |
|
719 |
||
299
by Leonard Richardson
The .previous_element of a BeautifulSoup object is now always None, |
720 |
* The .previous_element of a BeautifulSoup object is now always None, |
721 |
not the last element to be parsed. [bug=1182089] |
|
722 |
||
295
by Leonard Richardson
html5lib now supports Python 3. Fixed some Python 2-specific |
723 |
* Fixed test failures when lxml is not installed. [bug=1181589] |
724 |
||
725 |
* html5lib now supports Python 3. Fixed some Python 2-specific |
|
726 |
code in the html5lib test suite. [bug=1181624] |
|
727 |
||
303
by Leonard Richardson
The html.parser treebuilder can now handle numeric attributes in |
728 |
* The html.parser treebuilder can now handle numeric attributes in |
729 |
text when the hexidecimal name of the attribute starts with a |
|
730 |
capital X. Patch by Tim Shirley. [bug=1186242] |
|
731 |
||
288.1.1
by Leonard Richardson
Added a deprecation warning to has_key(). |
732 |
= 4.2.0 (20130514) = |
272
by Leonard Richardson
In an HTML document, the contents of a <script> or <style> tag will |
733 |
|
282.1.12
by Leonard Richardson
Updated news. |
734 |
* The Tag.select() method now supports a much wider variety of CSS |
735 |
selectors. |
|
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
736 |
|
737 |
- Added support for the adjacent sibling combinator (+) and the |
|
738 |
general sibling combinator (~). Tests by "liquider". [bug=1082144] |
|
739 |
||
282.1.13
by Leonard Richardson
Fixed terminology. |
740 |
- The combinators (>, +, and ~) can now combine with any supported |
282.1.12
by Leonard Richardson
Updated news. |
741 |
selector, not just one that selects based on tag name. |
742 |
||
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
743 |
- Added limited support for the "nth-of-type" pseudo-class. Code |
744 |
by Sven Slootweg. [bug=1109952] |
|
745 |
||
274.1.3
by Leonard Richardson
Aliased the BeautifulSoup class to the easier-to-type "_s" and "_soup". |
746 |
* The BeautifulSoup class is now aliased to "_s" and "_soup", making |
278
by Leonard Richardson
Added support for the "nth-of-type" CSS selector. The CSS selector ">" can now find a tag by means other than the tag name. Code by Sven Slootweg. |
747 |
it quicker to type the import statement in an interactive session: |
274.1.3
by Leonard Richardson
Aliased the BeautifulSoup class to the easier-to-type "_s" and "_soup". |
748 |
|
749 |
from bs4 import _s |
|
750 |
or
|
|
751 |
from bs4 import _soup |
|
752 |
||
282
by Leonard Richardson
Fixed up diagnose() and added it to the docs. |
753 |
The alias may change in the future, so don't use this in code you're |
754 |
going to run more than once. |
|
755 |
||
756 |
* Added the 'diagnose' submodule, which includes several useful |
|
757 |
functions for reporting problems and doing tech support. |
|
758 |
||
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
759 |
- diagnose(data) tries the given markup on every installed parser, |
282
by Leonard Richardson
Fixed up diagnose() and added it to the docs. |
760 |
reporting exceptions and displaying successes. If a parser is not |
761 |
installed, diagnose() mentions this fact. |
|
762 |
||
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
763 |
- lxml_trace(data, html=True) runs the given markup through lxml's |
282
by Leonard Richardson
Fixed up diagnose() and added it to the docs. |
764 |
XML parser or HTML parser, and prints out the parser events as
|
765 |
they happen. This helps you quickly determine whether a given
|
|
766 |
problem occurs in lxml code or Beautiful Soup code.
|
|
767 |
||
282.1.11
by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement. |
768 |
- htmlparser_trace(data) is the same thing, but for Python's |
282
by Leonard Richardson
Fixed up diagnose() and added it to the docs. |
769 |
built-in HTMLParser class. |
278
by Leonard Richardson
Added support for the "nth-of-type" CSS selector. The CSS selector ">" can now find a tag by means other than the tag name. Code by Sven Slootweg. |
770 |
|
282.1.12
by Leonard Richardson
Updated news. |
771 |
* In an HTML document, the contents of a <script> or <style> tag will |
772 |
no longer undergo entity substitution by default. XML documents work |
|
773 |
the same way they did before. [bug=1085953] |
|
774 |
||
775 |
* Methods like get_text() and properties like .strings now only give |
|
776 |
you strings that are visible in the document--no comments or |
|
777 |
processing commands. [bug=1050164] |
|
778 |
||
277
by Leonard Richardson
The prettify() method now leaves the contents of <pre> tags |
779 |
* The prettify() method now leaves the contents of <pre> tags |
780 |
alone. [bug=1095654] |
|
781 |
||
264
by Leonard Richardson
Added bug reference. |
782 |
* Fix a bug in the html5lib treebuilder which sometimes created |
783 |
disconnected trees. [bug=1039527] |
|
784 |
||
265.1.1
by Leonard Richardson
Fix a bug in the lxml treebuilder which crashed when a tag included |
785 |
* Fix a bug in the lxml treebuilder which crashed when a tag included |
786 |
an attribute from the predefined "xml:" namespace. [bug=1065617] |
|
787 |
||
273
by Leonard Richardson
Fix a bug by which keyword arguments to find_parent() were not being passed on. [bug=1126734] |
788 |
* Fix a bug by which keyword arguments to find_parent() were not |
789 |
being passed on. [bug=1126734] |
|
790 |
||
275
by Leonard Richardson
Stop a crash when unwisely messing with a tag that's been |
791 |
* Stop a crash when unwisely messing with a tag that's been |
792 |
decomposed. [bug=1097699]
|
|
793 |
||
288.1.1
by Leonard Richardson
Added a deprecation warning to has_key(). |
794 |
* Now that lxml's segfault on invalid doctype has been fixed, fixed a |
274.1.1
by Leonard Richardson
Now that lxml's segfault on invalid doctype has been fixed, fix a |
795 |
corresponding problem on the Beautiful Soup end that was previously |
796 |
invisible. [bug=984936] |
|
797 |
||
279
by Leonard Richardson
Fixed an exception when an overspecified CSS selector didn't match |
798 |
* Fixed an exception when an overspecified CSS selector didn't match |
799 |
anything. Code by Stefaan Lippens. [bug=1168167]
|
|
800 |
||
258
by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503] |
801 |
= 4.1.3 (20120820) =
|
802 |
||
260
by Leonard Richardson
Python 3.1 also needs to skip the unicode attribute name test. |
803 |
* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious
|
804 |
test failure caused by the lousy HTMLParser in those
|
|
805 |
versions. [bug=1038503]
|
|
258
by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503] |
806 |
|
259
by Leonard Richardson
Raise a more specific error (FeatureNotFound) when a requested |
807 |
* Raise a more specific error (FeatureNotFound) when a requested
|
808 |
parser or parser feature is not installed. Raise NotImplementedError
|
|
809 |
instead of ValueError when the user calls insert_before() or
|
|
810 |
insert_after() on the BeautifulSoup object itself. Patch by Aaron
|
|
811 |
Devore. [bug=1038301]
|
|
258
by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503] |
812 |
|
252
by Leonard Richardson
Prep for release. |
813 |
= 4.1.2 (20120817) =
|
245
by Leonard Richardson
Use logging.warning() instead of warning.warn() to notify the user that characters were replaced with REPLACEMENT CHARACTER. [bug=1013862] |
814 |
|
251
by Leonard Richardson
As per PEP-8, allow searching by CSS class using the 'class_' |
815 |
* As per PEP-8, allow searching by CSS class using the 'class_' |
816 |
keyword argument. [bug=1037624]
|
|
817 |
||
255
by Leonard Richardson
Fixed a crash on encoding when an attribute name contained |
818 |
* Display namespace prefixes for namespaced attribute names, instead of
|
250
by Leonard Richardson
Use namespace prefixes for namespaced attribute names, instead of |
819 |
the fully-qualified names given by the lxml parser. [bug=1037597]
|
820 |
||
255
by Leonard Richardson
Fixed a crash on encoding when an attribute name contained |
821 |
* Fixed a crash on encoding when an attribute name contained
|
822 |
non-ASCII characters.
|
|
823 |
||
251
by Leonard Richardson
As per PEP-8, allow searching by CSS class using the 'class_' |
824 |
* When sniffing encodings, if the cchardet library is installed,
|
258
by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503] |
825 |
Beautiful Soup uses it instead of chardet. cchardet is much
|
251
by Leonard Richardson
As per PEP-8, allow searching by CSS class using the 'class_' |
826 |
faster. [bug=1020748]
|
246
by Leonard Richardson
When sniffing encodings, if the cchardet library is installed, use it instead of chardet. It's much faster. [bug=1020748] |
827 |
|
245
by Leonard Richardson
Use logging.warning() instead of warning.warn() to notify the user that characters were replaced with REPLACEMENT CHARACTER. [bug=1013862] |
828 |
* Use logging.warning() instead of warning.warn() to notify the user
|
829 |
that characters were replaced with REPLACEMENT
|
|
830 |
CHARACTER. [bug=1013862]
|
|
831 |
||
243
by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387] |
832 |
= 4.1.1 (20120703) =
|
239
by Leonard Richardson
Fixed an html5lib tree builder crash which happened when html5lib |
833 |
|
241
by Leonard Richardson
Fixed a typo that made parsing much slower than it should have been. [bug=1020268] |
834 |
* Fixed an html5lib tree builder crash which happened when html5lib
|
243
by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387] |
835 |
moved a tag with a multivalued attribute from one part of the tree
|
836 |
to another. [bug=1019603]
|
|
239
by Leonard Richardson
Fixed an html5lib tree builder crash which happened when html5lib |
837 |
|
243
by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387] |
838 |
* Correctly display closing tags with an XML namespace declared. Patch
|
241
by Leonard Richardson
Fixed a typo that made parsing much slower than it should have been. [bug=1020268] |
839 |
by Andreas Kostyrka. [bug=1019635]
|
840 |
||
841 |
* Fixed a typo that made parsing significantly slower than it should
|
|
243
by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387] |
842 |
have been, and also waited too long to close tags with XML
|
843 |
namespaces. [bug=1020268]
|
|
844 |
||
845 |
* get_text() now returns an empty Unicode string if there is no text,
|
|
846 |
rather than an empty bytestring. [bug=1020387]
|
|
241
by Leonard Richardson
Fixed a typo that made parsing much slower than it should have been. [bug=1020268] |
847 |
|
236
by Leonard Richardson
Prep for release. |
848 |
= 4.1.0 (20120529) =
|
228
by Leonard Richardson
Added experimental support for fixing Windows-1252 characters embedded in UTF-8 documents. |
849 |
|
850 |
* Added experimental support for fixing Windows-1252 characters
|
|
232
by Leonard Richardson
Fixed a bug with the lxml treebuilder that prevented the user from adding attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch. |
851 |
embedded in UTF-8 documents. (UnicodeDammit.detwingle())
|
228
by Leonard Richardson
Added experimental support for fixing Windows-1252 characters embedded in UTF-8 documents. |
852 |
|
230
by Leonard Richardson
Fixed the handling of " with the built-in parser. [bug=993871] |
853 |
* Fixed the handling of " with the built-in parser. [bug=993871]
|
854 |
||
231
by Leonard Richardson
Comments, processing instructions, document type declarations, and markup declarations are now treated as preformatted strings, the way CData blocks are. [bug=1001025] Also in this commit: renamed detwingle method to detwingle(). |
855 |
* Comments, processing instructions, document type declarations, and
|
856 |
markup declarations are now treated as preformatted strings, the way
|
|
857 |
CData blocks are. [bug=1001025]
|
|
858 |
||
232
by Leonard Richardson
Fixed a bug with the lxml treebuilder that prevented the user from adding attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch. |
859 |
* Fixed a bug with the lxml treebuilder that prevented the user from
|
860 |
adding attributes to a tag that didn't originally have |
|
236
by Leonard Richardson
Prep for release. |
861 |
attributes. [bug=1002378] Thanks to Oliver Beattie for the patch. |
232
by Leonard Richardson
Fixed a bug with the lxml treebuilder that prevented the user from adding attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch. |
862 |
|
233
by Leonard Richardson
Fixed some edge-case bugs having to do with inserting an element |
863 |
* Fixed some edge-case bugs having to do with inserting an element |
864 |
into a tag it's already inside, and replacing one of a tag's |
|
865 |
children with another. [bug=997529] |
|
866 |
||
236
by Leonard Richardson
Prep for release. |
867 |
* Added the ability to search for attribute values specified in UTF-8. [bug=1003974] |
235
by Leonard Richardson
Fixed the inability to search for non-ASCII attribute |
868 |
|
869 |
This caused a major refactoring of the search code. All the tests |
|
870 |
pass, but it's possible that some searches will behave differently. |
|
234
by Leonard Richardson
Fixed the basic failure in [bug=1003974], but not more advanced cases. |
871 |
|
225
by Leonard Richardson
Prep for release. |
872 |
= 4.0.5 (20120427) =
|
214
by Leonard Richardson
Fixed a bug that made the HTMLParser treebuilder generate XML definitions ending with two question marks instead of one. [bug=984258] |
873 |
|
229
by Leonard Richardson
Fixed NEWS. |
874 |
* Added a new method, wrap(), which wraps an element in a tag.
|
224
by Leonard Richardson
Added a new method, wrap(). |
875 |
|
223
by Leonard Richardson
Renamed replace_with_children() to the jQuery name, unwrap(). |
876 |
* Renamed replace_with_children() to unwrap(), which is easier to
|
877 |
understand and also the jQuery name of the function.
|
|
878 |
||
217
by Leonard Richardson
Made encoding substitution in <meta> tags completely transparent (no more %SOUP-ENCODING%). |
879 |
* Made encoding substitution in <meta> tags completely transparent (no
|
880 |
more %SOUP-ENCODING%).
|
|
881 |
||
222
by Leonard Richardson
Fixed a bug in decoding data that contained a byte-order mark, such as data encoded in UTF-16LE. [bug=988980] |
882 |
* Fixed a bug in decoding data that contained a byte-order mark, such
|
883 |
as data encoded in UTF-16LE. [bug=988980]
|
|
884 |
||
214
by Leonard Richardson
Fixed a bug that made the HTMLParser treebuilder generate XML definitions ending with two question marks instead of one. [bug=984258] |
885 |
* Fixed a bug that made the HTMLParser treebuilder generate XML
|
886 |
definitions ending with two question marks instead of
|
|
887 |
one. [bug=984258]
|
|
888 |
||
221
by Leonard Richardson
Upon document generation, CData objects are no longer run through the formatter. [bug=988905] |
889 |
* Upon document generation, CData objects are no longer run through
|
890 |
the formatter. [bug=988905]
|
|
891 |
||
220
by Leonard Richardson
The test suite now passes when lxml is not installed, whether or not html5lib is installed. [bug=987004] |
892 |
* The test suite now passes when lxml is not installed, whether or not
|
893 |
html5lib is installed. [bug=987004]
|
|
894 |
||
215
by Leonard Richardson
Print a warning on HTMLParseErrors to let people know they should install an external parser. |
895 |
* Print a warning on HTMLParseErrors to let people know they should
|
896 |
install a better parser library.
|
|
897 |
||
213
by Leonard Richardson
Prep for release. |
898 |
= 4.0.4 (20120416) =
|
205
by Leonard Richardson
Have objects_was_parsed set the previous element's next_element if possible. [bug=975926] |
899 |
|
900 |
* Fixed a bug that sometimes created disconnected trees.
|
|
901 |
||
209
by Leonard Richardson
Fixed a bug with the string setter that moved a string around the |
902 |
* Fixed a bug with the string setter that moved a string around the
|
903 |
tree instead of copying it. [bug=983050]
|
|
904 |
||
210
by Leonard Richardson
Attribute values are now run through the provided output formatter. Previously they were always run through the 'minimal' formatter. [bug=980237] |
905 |
* Attribute values are now run through the provided output formatter.
|
906 |
Previously they were always run through the 'minimal' formatter. In |
|
907 |
the future I may make it possible to specify different formatters
|
|
908 |
for attribute values and strings, but for now, consistent behavior
|
|
909 |
is better than inconsistent behavior. [bug=980237]
|
|
910 |
||
206
by Leonard Richardson
Added renderContents back. |
911 |
* Added the missing renderContents method from Beautiful Soup 3. Also
|
912 |
added an encode_contents() method to go along with decode_contents().
|
|
913 |
||
208
by Leonard Richardson
Give a more useful error when the user tries to run the Python 2 version of BS under Python 3. |
914 |
* Give a more useful error when the user tries to run the Python 2
|
915 |
version of BS under Python 3.
|
|
916 |
||
211
by Leonard Richardson
Unicode, Dammit now has an option to turn MS smart quotes into ASCII characters. |
917 |
* UnicodeDammit can now convert Microsoft smart quotes to ASCII with
|
918 |
UnicodeDammit(markup, smart_quotes_to="ascii").
|
|
919 |
||
204
by Leonard Richardson
Prep for release. |
920 |
= 4.0.3 (20120403) =
|
197
by Leonard Richardson
Fixed a typo that caused some versions of Python 3 to convert the Beautiful Soup codebase incorrectly. |
921 |
|
922 |
* Fixed a typo that caused some versions of Python 3 to convert the
|
|
923 |
Beautiful Soup codebase incorrectly.
|
|
924 |
||
203
by Leonard Richardson
Got rid of the 4.0.2 workaround for HTML documents--it was unnecessary and the workaround was triggering a (possibly different, but related) bug in lxml. [bug=972466] |
925 |
* Got rid of the 4.0.2 workaround for HTML documents--it was
|
926 |
unnecessary and the workaround was triggering a (possibly different,
|
|
927 |
but related) bug in lxml. [bug=972466]
|
|
928 |
||
196
by Leonard Richardson
Prep for release. |
929 |
= 4.0.2 (20120326) =
|
194
by Leonard Richardson
Fixed a bug where specifying 'text' while searching for a tag only worked if 'text' specified an exact string match. [bug=955942] |
930 |
|
195
by Leonard Richardson
Pass data into XMLParser.feed() in chunks. [bug=963880] |
931 |
* Worked around a possible bug in lxml that prevents non-tiny XML
|
932 |
documents from being parsed. [bug=963880, bug=963936]
|
|
933 |
||
196
by Leonard Richardson
Prep for release. |
934 |
* Fixed a bug where specifying `text` while also searching for a tag
|
935 |
only worked if `text` wanted an exact string match. [bug=955942]
|
|
194
by Leonard Richardson
Fixed a bug where specifying 'text' while searching for a tag only worked if 'text' specified an exact string match. [bug=955942] |
936 |
|
188
by Leonard Richardson
Bumped version number. |
937 |
= 4.0.1 (20120314) =
|
938 |
||
939 |
* This is the first official release of Beautiful Soup 4. There is no
|
|
940 |
4.0.0 release, to eliminate any possibility that packaging software
|
|
941 |
might treat "4.0.0" as being an earlier version than "4.0.0b10".
|
|
187
by Leonard Richardson
Brought the soupselect port up to date. |
942 |
|
943 |
* Brought BS up to date with the latest release of soupselect, adding
|
|
944 |
CSS selector support for direct descendant matches and multiple CSS
|
|
945 |
class matches.
|
|
946 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
947 |
= 4.0.0b10 (20120302) =
|
179.1.3
by Leonard Richardson
Test that CSS selectors work within the tree as well as at the top level. |
948 |
|
179.1.4
by Leonard Richardson
Updated docs. |
949 |
* Added support for simple CSS selectors, taken from the soupselect project.
|
179.1.3
by Leonard Richardson
Test that CSS selectors work within the tree as well as at the top level. |
950 |
|
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
951 |
* Fixed a crash when using html5lib. [bug=943246]
|
952 |
||
182
by Leonard Richardson
In HTML5-style <meta charset="foo"> tags, the value of the "charset" attribute is now replaced with the appropriate encoding on output. [bug=942714] |
953 |
* In HTML5-style <meta charset="foo"> tags, the value of the "charset"
|
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
954 |
attribute is now replaced with the appropriate encoding on
|
955 |
output. [bug=942714]
|
|
956 |
||
957 |
* Fixed a bug that caused calling a tag to sometimes call find_all()
|
|
958 |
with the wrong arguments. [bug=944426]
|
|
182
by Leonard Richardson
In HTML5-style <meta charset="foo"> tags, the value of the "charset" attribute is now replaced with the appropriate encoding on output. [bug=942714] |
959 |
|
184
by Leonard Richardson
For backwards compatibility, brought back the BeautifulStoneSoup class as a deprecated wrapper around BeautifulSoup. |
960 |
* For backwards compatibility, brought back the BeautifulStoneSoup
|
961 |
class as a deprecated wrapper around BeautifulSoup.
|
|
962 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
963 |
= 4.0.0b9 (20120228) =
|
175
by Leonard Richardson
Renamed Tag.nsprefix to Tag.prefix, for consistency with NamespacedAttribute. |
964 |
|
177
by Leonard Richardson
Fixed DOCTYPE handling. |
965 |
* Fixed the string representation of DOCTYPEs that have both a public
|
966 |
ID and a system ID.
|
|
967 |
||
179
by Leonard Richardson
Fixed the generated XML declaration. |
968 |
* Fixed the generated XML declaration.
|
969 |
||
175
by Leonard Richardson
Renamed Tag.nsprefix to Tag.prefix, for consistency with NamespacedAttribute. |
970 |
* Renamed Tag.nsprefix to Tag.prefix, for consistency with
|
971 |
NamespacedAttribute.
|
|
972 |
||
421.1.1
by Ville Skyttä
Spelling fixes |
973 |
* Fixed a test failure that occurred on Python 3.x when chardet was
|
176
by Leonard Richardson
Fixed a test failure that occured on Python 3.x when chardet was installed. |
974 |
installed.
|
975 |
||
178
by Leonard Richardson
Make prettify() return Unicode by default, so it will look nice when passed into print() under Python 3. |
976 |
* Made prettify() return Unicode by default, so it will look nice on
|
977 |
Python 3 when passed into print().
|
|
978 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
979 |
= 4.0.0b8 (20120224) =
|
158.1.10
by Leonard Richardson
Bumped version number. |
980 |
|
981 |
* All tree builders now preserve namespace information in the
|
|
174
by Leonard Richardson
I keep typing assertEquals. |
982 |
documents they parse. If you use the html5lib parser or lxml's XML |
983 |
parser, you can access the namespace URL for a tag as tag.namespace. |
|
158.1.10
by Leonard Richardson
Bumped version number. |
984 |
|
985 |
However, there is no special support for namespace-oriented |
|
986 |
searching or tree manipulation. When you search the tree, you need |
|
987 |
to use namespace prefixes exactly as they're used in the original |
|
988 |
document.
|
|
989 |
||
158.1.11
by Leonard Richardson
Fixed handling of the closing of namespaced tags. |
990 |
* The string representation of a DOCTYPE always ends in a newline.
|
991 |
||
173
by Leonard Richardson
Warn when SoupStrainer is used with the html5lib tree builder. |
992 |
* Issue a warning if the user tries to use a SoupStrainer in
|
993 |
conjunction with the html5lib tree builder, which doesn't support |
|
994 |
them. |
|
995 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
996 |
= 4.0.0b7 (20120223) = |
157
by Leonard Richardson
Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion. |
997 |
|
158
by Leonard Richardson
By default, turn unrecognized characters into numeric XML entity refs. |
998 |
* Upon decoding to string, any characters that can't be represented in |
999 |
your chosen encoding will be converted into numeric XML entity
|
|
1000 |
references.
|
|
1001 |
||
157
by Leonard Richardson
Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion. |
1002 |
* Issue a warning if characters were replaced with REPLACEMENT
|
1003 |
CHARACTER during Unicode conversion.
|
|
1004 |
||
160
by Leonard Richardson
Added code from 2.7's standard library so that the tests will run on Python 2.6. |
1005 |
* Restored compatibility with Python 2.6.
|
1006 |
||
421.1.1
by Ville Skyttä
Spelling fixes |
1007 |
* The install process no longer installs docs or auxiliary text files.
|
169
by Leonard Richardson
It's now possible to copy a BeautifulSoup object created with the html.parser treebuilder. |
1008 |
|
1009 |
* It's now possible to deepcopy a BeautifulSoup object created with |
|
1010 |
Python's built-in HTML parser. |
|
1011 |
||
169.1.6
by Leonard Richardson
Updated NEWS. |
1012 |
* About 100 unit tests that "test" the behavior of various parsers on
|
1013 |
invalid markup have been removed. Legitimate changes to those
|
|
1014 |
parsers caused these tests to fail, indicating that perhaps
|
|
1015 |
Beautiful Soup should not test the behavior of foreign
|
|
1016 |
libraries.
|
|
1017 |
||
1018 |
The problematic unit tests have been reformulated as informational
|
|
1019 |
comparisons generated by the script
|
|
1020 |
scripts/demonstrate_parser_differences.py.
|
|
1021 |
||
1022 |
This makes Beautiful Soup compatible with html5lib version 0.95 and
|
|
1023 |
future versions of HTMLParser.
|
|
1024 |
||
185
by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426] |
1025 |
= 4.0.0b6 (20120216) =
|
150.1.8
by Leonard Richardson
Added to NEWS. |
1026 |
|
157
by Leonard Richardson
Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion. |
1027 |
* Multi-valued attributes like "class" always have a list of values,
|
1028 |
even if there's only one value in the list. |
|
1029 |
||
1030 |
* Added a number of multi-valued attributes defined in HTML5. |
|
154
by Leonard Richardson
The value of multi-valued attributes like class are always turned into a list, even if there's only one value. |
1031 |
|
155
by Leonard Richardson
Added a kind of hacky way to interpret the restriction class='foo bar'. Stop generating a space before the slash that closes an empty-element tag. |
1032 |
* Stopped generating a space before the slash that closes an |
1033 |
empty-element tag. This may come back if I add a special XHTML mode |
|
1034 |
(http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty |
|
1035 |
useless.
|
|
1036 |
||
152
by Leonard Richardson
Better defined behavior when the user wants to search for a combination of text and tag-specific arguments. [bug=695312] |
1037 |
* Passing text along with tag-specific arguments to a find* method:
|
1038 |
||
1039 |
find("a", text="Click here")
|
|
1040 |
||
1041 |
will find tags that contain the given text as their
|
|
1042 |
.string. Previously, the tag-specific arguments were ignored and
|
|
1043 |
only strings were searched.
|
|
1044 |
||
150.1.8
by Leonard Richardson
Added to NEWS. |
1045 |
* Fixed a bug that caused the html5lib tree builder to build a
|
1046 |
partially disconnected tree. Generally cleaned up the html5lib tree
|
|
1047 |
builder.
|
|
1048 |
||
155
by Leonard Richardson
Added a kind of hacky way to interpret the restriction class='foo bar'. Stop generating a space before the slash that closes an empty-element tag. |
1049 |
* If you restrict a multi-valued attribute like "class" to a string
|
1050 |
that contains spaces, Beautiful Soup will only consider it a match
|
|
1051 |
if the values correspond to that specific string.
|
|
1052 |
||
149
by Leonard Richardson
Bumped version number. |
1053 |
= 4.0.0b5 (20120209) =
|
138
by Leonard Richardson
Rationalized the treatment of multi-valued HTML attributes such as 'class' |
1054 |
|
1055 |
* Rationalized Beautiful Soup's treatment of CSS class. A tag |
|
1056 |
belonging to multiple CSS classes is treated as having a list of |
|
1057 |
values for the 'class' attribute. Searching for a CSS class will |
|
1058 |
match *any* of the CSS classes. |
|
1059 |
||
1060 |
This actually affects all attributes that the HTML standard defines |
|
1061 |
as taking multiple values (class, rel, rev, archive, accept-charset, |
|
148
by Leonard Richardson
Added bug reference. |
1062 |
and headers), but 'class' is by far the most common. [bug=41034] |
138
by Leonard Richardson
Rationalized the treatment of multi-valued HTML attributes such as 'class' |
1063 |
|
1064 |
* If you pass anything other than a dictionary as the second argument |
|
1065 |
to one of the find* methods, it'll assume you want to use that |
|
1066 |
object to search against a tag's CSS classes. Previously this only |
|
1067 |
worked if you passed in a string. |
|
1068 |
||
140
by Leonard Richardson
Fixed a bug that caused a crash when you passed a dictionary as an attribute value (possibly because you mistyped attrs). [bug=842419] |
1069 |
* Fixed a bug that caused a crash when you passed a dictionary as an |
1070 |
attribute value (possibly because you mistyped "attrs"). [bug=842419] |
|
1071 |
||
144
by Leonard Richardson
Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags like <meta charset="utf-8" />. [bug=837268] |
1072 |
* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags |
1073 |
like <meta charset="utf-8" />. [bug=837268] |
|
1074 |
||
146
by Leonard Richardson
As a last-ditch attempt to turn data into Unicode, use errors=replace instead of errors=strict. |
1075 |
* If Unicode, Dammit can't figure out a consistent encoding for a |
1076 |
page, it will try each of its guesses again, with errors="replace"
|
|
1077 |
instead of errors="strict". This may mean that some data gets
|
|
1078 |
replaced with REPLACEMENT CHARACTER, but at least most of it will
|
|
1079 |
get turned into Unicode. [bug=754903]
|
|
1080 |
||
145
by Leonard Richardson
Patched over a bug in html5lib (?) that was crashing Beautiful Soup on certain kinds of markup. [bug=838800] |
1081 |
* Patched over a bug in html5lib (?) that was crashing Beautiful Soup
|
1082 |
on certain kinds of markup. [bug=838800]
|
|
1083 |
||
141
by Leonard Richardson
Fixed a bug that wrecked the tree if you replaced an element with an empty string. [bug=728697] |
1084 |
* Fixed a bug that wrecked the tree if you replaced an element with an
|
1085 |
empty string. [bug=728697]
|
|
1086 |
||
142
by Leonard Richardson
Improved Unicode, Dammit's behavior when you give it Unicode to begin with. |
1087 |
* Improved Unicode, Dammit's behavior when you give it Unicode to |
1088 |
begin with. |
|
1089 |
||
134
by Leonard Richardson
Moved the historical changelog into NEWS. |
1090 |
= 4.0.0b4 (20120208) = |
131
by Leonard Richardson
Moved around a bunch of metadata. |
1091 |
|
1092 |
* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag() |
|
1093 |
||
1094 |
* BeautifulSoup.new_tag() will follow the rules of whatever |
|
1095 |
tree-builder was used to create the original BeautifulSoup object. A |
|
1096 |
new <p> tag will look like "<p />" if the soup object was created to |
|
1097 |
parse XML, but it will look like "<p></p>" if the soup object was |
|
1098 |
created to parse HTML. |
|
1099 |
||
1100 |
* We pass in strict=False to html.parser on Python 3, greatly |
|
1101 |
improving html.parser's ability to handle bad HTML. |
|
1102 |
||
1103 |
* We also monkeypatch a serious bug in html.parser that made
|
|
1104 |
strict=False disastrous on Python 3.2.2.
|
|
1105 |
||
1106 |
* Replaced the "substitute_html_entities" argument with the
|
|
133
by Leonard Richardson
Added more detail to the NEWS. |
1107 |
more general "formatter" argument.
|
131
by Leonard Richardson
Moved around a bunch of metadata. |
1108 |
|
1109 |
* Bare ampersands and angle brackets are always converted to XML
|
|
1110 |
entities unless the user prevents it.
|
|
1111 |
||
133
by Leonard Richardson
Added more detail to the NEWS. |
1112 |
* Added PageElement.insert_before() and PageElement.insert_after(),
|
1113 |
which let you put an element into the parse tree with respect to
|
|
1114 |
some other element.
|
|
131
by Leonard Richardson
Moved around a bunch of metadata. |
1115 |
|
1116 |
* Raise an exception when the user tries to do something nonsensical
|
|
1117 |
like insert a tag into itself.
|
|
1118 |
||
122
by Leonard Richardson
Documented today's changes. |
1119 |
|
134
by Leonard Richardson
Moved the historical changelog into NEWS. |
1120 |
= 4.0.0b3 (20120203) =
|
126
by Leonard Richardson
Package the docs with the code. |
1121 |
|
1122 |
Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful
|
|
1123 |
Soup's custom HTML parser in favor of a system that lets you write a |
|
1124 |
little glue code and plug in any HTML or XML parser you want. |
|
1125 |
||
1126 |
Beautiful Soup 4.0 comes with glue code for four parsers: |
|
1127 |
||
1128 |
* Python's standard HTMLParser (html.parser in Python 3) |
|
1129 |
* lxml's HTML and XML parsers |
|
1130 |
* html5lib's HTML parser |
|
1131 |
||
1132 |
HTMLParser is the default, but I recommend you install lxml if you
|
|
1133 |
can.
|
|
1134 |
||
1135 |
For complete documentation, see the Sphinx documentation in
|
|
1136 |
bs4/doc/source/. What follows is a summary of the changes from
|
|
1137 |
Beautiful Soup 3.
|
|
1138 |
||
1139 |
=== The module name has changed ===
|
|
1140 |
||
1141 |
Previously you imported the BeautifulSoup class from a module also
|
|
1142 |
called BeautifulSoup. To save keystrokes and make it clear which
|
|
1143 |
version of the API is in use, the module is now called 'bs4': |
|
1144 |
||
1145 |
>>> from bs4 import BeautifulSoup
|
|
1146 |
||
1147 |
=== It works with Python 3 ===
|
|
1148 |
||
1149 |
Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
|
|
1150 |
so bad that it barely worked at all. Beautiful Soup 4 works with
|
|
1151 |
Python 3, and since its parser is pluggable, you don't sacrifice |
|
1152 |
quality. |
|
1153 |
||
1154 |
Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3 |
|
1155 |
support to the finish line. Ezio Melotti is also to thank for greatly |
|
1156 |
improving the HTML parser that comes with Python 3.2. |
|
1157 |
||
1158 |
=== CDATA sections are normal text, if they're understood at all. === |
|
1159 |
||
1160 |
Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
|
|
1161 |
markup:
|
|
1162 |
||
1163 |
<p><![CDATA[foo]]></p> => <p></p>
|
|
1164 |
||
1165 |
A future version of html5lib will turn CDATA sections into text nodes,
|
|
1166 |
but only within tags like <svg> and <math>:
|
|
1167 |
||
1168 |
<svg><![CDATA[foo]]></svg> => <p>foo</p>
|
|
1169 |
||
1170 |
The default XML parser (which uses lxml behind the scenes) turns CDATA
|
|
1171 |
sections into ordinary text elements:
|
|
1172 |
||
1173 |
<p><![CDATA[foo]]></p> => <p>foo</p>
|
|
1174 |
||
1175 |
In theory it's possible to preserve the CDATA sections when using the |
|
1176 |
XML parser, but I don't see how to get it to work in practice. |
|
1177 |
||
1178 |
=== Miscellaneous other stuff ===
|
|
1179 |
||
1180 |
If the BeautifulSoup instance has .is_xml set to True, an appropriate
|
|
1181 |
XML declaration will be emitted when the tree is transformed into a
|
|
1182 |
string:
|
|
1183 |
||
1184 |
<?xml version="1.0" encoding="utf-8">
|
|
1185 |
<markup>
|
|
1186 |
...
|
|
1187 |
</markup>
|
|
1188 |
||
1189 |
The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree |
|
1190 |
builders set it to False. If you want to parse XHTML with an HTML
|
|
1191 |
parser, you can set it manually.
|
|
1192 |
||
75.1.4
by Leonard Richardson
Emit an XML declaration when appropriate. |
1193 |
|
92
by Leonard Richardson
Prep for beta release. |
1194 |
= 3.2.0 =
|
1195 |
||
1196 |
The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2 |
|
1197 |
to make it obvious which one you should use. |
|
1198 |
||
1
by Leonard Richardson
Initial (manual) import. |
1199 |
= 3.1.0 = |
1200 |
||
1201 |
A hybrid version that supports 2.4 and can be automatically converted |
|
1202 |
to run under Python 3.0. There are three backwards-incompatible |
|
1203 |
changes you should be aware of, but no new features or deliberate |
|
1204 |
behavior changes. |
|
1205 |
||
1206 |
1. str() may no longer do what you want. This is because the meaning |
|
1207 |
of str() inverts between Python 2 and 3; in Python 2 it gives you a |
|
1208 |
byte string, in Python 3 it gives you a Unicode string. |
|
1209 |
||
1210 |
The effect of this is that you can't pass an encoding to .__str__ |
|
1211 |
anymore. Use encode() to get a string and decode() to get Unicode, and
|
|
1212 |
you'll be ready (well, readier) for Python 3. |
|
1213 |
||
1214 |
2. Beautiful Soup is now based on HTMLParser rather than SGMLParser, |
|
1215 |
which is gone in Python 3. There's some bad HTML that SGMLParser |
|
1216 |
handled but HTMLParser doesn't, usually to do with attribute values |
|
1217 |
that aren't closed or have brackets inside them: |
|
1218 |
||
1219 |
<a href="foo</a>, </a><a href="bar">baz</a>
|
|
1220 |
<a b="<a>">', '<a b="<a>"></a><a>"></a> |
|
1221 |
||
1222 |
A later version of Beautiful Soup will allow you to plug in different
|
|
1223 |
parsers to make tradeoffs between speed and the ability to handle bad
|
|
1224 |
HTML.
|
|
1225 |
||
87.1.3
by Aaron DeVore
Changelog for attribute renames |
1226 |
3. In Python 3 (but not Python 2), HTMLParser converts entities within
|
1
by Leonard Richardson
Initial (manual) import. |
1227 |
attributes to the corresponding Unicode characters. In Python 2 it's |
1228 |
possible to parse this string and leave the é intact. |
|
1229 |
||
1230 |
<a href="http://crummy.com?sacré&bleu"> |
|
1231 |
||
1232 |
In Python 3, the é is always converted to \xe9 during |
|
1233 |
parsing. |
|
1234 |
||
1235 |
||
1236 |
= 3.0.7a = |
|
1237 |
||
1238 |
Added an import that makes BS work in Python 2.3. |
|
1239 |
||
1240 |
||
1241 |
= 3.0.7 = |
|
1242 |
||
1243 |
Fixed a UnicodeDecodeError when unpickling documents that contain |
|
1244 |
non-ASCII characters. |
|
1245 |
||
421.1.1
by Ville Skyttä
Spelling fixes |
1246 |
Fixed a TypeError that occurred in some circumstances when a tag |
1
by Leonard Richardson
Initial (manual) import. |
1247 |
contained no text. |
1248 |
||
1249 |
Jump through hoops to avoid the use of chardet, which can be extremely |
|
1250 |
slow in some circumstances. UTF-8 documents should never trigger the |
|
1251 |
use of chardet. |
|
1252 |
||
1253 |
Whitespace is preserved inside <pre> and <textarea> tags that contain |
|
1254 |
nothing but whitespace. |
|
1255 |
||
1256 |
Beautiful Soup can now parse a doctype that's scoped to an XML namespace. |
|
1257 |
||
1258 |
||
1259 |
= 3.0.6 =
|
|
1260 |
||
1261 |
Got rid of a very old debug line that prevented chardet from working.
|
|
1262 |
||
1263 |
Added a Tag.decompose() method that completely disconnects a tree or a
|
|
1264 |
subset of a tree, breaking it up into bite-sized pieces that are
|
|
1265 |
easy for the garbage collecter to collect.
|
|
1266 |
||
1267 |
Tag.extract() now returns the tag that was extracted.
|
|
1268 |
||
1269 |
Tag.findNext() now does something with the keyword arguments you pass
|
|
1270 |
it instead of dropping them on the floor.
|
|
1271 |
||
1272 |
Fixed a Unicode conversion bug.
|
|
1273 |
||
1274 |
Fixed a bug that garbled some <meta> tags when rewriting them.
|
|
1275 |
||
1276 |
||
1277 |
= 3.0.5 =
|
|
1278 |
||
1279 |
Soup objects can now be pickled, and copied with copy.deepcopy.
|
|
1280 |
||
1281 |
Tag.append now works properly on existing BS objects. (It wasn't |
|
1282 |
originally intended for outside use, but it can be now.) (Giles |
|
1283 |
Radford) |
|
1284 |
||
1285 |
Passing in a nonexistent encoding will no longer crash the parser on |
|
1286 |
Python 2.4 (John Nagle). |
|
1287 |
||
1288 |
Fixed an underlying bug in SGMLParser that thinks ASCII has 255 |
|
1289 |
characters instead of 127 (John Nagle). |
|
1290 |
||
1291 |
Entities are converted more consistently to Unicode characters. |
|
1292 |
||
1293 |
Entity references in attribute values are now converted to Unicode |
|
1294 |
characters when appropriate. Numeric entities are always converted, |
|
1295 |
because SGMLParser always converts them outside of attribute values. |
|
1296 |
||
1297 |
ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to |
|
1298 |
XHTML_ENTITIES. |
|
1299 |
||
1300 |
The regular expression for bare ampersands was too loose. In some |
|
1301 |
cases ampersands were not being escaped. (Sam Ruby?) |
|
1302 |
||
1303 |
Non-breaking spaces and other special Unicode space characters are no |
|
1304 |
longer folded to ASCII spaces. (Robert Leftwich) |
|
1305 |
||
1306 |
Information inside a TEXTAREA tag is now parsed literally, not as HTML |
|
1307 |
tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang) |
|
1308 |
||
1309 |
= 3.0.4 = |
|
1310 |
||
1311 |
Fixed a bug that crashed Unicode conversion in some cases. |
|
1312 |
||
1313 |
Fixed a bug that prevented UnicodeDammit from being used as a |
|
1314 |
general-purpose data scrubber. |
|
1315 |
||
1316 |
Fixed some unit test failures when running against Python 2.5. |
|
1317 |
||
1318 |
When considering whether to convert smart quotes, UnicodeDammit now |
|
1319 |
looks at the original encoding in a case-insensitive way. |
|
134
by Leonard Richardson
Moved the historical changelog into NEWS. |
1320 |
|
1321 |
= 3.0.3 (20060606) = |
|
1322 |
||
1323 |
Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be |
|
1324 |
sure to pass in an appropriate value for convertEntities, or XML/HTML |
|
1325 |
entities might stick around that aren't valid in HTML/XML). The result |
|
1326 |
may not validate, but it should be good enough to not choke a
|
|
1327 |
real-world XML parser. Specifically, the output of a properly
|
|
1328 |
constructed soup object should always be valid as part of an XML
|
|
1329 |
document, but parts may be missing if they were missing in the
|
|
1330 |
original. As always, if the input is valid XML, the output will also
|
|
1331 |
be valid.
|
|
1332 |
||
1333 |
= 3.0.2 (20060602) =
|
|
1334 |
||
1335 |
Previously, Beautiful Soup correctly handled attribute values that
|
|
1336 |
contained embedded quotes (sometimes by escaping), but not other kinds
|
|
1337 |
of XML character. Now, it correctly handles or escapes all special XML
|
|
1338 |
characters in attribute values.
|
|
1339 |
||
1340 |
I aliased methods to the 2.x names (fetch, find, findText, etc.) for
|
|
1341 |
backwards compatibility purposes. Those names are deprecated and if I
|
|
1342 |
ever do a 4.0 I will remove them. I will, I tell you!
|
|
1343 |
||
1344 |
Fixed a bug where the findAll method wasn't passing along any keyword |
|
1345 |
arguments. |
|
1346 |
||
1347 |
When run from the command line, Beautiful Soup now acts as an HTML |
|
1348 |
pretty-printer, not an XML pretty-printer. |
|
1349 |
||
1350 |
= 3.0.1 (20060530) = |
|
1351 |
||
1352 |
Reintroduced the "fetch by CSS class" shortcut. I thought keyword |
|
1353 |
arguments would replace it, but they don't. You can't call soup('a', |
|
1354 |
class='foo') because class is a Python keyword. |
|
1355 |
||
1356 |
If Beautiful Soup encounters a meta tag that declares the encoding, |
|
1357 |
but a SoupStrainer tells it not to parse that tag, Beautiful Soup will |
|
1358 |
no longer try to rewrite the meta tag to mention the new |
|
1359 |
encoding. Basically, this makes SoupStrainers work in real-world |
|
1360 |
applications instead of crashing the parser. |
|
1361 |
||
1362 |
= 3.0.0 "Who would not give all else for two p" (20060528) = |
|
1363 |
||
1364 |
This release is not backward-compatible with previous releases. If |
|
1365 |
you've got code written with a previous version of the library, go |
|
1366 |
ahead and keep using it, unless one of the features mentioned here
|
|
1367 |
really makes your life easier. Since the library is self-contained,
|
|
1368 |
you can include an old copy of the library in your old applications,
|
|
1369 |
and use the new version for everything else.
|
|
1370 |
||
1371 |
The documentation has been rewritten and greatly expanded with many
|
|
1372 |
more examples.
|
|
1373 |
||
1374 |
Beautiful Soup autodetects the encoding of a document (or uses the one
|
|
1375 |
you specify), and converts it from its native encoding to
|
|
1376 |
Unicode. Internally, it only deals with Unicode strings. When you
|
|
1377 |
print out the document, it converts to UTF-8 (or another encoding you
|
|
1378 |
specify). [Doc reference]
|
|
1379 |
||
1380 |
It's now easy to make large-scale changes to the parse tree without |
|
1381 |
screwing up the navigation members. The methods are extract, |
|
1382 |
replaceWith, and insert. [Doc reference. See also Improving Memory |
|
1383 |
Usage with extract] |
|
1384 |
||
1385 |
Passing True in as an attribute value gives you tags that have any |
|
1386 |
value for that attribute. You don't have to create a regular |
|
1387 |
expression. Passing None for an attribute value gives you tags that
|
|
1388 |
don't have that attribute at all. |
|
1389 |
||
1390 |
Tag objects now know whether or not they're self-closing. This avoids |
|
1391 |
the problem where Beautiful Soup thought that tags like <BR /> were
|
|
1392 |
self-closing even in XML documents. You can customize the self-closing
|
|
1393 |
tags for a parser object by passing them in as a list of
|
|
1394 |
selfClosingTags: you don't have to subclass anymore. |
|
1395 |
||
1396 |
There's a new built-in parser, MinimalSoup, which has most of |
|
1397 |
BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc |
|
1398 |
reference] |
|
1399 |
||
1400 |
You can use a SoupStrainer to tell Beautiful Soup to parse only part |
|
1401 |
of a document. This saves time and memory, often making Beautiful Soup |
|
1402 |
about as fast as a custom-built SGMLParser subclass. [Doc reference, |
|
1403 |
SoupStrainer reference] |
|
1404 |
||
1405 |
You can (usually) use keyword arguments instead of passing a |
|
1406 |
dictionary of attributes to a search method. That is, you can replace |
|
1407 |
soup(args={"id" : "5"}) with soup(id="5"). You can still use args if |
|
1408 |
(for instance) you need to find an attribute whose name clashes with |
|
1409 |
the name of an argument to findAll. [Doc reference: **kwargs attrs] |
|
1410 |
||
1411 |
The method names have changed to the better method names used in |
|
1412 |
Rubyful Soup. Instead of find methods and fetch methods, there are |
|
1413 |
only find methods. Instead of a scheme where you can't remember which |
|
1414 |
method finds one element and which one finds them all, we have find
|
|
1415 |
and findAll. In general, if the method name mentions All or a plural
|
|
1416 |
noun (eg. findNextSiblings), then it finds many elements
|
|
1417 |
method. Otherwise, it only finds one element. [Doc reference]
|
|
1418 |
||
1419 |
Some of the argument names have been renamed for clarity. For instance
|
|
1420 |
avoidParserProblems is now parserMassage.
|
|
1421 |
||
1422 |
Beautiful Soup no longer implements a feed method. You need to pass a
|
|
1423 |
string or a filehandle into the soup constructor, not with feed after
|
|
1424 |
the soup has been created. There is still a feed method, but it's the |
|
1425 |
feed method implemented by SGMLParser and calling it will bypass |
|
1426 |
Beautiful Soup and cause problems. |
|
1427 |
||
1428 |
The NavigableText class has been renamed to NavigableString. There is |
|
1429 |
no NavigableUnicodeString anymore, because every string inside a |
|
1430 |
Beautiful Soup parse tree is a Unicode string. |
|
1431 |
||
1432 |
findText and fetchText are gone. Just pass a text argument into find |
|
1433 |
or findAll. |
|
1434 |
||
1435 |
Null was more trouble than it was worth, so I got rid of it. Anything |
|
1436 |
that used to return Null now returns None. |
|
1437 |
||
1438 |
Special XML constructs like comments and CDATA now have their own |
|
1439 |
NavigableString subclasses, instead of being treated as oddly-formed |
|
1440 |
data. If you parse a document that contains CDATA and write it back |
|
1441 |
out, the CDATA will still be there. |
|
1442 |
||
1443 |
When you're parsing a document, you can get Beautiful Soup to convert |
|
1444 |
XML or HTML entities into the corresponding Unicode characters. [Doc
|
|
1445 |
reference]
|
|
1446 |
||
1447 |
= 2.1.1 (20050918) =
|
|
1448 |
||
1449 |
Fixed a serious performance bug in BeautifulStoneSoup which was
|
|
1450 |
causing parsing to be incredibly slow.
|
|
1451 |
||
1452 |
Corrected several entities that were previously being incorrectly
|
|
1453 |
translated from Microsoft smart-quote-like characters.
|
|
1454 |
||
1455 |
Fixed a bug that was breaking text fetch.
|
|
1456 |
||
1457 |
Fixed a bug that crashed the parser when text chunks that look like
|
|
1458 |
HTML tag names showed up within a SCRIPT tag.
|
|
1459 |
||
1460 |
THEAD, TBODY, and TFOOT tags are now nestable within TABLE
|
|
1461 |
tags. Nested tables should parse more sensibly now.
|
|
1462 |
||
1463 |
BASE is now considered a self-closing tag.
|
|
1464 |
||
1465 |
= 2.1.0 "Game, or any other dish?" (20050504) =
|
|
1466 |
||
1467 |
Added a wide variety of new search methods which, given a starting
|
|
1468 |
point inside the tree, follow a particular navigation member (like
|
|
1469 |
nextSibling) over and over again, looking for Tag and NavigableText
|
|
1470 |
objects that match certain criteria. The new methods are findNext,
|
|
1471 |
fetchNext, findPrevious, fetchPrevious, findNextSibling,
|
|
1472 |
fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings,
|
|
1473 |
findParent, and fetchParents. All of these use the same basic code
|
|
1474 |
used by first and fetch, so you can pass your weird ways of matching
|
|
1475 |
things into these methods.
|
|
1476 |
||
1477 |
The fetch method and its derivatives now accept a limit argument.
|
|
1478 |
||
1479 |
You can now pass keyword arguments when calling a Tag object as though
|
|
1480 |
it were a method.
|
|
1481 |
||
1482 |
Fixed a bug that caused all hand-created tags to share a single set of
|
|
1483 |
attributes.
|
|
1484 |
||
1485 |
= 2.0.3 (20050501) =
|
|
1486 |
||
1487 |
Fixed Python 2.2 support for iterators.
|
|
1488 |
||
1489 |
Fixed a bug that gave the wrong representation to tags within quote
|
|
1490 |
tags like <script>.
|
|
1491 |
||
1492 |
Took some code from Mark Pilgrim that treats CDATA declarations as
|
|
1493 |
data instead of ignoring them.
|
|
1494 |
||
1495 |
Beautiful Soup's setup.py will now do an install even if the unit |
|
1496 |
tests fail. It won't build a source distribution if the unit tests |
|
1497 |
fail, so I can't release a new version unless they pass. |
|
1498 |
||
1499 |
= 2.0.2 (20050416) = |
|
1500 |
||
1501 |
Added the unit tests in a separate module, and packaged it with |
|
1502 |
distutils. |
|
1503 |
||
1504 |
Fixed a bug that sometimes caused renderContents() to return a Unicode |
|
1505 |
string even if there was no Unicode in the original string. |
|
1506 |
||
1507 |
Added the done() method, which closes all of the parser's open |
|
1508 |
tags. It gets called automatically when you pass in some text to the
|
|
1509 |
constructor of a parser class; otherwise you must call it yourself.
|
|
1510 |
||
1511 |
Reinstated some backwards compatibility with 1.x versions: referencing
|
|
1512 |
the string member of a NavigableText object returns the NavigableText
|
|
1513 |
object instead of throwing an error.
|
|
1514 |
||
1515 |
= 2.0.1 (20050412) =
|
|
1516 |
||
1517 |
Fixed a bug that caused bad results when you tried to reference a tag
|
|
1518 |
name shorter than 3 characters as a member of a Tag, eg. tag.table.td.
|
|
1519 |
||
1520 |
Made sure all Tags have the 'hidden' attribute so that an attempt to |
|
1521 |
access tag.hidden doesn't spawn an attempt to find a tag named |
|
1522 |
'hidden'. |
|
1523 |
||
1524 |
Fixed a bug in the comparison operator. |
|
1525 |
||
1526 |
= 2.0.0 "Who cares for fish?" (20050410) |
|
1527 |
||
1528 |
Beautiful Soup version 1 was very useful but also pretty stupid. I |
|
1529 |
originally wrote it without noticing any of the problems inherent in |
|
1530 |
trying to build a parse tree out of ambiguous HTML tags. This version |
|
1531 |
solves all of those problems to my satisfaction. It also adds many new |
|
1532 |
clever things to make up for the removal of the stupid things. |
|
1533 |
||
1534 |
== Parsing == |
|
1535 |
||
1536 |
The parser logic has been greatly improved, and the BeautifulSoup |
|
1537 |
class should much more reliably yield a parse tree that looks like |
|
1538 |
what the page author intended. For a particular class of odd edge |
|
1539 |
cases that now causes problems, there is a new class, |
|
1540 |
ICantBelieveItsBeautifulSoup. |
|
1541 |
||
1542 |
By default, Beautiful Soup now performs some cleanup operations on |
|
1543 |
text before parsing it. This is to avoid common problems with bad |
|
1544 |
definitions and self-closing tags that crash SGMLParser. You can |
|
1545 |
provide your own set of cleanup operations, or turn it off |
|
1546 |
altogether. The cleanup operations include fixing self-closing tags |
|
1547 |
that don't close, and replacing Microsoft smart quotes and similar |
|
1548 |
characters with their HTML entity equivalents.
|
|
1549 |
||
1550 |
You can now get a pretty-print version of parsed HTML to get a visual
|
|
1551 |
picture of how Beautiful Soup parses it, with the Tag.prettify()
|
|
1552 |
method.
|
|
1553 |
||
1554 |
== Strings and Unicode ==
|
|
1555 |
||
1556 |
There are separate NavigableText subclasses for ASCII and Unicode
|
|
1557 |
strings. These classes directly subclass the corresponding base data
|
|
1558 |
types. This means you can treat NavigableText objects as strings
|
|
1559 |
instead of having to call methods on them to get the strings.
|
|
1560 |
||
1561 |
str() on a Tag always returns a string, and unicode() always returns
|
|
1562 |
Unicode. Previously it was inconsistent.
|
|
1563 |
||
1564 |
== Tree traversal ==
|
|
1565 |
||
1566 |
In a first() or fetch() call, the tag name or the desired value of an
|
|
1567 |
attribute can now be any of the following:
|
|
1568 |
||
1569 |
* A string (matches that specific tag or that specific attribute value)
|
|
1570 |
* A list of strings (matches any tag or attribute value in the list)
|
|
1571 |
* A compiled regular expression object (matches any tag or attribute
|
|
1572 |
value that matches the regular expression)
|
|
1573 |
* A callable object that takes the Tag object or attribute value as a
|
|
1574 |
string. It returns None/false/empty string if the given string
|
|
1575 |
doesn't match, and any other value if it does. |
|
1576 |
||
1577 |
This is much easier to use than SQL-style wildcards (see, regular |
|
1578 |
expressions are good for something). Because of this, I took out |
|
1579 |
SQL-style wildcards. I'll put them back if someone complains, but |
|
1580 |
their removal simplifies the code a lot.
|
|
1581 |
||
1582 |
You can use fetch() and first() to search for text in the parse tree,
|
|
1583 |
not just tags. There are new alias methods fetchText() and firstText()
|
|
1584 |
designed for this purpose. As with searching for tags, you can pass in
|
|
1585 |
a string, a regular expression object, or a method to match your text.
|
|
1586 |
||
1587 |
If you pass in something besides a map to the attrs argument of
|
|
1588 |
fetch() or first(), Beautiful Soup will assume you want to match that
|
|
1589 |
thing against the "class" attribute. When you're scraping |
|
1590 |
well-structured HTML, this makes your code a lot cleaner. |
|
1591 |
||
1592 |
1.x and 2.x both let you call a Tag object as a shorthand for |
|
1593 |
fetch(). For instance, foo("bar") is a shorthand for |
|
1594 |
foo.fetch("bar"). In 2.x, you can also access a specially-named member |
|
1595 |
of a Tag object as a shorthand for first(). For instance, foo.barTag |
|
1596 |
is a shorthand for foo.first("bar"). By chaining these shortcuts you |
|
1597 |
traverse a tree in very little code: for header in |
|
1598 |
soup.bodyTag.pTag.tableTag('th'): |
|
1599 |
||
1600 |
If an element relationship (like parent or next) doesn't apply to a |
|
1601 |
tag, it'll now show up Null instead of None. first() will also return |
|
1602 |
Null if you ask it for a nonexistent tag. Null is an object that's |
|
1603 |
just like None, except you can do whatever you want to it and it'll |
|
1604 |
give you Null instead of throwing an error. |
|
1605 |
||
1606 |
This lets you do tree traversals like soup.htmlTag.headTag.titleTag |
|
1607 |
without having to worry if the intermediate stages are actually |
|
1608 |
there. Previously, if there was no 'head' tag in the document, headTag |
|
1609 |
in that instance would have been None, and accessing its 'titleTag' |
|
1610 |
member would have thrown an AttributeError. Now, you can get what you |
|
1611 |
want when it exists, and get Null when it doesn't, without having to |
|
1612 |
do a lot of conditionals checking to see if every stage is None.
|
|
1613 |
||
1614 |
There are two new relations between page elements: previousSibling and
|
|
1615 |
nextSibling. They reference the previous and next element at the same
|
|
1616 |
level of the parse tree. For instance, if you have HTML like this:
|
|
1617 |
||
1618 |
<p><ul><li>Foo<br /><li>Bar</ul>
|
|
1619 |
||
1620 |
The first 'li' tag has a previousSibling of Null and its nextSibling |
|
1621 |
is the second 'li' tag. The second 'li' tag has a nextSibling of Null |
|
1622 |
and its previousSibling is the first 'li' tag. The previousSibling of |
|
1623 |
the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the |
|
1624 |
'br' tag. |
|
1625 |
||
1626 |
I took out the ability to use fetch() to find tags that have a
|
|
1627 |
specific list of contents. See, I can't even explain it well. It was |
|
1628 |
really difficult to use, I never used it, and I don't think anyone |
|
1629 |
else ever used it. To the extent anyone did, they can probably use
|
|
1630 |
fetchText() instead. If it turns out someone needs it I'll think of |
|
1631 |
another solution. |
|
1632 |
||
1633 |
== Tree manipulation == |
|
1634 |
||
1635 |
You can add new attributes to a tag, and delete attributes from a |
|
1636 |
tag. In 1.x you could only change a tag's existing attributes. |
|
1637 |
||
1638 |
== Porting Considerations ==
|
|
1639 |
||
1640 |
There are three changes in 2.0 that break old code:
|
|
1641 |
||
1642 |
In the post-1.2 release you could pass in a function into fetch(). The
|
|
1643 |
function took a string, the tag name. In 2.0, the function takes the
|
|
1644 |
actual Tag object.
|
|
1645 |
||
1646 |
It's no longer to pass in SQL-style wildcards to fetch(). Use a |
|
1647 |
regular expression instead. |
|
1648 |
||
1649 |
The different parsing algorithm means the parse tree may not be shaped |
|
1650 |
like you expect. This will only actually affect you if your code uses |
|
1651 |
one of the affected parts. I haven't run into this problem yet while |
|
1652 |
porting my code.
|
|
1653 |
||
1654 |
= Between 1.2 and 2.0 =
|
|
1655 |
||
1656 |
This is the release to get if you want Python 1.5 compatibility.
|
|
1657 |
||
1658 |
The desired value of an attribute can now be any of the following:
|
|
1659 |
||
1660 |
* A string
|
|
1661 |
* A string with SQL-style wildcards
|
|
1662 |
* A compiled RE object
|
|
1663 |
* A callable that returns None/false/empty string if the given value
|
|
1664 |
doesn't match, and any other value otherwise. |
|
1665 |
||
1666 |
This is much easier to use than SQL-style wildcards (see, regular |
|
1667 |
expressions are good for something). Because of this, I no longer |
|
1668 |
recommend you use SQL-style wildcards. They may go away in a future |
|
1669 |
release to clean up the code. |
|
1670 |
||
1671 |
Made Beautiful Soup handle processing instructions as text instead of |
|
1672 |
ignoring them. |
|
1673 |
||
1674 |
Applied patch from Richie Hindle (richie at entrian dot com) that |
|
1675 |
makes tag.string a shorthand for tag.contents[0].string when the tag |
|
1676 |
has only one string-owning child. |
|
1677 |
||
1678 |
Added still more nestable tags. The nestable tags thing won't work in |
|
1679 |
a lot of cases and needs to be rethought.
|
|
1680 |
||
1681 |
Fixed an edge case where searching for "%foo" would match any string
|
|
1682 |
shorter than "foo".
|
|
1683 |
||
1684 |
= 1.2 "Who for such dainties would not stoop?" (20040708) =
|
|
1685 |
||
1686 |
Applied patch from Ben Last (ben at benlast dot com) that made
|
|
1687 |
Tag.renderContents() correctly handle Unicode.
|
|
1688 |
||
1689 |
Made BeautifulStoneSoup even dumber by making it not implicitly close
|
|
1690 |
a tag when another tag of the same type is encountered; only when an
|
|
1691 |
actual closing tag is encountered. This change courtesy of Fuzzy (mike
|
|
1692 |
at pcblokes dot com). BeautifulSoup still works as before.
|
|
1693 |
||
1694 |
= 1.1 "Swimming in a hot tureen" =
|
|
1695 |
||
1696 |
Added more 'nestable' tags. Changed popping semantics so that when a |
|
1697 |
nestable tag is encountered, tags are popped up to the previously
|
|
1698 |
encountered nestable tag (of whatever kind). I will revert this if
|
|
1699 |
enough people complain, but it should make more people's lives easier |
|
1700 |
than harder. This enhancement was suggested by Anthony Baxter (anthony |
|
1701 |
at interlink dot com dot au). |
|
1702 |
||
1703 |
= 1.0 "So rich and green" (20040420) = |
|
1704 |
||
1705 |
Initial release. |