~leonardr/beautifulsoup/bs4

602 by Leonard Richardson
NavigableString and its subclasses now implement the get_text()
1
Beautiful Soup's official support for Python 2 ended on December 31st,
606 by Leonard Richardson
Goodbye, Python 2. [bug=1942919]
2
2020. The final release to support Python 2 was Beautiful Soup
3
4.9.3. In the Launchpad Bazaar repository, the final revision to support
4
Python 2 was revision 605.
5
608 by Leonard Richardson
Ported unit tests to use pytest.
6
= 4.11.0 (Unreleased)
7
8
* Ported unit tests to use pytest.
9
614 by Leonard Richardson
Added special string classes, RubyParenthesisString and RubyTextString,
10
* Added special string classes, RubyParenthesisString and RubyTextString,
11
  to make it possible to treat ruby text specially in get_text() calls.
12
  [bug=1941980]
13
629 by Leonard Richardson
It's now possible to customize the way output is indented by
14
* It's now possible to customize the way output is indented by
15
  providing a value for the 'indent' argument to the Formatter
16
  constructor. The 'indent' argument works very similarly to the
17
  argument of the same name in the Python standard library's
630 by Leonard Richardson
I guess that's not a method.
18
  json.dump() function. [bug=1955497]
629 by Leonard Richardson
It's now possible to customize the way output is indented by
19
626 by Leonard Richardson
If the charset-normalizer Python module
20
* If the charset-normalizer Python module
21
  (https://pypi.org/project/charset-normalizer/) is installed, Beautiful
22
  Soup will use it to detect the character sets of incoming documents.
23
  This is also the module used by newer versions of the Requests library.
24
  For the sake of backwards compatibility, chardet and cchardet both take
25
  precedence if installed. [bug=1955346]
617 by Leonard Richardson
Fixed a crash when overriding multi_valued_attributes and using the
26
618 by Leonard Richardson
Added a workaround for an lxml bug (https://bugs.launchpad.net/lxml/+bug/1948551) that caused
27
* Added a workaround for an lxml bug
622 by Leonard Richardson
Issue a warning when an HTML parser is used to parse a document that
28
  (https://bugs.launchpad.net/lxml/+bug/1948551) that causes
618 by Leonard Richardson
Added a workaround for an lxml bug (https://bugs.launchpad.net/lxml/+bug/1948551) that caused
29
  problems when parsing a Unicode string beginning with BYTE ORDER MARK.
30
  [bug=1947768]
31
622 by Leonard Richardson
Issue a warning when an HTML parser is used to parse a document that
32
* Issue a warning when an HTML parser is used to parse a document that
33
  looks like XML but not XHTML. [bug=1939121]
34
624 by Leonard Richardson
Do a better job of keeping track of namespaces as an XML document is
35
* Do a better job of keeping track of namespaces as an XML document is
36
  parsed, so that CSS selectors that use namespaces will do the right
37
  thing more often. [bug=1946243]
38
619 by Leonard Richardson
Renamed the 'text' field to 'string' for real. Tests are not changed in this commit to demonstrate that the renaming doesn't break anything. [bug=1947038]
39
* Some time ago, the misleadingly named "text" argument to find-type
40
  methods was renamed to the more accurate "string." But this supposed
41
  "renaming" didn't make it into important places like the method
42
  signatures or the docstrings. That's corrected in this
622 by Leonard Richardson
Issue a warning when an HTML parser is used to parse a document that
43
  version. "text" still works, but will give a DeprecationWarning.
44
  [bug=1947038]
619 by Leonard Richardson
Renamed the 'text' field to 'string' for real. Tests are not changed in this commit to demonstrate that the renaming doesn't break anything. [bug=1947038]
45
626 by Leonard Richardson
If the charset-normalizer Python module
46
* Fixed a crash when pickling a BeautifulSoup object that has no
625 by Leonard Richardson
Fix a crash when pickling a BeautifulSoup object that has no
47
  tree builder. [bug=1934003]
48
626 by Leonard Richardson
If the charset-normalizer Python module
49
* Fixed a crash when overriding multi_valued_attributes and using the
50
  html5lib parser. [bug=1948488]
51
627 by Leonard Richardson
Removed support for the iconv_codec library, which doesn't seem
52
* Removed support for the iconv_codec library, which doesn't seem
53
  to exist anymore and was never put up on PyPI. (The closest
628 by Leonard Richardson
Remove a huge list of HTML entities that was only necessary under Python 2.
54
  replacement on PyPI, iconv_codecs, is GPL-licensed, so we can't use
55
  it--it's also quite old.)
627 by Leonard Richardson
Removed support for the iconv_codec library, which doesn't seem
56
606 by Leonard Richardson
Goodbye, Python 2. [bug=1942919]
57
= 4.10.0 (20210907)
58
59
* This is the first release of Beautiful Soup to only support Python
60
  3. I dropped Python 2 support to maintain support for newer versions
61
  (58 and up) of setuptools. See:
62
  https://github.com/pypa/setuptools/issues/2769 [bug=1942919]
602 by Leonard Richardson
NavigableString and its subclasses now implement the get_text()
63
600 by Leonard Richardson
The behavior of methods like .get_text() and .strings now differs
64
* The behavior of methods like .get_text() and .strings now differs
65
  depending on the type of tag. The change is visible with HTML tags
66
  like <script>, <style>, and <template>. Starting in 4.9.0, methods
67
  like get_text() returned no results on such tags, because the
68
  contents of those tags are not considered 'text' within the document
69
  as a whole.
70
71
  But a user who calls script.get_text() is working from a different
72
  definition of 'text' than a user who calls div.get_text()--otherwise
73
  there would be no need to call script.get_text() at all. In 4.10.0,
74
  the contents of (e.g.) a <script> tag are considered 'text' during a
75
  get_text() call on the tag itself, but not considered 'text' during
76
  a get_text() call on the tag's parent.
77
78
  Because of this change, calling get_text() on each child of a tag
79
  may now return a different result than calling get_text() on the tag
80
  itself. That's because different tags now have different
81
  understandings of what counts as 'text'. [bug=1906226] [bug=1868861]
601 by Leonard Richardson
The 'html5' formatter now treats attributes whose values are the
82
602 by Leonard Richardson
NavigableString and its subclasses now implement the get_text()
83
* NavigableString and its subclasses now implement the get_text()
84
  method, as well as the properties .strings and
85
  .stripped_strings. These methods will either return the string
86
  itself, or nothing, so the only reason to use this is when iterating
87
  over a list of mixed Tag and NavigableString objects. [bug=1904309]
88
601 by Leonard Richardson
The 'html5' formatter now treats attributes whose values are the
89
* The 'html5' formatter now treats attributes whose values are the
90
  empty string as HTML boolean attributes. Previously (and in other
91
  formatters), an attribute value must be set as None to be treated as
92
  a boolean attribute. In a future release, I plan to also give this
93
  behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424]
94
605 by Leonard Richardson
The 'replace_with()' method now takes a variable number of arguments,
95
* The 'replace_with()' method now takes a variable number of arguments,
96
  and can be used to replace a single element with a sequence of elements.
97
  Patch by Bill Chandos. [rev=605]
98
595 by Leonard Richardson
Corrected output when the namespace prefix associated with a
99
* Corrected output when the namespace prefix associated with a
100
  namespaced attribute is the empty string, as opposed to
101
  None. [bug=1915583]
102
597 by Leonard Richardson
Performance improvement when processing tags that speeds up overall
103
* Performance improvement when processing tags that speeds up overall
104
  tree construction by 2%. Patch by Morotti. [bug=1899358]
105
599 by Leonard Richardson
Corrected the use of special string container classes in cases when a
106
* Corrected the use of special string container classes in cases when a
107
  single tag may contain strings with different containers; such as
108
  the <template> tag, which may contain both TemplateString objects
109
  and Comment objects. [bug=1913406]
110
605 by Leonard Richardson
The 'replace_with()' method now takes a variable number of arguments,
111
* The html.parser tree builder can now handle named entities
604 by Leonard Richardson
The html.parser tree builder can now handles named entities
112
  found in the HTML5 spec in much the same way that the html5lib
605 by Leonard Richardson
The 'replace_with()' method now takes a variable number of arguments,
113
  tree builder does. Note that the lxml HTML tree builder doesn't handle
114
  named entities this way. [bug=1924908]
604 by Leonard Richardson
The html.parser tree builder can now handles named entities
115
598 by Leonard Richardson
Added a second way to pass specify encodings to UnicodeDammit and
116
* Added a second way to pass specify encodings to UnicodeDammit and
117
  EncodingDetector, based on the order of precedence defined in the
118
  HTML5 spec, starting at:
119
  https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
120
121
  Encodings in 'known_definite_encodings' are tried first, then
122
  byte-order-mark sniffing is run, then encodings in 'user_encodings'
123
  are tried. The old argument, 'override_encodings', is now a
124
  deprecated alias for 'known_definite_encodings'.
125
126
  This changes the default behavior of the html.parser and lxml tree
127
  builders, in a way that may slightly improve encoding
128
  detection but will probably have no effect. [bug=1889014]
129
596 by Leonard Richardson
Improve the warning issued when a directory name (as opposed to
130
* Improve the warning issued when a directory name (as opposed to
131
  the name of a regular file) is passed as markup into the BeautifulSoup
132
  constructor. [bug=1913628]
133
592 by Leonard Richardson
Prepare for release.
134
= 4.9.3 (20201003)
591 by Leonard Richardson
Implemented a significant performance optimization to the process of
135
136
* Implemented a significant performance optimization to the process of
137
  searching the parse tree. Patch by Morotti. [bug=1898212]
138
588 by Leonard Richardson
Increment version number.
139
= 4.9.2 (20200926)
579 by Leonard Richardson
Fixed a bug that caused too many tags to be popped from the tag
140
141
* Fixed a bug that caused too many tags to be popped from the tag
142
  stack during tree building, when encountering a closing tag that had
143
  no matching opening tag. [bug=1880420]
144
587 by Leonard Richardson
Fixed a bug that inconsistently moved elements over when passing
145
* Fixed a bug that inconsistently moved elements over when passing
146
  a Tag, rather than a list, into Tag.extend(). [bug=1885710]
147
585 by Leonard Richardson
Specify the soupsieve dependency in a way that complies with
148
* Specify the soupsieve dependency in a way that complies with
586 by Leonard Richardson
Change the signatures for BeautifulSoup.insert_before and insert_after
149
  PEP 508. Patch by Mike Nerone. [bug=1893696]
150
151
* Change the signatures for BeautifulSoup.insert_before and insert_after
152
  (which are not implemented) to match PageElement.insert_before and
153
  insert_after, quieting warnings in some IDEs. [bug=1897120]
585 by Leonard Richardson
Specify the soupsieve dependency in a way that complies with
154
577 by Leonard Richardson
Prep for release.
155
= 4.9.1 (20200517)
568 by Leonard Richardson
Fixed test failures when run against soupselect 2.0. Patch by Tomáš
156
573 by Leonard Richardson
Added a keyword argument on_duplicate_attribute to the
157
* Added a keyword argument 'on_duplicate_attribute' to the
158
  BeautifulSoupHTMLParser constructor (used by the html.parser tree
159
  builder) which lets you customize the handling of markup that
160
  contains the same attribute more than once, as in:
575 by Leonard Richardson
Documented some recently added customization features.
161
  <a href="url1" href="url2"> [bug=1878209]
573 by Leonard Richardson
Added a keyword argument on_duplicate_attribute to the
162
570 by Leonard Richardson
Fixed typo.
163
* Added a distinct subclass, GuessedAtParserWarning, for the warning
569 by Leonard Richardson
Added two distinct UserWarning subclasses for warnings issued from the BeautifulSoup constructor which a caller may want to filter out. [bug=1873787]
164
  issued when BeautifulSoup is instantiated without a parser being
165
  specified. [bug=1873787]
166
167
* Added a distinct subclass, MarkupResemblesLocatorWarning, for the
168
  warning issued when BeautifulSoup is instantiated with 'markup' that
169
  actually seems to be a URL or the path to a file on
170
  disk. [bug=1873787]
171
568 by Leonard Richardson
Fixed test failures when run against soupselect 2.0. Patch by Tomáš
172
* The new NavigableString subclasses (Stylesheet, Script, and
173
  TemplateString) can now be imported directly from the bs4 package.
174
571 by Leonard Richardson
If you encode a document with a Python-specific encoding like
175
* If you encode a document with a Python-specific encoding like
176
  'unicode_escape', that encoding is no longer mentioned in the final
177
  XML or HTML document. Instead, encoding information is omitted or
178
  left blank. [bug=1874955]
179
568 by Leonard Richardson
Fixed test failures when run against soupselect 2.0. Patch by Tomáš
180
* Fixed test failures when run against soupselect 2.0. Patch by Tomáš
181
  Chvátal. [bug=1872279]
182
564 by Leonard Richardson
Embedded CSS and Javascript is now stored in distinct Stylesheet and
183
= 4.9.0 (20200405)
554 by Leonard Richardson
API CHANGE - Added PageElement.decomposed, a new property which lets you
184
185
* Added PageElement.decomposed, a new property which lets you
186
  check whether you've already called decompose() on a Tag or
187
  NavigableString.
553 by Leonard Richardson
Fixed an unhandled exception when formatting a Tag that had been decomposed.[bug=1857767]
188
564 by Leonard Richardson
Embedded CSS and Javascript is now stored in distinct Stylesheet and
189
* Embedded CSS and Javascript is now stored in distinct Stylesheet and
566 by Leonard Richardson
Added a notice about the new behavior of .text to the documentation.
190
  Script tags, which are ignored by methods like get_text() since most
191
  people don't consider this sort of content to be 'text'. This
564 by Leonard Richardson
Embedded CSS and Javascript is now stored in distinct Stylesheet and
192
  feature is not supported by the html5lib treebuilder. [bug=1868861]
193
561 by Leonard Richardson
Added a Russian translation by 'authoress' to the repository.
194
* Added a Russian translation by 'authoress' to the repository.
195
553 by Leonard Richardson
Fixed an unhandled exception when formatting a Tag that had been decomposed.[bug=1857767]
196
* Fixed an unhandled exception when formatting a Tag that had been
197
  decomposed.[bug=1857767]
198
559 by Leonard Richardson
Fixed a bug that happened when passing a Unicode filename containing
199
* Fixed a bug that happened when passing a Unicode filename containing
200
  non-ASCII characters as markup into Beautiful Soup, on a system that
201
  allows Unicode filenames. [bug=1866717]
202
556 by Leonard Richardson
Added a performance optimization to PageElement.extract(). Patch by Arthur Darcet.
203
* Added a performance optimization to PageElement.extract(). Patch by
204
  Arthur Darcet.
205
544 by Leonard Richardson
Wrote docstrings for formatter.py.
206
= 4.8.2 (20191224)
534 by Leonard Richardson
Fixed a deprecation warning on Python 3.7. Patch by Colin
207
546 by Leonard Richardson
Added docstrings for some but not all tree buidlers.
208
* Added Python docstrings to all public methods of the most commonly
209
  used classes.
540 by Leonard Richardson
Added Python docstrings to all public methods in element.py.
210
543 by Leonard Richardson
Fixed deprecation warning. [bug=1855301]
211
* Added a Chinese translation by Deron Wang and a Brazilian Portuguese
212
  translation by Cezar Peixeiro to the repository.
213
214
* Fixed two deprecation warnings. Patches by Colin
215
  Watson and Nicholas Neumann. [bug=1847592] [bug=1855301]
216
538 by Leonard Richardson
The html.parser tree builder now correctly handles DOCTYPEs that are
217
* The html.parser tree builder now correctly handles DOCTYPEs that are
218
  not uppercase. [bug=1848401]
219
543 by Leonard Richardson
Fixed deprecation warning. [bug=1855301]
220
* PageElement.select() now returns a ResultSet rather than a regular
221
  list, making it consistent with methods like find_all().
540 by Leonard Richardson
Added Python docstrings to all public methods in element.py.
222
528 by Leonard Richardson
Added section on Python 2 sunsetting.
223
= 4.8.1 (20191006)
515 by Leonard Richardson
Adapt Chris Mayo's code to track line number and position when using html.parser.
224
516 by Leonard Richardson
Implemented line number tracking for html5lib.
225
* When the html.parser or html5lib parsers are in use, Beautiful Soup
226
  will, by default, record the position in the original document where
227
  each tag was encountered. This includes line number (Tag.sourceline)
228
  and position within a line (Tag.sourcepos).  Based on code by Chris
517 by Leonard Richardson
Added a section about project support to the README.
229
  Mayo. [bug=1742921]
515 by Leonard Richardson
Adapt Chris Mayo's code to track line number and position when using html.parser.
230
527 by Leonard Richardson
Avoid a crash when unpickling certain parse trees generated using html5lib on Python 3. [bug=1843545]
231
* When instantiating a BeautifulSoup object, it's now possible to
528 by Leonard Richardson
Added section on Python 2 sunsetting.
232
   provide a dictionary ('element_classes') of the classes you'd like to be
233
   instantiated instead of Tag, NavigableString, etc.
527 by Leonard Richardson
Avoid a crash when unpickling certain parse trees generated using html5lib on Python 3. [bug=1843545]
234
524 by Leonard Richardson
Fixed the definition of the default XML namespace when using
235
* Fixed the definition of the default XML namespace when using
236
   lxml 4.4. Patch by Isaac Muse. [bug=1840141]
237
520 by Leonard Richardson
Copying a Tag preserves information that was originally obtained from
238
* Fixed a crash when pretty-printing tags that were not created
239
   during initial parsing. [bug=1838903]
240
241
* Copying a Tag preserves information that was originally obtained from
242
   the TreeBuilder used to build the original Tag. [bug=1838903]
518 by Leonard Richardson
Fixed a crash when pretty-printing tags that were not created
243
526 by Leonard Richardson
Avoid a crash when trying to detect the declared encoding of a
244
* Raise an explanatory exception when the underlying parser
245
   completely rejects the incoming markup. [bug=1838877]
246
247
* Avoid a crash when trying to detect the declared encoding of a
248
   Unicode document. [bug=1838877]
249
527 by Leonard Richardson
Avoid a crash when unpickling certain parse trees generated using html5lib on Python 3. [bug=1843545]
250
* Avoid a crash when unpickling certain parse trees generated
251
   using html5lib on Python 3. [bug=1843545]
252
513 by Leonard Richardson
Clarified the changelog.
253
= 4.8.0 (20190720, "One Small Soup")
501 by Leonard Richardson
It's now possible to customize the TreeBuilder object by passing
254
514 by Leonard Richardson
Minor changes to docs and CHANGELOG.
255
This release focuses on making it easier to customize Beautiful Soup's
256
input mechanism (the TreeBuilder) and output mechanism (the Formatter).
257
258
* You can customize the TreeBuilder object by passing keyword
259
  arguments into the BeautifulSoup constructor. Those keyword
260
  arguments will be passed along into the TreeBuilder constructor.
261
262
  The main reason to do this right now is to change how which
263
  attributes are treated as multi-valued attributes (the way 'class'
264
  is treated by default). You can do this with the
265
  'multi_valued_attributes' argument. [bug=1832978]
511 by Leonard Richardson
Added documentation for Tag.smooth().
266
512 by Leonard Richardson
Prep for release.
267
* The role of Formatter objects has been greatly expanded. The Formatter
268
  class now controls the following:
511 by Leonard Richardson
Added documentation for Tag.smooth().
269
270
  - The function to call to perform entity substitution. (This was
271
    previously Formatter's only job.)
272
  - Which tags should be treated as containing CDATA and have their
273
    contents exempt from entity substitution.
274
  - The order in which a tag's attributes are output. [bug=1812422]
275
  - Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>'
276
277
  All preexisting code should work as before.
278
279
* Added a new method to the API, Tag.smooth(), which consolidates
514 by Leonard Richardson
Minor changes to docs and CHANGELOG.
280
  multiple adjacent NavigableString elements. [bug=1697296]
511 by Leonard Richardson
Added documentation for Tag.smooth().
281
514 by Leonard Richardson
Minor changes to docs and CHANGELOG.
282
* &apos; (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is always
511 by Leonard Richardson
Added documentation for Tag.smooth().
283
  recognized as a named entity and converted to a single quote. [bug=1818721]
504 by Leonard Richardson
&apos; (which is valid in XML and XHTML, but not HTML 4) is now
284
496 by Leonard Richardson
Tried even harder to avoid the deprecation warning originally fixed in
285
= 4.7.1 (20190106)
495 by Leonard Richardson
Fixed an incorrectly raised exception when inserting a tag before or
286
287
* Fixed a significant performance problem introduced in 4.7.0. [bug=1810617]
288
289
* Fixed an incorrectly raised exception when inserting a tag before or
290
  after an identical tag. [bug=1810692]
291
292
* Beautiful Soup will no longer try to keep track of namespaces that
293
  are not defined with a prefix; this can confuse soupselect. [bug=1810680]
294
496 by Leonard Richardson
Tried even harder to avoid the deprecation warning originally fixed in
295
* Tried even harder to avoid the deprecation warning originally fixed in
296
   4.6.1. [bug=1778909]
297
488 by Leonard Richardson
Prep for release.
298
= 4.7.0 (20181231)
477 by Leonard Richardson
Merged in next_previous_fixes from Isaac Muse. [bug=1782928,1798699]
299
481 by Leonard Richardson
Issue a warning and raise a more useful exception if someone tries to call Tag.select() without SoupSieve installed.
300
* Beautiful Soup's CSS Selector implementation has been replaced by a
301
  dependency on Isaac Muse's SoupSieve project (the soupsieve package
302
  on PyPI). The good news is that SoupSieve has a much more robust and
303
  complete implementation of CSS selectors, resolving a large number
304
  of longstanding issues. The bad news is that from this point onward,
305
  SoupSieve must be installed if you want to use the select() method.
306
307
  You don't have to change anything lf you installed Beautiful Soup
308
  through pip (SoupSieve will be automatically installed when you
309
  upgrade Beautiful Soup) or if you don't use CSS selectors from
310
  within Beautiful Soup.
311
312
  SoupSieve documentation: https://facelessuser.github.io/soupsieve/
313
490 by Leonard Richardson
Added information to CHANGELOG I forgot to add earlier.
314
* Added the PageElement.extend() method, which works like list.append().
315
   [bug=1514970]
316
317
* PageElement.insert_before() and insert_after() now take a variable
318
   number of arguments. [bug=1514970]
319
477 by Leonard Richardson
Merged in next_previous_fixes from Isaac Muse. [bug=1782928,1798699]
320
* Fix a number of problems with the tree builder that caused
321
  trees that were superficially okay, but which fell apart when bits
483 by Leonard Richardson
Merging the linkage checker and html5lib fixes by Isaac Muse found in https://code.launchpad.net/~facelessuser/beautifulsoup/html5lib-fix/+merge/361282. [bug=1809910]
322
  were extracted. Patch by Isaac Muse. [bug=1782928,1809910]
477 by Leonard Richardson
Merged in next_previous_fixes from Isaac Muse. [bug=1782928,1798699]
323
324
* Fixed a problem with the tree builder in which elements that
325
  contained no content (such as empty comments and all-whitespace
326
  elements) were not being treated as part of the tree. Patch by Isaac
327
  Muse. [bug=1798699]
328
484 by Leonard Richardson
Fixed a problem with multi-valued attributes where the value
329
* Fixed a problem with multi-valued attributes where the value
330
  contained whitespace. Thanks to Jens Svalgaard for the
331
  fix. [bug=1787453]
332
482 by Leonard Richardson
Clarified the software license.
333
* Clarified ambiguous license statements in the source code. Beautiful
484 by Leonard Richardson
Fixed a problem with multi-valued attributes where the value
334
  Soup is released under the MIT license, and has been since 4.4.0.
482 by Leonard Richardson
Clarified the software license.
335
488 by Leonard Richardson
Prep for release.
336
* This file has been renamed from NEWS.txt to CHANGELOG.
337
476 by Leonard Richardson
Bump up to version 4.6.3 so I can re-release.
338
= 4.6.3 (20180812)
339
340
* Exactly the same as 4.6.2. Re-released to make the README file
341
  render properly on PyPI.
342
475 by Leonard Richardson
Converted README to Markdown format.
343
= 4.6.2 (20180812)
474 by Leonard Richardson
Fix an exception when a custom formatter was asked to format a void
344
345
* Fix an exception when a custom formatter was asked to format a void
346
  element. [bug=1784408]
347
473 by Leonard Richardson
Prep for release.
348
= 4.6.1 (20180728)
451 by Leonard Richardson
Improve the warning given when no parser is specified. [bug=1780571]
349
459 by Leonard Richardson
Stop data loss when encountering an empty numeric entity, and
350
* Stop data loss when encountering an empty numeric entity, and
351
  possibly in other cases.  Thanks to tos.kamiya for the fix. [bug=1698503]
352
465 by Leonard Richardson
Preserve XML namespaces when they are introduced inside an XML
353
* Preserve XML namespaces introduced inside an XML document, not just
354
   the ones introduced at the top level. [bug=1718787]
355
466 by Leonard Richardson
Fixed a bug where find_all() was not working when asked to find a
356
* Added a new formatter, "html5", which represents void elements
469 by Leonard Richardson
Fixed a problem where the html.parser tree builder interpreted
357
   as "<element>" rather than "<element/>".  [bug=1716272]
358
359
* Fixed a problem where the html.parser tree builder interpreted
360
  a string like "&foo " as the character entity "&foo;"  [bug=1728706]
466 by Leonard Richardson
Fixed a bug where find_all() was not working when asked to find a
361
471 by Leonard Richardson
Correctly handle invalid HTML numeric character entities like &#147;
362
* Correctly handle invalid HTML numeric character entities like &#147;
363
  which reference code points that are not Unicode code points. Note
364
  that this is only fixed when Beautiful Soup is used with the
365
  html.parser parser -- html5lib already worked and I couldn't fix it
366
  with lxml.  [bug=1782933]
367
452 by Leonard Richardson
Fixed code that was causing deprecation warnings in recent Python 3
368
* Improved the warning given when no parser is specified. [bug=1780571]
369
472 by Leonard Richardson
When markup contains duplicate elements, a select() call that
370
* When markup contains duplicate elements, a select() call that
371
  includes multiple match clauses will match all relevant
372
  elements. [bug=1770596]
373
452 by Leonard Richardson
Fixed code that was causing deprecation warnings in recent Python 3
374
* Fixed code that was causing deprecation warnings in recent Python 3
375
  versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496]
451 by Leonard Richardson
Improve the warning given when no parser is specified. [bug=1780571]
376
453 by Leonard Richardson
Fixed a Windows crash in diagnose() when checking whether a long
377
* Fixed a Windows crash in diagnose() when checking whether a long
378
  markup string is a filename. [bug=1737121]
379
454 by Leonard Richardson
Stopped HTMLParser from raising an exception in very rare cases of
380
* Stopped HTMLParser from raising an exception in very rare cases of
381
  bad markup. [bug=1708831]
382
466 by Leonard Richardson
Fixed a bug where find_all() was not working when asked to find a
383
* Fixed a bug where find_all() was not working when asked to find a
384
  tag with a namespaced name in an XML document that was parsed as
385
  HTML. [bug=1723783]
462 by Leonard Richardson
Introduced the Formatter system. [bug=1716272].
386
387
* You can get finer control over formatting by subclassing
388
  bs4.element.Formatter and passing a Formatter instance into (e.g.)
389
  encode(). [bug=1716272]
461 by Leonard Richardson
It's possible for a TreeBuilder subclass to specify that void
390
464 by Leonard Richardson
You can pass a dictionary of into
391
* You can pass a dictionary of `attrs` into
392
  BeautifulSoup.new_tag. This makes it possible to create a tag with
393
  an attribute like 'name' that would otherwise be masked by another
394
  argument of new_tag. [bug=1779276]
395
470 by Leonard Richardson
Clarified the deprecation warning when accessing tag.fooTag, to cover
396
* Clarified the deprecation warning when accessing tag.fooTag, to cover
397
  the possibility that you might really have been looking for a tag
398
  called 'fooTag'.
399
450 by Leonard Richardson
Prep for 4.6.0 release.
400
= 4.6.0 (20170507) =
444 by Leonard Richardson
Added the method, which acts like for
401
447 by Leonard Richardson
Replace get_attribute_text with get_attribute_list.
402
* Added the `Tag.get_attribute_list` method, which acts like `Tag.get` for
403
  getting the value of an attribute, but which always returns a list,
404
  whether or not the attribute is a multi-value attribute. [bug=1678589]
442 by Leonard Richardson
It's now possible to use a tag's namespace prefix when searching,
405
443 by Leonard Richardson
HTML parsers treat all HTML4 and HTML5 empty element tags (aka void element tags) correctly. [bug=1656909]
406
* It's now possible to use a tag's namespace prefix when searching,
407
  e.g. soup.find('namespace:tag') [bug=1655332]
408
446 by Leonard Richardson
Improved the handling of empty-element tags like <br> when using the
409
* Improved the handling of empty-element tags like <br> when using the
410
  html.parser parser. [bug=1676935]
411
443 by Leonard Richardson
HTML parsers treat all HTML4 and HTML5 empty element tags (aka void element tags) correctly. [bug=1656909]
412
* HTML parsers treat all HTML4 and HTML5 empty element tags (aka void
413
  element tags) correctly. [bug=1656909]
442 by Leonard Richardson
It's now possible to use a tag's namespace prefix when searching,
414
449 by Leonard Richardson
Namespace prefix is preserved when an XML tag is copied. Thanks
415
* Namespace prefix is preserved when an XML tag is copied. Thanks
416
  to Vikas for a patch and test. [bug=1685172]
417
439 by Leonard Richardson
I need to do another release because of an error while running the release script.
418
= 4.5.3 (20170102) =
434 by Leonard Richardson
Fixed yet another problem that caused the html5lib tree builder to
419
436 by Leonard Richardson
Fixed foster parenting when html5lib is the tree builder. Thanks to Geoffrey Sneddon for a patch and test.
420
* Fixed foster parenting when html5lib is the tree builder. Thanks to
421
  Geoffrey Sneddon for a patch and test.
439 by Leonard Richardson
I need to do another release because of an error while running the release script.
422
  
434 by Leonard Richardson
Fixed yet another problem that caused the html5lib tree builder to
423
* Fixed yet another problem that caused the html5lib tree builder to
424
  create a disconnected parse tree. [bug=1629825]
425
439 by Leonard Richardson
I need to do another release because of an error while running the release script.
426
= 4.5.2 (20170102) =
427
428
* Apart from the version number, this release is identical to
429
  4.5.3. Due to user error, it could not be completely uploaded to
430
  PyPI. Use 4.5.3 instead.
431
430 by Leonard Richardson
Bump version number.
432
= 4.5.1 (20160802) =
428 by Leonard Richardson
Fixed a reported (but not duplicated) bug involving processing instructions fed into the lxml HTML parser.
433
429 by Leonard Richardson
Explained why we test both unicode and bytestring processing instructions.
434
* Fixed a crash when passing Unicode markup that contained a
435
  processing instruction into the lxml HTML parser on Python
436
  3. [bug=1608048]
428 by Leonard Richardson
Fixed a reported (but not duplicated) bug involving processing instructions fed into the lxml HTML parser.
437
419 by Leonard Richardson
Updated NEWS in preparation for release.
438
= 4.5.0 (20160719) =
439
440
* Beautiful Soup is no longer compatible with Python 2.6. This
441
  actually happened a few releases ago, but it's now official.
400 by Leonard Richardson
Fixed a Python 3 ByteWarning when a URL was passed in as though it
442
406 by Leonard Richardson
Beautiful Soup will now work with versions of html5lib greater than
443
* Beautiful Soup will now work with versions of html5lib greater than
444
  0.99999999. [bug=1603299]
445
417 by Leonard Richardson
If a search against each individual value of a multi-valued
446
* If a search against each individual value of a multi-valued
447
  attribute fails, the search will be run one final time against the
448
  complete attribute value considered as a single string. That is, if
449
  a tag has class="foo bar" and neither "foo" nor "bar" matches, but
450
  "foo bar" does, the tag is now considered a match.
451
452
  This happened in previous versions, but only when the value being
419 by Leonard Richardson
Updated NEWS in preparation for release.
453
  searched for was a string. Now it also works when that value is
454
  a regular expression, a list of strings, etc. [bug=1476868]
417 by Leonard Richardson
If a search against each individual value of a multi-valued
455
410 by Leonard Richardson
Although the previously fixed problem only occurs when using the html5lib tree builder, it's not actually a problem with the tree builder itself.
456
* Fixed a bug that deranged the tree when a whitespace element was
457
  reparented into a tag that contained an identical whitespace
458
  element. [bug=1505351]
409 by Leonard Richardson
Fixed a bug in the html5lib treebuilder that deranged the tree
459
415 by Leonard Richardson
Added support for CSS selector values that contain quoted spaces,
460
* Added support for CSS selector values that contain quoted spaces,
461
  such as tag[style="display: foo"]. [bug=1540588]
462
400 by Leonard Richardson
Fixed a Python 3 ByteWarning when a URL was passed in as though it
463
* Corrected handling of XML processing instructions. [bug=1504393]
464
416 by Leonard Richardson
Corrected an encoding error that happened when a BeautifulSoup
465
* Corrected an encoding error that happened when a BeautifulSoup
466
  object was copied. [bug=1554439]
467
401 by Leonard Richardson
The contents of <textarea> tags will no longer be modified when the
468
* The contents of <textarea> tags will no longer be modified when the
469
  tree is prettified. [bug=1555829]
470
411 by Leonard Richardson
When a BeautifulSoup object is pickled but its tree builder cannot
471
* When a BeautifulSoup object is pickled but its tree builder cannot
472
  be pickled, its .builder attribute is set to None instead of being
473
  destroyed. This avoids a performance problem once the object is
474
  unpickled. [bug=1523629]
475
402 by Leonard Richardson
Specify the file and line number when warning about a
476
* Specify the file and line number when warning about a
477
  BeautifulSoup object being instantiated without a parser being
478
  specified. [bug=1574647]
479
414 by Leonard Richardson
The argument to now works correctly, though it's
480
* The `limit` argument to `select()` now works correctly, though it's
481
  not implemented very efficiently. [bug=1520530]
482
400 by Leonard Richardson
Fixed a Python 3 ByteWarning when a URL was passed in as though it
483
* Fixed a Python 3 ByteWarning when a URL was passed in as though it
484
  were markup. Thanks to James Salter for a patch and
485
  test. [bug=1533762]
486
405 by Leonard Richardson
We don't run the check for a filename passed in as markup if the
487
* We don't run the check for a filename passed in as markup if the
488
  'filename' contains a less-than character; the less-than character
489
  indicates it's most likely a very small document. [bug=1577864]
490
392 by Leonard Richardson
Fixed a bug that deranged the tree when part of it was
491
= 4.4.1 (20150928) =
390 by Leonard Richardson
Fixed the test_detect_utf8 test so that it works when chardet is
492
392 by Leonard Richardson
Fixed a bug that deranged the tree when part of it was
493
* Fixed a bug that deranged the tree when part of it was
494
  removed. Thanks to Eric Weiser for the patch and John Wiseman for a
495
  test. [bug=1481520]
496
395 by Leonard Richardson
Fixed a parse bug with the html5lib tree-builder. Thanks to Roel
497
* Fixed a parse bug with the html5lib tree-builder. Thanks to Roel
498
  Kramer for the patch. [bug=1483781]
499
394 by Leonard Richardson
Improved the implementation of CSS selector grouping. Thanks to Orangain for the patch. [bug=1484543]
500
* Improved the implementation of CSS selector grouping. Thanks to
501
  Orangain for the patch. [bug=1484543]
502
393 by Leonard Richardson
Corrected the output of Declaration objects. [bug=1477847]
503
* Fixed the test_detect_utf8 test so that it works when chardet is
504
  installed. [bug=1471359]
505
506
* Corrected the output of Declaration objects. [bug=1477847]
507
394 by Leonard Richardson
Improved the implementation of CSS selector grouping. Thanks to Orangain for the patch. [bug=1484543]
508
386 by Leonard Richardson
Change setup.py to focus on creating wheels.
509
= 4.4.0 (20150703) =
358 by Leonard Richardson
Started using a standard MIT license. [bug=1294662]
510
379 by Leonard Richardson
Reorganized changelog.
511
Especially important changes:
512
513
* Added a warning when you instantiate a BeautifulSoup object without
514
  explicitly naming a parser. [bug=1398866]
515
366 by Leonard Richardson
In Python 3, __str__ now returns a Unicode string instead
516
* __repr__ now returns an ASCII bytestring in Python 2, and a Unicode
517
  string in Python 3, instead of a UTF8-encoded bytestring in both
518
  versions. In Python 3, __str__ now returns a Unicode string instead
519
  of a bytestring. [bug=1420131]
520
379 by Leonard Richardson
Reorganized changelog.
521
* The `text` argument to the find_* methods is now called `string`,
522
  which is more accurate. `text` still works, but `string` is the
523
  argument described in the documentation. `text` may eventually
524
  change its meaning, but not for a very long time. [bug=1366856]
525
381 by Leonard Richardson
Changed the way soup objects work under copy.copy(). Copying a
526
* Changed the way soup objects work under copy.copy(). Copying a
527
  NavigableString or a Tag will give you a new NavigableString that's
528
  equal to the old one but not connected to the parse tree. Patch by
529
  Martijn Peters. [bug=1307490]
380 by Leonard Richardson
Copying a NavigableString will give you a new NavigableString that is not connected to the parse tree.
530
379 by Leonard Richardson
Reorganized changelog.
531
* Started using a standard MIT license. [bug=1294662]
532
533
* Added a Chinese translation of the documentation by Delong .w.
534
535
New features:
536
371 by Leonard Richardson
Introduced the select_one() method, which uses a CSS selector but
537
* Introduced the select_one() method, which uses a CSS selector but
538
  only returns the first match, instead of a list of
539
  matches. [bug=1349367]
540
376 by Leonard Richardson
Raise a NotImplementedError whenever an unsupported CSS pseudoclass
541
* You can now create a Tag object without specifying a
542
  TreeBuilder. Patch by Martijn Pieters. [bug=1307471]
543
544
* You can now create a NavigableString or a subclass just by invoking
545
  the constructor. [bug=1294315]
546
373 by Leonard Richardson
Added an exclude_encodings argument to UnicodeDammit and to the
547
* Added an `exclude_encodings` argument to UnicodeDammit and to the
548
  Beautiful Soup constructor, which lets you prohibit the detection of
549
  an encoding that you know is wrong. [bug=1469408]
550
379 by Leonard Richardson
Reorganized changelog.
551
* The select() method now supports selector grouping. Patch by
552
  Francisco Canas [bug=1191917]
553
554
Bug fixes:
555
338 by Leonard Richardson
Fixed yet another problem that caused the html5lib tree builder to
556
* Fixed yet another problem that caused the html5lib tree builder to
557
  create a disconnected parse tree. [bug=1237763]
558
359 by Leonard Richardson
Improved docstring for encode_contents() and decode_contents(). [bug=1441543]
559
* Force object_was_parsed() to keep the tree intact even when an element
560
  from later in the document is moved into place. [bug=1430633]
561
562
* Fixed yet another bug that caused a disconnected tree when html5lib
563
  copied an element from one part of the tree to another. [bug=1270611]
564
378 by Leonard Richardson
Fixed a bug where Element.extract() could create an infinite loop in
565
* Fixed a bug where Element.extract() could create an infinite loop in
566
  the remaining tree.
567
352 by Leonard Richardson
The select() method can now find tags whose names contain
568
* The select() method can now find tags whose names contain
360 by Leonard Richardson
The select() method can now find tags with attributes whose names
569
  dashes. Patch by Francisco Canas. [bug=1276211]
570
571
* The select() method can now find tags with attributes whose names
572
  contain dashes. Patch by Marek Kapolka. [bug=1304007]
352 by Leonard Richardson
The select() method can now find tags whose names contain
573
353 by Leonard Richardson
Improved the lxml tree builder's handling of processing
574
* Improved the lxml tree builder's handling of processing
575
  instructions. [bug=1294645]
576
337 by Leonard Richardson
Restored the helpful syntax error that happens when you try to
577
* Restored the helpful syntax error that happens when you try to
578
  import the Python 2 edition of Beautiful Soup under Python
579
  3. [bug=1213387]
580
347 by Leonard Richardson
In Python 3.4 and above, set the new convert_charrefs argument to
581
* In Python 3.4 and above, set the new convert_charrefs argument to
582
  the html.parser constructor to avoid a warning and future
583
  failures. Patch by Stefano Revera. [bug=1375721]
584
350 by Leonard Richardson
The warning when you pass in a filename or URL as markup will now be
585
* The warning when you pass in a filename or URL as markup will now be
586
  displayed correctly even if the filename or URL is a Unicode
587
  string. [bug=1268888]
342 by Leonard Richardson
Added a Chinese translation of the documentation by Delong .w.
588
360.1.1 by Leonard Richardson
If the initial <html> tag contains a CDATA list attribute such as
589
* If the initial <html> tag contains a CDATA list attribute such as
590
  'class', the html5lib tree builder will now turn its value into a
591
  list, as it would with any other tag. [bug=1296481]
592
360.1.3 by Leonard Richardson
Fixed an import error in Python 3.5 caused by the removal of the
593
* Fixed an import error in Python 3.5 caused by the removal of the
594
  HTMLParseError class. [bug=1420063]
595
359 by Leonard Richardson
Improved docstring for encode_contents() and decode_contents(). [bug=1441543]
596
* Improved docstring for encode_contents() and
597
  decode_contents(). [bug=1441543]
357 by Leonard Richardson
Fixed yet another bug that caused a disconnected tree when html5lib
598
364 by Leonard Richardson
Fixed a crash in Unicode, Dammit's encoding detector when the name
599
* Fixed a crash in Unicode, Dammit's encoding detector when the name
600
  of the encoding itself contained invalid bytes. [bug=1360913]
601
367 by Leonard Richardson
Improved the exception raised when you call .unwrap() or
602
* Improved the exception raised when you call .unwrap() or
603
  .replace_with() on an element that's not attached to a tree.
604
376 by Leonard Richardson
Raise a NotImplementedError whenever an unsupported CSS pseudoclass
605
* Raise a NotImplementedError whenever an unsupported CSS pseudoclass
606
  is used in select(). Previously some cases did not result in a
607
  NotImplementedError.
368 by Leonard Richardson
You can now create a NavigableString or a subclass just by invoking
608
382 by Leonard Richardson
It's now possible to pickle a BeautifulSoup object no matter which
609
* It's now possible to pickle a BeautifulSoup object no matter which
610
  tree builder was used to create it. However, the only tree builder
611
  that survives the pickling process is the HTMLParserTreeBuilder
612
  ('html.parser'). If you unpickle a BeautifulSoup object created with
613
  some other tree builder, soup.builder will be None. [bug=1231545]
614
336 by Leonard Richardson
Prep for release.
615
= 4.3.2 (20131002) =
331 by Leonard Richardson
Combined two tests to stop a spurious test failure when tests are
616
333 by Leonard Richardson
Fixed a bug in which short Unicode input was improperly encoded to ASCII when checking whether or not it was a file on
617
* Fixed a bug in which short Unicode input was improperly encoded to
336 by Leonard Richardson
Prep for release.
618
  ASCII when checking whether or not it was the name of a file on
333 by Leonard Richardson
Fixed a bug in which short Unicode input was improperly encoded to ASCII when checking whether or not it was a file on
619
  disk. [bug=1227016]
620
334 by Leonard Richardson
Fixed a crash when a short input contains data not valid in
621
* Fixed a crash when a short input contains data not valid in
622
  filenames. [bug=1232604]
623
335 by Leonard Richardson
Fixed a bug that caused Unicode data put into UnicodeDammit to
624
* Fixed a bug that caused Unicode data put into UnicodeDammit to
625
  return None instead of the original data. [bug=1214983]
626
331 by Leonard Richardson
Combined two tests to stop a spurious test failure when tests are
627
* Combined two tests to stop a spurious test failure when tests are
332 by Leonard Richardson
Fixed typo.
628
  run by nosetests. [bug=1212445]
331 by Leonard Richardson
Combined two tests to stop a spurious test failure when tests are
629
329 by Leonard Richardson
Updated NEWS.
630
= 4.3.1 (20130815) =
327 by Leonard Richardson
* Fixed yet another problem with the html5lib tree builder, caused by
631
632
* Fixed yet another problem with the html5lib tree builder, caused by
633
  html5lib's tendency to rearrange the tree during
634
  parsing. [bug=1189267]
635
329 by Leonard Richardson
Updated NEWS.
636
* Fixed a bug that caused the optimized version of find_all() to
637
  return nothing. [bug=1212655]
638
326 by Leonard Richardson
Prep for release.
639
= 4.3.0 (20130812) =
305 by Leonard Richardson
Merged in big encoding-detection refactoring branch.
640
641
* Instead of converting incoming data to Unicode and feeding it to the
324 by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by
642
  lxml tree builder in chunks, Beautiful Soup now makes successive
643
  guesses at the encoding of the incoming data, and tells lxml to
644
  parse the data as that encoding. Giving lxml more control over the
645
  parsing process improves performance and avoids a number of bugs and
646
  issues with the lxml parser which had previously required elaborate
647
  workarounds:
323 by Leonard Richardson
A little cleanup.
648
324 by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by
649
  - An issue in which lxml refuses to parse Unicode strings on some
650
    systems. [bug=1180527]
323 by Leonard Richardson
A little cleanup.
651
652
  - A returning bug that truncated documents longer than a (very
653
    small) size. [bug=963880]
654
655
  - A returning bug in which extra spaces were added to a document if
656
    the document defined a charset other than UTF-8. [bug=972466]
305 by Leonard Richardson
Merged in big encoding-detection refactoring branch.
657
658
  This required a major overhaul of the tree builder architecture. If
659
  you wrote your own tree builder and didn't tell me, you'll need to
660
  modify your prepare_markup() method.
661
662
* The UnicodeDammit code that makes guesses at encodings has been
663
  split into its own class, EncodingDetector. A lot of apparently
664
  redundant code has been removed from Unicode, Dammit, and some
665
  undocumented features have also been removed.
666
306 by Leonard Richardson
Beautiful Soup will issue a warning if instead of markup you pass it
667
* Beautiful Soup will issue a warning if instead of markup you pass it
324 by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by
668
  a URL or the name of a file on disk (a common beginner's mistake).
306 by Leonard Richardson
Beautiful Soup will issue a warning if instead of markup you pass it
669
317 by Leonard Richardson
Added raw html5lib to the list of parsers that get tested.
670
* A number of optimizations improve the performance of the lxml tree
322 by Leonard Richardson
Updated NEWS.
671
  builder by about 33%, the html.parser tree builder by about 20%, and
672
  the html5lib tree builder by about 15%.
317 by Leonard Richardson
Added raw html5lib to the list of parsers that get tested.
673
324 by Leonard Richardson
All find_all calls should now return a ResultSet object. Patch by
674
* All find_all calls should now return a ResultSet object. Patch by
675
  Aaron DeVore. [bug=1194034]
676
302 by Leonard Richardson
Reverted the patch that gives NavigableString a .name property, because that's too big an API change for a bugfix release.
677
= 4.2.1 (20130531) =
295 by Leonard Richardson
html5lib now supports Python 3. Fixed some Python 2-specific
678
301 by Leonard Richardson
The default XML formatter will now replace ampersands even if they appear to be part of entities. That is, "&lt;" will become "&amp;lt;".[bug=1182183]
679
* The default XML formatter will now replace ampersands even if they
680
  appear to be part of entities. That is, "&lt;" will become
681
  "&amp;lt;". The old code was left over from Beautiful Soup 3, which
682
  didn't always turn entities into Unicode characters.
683
684
  If you really want the old behavior (maybe because you add new
685
  strings to the tree, those strings include entities, and you want
686
  the formatter to leave them alone on output), it can be found in
687
  EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183]
688
296 by Leonard Richardson
Gave new_string() the ability to create subclasses of
689
* Gave new_string() the ability to create subclasses of
690
  NavigableString. [bug=1181986]
691
297 by Leonard Richardson
Fixed another bug by which the html5lib tree builder could create a
692
* Fixed another bug by which the html5lib tree builder could create a
693
  disconnected tree. [bug=1182089]
694
299 by Leonard Richardson
The .previous_element of a BeautifulSoup object is now always None,
695
* The .previous_element of a BeautifulSoup object is now always None,
696
  not the last element to be parsed. [bug=1182089]
697
295 by Leonard Richardson
html5lib now supports Python 3. Fixed some Python 2-specific
698
* Fixed test failures when lxml is not installed. [bug=1181589]
699
700
* html5lib now supports Python 3. Fixed some Python 2-specific
701
  code in the html5lib test suite. [bug=1181624]
702
303 by Leonard Richardson
The html.parser treebuilder can now handle numeric attributes in
703
* The html.parser treebuilder can now handle numeric attributes in
704
  text when the hexidecimal name of the attribute starts with a
705
  capital X. Patch by Tim Shirley. [bug=1186242]
706
288.1.1 by Leonard Richardson
Added a deprecation warning to has_key().
707
= 4.2.0 (20130514) =
272 by Leonard Richardson
In an HTML document, the contents of a <script> or <style> tag will
708
282.1.12 by Leonard Richardson
Updated news.
709
* The Tag.select() method now supports a much wider variety of CSS
710
  selectors.
282.1.11 by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement.
711
712
 - Added support for the adjacent sibling combinator (+) and the
713
   general sibling combinator (~). Tests by "liquider". [bug=1082144]
714
282.1.13 by Leonard Richardson
Fixed terminology.
715
 - The combinators (>, +, and ~) can now combine with any supported
282.1.12 by Leonard Richardson
Updated news.
716
   selector, not just one that selects based on tag name.
717
282.1.11 by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement.
718
 - Added limited support for the "nth-of-type" pseudo-class. Code
719
   by Sven Slootweg. [bug=1109952]
720
274.1.3 by Leonard Richardson
Aliased the BeautifulSoup class to the easier-to-type "_s" and "_soup".
721
* The BeautifulSoup class is now aliased to "_s" and "_soup", making
278 by Leonard Richardson
Added support for the "nth-of-type" CSS selector. The CSS selector ">" can now find a tag by means other than the tag name. Code by Sven Slootweg.
722
  it quicker to type the import statement in an interactive session:
274.1.3 by Leonard Richardson
Aliased the BeautifulSoup class to the easier-to-type "_s" and "_soup".
723
724
  from bs4 import _s
725
   or
726
  from bs4 import _soup
727
282 by Leonard Richardson
Fixed up diagnose() and added it to the docs.
728
  The alias may change in the future, so don't use this in code you're
729
  going to run more than once.
730
731
* Added the 'diagnose' submodule, which includes several useful
732
  functions for reporting problems and doing tech support.
733
282.1.11 by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement.
734
  - diagnose(data) tries the given markup on every installed parser,
282 by Leonard Richardson
Fixed up diagnose() and added it to the docs.
735
    reporting exceptions and displaying successes. If a parser is not
736
    installed, diagnose() mentions this fact.
737
282.1.11 by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement.
738
  - lxml_trace(data, html=True) runs the given markup through lxml's
282 by Leonard Richardson
Fixed up diagnose() and added it to the docs.
739
    XML parser or HTML parser, and prints out the parser events as
740
    they happen. This helps you quickly determine whether a given
741
    problem occurs in lxml code or Beautiful Soup code.
742
282.1.11 by Leonard Richardson
Moved select() to Tag. It was always an error to call select() on a string, so there's no reason for it to be in PageElement.
743
  - htmlparser_trace(data) is the same thing, but for Python's
282 by Leonard Richardson
Fixed up diagnose() and added it to the docs.
744
    built-in HTMLParser class.
278 by Leonard Richardson
Added support for the "nth-of-type" CSS selector. The CSS selector ">" can now find a tag by means other than the tag name. Code by Sven Slootweg.
745
282.1.12 by Leonard Richardson
Updated news.
746
* In an HTML document, the contents of a <script> or <style> tag will
747
  no longer undergo entity substitution by default. XML documents work
748
  the same way they did before. [bug=1085953]
749
750
* Methods like get_text() and properties like .strings now only give
751
  you strings that are visible in the document--no comments or
752
  processing commands. [bug=1050164]
753
277 by Leonard Richardson
The prettify() method now leaves the contents of <pre> tags
754
* The prettify() method now leaves the contents of <pre> tags
755
  alone. [bug=1095654]
756
264 by Leonard Richardson
Added bug reference.
757
* Fix a bug in the html5lib treebuilder which sometimes created
758
  disconnected trees. [bug=1039527]
759
265.1.1 by Leonard Richardson
Fix a bug in the lxml treebuilder which crashed when a tag included
760
* Fix a bug in the lxml treebuilder which crashed when a tag included
761
  an attribute from the predefined "xml:" namespace. [bug=1065617]
762
273 by Leonard Richardson
Fix a bug by which keyword arguments to find_parent() were not being passed on. [bug=1126734]
763
* Fix a bug by which keyword arguments to find_parent() were not
764
  being passed on. [bug=1126734]
765
275 by Leonard Richardson
Stop a crash when unwisely messing with a tag that's been
766
* Stop a crash when unwisely messing with a tag that's been
767
  decomposed. [bug=1097699]
768
288.1.1 by Leonard Richardson
Added a deprecation warning to has_key().
769
* Now that lxml's segfault on invalid doctype has been fixed, fixed a
274.1.1 by Leonard Richardson
Now that lxml's segfault on invalid doctype has been fixed, fix a
770
  corresponding problem on the Beautiful Soup end that was previously
771
  invisible. [bug=984936]
772
279 by Leonard Richardson
Fixed an exception when an overspecified CSS selector didn't match
773
* Fixed an exception when an overspecified CSS selector didn't match
774
  anything. Code by Stefaan Lippens. [bug=1168167]
775
258 by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503]
776
= 4.1.3 (20120820) =
777
260 by Leonard Richardson
Python 3.1 also needs to skip the unicode attribute name test.
778
* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious
779
  test failure caused by the lousy HTMLParser in those
780
  versions. [bug=1038503]
258 by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503]
781
259 by Leonard Richardson
Raise a more specific error (FeatureNotFound) when a requested
782
* Raise a more specific error (FeatureNotFound) when a requested
783
  parser or parser feature is not installed. Raise NotImplementedError
784
  instead of ValueError when the user calls insert_before() or
785
  insert_after() on the BeautifulSoup object itself. Patch by Aaron
786
  Devore. [bug=1038301]
258 by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503]
787
252 by Leonard Richardson
Prep for release.
788
= 4.1.2 (20120817) =
245 by Leonard Richardson
Use logging.warning() instead of warning.warn() to notify the user that characters were replaced with REPLACEMENT CHARACTER. [bug=1013862]
789
251 by Leonard Richardson
As per PEP-8, allow searching by CSS class using the 'class_'
790
* As per PEP-8, allow searching by CSS class using the 'class_'
791
  keyword argument. [bug=1037624]
792
255 by Leonard Richardson
Fixed a crash on encoding when an attribute name contained
793
* Display namespace prefixes for namespaced attribute names, instead of
250 by Leonard Richardson
Use namespace prefixes for namespaced attribute names, instead of
794
  the fully-qualified names given by the lxml parser. [bug=1037597]
795
255 by Leonard Richardson
Fixed a crash on encoding when an attribute name contained
796
* Fixed a crash on encoding when an attribute name contained
797
  non-ASCII characters.
798
251 by Leonard Richardson
As per PEP-8, allow searching by CSS class using the 'class_'
799
* When sniffing encodings, if the cchardet library is installed,
258 by Leonard Richardson
Skipped a test under Python 2.6 to avoid a spurious test failure. [bug=1038503]
800
  Beautiful Soup uses it instead of chardet. cchardet is much
251 by Leonard Richardson
As per PEP-8, allow searching by CSS class using the 'class_'
801
  faster. [bug=1020748]
246 by Leonard Richardson
When sniffing encodings, if the cchardet library is installed, use it instead of chardet. It's much faster. [bug=1020748]
802
245 by Leonard Richardson
Use logging.warning() instead of warning.warn() to notify the user that characters were replaced with REPLACEMENT CHARACTER. [bug=1013862]
803
* Use logging.warning() instead of warning.warn() to notify the user
804
  that characters were replaced with REPLACEMENT
805
  CHARACTER. [bug=1013862]
806
243 by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387]
807
= 4.1.1 (20120703) =
239 by Leonard Richardson
Fixed an html5lib tree builder crash which happened when html5lib
808
241 by Leonard Richardson
Fixed a typo that made parsing much slower than it should have been. [bug=1020268]
809
* Fixed an html5lib tree builder crash which happened when html5lib
243 by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387]
810
  moved a tag with a multivalued attribute from one part of the tree
811
  to another. [bug=1019603]
239 by Leonard Richardson
Fixed an html5lib tree builder crash which happened when html5lib
812
243 by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387]
813
* Correctly display closing tags with an XML namespace declared. Patch
241 by Leonard Richardson
Fixed a typo that made parsing much slower than it should have been. [bug=1020268]
814
  by Andreas Kostyrka. [bug=1019635]
815
816
* Fixed a typo that made parsing significantly slower than it should
243 by Leonard Richardson
get_text() now returns an empty Unicode string if there is no text, rather than an empty bytestring. [bug=1020387]
817
  have been, and also waited too long to close tags with XML
818
  namespaces. [bug=1020268]
819
820
* get_text() now returns an empty Unicode string if there is no text,
821
  rather than an empty bytestring. [bug=1020387]
241 by Leonard Richardson
Fixed a typo that made parsing much slower than it should have been. [bug=1020268]
822
236 by Leonard Richardson
Prep for release.
823
= 4.1.0 (20120529) =
228 by Leonard Richardson
Added experimental support for fixing Windows-1252 characters embedded in UTF-8 documents.
824
825
* Added experimental support for fixing Windows-1252 characters
232 by Leonard Richardson
Fixed a bug with the lxml treebuilder that prevented the user from adding attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch.
826
  embedded in UTF-8 documents. (UnicodeDammit.detwingle())
228 by Leonard Richardson
Added experimental support for fixing Windows-1252 characters embedded in UTF-8 documents.
827
230 by Leonard Richardson
Fixed the handling of &quot; with the built-in parser. [bug=993871]
828
* Fixed the handling of &quot; with the built-in parser. [bug=993871]
829
231 by Leonard Richardson
Comments, processing instructions, document type declarations, and markup declarations are now treated as preformatted strings, the way CData blocks are. [bug=1001025] Also in this commit: renamed detwingle method to detwingle().
830
* Comments, processing instructions, document type declarations, and
831
  markup declarations are now treated as preformatted strings, the way
832
  CData blocks are. [bug=1001025]
833
232 by Leonard Richardson
Fixed a bug with the lxml treebuilder that prevented the user from adding attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch.
834
* Fixed a bug with the lxml treebuilder that prevented the user from
835
  adding attributes to a tag that didn't originally have
236 by Leonard Richardson
Prep for release.
836
  attributes. [bug=1002378] Thanks to Oliver Beattie for the patch.
232 by Leonard Richardson
Fixed a bug with the lxml treebuilder that prevented the user from adding attributes to a tag that didn't originally have any. [bug=1002378] Thanks to Oliver Beattie for the patch.
837
233 by Leonard Richardson
Fixed some edge-case bugs having to do with inserting an element
838
* Fixed some edge-case bugs having to do with inserting an element
839
  into a tag it's already inside, and replacing one of a tag's
840
  children with another. [bug=997529]
841
236 by Leonard Richardson
Prep for release.
842
* Added the ability to search for attribute values specified in UTF-8. [bug=1003974]
235 by Leonard Richardson
Fixed the inability to search for non-ASCII attribute
843
844
  This caused a major refactoring of the search code. All the tests
845
  pass, but it's possible that some searches will behave differently.
234 by Leonard Richardson
Fixed the basic failure in [bug=1003974], but not more advanced cases.
846
225 by Leonard Richardson
Prep for release.
847
= 4.0.5 (20120427) =
214 by Leonard Richardson
Fixed a bug that made the HTMLParser treebuilder generate XML definitions ending with two question marks instead of one. [bug=984258]
848
229 by Leonard Richardson
Fixed NEWS.
849
* Added a new method, wrap(), which wraps an element in a tag.
224 by Leonard Richardson
Added a new method, wrap().
850
223 by Leonard Richardson
Renamed replace_with_children() to the jQuery name, unwrap().
851
* Renamed replace_with_children() to unwrap(), which is easier to
852
  understand and also the jQuery name of the function.
853
217 by Leonard Richardson
Made encoding substitution in <meta> tags completely transparent (no more %SOUP-ENCODING%).
854
* Made encoding substitution in <meta> tags completely transparent (no
855
  more %SOUP-ENCODING%).
856
222 by Leonard Richardson
Fixed a bug in decoding data that contained a byte-order mark, such as data encoded in UTF-16LE. [bug=988980]
857
* Fixed a bug in decoding data that contained a byte-order mark, such
858
  as data encoded in UTF-16LE. [bug=988980]
859
214 by Leonard Richardson
Fixed a bug that made the HTMLParser treebuilder generate XML definitions ending with two question marks instead of one. [bug=984258]
860
* Fixed a bug that made the HTMLParser treebuilder generate XML
861
  definitions ending with two question marks instead of
862
  one. [bug=984258]
863
221 by Leonard Richardson
Upon document generation, CData objects are no longer run through the formatter. [bug=988905]
864
* Upon document generation, CData objects are no longer run through
865
  the formatter. [bug=988905]
866
220 by Leonard Richardson
The test suite now passes when lxml is not installed, whether or not html5lib is installed. [bug=987004]
867
* The test suite now passes when lxml is not installed, whether or not
868
  html5lib is installed. [bug=987004]
869
215 by Leonard Richardson
Print a warning on HTMLParseErrors to let people know they should install an external parser.
870
* Print a warning on HTMLParseErrors to let people know they should
871
  install a better parser library.
872
213 by Leonard Richardson
Prep for release.
873
= 4.0.4 (20120416) =
205 by Leonard Richardson
Have objects_was_parsed set the previous element's next_element if possible. [bug=975926]
874
875
* Fixed a bug that sometimes created disconnected trees.
876
209 by Leonard Richardson
Fixed a bug with the string setter that moved a string around the
877
* Fixed a bug with the string setter that moved a string around the
878
  tree instead of copying it. [bug=983050]
879
210 by Leonard Richardson
Attribute values are now run through the provided output formatter. Previously they were always run through the 'minimal' formatter. [bug=980237]
880
* Attribute values are now run through the provided output formatter.
881
  Previously they were always run through the 'minimal' formatter. In
882
  the future I may make it possible to specify different formatters
883
  for attribute values and strings, but for now, consistent behavior
884
  is better than inconsistent behavior. [bug=980237]
885
206 by Leonard Richardson
Added renderContents back.
886
* Added the missing renderContents method from Beautiful Soup 3. Also
887
  added an encode_contents() method to go along with decode_contents().
888
208 by Leonard Richardson
Give a more useful error when the user tries to run the Python 2 version of BS under Python 3.
889
* Give a more useful error when the user tries to run the Python 2
890
  version of BS under Python 3.
891
211 by Leonard Richardson
Unicode, Dammit now has an option to turn MS smart quotes into ASCII characters.
892
* UnicodeDammit can now convert Microsoft smart quotes to ASCII with
893
  UnicodeDammit(markup, smart_quotes_to="ascii").
894
204 by Leonard Richardson
Prep for release.
895
= 4.0.3 (20120403) =
197 by Leonard Richardson
Fixed a typo that caused some versions of Python 3 to convert the Beautiful Soup codebase incorrectly.
896
897
* Fixed a typo that caused some versions of Python 3 to convert the
898
  Beautiful Soup codebase incorrectly.
899
203 by Leonard Richardson
Got rid of the 4.0.2 workaround for HTML documents--it was unnecessary and the workaround was triggering a (possibly different, but related) bug in lxml. [bug=972466]
900
* Got rid of the 4.0.2 workaround for HTML documents--it was
901
  unnecessary and the workaround was triggering a (possibly different,
902
  but related) bug in lxml. [bug=972466]
903
196 by Leonard Richardson
Prep for release.
904
= 4.0.2 (20120326) =
194 by Leonard Richardson
Fixed a bug where specifying 'text' while searching for a tag only worked if 'text' specified an exact string match. [bug=955942]
905
195 by Leonard Richardson
Pass data into XMLParser.feed() in chunks. [bug=963880]
906
* Worked around a possible bug in lxml that prevents non-tiny XML
907
  documents from being parsed. [bug=963880, bug=963936]
908
196 by Leonard Richardson
Prep for release.
909
* Fixed a bug where specifying `text` while also searching for a tag
910
  only worked if `text` wanted an exact string match. [bug=955942]
194 by Leonard Richardson
Fixed a bug where specifying 'text' while searching for a tag only worked if 'text' specified an exact string match. [bug=955942]
911
188 by Leonard Richardson
Bumped version number.
912
= 4.0.1 (20120314) =
913
914
* This is the first official release of Beautiful Soup 4. There is no
915
  4.0.0 release, to eliminate any possibility that packaging software
916
  might treat "4.0.0" as being an earlier version than "4.0.0b10".
187 by Leonard Richardson
Brought the soupselect port up to date.
917
918
* Brought BS up to date with the latest release of soupselect, adding
919
  CSS selector support for direct descendant matches and multiple CSS
920
  class matches.
921
185 by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426]
922
= 4.0.0b10 (20120302) =
179.1.3 by Leonard Richardson
Test that CSS selectors work within the tree as well as at the top level.
923
179.1.4 by Leonard Richardson
Updated docs.
924
* Added support for simple CSS selectors, taken from the soupselect project.
179.1.3 by Leonard Richardson
Test that CSS selectors work within the tree as well as at the top level.
925
185 by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426]
926
* Fixed a crash when using html5lib. [bug=943246]
927
182 by Leonard Richardson
In HTML5-style <meta charset="foo"> tags, the value of the "charset" attribute is now replaced with the appropriate encoding on output. [bug=942714]
928
* In HTML5-style <meta charset="foo"> tags, the value of the "charset"
185 by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426]
929
  attribute is now replaced with the appropriate encoding on
930
  output. [bug=942714]
931
932
* Fixed a bug that caused calling a tag to sometimes call find_all()
933
  with the wrong arguments. [bug=944426]
182 by Leonard Richardson
In HTML5-style <meta charset="foo"> tags, the value of the "charset" attribute is now replaced with the appropriate encoding on output. [bug=942714]
934
184 by Leonard Richardson
For backwards compatibility, brought back the BeautifulStoneSoup class as a deprecated wrapper around BeautifulSoup.
935
* For backwards compatibility, brought back the BeautifulStoneSoup
936
  class as a deprecated wrapper around BeautifulSoup.
937
185 by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426]
938
= 4.0.0b9 (20120228) =
175 by Leonard Richardson
Renamed Tag.nsprefix to Tag.prefix, for consistency with NamespacedAttribute.
939
177 by Leonard Richardson
Fixed DOCTYPE handling.
940
* Fixed the string representation of DOCTYPEs that have both a public
941
  ID and a system ID.
942
179 by Leonard Richardson
Fixed the generated XML declaration.
943
* Fixed the generated XML declaration.
944
175 by Leonard Richardson
Renamed Tag.nsprefix to Tag.prefix, for consistency with NamespacedAttribute.
945
* Renamed Tag.nsprefix to Tag.prefix, for consistency with
946
  NamespacedAttribute.
947
421.1.1 by Ville Skyttä
Spelling fixes
948
* Fixed a test failure that occurred on Python 3.x when chardet was
176 by Leonard Richardson
Fixed a test failure that occured on Python 3.x when chardet was installed.
949
  installed.
950
178 by Leonard Richardson
Make prettify() return Unicode by default, so it will look nice when passed into print() under Python 3.
951
* Made prettify() return Unicode by default, so it will look nice on
952
  Python 3 when passed into print().
953
185 by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426]
954
= 4.0.0b8 (20120224) =
158.1.10 by Leonard Richardson
Bumped version number.
955
956
* All tree builders now preserve namespace information in the
174 by Leonard Richardson
I keep typing assertEquals.
957
  documents they parse. If you use the html5lib parser or lxml's XML
958
  parser, you can access the namespace URL for a tag as tag.namespace.
158.1.10 by Leonard Richardson
Bumped version number.
959
960
  However, there is no special support for namespace-oriented
961
  searching or tree manipulation. When you search the tree, you need
962
  to use namespace prefixes exactly as they're used in the original
963
  document.
964
158.1.11 by Leonard Richardson
Fixed handling of the closing of namespaced tags.
965
* The string representation of a DOCTYPE always ends in a newline.
966
173 by Leonard Richardson
Warn when SoupStrainer is used with the html5lib tree builder.
967
* Issue a warning if the user tries to use a SoupStrainer in
968
  conjunction with the html5lib tree builder, which doesn't support
969
  them.
970
185 by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426]
971
= 4.0.0b7 (20120223) =
157 by Leonard Richardson
Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion.
972
158 by Leonard Richardson
By default, turn unrecognized characters into numeric XML entity refs.
973
* Upon decoding to string, any characters that can't be represented in
974
  your chosen encoding will be converted into numeric XML entity
975
  references.
976
157 by Leonard Richardson
Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion.
977
* Issue a warning if characters were replaced with REPLACEMENT
978
  CHARACTER during Unicode conversion.
979
160 by Leonard Richardson
Added code from 2.7's standard library so that the tests will run on Python 2.6.
980
* Restored compatibility with Python 2.6.
981
421.1.1 by Ville Skyttä
Spelling fixes
982
* The install process no longer installs docs or auxiliary text files.
169 by Leonard Richardson
It's now possible to copy a BeautifulSoup object created with the html.parser treebuilder.
983
984
* It's now possible to deepcopy a BeautifulSoup object created with
985
  Python's built-in HTML parser.
986
169.1.6 by Leonard Richardson
Updated NEWS.
987
* About 100 unit tests that "test" the behavior of various parsers on
988
  invalid markup have been removed. Legitimate changes to those
989
  parsers caused these tests to fail, indicating that perhaps
990
  Beautiful Soup should not test the behavior of foreign
991
  libraries.
992
993
  The problematic unit tests have been reformulated as informational
994
  comparisons generated by the script
995
  scripts/demonstrate_parser_differences.py.
996
997
  This makes Beautiful Soup compatible with html5lib version 0.95 and
998
  future versions of HTMLParser.
999
185 by Leonard Richardson
Fixed a bug that caused calling a tag to sometimes call find_all() with the wrong arguments. [bug=944426]
1000
= 4.0.0b6 (20120216) =
150.1.8 by Leonard Richardson
Added to NEWS.
1001
157 by Leonard Richardson
Issue a warning if characters were replaced with REPLACEMENT CHARACTER during Unicode conversion.
1002
* Multi-valued attributes like "class" always have a list of values,
1003
  even if there's only one value in the list.
1004
1005
* Added a number of multi-valued attributes defined in HTML5.
154 by Leonard Richardson
The value of multi-valued attributes like class are always turned into a list, even if there's only one value.
1006
155 by Leonard Richardson
Added a kind of hacky way to interpret the restriction class='foo bar'. Stop generating a space before the slash that closes an empty-element tag.
1007
* Stopped generating a space before the slash that closes an
1008
  empty-element tag. This may come back if I add a special XHTML mode
1009
  (http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty
1010
  useless.
1011
152 by Leonard Richardson
Better defined behavior when the user wants to search for a combination of text and tag-specific arguments. [bug=695312]
1012
* Passing text along with tag-specific arguments to a find* method:
1013
1014
   find("a", text="Click here")
1015
1016
  will find tags that contain the given text as their
1017
  .string. Previously, the tag-specific arguments were ignored and
1018
  only strings were searched.
1019
150.1.8 by Leonard Richardson
Added to NEWS.
1020
* Fixed a bug that caused the html5lib tree builder to build a
1021
  partially disconnected tree. Generally cleaned up the html5lib tree
1022
  builder.
1023
155 by Leonard Richardson
Added a kind of hacky way to interpret the restriction class='foo bar'. Stop generating a space before the slash that closes an empty-element tag.
1024
* If you restrict a multi-valued attribute like "class" to a string
1025
  that contains spaces, Beautiful Soup will only consider it a match
1026
  if the values correspond to that specific string.
1027
149 by Leonard Richardson
Bumped version number.
1028
= 4.0.0b5 (20120209) =
138 by Leonard Richardson
Rationalized the treatment of multi-valued HTML attributes such as 'class'
1029
1030
* Rationalized Beautiful Soup's treatment of CSS class. A tag
1031
  belonging to multiple CSS classes is treated as having a list of
1032
  values for the 'class' attribute. Searching for a CSS class will
1033
  match *any* of the CSS classes.
1034
1035
  This actually affects all attributes that the HTML standard defines
1036
  as taking multiple values (class, rel, rev, archive, accept-charset,
148 by Leonard Richardson
Added bug reference.
1037
  and headers), but 'class' is by far the most common. [bug=41034]
138 by Leonard Richardson
Rationalized the treatment of multi-valued HTML attributes such as 'class'
1038
1039
* If you pass anything other than a dictionary as the second argument
1040
  to one of the find* methods, it'll assume you want to use that
1041
  object to search against a tag's CSS classes. Previously this only
1042
  worked if you passed in a string.
1043
140 by Leonard Richardson
Fixed a bug that caused a crash when you passed a dictionary as an attribute value (possibly because you mistyped attrs). [bug=842419]
1044
* Fixed a bug that caused a crash when you passed a dictionary as an
1045
  attribute value (possibly because you mistyped "attrs"). [bug=842419]
1046
144 by Leonard Richardson
Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags like <meta charset="utf-8" />. [bug=837268]
1047
* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags
1048
  like <meta charset="utf-8" />. [bug=837268]
1049
146 by Leonard Richardson
As a last-ditch attempt to turn data into Unicode, use errors=replace instead of errors=strict.
1050
* If Unicode, Dammit can't figure out a consistent encoding for a
1051
  page, it will try each of its guesses again, with errors="replace"
1052
  instead of errors="strict". This may mean that some data gets
1053
  replaced with REPLACEMENT CHARACTER, but at least most of it will
1054
  get turned into Unicode. [bug=754903]
1055
145 by Leonard Richardson
Patched over a bug in html5lib (?) that was crashing Beautiful Soup on certain kinds of markup. [bug=838800]
1056
* Patched over a bug in html5lib (?) that was crashing Beautiful Soup
1057
  on certain kinds of markup. [bug=838800]
1058
141 by Leonard Richardson
Fixed a bug that wrecked the tree if you replaced an element with an empty string. [bug=728697]
1059
* Fixed a bug that wrecked the tree if you replaced an element with an
1060
  empty string. [bug=728697]
1061
142 by Leonard Richardson
Improved Unicode, Dammit's behavior when you give it Unicode to begin with.
1062
* Improved Unicode, Dammit's behavior when you give it Unicode to
1063
  begin with.
1064
134 by Leonard Richardson
Moved the historical changelog into NEWS.
1065
= 4.0.0b4 (20120208) =
131 by Leonard Richardson
Moved around a bunch of metadata.
1066
1067
* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag()
1068
1069
* BeautifulSoup.new_tag() will follow the rules of whatever
1070
  tree-builder was used to create the original BeautifulSoup object. A
1071
  new <p> tag will look like "<p />" if the soup object was created to
1072
  parse XML, but it will look like "<p></p>" if the soup object was
1073
  created to parse HTML.
1074
1075
* We pass in strict=False to html.parser on Python 3, greatly
1076
  improving html.parser's ability to handle bad HTML.
1077
1078
* We also monkeypatch a serious bug in html.parser that made
1079
  strict=False disastrous on Python 3.2.2.
1080
1081
* Replaced the "substitute_html_entities" argument with the
133 by Leonard Richardson
Added more detail to the NEWS.
1082
  more general "formatter" argument.
131 by Leonard Richardson
Moved around a bunch of metadata.
1083
1084
* Bare ampersands and angle brackets are always converted to XML
1085
  entities unless the user prevents it.
1086
133 by Leonard Richardson
Added more detail to the NEWS.
1087
* Added PageElement.insert_before() and PageElement.insert_after(),
1088
  which let you put an element into the parse tree with respect to
1089
  some other element.
131 by Leonard Richardson
Moved around a bunch of metadata.
1090
1091
* Raise an exception when the user tries to do something nonsensical
1092
  like insert a tag into itself.
1093
122 by Leonard Richardson
Documented today's changes.
1094
134 by Leonard Richardson
Moved the historical changelog into NEWS.
1095
= 4.0.0b3 (20120203) =
126 by Leonard Richardson
Package the docs with the code.
1096
1097
Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful
1098
Soup's custom HTML parser in favor of a system that lets you write a
1099
little glue code and plug in any HTML or XML parser you want.
1100
1101
Beautiful Soup 4.0 comes with glue code for four parsers:
1102
1103
 * Python's standard HTMLParser (html.parser in Python 3)
1104
 * lxml's HTML and XML parsers
1105
 * html5lib's HTML parser
1106
1107
HTMLParser is the default, but I recommend you install lxml if you
1108
can.
1109
1110
For complete documentation, see the Sphinx documentation in
1111
bs4/doc/source/. What follows is a summary of the changes from
1112
Beautiful Soup 3.
1113
1114
=== The module name has changed ===
1115
1116
Previously you imported the BeautifulSoup class from a module also
1117
called BeautifulSoup. To save keystrokes and make it clear which
1118
version of the API is in use, the module is now called 'bs4':
1119
1120
    >>> from bs4 import BeautifulSoup
1121
1122
=== It works with Python 3 ===
1123
1124
Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
1125
so bad that it barely worked at all. Beautiful Soup 4 works with
1126
Python 3, and since its parser is pluggable, you don't sacrifice
1127
quality.
1128
1129
Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3
1130
support to the finish line. Ezio Melotti is also to thank for greatly
1131
improving the HTML parser that comes with Python 3.2.
1132
1133
=== CDATA sections are normal text, if they're understood at all. ===
1134
1135
Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
1136
markup:
1137
1138
 <p><![CDATA[foo]]></p> => <p></p>
1139
1140
A future version of html5lib will turn CDATA sections into text nodes,
1141
but only within tags like <svg> and <math>:
1142
1143
 <svg><![CDATA[foo]]></svg> => <p>foo</p>
1144
1145
The default XML parser (which uses lxml behind the scenes) turns CDATA
1146
sections into ordinary text elements:
1147
1148
 <p><![CDATA[foo]]></p> => <p>foo</p>
1149
1150
In theory it's possible to preserve the CDATA sections when using the
1151
XML parser, but I don't see how to get it to work in practice.
1152
1153
=== Miscellaneous other stuff ===
1154
1155
If the BeautifulSoup instance has .is_xml set to True, an appropriate
1156
XML declaration will be emitted when the tree is transformed into a
1157
string:
1158
1159
    <?xml version="1.0" encoding="utf-8">
1160
    <markup>
1161
     ...
1162
    </markup>
1163
1164
The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
1165
builders set it to False. If you want to parse XHTML with an HTML
1166
parser, you can set it manually.
1167
75.1.4 by Leonard Richardson
Emit an XML declaration when appropriate.
1168
92 by Leonard Richardson
Prep for beta release.
1169
= 3.2.0 =
1170
1171
The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2
1172
to make it obvious which one you should use.
1173
1 by Leonard Richardson
Initial (manual) import.
1174
= 3.1.0 =
1175
1176
A hybrid version that supports 2.4 and can be automatically converted
1177
to run under Python 3.0. There are three backwards-incompatible
1178
changes you should be aware of, but no new features or deliberate
1179
behavior changes.
1180
1181
1. str() may no longer do what you want. This is because the meaning
1182
of str() inverts between Python 2 and 3; in Python 2 it gives you a
1183
byte string, in Python 3 it gives you a Unicode string.
1184
1185
The effect of this is that you can't pass an encoding to .__str__
1186
anymore. Use encode() to get a string and decode() to get Unicode, and
1187
you'll be ready (well, readier) for Python 3.
1188
1189
2. Beautiful Soup is now based on HTMLParser rather than SGMLParser,
1190
which is gone in Python 3. There's some bad HTML that SGMLParser
1191
handled but HTMLParser doesn't, usually to do with attribute values
1192
that aren't closed or have brackets inside them:
1193
1194
  <a href="foo</a>, </a><a href="bar">baz</a>
1195
  <a b="<a>">', '<a b="&lt;a&gt;"></a><a>"></a>
1196
1197
A later version of Beautiful Soup will allow you to plug in different
1198
parsers to make tradeoffs between speed and the ability to handle bad
1199
HTML.
1200
87.1.3 by Aaron DeVore
Changelog for attribute renames
1201
3. In Python 3 (but not Python 2), HTMLParser converts entities within
1 by Leonard Richardson
Initial (manual) import.
1202
attributes to the corresponding Unicode characters. In Python 2 it's
1203
possible to parse this string and leave the &eacute; intact.
1204
1205
 <a href="http://crummy.com?sacr&eacute;&bleu">
1206
1207
In Python 3, the &eacute; is always converted to \xe9 during
1208
parsing.
1209
1210
1211
= 3.0.7a =
1212
1213
Added an import that makes BS work in Python 2.3.
1214
1215
1216
= 3.0.7 =
1217
1218
Fixed a UnicodeDecodeError when unpickling documents that contain
1219
non-ASCII characters.
1220
421.1.1 by Ville Skyttä
Spelling fixes
1221
Fixed a TypeError that occurred in some circumstances when a tag
1 by Leonard Richardson
Initial (manual) import.
1222
contained no text.
1223
1224
Jump through hoops to avoid the use of chardet, which can be extremely
1225
slow in some circumstances. UTF-8 documents should never trigger the
1226
use of chardet.
1227
1228
Whitespace is preserved inside <pre> and <textarea> tags that contain
1229
nothing but whitespace.
1230
1231
Beautiful Soup can now parse a doctype that's scoped to an XML namespace.
1232
1233
1234
= 3.0.6 =
1235
1236
Got rid of a very old debug line that prevented chardet from working.
1237
1238
Added a Tag.decompose() method that completely disconnects a tree or a
1239
subset of a tree, breaking it up into bite-sized pieces that are
1240
easy for the garbage collecter to collect.
1241
1242
Tag.extract() now returns the tag that was extracted.
1243
1244
Tag.findNext() now does something with the keyword arguments you pass
1245
it instead of dropping them on the floor.
1246
1247
Fixed a Unicode conversion bug.
1248
1249
Fixed a bug that garbled some <meta> tags when rewriting them.
1250
1251
1252
= 3.0.5 =
1253
1254
Soup objects can now be pickled, and copied with copy.deepcopy.
1255
1256
Tag.append now works properly on existing BS objects. (It wasn't
1257
originally intended for outside use, but it can be now.) (Giles
1258
Radford)
1259
1260
Passing in a nonexistent encoding will no longer crash the parser on
1261
Python 2.4 (John Nagle).
1262
1263
Fixed an underlying bug in SGMLParser that thinks ASCII has 255
1264
characters instead of 127 (John Nagle).
1265
1266
Entities are converted more consistently to Unicode characters.
1267
1268
Entity references in attribute values are now converted to Unicode
1269
characters when appropriate. Numeric entities are always converted,
1270
because SGMLParser always converts them outside of attribute values.
1271
1272
ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to
1273
XHTML_ENTITIES.
1274
1275
The regular expression for bare ampersands was too loose. In some
1276
cases ampersands were not being escaped. (Sam Ruby?)
1277
1278
Non-breaking spaces and other special Unicode space characters are no
1279
longer folded to ASCII spaces. (Robert Leftwich)
1280
1281
Information inside a TEXTAREA tag is now parsed literally, not as HTML
1282
tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang)
1283
1284
= 3.0.4 =
1285
1286
Fixed a bug that crashed Unicode conversion in some cases.
1287
1288
Fixed a bug that prevented UnicodeDammit from being used as a
1289
general-purpose data scrubber.
1290
1291
Fixed some unit test failures when running against Python 2.5.
1292
1293
When considering whether to convert smart quotes, UnicodeDammit now
1294
looks at the original encoding in a case-insensitive way.
134 by Leonard Richardson
Moved the historical changelog into NEWS.
1295
1296
= 3.0.3 (20060606) =
1297
1298
Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be
1299
sure to pass in an appropriate value for convertEntities, or XML/HTML
1300
entities might stick around that aren't valid in HTML/XML). The result
1301
may not validate, but it should be good enough to not choke a
1302
real-world XML parser. Specifically, the output of a properly
1303
constructed soup object should always be valid as part of an XML
1304
document, but parts may be missing if they were missing in the
1305
original. As always, if the input is valid XML, the output will also
1306
be valid.
1307
1308
= 3.0.2 (20060602) =
1309
1310
Previously, Beautiful Soup correctly handled attribute values that
1311
contained embedded quotes (sometimes by escaping), but not other kinds
1312
of XML character. Now, it correctly handles or escapes all special XML
1313
characters in attribute values.
1314
1315
I aliased methods to the 2.x names (fetch, find, findText, etc.) for
1316
backwards compatibility purposes. Those names are deprecated and if I
1317
ever do a 4.0 I will remove them. I will, I tell you!
1318
1319
Fixed a bug where the findAll method wasn't passing along any keyword
1320
arguments.
1321
1322
When run from the command line, Beautiful Soup now acts as an HTML
1323
pretty-printer, not an XML pretty-printer.
1324
1325
= 3.0.1 (20060530) =
1326
1327
Reintroduced the "fetch by CSS class" shortcut. I thought keyword
1328
arguments would replace it, but they don't. You can't call soup('a',
1329
class='foo') because class is a Python keyword.
1330
1331
If Beautiful Soup encounters a meta tag that declares the encoding,
1332
but a SoupStrainer tells it not to parse that tag, Beautiful Soup will
1333
no longer try to rewrite the meta tag to mention the new
1334
encoding. Basically, this makes SoupStrainers work in real-world
1335
applications instead of crashing the parser.
1336
1337
= 3.0.0 "Who would not give all else for two p" (20060528) =
1338
1339
This release is not backward-compatible with previous releases. If
1340
you've got code written with a previous version of the library, go
1341
ahead and keep using it, unless one of the features mentioned here
1342
really makes your life easier. Since the library is self-contained,
1343
you can include an old copy of the library in your old applications,
1344
and use the new version for everything else.
1345
1346
The documentation has been rewritten and greatly expanded with many
1347
more examples.
1348
1349
Beautiful Soup autodetects the encoding of a document (or uses the one
1350
you specify), and converts it from its native encoding to
1351
Unicode. Internally, it only deals with Unicode strings. When you
1352
print out the document, it converts to UTF-8 (or another encoding you
1353
specify). [Doc reference]
1354
1355
It's now easy to make large-scale changes to the parse tree without
1356
screwing up the navigation members. The methods are extract,
1357
replaceWith, and insert. [Doc reference. See also Improving Memory
1358
Usage with extract]
1359
1360
Passing True in as an attribute value gives you tags that have any
1361
value for that attribute. You don't have to create a regular
1362
expression. Passing None for an attribute value gives you tags that
1363
don't have that attribute at all.
1364
1365
Tag objects now know whether or not they're self-closing. This avoids
1366
the problem where Beautiful Soup thought that tags like <BR /> were
1367
self-closing even in XML documents. You can customize the self-closing
1368
tags for a parser object by passing them in as a list of
1369
selfClosingTags: you don't have to subclass anymore.
1370
1371
There's a new built-in parser, MinimalSoup, which has most of
1372
BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc
1373
reference]
1374
1375
You can use a SoupStrainer to tell Beautiful Soup to parse only part
1376
of a document. This saves time and memory, often making Beautiful Soup
1377
about as fast as a custom-built SGMLParser subclass. [Doc reference,
1378
SoupStrainer reference]
1379
1380
You can (usually) use keyword arguments instead of passing a
1381
dictionary of attributes to a search method. That is, you can replace
1382
soup(args={"id" : "5"}) with soup(id="5"). You can still use args if
1383
(for instance) you need to find an attribute whose name clashes with
1384
the name of an argument to findAll. [Doc reference: **kwargs attrs]
1385
1386
The method names have changed to the better method names used in
1387
Rubyful Soup. Instead of find methods and fetch methods, there are
1388
only find methods. Instead of a scheme where you can't remember which
1389
method finds one element and which one finds them all, we have find
1390
and findAll. In general, if the method name mentions All or a plural
1391
noun (eg. findNextSiblings), then it finds many elements
1392
method. Otherwise, it only finds one element. [Doc reference]
1393
1394
Some of the argument names have been renamed for clarity. For instance
1395
avoidParserProblems is now parserMassage.
1396
1397
Beautiful Soup no longer implements a feed method. You need to pass a
1398
string or a filehandle into the soup constructor, not with feed after
1399
the soup has been created. There is still a feed method, but it's the
1400
feed method implemented by SGMLParser and calling it will bypass
1401
Beautiful Soup and cause problems.
1402
1403
The NavigableText class has been renamed to NavigableString. There is
1404
no NavigableUnicodeString anymore, because every string inside a
1405
Beautiful Soup parse tree is a Unicode string.
1406
1407
findText and fetchText are gone. Just pass a text argument into find
1408
or findAll.
1409
1410
Null was more trouble than it was worth, so I got rid of it. Anything
1411
that used to return Null now returns None.
1412
1413
Special XML constructs like comments and CDATA now have their own
1414
NavigableString subclasses, instead of being treated as oddly-formed
1415
data. If you parse a document that contains CDATA and write it back
1416
out, the CDATA will still be there.
1417
1418
When you're parsing a document, you can get Beautiful Soup to convert
1419
XML or HTML entities into the corresponding Unicode characters. [Doc
1420
reference]
1421
1422
= 2.1.1 (20050918) =
1423
1424
Fixed a serious performance bug in BeautifulStoneSoup which was
1425
causing parsing to be incredibly slow.
1426
1427
Corrected several entities that were previously being incorrectly
1428
translated from Microsoft smart-quote-like characters.
1429
1430
Fixed a bug that was breaking text fetch.
1431
1432
Fixed a bug that crashed the parser when text chunks that look like
1433
HTML tag names showed up within a SCRIPT tag.
1434
1435
THEAD, TBODY, and TFOOT tags are now nestable within TABLE
1436
tags. Nested tables should parse more sensibly now.
1437
1438
BASE is now considered a self-closing tag.
1439
1440
= 2.1.0 "Game, or any other dish?" (20050504) =
1441
1442
Added a wide variety of new search methods which, given a starting
1443
point inside the tree, follow a particular navigation member (like
1444
nextSibling) over and over again, looking for Tag and NavigableText
1445
objects that match certain criteria. The new methods are findNext,
1446
fetchNext, findPrevious, fetchPrevious, findNextSibling,
1447
fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings,
1448
findParent, and fetchParents. All of these use the same basic code
1449
used by first and fetch, so you can pass your weird ways of matching
1450
things into these methods.
1451
1452
The fetch method and its derivatives now accept a limit argument.
1453
1454
You can now pass keyword arguments when calling a Tag object as though
1455
it were a method.
1456
1457
Fixed a bug that caused all hand-created tags to share a single set of
1458
attributes.
1459
1460
= 2.0.3 (20050501) =
1461
1462
Fixed Python 2.2 support for iterators.
1463
1464
Fixed a bug that gave the wrong representation to tags within quote
1465
tags like <script>.
1466
1467
Took some code from Mark Pilgrim that treats CDATA declarations as
1468
data instead of ignoring them.
1469
1470
Beautiful Soup's setup.py will now do an install even if the unit
1471
tests fail. It won't build a source distribution if the unit tests
1472
fail, so I can't release a new version unless they pass.
1473
1474
= 2.0.2 (20050416) =
1475
1476
Added the unit tests in a separate module, and packaged it with
1477
distutils.
1478
1479
Fixed a bug that sometimes caused renderContents() to return a Unicode
1480
string even if there was no Unicode in the original string.
1481
1482
Added the done() method, which closes all of the parser's open
1483
tags. It gets called automatically when you pass in some text to the
1484
constructor of a parser class; otherwise you must call it yourself.
1485
1486
Reinstated some backwards compatibility with 1.x versions: referencing
1487
the string member of a NavigableText object returns the NavigableText
1488
object instead of throwing an error.
1489
1490
= 2.0.1 (20050412) =
1491
1492
Fixed a bug that caused bad results when you tried to reference a tag
1493
name shorter than 3 characters as a member of a Tag, eg. tag.table.td.
1494
1495
Made sure all Tags have the 'hidden' attribute so that an attempt to
1496
access tag.hidden doesn't spawn an attempt to find a tag named
1497
'hidden'.
1498
1499
Fixed a bug in the comparison operator.
1500
1501
= 2.0.0 "Who cares for fish?" (20050410)
1502
1503
Beautiful Soup version 1 was very useful but also pretty stupid. I
1504
originally wrote it without noticing any of the problems inherent in
1505
trying to build a parse tree out of ambiguous HTML tags. This version
1506
solves all of those problems to my satisfaction. It also adds many new
1507
clever things to make up for the removal of the stupid things.
1508
1509
== Parsing ==
1510
1511
The parser logic has been greatly improved, and the BeautifulSoup
1512
class should much more reliably yield a parse tree that looks like
1513
what the page author intended. For a particular class of odd edge
1514
cases that now causes problems, there is a new class,
1515
ICantBelieveItsBeautifulSoup.
1516
1517
By default, Beautiful Soup now performs some cleanup operations on
1518
text before parsing it. This is to avoid common problems with bad
1519
definitions and self-closing tags that crash SGMLParser. You can
1520
provide your own set of cleanup operations, or turn it off
1521
altogether. The cleanup operations include fixing self-closing tags
1522
that don't close, and replacing Microsoft smart quotes and similar
1523
characters with their HTML entity equivalents.
1524
1525
You can now get a pretty-print version of parsed HTML to get a visual
1526
picture of how Beautiful Soup parses it, with the Tag.prettify()
1527
method.
1528
1529
== Strings and Unicode ==
1530
1531
There are separate NavigableText subclasses for ASCII and Unicode
1532
strings. These classes directly subclass the corresponding base data
1533
types. This means you can treat NavigableText objects as strings
1534
instead of having to call methods on them to get the strings.
1535
1536
str() on a Tag always returns a string, and unicode() always returns
1537
Unicode. Previously it was inconsistent.
1538
1539
== Tree traversal ==
1540
1541
In a first() or fetch() call, the tag name or the desired value of an
1542
attribute can now be any of the following:
1543
1544
 * A string (matches that specific tag or that specific attribute value)
1545
 * A list of strings (matches any tag or attribute value in the list)
1546
 * A compiled regular expression object (matches any tag or attribute
1547
   value that matches the regular expression)
1548
 * A callable object that takes the Tag object or attribute value as a
1549
   string. It returns None/false/empty string if the given string
1550
   doesn't match, and any other value if it does.
1551
1552
This is much easier to use than SQL-style wildcards (see, regular
1553
expressions are good for something). Because of this, I took out
1554
SQL-style wildcards. I'll put them back if someone complains, but
1555
their removal simplifies the code a lot.
1556
1557
You can use fetch() and first() to search for text in the parse tree,
1558
not just tags. There are new alias methods fetchText() and firstText()
1559
designed for this purpose. As with searching for tags, you can pass in
1560
a string, a regular expression object, or a method to match your text.
1561
1562
If you pass in something besides a map to the attrs argument of
1563
fetch() or first(), Beautiful Soup will assume you want to match that
1564
thing against the "class" attribute. When you're scraping
1565
well-structured HTML, this makes your code a lot cleaner.
1566
1567
1.x and 2.x both let you call a Tag object as a shorthand for
1568
fetch(). For instance, foo("bar") is a shorthand for
1569
foo.fetch("bar"). In 2.x, you can also access a specially-named member
1570
of a Tag object as a shorthand for first(). For instance, foo.barTag
1571
is a shorthand for foo.first("bar"). By chaining these shortcuts you
1572
traverse a tree in very little code: for header in
1573
soup.bodyTag.pTag.tableTag('th'):
1574
1575
If an element relationship (like parent or next) doesn't apply to a
1576
tag, it'll now show up Null instead of None. first() will also return
1577
Null if you ask it for a nonexistent tag. Null is an object that's
1578
just like None, except you can do whatever you want to it and it'll
1579
give you Null instead of throwing an error.
1580
1581
This lets you do tree traversals like soup.htmlTag.headTag.titleTag
1582
without having to worry if the intermediate stages are actually
1583
there. Previously, if there was no 'head' tag in the document, headTag
1584
in that instance would have been None, and accessing its 'titleTag'
1585
member would have thrown an AttributeError. Now, you can get what you
1586
want when it exists, and get Null when it doesn't, without having to
1587
do a lot of conditionals checking to see if every stage is None.
1588
1589
There are two new relations between page elements: previousSibling and
1590
nextSibling. They reference the previous and next element at the same
1591
level of the parse tree. For instance, if you have HTML like this:
1592
1593
  <p><ul><li>Foo<br /><li>Bar</ul>
1594
1595
The first 'li' tag has a previousSibling of Null and its nextSibling
1596
is the second 'li' tag. The second 'li' tag has a nextSibling of Null
1597
and its previousSibling is the first 'li' tag. The previousSibling of
1598
the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the
1599
'br' tag.
1600
1601
I took out the ability to use fetch() to find tags that have a
1602
specific list of contents. See, I can't even explain it well. It was
1603
really difficult to use, I never used it, and I don't think anyone
1604
else ever used it. To the extent anyone did, they can probably use
1605
fetchText() instead. If it turns out someone needs it I'll think of
1606
another solution.
1607
1608
== Tree manipulation ==
1609
1610
You can add new attributes to a tag, and delete attributes from a
1611
tag. In 1.x you could only change a tag's existing attributes.
1612
1613
== Porting Considerations ==
1614
1615
There are three changes in 2.0 that break old code:
1616
1617
In the post-1.2 release you could pass in a function into fetch(). The
1618
function took a string, the tag name. In 2.0, the function takes the
1619
actual Tag object.
1620
1621
It's no longer to pass in SQL-style wildcards to fetch(). Use a
1622
regular expression instead.
1623
1624
The different parsing algorithm means the parse tree may not be shaped
1625
like you expect. This will only actually affect you if your code uses
1626
one of the affected parts. I haven't run into this problem yet while
1627
porting my code.
1628
1629
= Between 1.2 and 2.0 =
1630
1631
This is the release to get if you want Python 1.5 compatibility.
1632
1633
The desired value of an attribute can now be any of the following:
1634
1635
 * A string
1636
 * A string with SQL-style wildcards
1637
 * A compiled RE object
1638
 * A callable that returns None/false/empty string if the given value
1639
   doesn't match, and any other value otherwise.
1640
1641
This is much easier to use than SQL-style wildcards (see, regular
1642
expressions are good for something). Because of this, I no longer
1643
recommend you use SQL-style wildcards. They may go away in a future
1644
release to clean up the code.
1645
1646
Made Beautiful Soup handle processing instructions as text instead of
1647
ignoring them.
1648
1649
Applied patch from Richie Hindle (richie at entrian dot com) that
1650
makes tag.string a shorthand for tag.contents[0].string when the tag
1651
has only one string-owning child.
1652
1653
Added still more nestable tags. The nestable tags thing won't work in
1654
a lot of cases and needs to be rethought.
1655
1656
Fixed an edge case where searching for "%foo" would match any string
1657
shorter than "foo".
1658
1659
= 1.2 "Who for such dainties would not stoop?" (20040708) =
1660
1661
Applied patch from Ben Last (ben at benlast dot com) that made
1662
Tag.renderContents() correctly handle Unicode.
1663
1664
Made BeautifulStoneSoup even dumber by making it not implicitly close
1665
a tag when another tag of the same type is encountered; only when an
1666
actual closing tag is encountered. This change courtesy of Fuzzy (mike
1667
at pcblokes dot com). BeautifulSoup still works as before.
1668
1669
= 1.1 "Swimming in a hot tureen" =
1670
1671
Added more 'nestable' tags. Changed popping semantics so that when a
1672
nestable tag is encountered, tags are popped up to the previously
1673
encountered nestable tag (of whatever kind). I will revert this if
1674
enough people complain, but it should make more people's lives easier
1675
than harder. This enhancement was suggested by Anthony Baxter (anthony
1676
at interlink dot com dot au).
1677
1678
= 1.0 "So rich and green" (20040420) =
1679
1680
Initial release.