5
BeautifulSoup_ is a Python package that parses broken HTML. While libxml2
6
(and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more
7
forgiving and has superiour `support for encoding detection`_.
5
BeautifulSoup_ is a Python package that parses broken HTML, just like
6
lxml supports it based on the parser of libxml2. BeautifulSoup uses a
7
different parsing approach. It is not a real HTML parser but uses
8
regular expressions to dive through tag soup. It is therefore more
9
forgiving in some cases and less good in others. It is not uncommon
10
that lxml/libxml2 parses and fixes broken HTML better, but
11
BeautifulSoup has superiour `support for encoding detection`_. It
12
very much depends on the input which parser works better.
9
14
.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
10
15
.. _`support for encoding detection`: http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode%2C%20Dammit
11
16
.. _ElementSoup: http://effbot.org/zone/element-soup.htm
13
lxml can benefit from the parsing capabilities of BeautifulSoup
14
through the ``lxml.html.soupparser`` module. It provides three main
15
functions: ``fromstring()`` and ``parse()`` to parse a string or file
16
using BeautifulSoup, and ``convert_tree()`` to convert an existing
17
BeautifulSoup tree into a list of top-level Elements.
18
To prevent users from having to choose their parser library in
19
advance, lxml can interface to the parsing capabilities of
20
BeautifulSoup through the ``lxml.html.soupparser`` module. It
21
provides three main functions: ``fromstring()`` and ``parse()`` to
22
parse a string or file using BeautifulSoup into an ``lxml.html``
23
document, and ``convert_tree()`` to convert an existing BeautifulSoup
24
tree into a list of top-level Elements.
20
27
Parsing with the soupparser