~facelessuser/beautifulsoup/lxml-fix : revision 361

1

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

2

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

3

4

5

6

<head>

7

8

9

<title>Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation</title>

10

11

12

13

14

15

var DOCUMENTATION_OPTIONS = {

16

URL_ROOT: './',

17

VERSION: '4.2.0',

18

COLLAPSE_INDEX: false,

19

FILE_SUFFIX: '.html',

20

HAS_SOURCE: true

21

};

22

</script>

23

24

25

26

27

</head>

28

<body>

29

30

<h3>Navigation</h3>

31

<ul>

32

33

<a href="genindex.html" title="General Index"

34

accesskey="I">index</a></li>

35

<li><a href="index.html">Beautiful Soup 4.2.0 documentation</a> »</li>

36

</ul>

37

</div>

38

39

40

41

42

43

44

45

<h1>Beautiful Soup 4.2.0 文档<a class="headerlink" href="#beautiful-soup-4-2-0" title="Permalink to this headline">¶</a></h1>

46

47

<a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

48

这篇文档介绍了BeautifulSoup4中所有主要特性,并切有小例子.让我来向你展示它适合做什么,如何工作,怎样使用,如何达到你想要的效果,和处理异常情况.

49

文档中出现的例子在Python2.7和Python3.2中的执行结果相同

50

你可能在寻找 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">Beautiful Soup3</a> 的文档,Beautiful Soup 3 目前已经停止开发,我们推荐在现在的项目中使用Beautiful Soup 4, <a class="reference external" href="http://www.baidu.com">移植到BS4</a>

51

52

53

如果你有关于BeautifulSoup的问题,可以发送邮件到 <a class="reference external" href="https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup">讨论组</a> .如果你的问题包含了一段需要转换的HTML代码,那么确保你提的问题描述中附带这段HTML文档的 <a class="reference internal" href="#id60">代码诊断</a> <a class="footnote-reference" href="#id82" id="id3">[1]</a>

54

</div>

55

</div>

56

57

58

下面的一段HTML代码将作为例子被多次用到.这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档):

59

<div class="highlight-python"><div class="highlight"><pre>html_doc = """

60

<html><head><title>The Dormouse's story</title></head>

61

<body>

62

The Dormouse's story

63

64

Once upon a time there were three little sisters; and their names were

65

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

66

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

67

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

68

and they lived at the bottom of a well.

69

70

...

71

"""

72

</pre></div>

73

</div>

74

使用BeautifulSoup解析这段代码,能够得到一个 <tt class="docutils literal">BeautifulSoup</tt> 的对象,并能按照标准的缩进格式的结构输出:

75

<div class="highlight-python"><div class="highlight"><pre>from bs4 import BeautifulSoup

76

soup = BeautifulSoup(html_doc)

77

78

print(soup.prettify())

79

# <html>

80

# <head>

81

# <title>

82

# The Dormouse's story

83

# </title>

84

# </head>

85

# <body>

86

#

87

#

88

# The Dormouse's story

89

#

90

#

91

#

92

# Once upon a time there were three little sisters; and their names were

93

# <a class="sister" href="http://example.com/elsie" id="link1">

94

# Elsie

95

# </a>

96

# ,

97

# <a class="sister" href="http://example.com/lacie" id="link2">

98

# Lacie

99

# </a>

100

# and

101

# <a class="sister" href="http://example.com/tillie" id="link2">

102

# Tillie

103

# </a>

104

# ; and they lived at the bottom of a well.

105

#

106

#

107

# ...

108

#

109

# </body>

110

# </html>

111

</pre></div>

112

</div>

113

几个简单的浏览结构化数据的方法:

114

<div class="highlight-python"><div class="highlight"><pre>soup.title

115

# <title>The Dormouse's story</title>

116

117

soup.title.name

118

# u'title'

119

120

soup.title.string

121

# u'The Dormouse's story'

122

123

soup.title.parent.name

124

# u'head'

125

126

soup.p

127

# The Dormouse's story

128

129

soup.p['class']

130

# u'title'

131

132

soup.a

133

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

134

135

soup.find_all('a')

136

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

137

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

138

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

139

140

soup.find(id="link3")

141

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

142

</pre></div>

143

</div>

144

从文档中找到所有<a>标签的链接:

145

<div class="highlight-python"><div class="highlight"><pre>for link in soup.find_all('a'):

146

print(link.get('href'))

147

# http://example.com/elsie

148

# http://example.com/lacie

149

# http://example.com/tillie

150

</pre></div>

151

</div>

152

从文档中获取所有文字内容:

153

<div class="highlight-python"><div class="highlight"><pre>print(soup.get_text())

154

# The Dormouse's story

155

#

156

# The Dormouse's story

157

#

158

# Once upon a time there were three little sisters; and their names were

159

# Elsie,

160

# Lacie and

161

# Tillie;

162

# and they lived at the bottom of a well.

163

#

164

# ...

165

</pre></div>

166

</div>

167

这是你想要的吗?别着急,还有更好用的

168

</div>

169

170

<h1>安装 Beautiful Soup<a class="headerlink" href="#id5" title="Permalink to this headline">¶</a></h1>

171

如果你用的是新版的Debain或ubuntu,那么可以通过系统的软件包管理来安装:

172

<tt class="docutils literal">$ apt-get install Python-bs4</tt>

173

Beautiful Soup 4 通过PyPi发布,所以如果你无法使用系统包管理安装,那么也可以通过 <tt class="docutils literal">easy_install</tt> 或 <tt class="docutils literal">pip</tt> 来安装.包的名字是 <tt class="docutils literal">beautifulsoup4</tt> ,这个包兼容Python2和Python3.

174

<tt class="docutils literal">$ easy_install beautifulsoup4</tt>

175

<tt class="docutils literal">$ pip install beautifulsoup4</tt>

176

(在PyPi中还有一个名字是 <tt class="docutils literal">BeautifulSoup</tt> 的包,但那可能不是你想要的,那是 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">Beautiful Soup3</a> 的发布版本,因为很多项目还在使用BS3, 所以 <tt class="docutils literal">BeautifulSoup</tt> 包依然有效.但是如果你在编写新项目,那么你应该安装的 <tt class="docutils literal">beautifulsoup4</tt> )

177

如果你没有安装 <tt class="docutils literal">easy_install</tt> 或 <tt class="docutils literal">pip</tt> ,那你也可以 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/download/4.x/">下载BS4的源码</a> ,然后通过setup.py来安装.

178

<tt class="docutils literal">$ Python setup.py install</tt>

179

如果上述安装方法都行不通,Beautiful Soup的发布协议允许你将BS4的代码打包在你的项目中,这样无须安装即可使用.

180

作者在Python2.7和Python3.2的版本下开发Beautiful Soup, 理论上Beautiful Soup应该在所有当前的Python版本中正常工作

181

182

<h2>安装完成后的问题<a class="headerlink" href="#id8" title="Permalink to this headline">¶</a></h2>

183

Beautiful Soup发布时打包成Python2版本的代码,在Python3环境下安装时,会自动转换成Python3的代码,如果没有一个安装的过程,那么代码就不会被转换.

184

如果代码抛出了 <tt class="docutils literal">ImportError</tt> 的异常: “No module named HTMLParser”, 这是因为你在Python3版本中执行Python2版本的代码.

185

如果代码抛出了 <tt class="docutils literal">ImportError</tt> 的异常: “No module named html.parser”, 这是因为你在Python2版本中执行Python3版本的代码.

186

如果遇到上述2种情况,最好的解决方法是重新安装BeautifulSoup4.

187

如果在ROOT_TAG_NAME = u’[document]’代码处遇到 <tt class="docutils literal">SyntaxError</tt> “Invalid syntax”错误,需要将把BS4的Python代码版本从Python2转换到Python3. 可以重新安装BS4:

188

<tt class="docutils literal">$ Python3 setup.py install</tt>

189

或在bs4的目录中执行Python代码版本转换脚本

190

191

</div>

192

193

<h2>安装解析器<a class="headerlink" href="#id9" title="Permalink to this headline">¶</a></h2>

194

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 <a class="reference external" href="http://lxml.de/">lxml</a> .根据操作系统不同,可以选择下列方法来安装lxml:

195

<tt class="docutils literal">$ apt-get install Python-lxml</tt>

196

<tt class="docutils literal">$ easy_install lxml</tt>

197

<tt class="docutils literal">$ pip install lxml</tt>

198

另一个可供选择的解析器是纯Python实现的 <a class="reference external" href="http://code.google.com/p/html5lib/">html5lib</a> , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:

199

<tt class="docutils literal">$ apt-get install Python-html5lib</tt>

200

<tt class="docutils literal">$ easy_install html5lib</tt>

201

<tt class="docutils literal">$ pip install html5lib</tt>

202

下表列出了主要的解析器,以及它们的优缺点:

203

204

205

206

207

208

209

</colgroup>

210

211

212

213

214

215

</tr>

216

</thead>

217

218

<tr class="row-even"><td>Python标准库</td>

219

<td><tt class="docutils literal">BeautifulSoup(markup,

220

"html.parser")</tt></td>

221

222

<li>Python的内置标准库</li>

223

<li>执行速度适中</li>

224

<li>文档容错能力强</li>

225

</ul>

226

</td>

227

228

<li>Python 2.7.3 or 3.2.2)前

229

的版本中文档容错能力差</li>

230

</ul>

231

</td>

232

</tr>

233

234

<td><tt class="docutils literal">BeautifulSoup(markup,

235

"lxml")</tt></td>

236

237

238

<li>文档容错能力强</li>

239

</ul>

240

</td>

241

242

<li>需要安装C语言库</li>

243

</ul>

244

</td>

245

</tr>

246

247

<td><tt class="docutils literal">BeautifulSoup(markup,

248

["lxml", "xml"])</tt>

249

<tt class="docutils literal">BeautifulSoup(markup,

250

"xml")</tt>

251

</td>

252

253

254

<li>唯一支持XML的解析器</li>

255

</ul>

256

</td>

257

258

<li>需要安装C语言库</li>

259

</ul>

260

</td>

261

</tr>

262

263

<td><tt class="docutils literal">BeautifulSoup(markup,

264

"html5lib")</tt></td>

265

266

<li>最好的容错性</li>

267

<li>以浏览器的方式解析文档</li>

268

<li>生成HTML5格式的文档</li>

269

</ul>

270

</td>

271

272

273

<li>不依赖外部扩展</li>

274

</ul>

275

</td>

276

</tr>

277

</tbody>

278

</table>

279

推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.

280

提示: 如果一段HTML或XML文档格式不正确的话,那么在不同的解析器中返回的结果可能是不一样的,查看 <a class="reference internal" href="#id49">解析器之间的区别</a> 了解更多细节

281

</div>

282

</div>

283

284

285

将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.

286

<div class="highlight-python"><div class="highlight"><pre>from bs4 import BeautifulSoup

287

288

soup = BeautifulSoup(open("index.html"))

289

290

soup = BeautifulSoup("<html>data</html>")

291

</pre></div>

292

</div>

293

首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码

294

<div class="highlight-python"><pre>BeautifulSoup("Sacr&eacute; bleu!")

295

<html><head></head><body>Sacré bleu!</body></html></pre>

296

</div>

297

然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.(参考 <a class="reference internal" href="#xml">解析成XML</a> ).

298

</div>

299

300

<h1>对象的种类<a class="headerlink" href="#id11" title="Permalink to this headline">¶</a></h1>

301

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: <tt class="docutils literal">Tag</tt> , <tt class="docutils literal">NavigableString</tt> , <tt class="docutils literal">BeautifulSoup</tt> , <tt class="docutils literal">Comment</tt> .

302

303

304

<tt class="docutils literal">Tag</tt> 对象与XML或HTML原生文档中的tag相同:

305

<div class="highlight-python"><div class="highlight"><pre>soup = BeautifulSoup('Extremely bold')

306

tag = soup.b

307

type(tag)

308

# <class 'bs4.element.Tag'>

309

</pre></div>

310

</div>

311

Tag有很多方法和属性,在 <a class="reference internal" href="#id15">遍历文档树</a> 和 <a class="reference internal" href="#id24">搜索文档树</a> 中有详细解释.现在介绍一下tag中最重要的属性: name和attributes

312

313

314

每个tag都有自己的名字,通过 <tt class="docutils literal">.name</tt> 来获取:

315

<div class="highlight-python"><div class="highlight"><pre>tag.name

316

# u'b'

317

</pre></div>

318

</div>

319

如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档:

320

<div class="highlight-python"><div class="highlight"><pre>tag.name = "blockquote"

321

tag

322

# <blockquote class="boldest">Extremely bold</blockquote>

323

</pre></div>

324

</div>

325

</div>

326

327

<h3>Attributes<a class="headerlink" href="#attributes" title="Permalink to this headline">¶</a></h3>

328

一个tag可能有很多个属性. tag <tt class="docutils literal"><b class="boldest"></tt> 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:

329

<div class="highlight-python"><div class="highlight"><pre>tag['class']

330

# u'boldest'

331

</pre></div>

332

</div>

333

也可以直接”点”取属性, 比如: <tt class="docutils literal">.attrs</tt> :

334

<div class="highlight-python"><div class="highlight"><pre>tag.attrs

335

# {u'class': u'boldest'}

336

</pre></div>

337

</div>

338

tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样

339

<div class="highlight-python"><div class="highlight"><pre>tag['class'] = 'verybold'

340

tag['id'] = 1

341

tag

342

# <blockquote class="verybold" id="1">Extremely bold</blockquote>

343

344

del tag['class']

345

del tag['id']

346

tag

347

# <blockquote>Extremely bold</blockquote>

348

349

tag['class']

350

# KeyError: 'class'

351

print(tag.get('class'))

352

# None

353

</pre></div>

354

</div>

355

356

357

HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性 <tt class="docutils literal">rel</tt> , <tt class="docutils literal">rev</tt> , <tt class="docutils literal">accept-charset</tt> , <tt class="docutils literal">headers</tt> , <tt class="docutils literal">accesskey</tt> . 在Beautiful Soup中多值属性的返回类型是list:

358

<div class="highlight-python"><div class="highlight"><pre>css_soup = BeautifulSoup('')

359

css_soup.p['class']

360

# ["body", "strikeout"]

361

362

css_soup = BeautifulSoup('')

363

css_soup.p['class']

364

# ["body"]

365

</pre></div>

366

</div>

367

如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回

368

<div class="highlight-python"><div class="highlight"><pre>id_soup = BeautifulSoup('')

369

id_soup.p['id']

370

# 'my id'

371

</pre></div>

372

</div>

373

将tag转换成字符串时,多值属性会合并为一个值

374

<div class="highlight-python"><div class="highlight"><pre>rel_soup = BeautifulSoup('Back to the <a rel="index">homepage</a>')

375

rel_soup.a['rel']

376

# ['index']

377

rel_soup.a['rel'] = ['index', 'contents']

378

print(rel_soup.p)

379

# Back to the <a rel="index contents">homepage</a>

380

</pre></div>

381

</div>

382

如果转换的文档是XML格式,那么tag中不包含多值属性

383

<div class="highlight-python"><div class="highlight"><pre>xml_soup = BeautifulSoup('', 'xml')

384

xml_soup.p['class']

385

# u'body strikeout'

386

</pre></div>

387

</div>

388

</div>

389

</div>

390

</div>

391

392

<h2>可以遍历的字符串<a class="headerlink" href="#id13" title="Permalink to this headline">¶</a></h2>

393

字符串常被包含在tag内.Beautiful Soup用 <tt class="docutils literal">NavigableString</tt> 类来包装tag中的字符串:

394

<div class="highlight-python"><div class="highlight"><pre>tag.string

395

# u'Extremely bold'

396

type(tag.string)

397

# <class 'bs4.element.NavigableString'>

398

</pre></div>

399

</div>

400

一个 <tt class="docutils literal">NavigableString</tt> 字符串与Python中的Unicode字符串相同,并且还支持包含在 <a class="reference internal" href="#id15">遍历文档树</a> 和 <a class="reference internal" href="#id24">搜索文档树</a> 中的一些特性. 通过 <tt class="docutils literal">unicode()</tt> 方法可以直接将 <tt class="docutils literal">NavigableString</tt> 对象转换成Unicode字符串:

401

<div class="highlight-python"><div class="highlight"><pre>unicode_string = unicode(tag.string)

402

unicode_string

403

# u'Extremely bold'

404

type(unicode_string)

405

# <type 'unicode'>

406

</pre></div>

407

</div>

408

tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 <a class="reference internal" href="#replace-with">replace_with()</a> 方法:

409

<div class="highlight-python"><div class="highlight"><pre>tag.string.replace_with("No longer bold")

410

tag

411

# <blockquote>No longer bold</blockquote>

412

</pre></div>

413

</div>

414

<tt class="docutils literal">NavigableString</tt> 对象支持 <a class="reference internal" href="#id15">遍历文档树</a> 和 <a class="reference internal" href="#id24">搜索文档树</a> 中定义的大部分属性, 并非全部.尤其是,一个字符串不能包含其它内容(tag能够包含字符串或是其它tag),字符串不支持 <tt class="docutils literal">.contents</tt> 或 <tt class="docutils literal">.string</tt> 属性或 <tt class="docutils literal">find()</tt> 方法.

415

如果想在Beautiful Soup之外使用 <tt class="docutils literal">NavigableString</tt> 对象,需要调用 <tt class="docutils literal">unicode()</tt> 方法,将该对象转换成普通的Unicode字符串,否则就算Beautiful Soup已方法已经执行结束,该对象的输出也会带有对象的引用地址.这样会浪费内存.

416

</div>

417

418

<h2>BeautifulSoup<a class="headerlink" href="#beautifulsoup" title="Permalink to this headline">¶</a></h2>

419

<tt class="docutils literal">BeautifulSoup</tt> 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 <tt class="docutils literal">Tag</tt> 对象,它支持 <a class="reference internal" href="#id15">遍历文档树</a> 和 <a class="reference internal" href="#id24">搜索文档树</a> 中描述的大部分的方法.

420

因为 <tt class="docutils literal">BeautifulSoup</tt> 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 <tt class="docutils literal">.name</tt> 属性是很方便的,所以 <tt class="docutils literal">BeautifulSoup</tt> 对象包含了一个值为 “[document]” 的特殊属性 <tt class="docutils literal">.name</tt>

421

<div class="highlight-python"><div class="highlight"><pre>soup.name

422

# u'[document]'

423

</pre></div>

424

</div>

425

</div>

426

427

<h2>注释及特殊字符串<a class="headerlink" href="#id14" title="Permalink to this headline">¶</a></h2>

428

<tt class="docutils literal">Tag</tt> , <tt class="docutils literal">NavigableString</tt> , <tt class="docutils literal">BeautifulSoup</tt> 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:

429

<div class="highlight-python"><div class="highlight"><pre>markup = ""

430

soup = BeautifulSoup(markup)

431

comment = soup.b.string

432

type(comment)

433

# <class 'bs4.element.Comment'>

434

</pre></div>

435

</div>

436

<tt class="docutils literal">Comment</tt> 对象是一个特殊类型的 <tt class="docutils literal">NavigableString</tt> 对象:

437

<div class="highlight-python"><div class="highlight"><pre>comment

438

# u'Hey, buddy. Want to buy a used parser'

439

</pre></div>

440

</div>

441

但是当它出现在HTML文档中时, <tt class="docutils literal">Comment</tt> 对象会使用特殊的格式输出:

442

<div class="highlight-python"><div class="highlight"><pre>print(soup.b.prettify())

443

#

444

#

445

#

446

</pre></div>

447

</div>

448

Beautiful Soup中定义的其它类型都可能会出现在XML的文档中: <tt class="docutils literal">CData</tt> , <tt class="docutils literal">ProcessingInstruction</tt> , <tt class="docutils literal">Declaration</tt> , <tt class="docutils literal">Doctype</tt> .与 <tt class="docutils literal">Comment</tt> 对象类似,这些类都是 <tt class="docutils literal">NavigableString</tt> 的子类,只是添加了一些额外的方法的字符串独享.下面是用CDATA来替代注释的例子:

449

<div class="highlight-python"><div class="highlight"><pre>from bs4 import CData

450

cdata = CData("A CDATA block")

451

comment.replace_with(cdata)

452

453

print(soup.b.prettify())

454

#

455

# <![CDATA[A CDATA block]]>

456

#

457

</pre></div>

458

</div>

459

</div>

460

</div>

461

462

<h1>遍历文档树<a class="headerlink" href="#id15" title="Permalink to this headline">¶</a></h1>

463

还拿”爱丽丝梦游仙境”的文档来做例子:

464

<div class="highlight-python"><div class="highlight"><pre>html_doc = """

465

<html><head><title>The Dormouse's story</title></head>

466

467

The Dormouse's story

468

469

Once upon a time there were three little sisters; and their names were

470

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

471

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

472

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

473

and they lived at the bottom of a well.

474

475

...

476

"""

477

478

from bs4 import BeautifulSoup

479

soup = BeautifulSoup(html_doc)

480

</pre></div>

481

</div>

482

通过这段例子来演示怎样从文档的一段内容找到另一段内容

483

484

485

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.

486

注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点

487

488

<h3>tag的名字<a class="headerlink" href="#id17" title="Permalink to this headline">¶</a></h3>

489

操作文档树最简单的方法就是告诉它你想获取的tag的name.如果想获取 <head> 标签,只要用 <tt class="docutils literal">soup.head</tt> :

490

<div class="highlight-python"><div class="highlight"><pre>soup.head

491

# <head><title>The Dormouse's story</title></head>

492

493

soup.title

494

# <title>The Dormouse's story</title>

495

</pre></div>

496

</div>

497

这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取<body>标签中的第一个标签:

498

<div class="highlight-python"><div class="highlight"><pre>soup.body.b

499

# The Dormouse's story

500

</pre></div>

501

</div>

502

通过点取属性的方式只能获得当前名字的第一个tag:

503

<div class="highlight-python"><div class="highlight"><pre>soup.a

504

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

505

</pre></div>

506

</div>

507

如果想要得到所有的<a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 <cite>Searching the tree</cite> 中描述的方法,比如: find_all()

508

<div class="highlight-python"><div class="highlight"><pre>soup.find_all('a')

509

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

510

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

511

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

512

</pre></div>

513

</div>

514

</div>

515

516

<h3>.contents 和 .children<a class="headerlink" href="#contents-children" title="Permalink to this headline">¶</a></h3>

517

tag的 <tt class="docutils literal">.contents</tt> 属性可以将tag的子节点以列表的方式输出:

518

<div class="highlight-python"><pre>head_tag = soup.head

519

head_tag

520

# <head><title>The Dormouse's story</title></head>

521

522

head_tag.contents

523

[<title>The Dormouse's story</title>]

524

525

title_tag = head_tag.contents[0]

526

title_tag

527

# <title>The Dormouse's story</title>

528

title_tag.contents

529

# [u'The Dormouse's story']</pre>

530

</div>

531

<tt class="docutils literal">BeautifulSoup</tt> 对象本身一定会包含子节点,也就是说<html>标签也是 <tt class="docutils literal">BeautifulSoup</tt> 对象的子节点:

532

<div class="highlight-python"><div class="highlight"><pre>len(soup.contents)

533

# 1

534

soup.contents[0].name

535

# u'html'

536

</pre></div>

537

</div>

538

字符串没有 <tt class="docutils literal">.contents</tt> 属性,因为字符串没有子节点:

539

<div class="highlight-python"><div class="highlight"><pre>text = title_tag.contents[0]

540

text.contents

541

# AttributeError: 'NavigableString' object has no attribute 'contents'

542

</pre></div>

543

</div>

544

通过tag的 <tt class="docutils literal">.children</tt> 生成器,可以对tag的子节点进行循环:

545

<div class="highlight-python"><div class="highlight"><pre>for child in title_tag.children:

546

print(child)

547

# The Dormouse's story

548

</pre></div>

549

</div>

550

</div>

551

552

<h3>.descendants<a class="headerlink" href="#descendants" title="Permalink to this headline">¶</a></h3>

553

<tt class="docutils literal">.contents</tt> 和 <tt class="docutils literal">.children</tt> 属性仅包含tag的直接子节点.例如,<head>标签只有一个直接子节点<title>

554

<div class="highlight-python"><div class="highlight"><pre>head_tag.contents

555

# [<title>The Dormouse's story</title>]

556

</pre></div>

557

</div>

558

但是<title>标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于<head>标签的子孙节点. <tt class="docutils literal">.descendants</tt> 属性可以对所有tag的子孙节点进行递归循环 <a class="footnote-reference" href="#id86" id="id18">[5]</a> :

559

<div class="highlight-python"><div class="highlight"><pre>for child in head_tag.descendants:

560

print(child)

561

# <title>The Dormouse's story</title>

562

# The Dormouse's story

563

</pre></div>

564

</div>

565

上面的例子中, <head>标签只有一个子节点,但是有2个子孙节点:<head>节点和<head>的子节点, <tt class="docutils literal">BeautifulSoup</tt> 有一个直接子节点(<html>节点),却有很多子孙节点:

566

<div class="highlight-python"><div class="highlight"><pre>len(list(soup.children))

567

# 1

568

len(list(soup.descendants))

569

# 25

570

</pre></div>

571

</div>

572

</div>

573

574

<h3>.string<a class="headerlink" href="#string" title="Permalink to this headline">¶</a></h3>

575

如果tag只有一个 <tt class="docutils literal">NavigableString</tt> 类型子节点,那么这个tag可以使用 <tt class="docutils literal">.string</tt> 得到子节点:

576

<div class="highlight-python"><div class="highlight"><pre>title_tag.string

577

# u'The Dormouse's story'

578

</pre></div>

579

</div>

580

如果一个tag仅有一个子节点,那么这个tag也可以使用 <tt class="docutils literal">.string</tt> 方法,输出结果与当前唯一子节点的 <tt class="docutils literal">.string</tt> 结果相同:

581

<div class="highlight-python"><div class="highlight"><pre>head_tag.contents

582

# [<title>The Dormouse's story</title>]

583

584

head_tag.string

585

# u'The Dormouse's story'

586

</pre></div>

587

</div>

588

如果tag包含了多个子节点,tag就无法确定 <tt class="docutils literal">.string</tt> 方法应该调用哪个子节点的内容, <tt class="docutils literal">.string</tt> 的输出结果是 <tt class="docutils literal">None</tt> :

589

<div class="highlight-python"><div class="highlight"><pre>print(soup.html.string)

590

# None

591

</pre></div>

592

</div>

593

</div>

594

595

<h3>.strings 和 stripped_strings<a class="headerlink" href="#strings-stripped-strings" title="Permalink to this headline">¶</a></h3>

596

如果tag中包含多个字符串 <a class="footnote-reference" href="#id83" id="id19">[2]</a> ,可以使用 <tt class="docutils literal">.strings</tt> 来循环获取:

597

<div class="highlight-python"><div class="highlight"><pre>for string in soup.strings:

598

print(repr(string))

599

# u"The Dormouse's story"

600

# u'\n\n'

601

# u"The Dormouse's story"

602

# u'\n\n'

603

# u'Once upon a time there were three little sisters; and their names were\n'

604

# u'Elsie'

605

# u',\n'

606

# u'Lacie'

607

# u' and\n'

608

# u'Tillie'

609

# u';\nand they lived at the bottom of a well.'

610

# u'\n\n'

611

# u'...'

612

# u'\n'

613

</pre></div>

614

</div>

615

输出的字符串中可能包含了很多空格或空行,使用 <tt class="docutils literal">.stripped_strings</tt> 可以去除多余空白内容:

616

<div class="highlight-python"><div class="highlight"><pre>for string in soup.stripped_strings:

617

print(repr(string))

618

# u"The Dormouse's story"

619

# u"The Dormouse's story"

620

# u'Once upon a time there were three little sisters; and their names were'

621

# u'Elsie'

622

# u','

623

# u'Lacie'

624

# u'and'

625

# u'Tillie'

626

# u';\nand they lived at the bottom of a well.'

627

# u'...'

628

</pre></div>

629

</div>

630

全部是空格的行会被忽略掉,段首和段末的空白会被删除

631

</div>

632

</div>

633

634

635

继续分析文档树,每个tag或字符串都有父节点:被包含在某个tag中

636

637

<h3>.parent<a class="headerlink" href="#parent" title="Permalink to this headline">¶</a></h3>

638

通过 <tt class="docutils literal">.parent</tt> 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,<head>标签是<title>标签的父节点:

639

<div class="highlight-python"><div class="highlight"><pre>title_tag = soup.title

640

title_tag

641

# <title>The Dormouse's story</title>

642

title_tag.parent

643

# <head><title>The Dormouse's story</title></head>

644

</pre></div>

645

</div>

646

文档title的字符串也有父节点:<title>标签

647

<div class="highlight-python"><div class="highlight"><pre>title_tag.string.parent

648

# <title>The Dormouse's story</title>

649

</pre></div>

650

</div>

651

文档的顶层节点比如<html>的父节点是 <tt class="docutils literal">BeautifulSoup</tt> 对象:

652

<div class="highlight-python"><div class="highlight"><pre>html_tag = soup.html

653

type(html_tag.parent)

654

# <class 'bs4.BeautifulSoup'>

655

</pre></div>

656

</div>

657

<tt class="docutils literal">BeautifulSoup</tt> 对象的 <tt class="docutils literal">.parent</tt> 是None:

658

<div class="highlight-python"><div class="highlight"><pre>print(soup.parent)

659

# None

660

</pre></div>

661

</div>

662

</div>

663

664

<h3>.parents<a class="headerlink" href="#parents" title="Permalink to this headline">¶</a></h3>

665

通过元素的 <tt class="docutils literal">.parents</tt> 属性可以递归得到元素的所有父辈节点,下面的例子使用了 <tt class="docutils literal">.parents</tt> 方法遍历了<a>标签到根节点的所有节点.

666

<div class="highlight-python"><div class="highlight"><pre>link = soup.a

667

link

668

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

669

for parent in link.parents:

670

if parent is None:

671

print(parent)

672

else:

673

print(parent.name)

674

# p

675

# body

676

# html

677

# [document]

678

# None

679

</pre></div>

680

</div>

681

</div>

682

</div>

683

684

685

看一段简单的例子:

686

<div class="highlight-python"><div class="highlight"><pre>sibling_soup = BeautifulSoup("<a>text1<c>text2</c></a>")

687

print(sibling_soup.prettify())

688

# <html>

689

# <body>

690

# <a>

691

#

692

# text1

693

#

694

# <c>

695

# text2

696

# </c>

697

# </a>

698

# </body>

699

# </html>

700

</pre></div>

701

</div>

702

因为标签和<c>标签是同一层:他们是同一个元素的子节点,所以和<c>可以被称为兄弟节点.一段文档以标准格式输出时,兄弟节点有相同的缩进级别.在代码中也可以使用这种关系.

703

704

<h3>.next_sibling 和 .previous_sibling<a class="headerlink" href="#next-sibling-previous-sibling" title="Permalink to this headline">¶</a></h3>

705

在文档树中,使用 <tt class="docutils literal">.next_sibling</tt> 和 <tt class="docutils literal">.previous_sibling</tt> 属性来查询兄弟节点:

706

<div class="highlight-python"><div class="highlight"><pre>sibling_soup.b.next_sibling

707

# <c>text2</c>

708

709

sibling_soup.c.previous_sibling

710

# text1

711

</pre></div>

712

</div>

713

标签有 <tt class="docutils literal">.next_sibling</tt> 属性,但是没有 <tt class="docutils literal">.previous_sibling</tt> 属性,因为标签在同级节点中是第一个.同理,<c>标签有 <tt class="docutils literal">.previous_sibling</tt> 属性,却没有 <tt class="docutils literal">.next_sibling</tt> 属性:

714

<div class="highlight-python"><div class="highlight"><pre>print(sibling_soup.b.previous_sibling)

715

# None

716

print(sibling_soup.c.next_sibling)

717

# None

718

</pre></div>

719

</div>

720

例子中的字符串“text1”和“text2”不是兄弟节点,因为它们的父节点不同:

721

<div class="highlight-python"><div class="highlight"><pre>sibling_soup.b.string

722

# u'text1'

723

724

print(sibling_soup.b.string.next_sibling)

725

# None

726

</pre></div>

727

</div>

728

实际文档中的tag的 <tt class="docutils literal">.next_sibling</tt> 和 <tt class="docutils literal">.previous_sibling</tt> 属性通常是字符串或空白. 看看“爱丽丝”文档:

729

<div class="highlight-python"><pre><a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>

730

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

731

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a></pre>

732

</div>

733

如果以为第一个<a>标签的 <tt class="docutils literal">.next_sibling</tt> 结果是第二个<a>标签,那就错了,真实结果是第一个<a>标签和第二个<a>标签之间的顿号和换行符:

734

<div class="highlight-python"><div class="highlight"><pre>link = soup.a

735

link

736

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

737

738

link.next_sibling

739

# u',\n'

740

</pre></div>

741

</div>

742

第二个<a>标签是顿号的 <tt class="docutils literal">.next_sibling</tt> 属性:

743

<div class="highlight-python"><div class="highlight"><pre>link.next_sibling.next_sibling

744

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

745

</pre></div>

746

</div>

747

</div>

748

749

<h3>.next_siblings 和 .previous_siblings<a class="headerlink" href="#next-siblings-previous-siblings" title="Permalink to this headline">¶</a></h3>

750

通过 <tt class="docutils literal">.next_siblings</tt> 和 <tt class="docutils literal">.previous_siblings</tt> 属性可以对当前节点的兄弟节点迭代输出:

751

<div class="highlight-python"><div class="highlight"><pre>for sibling in soup.a.next_siblings:

752

print(repr(sibling))

753

# u',\n'

754

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

755

# u' and\n'

756

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

757

# u'; and they lived at the bottom of a well.'

758

# None

759

760

for sibling in soup.find(id="link3").previous_siblings:

761

print(repr(sibling))

762

# ' and\n'

763

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

764

# u',\n'

765

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

766

# u'Once upon a time there were three little sisters; and their names were\n'

767

# None

768

</pre></div>

769

</div>

770

</div>

771

</div>

772

773

<h2>回退和前进<a class="headerlink" href="#id22" title="Permalink to this headline">¶</a></h2>

774

看一下“爱丽丝” 文档:

775

<div class="highlight-python"><pre><html><head><title>The Dormouse's story</title></head>

776

The Dormouse's story</pre>

777

</div>

778

HTML解析器把这段字符串转换成一连串的事件: “打开<html>标签”,”打开一个<head>标签”,”打开一个<title>标签”,”添加一段字符串”,”关闭<title>标签”,”打开标签”,等等.Beautiful Soup提供了重现解析器初始化过程的方法.

779

780

<h3>.next_element 和 .previous_element<a class="headerlink" href="#next-element-previous-element" title="Permalink to this headline">¶</a></h3>

781

<tt class="docutils literal">.next_element</tt> 属性指向解析过程中下一个被解析的对象(字符串或tag),结果可能与 <tt class="docutils literal">.next_sibling</tt> 相同,但通常是不一样的.

782

这是“爱丽丝”文档中最后一个<a>标签,它的 <tt class="docutils literal">.next_sibling</tt> 结果是一个字符串,因为当前的解析过程 <a class="footnote-reference" href="#id83" id="id23">[2]</a> 因为当前的解析过程因为遇到了<a>标签而中断了:

783

<div class="highlight-python"><div class="highlight"><pre>last_a_tag = soup.find("a", id="link3")

784

last_a_tag

785

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

786

787

last_a_tag.next_sibling

788

# '; and they lived at the bottom of a well.'

789

</pre></div>

790

</div>

791

但这个<a>标签的 <tt class="docutils literal">.next_element</tt> 属性结果是在<a>标签被解析之后的解析内容,不是<a>标签后的句子部分,应该是字符串”Tillie”:

792

<div class="highlight-python"><div class="highlight"><pre>last_a_tag.next_element

793

# u'Tillie'

794

</pre></div>

795

</div>

796

这是因为在原始文档中,字符串“Tillie” 在分号前出现,解析器先进入<a>标签,然后是字符串“Tillie”,然后关闭</a>标签,然后是分号和剩余部分.分号与<a>标签在同一层级,但是字符串“Tillie”会被先解析.

797

<tt class="docutils literal">.previous_element</tt> 属性刚好与 <tt class="docutils literal">.next_element</tt> 相反,它指向当前被解析的对象的前一个解析对象:

798

<div class="highlight-python"><div class="highlight"><pre>last_a_tag.previous_element

799

# u' and\n'

800

last_a_tag.previous_element.next_element

801

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

802

</pre></div>

803

</div>

804

</div>

805

806

<h3>.next_elements 和 .previous_elements<a class="headerlink" href="#next-elements-previous-elements" title="Permalink to this headline">¶</a></h3>

807

通过 <tt class="docutils literal">.next_elements</tt> 和 <tt class="docutils literal">.previous_elements</tt> 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样:

808

<div class="highlight-python"><div class="highlight"><pre>for element in last_a_tag.next_elements:

809

print(repr(element))

810

# u'Tillie'

811

# u';\nand they lived at the bottom of a well.'

812

# u'\n\n'

813

# ...

814

# u'...'

815

# u'\n'

816

# None

817

</pre></div>

818

</div>

819

</div>

820

</div>

821

</div>

822

823

<h1>搜索文档树<a class="headerlink" href="#id24" title="Permalink to this headline">¶</a></h1>

824

Beautiful Soup定义了很多搜索方法,这里着重介绍2个: <tt class="docutils literal">find()</tt> 和 <tt class="docutils literal">find_all()</tt> .其它方法的参数和用法类似,请读者举一反三.

825

再以“爱丽丝”文档作为例子:

826

<div class="highlight-python"><div class="highlight"><pre>html_doc = """

827

<html><head><title>The Dormouse's story</title></head>

828

829

The Dormouse's story

830

831

Once upon a time there were three little sisters; and their names were

832

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

833

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

834

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

835

and they lived at the bottom of a well.

836

837

...

838

"""

839

840

from bs4 import BeautifulSoup

841

soup = BeautifulSoup(html_doc)

842

</pre></div>

843

</div>

844

使用 <tt class="docutils literal">find_all()</tt> 类似的方法可以查找到想要查找的文档内容

845

846

847

介绍 <tt class="docutils literal">find_all()</tt> 方法前,先介绍一下过滤器的类型 <a class="footnote-reference" href="#id84" id="id26">[3]</a> ,这些过滤器贯穿整个搜索的API.过滤器可以被用在tag的name中,节点的属性中,字符串中或他们的混合中.

848

849

850

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签:

851

<div class="highlight-python"><div class="highlight"><pre>soup.find_all('b')

852

# [The Dormouse's story]

853

</pre></div>

854

</div>

855

如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错

856

</div>

857

858

<h3>正则表达式<a class="headerlink" href="#id28" title="Permalink to this headline">¶</a></h3>

859

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 <tt class="docutils literal">match()</tt> 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和标签都应该被找到:

860

<div class="highlight-python"><div class="highlight"><pre>import re

861

for tag in soup.find_all(re.compile("^b")):

862

print(tag.name)

863

# body

864

# b

865

</pre></div>

866

</div>

867

下面代码找出所有名字中包含”t”的标签:

868

<div class="highlight-python"><div class="highlight"><pre>for tag in soup.find_all(re.compile("t")):

869

print(tag.name)

870

# html

871

# title

872

</pre></div>

873

</div>

874

</div>

875

876

877

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和标签:

878

<div class="highlight-python"><div class="highlight"><pre>soup.find_all(["a", "b"])

879

# [The Dormouse's story,

880

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

881

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

882

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

883

</pre></div>

884

</div>

885

</div>

886

887

888

<tt class="docutils literal">True</tt> 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

889

<div class="highlight-python"><div class="highlight"><pre>for tag in soup.find_all(True):

890

print(tag.name)

891

# html

892

# head

893

# title

894

# body

895

# p

896

# b

897

# p

898

# a

899

# a

900

# a

901

# p

902

</pre></div>

903

</div>

904

</div>

905

906

907

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 <a class="footnote-reference" href="#id85" id="id31">[4]</a> ,如果这个方法返回 <tt class="docutils literal">True</tt> 表示当前元素匹配并且被找到,如果不是则反回 <tt class="docutils literal">False</tt>

908

下面方法校验了当前元素,如果包含 <tt class="docutils literal">class</tt> 属性却不包含 <tt class="docutils literal">id</tt> 属性,那么将返回 <tt class="docutils literal">True</tt>:

909

<div class="highlight-python"><div class="highlight"><pre>def has_class_but_no_id(tag):

910

return tag.has_attr('class') and not tag.has_attr('id')

911

</pre></div>

912

</div>

913

将这个方法作为参数传入 <tt class="docutils literal">find_all()</tt> 方法,将得到所有标签:

914

<div class="highlight-python"><div class="highlight"><pre>soup.find_all(has_class_but_no_id)

915

# [The Dormouse's story,

916

# Once upon a time there were...,

917

# ...]

918

</pre></div>

919

</div>

920

返回结果中只有标签没有<a>标签,因为<a>标签还定义了”id”,没有返回<html>和<head>,因为<html>和<head>中没有定义”class”属性.

921

下面代码找到所有被文字包含的节点内容:

922

<div class="highlight-python"><div class="highlight"><pre>from bs4 import NavigableString

923

def surrounded_by_strings(tag):

924

return (isinstance(tag.next_element, NavigableString)

925

and isinstance(tag.previous_element, NavigableString))

926

927

for tag in soup.find_all(surrounded_by_strings):

928

print tag.name

929

# p

930

# a

931

# a

932

# a

933

# p

934

</pre></div>

935

</div>

936

现在来了解一下搜索方法的细节

937

</div>

938

</div>

939

940

941

find_all( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )

942

<tt class="docutils literal">find_all()</tt> 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件.这里有几个例子:

943

<div class="highlight-python"><div class="highlight"><pre>soup.find_all("title")

944

# [<title>The Dormouse's story</title>]

945

946

soup.find_all("p", "title")

947

# [The Dormouse's story]

948

949

soup.find_all("a")

950

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

951

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

952

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

953

954

soup.find_all(id="link2")

955

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

956

957

import re

958

soup.find(text=re.compile("sisters"))

959

# u'Once upon a time there were three little sisters; and their names were\n'

960

</pre></div>

961

</div>

962

有几个方法很相似,还有几个方法是新的,参数中的 <tt class="docutils literal">text</tt> 和 <tt class="docutils literal">id</tt> 是什么含义? 为什么 <tt class="docutils literal">find_all("p", "title")</tt> 返回的是CSS Class为”title”的标签? 我们来仔细看一下 <tt class="docutils literal">find_all()</tt> 的参数

963

964

965

<tt class="docutils literal">name</tt> 参数可以查找所有名字为 <tt class="docutils literal">name</tt> 的tag,字符串对象会被自动忽略掉.

966

简单的用法如下:

967

<div class="highlight-python"><div class="highlight"><pre>soup.find_all("title")

968

# [<title>The Dormouse's story</title>]

969

</pre></div>

970

</div>

971

重申: 搜索 <tt class="docutils literal">name</tt> 参数的值可以使任一类型的 <a class="reference internal" href="#id25">过滤器</a> ,字符窜,正则表达式,列表,方法或是 <tt class="docutils literal">True</tt> .

972

</div>

973

974

<h3>keyword 参数<a class="headerlink" href="#keyword" title="Permalink to this headline">¶</a></h3>

975

如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 <tt class="docutils literal">id</tt> 的参数,Beautiful Soup会搜索每个tag的”id”属性.

976

<div class="highlight-python"><div class="highlight"><pre>soup.find_all(id='link2')

977

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

978

</pre></div>

979

</div>

980

如果传入 <tt class="docutils literal">href</tt> 参数,Beautiful Soup会搜索每个tag的”href”属性:

981

<div class="highlight-python"><div class="highlight"><pre>soup.find_all(href=re.compile("elsie"))

982

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

983

</pre></div>

984

</div>

985

搜索指定名字的属性时可以使用的参数值包括 <a class="reference internal" href="#id27">字符串</a> , <a class="reference internal" href="#id28">正则表达式</a> , <a class="reference internal" href="#id29">列表</a>, <a class="reference internal" href="#true">True</a> .

986

下面的例子在文档树中查找所有包含 <tt class="docutils literal">id</tt> 属性的tag,无论 <tt class="docutils literal">id</tt> 的值是什么:

987

<div class="highlight-python"><div class="highlight"><pre>soup.find_all(id=True)

988

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

989

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

990

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

991

</pre></div>

992

</div>

993

使用多个指定名字的参数可以同时过滤tag的多个属性:

994

<div class="highlight-python"><div class="highlight"><pre>soup.find_all(href=re.compile("elsie"), id='link1')

995

# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

996

</pre></div>

997

</div>

998

有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性:

999

<div class="highlight-python"><div class="highlight"><pre>data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

1000

data_soup.find_all(data-foo="value")

1001

# SyntaxError: keyword can't be an expression

1002

</pre></div>

1003

</div>

1004

但是可以通过 <tt class="docutils literal">find_all()</tt> 方法的 <tt class="docutils literal">attrs</tt> 参数定义一个字典参数来搜索包含特殊属性的tag:

1005

<div class="highlight-python"><div class="highlight"><pre>data_soup.find_all(attrs={"data-foo": "value"})

1006

# [<div data-foo="value">foo!</div>]

1007

</pre></div>

1008

</div>

1009

</div>

1010

1011

<h3>按CSS搜索<a class="headerlink" href="#css" title="Permalink to this headline">¶</a></h3>

1012

按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 <tt class="docutils literal">class</tt> 在Python中是保留字,使用 <tt class="docutils literal">class</tt> 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 <tt class="docutils literal">class_</tt> 参数搜索有指定CSS类名的tag:

1013

<div class="highlight-python"><div class="highlight"><pre>soup.find_all("a", class_="sister")

1014

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

1015

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

1016

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1017

</pre></div>

1018

</div>

1019

<tt class="docutils literal">class_</tt> 参数同样接受不同类型的 <tt class="docutils literal">过滤器</tt> ,字符串,正则表达式,方法或 <tt class="docutils literal">True</tt> :

1020

<div class="highlight-python"><div class="highlight"><pre>soup.find_all(class_=re.compile("itl"))

1021

# [The Dormouse's story]

1022

1023

def has_six_characters(css_class):

1024

return css_class is not None and len(css_class) == 6

1025

1026

soup.find_all(class_=has_six_characters)

1027

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

1028

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

1029

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1030

</pre></div>

1031

</div>

1032

tag的 <tt class="docutils literal">class</tt> 属性是 <a class="reference internal" href="#id12">多值属性</a> .按照CSS类名搜索tag时,可以分别搜索tag中的每个CSS类名:

1033

<div class="highlight-python"><div class="highlight"><pre>css_soup = BeautifulSoup('')

1034

css_soup.find_all("p", class_="strikeout")

1035

# []

1036

1037

css_soup.find_all("p", class_="body")

1038

# []

1039

</pre></div>

1040

</div>

1041

搜索 <tt class="docutils literal">class</tt> 属性时也可以通过CSS值完全匹配:

1042

<div class="highlight-python"><div class="highlight"><pre>css_soup.find_all("p", class_="body strikeout")

1043

# []

1044

</pre></div>

1045

</div>

1046

完全匹配 <tt class="docutils literal">class</tt> 的值时,如果CSS类名的顺序与实际不符,将搜索不到结果:

1047

<div class="highlight-python"><div class="highlight"><pre>soup.find_all("a", attrs={"class": "sister"})

1048

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

1049

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

1050

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1051

</pre></div>

1052

</div>

1053

</div>

1054

1055

1056

通过 <tt class="docutils literal">text</tt> 参数可以搜搜文档中的字符串内容.与 <tt class="docutils literal">name</tt> 参数的可选值一样, <tt class="docutils literal">text</tt> 参数接受 <a class="reference internal" href="#id27">字符串</a> , <a class="reference internal" href="#id28">正则表达式</a> , <a class="reference internal" href="#id29">列表</a>, <a class="reference internal" href="#true">True</a> . 看例子:

1057

<div class="highlight-python"><pre>soup.find_all(text="Elsie")

1058

# [u'Elsie']

1059

1060

soup.find_all(text=["Tillie", "Elsie", "Lacie"])

1061

# [u'Elsie', u'Lacie', u'Tillie']

1062

1063

soup.find_all(text=re.compile("Dormouse"))

1064

[u"The Dormouse's story", u"The Dormouse's story"]

1065

1066

def is_the_only_string_within_a_tag(s):

1067

""Return True if this string is the only child of its parent tag.""

1068

return (s == s.parent.string)

1069

1070

soup.find_all(text=is_the_only_string_within_a_tag)

1071

# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']</pre>

1072

</div>

1073

虽然 <tt class="docutils literal">text</tt> 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag.Beautiful Soup会找到 <tt class="docutils literal">.string</tt> 方法与 <tt class="docutils literal">text</tt> 参数值相符的tag.下面代码用来搜索内容里面包含“Elsie”的<a>标签:

1074

<div class="highlight-python"><div class="highlight"><pre>soup.find_all("a", text="Elsie")

1075

# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

1076

</pre></div>

1077

</div>

1078

</div>

1079

1080

<h3><tt class="docutils literal">limit</tt> 参数<a class="headerlink" href="#limit" title="Permalink to this headline">¶</a></h3>

1081

<tt class="docutils literal">find_all()</tt> 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 <tt class="docutils literal">limit</tt> 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 <tt class="docutils literal">limit</tt> 的限制时,就停止搜索返回结果.

1082

文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量:

1083

<div class="highlight-python"><div class="highlight"><pre>soup.find_all("a", limit=2)

1084

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

1085

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

1086

</pre></div>

1087

</div>

1088

</div>

1089

1090

<h3><tt class="docutils literal">recursive</tt> 参数<a class="headerlink" href="#recursive" title="Permalink to this headline">¶</a></h3>

1091

调用tag的 <tt class="docutils literal">find_all()</tt> 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 <tt class="docutils literal">recursive=False</tt> .

1092

一段简单的文档:

1093

1094

<head>

1095

<title>

1096

The Dormouse's story

1097

</title>

1098

</head>

1099

...</pre>

1100

</div>

1101

是否使用 <tt class="docutils literal">recursive</tt> 参数的搜索结果:

1102

<div class="highlight-python"><div class="highlight"><pre>soup.html.find_all("title")

1103

# [<title>The Dormouse's story</title>]

1104

1105

soup.html.find_all("title", recursive=False)

1106

# []

1107

</pre></div>

1108

</div>

1109

</div>

1110

</div>

1111

1112

<h2>像调用 <tt class="docutils literal">find_all()</tt> 一样调用tag<a class="headerlink" href="#find-all-tag" title="Permalink to this headline">¶</a></h2>

1113

<tt class="docutils literal">find_all()</tt> 几乎是Beautiful Soup中最常用的搜索方法,所以我们定义了它的简写方法. <tt class="docutils literal">BeautifulSoup</tt> 对象和 <tt class="docutils literal">tag</tt> 对象可以被当作一个方法来使用,这个方法的执行结果与调用这个对象的 <tt class="docutils literal">find_all()</tt> 方法相同,下面两行代码是等价的:

1114

<div class="highlight-python"><div class="highlight"><pre>soup.find_all("a")

1115

soup("a")

1116

</pre></div>

1117

</div>

1118

这两行代码也是等价的:

1119

<div class="highlight-python"><div class="highlight"><pre>soup.title.find_all(text=True)

1120

soup.title(text=True)

1121

</pre></div>

1122

</div>

1123

</div>

1124

1125

1126

find( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )

1127

<tt class="docutils literal">find_all()</tt> 方法将返回文档中符合条件的所有tag,尽管有时候我们只想得到一个结果.比如文档中只有一个<body>标签,那么使用 <tt class="docutils literal">find_all()</tt> 方法来查找<body>标签就不太合适, 使用 <tt class="docutils literal">find_all</tt> 方法并设置 <tt class="docutils literal">limit=1</tt> 参数不如直接使用 <tt class="docutils literal">find()</tt> 方法.下面两行代码是等价的:

1128

<div class="highlight-python"><div class="highlight"><pre>soup.find_all('title', limit=1)

1129

# [<title>The Dormouse's story</title>]

1130

1131

soup.find('title')

1132

# <title>The Dormouse's story</title>

1133

</pre></div>

1134

</div>

1135

唯一的区别是 <tt class="docutils literal">find_all()</tt> 方法的返回结果是值包含一个元素的列表,而 <tt class="docutils literal">find()</tt> 方法直接返回结果.

1136

<tt class="docutils literal">find_all()</tt> 方法没有找到目标是返回空列表, <tt class="docutils literal">find()</tt> 方法找不到目标时,返回 <tt class="docutils literal">None</tt> .

1137

<div class="highlight-python"><div class="highlight"><pre>print(soup.find("nosuchtag"))

1138

# None

1139

</pre></div>

1140

</div>

1141

<tt class="docutils literal">soup.head.title</tt> 是 <a class="reference internal" href="#id17">tag的名字</a> 方法的简写.这个简写的原理就是多次调用当前tag的 <tt class="docutils literal">find()</tt> 方法:

1142

<div class="highlight-python"><div class="highlight"><pre>soup.head.title

1143

# <title>The Dormouse's story</title>

1144

1145

soup.find("head").find("title")

1146

# <title>The Dormouse's story</title>

1147

</pre></div>

1148

</div>

1149

</div>

1150

1151

<h2>find_parents() 和 find_parent()<a class="headerlink" href="#find-parents-find-parent" title="Permalink to this headline">¶</a></h2>

1152

find_parents( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )

1153

find_parent( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )

1154

我们已经用了很大篇幅来介绍 <tt class="docutils literal">find_all()</tt> 和 <tt class="docutils literal">find()</tt> 方法,Beautiful Soup中还有10个用于搜索的API.它们中的五个用的是与 <tt class="docutils literal">find_all()</tt> 相同的搜索参数,另外5个与 <tt class="docutils literal">find()</tt> 方法的搜索参数类似.区别仅是它们搜索文档的不同部分.

1155

记住: <tt class="docutils literal">find_all()</tt> 和 <tt class="docutils literal">find()</tt> 只搜索当前节点的所有子节点,孙子节点等. <tt class="docutils literal">find_parents()</tt> 和 <tt class="docutils literal">find_parent()</tt> 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容. 我们从一个文档中的一个叶子节点开始:

1156

<div class="highlight-python"><pre>a_string = soup.find(text="Lacie")

1157

a_string

1158

# u'Lacie'

1159

1160

a_string.find_parents("a")

1161

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

1162

1163

a_string.find_parent("p")

1164

# Once upon a time there were three little sisters; and their names were

1165

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

1166

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

1167

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

1168

# and they lived at the bottom of a well.

1169

1170

a_string.find_parents("p", class="title")

1171

# []</pre>

1172

</div>

1173

文档中的一个<a>标签是是当前叶子节点的直接父节点,所以可以被找到.还有一个标签,是目标叶子节点的间接父辈节点,所以也可以被找到.包含class值为”title”的标签不是不是目标叶子节点的父辈节点,所以通过 <tt class="docutils literal">find_parents()</tt> 方法搜索不到.

1174

<tt class="docutils literal">find_parent()</tt> 和 <tt class="docutils literal">find_parents()</tt> 方法会让人联想到 <a class="reference internal" href="#parent">.parent</a> 和 <a class="reference internal" href="#parents">.parents</a> 属性.它们之间的联系非常紧密.搜索父辈节点的方法实际上就是对 <tt class="docutils literal">.parents</tt> 属性的迭代搜索.

1175

</div>

1176

1177

<h2>find_next_siblings() 合 find_next_sibling()<a class="headerlink" href="#find-next-siblings-find-next-sibling" title="Permalink to this headline">¶</a></h2>

1178

find_next_siblings( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )

1179

find_next_sibling( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )

1180

这2个方法通过 <a class="reference internal" href="#next-siblings-previous-siblings">.next_siblings</a> 属性对当tag的所有后面解析 <a class="footnote-reference" href="#id86" id="id33">[5]</a> 的兄弟tag节点进行迭代, <tt class="docutils literal">find_next_siblings()</tt> 方法返回所有符合条件的后面的兄弟节点, <tt class="docutils literal">find_next_sibling()</tt> 只返回符合条件的后面的第一个tag节点.

1181

<div class="highlight-python"><div class="highlight"><pre>first_link = soup.a

1182

first_link

1183

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

1184

1185

first_link.find_next_siblings("a")

1186

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

1187

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1188

1189

first_story_paragraph = soup.find("p", "story")

1190

first_story_paragraph.find_next_sibling("p")

1191

# ...

1192

</pre></div>

1193

</div>

1194

</div>

1195

1196

<h2>find_previous_siblings() 和 find_previous_sibling()<a class="headerlink" href="#find-previous-siblings-find-previous-sibling" title="Permalink to this headline">¶</a></h2>

1197

find_previous_siblings( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )

1198

find_previous_sibling( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )

1199

这2个方法通过 <a class="reference internal" href="#next-siblings-previous-siblings">.previous_siblings</a> 属性对当前tag的前面解析 <a class="footnote-reference" href="#id86" id="id34">[5]</a> 的兄弟tag节点进行迭代, <tt class="docutils literal">find_previous_siblings()</tt> 方法返回所有符合条件的前面的兄弟节点, <tt class="docutils literal">find_previous_sibling()</tt> 方法返回第一个符合条件的前面的兄弟节点:

1200

<div class="highlight-python"><div class="highlight"><pre>last_link = soup.find("a", id="link3")

1201

last_link

1202

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

1203

1204

last_link.find_previous_siblings("a")

1205

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

1206

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

1207

1208

first_story_paragraph = soup.find("p", "story")

1209

first_story_paragraph.find_previous_sibling("p")

1210

# The Dormouse's story

1211

</pre></div>

1212

</div>

1213

</div>

1214

1215

1216

find_all_next( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )

1217

find_next( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )

1218

这2个方法通过 <a class="reference internal" href="#next-elements-previous-elements">.next_elements</a> 属性对当前tag的之后的 <a class="footnote-reference" href="#id86" id="id35">[5]</a> tag和字符串进行迭代, <tt class="docutils literal">find_all_next()</tt> 方法返回所有符合条件的节点, <tt class="docutils literal">find_next()</tt> 方法返回第一个符合条件的节点:

1219

<div class="highlight-python"><div class="highlight"><pre>first_link = soup.a

1220

first_link

1221

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

1222

1223

first_link.find_all_next(text=True)

1224

# [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',

1225

# u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']

1226

1227

first_link.find_next("p")

1228

# ...

1229

</pre></div>

1230

</div>

1231

第一个例子中,字符串 “Elsie”也被显示出来,尽管它被包含在我们开始查找的<a>标签的里面.第二个例子中,最后一个标签也被显示出来,尽管它与我们开始查找位置的<a>标签不属于同一部分.例子中,搜索的重点是要匹配过滤器的条件,并且在文档中出现的顺序而不是开始查找的元素的位置.

1232

</div>

1233

1234

<h2>find_all_previous() 和 find_previous()<a class="headerlink" href="#find-all-previous-find-previous" title="Permalink to this headline">¶</a></h2>

1235

find_all_previous( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )

1236

find_previous( <a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> )

1237

这2个方法通过 <a class="reference internal" href="#next-elements-previous-elements">.previous_elements</a> 属性对当前节点前面 <a class="footnote-reference" href="#id86" id="id36">[5]</a> 的tag和字符串进行迭代, <tt class="docutils literal">find_all_previous()</tt> 方法返回所有符合条件的节点, <tt class="docutils literal">find_previous()</tt> 方法返回第一个符合条件的节点.

1238

<div class="highlight-python"><div class="highlight"><pre>first_link = soup.a

1239

first_link

1240

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

1241

1242

first_link.find_all_previous("p")

1243

# [Once upon a time there were three little sisters; ...,

1244

# The Dormouse's story]

1245

1246

first_link.find_previous("title")

1247

# <title>The Dormouse's story</title>

1248

</pre></div>

1249

</div>

1250

<tt class="docutils literal">find_all_previous("p")</tt> 返回了文档中的第一段(class=”title”的那段),但还返回了第二段,标签包含了我们开始查找的<a>标签.不要惊讶,这段代码的功能是查找所有出现在指定<a>标签之前的标签,因为这个标签包含了开始的<a>标签,所以标签一定是在<a>之前出现的.

1251

</div>

1252

1253

<h2>CSS选择器<a class="headerlink" href="#id37" title="Permalink to this headline">¶</a></h2>

1254

Beautiful Soup支持大部分的CSS选择器 <a class="footnote-reference" href="#id87" id="id38">[6]</a> ,在 <tt class="docutils literal">Tag</tt> 或 <tt class="docutils literal">BeautifulSoup</tt> 对象的 <tt class="docutils literal">.select()</tt> 方法中传入字符串参数,即可使用CSS选择器的语法找到tag:

1255

<div class="highlight-python"><div class="highlight"><pre>soup.select("title")

1256

# [<title>The Dormouse's story</title>]

1257

1258

soup.select("p nth-of-type(3)")

1259

# [...]

1260

</pre></div>

1261

</div>

1262

通过tag标签逐层查找:

1263

<div class="highlight-python"><div class="highlight"><pre>soup.select("body a")

1264

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

1265

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

1266

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1267

1268

soup.select("html head title")

1269

# [<title>The Dormouse's story</title>]

1270

</pre></div>

1271

</div>

1272

找到某个tag标签下的直接子标签 <a class="footnote-reference" href="#id87" id="id39">[6]</a> :

1273

<div class="highlight-python"><div class="highlight"><pre>soup.select("head > title")

1274

# [<title>The Dormouse's story</title>]

1275

1276

soup.select("p > a")

1277

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

1278

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

1279

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1280

1281

soup.select("p > a:nth-of-type(2)")

1282

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

1283

1284

soup.select("p > #link1")

1285

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

1286

1287

soup.select("body > a")

1288

# []

1289

</pre></div>

1290

</div>

1291

找到兄弟节点标签:

1292

<div class="highlight-python"><div class="highlight"><pre>soup.select("#link1 ~ .sister")

1293

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

1294

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1295

1296

soup.select("#link1 + .sister")

1297

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

1298

</pre></div>

1299

</div>

1300

通过CSS的类名查找:

1301

<div class="highlight-python"><div class="highlight"><pre>soup.select(".sister")

1302

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

1303

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

1304

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1305

1306

soup.select("[class~=sister]")

1307

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

1308

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

1309

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1310

</pre></div>

1311

</div>

1312

通过tag的id查找:

1313

<div class="highlight-python"><div class="highlight"><pre>soup.select("#link1")

1314

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

1315

1316

soup.select("a#link2")

1317

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

1318

</pre></div>

1319

</div>

1320

通过是否存在某个属性来查找:

1321

<div class="highlight-python"><div class="highlight"><pre>soup.select('a[href]')

1322

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

1323

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

1324

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1325

</pre></div>

1326

</div>

1327

通过属性的值来查找:

1328

<div class="highlight-python"><div class="highlight"><pre>soup.select('a[href="http://example.com/elsie"]')

1329

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

1330

1331

soup.select('a[href^="http://example.com/"]')

1332

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

1333

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

1334

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1335

1336

soup.select('a[href$="tillie"]')

1337

# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1338

1339

soup.select('a[href*=".com/el"]')

1340

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

1341

</pre></div>

1342

</div>

1343

通过语言设置来查找:

1344

<div class="highlight-python"><div class="highlight"><pre>multilingual_markup = """

1345

Hello

1346

Howdy, y'all

1347

Pip-pip, old fruit

1348

Bonjour mes amis

1349

"""

1350

multilingual_soup = BeautifulSoup(multilingual_markup)

1351

multilingual_soup.select('p[lang|=en]')

1352

# [Hello,

1353

# Howdy, y'all,

1354

# Pip-pip, old fruit]

1355

</pre></div>

1356

</div>

1357

对于熟悉CSS选择器语法的人来说这是个非常方便的方法.Beautiful Soup也支持CSS选择器API,如果你仅仅需要CSS选择器的功能,那么直接使用 <tt class="docutils literal">lxml</tt> 也可以,而且速度更快,支持更多的CSS选择器语法,但Beautiful Soup整合了CSS选择器的语法和自身方便使用API.

1358

</div>

1359

</div>

1360

1361

<h1>修改文档树<a class="headerlink" href="#id40" title="Permalink to this headline">¶</a></h1>

1362

Beautiful Soup的强项是文档树的搜索,但同时也可以方便的修改文档树

1363

1364

<h2>修改tag的名称和属性<a class="headerlink" href="#id41" title="Permalink to this headline">¶</a></h2>

1365

在 <a class="reference internal" href="#attributes">Attributes</a> 的章节中已经介绍过这个功能,但是再看一遍也无妨. 重命名一个tag,改变属性的值,添加或删除属性:

1366

<div class="highlight-python"><div class="highlight"><pre>soup = BeautifulSoup('Extremely bold')

1367

tag = soup.b

1368

1369

tag.name = "blockquote"

1370

tag['class'] = 'verybold'

1371

tag['id'] = 1

1372

tag

1373

# <blockquote class="verybold" id="1">Extremely bold</blockquote>

1374

1375

del tag['class']

1376

del tag['id']

1377

tag

1378

# <blockquote>Extremely bold</blockquote>

1379

</pre></div>

1380

</div>

1381

</div>

1382

1383

<h2>修改 .string<a class="headerlink" href="#id42" title="Permalink to this headline">¶</a></h2>

1384

给tag的 <tt class="docutils literal">.string</tt> 属性赋值,就相当于用当前的内容替代了原来的内容:

1385

<div class="highlight-python"><div class="highlight"><pre>markup = '<a href="http://example.com/">I linked to example.com</a>'

1386

soup = BeautifulSoup(markup)

1387

1388

tag = soup.a

1389

tag.string = "New link text."

1390

tag

1391

# <a href="http://example.com/">New link text.</a>

1392

</pre></div>

1393

</div>

1394

注意: 如果当前的tag包含了其它tag,那么给它的 <tt class="docutils literal">.string</tt> 属性赋值会覆盖掉原有的所有内容包括子tag

1395

</div>

1396

1397

<h2>append()<a class="headerlink" href="#append" title="Permalink to this headline">¶</a></h2>

1398

<tt class="docutils literal">Tag.append()</tt> 方法想tag中添加内容,就好像Python的列表的 <tt class="docutils literal">.append()</tt> 方法:

1399

<div class="highlight-python"><div class="highlight"><pre>soup = BeautifulSoup("<a>Foo</a>")

1400

soup.a.append("Bar")

1401

1402

soup

1403

# <html><head></head><body><a>FooBar</a></body></html>

1404

soup.a.contents

1405

# [u'Foo', u'Bar']

1406

</pre></div>

1407

</div>

1408

</div>

1409

1410

<h2>BeautifulSoup.new_string() 和 .new_tag()<a class="headerlink" href="#beautifulsoup-new-string-new-tag" title="Permalink to this headline">¶</a></h2>

1411

如果想添加一段文本内容到文档中也没问题,可以调用Python的 <tt class="docutils literal">append()</tt> 方法或调用工厂方法 <tt class="docutils literal">BeautifulSoup.new_string()</tt> :

1412

<div class="highlight-python"><div class="highlight"><pre>soup = BeautifulSoup("")

1413

tag = soup.b

1414

tag.append("Hello")

1415

new_string = soup.new_string(" there")

1416

tag.append(new_string)

1417

tag

1418

# Hello there.

1419

tag.contents

1420

# [u'Hello', u' there']

1421

</pre></div>

1422

</div>

1423

如果想要创建一段注释,或 <tt class="docutils literal">NavigableString</tt> 的任何子类,将子类作为 <tt class="docutils literal">new_string()</tt> 方法的第二个参数传入:

1424

<div class="highlight-python"><div class="highlight"><pre>from bs4 import Comment

1425

new_comment = soup.new_string("Nice to see you.", Comment)

1426

tag.append(new_comment)

1427

tag

1428

# Hello there

1429

tag.contents

1430

# [u'Hello', u' there', u'Nice to see you.']

1431

</pre></div>

1432

</div>

1433

# 这是Beautiful Soup 4.2.1 中新增的方法

1434

创建一个tag最好的方法是调用工厂方法 <tt class="docutils literal">BeautifulSoup.new_tag()</tt> :

1435

<div class="highlight-python"><div class="highlight"><pre>soup = BeautifulSoup("")

1436

original_tag = soup.b

1437

1438

new_tag = soup.new_tag("a", href="http://www.example.com")

1439

original_tag.append(new_tag)

1440

original_tag

1441

# <a href="http://www.example.com"></a>

1442

1443

new_tag.string = "Link text."

1444

original_tag

1445

# <a href="http://www.example.com">Link text.</a>

1446

</pre></div>

1447

</div>

1448

第一个参数作为tag的name,是必填,其它参数选填

1449

</div>

1450

1451

<h2>insert()<a class="headerlink" href="#insert" title="Permalink to this headline">¶</a></h2>

1452

<tt class="docutils literal">Tag.insert()</tt> 方法与 <tt class="docutils literal">Tag.append()</tt> 方法类似,区别是不会把新元素添加到父节点 <tt class="docutils literal">.contents</tt> 属性的最后,而是把元素插入到指定的位置.与Python列表总的 <tt class="docutils literal">.insert()</tt> 方法的用法下同:

1453

<div class="highlight-python"><div class="highlight"><pre>markup = '<a href="http://example.com/">I linked to example.com</a>'

1454

soup = BeautifulSoup(markup)

1455

tag = soup.a

1456

1457

tag.insert(1, "but did not endorse ")

1458

tag

1459

# <a href="http://example.com/">I linked to but did not endorse example.com</a>

1460

tag.contents

1461

# [u'I linked to ', u'but did not endorse', example.com]

1462

</pre></div>

1463

</div>

1464

</div>

1465

1466

<h2>insert_before() 和 insert_after()<a class="headerlink" href="#insert-before-insert-after" title="Permalink to this headline">¶</a></h2>

1467

<tt class="docutils literal">insert_before()</tt> 方法在当前tag或文本节点前插入内容:

1468

<div class="highlight-python"><div class="highlight"><pre>soup = BeautifulSoup("stop")

1469

tag = soup.new_tag("i")

1470

tag.string = "Don't"

1471

soup.b.string.insert_before(tag)

1472

soup.b

1473

# Don'tstop

1474

</pre></div>

1475

</div>

1476

<tt class="docutils literal">insert_after()</tt> 方法在当前tag或文本节点后插入内容:

1477

<div class="highlight-python"><div class="highlight"><pre>soup.b.i.insert_after(soup.new_string(" ever "))

1478

soup.b

1479

# Don't ever stop

1480

soup.b.contents

1481

# [Don't, u' ever ', u'stop']

1482

</pre></div>

1483

</div>

1484

</div>

1485

1486

<h2>clear()<a class="headerlink" href="#clear" title="Permalink to this headline">¶</a></h2>

1487

<tt class="docutils literal">Tag.clear()</tt> 方法移除当前tag的内容:

1488

<div class="highlight-python"><div class="highlight"><pre>markup = '<a href="http://example.com/">I linked to example.com</a>'

1489

soup = BeautifulSoup(markup)

1490

tag = soup.a

1491

1492

tag.clear()

1493

tag

1494

# <a href="http://example.com/"></a>

1495

</pre></div>

1496

</div>

1497

</div>

1498

1499

<h2>extract()<a class="headerlink" href="#extract" title="Permalink to this headline">¶</a></h2>

1500

<tt class="docutils literal">PageElement.extract()</tt> 方法将当前tag移除文档树,并作为方法结果返回:

1501

<div class="highlight-python"><div class="highlight"><pre>markup = '<a href="http://example.com/">I linked to example.com</a>'

1502

soup = BeautifulSoup(markup)

1503

a_tag = soup.a

1504

1505

i_tag = soup.i.extract()

1506

1507

a_tag

1508

# <a href="http://example.com/">I linked to</a>

1509

1510

i_tag

1511

# example.com

1512

1513

print(i_tag.parent)

1514

None

1515

</pre></div>

1516

</div>

1517

这个方法实际上产生了2个文档树: 一个是用来解析原始文档的 <tt class="docutils literal">BeautifulSoup</tt> 对象,另一个是被移除并且返回的tag.被移除并返回的tag可以继续调用 <tt class="docutils literal">extract</tt> 方法:

1518

<div class="highlight-python"><div class="highlight"><pre>my_string = i_tag.string.extract()

1519

my_string

1520

# u'example.com'

1521

1522

print(my_string.parent)

1523

# None

1524

i_tag

1525

#

1526

</pre></div>

1527

</div>

1528

</div>

1529

1530

<h2>decompose()<a class="headerlink" href="#decompose" title="Permalink to this headline">¶</a></h2>

1531

<tt class="docutils literal">Tag.decompose()</tt> 方法将当前节点移除文档树并完全销毁:

1532

<div class="highlight-python"><div class="highlight"><pre>markup = '<a href="http://example.com/">I linked to example.com</a>'

1533

soup = BeautifulSoup(markup)

1534

a_tag = soup.a

1535

1536

soup.i.decompose()

1537

1538

a_tag

1539

# <a href="http://example.com/">I linked to</a>

1540

</pre></div>

1541

</div>

1542

</div>

1543

1544

<h2>replace_with()<a class="headerlink" href="#replace-with" title="Permalink to this headline">¶</a></h2>

1545

<tt class="docutils literal">PageElement.replace_with()</tt> 方法移除文档树中的某段内容,并用新tag或文本节点替代它:

1546

<div class="highlight-python"><div class="highlight"><pre>markup = '<a href="http://example.com/">I linked to example.com</a>'

1547

soup = BeautifulSoup(markup)

1548

a_tag = soup.a

1549

1550

new_tag = soup.new_tag("b")

1551

new_tag.string = "example.net"

1552

a_tag.i.replace_with(new_tag)

1553

1554

a_tag

1555

# <a href="http://example.com/">I linked to example.net</a>

1556

</pre></div>

1557

</div>

1558

<tt class="docutils literal">replace_with()</tt> 方法返回被替代的tag或文本节点,可以用来浏览或添加到文档树其它地方

1559

</div>

1560

1561

1562

<tt class="docutils literal">PageElement.wrap()</tt> 方法可以对指定的tag元素进行包装 <a class="footnote-reference" href="#id89" id="id43">[8]</a> ,并返回包装后的结果:

1563

<div class="highlight-python"><div class="highlight"><pre>soup = BeautifulSoup("I wish I was bold.")

1564

soup.p.string.wrap(soup.new_tag("b"))

1565

# I wish I was bold.

1566

1567

soup.p.wrap(soup.new_tag("div"))

1568

# <div>I wish I was bold.</div>

1569

</pre></div>

1570

</div>

1571

该方法在 Beautiful Soup 4.0.5 中添加

1572

</div>

1573

1574

<h2>unwrap()<a class="headerlink" href="#unwrap" title="Permalink to this headline">¶</a></h2>

1575

<tt class="docutils literal">Tag.unwrap()</tt> 方法与 <tt class="docutils literal">wrap()</tt> 方法相反.将移除tag内的所有tag标签,该方法常被用来进行标记的解包:

1576

<div class="highlight-python"><div class="highlight"><pre>markup = '<a href="http://example.com/">I linked to example.com</a>'

1577

soup = BeautifulSoup(markup)

1578

a_tag = soup.a

1579

1580

a_tag.i.unwrap()

1581

a_tag

1582

# <a href="http://example.com/">I linked to example.com</a>

1583

</pre></div>

1584

</div>

1585

与 <tt class="docutils literal">replace_with()</tt> 方法相同, <tt class="docutils literal">unwrap()</tt> 方法返回被移除的tag

1586

</div>

1587

</div>

1588

1589

1590

1591

<h2>格式化输出<a class="headerlink" href="#id45" title="Permalink to this headline">¶</a></h2>

1592

<tt class="docutils literal">prettify()</tt> 方法将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行

1593

<div class="highlight-python"><div class="highlight"><pre>markup = '<a href="http://example.com/">I linked to example.com</a>'

1594

soup = BeautifulSoup(markup)

1595

soup.prettify()

1596

# '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...'

1597

1598

print(soup.prettify())

1599

# <html>

1600

# <head>

1601

# </head>

1602

# <body>

1603

# <a href="http://example.com/">

1604

# I linked to

1605

#

1606

# example.com

1607

#

1608

# </a>

1609

# </body>

1610

# </html>

1611

</pre></div>

1612

</div>

1613

<tt class="docutils literal">BeautifulSoup</tt> 对象和它的tag节点都可以调用 <tt class="docutils literal">prettify()</tt> 方法:

1614

<div class="highlight-python"><div class="highlight"><pre>print(soup.a.prettify())

1615

# <a href="http://example.com/">

1616

# I linked to

1617

#

1618

# example.com

1619

#

1620

# </a>

1621

</pre></div>

1622

</div>

1623

</div>

1624

1625

1626

如果只想得到结果字符串,不重视格式,那么可以对一个 <tt class="docutils literal">BeautifulSoup</tt> 对象或 <tt class="docutils literal">Tag</tt> 对象使用Python的 <tt class="docutils literal">unicode()</tt> 或 <tt class="docutils literal">str()</tt> 方法:

1627

<div class="highlight-python"><div class="highlight"><pre>str(soup)

1628

# '<html><head></head><body><a href="http://example.com/">I linked to example.com</a></body></html>'

1629

1630

unicode(soup.a)

1631

# u'<a href="http://example.com/">I linked to example.com</a>'

1632

</pre></div>

1633

</div>

1634

<tt class="docutils literal">str()</tt> 方法返回UTF-8编码的字符串,可以指定 <a class="reference internal" href="#id51">编码</a> 的设置.

1635

还可以调用 <tt class="docutils literal">encode()</tt> 方法获得字节码或调用 <tt class="docutils literal">decode()</tt> 方法获得Unicode.

1636

</div>

1637

1638

1639

Beautiful Soup输出是会将HTML中的特殊字符转换成Unicode,比如“&lquot;”:

1640

<div class="highlight-python"><div class="highlight"><pre>soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")

1641

unicode(soup)

1642

# u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'

1643

</pre></div>

1644

</div>

1645

如果将文档转换成字符串,Unicode编码会被编码成UTF-8.这样就无法正确显示HTML特殊字符了:

1646

<div class="highlight-python"><div class="highlight"><pre>str(soup)

1647

# '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'

1648

</pre></div>

1649

</div>

1650

</div>

1651

1652

1653

如果只想得到tag中包含的文本内容,那么可以嗲用 <tt class="docutils literal">get_text()</tt> 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回:

1654

<div class="highlight-python"><div class="highlight"><pre>markup = '<a href="http://example.com/">\nI linked to example.com\n</a>'

1655

soup = BeautifulSoup(markup)

1656

1657

soup.get_text()

1658

u'\nI linked to example.com\n'

1659

soup.i.get_text()

1660

u'example.com'

1661

</pre></div>

1662

</div>

1663

可以通过参数指定tag的文本内容的分隔符:

1664

<div class="highlight-python"><div class="highlight"><pre># soup.get_text("|")

1665

u'\nI linked to |example.com|\n'

1666

</pre></div>

1667

</div>

1668

还可以去除获得文本内容的前后空白:

1669

<div class="highlight-python"><div class="highlight"><pre># soup.get_text("|", strip=True)

1670

u'I linked to|example.com'

1671

</pre></div>

1672

</div>

1673

或者使用 <a class="reference internal" href="#strings-stripped-strings">.stripped_strings</a> 生成器,获得文本列表后手动处理列表:

1674

<div class="highlight-python"><div class="highlight"><pre>[text for text in soup.stripped_strings]

1675

# [u'I linked to', u'example.com']

1676

</pre></div>

1677

</div>

1678

</div>

1679

</div>

1680

1681

<h1>指定文档解析器<a class="headerlink" href="#id48" title="Permalink to this headline">¶</a></h1>

1682

如果仅是想要解析HTML文档,只要用文档创建 <tt class="docutils literal">BeautifulSoup</tt> 对象就可以了.Beautiful Soup会自动选择一个解析器来解析文档.但是还可以通过参数指定使用那种解析器来解析当前文档.

1683

<tt class="docutils literal">BeautifulSoup</tt> 第一个参数应该是要被解析的文档字符串或是文件句柄,第二个参数用来标识怎样解析文档.如果第二个参数为空,那么Beautiful Soup根据当前系统安装的库自动选择解析器,解析器的优先数序: lxml, html5lib, Python标准库.在下面两种条件下解析器优先顺序会变化:

1684

1685

1686

<li>要解析的文档是什么类型: 目前支持, “html”, “xml”, 和 “html5”</li>

1687

<li>指定使用哪种解析器: 目前支持, “lxml”, “html5lib”, 和 “html.parser”</li>

1688

</ul>

1689

</div></blockquote>

1690

<a class="reference internal" href="#id9">安装解析器</a> 章节介绍了可以使用哪种解析器,以及如何安装.

1691

如果指定的解析器没有安装,Beautiful Soup会自动选择其它方案.目前只有 lxml 解析器支持XML文档的解析,在没有安装lxml库的情况下,创建 <tt class="docutils literal">beautifulsoup</tt> 对象时无论是否指定使用lxml,都无法得到解析后的对象

1692

1693

<h2>解析器之间的区别<a class="headerlink" href="#id49" title="Permalink to this headline">¶</a></h2>

1694

Beautiful Soup为不同的解析器提供了相同的接口,但解析器本身时有区别的.同一篇文档被不同的解析器解析后可能会生成不同结构的树型文档.区别最大的是HTML解析器和XML解析器,看下面片段被解析成HTML结构:

1695

<div class="highlight-python"><div class="highlight"><pre>BeautifulSoup("<a></a>")

1696

# <html><head></head><body><a></a></body></html>

1697

</pre></div>

1698

</div>

1699

因为空标签不符合HTML标准,所以解析器把它解析成

1700

同样的文档使用XML解析如下(解析XML需要安装lxml库).注意,空标签依然被保留,并且文档前添加了XML头,而不是被包含在<html>标签内:

1701

<div class="highlight-python"><div class="highlight"><pre>BeautifulSoup("<a></a>", "xml")

1702

# <?xml version="1.0" encoding="utf-8"?>

1703

# <a></a>

1704

</pre></div>

1705

</div>

1706

HTML解析器之间也有区别,如果被解析的HTML文档是标准格式,那么解析器之间没有任何差别,只是解析速度不同,结果都会返回正确的文档树.

1707

但是如果被解析文档不是标准格式,那么不同的解析器返回结果可能不同.下面例子中,使用lxml解析错误格式的文档,结果标签被直接忽略掉了:

1708

<div class="highlight-python"><div class="highlight"><pre>BeautifulSoup("<a>", "lxml")

1709

# <html><body><a></a></body></html>

1710

</pre></div>

1711

</div>

1712

使用html5lib库解析相同文档会得到不同的结果:

1713

<div class="highlight-python"><div class="highlight"><pre>BeautifulSoup("<a>", "html5lib")

1714

# <html><head></head><body><a></a></body></html>

1715

</pre></div>

1716

</div>

1717

html5lib库没有忽略掉标签,而是自动补全了标签,还给文档树添加了<head>标签.

1718

使用pyhton内置库解析结果如下:

1719

<div class="highlight-python"><div class="highlight"><pre>BeautifulSoup("<a>", "html.parser")

1720

# <a></a>

1721

</pre></div>

1722

</div>

1723

与lxml <a class="footnote-reference" href="#id88" id="id50">[7]</a> 库类似的,Python内置库忽略掉了标签,与html5lib库不同的是标准库没有尝试创建符合标准的文档格式或将文档片段包含在<body>标签内,与lxml不同的是标准库甚至连<html>标签都没有尝试去添加.

1724

因为文档片段“<a>”是错误格式,所以以上解析方式都能算作”正确”,html5lib库使用的是HTML5的部分标准,所以最接近”正确”.不过所有解析器的结构都能够被认为是”正常”的.

1725

不同的解析器可能影响代码执行结果,如果在分发给别人的代码中使用了 <tt class="docutils literal">BeautifulSoup</tt> ,那么最好注明使用了哪种解析器,以减少不必要的麻烦.

1726

</div>

1727

</div>

1728

1729

1730

任何HTML或XML文档都有自己的编码方式,比如ASCII 或 UTF-8,但是使用Beautiful Soup解析后,文档都被转换成了Unicode:

1731

<div class="highlight-python"><div class="highlight"><pre>markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"

1732

soup = BeautifulSoup(markup)

1733

soup.h1

1734

# <h1>Sacré bleu!</h1>

1735

soup.h1.string

1736

# u'Sacr\xe9 bleu!'

1737

</pre></div>

1738

</div>

1739

这不是魔术(但很神奇),Beautiful Soup用了 <a class="reference internal" href="#unicode-dammit">编码自动检测</a> 子库来识别当前文档编码并转换成Unicode编码. <tt class="docutils literal">BeautifulSoup</tt> 对象的 <tt class="docutils literal">.original_encoding</tt> 属性记录了自动识别编码的结果:

1740

<div class="highlight-python"><div class="highlight"><pre>soup.original_encoding

1741

'utf-8'

1742

</pre></div>

1743

</div>

1744

<a class="reference internal" href="#unicode-dammit">编码自动检测</a> 功能大部分时候都能猜对编码格式,但有时候也会出错.有时候即使猜测正确,也是在逐个字节的遍历整个文档后才猜对的,这样很慢.如果预先知道文档编码,可以设置编码参数来减少自动检查编码出错的概率并且提高文档解析速度.在创建 <tt class="docutils literal">BeautifulSoup</tt> 对象的时候设置 <tt class="docutils literal">from_encoding</tt> 参数.

1745

下面一段文档用了ISO-8859-8编码方式,这段文档太短,结果Beautiful Soup以为文档是用ISO-8859-7编码:

1746

<div class="highlight-python"><pre>markup = b"<h1>\xed\xe5\xec\xf9</h1>"

1747

soup = BeautifulSoup(markup)

1748

soup.h1

1749

1750

soup.original_encoding

1751

'ISO-8859-7'</pre>

1752

</div>

1753

通过传入 <tt class="docutils literal">from_encoding</tt> 参数来指定编码方式:

1754

<div class="highlight-python"><pre>soup = BeautifulSoup(markup, from_encoding="iso-8859-8")

1755

soup.h1

1756

1757

soup.original_encoding

1758

'iso8859-8'</pre>

1759

</div>

1760

少数情况下(通常是UTF-8编码的文档中包含了其它编码格式的文件),想获得正确的Unicode编码就不得不将文档中少数特殊编码字符替换成特殊Unicode编码,“REPLACEMENT CHARACTER” (U+FFFD, �) <a class="footnote-reference" href="#id90" id="id52">[9]</a> . 如果Beautifu Soup猜测文档编码时作了特殊字符的替换,那么Beautiful Soup会把 <tt class="docutils literal">UnicodeDammit</tt> 或 <tt class="docutils literal">BeautifulSoup</tt> 对象的 <tt class="docutils literal">.contains_replacement_characters</tt> 属性标记为 <tt class="docutils literal">True</tt> .这样就可以知道当前文档进行Unicode编码后丢失了一部分特殊内容字符.如果文档中包含�而 <tt class="docutils literal">.contains_replacement_characters</tt> 属性是 <tt class="docutils literal">False</tt> ,则表示�就是文档中原来的字符,不是转码失败.

1761

1762

1763

通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码,下面例子输入文档是Latin-1编码:

1764

<div class="highlight-python"><div class="highlight"><pre>markup = b'''

1765

<html>

1766

<head>

1767

1768

</head>

1769

<body>

1770

Sacr\xe9 bleu!

1771

</body>

1772

</html>

1773

'''

1774

1775

soup = BeautifulSoup(markup)

1776

print(soup.prettify())

1777

# <html>

1778

# <head>

1779

# <meta content="text/html; charset=utf-8" http-equiv="Content-type" />

1780

# </head>

1781

# <body>

1782

#

1783

# Sacré bleu!

1784

#

1785

# </body>

1786

# </html>

1787

</pre></div>

1788

</div>

1789

注意,输出文档中的<meta>标签的编码设置已经修改成了与输出编码一致的UTF-8.

1790

如果不想用UTF-8编码输出,可以将编码方式传入 <tt class="docutils literal">prettify()</tt> 方法:

1791

<div class="highlight-python"><div class="highlight"><pre>print(soup.prettify("latin-1"))

1792

# <html>

1793

# <head>

1794

# <meta content="text/html; charset=latin-1" http-equiv="Content-type" />

1795

# ...

1796

</pre></div>

1797

</div>

1798

还可以调用 <tt class="docutils literal">BeautifulSoup</tt> 对象或任意节点的 <tt class="docutils literal">encode()</tt> 方法,就像Python的字符串调用 <tt class="docutils literal">encode()</tt> 方法一样:

1799

<div class="highlight-python"><div class="highlight"><pre>soup.p.encode("latin-1")

1800

# 'Sacr\xe9 bleu!'

1801

1802

soup.p.encode("utf-8")

1803

# 'Sacr\xc3\xa9 bleu!'

1804

</pre></div>

1805

</div>

1806

如果文档中包含当前编码不支持的字符,那么这些字符将呗转换成一系列XML特殊字符引用,下面例子中包含了Unicode编码字符SNOWMAN:

1807

<div class="highlight-python"><div class="highlight"><pre>markup = u"\N{SNOWMAN}"

1808

snowman_soup = BeautifulSoup(markup)

1809

tag = snowman_soup.b

1810

</pre></div>

1811

</div>

1812

SNOWMAN字符在UTF-8编码中可以正常显示(看上去像是☃),但有些编码不支持SNOWMAN字符,比如ISO-Latin-1或ASCII,那么在这些编码中SNOWMAN字符会被转换成“&#9731”:

1813

<div class="highlight-python"><div class="highlight"><pre>print(tag.encode("utf-8"))

1814

# ☃

1815

1816

print tag.encode("latin-1")

1817

# &#9731;

1818

1819

print tag.encode("ascii")

1820

# &#9731;

1821

</pre></div>

1822

</div>

1823

</div>

1824

1825

<h2>Unicode, dammit! (靠!)<a class="headerlink" href="#unicode-dammit" title="Permalink to this headline">¶</a></h2>

1826

<a class="reference internal" href="#unicode-dammit">编码自动检测</a> 功能可以在Beautiful Soup以外使用,检测某段未知编码时,可以使用这个方法:

1827

<div class="highlight-python"><div class="highlight"><pre>from bs4 import UnicodeDammit

1828

dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")

1829

print(dammit.unicode_markup)

1830

# Sacré bleu!

1831

dammit.original_encoding

1832

# 'utf-8'

1833

</pre></div>

1834

</div>

1835

如果Python中安装了 <tt class="docutils literal">chardet</tt> 或 <tt class="docutils literal">cchardet</tt> 那么编码检测功能的准确率将大大提高.输入的字符越多,检测结果越精确,如果事先猜测到一些可能编码,那么可以将猜测的编码作为参数,这样将优先检测这些编码:

1836

<div class="highlight-python"><div class="highlight"><pre>dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])

1837

print(dammit.unicode_markup)

1838

# Sacré bleu!

1839

dammit.original_encoding

1840

# 'latin-1'

1841

</pre></div>

1842

</div>

1843

<a class="reference internal" href="#unicode-dammit">编码自动检测</a> 功能中有2项功能是Beautiful Soup库中用不到的

1844

1845

1846

使用Unicode时,Beautiful Soup还会智能的把引号 <a class="footnote-reference" href="#id91" id="id55">[10]</a> 转换成HTML或XML中的特殊字符:

1847

<div class="highlight-python"><div class="highlight"><pre>markup = b"I just \x93love\x94 Microsoft Word\x92s smart quotes"

1848

1849

UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup

1850

# u'I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes'

1851

1852

UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup

1853

# u'I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes'

1854

</pre></div>

1855

</div>

1856

也可以把引号转换为ASCII码:

1857

<div class="highlight-python"><div class="highlight"><pre>UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup

1858

# u'I just "love" Microsoft Word\'s smart quotes'

1859

</pre></div>

1860

</div>

1861

很有用的功能,但是Beautiful Soup没有使用这种方式.默认情况下,Beautiful Soup把引号转换成Unicode:

1862

<div class="highlight-python"><div class="highlight"><pre>UnicodeDammit(markup, ["windows-1252"]).unicode_markup

1863

# u'I just \u201clove\u201d Microsoft Word\u2019s smart quotes'

1864

</pre></div>

1865

</div>

1866

</div>

1867

1868

<h3>矛盾的编码<a class="headerlink" href="#id56" title="Permalink to this headline">¶</a></h3>

1869

有时文档的大部分都是用UTF-8,但同时还包含了Windows-1252编码的字符,就像微软的智能引号 <a class="footnote-reference" href="#id91" id="id57">[10]</a> 一样.一些包含多个信息的来源网站容易出现这种情况. <tt class="docutils literal">UnicodeDammit.detwingle()</tt> 方法可以把这类文档转换成纯UTF-8编码格式,看个简单的例子:

1870

<div class="highlight-python"><div class="highlight"><pre>snowmen = (u"\N{SNOWMAN}" * 3)

1871

quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}")

1872

doc = snowmen.encode("utf8") + quote.encode("windows_1252")

1873

</pre></div>

1874

</div>

1875

这段文档很杂乱,snowmen是UTF-8编码,引号是Windows-1252编码,直接输出时不能同时显示snowmen和引号,因为它们编码不同:

1876

<div class="highlight-python"><div class="highlight"><pre>print(doc)

1877

# ☃☃☃�I like snowmen!�

1878

1879

print(doc.decode("windows-1252"))

1880

# â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”

1881

</pre></div>

1882

</div>

1883

如果对这段文档用UTF-8解码就会得到 <tt class="docutils literal">UnicodeDecodeError</tt> 异常,如果用Windows-1252解码就回得到一堆乱码.幸好, <tt class="docutils literal">UnicodeDammit.detwingle()</tt> 方法会吧这段字符串转换成UTF-8编码,允许我们同时显示出文档中的snowmen和引号:

1884

<div class="highlight-python"><div class="highlight"><pre>new_doc = UnicodeDammit.detwingle(doc)

1885

print(new_doc.decode("utf8"))

1886

# ☃☃☃“I like snowmen!”

1887

</pre></div>

1888

</div>

1889

<tt class="docutils literal">UnicodeDammit.detwingle()</tt> 方法只能解码包含在UTF-8编码中的Windows-1252编码内容,但这解决了最常见的一类问题.

1890

在创建 <tt class="docutils literal">BeautifulSoup</tt> 或 <tt class="docutils literal">UnicodeDammit</tt> 对象前一定要先对文档调用 <tt class="docutils literal">UnicodeDammit.detwingle()</tt> 确保文档的编码方式正确.如果尝试去解析一段包含Windows-1252编码的UTF-8文档,就会得到一堆乱码,比如: â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”.

1891

<tt class="docutils literal">UnicodeDammit.detwingle()</tt> 方法在Beautiful Soup 4.1.0版本中新增

1892

</div>

1893

</div>

1894

</div>

1895

1896

<h1>解析部分文档<a class="headerlink" href="#id58" title="Permalink to this headline">¶</a></h1>

1897

如果仅仅因为想要查找文档中的<a>标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把<a>标签以外的东西都忽略掉. <tt class="docutils literal">SoupStrainer</tt> 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 <tt class="docutils literal">SoupStrainer</tt> 中定义过的文档. 创建一个 <tt class="docutils literal">SoupStrainer</tt> 对象并作为 <tt class="docutils literal">parse_only</tt> 参数给 <tt class="docutils literal">BeautifulSoup</tt> 的构造方法即可.

1898

1899

<h2>SoupStrainer<a class="headerlink" href="#soupstrainer" title="Permalink to this headline">¶</a></h2>

1900

<tt class="docutils literal">SoupStrainer</tt> 类接受与典型搜索方法相同的参数：<a class="reference internal" href="#id32">name</a> , <a class="reference internal" href="#css">attrs</a> , <a class="reference internal" href="#recursive">recursive</a> , <a class="reference internal" href="#text">text</a> , <a class="reference internal" href="#keyword">**kwargs</a> 。下面举例说明三种 <tt class="docutils literal">SoupStrainer</tt> 对象：

1901

<div class="highlight-python"><div class="highlight"><pre>from bs4 import SoupStrainer

1902

1903

only_a_tags = SoupStrainer("a")

1904

1905

only_tags_with_id_link2 = SoupStrainer(id="link2")

1906

1907

def is_short_string(string):

1908

return len(string) < 10

1909

1910

only_short_strings = SoupStrainer(text=is_short_string)

1911

</pre></div>

1912

</div>

1913

再拿“爱丽丝”文档来举例，来看看使用三种 <tt class="docutils literal">SoupStrainer</tt> 对象做参数会有什么不同:

1914

<div class="highlight-python"><div class="highlight"><pre>html_doc = """

1915

<html><head><title>The Dormouse's story</title></head>

1916

1917

The Dormouse's story

1918

1919

Once upon a time there were three little sisters; and their names were

1920

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

1921

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

1922

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

1923

and they lived at the bottom of a well.

1924

1925

...

1926

"""

1927

1928

print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())

1929

# <a class="sister" href="http://example.com/elsie" id="link1">

1930

# Elsie

1931

# </a>

1932

# <a class="sister" href="http://example.com/lacie" id="link2">

1933

# Lacie

1934

# </a>

1935

# <a class="sister" href="http://example.com/tillie" id="link3">

1936

# Tillie

1937

# </a>

1938

1939

print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify())

1940

# <a class="sister" href="http://example.com/lacie" id="link2">

1941

# Lacie

1942

# </a>

1943

1944

print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify())

1945

# Elsie

1946

# ,

1947

# Lacie

1948

# and

1949

# Tillie

1950

# ...

1951

#

1952

</pre></div>

1953

</div>

1954

还可以将 <tt class="docutils literal">SoupStrainer</tt> 作为参数传入 <a class="reference internal" href="#id24">搜索文档树</a> 中提到的方法.这可能不是个常用用法,所以还是提一下:

1955

<div class="highlight-python"><div class="highlight"><pre>soup = BeautifulSoup(html_doc)

1956

soup.find_all(only_short_strings)

1957

# [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',

1958

# u'\n\n', u'...', u'\n']

1959

</pre></div>

1960

</div>

1961

</div>

1962

</div>

1963

1964

1965

1966

1967

如果想知道Beautiful Soup到底怎样处理一份文档,可以将文档传入 <tt class="docutils literal">diagnose()</tt> 方法(Beautiful Soup 4.2.0中新增),Beautiful Soup会输出一份报告,说明不同的解析器会怎样处理这段文档,并标出当前的解析过程会使用哪种解析器:

1968

<div class="highlight-python"><div class="highlight"><pre>from bs4.diagnose import diagnose

1969

data = open("bad.html").read()

1970

diagnose(data)

1971

1972

# Diagnostic running on Beautiful Soup 4.2.0

1973

# Python version 2.7.3 (default, Aug 1 2012, 05:16:07)

1974

# I noticed that html5lib is not installed. Installing it may help.

1975

# Found lxml version 2.3.2.0

1976

#

1977

# Trying to parse your data with html.parser

1978

# Here's what html.parser did with the document:

1979

# ...

1980

</pre></div>

1981

</div>

1982

<tt class="docutils literal">diagnose()</tt> 方法的输出结果可能帮助你找到问题的原因,如果不行,还可以把结果复制出来以便寻求他人的帮助

1983

</div>

1984

1985

<h2>文档解析错误<a class="headerlink" href="#id61" title="Permalink to this headline">¶</a></h2>

1986

文档解析错误有两种.一种是崩溃,Beautiful Soup尝试解析一段文档结果却抛除了异常,通常是 <tt class="docutils literal">HTMLParser.HTMLParseError</tt> .还有一种异常情况,是Beautiful Soup解析后的文档树看起来与原来的内容相差很多.

1987

这些错误几乎都不是Beautiful Soup的原因,这不会是因为Beautiful Soup得代码写的太优秀,而是因为Beautiful Soup没有包含任何文档解析代码.异常产生自被依赖的解析器,如果解析器不能很好的解析出当前的文档,那么最好的办法是换一个解析器.更多细节查看 <a class="reference internal" href="#id9">安装解析器</a> 章节.

1988

最常见的解析错误是 <tt class="docutils literal">HTMLParser.HTMLParseError: malformed start tag</tt> 和 <tt class="docutils literal">HTMLParser.HTMLParseError: bad end tag</tt> .这都是由Python内置的解析器引起的,解决方法是 <a class="reference internal" href="#id9">安装lxml或html5lib</a>

1989

最常见的异常现象是当前文档找不到指定的Tag,而这个Tag光是用眼睛就足够发现的了. <tt class="docutils literal">find_all()</tt> 方法返回 [] ,而 <tt class="docutils literal">find()</tt> 方法返回 None .这是Python内置解析器的又一个问题: 解析器会跳过那些它不知道的tag.解决方法还是 <a class="reference internal" href="#id9">安装lxml或html5lib</a>

1990

</div>

1991

1992

1993

1994

<li><tt class="docutils literal">SyntaxError: Invalid syntax</tt> (异常位置在代码行: <tt class="docutils literal">ROOT_TAG_NAME = u'[document]'</tt> ),因为Python2版本的代码没有经过迁移就在Python3中窒息感</li>

1995

<li><tt class="docutils literal">ImportError: No module named HTMLParser</tt> 因为在Python3中执行Python2版本的Beautiful Soup</li>

1996

<li><tt class="docutils literal">ImportError: No module named html.parser</tt> 因为在Python2中执行Python3版本的Beautiful Soup</li>

1997

<li><tt class="docutils literal">ImportError: No module named BeautifulSoup</tt> 因为在没有安装BeautifulSoup3库的Python环境下执行代码,或忘记了BeautifulSoup4的代码需要从 <tt class="docutils literal">bs4</tt> 包中引入</li>

1998

<li><tt class="docutils literal">ImportError: No module named bs4</tt> 因为当前Python环境下还没有安装BeautifulSoup4</li>

1999

</ul>

2000

</div>

2001

2002

<h2>解析成XML<a class="headerlink" href="#xml" title="Permalink to this headline">¶</a></h2>

2003

默认情况下,Beautiful Soup会将当前文档作为HTML格式解析,如果要解析XML文档,要在 <tt class="docutils literal">BeautifulSoup</tt> 构造方法中加入第二个参数 “xml”:

2004

<div class="highlight-python"><div class="highlight"><pre>soup = BeautifulSoup(markup, "xml")

2005

</pre></div>

2006

</div>

2007

当然,还需要 <a class="reference internal" href="#id9">安装lxml</a>

2008

</div>

2009

2010

<h2>解析器的错误<a class="headerlink" href="#id63" title="Permalink to this headline">¶</a></h2>

2011

2012

<li>如果同样的代码在不同环境下结果不同,可能是因为两个环境下使用不同的解析器造成的.例如这个环境中安装了lxml,而另一个环境中只有html5lib, <a class="reference internal" href="#id49">解析器之间的区别</a> 中说明了原因.修复方法是在 <tt class="docutils literal">BeautifulSoup</tt> 的构造方法中中指定解析器</li>

2013

<li>因为HTML标签是 <a class="reference external" href="http://www.w3.org/TR/html5/syntax.html#syntax">大小写敏感</a> 的,所以3种解析器再出来文档时都将tag和属性转换成小写.例如文档中的 <TAG></TAG> 会被转换为 <tag></tag> .如果想要保留tag的大写的话,那么应该将文档 <a class="reference internal" href="#xml">解析成XML</a> .</li>

2014

</ul>

2015

</div>

2016

2017

2018

2019

<li><tt class="docutils literal">UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar</tt> (或其它类型的 <tt class="docutils literal">UnicodeEncodeError</tt> )的错误,主要是两方面的错误(都不是Beautiful Soup的原因),第一种是正在使用的终端(console)无法显示部分Unicode,参考 <a class="reference external" href="http://wiki.Python.org/moin/PrintFails">Python wiki</a> ,第二种是向文件写入时,被写入文件不支持部分Unicode,这时只要用 <tt class="docutils literal">u.encode("utf8")</tt> 方法将编码转换为UTF-8.</li>

2020

<li><tt class="docutils literal">KeyError: [attr]</tt> 因为调用 <tt class="docutils literal">tag['attr']</tt> 方法而引起,因为这个tag没有定义该属性.出错最多的是 <tt class="docutils literal">KeyError: 'href'</tt> 和 <tt class="docutils literal">KeyError: 'class'</tt> .如果不确定某个属性是否存在时,用 <tt class="docutils literal">tag.get('attr')</tt> 方法去获取它,跟获取Python字典的key一样</li>

2021

<li><tt class="docutils literal">AttributeError: 'ResultSet' object has no attribute 'foo'</tt> 错误通常是因为把 <tt class="docutils literal">find_all()</tt> 的返回结果当作一个tag或文本节点使用,实际上返回结果是一个列表或 <tt class="docutils literal">ResultSet</tt> 对象的字符串,需要对结果进行循环才能得到每个节点的 <tt class="docutils literal">.foo</tt> 属性.或者使用 <tt class="docutils literal">find()</tt> 方法仅获取到一个节点</li>

2022

<li><tt class="docutils literal">AttributeError: 'NoneType' object has no attribute 'foo'</tt> 这个错误通常是在调用了 <tt class="docutils literal">find()</tt> 方法后直节点取某个属性 .foo 但是 <tt class="docutils literal">find()</tt> 方法并没有找到任何结果,所以它的返回值是 <tt class="docutils literal">None</tt> .需要找出为什么 <tt class="docutils literal">find()</tt> 的返回值是 <tt class="docutils literal">None</tt> .</li>

2023

</ul>

2024

</div>

2025

2026

<h2>如何提高效率<a class="headerlink" href="#id66" title="Permalink to this headline">¶</a></h2>

2027

Beautiful Soup对文档的解析速度不会比它所依赖的解析器更快,如果对计算时间要求很高或者计算机的时间比程序员的时间更值钱,那么就应该直接使用 <a class="reference external" href="http://lxml.de/">lxml</a> .

2028

换句话说,还有提高Beautiful Soup效率的办法,使用lxml作为解析器.Beautiful Soup用lxml做解析器比用html5lib或Python内置解析器速度快很多.

2029

安装 <a class="reference external" href="http://pypi.Python.org/pypi/cchardet/">cchardet</a> 后文档的解码的编码检测会速度更快

2030

<a class="reference internal" href="#id58">解析部分文档</a> 不会节省多少解析时间,但是会节省很多内存,并且搜索时也会变得更快.

2031

</div>

2032

</div>

2033

2034

<h1>Beautiful Soup 3<a class="headerlink" href="#beautiful-soup-3" title="Permalink to this headline">¶</a></h1>

2035

Beautiful Soup 3是上一个发布版本,目前已经停止维护.Beautiful Soup 3库目前已经被几个主要的linux平台添加到源里:

2036

<tt class="docutils literal">$ apt-get install Python-beautifulsoup</tt>

2037

在PyPi中分发的包名字是 <tt class="docutils literal">BeautifulSoup</tt> :

2038

<tt class="docutils literal">$ easy_install BeautifulSoup</tt>

2039

<tt class="docutils literal">$ pip install BeautifulSoup</tt>

2040

或通过 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz">Beautiful Soup 3.2.0源码包</a> 安装

2041

Beautiful Soup 3的在线文档查看 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html">这里</a> ,当然还有 <a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html">中文版</a> ,然后再读本片文档,来对比Beautiful Soup 4中有什新变化.

2042

2043

<h2>迁移到BS4<a class="headerlink" href="#id70" title="Permalink to this headline">¶</a></h2>

2044

只要一个小变动就能让大部分的Beautiful Soup 3代码使用Beautiful Soup 4的库和方法—-修改 <tt class="docutils literal">BeautifulSoup</tt> 对象的引入方式:

2045

<div class="highlight-python"><div class="highlight"><pre>from BeautifulSoup import BeautifulSoup

2046

</pre></div>

2047

</div>

2048

修改为:

2049

<div class="highlight-python"><div class="highlight"><pre>from bs4 import BeautifulSoup

2050

</pre></div>

2051

</div>

2052

2053

<li>如果代码抛出 <tt class="docutils literal">ImportError</tt> 异常“No module named BeautifulSoup”,原因可能是尝试执行Beautiful Soup 3,但环境中只安装了Beautiful Soup 4库</li>

2054

<li>如果代码跑出 <tt class="docutils literal">ImportError</tt> 异常“No module named bs4”,原因可能是尝试运行Beautiful Soup 4的代码,但环境中只安装了Beautiful Soup 3.</li>

2055

</ul>

2056

虽然BS4兼容绝大部分BS3的功能,但BS3中的大部分方法已经不推荐使用了,就方法按照 <a class="reference external" href="http://www.Python.org/dev/peps/pep-0008/">PEP8标准</a> 重新定义了方法名.很多方法都重新定义了方法名,但只有少数几个方法没有向下兼容.

2057

上述内容就是BS3迁移到BS4的注意事项

2058

2059

<h3>需要的解析器<a class="headerlink" href="#id71" title="Permalink to this headline">¶</a></h3>

2060

Beautiful Soup 3曾使用Python的 <tt class="docutils literal">SGMLParser</tt> 解析器,这个模块在Python3中已经被移除了.Beautiful Soup 4默认使用系统的 <tt class="docutils literal">html.parser</tt> ,也可以使用lxml或html5lib扩展库代替.查看 <a class="reference internal" href="#id9">安装解析器</a> 章节

2061

因为 <tt class="docutils literal">html.parser</tt> 解析器与 <tt class="docutils literal">SGMLParser</tt> 解析器不同,它们在处理格式不正确的文档时也会产生不同结果.通常 <tt class="docutils literal">html.parser</tt> 解析器会抛出异常.所以推荐安装扩展库作为解析器.有时 <tt class="docutils literal">html.parser</tt> 解析出的文档树结构与 <tt class="docutils literal">SGMLParser</tt> 的不同.如果发生这种情况,那么需要升级BS3来处理新的文档树.

2062

</div>

2063

2064

<h3>方法名的变化<a class="headerlink" href="#id72" title="Permalink to this headline">¶</a></h3>

2065

2066

<li><tt class="docutils literal">renderContents</tt> -> <tt class="docutils literal">encode_contents</tt></li>

2067

<li><tt class="docutils literal">replaceWith</tt> -> <tt class="docutils literal">replace_with</tt></li>

2068

<li><tt class="docutils literal">replaceWithChildren</tt> -> <tt class="docutils literal">unwrap</tt></li>

2069

<li><tt class="docutils literal">findAll</tt> -> <tt class="docutils literal">find_all</tt></li>

2070

<li><tt class="docutils literal">findAllNext</tt> -> <tt class="docutils literal">find_all_next</tt></li>

2071

<li><tt class="docutils literal">findAllPrevious</tt> -> <tt class="docutils literal">find_all_previous</tt></li>

2072

<li><tt class="docutils literal">findNext</tt> -> <tt class="docutils literal">find_next</tt></li>

2073

<li><tt class="docutils literal">findNextSibling</tt> -> <tt class="docutils literal">find_next_sibling</tt></li>

2074

<li><tt class="docutils literal">findNextSiblings</tt> -> <tt class="docutils literal">find_next_siblings</tt></li>

2075

<li><tt class="docutils literal">findParent</tt> -> <tt class="docutils literal">find_parent</tt></li>

2076

<li><tt class="docutils literal">findParents</tt> -> <tt class="docutils literal">find_parents</tt></li>

2077

<li><tt class="docutils literal">findPrevious</tt> -> <tt class="docutils literal">find_previous</tt></li>

2078

<li><tt class="docutils literal">findPreviousSibling</tt> -> <tt class="docutils literal">find_previous_sibling</tt></li>

2079

<li><tt class="docutils literal">findPreviousSiblings</tt> -> <tt class="docutils literal">find_previous_siblings</tt></li>

2080

<li><tt class="docutils literal">nextSibling</tt> -> <tt class="docutils literal">next_sibling</tt></li>

2081

<li><tt class="docutils literal">previousSibling</tt> -> <tt class="docutils literal">previous_sibling</tt></li>

2082

</ul>

2083

Beautiful Soup构造方法的参数部分也有名字变化:

2084

2085

<li><tt class="docutils literal">BeautifulSoup(parseOnlyThese=...)</tt> -> <tt class="docutils literal">BeautifulSoup(parse_only=...)</tt></li>

2086

<li><tt class="docutils literal">BeautifulSoup(fromEncoding=...)</tt> -> <tt class="docutils literal">BeautifulSoup(from_encoding=...)</tt></li>

2087

</ul>

2088

为了适配Python3,修改了一个方法名:

2089

2090

2091

</ul>

2092

修改了一个属性名,让它看起来更专业点:

2093

2094

<li><tt class="docutils literal">Tag.isSelfClosing</tt> -> <tt class="docutils literal">Tag.is_empty_element</tt></li>

2095

</ul>

2096

修改了下面3个属性的名字,以免雨Python保留字冲突.这些变动不是向下兼容的,如果在BS3中使用了这些属性,那么在BS4中这些代码无法执行.

2097

2098

<li>UnicodeDammit.Unicode -> UnicodeDammit.Unicode_markup``</li>

2099

<li><tt class="docutils literal">Tag.next</tt> -> <tt class="docutils literal">Tag.next_element</tt></li>

2100

<li><tt class="docutils literal">Tag.previous</tt> -> <tt class="docutils literal">Tag.previous_element</tt></li>

2101

</ul>

2102

</div>

2103

2104

2105

将下列生成器按照PEP8标准重新命名,并转换成对象的属性:

2106

2107

<li><tt class="docutils literal">childGenerator()</tt> -> <tt class="docutils literal">children</tt></li>

2108

<li><tt class="docutils literal">nextGenerator()</tt> -> <tt class="docutils literal">next_elements</tt></li>

2109

<li><tt class="docutils literal">nextSiblingGenerator()</tt> -> <tt class="docutils literal">next_siblings</tt></li>

2110

<li><tt class="docutils literal">previousGenerator()</tt> -> <tt class="docutils literal">previous_elements</tt></li>

2111

<li><tt class="docutils literal">previousSiblingGenerator()</tt> -> <tt class="docutils literal">previous_siblings</tt></li>

2112

<li><tt class="docutils literal">recursiveChildGenerator()</tt> -> <tt class="docutils literal">descendants</tt></li>

2113

<li><tt class="docutils literal">parentGenerator()</tt> -> <tt class="docutils literal">parents</tt></li>

2114

</ul>

2115

所以迁移到BS4版本时要替换这些代码:

2116

<div class="highlight-python"><div class="highlight"><pre>for parent in tag.parentGenerator():

2117

...

2118

</pre></div>

2119

</div>

2120

替换为:

2121

<div class="highlight-python"><div class="highlight"><pre>for parent in tag.parents:

2122

...

2123

</pre></div>

2124

</div>

2125

(两种调用方法现在都能使用)

2126

BS3中有的生成器循环结束后会返回 <tt class="docutils literal">None</tt> 然后结束.这是个bug.新版生成器不再返回 <tt class="docutils literal">None</tt> .

2127

BS4中增加了2个新的生成器, <a class="reference internal" href="#strings-stripped-strings">.strings 和 stripped_strings</a> . <tt class="docutils literal">.strings</tt> 生成器返回NavigableString对象, <tt class="docutils literal">.stripped_strings</tt> 方法返回去除前后空白的Python的string对象.

2128

</div>

2129

2130

2131

BS4中移除了解析XML的 <tt class="docutils literal">BeautifulStoneSoup</tt> 类.如果要解析一段XML文档,使用 <tt class="docutils literal">BeautifulSoup</tt> 构造方法并在第二个参数设置为“xml”.同时 <tt class="docutils literal">BeautifulSoup</tt> 构造方法也不再识别 <tt class="docutils literal">isHTML</tt> 参数.

2132

Beautiful Soup处理XML空标签的方法升级了.旧版本中解析XML时必须指明哪个标签是空标签. 构造方法的 <tt class="docutils literal">selfClosingTags</tt> 参数已经不再使用.新版Beautiful Soup将所有空标签解析为空元素,如果向空元素中添加子节点,那么这个元素就不再是空元素了.

2133

</div>

2134

2135

2136

HTML或XML实体都会被解析成Unicode字符,Beautiful Soup 3版本中有很多处理实体的方法,在新版中都被移除了. <tt class="docutils literal">BeautifulSoup</tt> 构造方法也不再接受 <tt class="docutils literal">smartQuotesTo</tt> 或 <tt class="docutils literal">convertEntities</tt> 参数. <a class="reference internal" href="#unicode-dammit">编码自动检测</a> 方法依然有 <tt class="docutils literal">smart_quotes_to</tt> 参数,但是默认会将引号转换成Unicode.内容配置项 <tt class="docutils literal">HTML_ENTITIES</tt> , <tt class="docutils literal">XML_ENTITIES</tt> 和 <tt class="docutils literal">XHTML_ENTITIES</tt> 在新版中被移除.因为它们代表的特性已经不再被支持.

2137

如果在输出文档时想把Unicode字符转换成HTML实体,而不是输出成UTF-8编码,那就需要用到 <a class="reference internal" href="#id47">输出格式</a> 的方法.

2138

</div>

2139

2140

2141

<a class="reference internal" href="#string">Tag.string</a> 属性现在是一个递归操作.如果A标签只包含了一个B标签,那么A标签的.string属性值与B标签的.string属性值相同.

2142

<a class="reference internal" href="#id12">多值属性</a> 比如 <tt class="docutils literal">class</tt> 属性包含一个他们的值的列表,而不是一个字符串.这可能会影响到如何按照CSS类名哦搜索tag.

2143

如果使用 <tt class="docutils literal">find*</tt> 方法时同时传入了 <a class="reference internal" href="#text">text 参数</a> 和 <a class="reference internal" href="#id32">name 参数</a> .Beautiful Soup会搜索指定name的tag,并且这个tag的 <a class="reference internal" href="#string">Tag.string</a> 属性包含text参数的内容.结果中不会包含字符串本身.旧版本中Beautiful Soup会忽略掉tag参数,只搜索text参数.

2144

<tt class="docutils literal">BeautifulSoup</tt> 构造方法不再支持 markupMassage 参数.现在由解析器负责文档的解析正确性.

2145

很少被用到的几个解析器方法在新版中被移除,比如 <tt class="docutils literal">ICantBelieveItsBeautifulSoup</tt> 和 <tt class="docutils literal">BeautifulSOAP</tt> .现在由解析器完全负责如何解释模糊不清的文档标记.

2146

<tt class="docutils literal">prettify()</tt> 方法在新版中返回Unicode字符串,不再返回字节流.

2147

<a class="reference external" href="http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html">BeautifulSoup3 文档</a>

2148

2149

2150

2151

<tr><td class="label"><a class="fn-backref" href="#id3">[1]</a></td><td>BeautifulSoup的google讨论组不是很活跃,可能是因为库已经比较完善了吧,但是作者还是会很热心的尽量帮你解决问题的.</td></tr>

2152

</tbody>

2153

</table>

2154

2155

2156

2157

<tr><td class="label">[2]</td><td>(<a class="fn-backref" href="#id19">1</a>, <a class="fn-backref" href="#id23">2</a>) 文档被解析成树形结构,所以下一步解析过程应该是当前节点的子节点</td></tr>

2158

</tbody>

2159

</table>

2160

2161

2162

2163

<tr><td class="label"><a class="fn-backref" href="#id26">[3]</a></td><td>过滤器只能作为搜索文档的参数,或者说应该叫参数类型更为贴切,原文中用了 <tt class="docutils literal">filter</tt> 因此翻译为过滤器</td></tr>

2164

</tbody>

2165

</table>

2166

2167

2168

2169

<tr><td class="label"><a class="fn-backref" href="#id31">[4]</a></td><td>元素参数,HTML文档中的一个tag节点,不能是文本节点</td></tr>

2170

</tbody>

2171

</table>

2172

2173

2174

2175

<tr><td class="label">[5]</td><td>(<a class="fn-backref" href="#id18">1</a>, <a class="fn-backref" href="#id33">2</a>, <a class="fn-backref" href="#id34">3</a>, <a class="fn-backref" href="#id35">4</a>, <a class="fn-backref" href="#id36">5</a>) 采用先序遍历方式</td></tr>

2176

</tbody>

2177

</table>

2178

2179

2180

2181

<tr><td class="label">[6]</td><td>(<a class="fn-backref" href="#id38">1</a>, <a class="fn-backref" href="#id39">2</a>) CSS选择器是一种单独的文档搜索语法, 参考 <a class="reference external" href="http://www.w3school.com.cn/css/css_selector_type.asp">http://www.w3school.com.cn/css/css_selector_type.asp</a></td></tr>

2182

</tbody>

2183

</table>

2184

2185

2186

2187

<tr><td class="label"><a class="fn-backref" href="#id50">[7]</a></td><td>原文写的是 html5lib, 译者觉得这是愿文档的一个笔误</td></tr>

2188

</tbody>

2189

</table>

2190

2191

2192

2193

<tr><td class="label"><a class="fn-backref" href="#id43">[8]</a></td><td>wrap含有包装,打包的意思,但是这里的包装不是在外部包装而是将当前tag的内部内容包装在一个tag里.包装原来内容的新tag依然在执行 <a class="reference internal" href="#wrap">wrap()</a> 方法的tag内</td></tr>

2194

</tbody>

2195

</table>

2196

2197

2198

2199

<tr><td class="label"><a class="fn-backref" href="#id52">[9]</a></td><td>文档中特殊编码字符被替换成特殊字符(通常是�)的过程是Beautful Soup自动实现的,如果想要多种编码格式的文档被完全转换正确,那么,只好,预先手动处理,统一编码格式</td></tr>

2200

</tbody>

2201

</table>

2202

2203

2204

2205

<tr><td class="label">[10]</td><td>(<a class="fn-backref" href="#id55">1</a>, <a class="fn-backref" href="#id57">2</a>) 智能引号,常出现在microsoft的word软件中,即在某一段落中按引号出现的顺序每个引号都被自动转换为左引号,或右引号.</td></tr>

2206

</tbody>

2207

</table>

2208

</div>

2209

</div>

2210

</div>

2211

2212

2213

</div>

2214

</div>

2215

</div>

2216

2217

2218

<h3><a href="index.html">Table Of Contents</a></h3>

2219

<ul>

2220

<li><a class="reference internal" href="#">Beautiful Soup 4.2.0 文档</a><ul>

2221

2222

</ul>

2223

</li>

2224

2225

<li><a class="reference internal" href="#id5">安装 Beautiful Soup</a><ul>

2226

<li><a class="reference internal" href="#id8">安装完成后的问题</a></li>

2227

<li><a class="reference internal" href="#id9">安装解析器</a></li>

2228

</ul>

2229

</li>

2230

2231

<li><a class="reference internal" href="#id11">对象的种类</a><ul>

2232

2233

2234

<li><a class="reference internal" href="#attributes">Attributes</a><ul>

2235

2236

</ul>

2237

</li>

2238

</ul>

2239

</li>

2240

<li><a class="reference internal" href="#id13">可以遍历的字符串</a></li>

2241

<li><a class="reference internal" href="#beautifulsoup">BeautifulSoup</a></li>

2242

<li><a class="reference internal" href="#id14">注释及特殊字符串</a></li>

2243

</ul>

2244

</li>

2245

<li><a class="reference internal" href="#id15">遍历文档树</a><ul>

2246

2247

<li><a class="reference internal" href="#id17">tag的名字</a></li>

2248

<li><a class="reference internal" href="#contents-children">.contents 和 .children</a></li>

2249

<li><a class="reference internal" href="#descendants">.descendants</a></li>

2250

<li><a class="reference internal" href="#string">.string</a></li>

2251

<li><a class="reference internal" href="#strings-stripped-strings">.strings 和 stripped_strings</a></li>

2252

</ul>

2253

</li>

2254

2255

<li><a class="reference internal" href="#parent">.parent</a></li>

2256

<li><a class="reference internal" href="#parents">.parents</a></li>

2257

</ul>

2258

</li>

2259

2260

<li><a class="reference internal" href="#next-sibling-previous-sibling">.next_sibling 和 .previous_sibling</a></li>

2261

<li><a class="reference internal" href="#next-siblings-previous-siblings">.next_siblings 和 .previous_siblings</a></li>

2262

</ul>

2263

</li>

2264

<li><a class="reference internal" href="#id22">回退和前进</a><ul>

2265

<li><a class="reference internal" href="#next-element-previous-element">.next_element 和 .previous_element</a></li>

2266

<li><a class="reference internal" href="#next-elements-previous-elements">.next_elements 和 .previous_elements</a></li>

2267

</ul>

2268

</li>

2269

</ul>

2270

</li>

2271

<li><a class="reference internal" href="#id24">搜索文档树</a><ul>

2272

2273

2274

<li><a class="reference internal" href="#id28">正则表达式</a></li>

2275

2276

2277

2278

</ul>

2279

</li>

2280

2281

2282

<li><a class="reference internal" href="#keyword">keyword 参数</a></li>

2283

<li><a class="reference internal" href="#css">按CSS搜索</a></li>

2284

2285

<li><a class="reference internal" href="#limit"><tt class="docutils literal">limit</tt> 参数</a></li>

2286

<li><a class="reference internal" href="#recursive"><tt class="docutils literal">recursive</tt> 参数</a></li>

2287

</ul>

2288

</li>

2289

<li><a class="reference internal" href="#find-all-tag">像调用 <tt class="docutils literal">find_all()</tt> 一样调用tag</a></li>

2290

2291

<li><a class="reference internal" href="#find-parents-find-parent">find_parents() 和 find_parent()</a></li>

2292

<li><a class="reference internal" href="#find-next-siblings-find-next-sibling">find_next_siblings() 合 find_next_sibling()</a></li>

2293

<li><a class="reference internal" href="#find-previous-siblings-find-previous-sibling">find_previous_siblings() 和 find_previous_sibling()</a></li>

2294

2295

<li><a class="reference internal" href="#find-all-previous-find-previous">find_all_previous() 和 find_previous()</a></li>

2296

<li><a class="reference internal" href="#id37">CSS选择器</a></li>

2297

</ul>

2298

</li>

2299

<li><a class="reference internal" href="#id40">修改文档树</a><ul>

2300

<li><a class="reference internal" href="#id41">修改tag的名称和属性</a></li>

2301

<li><a class="reference internal" href="#id42">修改 .string</a></li>

2302

<li><a class="reference internal" href="#append">append()</a></li>

2303

<li><a class="reference internal" href="#beautifulsoup-new-string-new-tag">BeautifulSoup.new_string() 和 .new_tag()</a></li>

2304

<li><a class="reference internal" href="#insert">insert()</a></li>

2305

<li><a class="reference internal" href="#insert-before-insert-after">insert_before() 和 insert_after()</a></li>

2306

<li><a class="reference internal" href="#clear">clear()</a></li>

2307

<li><a class="reference internal" href="#extract">extract()</a></li>

2308

<li><a class="reference internal" href="#decompose">decompose()</a></li>

2309

<li><a class="reference internal" href="#replace-with">replace_with()</a></li>

2310

2311

<li><a class="reference internal" href="#unwrap">unwrap()</a></li>

2312

</ul>

2313

</li>

2314

2315

<li><a class="reference internal" href="#id45">格式化输出</a></li>

2316

2317

2318

2319

</ul>

2320

</li>

2321

<li><a class="reference internal" href="#id48">指定文档解析器</a><ul>

2322

<li><a class="reference internal" href="#id49">解析器之间的区别</a></li>

2323

</ul>

2324

</li>

2325

2326

2327

<li><a class="reference internal" href="#unicode-dammit">Unicode, dammit! (靠!)</a><ul>

2328

2329

<li><a class="reference internal" href="#id56">矛盾的编码</a></li>

2330

</ul>

2331

</li>

2332

</ul>

2333

</li>

2334

<li><a class="reference internal" href="#id58">解析部分文档</a><ul>

2335

<li><a class="reference internal" href="#soupstrainer">SoupStrainer</a></li>

2336

</ul>

2337

</li>

2338

2339

2340

<li><a class="reference internal" href="#id61">文档解析错误</a></li>

2341

2342

<li><a class="reference internal" href="#xml">解析成XML</a></li>

2343

<li><a class="reference internal" href="#id63">解析器的错误</a></li>

2344

2345

<li><a class="reference internal" href="#id66">如何提高效率</a></li>

2346

</ul>

2347

</li>

2348

<li><a class="reference internal" href="#beautiful-soup-3">Beautiful Soup 3</a><ul>

2349

<li><a class="reference internal" href="#id70">迁移到BS4</a><ul>

2350

<li><a class="reference internal" href="#id71">需要的解析器</a></li>

2351

<li><a class="reference internal" href="#id72">方法名的变化</a></li>

2352

2353

2354

2355

2356

</ul>

2357

</li>

2358

</ul>

2359

</li>

2360

</ul>

2361

2362

2363

2364

<li><a href="_sources/zh.txt"

2365

rel="nofollow">Show Source</a></li>

2366

</ul>

2367

2368

<h3>Quick search</h3>

2369

2370

2371

2372

2373

2374

</form>

2375

2376

Enter search terms or a module, class or function name.

2377

2378

</div>

2379

2380

</div>

2381

</div>

2382

2383

</div>

2384

2385

<h3>Navigation</h3>

2386

<ul>

2387

2388

<a href="genindex.html" title="General Index"

2389

>index</a></li>

2390

<li><a href="index.html">Beautiful Soup 4.2.0 documentation</a> »</li>

2391

</ul>

2392

</div>

2393

2394

2395

Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.2b1.

2396

</div>

2397

</body>

2398

</html>

b'\\ No newline at end of file'