1
<html><head><title>Writing Web Clients</title></head><body>
3
<h1>Writing Web Clients</h1>
5
<h2>Web Clients -- The Tutorial</h2><ul>
8
<li>Gimmick -- Buffy quotes</li>
12
<em>Anya (Family, season 5) -- Thank you for coming. We value your patronage.</em>
13
<h2>What Are Web Clients?</h2><ul>
14
<li>Clarification: non-interactive web clients</li>
16
<li>Special purpose</li>
18
<li>Often, quick and dirty hacks</li>
20
<li>Make a web page into API</li>
24
<em>Giles (Family, season 5) -- Could we please be a little less effusive, Anya?</em>
25
<h2>What Are Web Clients Useful For?</h2><ul>
26
<li>Mass download</li>
28
<li>Periodic checking</li>
30
<li>Automating tasks<ul><li>Make a web page more friendly</li>
35
<em>Harmony (Family, season 5) -- Aww. You're my little lamb.</em>
36
<h2>Review of Modules</h2><ul>
51
<em>Buffy (Family, season 5) -- Your definition of narrow is impressively wide.</em>
52
<h2>Modules -- htmllib</h2><ul>
53
<li>Most useful for easy filtering of images</li>
57
<li>Other things often easier with sgmllib</li>
61
<li>Or with string manipulation</li>
65
<em>Xander (Family, season 5) -- The answer is somewhere here.</em>
66
<h2>Modules -- htmllib -- idiomatic usage</h2>
69
import htmllib, formatter
71
h = htmllib.HTMLParser(formatter.NullFormatter())
77
<em>Xander (Family, season 5) -- I'm helping, I'm reading, I'm quiet.</em>
78
<h2>Modules -- htmllib -- idiotmatic usage (cont'd)</h2>
80
import htmllib, formatter
82
class IMGFinder(htmllib.HTMLParser):
84
def __init__(self, *args, **kw):
85
htmllib.HTMLParser.__init__(self, *args, **kw)
88
def handle_image(self, src, *args): self.ims.append(src)
90
h = IMGFinder(formatter.NullFormatter())
96
<em>Donny (Family, season 5) -- Look what I found!</em>
97
<h2>Modules -- htmllib -- base</h2><ul>
98
<li>Some sites use 'base' for different relative linking</li>
100
<li>For example, Zope does</li>
102
<li>In above examples, 'h.base' has the base</li>
106
<em>Dawn (Family, season 5) -- This is the source of my gladness.</em>
107
<h2>Modules -- htmllib -- base (example)</h2><ul>
108
<li>If the page on http://example.com/foo/bar.html has a link to '../baz.html'<ul><li>It means http://example.com/baz.html</li>
111
<li>If the original page has base='/foo/quux'<ul><li>It means http://example.com/foo/baz.html</li>
116
<em>Riley (Family, season 5) -- Every time I think I'm getting close to you...</em>
117
<h2>Modules -- urllib/urllib2</h2><ul>
118
<li>High-level interface</li>
120
<li>Treat URLs as file-like objects</li>
122
<li>...but still allows low-level operations</li>
124
<li>Interface largely compatible</li>
128
<em>Glory (Family, season 5) -- I am great and I am beautiful.</em>
129
<h2>Modules -- urllib/urllib2 (cont'd)</h2><ul>
130
<li>Can work through object-interface</li>
132
<li>More flexible</li>
134
<li>Interface no longer compatible</li>
136
<li>urllib2 better usually</li>
140
<em>Joyce (Ted, season 2) -- He redid my entire system.</em>
141
<h2>Modules -- urllib/urllib2 (examples)</h2><ul>
142
<li>urllib.urlopen("http://www.yahoo.com/").read() -> contents</li>
144
<li>urllib.urlopen("http://www.yahoo.com/").info() -> headers</li>
146
<li>Same works with urllib2</li>
148
<li>Automatically uses environment variables for proxies</li>
150
<li>urllib2 supports proxies with authentication</li>
154
<em>Xander (Ted, season 2) -- Yum-my!</em>
155
<h2>Digression -- HTTP Overview</h2><ul>
156
<li>Request/Response</li>
158
<li>Request is command followed by headers followed by body</li>
160
<li>Response is error code followed by headers followed by body</li>
162
<li>No welcome message</li>
166
<em>Tara (Family, season 5) -- ...in terms of the karmic cycle.</em>
167
<h2>Example HTTP Sessions</h2><ul>
172
GET /foo/bar.html HTTP/1.0
173
Host: www.example.org
177
<ul><li>Server</li></ul>
181
Content-Type: text/html
183
<html><body>lalalala</body></html>
187
<em>Giles (Family, season 5) -- And you are talking about what on earth?</em>
188
<h2>Modules -- httplib</h2><ul>
189
<li>Low-level interface to innards of HTTP</li>
191
<li>Absolute control</li>
193
<li>No abstractions</li>
197
<em>Mr. MacLay (Family, season 5) -- We know how to control her...problem.</em>
198
<h2>Modules -- httplib -- example</h2><ul>
199
<li>Note: usually, the Host header is important<ul><li>Virtual hosting</li>
203
>>> import httplib
204
>>> h=httplib.HTTP("moshez.org")
205
>>> h.putrequest('GET', '/')
206
>>> h.putheader('Host', 'moshez.org')
207
>>> h.endheaders()
208
>>> h.getreply()
209
(200, 'OK', <mimetools.Message instance at 0x81220dc>)
210
>>> h.getfile().read(10)
211
"<HTML>\n<HE"
214
<em>Anya (Family, season 5) -- ...and it was fun!</em>
215
<h2>Modules -- urlparse</h2><ul>
216
<li>urlparse.urljoin -- like os.path.join for URLs</li>
218
<li>For path manipulation<ul><li>urlparse.urlsplit</li>
220
<li>urlparse.urlunsplit</li>
225
<em>Buffy (Family, season 5) -- You know what, you guys, just leave it here.</em>
226
<h2>Downloading Dilbert</h2>
230
URL = 'http://www.dilbert.com/'
231
f = urllib2.urlopen(URL)
233
href = re.compile('<a href="(/comics/.*?/dilbert.*?gif)">')
234
m = href.search(value)
235
f = urllib2.urlretrieve(urlparse.urljoin(URL, m.group(1)),
239
<em>Tara (Family, season 5) -- That was funny if you [...] are a complete dork.</em>
240
<h2>Downloading Dark Angel Transcripts</h2><ul>
241
<li>Common situation of mass download</li></ul>
244
import urllib2, htmllib, formatter, posixpath
245
URL="http://www.darkangelfan.com/episode/"
246
LINK_RE = re.compile('/trans_[0-9]+\.shtml$')
247
s = urllib2.urlopen(URL).read()
248
h = htmllib.HTMLParser(formatter.NullFormatter())
250
links = [urlparse.urljoin(URL, link)
251
for link in h.anchorlist if LINK_RE.search(link)]
252
### -- really download --
254
urllib2.urlretrieve(link, posixpath.basename(link))
258
<em>Intern (Family, season 5) -- Yeah. That makes like five this month.</em>
259
<h2>Downloading Dark Angel Transcripts (select)</h2>
264
def __init__(self, fin, fout):
265
self.fin, self.fout, self.fileno = fin, fout, fin.fileno
268
buf = self.fin.read(4096)
270
for f in [self.fout, self.fin]: f.close()
275
<em>Joyce (Ted, season 2) -- I've been looking for the right moment.</em>
276
<h2>Downloading Dark Angel Transcripts (select, cont'd)</h2><ul>
277
<li>Same code up to 'really download'</li></ul>
280
downloaders = [Downloader(urllib2.urlopen(link),
281
open(posixpath.basename(link), 'wb'))
284
toRead = select.select(None, [downloaders], [], [])
285
for downloader in toRead:
286
if downloader.read():
287
downloaders.remove(downloader)
290
<em>Buffy (Family, season 5) -- Tara's damn birthday is just one too many things for me to worry about.</em>
291
<h2>Downloading Dark Angel Transcripts (threads)</h2><ul>
292
<li>Bare bones example</li></ul>
298
Thread(target=urllib2.urlretrieve,
299
args=(link,posixpath.basename(link)))
302
<em>Buffy (Ted, season 2) -- Sounds like fun.</em>
303
<h2>Digression - twisted.web.client</h2><ul>
304
<li>Part of the Twisted networking framework</li>
306
<li>High level interface to HTTP client</li>
308
<li>Completely asynchronous</li>
310
<li>Reports results via callbacks</li>
312
<li>client.getpage("http://www.yahoo.com").addCallbacks(gotResult, gotError)</li>
316
<em>Buffy (Ted, season 2) -- You're supposed to use your powers for good!</em>
317
<h2>Downloading Dark Angel Transcripts (web.client)</h2>
319
from twisted.web import client
320
from twisted.internet import import reactor, defer
323
[client.downloadPage(link, posixpath.basename(link))
324
for link in links]).addBoth(lambda _: reactor.stop())
328
<em>Ted (Ted, season 2) -- You don't have to worry about anything.</em>
329
<h2>HTTP Authentication</h2><ul>
330
<li>Client attempts to connect</li>
332
<li>Server sends back a 401 (please authenticate)</li>
334
<li>Client sends same request back -- with auth tokens</li>
336
<li>Only HTTP Basic authentication widely supported</li>
338
<li>Client can send auth tokens on more requests automatically</li>
342
<em>Buffy (Ted, season 2) -- Ummm... Who are these people?</em>
343
<h2>HTTP Authentication - manually</h2><ul>
344
<li>In HTTP, authentication is a header</li>
346
<li>Base authentication is sending username and password</li>
352
h=httplib.HTTP("localhost")
353
h.putrequest('GET', '/protected/stuff.html')
354
h.putheader('Authorization',
355
base64.encodestring(user+":"+password).strip())
358
print h.getfile().read()
361
<em>Tara (Family, season 5) -- And, uh, these are my-my friends.</em>
362
<h2>HTTP Authentication - urllib2</h2><ul>
363
<li>Can read username/password from URL</li>
365
<li>urllib2.urlopen("http://moshez:s3krit@example.com"
366
"/protected/stuff.html")</li>
370
<em>Xander (Ted, season 2) -- I am really jinxing the hell out of us.</em>
371
<h2>Further Reading</h2><ul>
372
<li>htmllib docs <a href="http://www.python.org/doc/current/lib/module-htmllib.html">http://www.python.org/doc/current/lib/module-htmllib.html</a></li>
374
<li>sgmllib docs<a href="http://www.python.org/doc/current/lib/module-sgmllib.html">http://www.python.org/doc/current/lib/module-sgmllib.html</a></li>
376
<li>urllib docs<a href="http://www.python.org/doc/current/lib/module-urllib.html">http://www.python.org/doc/current/lib/module-urllib.html</a></li>
378
<li>urllib2 docs<a href="http://www.python.org/doc/current/lib/module-urllib2.html">http://www.python.org/doc/current/lib/module-urllib2.html</a></li>
380
<li>httplib docs<a href="http://www.python.org/doc/current/lib/module-httplib.html">http://www.python.org/doc/current/lib/module-httplib.html</a></li>
382
<li>re docs<a href="http://www.python.org/doc/current/lib/module-re.html">http://www.python.org/doc/current/lib/module-re.html</a></li>
384
<li>HTTP RFC<a href="http://www.w3.org/Protocols/rfc2616/rfc2616.html">http://www.w3.org/Protocols/rfc2616/rfc2616.html</a></li>
386
<li>W3C HTML Page<a href="http://www.w3.org/MarkUp/">http://www.w3.org/MarkUp/</a></li>
388
<li>Twisted<a href="http://twistedmatrix.com">http://twistedmatrix.com</a></li>
392
<em>Willow (Ted, season 2) -- 'Book-cracker Buffy', it's kind of her nickname.</em>
394
<em>Buffy (Family, season 5) -- I let you come, now sit down and look studious.</em>
395
<h2>Bonus Slides</h2>
396
<em>Tara (Family, season 5) -- You always make me feel special.</em>
399
<li>Carry state from one page to another</li>
401
<li>Server sends header: Set-Cookie</li>
403
<li>Client sends on later requests header: Cookie</li>
407
<em>Ted (Ted, season 2) -- Who's up for dessert? I made chocolate-chip cookies!</em>
408
<h2>urllib2 cookies</h2><ul>
409
<li>Unfortunately, no automatic cookie jar support</li>
411
<li>Can manually use .info() to read cookies...</li>
413
<li>...and the Request() API to send them to the server</li>
417
<em>Joyce (Ted, season 2) -- Mm! Buffy, you've got to try one of these!</em>
418
<h2>Logging Into Advogato</h2>
423
u = urllib2.urlopen("http://advogato.org/acct/loginsub.html",
424
urllib2.urlencode({'u': 'moshez',
425
'pass': 'not my real pass'})
426
cookie = u.info()['set-cookie']
427
cookie = cookie[:cookie.find(';')]
428
r = Request('http://advogato.org/diary/post.html',
430
{'entry': open('entry').read(), 'post': 'Post'}),
432
urllib2.urlopen(r).read()
436
<em>Anya (Family, season 5) -- I have a place in the world now.</em>
437
<h2>On Being Nice - Robots</h2><ul>
438
<li>Some sites don't want automatic crawlers</li>
440
<li>It is up to you whether to play nice</li>
442
<li>But you should know the rules before you break them</li>
444
<li>Robots file -- at /robots.txt</li>
448
<em>Willow (Ted, season 2) -- There were design features in that robot that pre-date...</em>
449
<h2>Using robotparser</h2>
452
rp = robotparser.RobotFileParser()
453
rp.set_url('http://www.example.com/robots.txt')
455
if not rp.can_fetch('', 'http://www.example.com/'):
460
<em>Buffy (Ted, season 2) -- Tell me you didn't keep any parts.</em>
461
<h2>webchecker</h2><ul>
462
<li>In the source distribution, in Tools/</li>
464
<li>Understands robots.txt</li>
466
<li>Can override which links gets chased</li>
470
<em>Willow (Ted, season 2) -- What do you mean, check him out?</em>
471
<h2>websucker</h2><ul>
472
<li>In the source distribution, in Tools/</li>
474
<li>Uses webchecker as a module</li>
476
<li>Saves the pages it downloads</li>
480
<em>Buffy (Ted, season 2) -- Find out his secrets, hack into his life.</em>