~vcs-imports/clientcookie/trunk : revision 2

1

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"

2

"http://www.w3.org/TR/html4/strict.dtd">

3

<html>

4

<head>

5

6

7

8

9

<title>ClientCookie documentation</title>

10

11

12

13

</head>

14

<body>

15

16

@# This file is processed by EmPy to colorize Python source code

17

@# http://wwwsearch.sf.net/bits/colorize.py

18

@{from colorize import colorize}

19

20

21

<img src="http://sourceforge.net/sflogo.php?group_id=48205&type=2"

22

width="125" height="37" alt="SourceForge.net Logo"></a></div>

23

24

<h1>ClientCookie</h1>

25

26

27

28

<p><em><strong>Note: this page describes the stable 0.4.x version. See <a

29

href="./src/doc-0_9.html">here</a> for the 0.9.x development version.

30

</strong></em>

31

32

33

<h2>Examples</h2>

34

35

@{colorize(r"""

36

import ClientCookie

37

response = ClientCookie.urlopen("http://foo.bar.com/")

38

""")}

39

40

<p>This function behaves identically to <code>urllib2.urlopen()</code>, except

41

that it deals with cookies automatically. That's probably all you need to

42

know.

43

44

<p>Here is a more complicated example, involving <code>Request</code> objects

45

(useful if you want to pass <code>Request</code>s around, add headers to them,

46

etc.):

47

48

@{colorize(r"""

49

import ClientCookie

50

import urllib2

51

request = urllib2.Request("http://www.acme.com/")

52

# note we're using the urlopen from ClientCookie, not urllib2

53

response = ClientCookie.urlopen(request)

54

# let's say this next request requires a cookie that was set in response

55

request2 = urllib2.Request("http://www.acme.com/flying_machines.html")

56

response2 = ClientCookie.urlopen(request2)

57

58

print response2.geturl()

59

print response2.info() # headers

60

print response2.read() # body (readline and readlines work too)

61

""")}

62

63

<p>In these examples, the workings are hidden inside the

64

<code>ClientCookie.urlopen()</code> function, which is an extension of

65

<code>urllib2.urlopen()</code>. Redirects, proxies and cookies are handled

66

automatically by this function. Cookie processing (etc.) is handled by

67

processor objects, which are an extension of <code>urllib2</code>'s handlers:

68

<code>HTTPCookieProcessor</code>, <code>HTTPRefererProcessor</code>,

69

<code>SeekableProcessor</code> etc. They are used like any other handler.

70

Processor-aware versions of <code>HTTPHandler</code> and

71

<code>HTTPSHandler</code> (if your Python installation has HTTPS support) are

72

also included, along with a bugfixed <code>HTTPRedirectHandler</code> is also

73

included (the bug, related to redirection, is fixed in 2.3).

74

75

<p>An example at a slightly lower level shows how the module processes

76

cookies more clearly:

77

78

@{colorize(r"""

79

# Don't copy this blindly! You probably want to follow the examples

80

# above, not this one.

81

import ClientCookie

82

83

# Build an opener that *doesn't* automatically call .add_cookie_header()

84

# and .extract_cookies(), so we can do it manually without interference.

85

class NullCookieProcessor(ClientCookie.HTTPCookieProcessor):

86

def http_request(self, request): return request

87

def http_response(self, request, response): return response

88

opener = ClientCookie.build_opener(NullCookieProcessor)

89

90

request = ClientCookie.Request("http://www.acme.com/")

91

response = opener.open(request)

92

cj = ClientCookie.CookieJar()

93

cj.extract_cookies(response, request)

94

# let's say this next request requires a cookie that was set in response

95

request2 = ClientCookie.Request("http://www.acme.com/flying_machines.html")

96

cj.add_cookie_header(request2)

97

response2 = opener.open(request2)

98

""")}

99

100

<p>The <code>CookieJar</code> class does all the work. There are essentially

101

two operations: <code>.extract_cookies()</code> extracts HTTP cookies from

102

<code>Set-Cookie</code> (the original <a

103

href="http://www.netscape.com/newsref/std/cookie_spec.html">Netscape cookie

104

standard</a>) and <code>Set-Cookie2</code> (<a

105

href="http://www.ietf.org/rfc/rfc2965.txt">RFC 2965</a>) headers from a

106

response if and only if they should be set given the request, and

107

<code>.add_cookie_header()</code> adds <code>Cookie</code> headers if and only

108

if they are appropriate for a particular HTTP request. Incoming cookies are

109

checked for acceptability based on the host name, etc. Cookies are only set on

110

outgoing requests if they match the request's host name, path, etc.

111

112

<p><strong>Note that if you're using <code>ClientCookie.urlopen()</code> (or if

113

you're using <code>ClientCookie.HTTPCookieProcessor</code> by some other

114

means), you don't need to call <code>.extract_cookies()</code> or

115

<code>.add_cookie()</code> header yourself</strong>. If, on the other hand,

116

you don't want to use <code>urllib2</code>, you will need to use this pair of

117

methods. You can make your own <code>request</code> and <code>response</code>

118

objects, which must support the interfaces described in the docstrings of

119

<code>.extract_cookies()</code> and <code>.add_cookie_header()</code>.

120

121

<p>Cookies may be saved to and loaded from a file. The subclass

122

<code>MozillaCookieJar</code> differs from <code>CookieJar</code> only in

123

storing cookies using a different, Mozilla/Netscape/lynx-compatible, file

124

format. This Mozilla-compatible (<code>'cookies.txt'</code>) format loses some

125

information when you save cookies to a file. Note that lynx also uses the

126

Mozilla file format. The subclass <code>MSIECookieJar</code> can load (but not

127

save, yet) from Microsoft Internet Explorer's cookie files (on Windows).

128

129

<h2>Important note</h2>

130

131

<p>Only use names you can import directly from the <code>ClientCookie</code>

132

package, and that don't start with a single underscore. Everything else is

133

subject to change or disappearance without notice.

134

135

136

<h2>Cooperating with Mozilla/Netscape, lynx and Internet Explorer</h2>

137

138

<p>The subclass <code>MozillaCookieJar</code> differs from

139

<code>CookieJar</code> only in storing cookies using a different,

140

Mozilla/Netscape-compatible, file format. The lynx browser also uses this

141

format. This file format can't store RFC 2965 cookies, so they are downgraded

142

to Netscape cookies on saving. <code>CookieJar</code> itself uses a

143

libwww-perl specific format (`Set-Cookie3'). Python and your browser should be

144

able to share a cookies file (note that the file location here will differ on

145

non-unix OSes):

146

147

<p><strong>WARNING:</strong> you may want to backup your browser's cookies file

148

if you use <code>MozillaCookieJar</code> to save cookies. I <em>think</em> it

149

works, but there have been bugs in the past!

150

151

@{colorize(r"""

152

import os, ClientCookie

153

cookies = ClientCookie.MozillaCookieJar()

154

cookies.load(os.path.join(os.environ["HOME"], "/.netscape/cookies.txt"))

155

# see also the save and revert methods

156

""")}

157

158

<p>Note that cookies saved while Mozilla is running will get clobbered by

159

Mozilla - see <code>MozillaCookieJar.__doc__</code>.

160

161

<p><code>MSIECookieJar</code> does the same for Microsoft Internet Explorer

162

(MSIE) 5.x and 6.x on Windows, but does not allow saving cookies in this

163

format. In future, the Windows API calls might be used to load and save

164

(though the index has to be read directly, since there is no API for that,

165

AFAIK).

166

167

@{colorize(r"""

168

import ClientCookie

169

c = ClientCookie.MSIECookieJar(delayload=True)

170

c.load_from_registry() # finds cookie index file from registry

171

""")}

172

173

<p>A true <code>delayload</code> argument speeds things up.

174

175

<p>On Windows 9x (win 95, win 98, win ME), you need to supply a username to the

176

<code>.load_from_registry()</code> method:

177

178

@{colorize(r"""

179

c.load_from_registry(username="jbloggs")

180

""")}

181

182

<p>Konqueror/Safari and Opera use different file formats, which aren't yet

183

supported.

184

185

186

<h2>Using your own CookieJar instance</h2>

187

188

<p>You might want to do this to <a href="./doc.html#browsers">use your

189

browser's cookies</a>, to customize <code>CookieJar</code>'s behaviour by

190

passing constructor arguments, or to be able to get at the cookies it will hold

191

(for example, for saving cookies between sessions and for debugging).

192

193

<p>If you're using the higher-level <code>urllib2</code>-like interface

194

(<code>urlopen()</code>, etc), you'll have to let it know what

195

<code>CookieJar</code> it should use:

196

197

@{colorize(r"""

198

import ClientCookie

199

cookies = ClientCookie.CookieJar()

200

# build_opener() adds standard handlers and processors (such as HTTPHandler

201

# and HTTPCookieProcessor) by default. The cookie processor we supply

202

# will replace the default one.

203

opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cookies))

204

205

r = opener.open("http://acme.com/") # GET

206

r = opener.open("http://acme.com/", data) # POST

207

""")}

208

209

<p>The <code>urlopen()</code> function uses a global <code>OpenerDirector</code>

210

instance to do its work, so if you want to use <code>urlopen()</code> with your

211

own <code>CookieJar</code>, install the <code>OpenerDirector</code> you built

212

with <code>build_opener()</code> using the

213

<code>ClientCookie.install_opener()</code> function, then proceed as usual:

214

215

@{colorize(r"""

216

ClientCookie.install_opener(opener)

217

r = ClientCookie.urlopen("http://www.acme.com/")

218

""")}

219

220

<p>Of course, everyone using <code>urlopen()</code> is using the same global

221

<code>CookieJar</code> instance!

222

223

224

225

<p>You can set a policy object (must satisfy the interface defined by

226

<code>ClientCookie.CookiePolicy</code>), which determines which cookies

227

are allowed to be set and returned. Use the policy argument to the

228

<code>CookieJar</code> constructor, or just set the policy attribute

229

directly. The default implementation has some useful switches:

230

231

@{colorize(r"""

232

from ClientCookie import CookieJar, DefaultCookiePolicy as Policy

233

cookies = CookieJar()

234

# turn off RFC 2965 cookies, be more strict about domains when setting and

235

# returning Netscape cookies, and block some domains from setting cookies

236

# or having them returned (read the DefaultCookiePolicy docstring for the

237

# domain matching rules here)

238

policy = Policy(rfc2965=False, strict_ns_domain=Policy.DomainStrict,

239

blocked_domains=["ads.net", ".ads.net"])

240

cookies.policy = policy

241

""")}

242

243

244

245

<h2>Optional goodies: HTTP-EQUIV, Refresh, Referer and seekable responses</h2>

246

247

<p>These are implemented as processor classes. Processors are an extension of

248

<code>urllib2</code>'s handlers: you just pass them to

249

<code>build_opener()</code> (example code below).

250

251

<dl>

252

253

<dt><code>HTTPRobotRulesProcessor</code>

254

255

<dd><p>WWW Robots (also called wanderers or spiders) are programs that traverse

256

many pages in the World Wide Web by recursively retrieving linked pages. This

257

kind of program can place significant loads on web servers, so there is a <a

258

href="http://www.robotstxt.org/wc/norobots.html">standard</a> for a <code>

259

robots.txt</code> file by which web site operators can request robots to keep

260

out of their site, or out of particular areas of it. This processor uses the

261

standard Python library's <code>robotparser</code> module. It raises

262

<code>ClientCookie.RobotExclusionError</code> (subclass of

263

<code>urllib2.HTTPError</code>) if an attempt is made to open a URL prohibited

264

by <code>robots.txt</code>. XXX ATM, this makes use of code in the

265

<code>robotparser</code> module that uses <code>urllib</code> - this will

266

likely change in future to use <code>urllib2</code>.

267

268

<dt><code>HTTPEquivProcessor</code>

269

270

<dd><p>The <code><META HTTP-EQUIV></code> tag is a way of including data

271

in HTML to be treated as if it were part of the HTTP headers. ClientCookie can

272

automatically read these tags and add the <code>HTTP-EQUIV</code> headers to

273

the response object's real HTTP headers. The HTML is left unchanged.

274

275

<dt><code>HTTPRefreshProcessor</code>

276

277

<dd><p>The <code>Refresh</code> HTTP header is a non-standard header which is

278

widely used. It requests that the user-agent follow a URL after a specified

279

time delay. ClientCookie can treat these headers (which may have been set in

280

<code><META HTTP-EQUIV></code> tags) as if they were 302 redirections.

281

Exactly when and how <code>Refresh</code> headers are handled is configurable

282

using the constructor arguments.

283

284

<dt><code>SeekableProcessor</code>

285

286

<dd><p>This makes ClientCookie's response objects <code>seek()</code>able.

287

Seeking is done lazily (ie. the response object only reads from the socket as

288

necessary, rather than slurping in all the data before the response is returned

289

to you). XXX only works for HTTP ATM, I think

290

291

<dt><code>HTTPRefererProcessor</code>

292

293

<dd><p>The <code>Referer</code> HTTP header lets the server know which URL

294

you've just visited. Some servers use this header as state information, and

295

don't like it if this is not present. It's a chore to add this header by hand

296

every time you make a request. This adds it automatically.

297

<strong>NOTE</strong>: this only makes sense if you use each processor for a

298

single chain of HTTP requests (so, for example, if you use a single

299

HTTPRefererProcessor to fetch a series of URLs extracted from a single page,

300

<strong>this will break</strong>). The <a href="../mechanize/">mechanize</a>

301

package does this properly.

302

303

304

@{colorize(r"""

305

import ClientCookie

306

cookies = ClientCookie.CookieJar()

307

308

opener = ClientCookie.build_opener(ClientCookie.HTTPRefererProcessor,

309

ClientCookie.HTTPEquivProcessor,

310

ClientCookie.HTTPRefreshProcessor,

311

ClientCookie.SeekableProcessor)

312

opener.open("http://www.rhubarb.com/")

313

""")}

314

315

</dl>

316

317

318

319

<h2>Confusing fact about headers and Requests</h2>

320

321

ClientCookie automatically upgrades <code>urllib2.Request</code> objects to

322

<code>ClientCookie.Request</code>. This means that you won't see any headers

323

that are added to Request objects by handlers unless you use

324

<code>ClientCookie.Request</code> in the first place. Sorry about that.

325

326

327

328

<h2>Adding headers</h2>

329

330

<p>Adding headers is done like so:

331

332

@{colorize(r"""

333

import ClientCookie, urllib2

334

req = urllib2.Request("http://foobar.com/")

335

req.add_header("Referer", "http://wwwsearch.sourceforge.net/ClientCookie/")

336

r = ClientCookie.urlopen(req)

337

""")}

338

339

<p>You can also use the headers argument to the <code>urllib2.Request</code>

340

constructor.

341

342

<p><code>urllib2</code> (in fact, ClientCookie takes over this task from

343

<code>urllib2</code>) adds some headers to <code>Request</code> objects

344

automatically - see the next section for details.

345

346

347

<h2>Changing the automatically-added headers (User-Agent)</h2>

348

349

<p><code>OpenerDirector</code> automatically adds a <code>User-Agent</code>

350

header to every <code>Request</code>.

351

352

<p>To change this and/or add similar headers, use your own

353

<code>OpenerDirector</code>:

354

355

@{colorize(r"""

356

import ClientCookie

357

cookies = ClientCookie.CookieJar()

358

opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cookies))

359

opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"),

360

("From", "responsible.person@example.com")]

361

""")}

362

363

<p>Again, to use <code>urlopen()</code>, install your

364

<code>OpenerDirector</code> globally:

365

366

@{colorize(r"""

367

ClientCookie.install_opener(opener)

368

r = ClientCookie.urlopen("http://acme.com/")

369

""")}

370

371

<p>Also, a few standard headers (<code>Content-Length</code>,

372

<code>Content-Type</code> and <code>Host</code>) are added when the

373

<code>Request</code> is passed to <code>urlopen()</code> (or

374

<code>OpenerDirector.open()</code>). ClientCookie explictly adds these (and

375

<code>User-Agent</code>) to the <code>Request</code> object, unlike

376

<code>urllib2</code>. You shouldn't need to change these headers, but since

377

this is done by <code>AbstractHTTPHandler</code>, you can change the way it

378

works by passing a subclass of that handler to <code>build_opener()</code>.

379

380

381

382

<h2>Initiating unverifiable transactions</h2>

383

384

<p>ClientCookie knows that redirected transactions are unverifiable, so it'll

385

handle that on its own.

386

387

<p>If you want to initiate an unverifiable transaction yourself (which you

388

should if, for example, you're downloading the images from a page, and 'the

389

user' hasn't explicitly OKed those URLs), you need to set a true

390

<code>request.unverifiable</code> on your <code>Request</code> instance, and

391

also set <code>request.origin_req_host</code> to the request-host of the origin

392

transaction (eg. the URL of the page containing the images). If

393

<code>unverifiable</code> is present and true, but <code>origin_req_host</code>

394

is not present, you'll get an <code>AttributeError</code>. XXX None of this is

395

very nice...

396

397

398

399

<h2>Debugging</h2>

400

401

402

403

<p>First, a few common problems. The most frequent mistake people seem to make

404

is to use <code>ClientCookie.urlopen()</code>, <em>and</em> the

405

<code>.extract_cookies()</code> and <code>.add_cookie_header()</code> methods

406

on a cookie object themselves. If you use <code>ClientCookie.urlopen()</code>

407

(or <code>OpenerDirector.open()</code>), the module handles extraction and

408

adding of cookies by itself, so you should not call

409

<code>.extract_cookies()</code> or <code>.add_cookie_header()</code>.

410

411

<p>If things don't seem to be working as expected, the first thing to try is to

412

<a href="./doc.html#policy">switch off</a> RFC 2965 handling. This is because

413

few browsers implement it, so it is likely that some servers incorrectly

414

implement it.

415

416

<p>Are you sure the server is sending you any cookies in the first place?

417

Maybe the server is keeping track of state in some other way

418

(<code>HIDDEN</code> HTML form entries (possibly in a separate page referenced

419

by a frame), URL-encoded session keys, IP address, HTTP <code>Referer</code>

420

headers)? Perhaps some embedded script in the HTML is setting cookies (see

421

below)? Maybe you messed up your request, and the server is sending you some

422

standard failure page (even if the page doesn't appear to indicate any

423

failure). Sometimes, a server wants particular headers set to the values it

424

expects, or it won't play nicely. The most frequent offenders here are the

425

<code>Referer</code> [<em>sic</em>] and / or <code>User-Agent</code> HTTP

426

headers (<a href="./doc.html#headers">see above</a> for how to set these). The

427

<code>User-Agent</code> header may need to be set to a value like that of a

428

popular browser. The <code>Referer</code> header may need to be set to the URL

429

that the server expects you to have followed a link from. Occasionally, it may

430

even be that operators deliberately configure a server to insist on precisely

431

the headers that the popular browsers (MS Internet Explorer, Mozilla/Netscape,

432

Opera, Konqueror/Safari) generate, but remember that incompetence (possibly on

433

your part) is more probable than deliberate sabotage (and if a site owner is

434

that keen to stop robots, you probably shouldn't be scraping it anyway).

435

436

<p>When you <code>.save()</code> to or

437

<code>.load()</code>/<code>.revert()</code> from a file, single-session cookies

438

will expire unless you explicitly request otherwise with the

439

<code>ignore_discard</code> argument. This may be your problem if you find

440

cookies are going away after saving and loading.

441

442

@{colorize(r"""

443

import ClientCookie

444

cookies = ClientCookie.CookieJar()

445

opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cookies))

446

ClientCookie.install_opener(opener)

447

r = ClientCookie.urlopen("http://foobar.com/")

448

cookies.save("/some/file", ignore_discard=True, ignore_expires=True)

449

""")}

450

451

<p>If none of the advice above solves your problem quickly, try comparing the

452

headers and data that you are sending out with those that a browser emits.

453

Often this will give you the clue you need. Of course, you'll want to check

454

that the browser is able to do manually what you're trying to achieve

455

programatically before minutely examining the headers. Make sure that what you

456

do manually is <em>exactly</em> the same as what you're trying to do from

457

Python - you may simply be hitting a server bug that only gets revealed if you

458

view pages in a particular order, for example. In order to see what your

459

browser is sending to the server (even if HTTPS is in use), see <a

460

href="../clientx.html">the General FAQ page</a>. If nothing is obviously wrong

461

with the requests your program is sending and you're out of ideas, you can try

462

the last resort of good old brute force binary-search debugging. Temporarily

463

switch to sending HTTP headers (with <code>httplib</code>). Start by copying

464

Netscape/Mozilla or IE slavishly (apart from session IDs, etc., of course),

465

then begin the tedious process of mutating your headers and data until they

466

match what your higher-level code was sending. This will at least reliably

467

find your problem.

468

469

<p>You can turn on display of HTTP headers:

470

471

@{colorize(r"""

472

import ClientCookie

473

hh = ClientCookie.HTTPHandler() # you might want HTTPSHandler, too

474

hh.set_http_debuglevel(1)

475

opener = ClientCookie.build_opener(hh)

476

response = opener.open(url)

477

""")}

478

479

<p>Alternatively, you can examine your individual request and response objects

480

to see what's going on. ClientCookie's responses can be made

481

<code>.seek()</code>able using <code>SeekableProcessor</code>. It's often

482

useful to use the <code>.seek()</code> method like this during debugging:

483

484

@{colorize(r"""

485

...

486

response = ClientCookie.urlopen("http://spam.eggs.org/")

487

print response.read()

488

response.seek(0)

489

# rest of code continues as if you'd never .read() the response

490

...

491

""")}

492

493

<p>Also, note <code>HTTPRedirectDebugProcessor</code> (which prints information

494

about redirections) and <code>HTTPResponseDebugProcessor</code> (which prints

495

out all response bodies, including those that are read during redirections).

496

<strong>NOTE</strong>: as well as having these processors in your

497

<code>OpenerDirector</code> (for example, by passing them to

498

<code>build_opener()</code>) you have to turn on logging at the

499

<cdoe>INFO</code> level or lower in order to see any output.

500

501

<p>If you would like to see what is going on in ClientCookie's tiny mind, do

502

this:

503

504

@{colorize(r"""

505

import ClientCookie

506

# ClientCookie.DEBUG covers masses of debugging information,

507

# ClientCookie.INFO just shows the output from HTTPRedirectDebugProcessor,

508

ClientCookie.getLogger("ClientCookie").setLevel(ClientCookie.DEBUG)

509

""")}

510

511

<p>(In Python 2.3, <code>logging.getLogger</code>, <code>logging.DEBUG</code>,

512

<code>logging.INFO</code> etc. work just as well.)

513

514

<p>The <code>DEBUG</code> level (as opposed to the <code>INFO</code> level) can

515

actually be quite useful, as it explains why particular cookies are accepted or

516

rejected and why they are or are not returned.

517

518

<p>One final thing to note is that there are some catch-all bare

519

<code>except:</code> statements in the module, which are there to handle

520

unexpected bad input without crashing your program. If this happens, it's a

521

bug in ClientCookie, so please mail me the warning text.

522

523

524

525

<h2>Embedded script that sets cookies</h2>

526

527

<p>It is possible to embed script in HTML pages (sandwiched between

528

<code><SCRIPT>here</SCRIPT></code> tags, and in

529

<code>javascript:</code> URLs) - JavaScript / ECMAScript, VBScript, or even

530

Python - that causes cookies to be set in a browser. See the <a

531

href="../bits/clientx.html">General FAQs</a> page for what to do about this.

532

533

534

535

<h2>Parsing HTTP date strings</h2>

536

537

<p>A function named <code>str2time</code> is provided by the package,

538

which may be useful for parsing dates in HTTP headers.

539

<code>str2time</code> is intended to be liberal, since HTTP date/time

540

formats are poorly standardised in practice. There is no need to use this

541

function in normal operations: <code>CookieJar</code> instances keep track

542

of cookie lifetimes automatically. This function will stay around in some

543

form, though the supported date/time formats may change.

544

545

546

547

<h2>Note about cookie standards</h2>

548

549

<p>The various cookie standards and their history form a case study of the

550

terrible things that can happen to a protocol. The long-suffering David

551

Kristol has written a <a

552

href="http://doi.acm.org/10.1145/502152.502153">paper</a> about it, if you

553

want to know the gory details.

554

555

<p>Here is a summary.

556

557

<p>The <a href="http://www.netscape.com/newsref/std/cookie_spec.html">Netscape

558

protocol</a> (cookie_spec.html) is still the only standard supported by most

559

browsers (including Internet Explorer and Netscape). Be aware that

560

cookie_spec.html is not, and never was, actually followed to the letter (or

561

anything close) by anyone (including Netscape, IE and ClientCookie): the

562

Netscape protocol standard is really defined by the behaviour of Netscape (and

563

now IE). Netscape cookies are also known as V0 cookies, to distinguish them

564

from RFC 2109 or RFC 2965 cookies, which have a version cookie-attribute with a

565

value of 1.

566

567

<p><a href="http://www.ietf.org/rfcs/rfc2109.txt">RFC 2109</a> was introduced

568

to fix some problems identified with the Netscape protocol, while still keeping

569

the same HTTP headers (<code>Cookie</code> and <code>Set-Cookie</code>). The

570

most prominent of these problems is the 'third-party' cookie issue, which was

571

an accidental feature of the Netscape protocol. When one visits www.bland.org,

572

one doesn't expect to get a cookie from www.lurid.com, a site one has never

573

visited. Depending on browser configuration, this can still happen, because

574

the unreconstructed Netscape protocol is happy to accept cookies from, say, an

575

image in a webpage (www.bland.org) that's included by linking to an

576

advertiser's server (www.lurid.com). This kind of event, where your browser

577

talks to a server that you haven't explicitly okayed by some means, is what the

578

RFCs call an 'unverifiable transaction'. In addition to the potential for

579

embarrassment caused by the presence of lurid.com's cookies on one's machine,

580

this may also be used to track your movements on the web, because advertising

581

agencies like doubleclick.net place ads on many sites. RFC 2109 tried to

582

change this by requiring cookies to be turned off during unverifiable

583

transactions with third-party servers - unless the user explicitly asks them to

584

be turned on. This clashed with the business model of advertisers like

585

doubleclick.net, who had started to take advantage of the third-party cookies

586

'bug'. Since the browser vendors were more interested in the advertisers'

587

concerns than those of the browser users, this arguably doomed both RFC 2109

588

and its successor, RFC 2965, from the start. Other problems than the

589

third-party cookie issue were also fixed by 2109. However, even ignoring the

590

advertising issue, 2109 was stillborn, because Internet Explorer and Netscape

591

behaved differently in response to its extended <code>Set-Cookie</code>

592

headers. This was not really RFC 2109's fault: it worked the way it did to

593

keep compatibility with the Netscape protocol as implemented by Netscape.

594

Microsoft Internet Explorer (MSIE) was very new when the standard was designed,

595

but was starting to be very popular when the standard was finalised. XXX P3P,

596

and MSIE & Mozilla options

597

598

<p>XXX Apparently MSIE implements bits of RFC 2109 - but not very compliant

599

(surprise). Presumably other browsers do too, as a result. ClientCookie

600

already does allow Netscape cookies to have <code>max-age</code> and

601

<code>port</code> cookie-attributes, and as far as I know that's the extent of

602

the support present in MSIE. I haven't tested, though!

603

604

<p><a href="http://www.ietf.org/rfcs/rfc2965.txt">RFC 2965</a> attempted to fix

605

the compatibility problem by introducing two new headers,

606

<code>Set-Cookie2</code> and <code>Cookie2</code>. Unlike the

607

<code>Cookie</code> header, <code>Cookie2</code> does <em>not</em> carry

608

cookies to the server - rather, it simply advertises to the server that RFC

609

2965 is understood. <code>Set-Cookie2</code> <em>does</em> carry cookies, from

610

server to client: the new header means that both IE and Netscape completely

611

ignore these cookies. This prevents breakage, but introduces a chicken-egg

612

problem that means 2965 may never be widely adopted, especially since Microsoft

613

shows no interest in it. XXX Rumour has it that the European Union is unhappy

614

with P3P, and might introduce legislation that requires something better,

615

forming a gap that RFC 2965 might fill - any truth in this? Opera is the only

616

browser I know of that supports the standard. On the server side, Apache's

617

<code>mod_usertrack</code> supports it. One confusing point to note about RFC

618

2965 is that it uses the same value (1) of the Version attribute in HTTP

619

headers as does RFC 2109.

620

621

<p>Recently, it was discovered that RFC 2965 does not fully take account of

622

issues arising when 2965 and Netscape cookies coexist, and errata were

623

discussed on the W3C http-state mailing list, but the list traffic has died and

624

it seems RFC 2965 is dead as an internet protocol (but still a useful basis for

625

implementing the de-facto standards, and perhaps as an intranet protocol).

626

627

<p>Because Netscape cookies are so poorly specified, the general philosophy

628

of the module's Netscape cookie implementation is to start with RFC 2965

629

and open holes where required for Netscape protocol-compatibility. RFC

630

2965 cookies are <em>always</em> treated as RFC 2965 requires, of course!

631

632

633

<h2>FAQs - usage</h2>

634

<ul>

635

<li>Why don't I have any cookies?

636

<p>Read the <a href="./doc.html#debugging">debugging section</a> of this page.

637

<li>My response claims to be empty, but I know it's not!

638

<p>Did you call <code>response.read()</code> (eg., in a debug statement),

639

then forget that all the data has already been read? In that case, you

640

may want to use <code>SeekableProcessor</code>.

641

<li>How do I download only part of a response body?

642

<p>Just call <code>.read()</code> or <code>.readline()</code> methods on your

643

response object as many times as you need. The <code>seek</code> method

644

(which will only be there if you're using <code>SeekableProcessor</code>)

645

still works, because <code>SeekableProcessor</code>'s response objects

646

cache read data.

647

<li>What's the difference between the <code>.load()</code> and

648

<code>.revert()</code> methods of <code>CookieJar</code>?

649

<p><code>.load()</code> <emph>appends</emph> cookies from a file.

650

<code>.revert()</code> discards all existing cookies held by the

651

<code>CookieJar</code> first (but it won't lose any existing cookies if

652

the loading fails).

653

<li>Is it threadsafe?

654

<p>I believe so, but it's not been tested yet.

655

656

<p>The module docstrings are worth reading if you want to do something

657

unusual.

658

<li>What's this "processor" business about? I knew

659

<code>urllib2</code> used "handlers", but not these

660

"processors".

661

<p>See this Python library <a href="http://www.python.org/sf/852995">patch</a>.

662

<li>How do I use it without urllib2.py?

663

<p>@{colorize(r"""

664

from ClientCookie import CookieJar

665

print CookieJar.extract_cookies.__doc__

666

print CookieJar.add_cookie_header.__doc__

667

""")}

668

</ul>

669

670

<p><a href="mailto:jjl@@pobox.com">John J. Lee</a>, January 2004.

671

672

<hr>

673

674

</div>

675

676

677

678

679

680

681

<br>

682

683

<a href="../ClientCookie/">ClientCookie</a><br>

684

<span class="thispage"><span class="subpage">ClientCookie docs</span></span><br>

685

<a href="../ClientForm/">ClientForm</a><br>

686

<a href="../DOMForm/">DOMForm</a><br>

687

<a href="../spidermonkey/">spidermonkey</a><br>

688

<a href="../ClientTable/">ClientTable</a><br>

689

<a href="../mechanize/">mechanize</a><br>

690

<a href="../pullparser/">pullparser</a><br>

691

<a href="../bits/clientx.html">General FAQs</a><br>

692

<a href="../bits/urllib2_152.py">urllib2.py</a><br>

693

<a href="../bits/urllib_152.py">urllib.py</a><br>

694

695

<br>

696

697

<a href="./doc.html#examples">Examples</a><br>

698

<a href="./doc.html#browsers">Mozilla & MSIE</a><br>

699

<a href="./doc.html#cookiejar">Using a <code>CookieJar</code></a><br>

700

<a href="./doc.html#goodies">Processors</a><br>

701

<a href="./doc.html#requests">Request confusion</a><br>

702

<a href="./doc.html#headers">Adding headers</a><br>

703

<a href="./doc.html#unverifiable">Verifiability</a><br>

704

<a href="./doc.html#debugging">Debugging</a><br>

705

<a href="./doc.html#script">Embedded scripts</a><br>

706

<a href="./doc.html#dates">HTTP date parsing</a><br>

707

<a href="./doc.html#standards">Standards</a><br>

708

<a href="./doc.html#faq_use">FAQs - usage</a><br>

709

710

</div>

711

712

</body>

713

714

</html>