2
Lamson takes the policy that email it receives is most likely complete garbage
3
using bizarre pre-Unicode formats that are irrelevant and unnecessary in today's
4
modern world. These emails must be cleansed of their unholy stench of
5
randomness and turned into something nice and clean that a regular Python
6
programmer can work with: unicode.
8
That's the receiving end, but on the sending end Lamson wants to make the world
9
better by not increasing the suffering. To that end, Lamson will canonicalize
10
all email it sends to be ascii or utf-8 (whichever is simpler and works to
11
encode the data). When you get an email from Lamson, it is a pristine easily
12
parseable clean unit of goodness you can count on.
14
To accomplish these tasks, Lamson goes back to basics and assert a few simple
15
rules on each email it receives:
17
1) NO ENCODING IS TRUSTED, NO LANGUAGE IS SACRED, ALL ARE SUSPECT.
18
2) Python wants Unicode, it will get Unicode.
19
3) Any email that CANNOT become Unicode, CANNOT be processed by Lamson or
21
4) Email addresses are ESSENTIAL to Lamson's routing and security, and therefore
22
will be canonicalized and properly encoded.
23
5) Lamson will therefore try to "upgrade" all email it receives to Unicode
24
internally, and cleaning all email addresses.
25
6) It does this by decoding all codecs, and if the codec LIES, then it will
26
attempt to statistically detect the codec using chardet.
27
7) If it can't detect the codec, and the codec lies, then the email is bad.
28
8) All text bodies and attachments are then converted to Python unicode in the
29
same way as the headers.
30
9) All other attachments are converted to raw strings as-is.
32
Once Lamson has done this, your Python handler can now assume that all
33
MailRequest objects are happily unicode enabled and ready to go. The rule is:
35
IF IT CANNOT BE UNICODE, THEN PYTHON CANNOT WORK WITH IT.
37
On the outgoing end (when you send a MailResponse), Lamson tries to create the
38
email it wants to receive by canonicalizing it:
40
1) All email will be encoded in the simplest cleanest way possible without
42
2) All headers are converted to 'ascii', and if that doesn't work, then 'utf-8'.
43
3) All text/* attachments and bodies are converted to ascii, and if that doesn't
45
4) All other attachments are left alone.
46
5) All email addresses are normalized and encoded if they have not been already.
48
The end result is an email that has the highest probability of not containing
49
any obfuscation techniques, hidden characters, bad characters, improper
50
formatting, invalid non-characterset headers, or any of the other billions of
51
things email clients do to the world. The output rule of Lamson is:
53
ALL EMAIL IS ASCII FIRST, THEN UTF-8, AND IF CANNOT BE EITHER THOSE IT WILL
56
Following these simple rules, this module does the work of converting email
57
to the canonical format and sending the canonical format. The code is
58
probably the most complex part of Lamson since the job it does is difficult.
60
Test results show that Lamson can safely canonicalize most email from any
61
culture (not just English) to the canonical form, and that if it can't then the
62
email is not formatted right and/or spam.
64
If you find an instance where this is not the case, then submit it to the
65
project as a test case.
69
from email.charset import Charset
73
from email import encoders
74
from email.mime.base import MIMEBase
75
from email.utils import parseaddr
79
DEFAULT_ENCODING = "utf-8"
80
DEFAULT_ERROR_HANDLING = "strict"
81
CONTENT_ENCODING_KEYS = set(['Content-Type', 'Content-Transfer-Encoding',
82
'Content-Disposition', 'Mime-Version'])
83
CONTENT_ENCODING_REMOVED_PARAMS = ['boundary']
85
REGEX_OPTS = re.IGNORECASE | re.MULTILINE
86
ENCODING_REGEX = re.compile(r"\=\?([a-z0-9\-]+?)\?([bq])\?", REGEX_OPTS)
87
ENCODING_END_REGEX = re.compile(r"\?=", REGEX_OPTS)
88
INDENT_REGEX = re.compile(r"\n\s+")
91
class EncodingError(Exception):
92
"""Thrown when there is an encoding error."""
96
class MailBase(object):
97
"""MailBase is used as the basis of lamson.mail and contains the basics of
98
encoding an email. You actually can do all your email processing with this
99
class, but it's more raw.
101
def __init__(self, items=()):
102
self.headers = dict(items)
105
self.content_encoding = {'Content-Type': (None, {}),
106
'Content-Disposition': (None, {}),
107
'Content-Transfer-Encoding': (None, {})}
109
def __getitem__(self, key):
110
return self.headers.get(normalize_header(key), None)
113
return len(self.headers)
116
return iter(self.headers)
118
def __contains__(self, key):
119
return normalize_header(key) in self.headers
121
def __setitem__(self, key, value):
122
self.headers[normalize_header(key)] = value
124
def __delitem__(self, key):
125
del self.headers[normalize_header(key)]
127
def __nonzero__(self):
128
return self.body != None or len(self.headers)
131
"""Returns the sorted keys."""
132
return sorted(self.headers.keys())
134
def attach_file(self, filename, data, ctype, disposition):
136
A file attachment is a raw attachment with a disposition that
137
indicates the file name.
139
assert filename, "You can't attach a file without a filename."
143
part.content_encoding['Content-Type'] = (ctype, {'name': filename})
144
part.content_encoding['Content-Disposition'] = (disposition,
145
{'filename': filename})
146
self.parts.append(part)
149
def attach_text(self, data, ctype):
151
This attaches a simpler text encoded part, which doesn't have a
156
part.content_encoding['Content-Type'] = (ctype, {})
157
self.parts.append(part)
166
class MIMEPart(MIMEBase):
168
A reimplementation of nearly everything in email.mime to be more useful
169
for actually attaching things. Rather than one class for every type of
170
thing you'd encode, there's just this one, and it figures out how to
171
encode what you ask it.
173
def __init__(self, type, **params):
174
self.maintype, self.subtype = type.split('/')
175
MIMEBase.__init__(self, self.maintype, self.subtype, **params)
177
def add_text(self, content):
178
# this is text, so encode it in canonical form
180
encoded = content.encode('ascii')
183
encoded = content.encode('utf-8')
186
self.set_payload(encoded, charset=charset)
189
def extract_payload(self, mail):
190
if mail.body == None: return # only None, '' is still ok
192
ctype, ctype_params = mail.content_encoding['Content-Type']
193
cdisp, cdisp_params = mail.content_encoding['Content-Disposition']
195
assert ctype, "Extract payload requires that mail.content_encoding have a valid Content-Type."
197
if ctype.startswith("text/"):
198
self.add_text(mail.body)
201
# replicate the content-disposition settings
202
self.add_header('Content-Disposition', cdisp, **cdisp_params)
204
self.set_payload(mail.body)
205
encoders.encode_base64(self)
208
return "<MIMEPart '%s/%s': %r, %r, multipart=%r>" % (self.subtype, self.maintype, self['Content-Type'],
209
self['Content-Disposition'],
212
def from_message(message):
214
Given a MIMEBase or similar Python email API message object, this
215
will canonicalize it and give you back a pristine MailBase.
216
If it can't then it raises a EncodingError.
220
# parse the content information out of message
221
for k in CONTENT_ENCODING_KEYS:
222
params = parse_parameter_header(message, k)
223
mail.content_encoding[k] = params
225
# copy over any keys that are not part of the content information
226
for k in message.keys():
227
if normalize_header(k) not in mail.content_encoding:
228
mail[k] = header_from_mime_encoding(message[k])
230
decode_message_body(mail, message)
232
if message.is_multipart():
233
# recursively go through each subpart and decode in the same way
234
for msg in message.get_payload():
235
if msg != message: # skip the multipart message itself
236
mail.parts.append(from_message(msg))
242
def to_message(mail):
244
Given a MailBase message, this will construct a MIMEPart
245
that is canonicalized for use with the Python email API.
247
ctype, params = mail.content_encoding['Content-Type']
251
ctype = 'multipart/mixed'
256
assert ctype.startswith("multipart") or ctype.startswith("message"), "Content type should be multipart or message, not %r" % ctype
258
# adjust the content type according to what it should be now
259
mail.content_encoding['Content-Type'] = (ctype, params)
262
out = MIMEPart(ctype, **params)
264
raise EncodingError("Content-Type malformed, not allowed: %r; %r" %
267
for k in mail.keys():
268
out[k.encode('ascii')] = header_to_mime_encoding(mail[k])
270
out.extract_payload(mail)
272
# go through the children
273
for part in mail.parts:
274
out.attach(to_message(part))
279
def to_string(mail, envelope_header=True):
280
"""Returns a canonicalized email string you can use to send or store
282
return to_message(mail).as_string(envelope_header)
285
def from_string(data):
286
"""Takes a string, and tries to clean it up into a clean MailBase."""
287
return from_message(email.message_from_string(data))
290
def to_file(mail, fileobj):
291
"""Writes a canonicalized message to the given file."""
292
fileobj.write(to_string(mail))
294
def from_file(fileobj):
295
"""Reads an email and cleans it up to make a MailBase."""
296
return from_message(email.message_from_file(fileobj))
299
def normalize_header(header):
300
return string.capwords(header.lower(), '-')
303
def parse_parameter_header(message, header):
304
params = message.get_params(header=header)
306
value = params.pop(0)[0]
307
params_dict = dict(params)
309
for key in CONTENT_ENCODING_REMOVED_PARAMS:
310
if key in params_dict: del params_dict[key]
312
return value, params_dict
316
def decode_message_body(mail, message):
317
mail.body = message.get_payload(decode=True)
319
# decode the payload according to the charset given if it's text
320
ctype, params = mail.content_encoding['Content-Type']
324
mail.body = attempt_decoding(charset, mail.body)
325
elif ctype.startswith("text/"):
326
charset = params.get('charset', 'ascii')
327
mail.body = attempt_decoding(charset, mail.body)
329
# it's a binary codec of some kind, so just decode and leave it
335
def header_to_mime_encoding(value):
336
if not value: return ""
338
encoder = Charset(DEFAULT_ENCODING)
341
return value.encode("ascii")
342
except UnicodeEncodeError:
344
# this could have an email address, make sure we don't screw it up
345
name, address = parseaddr(value)
346
return '"%s" <%s>' % (encoder.header_encode(name.encode("utf-8")), address)
348
return encoder.header_encode(value.encode("utf-8"))
351
def header_from_mime_encoding(header):
354
elif type(header) == list:
355
return [properly_decode_header(h) for h in header]
357
return properly_decode_header(header)
362
def guess_encoding_and_decode(original, data, errors=DEFAULT_ERROR_HANDLING):
364
charset = chardet.detect(str(data))
366
if not charset['encoding']:
367
raise EncodingError("Header claimed %r charset, but detection found none. Decoding failed." % original)
369
return data.decode(charset["encoding"], errors)
370
except UnicodeError, exc:
371
raise EncodingError("Header lied and claimed %r charset, guessing said "
372
"%r charset, neither worked so this is a bad email: "
373
"%s." % (original, charset, exc))
376
def attempt_decoding(charset, dec):
378
if isinstance(dec, unicode):
379
# it's already unicode so just return it
382
return dec.decode(charset)
384
# looks like the charset lies, try to detect it
385
return guess_encoding_and_decode(charset, dec)
387
# they gave a crap encoding
388
return guess_encoding_and_decode(charset, dec)
391
def apply_charset_to_header(charset, encoding, data):
392
if encoding == 'b' or encoding == 'B':
393
dec = email.base64mime.decode(data.encode('ascii'))
394
elif encoding == 'q' or encoding == 'Q':
395
dec = email.quoprimime.header_decode(data.encode('ascii'))
397
raise EncodingError("Invalid header encoding %r should be 'Q' or 'B'." % encoding)
399
return attempt_decoding(charset, dec)
404
def _match(data, pattern, pos):
405
found = pattern.search(data, pos)
407
# contract: returns data before the match, and the match groups
408
left = data[pos:found.start()]
409
return left, found.groups(), found.end()
412
return left, None, -1
416
def _tokenize(data, next):
419
left, enc_header, next = _match(data, ENCODING_REGEX, next)
422
enc_data, _, next = _match(data, ENCODING_END_REGEX, next)
424
return left, enc_header, enc_data, next
431
left, enc_header, enc_data, next = _tokenize(data, next)
433
if next != -1 and INDENT_REGEX.match(data, next):
438
yield left, enc_header, enc_data, continued
441
def _parse_charset_header(data):
442
scanner = _scan(data)
448
left, enc_header, enc_data, continued = scanner.next()
450
left, enc_header, enc_data, continued = oddness
454
l, eh, ed, continued = scanner.next()
457
assert not ed, "Parsing error, give Zed this: %r" % data
458
oddness = (" " + l.lstrip(), eh, ed, continued)
459
elif eh[0] == enc_header[0] and eh[1] == enc_header[1]:
462
# odd case, it's continued but not from the same base64
463
# need to stack this for the next loop, and drop the \n\s+
464
oddness = ('', eh, ed, continued)
468
yield attempt_decoding('ascii', left)
471
yield apply_charset_to_header(enc_header[0], enc_header[1], enc_data)
473
except StopIteration:
477
def properly_decode_header(header):
478
return u"".join(_parse_charset_header(header))