1
'''"Executable documentation" for the pickle module.
3
Extensive comments about the pickle protocols and pickle-machine opcodes
4
can be found here. Some functions meant for external use:
7
Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
9
dis(pickle, out=None, memo=None, indentlevel=4)
10
Print a symbolic disassembly of a pickle.
13
__all__ = ['dis', 'genops', 'optimize']
17
# - A pickle verifier: read a pickle and check it exhaustively for
18
# well-formedness. dis() does a lot of this already.
20
# - A protocol identifier: examine a pickle and return its protocol number
21
# (== the highest .proto attr value among all the opcodes in the pickle).
22
# dis() already prints this info at the end.
24
# - A pickle optimizer: for example, tuple-building code is sometimes more
25
# elaborate than necessary, catering for the possibility that the tuple
26
# is recursive. Or lots of times a PUT is generated that's never accessed
31
"A pickle" is a program for a virtual pickle machine (PM, but more accurately
32
called an unpickling machine). It's a sequence of opcodes, interpreted by the
33
PM, building an arbitrarily complex Python object.
35
For the most part, the PM is very simple: there are no looping, testing, or
36
conditional instructions, no arithmetic and no function calls. Opcodes are
37
executed once each, from first to last, until a STOP opcode is reached.
39
The PM has two data areas, "the stack" and "the memo".
41
Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
42
integer object on the stack, whose value is gotten from a decimal string
43
literal immediately following the INT opcode in the pickle bytestream. Other
44
opcodes take Python objects off the stack. The result of unpickling is
45
whatever object is left on the stack when the final STOP opcode is executed.
47
The memo is simply an array of objects, or it can be implemented as a dict
48
mapping little integers to objects. The memo serves as the PM's "long term
49
memory", and the little integers indexing the memo are akin to variable
50
names. Some opcodes pop a stack object into the memo at a given index,
51
and others push a memo object at a given index onto the stack again.
53
At heart, that's all the PM has. Subtleties arise for these reasons:
55
+ Object identity. Objects can be arbitrarily complex, and subobjects
56
may be shared (for example, the list [a, a] refers to the same object a
57
twice). It can be vital that unpickling recreate an isomorphic object
58
graph, faithfully reproducing sharing.
60
+ Recursive objects. For example, after "L = []; L.append(L)", L is a
61
list, and L[0] is the same list. This is related to the object identity
62
point, and some sequences of pickle opcodes are subtle in order to
63
get the right result in all cases.
65
+ Things pickle doesn't know everything about. Examples of things pickle
66
does know everything about are Python's builtin scalar and container
67
types, like ints and tuples. They generally have opcodes dedicated to
68
them. For things like module references and instances of user-defined
69
classes, pickle's knowledge is limited. Historically, many enhancements
70
have been made to the pickle protocol in order to do a better (faster,
71
and/or more compact) job on those.
73
+ Backward compatibility and micro-optimization. As explained below,
74
pickle opcodes never go away, not even when better ways to do a thing
75
get invented. The repertoire of the PM just keeps growing over time.
76
For example, protocol 0 had two opcodes for building Python integers (INT
77
and LONG), protocol 1 added three more for more-efficient pickling of short
78
integers, and protocol 2 added two more for more-efficient pickling of
79
long integers (before protocol 2, the only ways to pickle a Python long
80
took time quadratic in the number of digits, for both pickling and
81
unpickling). "Opcode bloat" isn't so much a subtlety as a source of
82
wearying complication.
87
For compatibility, the meaning of a pickle opcode never changes. Instead new
88
pickle opcodes get added, and each version's unpickler can handle all the
89
pickle opcodes in all protocol versions to date. So old pickles continue to
90
be readable forever. The pickler can generally be told to restrict itself to
91
the subset of opcodes available under previous protocol versions too, so that
92
users can create pickles under the current version readable by older
93
versions. However, a pickle does not contain its version number embedded
94
within it. If an older unpickler tries to read a pickle using a later
95
protocol, the result is most likely an exception due to seeing an unknown (in
96
the older unpickler) opcode.
98
The original pickle used what's now called "protocol 0", and what was called
99
"text mode" before Python 2.3. The entire pickle bytestream is made up of
100
printable 7-bit ASCII characters, plus the newline character, in protocol 0.
101
That's why it was called text mode. Protocol 0 is small and elegant, but
102
sometimes painfully inefficient.
104
The second major set of additions is now called "protocol 1", and was called
105
"binary mode" before Python 2.3. This added many opcodes with arguments
106
consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
107
bytes. Binary mode pickles can be substantially smaller than equivalent
108
text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
109
int as 4 bytes following the opcode, which is cheaper to unpickle than the
110
(perhaps) 11-character decimal string attached to INT. Protocol 1 also added
111
a number of opcodes that operate on many stack elements at once (like APPENDS
112
and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
114
The third major set of additions came in Python 2.3, and is called "protocol
117
- A better way to pickle instances of new-style classes (NEWOBJ).
119
- A way for a pickle to identify its protocol (PROTO).
121
- Time- and space- efficient pickling of long ints (LONG{1,4}).
123
- Shortcuts for small tuples (TUPLE{1,2,3}}.
125
- Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
127
- The "extension registry", a vector of popular objects that can be pushed
128
efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but
129
the registry contents are predefined (there's nothing akin to the memo's
132
Another independent change with Python 2.3 is the abandonment of any
133
pretense that it might be safe to load pickles received from untrusted
134
parties -- no sufficient security analysis has been done to guarantee
135
this and there isn't a use case that warrants the expense of such an
138
To this end, all tests for __safe_for_unpickling__ or for
139
copy_reg.safe_constructors are removed from the unpickling code.
140
References to these variables in the descriptions below are to be seen
141
as describing unpickling in Python 2.2 and before.
144
# Meta-rule: Descriptions are stored in instances of descriptor objects,
145
# with plain constructors. No meta-language is defined from which
146
# descriptors could be constructed. If you want, e.g., XML, write a little
147
# program to generate XML from the objects.
149
##############################################################################
150
# Some pickle opcodes have an argument, following the opcode in the
151
# bytestream. An argument is of a specific type, described by an instance
152
# of ArgumentDescriptor. These are not to be confused with arguments taken
153
# off the stack -- ArgumentDescriptor applies only to arguments embedded in
154
# the opcode stream, immediately following an opcode.
156
# Represents the number of bytes consumed by an argument delimited by the
157
# next newline character.
160
# Represents the number of bytes consumed by a two-argument opcode where
161
# the first argument gives the number of bytes in the second argument.
162
TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int
163
TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int
165
class ArgumentDescriptor(object):
167
# name of descriptor record, also a module global name; a string
170
# length of argument, in bytes; an int; UP_TO_NEWLINE and
171
# TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length
175
# a function taking a file-like object, reading this kind of argument
176
# from the object at the current position, advancing the current
177
# position by n bytes, and returning the value of the argument
180
# human-readable docs for this arg descriptor; a string
184
def __init__(self, name, n, reader, doc):
185
assert isinstance(name, str)
188
assert isinstance(n, int) and (n >= 0 or
190
TAKEN_FROM_ARGUMENT1,
191
TAKEN_FROM_ARGUMENT4))
196
assert isinstance(doc, str)
199
from struct import unpack as _unpack
204
>>> read_uint1(StringIO.StringIO('\xff'))
211
raise ValueError("not enough data in stream to read uint1")
213
uint1 = ArgumentDescriptor(
217
doc="One-byte unsigned integer.")
223
>>> read_uint2(StringIO.StringIO('\xff\x00'))
225
>>> read_uint2(StringIO.StringIO('\xff\xff'))
231
return _unpack("<H", data)[0]
232
raise ValueError("not enough data in stream to read uint2")
234
uint2 = ArgumentDescriptor(
238
doc="Two-byte unsigned integer, little-endian.")
244
>>> read_int4(StringIO.StringIO('\xff\x00\x00\x00'))
246
>>> read_int4(StringIO.StringIO('\x00\x00\x00\x80')) == -(2**31)
252
return _unpack("<i", data)[0]
253
raise ValueError("not enough data in stream to read int4")
255
int4 = ArgumentDescriptor(
259
doc="Four-byte signed integer, little-endian, 2's complement.")
262
def read_stringnl(f, decode=True, stripquotes=True):
265
>>> read_stringnl(StringIO.StringIO("'abcd'\nefg\n"))
268
>>> read_stringnl(StringIO.StringIO("\n"))
269
Traceback (most recent call last):
271
ValueError: no string quotes around ''
273
>>> read_stringnl(StringIO.StringIO("\n"), stripquotes=False)
276
>>> read_stringnl(StringIO.StringIO("''\n"))
279
>>> read_stringnl(StringIO.StringIO('"abcd"'))
280
Traceback (most recent call last):
282
ValueError: no newline found when trying to read stringnl
284
Embedded escapes are undone in the result.
285
>>> read_stringnl(StringIO.StringIO(r"'a\n\\b\x00c\td'" + "\n'e'"))
290
if not data.endswith('\n'):
291
raise ValueError("no newline found when trying to read stringnl")
292
data = data[:-1] # lose the newline
296
if data.startswith(q):
297
if not data.endswith(q):
298
raise ValueError("strinq quote %r not found at both "
299
"ends of %r" % (q, data))
303
raise ValueError("no string quotes around %r" % data)
305
# I'm not sure when 'string_escape' was added to the std codecs; it's
306
# crazy not to use it if it's there.
308
data = data.decode('string_escape')
311
stringnl = ArgumentDescriptor(
314
reader=read_stringnl,
315
doc="""A newline-terminated string.
317
This is a repr-style string, with embedded escapes, and
321
def read_stringnl_noescape(f):
322
return read_stringnl(f, decode=False, stripquotes=False)
324
stringnl_noescape = ArgumentDescriptor(
325
name='stringnl_noescape',
327
reader=read_stringnl_noescape,
328
doc="""A newline-terminated string.
330
This is a str-style string, without embedded escapes,
331
or bracketing quotes. It should consist solely of
332
printable ASCII characters.
335
def read_stringnl_noescape_pair(f):
338
>>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\nEmpty\njunk"))
342
return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
344
stringnl_noescape_pair = ArgumentDescriptor(
345
name='stringnl_noescape_pair',
347
reader=read_stringnl_noescape_pair,
348
doc="""A pair of newline-terminated strings.
350
These are str-style strings, without embedded
351
escapes, or bracketing quotes. They should
352
consist solely of printable ASCII characters.
353
The pair is returned as a single string, with
354
a single blank separating the two strings.
360
>>> read_string4(StringIO.StringIO("\x00\x00\x00\x00abc"))
362
>>> read_string4(StringIO.StringIO("\x03\x00\x00\x00abcdef"))
364
>>> read_string4(StringIO.StringIO("\x00\x00\x00\x03abcdef"))
365
Traceback (most recent call last):
367
ValueError: expected 50331648 bytes in a string4, but only 6 remain
372
raise ValueError("string4 byte count < 0: %d" % n)
376
raise ValueError("expected %d bytes in a string4, but only %d remain" %
379
string4 = ArgumentDescriptor(
381
n=TAKEN_FROM_ARGUMENT4,
383
doc="""A counted string.
385
The first argument is a 4-byte little-endian signed int giving
386
the number of bytes in the string, and the second argument is
394
>>> read_string1(StringIO.StringIO("\x00"))
396
>>> read_string1(StringIO.StringIO("\x03abcdef"))
405
raise ValueError("expected %d bytes in a string1, but only %d remain" %
408
string1 = ArgumentDescriptor(
410
n=TAKEN_FROM_ARGUMENT1,
412
doc="""A counted string.
414
The first argument is a 1-byte unsigned int giving the number
415
of bytes in the string, and the second argument is that many
420
def read_unicodestringnl(f):
423
>>> read_unicodestringnl(StringIO.StringIO("abc\uabcd\njunk"))
428
if not data.endswith('\n'):
429
raise ValueError("no newline found when trying to read "
431
data = data[:-1] # lose the newline
432
return unicode(data, 'raw-unicode-escape')
434
unicodestringnl = ArgumentDescriptor(
435
name='unicodestringnl',
437
reader=read_unicodestringnl,
438
doc="""A newline-terminated Unicode string.
440
This is raw-unicode-escape encoded, so consists of
441
printable ASCII characters, and may contain embedded
445
def read_unicodestring4(f):
448
>>> s = u'abcd\uabcd'
449
>>> enc = s.encode('utf-8')
452
>>> n = chr(len(enc)) + chr(0) * 3 # little-endian 4-byte length
453
>>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk'))
457
>>> read_unicodestring4(StringIO.StringIO(n + enc[:-1]))
458
Traceback (most recent call last):
460
ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
465
raise ValueError("unicodestring4 byte count < 0: %d" % n)
468
return unicode(data, 'utf-8')
469
raise ValueError("expected %d bytes in a unicodestring4, but only %d "
470
"remain" % (n, len(data)))
472
unicodestring4 = ArgumentDescriptor(
473
name="unicodestring4",
474
n=TAKEN_FROM_ARGUMENT4,
475
reader=read_unicodestring4,
476
doc="""A counted Unicode string.
478
The first argument is a 4-byte little-endian signed int
479
giving the number of bytes in the string, and the second
480
argument-- the UTF-8 encoding of the Unicode string --
481
contains that many bytes.
485
def read_decimalnl_short(f):
488
>>> read_decimalnl_short(StringIO.StringIO("1234\n56"))
491
>>> read_decimalnl_short(StringIO.StringIO("1234L\n56"))
492
Traceback (most recent call last):
494
ValueError: trailing 'L' not allowed in '1234L'
497
s = read_stringnl(f, decode=False, stripquotes=False)
499
raise ValueError("trailing 'L' not allowed in %r" % s)
501
# It's not necessarily true that the result fits in a Python short int:
502
# the pickle may have been written on a 64-bit box. There's also a hack
503
# for True and False here.
511
except OverflowError:
514
def read_decimalnl_long(f):
518
>>> read_decimalnl_long(StringIO.StringIO("1234\n56"))
519
Traceback (most recent call last):
521
ValueError: trailing 'L' required in '1234'
523
Someday the trailing 'L' will probably go away from this output.
525
>>> read_decimalnl_long(StringIO.StringIO("1234L\n56"))
528
>>> read_decimalnl_long(StringIO.StringIO("123456789012345678901234L\n6"))
529
123456789012345678901234L
532
s = read_stringnl(f, decode=False, stripquotes=False)
533
if not s.endswith("L"):
534
raise ValueError("trailing 'L' required in %r" % s)
538
decimalnl_short = ArgumentDescriptor(
539
name='decimalnl_short',
541
reader=read_decimalnl_short,
542
doc="""A newline-terminated decimal integer literal.
544
This never has a trailing 'L', and the integer fit
545
in a short Python int on the box where the pickle
546
was written -- but there's no guarantee it will fit
547
in a short Python int on the box where the pickle
551
decimalnl_long = ArgumentDescriptor(
552
name='decimalnl_long',
554
reader=read_decimalnl_long,
555
doc="""A newline-terminated decimal integer literal.
557
This has a trailing 'L', and can represent integers
565
>>> read_floatnl(StringIO.StringIO("-1.25\n6"))
568
s = read_stringnl(f, decode=False, stripquotes=False)
571
floatnl = ArgumentDescriptor(
575
doc="""A newline-terminated decimal floating literal.
577
In general this requires 17 significant digits for roundtrip
578
identity, and pickling then unpickling infinities, NaNs, and
579
minus zero doesn't work across boxes, or on some boxes even
580
on itself (e.g., Windows can't read the strings it produces
581
for infinities or NaNs).
586
>>> import StringIO, struct
587
>>> raw = struct.pack(">d", -1.25)
589
'\xbf\xf4\x00\x00\x00\x00\x00\x00'
590
>>> read_float8(StringIO.StringIO(raw + "\n"))
596
return _unpack(">d", data)[0]
597
raise ValueError("not enough data in stream to read float8")
600
float8 = ArgumentDescriptor(
604
doc="""An 8-byte binary representation of a float, big-endian.
606
The format is unique to Python, and shared with the struct
607
module (format string '>d') "in theory" (the struct and cPickle
608
implementations don't share the code -- they should). It's
609
strongly related to the IEEE-754 double format, and, in normal
610
cases, is in fact identical to the big-endian 754 double format.
611
On other boxes the dynamic range is limited to that of a 754
612
double, and "add a half and chop" rounding is used to reduce
613
the precision to 53 bits. However, even on a 754 box,
614
infinities, NaNs, and minus zero may not be handled correctly
615
(may not survive roundtrip pickling intact).
620
from pickle import decode_long
625
>>> read_long1(StringIO.StringIO("\x00"))
627
>>> read_long1(StringIO.StringIO("\x02\xff\x00"))
629
>>> read_long1(StringIO.StringIO("\x02\xff\x7f"))
631
>>> read_long1(StringIO.StringIO("\x02\x00\xff"))
633
>>> read_long1(StringIO.StringIO("\x02\x00\x80"))
640
raise ValueError("not enough data in stream to read long1")
641
return decode_long(data)
643
long1 = ArgumentDescriptor(
645
n=TAKEN_FROM_ARGUMENT1,
647
doc="""A binary long, little-endian, using 1-byte size.
649
This first reads one byte as an unsigned size, then reads that
650
many bytes and interprets them as a little-endian 2's-complement long.
651
If the size is 0, that's taken as a shortcut for the long 0L.
657
>>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x00"))
659
>>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x7f"))
661
>>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\xff"))
663
>>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\x80"))
665
>>> read_long1(StringIO.StringIO("\x00\x00\x00\x00"))
671
raise ValueError("long4 byte count < 0: %d" % n)
674
raise ValueError("not enough data in stream to read long4")
675
return decode_long(data)
677
long4 = ArgumentDescriptor(
679
n=TAKEN_FROM_ARGUMENT4,
681
doc="""A binary representation of a long, little-endian.
683
This first reads four bytes as a signed size (but requires the
684
size to be >= 0), then reads that many bytes and interprets them
685
as a little-endian 2's-complement long. If the size is 0, that's taken
686
as a shortcut for the long 0L, although LONG1 should really be used
687
then instead (and in any case where # of bytes < 256).
691
##############################################################################
692
# Object descriptors. The stack used by the pickle machine holds objects,
693
# and in the stack_before and stack_after attributes of OpcodeInfo
694
# descriptors we need names to describe the various types of objects that can
695
# appear on the stack.
697
class StackObject(object):
699
# name of descriptor record, for info only
702
# type of object, or tuple of type objects (meaning the object can
703
# be of any type in the tuple)
706
# human-readable docs for this kind of stack object; a string
710
def __init__(self, name, obtype, doc):
711
assert isinstance(name, str)
714
assert isinstance(obtype, type) or isinstance(obtype, tuple)
715
if isinstance(obtype, tuple):
716
for contained in obtype:
717
assert isinstance(contained, type)
720
assert isinstance(doc, str)
730
doc="A short (as opposed to long) Python integer object.")
732
pylong = StackObject(
735
doc="A long (as opposed to short) Python integer object.")
737
pyinteger_or_bool = StackObject(
739
obtype=(int, long, bool),
740
doc="A Python integer object (short or long), or "
743
pybool = StackObject(
746
doc="A Python bool object.")
748
pyfloat = StackObject(
751
doc="A Python float object.")
753
pystring = StackObject(
756
doc="A Python string object.")
758
pyunicode = StackObject(
761
doc="A Python Unicode string object.")
763
pynone = StackObject(
766
doc="The Python None object.")
768
pytuple = StackObject(
771
doc="A Python tuple object.")
773
pylist = StackObject(
776
doc="A Python list object.")
778
pydict = StackObject(
781
doc="A Python dict object.")
783
anyobject = StackObject(
786
doc="Any kind of object whatsoever.")
788
markobject = StackObject(
791
doc="""'The mark' is a unique object.
793
Opcodes that operate on a variable number of objects
794
generally don't embed the count of objects in the opcode,
795
or pull it off the stack. Instead the MARK opcode is used
796
to push a special marker object on the stack, and then
797
some other opcodes grab all the objects from the top of
798
the stack down to (but not including) the topmost marker
802
stackslice = StackObject(
805
doc="""An object representing a contiguous slice of the stack.
807
This is used in conjunction with markobject, to represent all
808
of the stack following the topmost markobject. For example,
809
the POP_MARK opcode changes the stack from
811
[..., markobject, stackslice]
815
No matter how many object are on the stack after the topmost
816
markobject, POP_MARK gets rid of all of them (including the
817
topmost markobject too).
820
##############################################################################
821
# Descriptors for pickle opcodes.
823
class OpcodeInfo(object):
826
# symbolic name of opcode; a string
829
# the code used in a bytestream to represent the opcode; a
830
# one-character string
833
# If the opcode has an argument embedded in the byte string, an
834
# instance of ArgumentDescriptor specifying its type. Note that
835
# arg.reader(s) can be used to read and decode the argument from
836
# the bytestream s, and arg.doc documents the format of the raw
837
# argument bytes. If the opcode doesn't have an argument embedded
838
# in the bytestream, arg should be None.
841
# what the stack looks like before this opcode runs; a list
844
# what the stack looks like after this opcode runs; a list
847
# the protocol number in which this opcode was introduced; an int
850
# human-readable docs for this opcode; a string
854
def __init__(self, name, code, arg,
855
stack_before, stack_after, proto, doc):
856
assert isinstance(name, str)
859
assert isinstance(code, str)
860
assert len(code) == 1
863
assert arg is None or isinstance(arg, ArgumentDescriptor)
866
assert isinstance(stack_before, list)
867
for x in stack_before:
868
assert isinstance(x, StackObject)
869
self.stack_before = stack_before
871
assert isinstance(stack_after, list)
872
for x in stack_after:
873
assert isinstance(x, StackObject)
874
self.stack_after = stack_after
876
assert isinstance(proto, int) and 0 <= proto <= 2
879
assert isinstance(doc, str)
885
# Ways to spell integers.
891
stack_after=[pyinteger_or_bool],
893
doc="""Push an integer or bool.
895
The argument is a newline-terminated decimal literal string.
897
The intent may have been that this always fit in a short Python int,
898
but INT can be generated in pickles written on a 64-bit box that
899
require a Python long on a 32-bit box. The difference between this
900
and LONG then is that INT skips a trailing 'L', and produces a short
901
int whenever possible.
903
Another difference is due to that, when bool was introduced as a
904
distinct type in 2.3, builtin names True and False were also added to
905
2.2.2, mapping to ints 1 and 0. For compatibility in both directions,
906
True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
907
Leading zeroes are never produced for a genuine integer. The 2.3
908
(and later) unpicklers special-case these and return bool instead;
909
earlier unpicklers ignore the leading "0" and return the int.
918
doc="""Push a four-byte signed integer.
920
This handles the full range of Python (short) integers on a 32-bit
921
box, directly as binary bytes (1 for the opcode and 4 for the integer).
922
If the integer is non-negative and fits in 1 or 2 bytes, pickling via
923
BININT1 or BININT2 saves space.
932
doc="""Push a one-byte unsigned integer.
934
This is a space optimization for pickling very small non-negative ints,
944
doc="""Push a two-byte unsigned integer.
946
This is a space optimization for pickling small positive ints, in
947
range(256, 2**16). Integers in range(256) can also be pickled via
948
BININT2, but BININT1 instead saves a byte.
955
stack_after=[pylong],
957
doc="""Push a long integer.
959
The same as INT, except that the literal ends with 'L', and always
960
unpickles to a Python long. There doesn't seem a real purpose to the
963
Note that LONG takes time quadratic in the number of digits when
964
unpickling (this is simply due to the nature of decimal->binary
965
conversion). Proto 2 added linear-time (in C; still quadratic-time
966
in Python) LONG1 and LONG4 opcodes.
973
stack_after=[pylong],
975
doc="""Long integer using one-byte length.
977
A more efficient encoding of a Python long; the long1 encoding
984
stack_after=[pylong],
986
doc="""Long integer using found-byte length.
988
A more efficient encoding of a Python long; the long4 encoding
991
# Ways to spell strings (8-bit, not Unicode).
997
stack_after=[pystring],
999
doc="""Push a Python string object.
1001
The argument is a repr-style string, with bracketing quote characters,
1002
and perhaps embedded escapes. The argument extends until the next
1010
stack_after=[pystring],
1012
doc="""Push a Python string object.
1014
There are two arguments: the first is a 4-byte little-endian signed int
1015
giving the number of bytes in the string, and the second is that many
1016
bytes, which are taken literally as the string content.
1019
I(name='SHORT_BINSTRING',
1023
stack_after=[pystring],
1025
doc="""Push a Python string object.
1027
There are two arguments: the first is a 1-byte unsigned int giving
1028
the number of bytes in the string, and the second is that many bytes,
1029
which are taken literally as the string content.
1032
# Ways to spell None.
1038
stack_after=[pynone],
1040
doc="Push None on the stack."),
1042
# Ways to spell bools, starting with proto 2. See INT for how this was
1043
# done before proto 2.
1049
stack_after=[pybool],
1053
Push True onto the stack."""),
1059
stack_after=[pybool],
1063
Push False onto the stack."""),
1065
# Ways to spell Unicode strings.
1069
arg=unicodestringnl,
1071
stack_after=[pyunicode],
1072
proto=0, # this may be pure-text, but it's a later addition
1073
doc="""Push a Python Unicode string object.
1075
The argument is a raw-unicode-escape encoding of a Unicode string,
1076
and so may contain embedded escape sequences. The argument extends
1077
until the next newline character.
1080
I(name='BINUNICODE',
1084
stack_after=[pyunicode],
1086
doc="""Push a Python Unicode string object.
1088
There are two arguments: the first is a 4-byte little-endian signed int
1089
giving the number of bytes in the string. The second is that many
1090
bytes, and is the UTF-8 encoding of the Unicode string.
1093
# Ways to spell floats.
1099
stack_after=[pyfloat],
1101
doc="""Newline-terminated decimal float literal.
1103
The argument is repr(a_float), and in general requires 17 significant
1104
digits for roundtrip conversion to be an identity (this is so for
1105
IEEE-754 double precision values, which is what Python float maps to
1108
In general, FLOAT cannot be used to transport infinities, NaNs, or
1109
minus zero across boxes (or even on a single box, if the platform C
1110
library can't read the strings it produces for such things -- Windows
1111
is like that), but may do less damage than BINFLOAT on boxes with
1112
greater precision or dynamic range than IEEE-754 double.
1119
stack_after=[pyfloat],
1121
doc="""Float stored in binary form, with 8 bytes of data.
1123
This generally requires less than half the space of FLOAT encoding.
1124
In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1125
minus zero, raises an exception if the exponent exceeds the range of
1126
an IEEE-754 double, and retains no more than 53 bits of precision (if
1127
there are more than that, "add a half and chop" rounding is used to
1128
cut it back to 53 significant bits).
1131
# Ways to build lists.
1133
I(name='EMPTY_LIST',
1137
stack_after=[pylist],
1139
doc="Push an empty list."),
1144
stack_before=[pylist, anyobject],
1145
stack_after=[pylist],
1147
doc="""Append an object to a list.
1149
Stack before: ... pylist anyobject
1150
Stack after: ... pylist+[anyobject]
1152
although pylist is really extended in-place.
1158
stack_before=[pylist, markobject, stackslice],
1159
stack_after=[pylist],
1161
doc="""Extend a list by a slice of stack objects.
1163
Stack before: ... pylist markobject stackslice
1164
Stack after: ... pylist+stackslice
1166
although pylist is really extended in-place.
1172
stack_before=[markobject, stackslice],
1173
stack_after=[pylist],
1175
doc="""Build a list out of the topmost stack slice, after markobject.
1177
All the stack entries following the topmost markobject are placed into
1178
a single Python list, which single list object replaces all of the
1179
stack from the topmost markobject onward. For example,
1181
Stack before: ... markobject 1 2 3 'abc'
1182
Stack after: ... [1, 2, 3, 'abc']
1185
# Ways to build tuples.
1187
I(name='EMPTY_TUPLE',
1191
stack_after=[pytuple],
1193
doc="Push an empty tuple."),
1198
stack_before=[markobject, stackslice],
1199
stack_after=[pytuple],
1201
doc="""Build a tuple out of the topmost stack slice, after markobject.
1203
All the stack entries following the topmost markobject are placed into
1204
a single Python tuple, which single tuple object replaces all of the
1205
stack from the topmost markobject onward. For example,
1207
Stack before: ... markobject 1 2 3 'abc'
1208
Stack after: ... (1, 2, 3, 'abc')
1214
stack_before=[anyobject],
1215
stack_after=[pytuple],
1217
doc="""Build a one-tuple out of the topmost item on the stack.
1219
This code pops one value off the stack and pushes a tuple of
1220
length 1 whose one item is that value back onto it. In other
1223
stack[-1] = tuple(stack[-1:])
1229
stack_before=[anyobject, anyobject],
1230
stack_after=[pytuple],
1232
doc="""Build a two-tuple out of the top two items on the stack.
1234
This code pops two values off the stack and pushes a tuple of
1235
length 2 whose items are those values back onto it. In other
1238
stack[-2:] = [tuple(stack[-2:])]
1244
stack_before=[anyobject, anyobject, anyobject],
1245
stack_after=[pytuple],
1247
doc="""Build a three-tuple out of the top three items on the stack.
1249
This code pops three values off the stack and pushes a tuple of
1250
length 3 whose items are those values back onto it. In other
1253
stack[-3:] = [tuple(stack[-3:])]
1256
# Ways to build dicts.
1258
I(name='EMPTY_DICT',
1262
stack_after=[pydict],
1264
doc="Push an empty dict."),
1269
stack_before=[markobject, stackslice],
1270
stack_after=[pydict],
1272
doc="""Build a dict out of the topmost stack slice, after markobject.
1274
All the stack entries following the topmost markobject are placed into
1275
a single Python dict, which single dict object replaces all of the
1276
stack from the topmost markobject onward. The stack slice alternates
1277
key, value, key, value, .... For example,
1279
Stack before: ... markobject 1 2 3 'abc'
1280
Stack after: ... {1: 2, 3: 'abc'}
1286
stack_before=[pydict, anyobject, anyobject],
1287
stack_after=[pydict],
1289
doc="""Add a key+value pair to an existing dict.
1291
Stack before: ... pydict key value
1292
Stack after: ... pydict
1294
where pydict has been modified via pydict[key] = value.
1300
stack_before=[pydict, markobject, stackslice],
1301
stack_after=[pydict],
1303
doc="""Add an arbitrary number of key+value pairs to an existing dict.
1305
The slice of the stack following the topmost markobject is taken as
1306
an alternating sequence of keys and values, added to the dict
1307
immediately under the topmost markobject. Everything at and after the
1308
topmost markobject is popped, leaving the mutated dict at the top
1311
Stack before: ... pydict markobject key_1 value_1 ... key_n value_n
1312
Stack after: ... pydict
1314
where pydict has been modified via pydict[key_i] = value_i for i in
1315
1, 2, ..., n, and in that order.
1318
# Stack manipulation.
1323
stack_before=[anyobject],
1326
doc="Discard the top stack item, shrinking the stack by one item."),
1331
stack_before=[anyobject],
1332
stack_after=[anyobject, anyobject],
1334
doc="Push the top stack item onto the stack again, duplicating it."),
1340
stack_after=[markobject],
1342
doc="""Push markobject onto the stack.
1344
markobject is a unique object, used by other opcodes to identify a
1345
region of the stack containing a variable number of objects for them
1346
to work on. See markobject.doc for more detail.
1352
stack_before=[markobject, stackslice],
1355
doc="""Pop all the stack objects at and above the topmost markobject.
1357
When an opcode using a variable number of stack objects is done,
1358
POP_MARK is used to remove those objects, and to remove the markobject
1359
that delimited their starting position on the stack.
1362
# Memo manipulation. There are really only two operations (get and put),
1363
# each in all-text, "short binary", and "long binary" flavors.
1367
arg=decimalnl_short,
1369
stack_after=[anyobject],
1371
doc="""Read an object from the memo and push it on the stack.
1373
The index of the memo object to push is given by the newline-terminated
1374
decimal string following. BINGET and LONG_BINGET are space-optimized
1382
stack_after=[anyobject],
1384
doc="""Read an object from the memo and push it on the stack.
1386
The index of the memo object to push is given by the 1-byte unsigned
1390
I(name='LONG_BINGET',
1394
stack_after=[anyobject],
1396
doc="""Read an object from the memo and push it on the stack.
1398
The index of the memo object to push is given by the 4-byte signed
1399
little-endian integer following.
1404
arg=decimalnl_short,
1408
doc="""Store the stack top into the memo. The stack is not popped.
1410
The index of the memo location to write into is given by the newline-
1411
terminated decimal string following. BINPUT and LONG_BINPUT are
1412
space-optimized versions.
1421
doc="""Store the stack top into the memo. The stack is not popped.
1423
The index of the memo location to write into is given by the 1-byte
1424
unsigned integer following.
1427
I(name='LONG_BINPUT',
1433
doc="""Store the stack top into the memo. The stack is not popped.
1435
The index of the memo location to write into is given by the 4-byte
1436
signed little-endian integer following.
1439
# Access the extension registry (predefined objects). Akin to the GET
1446
stack_after=[anyobject],
1448
doc="""Extension code.
1450
This code and the similar EXT2 and EXT4 allow using a registry
1451
of popular objects that are pickled by name, typically classes.
1452
It is envisioned that through a global negotiation and
1453
registration process, third parties can set up a mapping between
1454
ints and object names.
1456
In order to guarantee pickle interchangeability, the extension
1457
code registry ought to be global, although a range of codes may
1458
be reserved for private use.
1460
EXT1 has a 1-byte integer argument. This is used to index into the
1461
extension registry, and the object at that index is pushed on the stack.
1468
stack_after=[anyobject],
1470
doc="""Extension code.
1472
See EXT1. EXT2 has a two-byte integer argument.
1479
stack_after=[anyobject],
1481
doc="""Extension code.
1483
See EXT1. EXT4 has a four-byte integer argument.
1486
# Push a class object, or module function, on the stack, via its module
1491
arg=stringnl_noescape_pair,
1493
stack_after=[anyobject],
1495
doc="""Push a global object (module.attr) on the stack.
1497
Two newline-terminated strings follow the GLOBAL opcode. The first is
1498
taken as a module name, and the second as a class name. The class
1499
object module.class is pushed on the stack. More accurately, the
1500
object returned by self.find_class(module, class) is pushed on the
1501
stack, so unpickling subclasses can override this form of lookup.
1504
# Ways to build objects of classes pickle doesn't know about directly
1505
# (user-defined classes). I despair of documenting this accurately
1506
# and comprehensibly -- you really have to read the pickle code to
1507
# find all the special cases.
1512
stack_before=[anyobject, anyobject],
1513
stack_after=[anyobject],
1515
doc="""Push an object built from a callable and an argument tuple.
1517
The opcode is named to remind of the __reduce__() method.
1519
Stack before: ... callable pytuple
1520
Stack after: ... callable(*pytuple)
1522
The callable and the argument tuple are the first two items returned
1523
by a __reduce__ method. Applying the callable to the argtuple is
1524
supposed to reproduce the original object, or at least get it started.
1525
If the __reduce__ method returns a 3-tuple, the last component is an
1526
argument to be passed to the object's __setstate__, and then the REDUCE
1527
opcode is followed by code to create setstate's argument, and then a
1528
BUILD opcode to apply __setstate__ to that argument.
1530
If type(callable) is not ClassType, REDUCE complains unless the
1531
callable has been registered with the copy_reg module's
1532
safe_constructors dict, or the callable has a magic
1533
'__safe_for_unpickling__' attribute with a true value. I'm not sure
1534
why it does this, but I've sure seen this complaint often enough when
1535
I didn't want to <wink>.
1541
stack_before=[anyobject, anyobject],
1542
stack_after=[anyobject],
1544
doc="""Finish building an object, via __setstate__ or dict update.
1546
Stack before: ... anyobject argument
1547
Stack after: ... anyobject
1549
where anyobject may have been mutated, as follows:
1551
If the object has a __setstate__ method,
1553
anyobject.__setstate__(argument)
1557
Else the argument must be a dict, the object must have a __dict__, and
1558
the object is updated via
1560
anyobject.__dict__.update(argument)
1562
This may raise RuntimeError in restricted execution mode (which
1563
disallows access to __dict__ directly); in that case, the object
1564
is updated instead via
1566
for k, v in argument.items():
1572
arg=stringnl_noescape_pair,
1573
stack_before=[markobject, stackslice],
1574
stack_after=[anyobject],
1576
doc="""Build a class instance.
1578
This is the protocol 0 version of protocol 1's OBJ opcode.
1579
INST is followed by two newline-terminated strings, giving a
1580
module and class name, just as for the GLOBAL opcode (and see
1581
GLOBAL for more details about that). self.find_class(module, name)
1582
is used to get a class object.
1584
In addition, all the objects on the stack following the topmost
1585
markobject are gathered into a tuple and popped (along with the
1586
topmost markobject), just as for the TUPLE opcode.
1588
Now it gets complicated. If all of these are true:
1590
+ The argtuple is empty (markobject was at the top of the stack
1593
+ It's an old-style class object (the type of the class object is
1596
+ The class object does not have a __getinitargs__ attribute.
1598
then we want to create an old-style class instance without invoking
1599
its __init__() method (pickle has waffled on this over the years; not
1600
calling __init__() is current wisdom). In this case, an instance of
1601
an old-style dummy class is created, and then we try to rebind its
1602
__class__ attribute to the desired class object. If this succeeds,
1603
the new instance object is pushed on the stack, and we're done. In
1604
restricted execution mode it can fail (assignment to __class__ is
1605
disallowed), and I'm not really sure what happens then -- it looks
1606
like the code ends up calling the class object's __init__ anyway,
1607
via falling into the next case.
1609
Else (the argtuple is not empty, it's not an old-style class object,
1610
or the class object does have a __getinitargs__ attribute), the code
1611
first insists that the class object have a __safe_for_unpickling__
1612
attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE,
1613
it doesn't matter whether this attribute has a true or false value, it
1614
only matters whether it exists (XXX this is a bug; cPickle
1615
requires the attribute to be true). If __safe_for_unpickling__
1616
doesn't exist, UnpicklingError is raised.
1618
Else (the class object does have a __safe_for_unpickling__ attr),
1619
the class object obtained from INST's arguments is applied to the
1620
argtuple obtained from the stack, and the resulting instance object
1621
is pushed on the stack.
1623
NOTE: checks for __safe_for_unpickling__ went away in Python 2.3.
1629
stack_before=[markobject, anyobject, stackslice],
1630
stack_after=[anyobject],
1632
doc="""Build a class instance.
1634
This is the protocol 1 version of protocol 0's INST opcode, and is
1635
very much like it. The major difference is that the class object
1636
is taken off the stack, allowing it to be retrieved from the memo
1637
repeatedly if several instances of the same class are created. This
1638
can be much more efficient (in both time and space) than repeatedly
1639
embedding the module and class names in INST opcodes.
1641
Unlike INST, OBJ takes no arguments from the opcode stream. Instead
1642
the class object is taken off the stack, immediately above the
1645
Stack before: ... markobject classobject stackslice
1646
Stack after: ... new_instance_object
1648
As for INST, the remainder of the stack above the markobject is
1649
gathered into an argument tuple, and then the logic seems identical,
1650
except that no __safe_for_unpickling__ check is done (XXX this is
1651
a bug; cPickle does test __safe_for_unpickling__). See INST for
1654
NOTE: In Python 2.3, INST and OBJ are identical except for how they
1655
get the class object. That was always the intent; the implementations
1656
had diverged for accidental reasons.
1662
stack_before=[anyobject, anyobject],
1663
stack_after=[anyobject],
1665
doc="""Build an object instance.
1667
The stack before should be thought of as containing a class
1668
object followed by an argument tuple (the tuple being the stack
1669
top). Call these cls and args. They are popped off the stack,
1670
and the value returned by cls.__new__(cls, *args) is pushed back
1682
doc="""Protocol version indicator.
1684
For protocol 2 and above, a pickle must start with this opcode.
1685
The argument is the protocol version, an int in range(2, 256).
1691
stack_before=[anyobject],
1694
doc="""Stop the unpickling machine.
1696
Every pickle ends with this opcode. The object at the top of the stack
1697
is popped, and that's the result of unpickling. The stack should be
1701
# Ways to deal with persistent IDs.
1705
arg=stringnl_noescape,
1707
stack_after=[anyobject],
1709
doc="""Push an object identified by a persistent ID.
1711
The pickle module doesn't define what a persistent ID means. PERSID's
1712
argument is a newline-terminated str-style (no embedded escapes, no
1713
bracketing quote characters) string, which *is* "the persistent ID".
1714
The unpickler passes this string to self.persistent_load(). Whatever
1715
object that returns is pushed on the stack. There is no implementation
1716
of persistent_load() in Python's unpickler: it must be supplied by an
1723
stack_before=[anyobject],
1724
stack_after=[anyobject],
1726
doc="""Push an object identified by a persistent ID.
1728
Like PERSID, except the persistent ID is popped off the stack (instead
1729
of being a string embedded in the opcode bytestream). The persistent
1730
ID is passed to self.persistent_load(), and whatever object that
1731
returns is pushed on the stack. See PERSID for more detail.
1736
# Verify uniqueness of .name and .code members.
1740
for i, d in enumerate(opcodes):
1741
if d.name in name2i:
1742
raise ValueError("repeated name %r at indices %d and %d" %
1743
(d.name, name2i[d.name], i))
1744
if d.code in code2i:
1745
raise ValueError("repeated code %r at indices %d and %d" %
1746
(d.code, code2i[d.code], i))
1751
del name2i, code2i, i, d
1753
##############################################################################
1754
# Build a code2op dict, mapping opcode characters to OpcodeInfo records.
1755
# Also ensure we've got the same stuff as pickle.py, although the
1756
# introspection here is dicey.
1763
def assure_pickle_consistency(verbose=False):
1766
copy = code2op.copy()
1767
for name in pickle.__all__:
1768
if not re.match("[A-Z][A-Z0-9_]+$", name):
1770
print "skipping %r: it doesn't look like an opcode name" % name
1772
picklecode = getattr(pickle, name)
1773
if not isinstance(picklecode, str) or len(picklecode) != 1:
1775
print ("skipping %r: value %r doesn't look like a pickle "
1776
"code" % (name, picklecode))
1778
if picklecode in copy:
1780
print "checking name %r w/ code %r for consistency" % (
1782
d = copy[picklecode]
1784
raise ValueError("for pickle code %r, pickle.py uses name %r "
1785
"but we're using name %r" % (picklecode,
1788
# Forget this one. Any left over in copy at the end are a problem
1789
# of a different kind.
1790
del copy[picklecode]
1792
raise ValueError("pickle.py appears to have a pickle opcode with "
1793
"name %r and code %r, but we don't" %
1796
msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
1797
for code, d in copy.items():
1798
msg.append(" name %r with code %r" % (d.name, code))
1799
raise ValueError("\n".join(msg))
1801
assure_pickle_consistency()
1802
del assure_pickle_consistency
1804
##############################################################################
1805
# A pickle opcode generator.
1808
"""Generate all the opcodes in a pickle.
1810
'pickle' is a file-like object, or string, containing the pickle.
1812
Each opcode in the pickle is generated, from the current pickle position,
1813
stopping after a STOP opcode is delivered. A triple is generated for
1818
opcode is an OpcodeInfo record, describing the current opcode.
1820
If the opcode has an argument embedded in the pickle, arg is its decoded
1821
value, as a Python object. If the opcode doesn't have an argument, arg
1824
If the pickle has a tell() method, pos was the value of pickle.tell()
1825
before reading the current opcode. If the pickle is a string object,
1826
it's wrapped in a StringIO object, and the latter's tell() result is
1827
used. Else (the pickle doesn't have a tell(), and it's not obvious how
1828
to query its current position) pos is None.
1831
import cStringIO as StringIO
1833
if isinstance(pickle, str):
1834
pickle = StringIO.StringIO(pickle)
1836
if hasattr(pickle, "tell"):
1837
getpos = pickle.tell
1839
getpos = lambda: None
1843
code = pickle.read(1)
1844
opcode = code2op.get(code)
1847
raise ValueError("pickle exhausted before seeing STOP")
1849
raise ValueError("at position %s, opcode %r unknown" % (
1850
pos is None and "<unknown>" or pos,
1852
if opcode.arg is None:
1855
arg = opcode.arg.reader(pickle)
1856
yield opcode, arg, pos
1858
assert opcode.name == 'STOP'
1861
##############################################################################
1862
# A pickle optimizer.
1865
'Optimize a pickle string by removing unused PUT opcodes'
1866
gets = set() # set of args used by a GET opcode
1867
puts = [] # (arg, startpos, stoppos) for the PUT opcodes
1868
prevpos = None # set to pos if previous opcode was a PUT
1869
for opcode, arg, pos in genops(p):
1870
if prevpos is not None:
1871
puts.append((prevarg, prevpos, pos))
1873
if 'PUT' in opcode.name:
1874
prevarg, prevpos = arg, pos
1875
elif 'GET' in opcode.name:
1878
# Copy the pickle string except for PUTS without a corresponding GET
1881
for arg, start, stop in puts:
1882
j = stop if (arg in gets) else start
1888
##############################################################################
1889
# A symbolic pickle disassembler.
1891
def dis(pickle, out=None, memo=None, indentlevel=4):
1892
"""Produce a symbolic disassembly of a pickle.
1894
'pickle' is a file-like object, or string, containing a (at least one)
1895
pickle. The pickle is disassembled from the current position, through
1896
the first STOP opcode encountered.
1898
Optional arg 'out' is a file-like object to which the disassembly is
1899
printed. It defaults to sys.stdout.
1901
Optional arg 'memo' is a Python dict, used as the pickle's memo. It
1902
may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
1903
Passing the same memo object to another dis() call then allows disassembly
1904
to proceed across multiple pickles that were all created by the same
1905
pickler with the same memo. Ordinarily you don't need to worry about this.
1907
Optional arg indentlevel is the number of blanks by which to indent
1908
a new MARK level. It defaults to 4.
1910
In addition to printing the disassembly, some sanity checks are made:
1912
+ All embedded opcode arguments "make sense".
1914
+ Explicit and implicit pop operations have enough items on the stack.
1916
+ When an opcode implicitly refers to a markobject, a markobject is
1917
actually on the stack.
1919
+ A memo entry isn't referenced before it's defined.
1921
+ The markobject isn't stored in the memo.
1923
+ A memo entry isn't redefined.
1926
# Most of the hair here is for sanity checks, but most of it is needed
1927
# anyway to detect when a protocol 0 POP takes a MARK off the stack
1928
# (which in turn is needed to indent MARK blocks correctly).
1930
stack = [] # crude emulation of unpickler stack
1932
memo = {} # crude emulation of unpickler memo
1933
maxproto = -1 # max protocol number seen
1934
markstack = [] # bytecode positions of MARK opcodes
1935
indentchunk = ' ' * indentlevel
1937
for opcode, arg, pos in genops(pickle):
1939
print >> out, "%5d:" % pos,
1941
line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
1942
indentchunk * len(markstack),
1945
maxproto = max(maxproto, opcode.proto)
1946
before = opcode.stack_before # don't mutate
1947
after = opcode.stack_after # don't mutate
1948
numtopop = len(before)
1950
# See whether a MARK should be popped.
1952
if markobject in before or (opcode.name == "POP" and
1954
stack[-1] is markobject):
1955
assert markobject not in after
1957
if markobject in before:
1958
assert before[-1] is stackslice
1960
markpos = markstack.pop()
1962
markmsg = "(MARK at unknown opcode offset)"
1964
markmsg = "(MARK at %d)" % markpos
1965
# Pop everything at and after the topmost markobject.
1966
while stack[-1] is not markobject:
1969
# Stop later code from popping too much.
1971
numtopop = before.index(markobject)
1973
assert opcode.name == "POP"
1976
errormsg = markmsg = "no MARK exists on stack"
1978
# Check for correct memo usage.
1979
if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"):
1980
assert arg is not None
1982
errormsg = "memo key %r already defined" % arg
1984
errormsg = "stack is empty -- can't store into memo"
1985
elif stack[-1] is markobject:
1986
errormsg = "can't store markobject in the memo"
1988
memo[arg] = stack[-1]
1990
elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
1992
assert len(after) == 1
1993
after = [memo[arg]] # for better stack emulation
1995
errormsg = "memo key %r has never been stored into" % arg
1997
if arg is not None or markmsg:
1998
# make a mild effort to align arguments
1999
line += ' ' * (10 - len(opcode.name))
2001
line += ' ' + repr(arg)
2003
line += ' ' + markmsg
2007
# Note that we delayed complaining until the offending opcode
2009
raise ValueError(errormsg)
2011
# Emulate the stack effects.
2012
if len(stack) < numtopop:
2013
raise ValueError("tries to pop %d items from stack with "
2014
"only %d items" % (numtopop, len(stack)))
2016
del stack[-numtopop:]
2017
if markobject in after:
2018
assert markobject not in before
2019
markstack.append(pos)
2023
print >> out, "highest protocol among opcodes =", maxproto
2025
raise ValueError("stack not empty after STOP: %r" % stack)
2027
# For use in the doctest, simply as an example of a class to pickle.
2029
def __init__(self, value):
2034
>>> x = [1, 2, (3, 4), {'abc': u"def"}]
2035
>>> pkl = pickle.dumps(x, 0)
2038
1: l LIST (MARK at 0)
2047
20: t TUPLE (MARK at 13)
2051
26: d DICT (MARK at 25)
2055
40: V UNICODE u'def'
2060
highest protocol among opcodes = 0
2062
Try again with a "binary" pickle.
2064
>>> pkl = pickle.dumps(x, 1)
2074
13: t TUPLE (MARK at 8)
2078
19: U SHORT_BINSTRING 'abc'
2080
26: X BINUNICODE u'def'
2083
37: e APPENDS (MARK at 3)
2085
highest protocol among opcodes = 1
2087
Exercise the INST/OBJ/BUILD family.
2089
>>> import pickletools
2090
>>> dis(pickle.dumps(pickletools.dis, 0))
2091
0: c GLOBAL 'pickletools dis'
2094
highest protocol among opcodes = 0
2096
>>> from pickletools import _Example
2097
>>> x = [_Example(42)] * 2
2098
>>> dis(pickle.dumps(x, 0))
2100
1: l LIST (MARK at 0)
2103
6: i INST 'pickletools _Example' (MARK at 5)
2106
32: d DICT (MARK at 31)
2108
36: S STRING 'value'
2117
highest protocol among opcodes = 0
2119
>>> dis(pickle.dumps(x, 1))
2124
5: c GLOBAL 'pickletools _Example'
2126
29: o OBJ (MARK at 4)
2130
35: U SHORT_BINSTRING 'value'
2136
50: e APPENDS (MARK at 3)
2138
highest protocol among opcodes = 1
2140
Try "the canonical" recursive-object test.
2153
>>> dis(pickle.dumps(L, 0))
2155
1: l LIST (MARK at 0)
2159
9: t TUPLE (MARK at 5)
2163
highest protocol among opcodes = 0
2165
>>> dis(pickle.dumps(L, 1))
2170
6: t TUPLE (MARK at 3)
2174
highest protocol among opcodes = 1
2176
Note that, in the protocol 0 pickle of the recursive tuple, the disassembler
2177
has to emulate the stack in order to realize that the POP opcode at 16 gets
2178
rid of the MARK at 0.
2180
>>> dis(pickle.dumps(T, 0))
2183
2: l LIST (MARK at 1)
2187
10: t TUPLE (MARK at 6)
2191
16: 0 POP (MARK at 0)
2194
highest protocol among opcodes = 0
2196
>>> dis(pickle.dumps(T, 1))
2202
7: t TUPLE (MARK at 4)
2205
11: 1 POP_MARK (MARK at 0)
2208
highest protocol among opcodes = 1
2212
>>> dis(pickle.dumps(L, 2))
2221
highest protocol among opcodes = 2
2223
>>> dis(pickle.dumps(T, 2))
2234
highest protocol among opcodes = 2
2239
>>> from StringIO import StringIO
2241
>>> p = pickle.Pickler(f, 2)
2247
>>> dis(f, memo=memo)
2255
12: e APPENDS (MARK at 5)
2257
highest protocol among opcodes = 2
2258
>>> dis(f, memo=memo)
2262
highest protocol among opcodes = 2
2265
__test__ = {'disassembler_test': _dis_test,
2266
'disassembler_memo_test': _memo_test,
2271
return doctest.testmod()
2273
if __name__ == "__main__":
1
'''"Executable documentation" for the pickle module.
3
Extensive comments about the pickle protocols and pickle-machine opcodes
4
can be found here. Some functions meant for external use:
7
Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
9
dis(pickle, out=None, memo=None, indentlevel=4)
10
Print a symbolic disassembly of a pickle.
13
__all__ = ['dis', 'genops', 'optimize']
17
# - A pickle verifier: read a pickle and check it exhaustively for
18
# well-formedness. dis() does a lot of this already.
20
# - A protocol identifier: examine a pickle and return its protocol number
21
# (== the highest .proto attr value among all the opcodes in the pickle).
22
# dis() already prints this info at the end.
24
# - A pickle optimizer: for example, tuple-building code is sometimes more
25
# elaborate than necessary, catering for the possibility that the tuple
26
# is recursive. Or lots of times a PUT is generated that's never accessed
31
"A pickle" is a program for a virtual pickle machine (PM, but more accurately
32
called an unpickling machine). It's a sequence of opcodes, interpreted by the
33
PM, building an arbitrarily complex Python object.
35
For the most part, the PM is very simple: there are no looping, testing, or
36
conditional instructions, no arithmetic and no function calls. Opcodes are
37
executed once each, from first to last, until a STOP opcode is reached.
39
The PM has two data areas, "the stack" and "the memo".
41
Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
42
integer object on the stack, whose value is gotten from a decimal string
43
literal immediately following the INT opcode in the pickle bytestream. Other
44
opcodes take Python objects off the stack. The result of unpickling is
45
whatever object is left on the stack when the final STOP opcode is executed.
47
The memo is simply an array of objects, or it can be implemented as a dict
48
mapping little integers to objects. The memo serves as the PM's "long term
49
memory", and the little integers indexing the memo are akin to variable
50
names. Some opcodes pop a stack object into the memo at a given index,
51
and others push a memo object at a given index onto the stack again.
53
At heart, that's all the PM has. Subtleties arise for these reasons:
55
+ Object identity. Objects can be arbitrarily complex, and subobjects
56
may be shared (for example, the list [a, a] refers to the same object a
57
twice). It can be vital that unpickling recreate an isomorphic object
58
graph, faithfully reproducing sharing.
60
+ Recursive objects. For example, after "L = []; L.append(L)", L is a
61
list, and L[0] is the same list. This is related to the object identity
62
point, and some sequences of pickle opcodes are subtle in order to
63
get the right result in all cases.
65
+ Things pickle doesn't know everything about. Examples of things pickle
66
does know everything about are Python's builtin scalar and container
67
types, like ints and tuples. They generally have opcodes dedicated to
68
them. For things like module references and instances of user-defined
69
classes, pickle's knowledge is limited. Historically, many enhancements
70
have been made to the pickle protocol in order to do a better (faster,
71
and/or more compact) job on those.
73
+ Backward compatibility and micro-optimization. As explained below,
74
pickle opcodes never go away, not even when better ways to do a thing
75
get invented. The repertoire of the PM just keeps growing over time.
76
For example, protocol 0 had two opcodes for building Python integers (INT
77
and LONG), protocol 1 added three more for more-efficient pickling of short
78
integers, and protocol 2 added two more for more-efficient pickling of
79
long integers (before protocol 2, the only ways to pickle a Python long
80
took time quadratic in the number of digits, for both pickling and
81
unpickling). "Opcode bloat" isn't so much a subtlety as a source of
82
wearying complication.
87
For compatibility, the meaning of a pickle opcode never changes. Instead new
88
pickle opcodes get added, and each version's unpickler can handle all the
89
pickle opcodes in all protocol versions to date. So old pickles continue to
90
be readable forever. The pickler can generally be told to restrict itself to
91
the subset of opcodes available under previous protocol versions too, so that
92
users can create pickles under the current version readable by older
93
versions. However, a pickle does not contain its version number embedded
94
within it. If an older unpickler tries to read a pickle using a later
95
protocol, the result is most likely an exception due to seeing an unknown (in
96
the older unpickler) opcode.
98
The original pickle used what's now called "protocol 0", and what was called
99
"text mode" before Python 2.3. The entire pickle bytestream is made up of
100
printable 7-bit ASCII characters, plus the newline character, in protocol 0.
101
That's why it was called text mode. Protocol 0 is small and elegant, but
102
sometimes painfully inefficient.
104
The second major set of additions is now called "protocol 1", and was called
105
"binary mode" before Python 2.3. This added many opcodes with arguments
106
consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
107
bytes. Binary mode pickles can be substantially smaller than equivalent
108
text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
109
int as 4 bytes following the opcode, which is cheaper to unpickle than the
110
(perhaps) 11-character decimal string attached to INT. Protocol 1 also added
111
a number of opcodes that operate on many stack elements at once (like APPENDS
112
and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
114
The third major set of additions came in Python 2.3, and is called "protocol
117
- A better way to pickle instances of new-style classes (NEWOBJ).
119
- A way for a pickle to identify its protocol (PROTO).
121
- Time- and space- efficient pickling of long ints (LONG{1,4}).
123
- Shortcuts for small tuples (TUPLE{1,2,3}}.
125
- Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
127
- The "extension registry", a vector of popular objects that can be pushed
128
efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but
129
the registry contents are predefined (there's nothing akin to the memo's
132
Another independent change with Python 2.3 is the abandonment of any
133
pretense that it might be safe to load pickles received from untrusted
134
parties -- no sufficient security analysis has been done to guarantee
135
this and there isn't a use case that warrants the expense of such an
138
To this end, all tests for __safe_for_unpickling__ or for
139
copy_reg.safe_constructors are removed from the unpickling code.
140
References to these variables in the descriptions below are to be seen
141
as describing unpickling in Python 2.2 and before.
144
# Meta-rule: Descriptions are stored in instances of descriptor objects,
145
# with plain constructors. No meta-language is defined from which
146
# descriptors could be constructed. If you want, e.g., XML, write a little
147
# program to generate XML from the objects.
149
##############################################################################
150
# Some pickle opcodes have an argument, following the opcode in the
151
# bytestream. An argument is of a specific type, described by an instance
152
# of ArgumentDescriptor. These are not to be confused with arguments taken
153
# off the stack -- ArgumentDescriptor applies only to arguments embedded in
154
# the opcode stream, immediately following an opcode.
156
# Represents the number of bytes consumed by an argument delimited by the
157
# next newline character.
160
# Represents the number of bytes consumed by a two-argument opcode where
161
# the first argument gives the number of bytes in the second argument.
162
TAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int
163
TAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int
165
class ArgumentDescriptor(object):
167
# name of descriptor record, also a module global name; a string
170
# length of argument, in bytes; an int; UP_TO_NEWLINE and
171
# TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length
175
# a function taking a file-like object, reading this kind of argument
176
# from the object at the current position, advancing the current
177
# position by n bytes, and returning the value of the argument
180
# human-readable docs for this arg descriptor; a string
184
def __init__(self, name, n, reader, doc):
185
assert isinstance(name, str)
188
assert isinstance(n, int) and (n >= 0 or
190
TAKEN_FROM_ARGUMENT1,
191
TAKEN_FROM_ARGUMENT4))
196
assert isinstance(doc, str)
199
from struct import unpack as _unpack
204
>>> read_uint1(StringIO.StringIO('\xff'))
211
raise ValueError("not enough data in stream to read uint1")
213
uint1 = ArgumentDescriptor(
217
doc="One-byte unsigned integer.")
223
>>> read_uint2(StringIO.StringIO('\xff\x00'))
225
>>> read_uint2(StringIO.StringIO('\xff\xff'))
231
return _unpack("<H", data)[0]
232
raise ValueError("not enough data in stream to read uint2")
234
uint2 = ArgumentDescriptor(
238
doc="Two-byte unsigned integer, little-endian.")
244
>>> read_int4(StringIO.StringIO('\xff\x00\x00\x00'))
246
>>> read_int4(StringIO.StringIO('\x00\x00\x00\x80')) == -(2**31)
252
return _unpack("<i", data)[0]
253
raise ValueError("not enough data in stream to read int4")
255
int4 = ArgumentDescriptor(
259
doc="Four-byte signed integer, little-endian, 2's complement.")
262
def read_stringnl(f, decode=True, stripquotes=True):
265
>>> read_stringnl(StringIO.StringIO("'abcd'\nefg\n"))
268
>>> read_stringnl(StringIO.StringIO("\n"))
269
Traceback (most recent call last):
271
ValueError: no string quotes around ''
273
>>> read_stringnl(StringIO.StringIO("\n"), stripquotes=False)
276
>>> read_stringnl(StringIO.StringIO("''\n"))
279
>>> read_stringnl(StringIO.StringIO('"abcd"'))
280
Traceback (most recent call last):
282
ValueError: no newline found when trying to read stringnl
284
Embedded escapes are undone in the result.
285
>>> read_stringnl(StringIO.StringIO(r"'a\n\\b\x00c\td'" + "\n'e'"))
290
if not data.endswith('\n'):
291
raise ValueError("no newline found when trying to read stringnl")
292
data = data[:-1] # lose the newline
296
if data.startswith(q):
297
if not data.endswith(q):
298
raise ValueError("strinq quote %r not found at both "
299
"ends of %r" % (q, data))
303
raise ValueError("no string quotes around %r" % data)
305
# I'm not sure when 'string_escape' was added to the std codecs; it's
306
# crazy not to use it if it's there.
308
data = data.decode('string_escape')
311
stringnl = ArgumentDescriptor(
314
reader=read_stringnl,
315
doc="""A newline-terminated string.
317
This is a repr-style string, with embedded escapes, and
321
def read_stringnl_noescape(f):
322
return read_stringnl(f, decode=False, stripquotes=False)
324
stringnl_noescape = ArgumentDescriptor(
325
name='stringnl_noescape',
327
reader=read_stringnl_noescape,
328
doc="""A newline-terminated string.
330
This is a str-style string, without embedded escapes,
331
or bracketing quotes. It should consist solely of
332
printable ASCII characters.
335
def read_stringnl_noescape_pair(f):
338
>>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\nEmpty\njunk"))
342
return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
344
stringnl_noescape_pair = ArgumentDescriptor(
345
name='stringnl_noescape_pair',
347
reader=read_stringnl_noescape_pair,
348
doc="""A pair of newline-terminated strings.
350
These are str-style strings, without embedded
351
escapes, or bracketing quotes. They should
352
consist solely of printable ASCII characters.
353
The pair is returned as a single string, with
354
a single blank separating the two strings.
360
>>> read_string4(StringIO.StringIO("\x00\x00\x00\x00abc"))
362
>>> read_string4(StringIO.StringIO("\x03\x00\x00\x00abcdef"))
364
>>> read_string4(StringIO.StringIO("\x00\x00\x00\x03abcdef"))
365
Traceback (most recent call last):
367
ValueError: expected 50331648 bytes in a string4, but only 6 remain
372
raise ValueError("string4 byte count < 0: %d" % n)
376
raise ValueError("expected %d bytes in a string4, but only %d remain" %
379
string4 = ArgumentDescriptor(
381
n=TAKEN_FROM_ARGUMENT4,
383
doc="""A counted string.
385
The first argument is a 4-byte little-endian signed int giving
386
the number of bytes in the string, and the second argument is
394
>>> read_string1(StringIO.StringIO("\x00"))
396
>>> read_string1(StringIO.StringIO("\x03abcdef"))
405
raise ValueError("expected %d bytes in a string1, but only %d remain" %
408
string1 = ArgumentDescriptor(
410
n=TAKEN_FROM_ARGUMENT1,
412
doc="""A counted string.
414
The first argument is a 1-byte unsigned int giving the number
415
of bytes in the string, and the second argument is that many
420
def read_unicodestringnl(f):
423
>>> read_unicodestringnl(StringIO.StringIO("abc\uabcd\njunk"))
428
if not data.endswith('\n'):
429
raise ValueError("no newline found when trying to read "
431
data = data[:-1] # lose the newline
432
return unicode(data, 'raw-unicode-escape')
434
unicodestringnl = ArgumentDescriptor(
435
name='unicodestringnl',
437
reader=read_unicodestringnl,
438
doc="""A newline-terminated Unicode string.
440
This is raw-unicode-escape encoded, so consists of
441
printable ASCII characters, and may contain embedded
445
def read_unicodestring4(f):
448
>>> s = u'abcd\uabcd'
449
>>> enc = s.encode('utf-8')
452
>>> n = chr(len(enc)) + chr(0) * 3 # little-endian 4-byte length
453
>>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk'))
457
>>> read_unicodestring4(StringIO.StringIO(n + enc[:-1]))
458
Traceback (most recent call last):
460
ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
465
raise ValueError("unicodestring4 byte count < 0: %d" % n)
468
return unicode(data, 'utf-8')
469
raise ValueError("expected %d bytes in a unicodestring4, but only %d "
470
"remain" % (n, len(data)))
472
unicodestring4 = ArgumentDescriptor(
473
name="unicodestring4",
474
n=TAKEN_FROM_ARGUMENT4,
475
reader=read_unicodestring4,
476
doc="""A counted Unicode string.
478
The first argument is a 4-byte little-endian signed int
479
giving the number of bytes in the string, and the second
480
argument-- the UTF-8 encoding of the Unicode string --
481
contains that many bytes.
485
def read_decimalnl_short(f):
488
>>> read_decimalnl_short(StringIO.StringIO("1234\n56"))
491
>>> read_decimalnl_short(StringIO.StringIO("1234L\n56"))
492
Traceback (most recent call last):
494
ValueError: trailing 'L' not allowed in '1234L'
497
s = read_stringnl(f, decode=False, stripquotes=False)
499
raise ValueError("trailing 'L' not allowed in %r" % s)
501
# It's not necessarily true that the result fits in a Python short int:
502
# the pickle may have been written on a 64-bit box. There's also a hack
503
# for True and False here.
511
except OverflowError:
514
def read_decimalnl_long(f):
518
>>> read_decimalnl_long(StringIO.StringIO("1234\n56"))
519
Traceback (most recent call last):
521
ValueError: trailing 'L' required in '1234'
523
Someday the trailing 'L' will probably go away from this output.
525
>>> read_decimalnl_long(StringIO.StringIO("1234L\n56"))
528
>>> read_decimalnl_long(StringIO.StringIO("123456789012345678901234L\n6"))
529
123456789012345678901234L
532
s = read_stringnl(f, decode=False, stripquotes=False)
533
if not s.endswith("L"):
534
raise ValueError("trailing 'L' required in %r" % s)
538
decimalnl_short = ArgumentDescriptor(
539
name='decimalnl_short',
541
reader=read_decimalnl_short,
542
doc="""A newline-terminated decimal integer literal.
544
This never has a trailing 'L', and the integer fit
545
in a short Python int on the box where the pickle
546
was written -- but there's no guarantee it will fit
547
in a short Python int on the box where the pickle
551
decimalnl_long = ArgumentDescriptor(
552
name='decimalnl_long',
554
reader=read_decimalnl_long,
555
doc="""A newline-terminated decimal integer literal.
557
This has a trailing 'L', and can represent integers
565
>>> read_floatnl(StringIO.StringIO("-1.25\n6"))
568
s = read_stringnl(f, decode=False, stripquotes=False)
571
floatnl = ArgumentDescriptor(
575
doc="""A newline-terminated decimal floating literal.
577
In general this requires 17 significant digits for roundtrip
578
identity, and pickling then unpickling infinities, NaNs, and
579
minus zero doesn't work across boxes, or on some boxes even
580
on itself (e.g., Windows can't read the strings it produces
581
for infinities or NaNs).
586
>>> import StringIO, struct
587
>>> raw = struct.pack(">d", -1.25)
589
'\xbf\xf4\x00\x00\x00\x00\x00\x00'
590
>>> read_float8(StringIO.StringIO(raw + "\n"))
596
return _unpack(">d", data)[0]
597
raise ValueError("not enough data in stream to read float8")
600
float8 = ArgumentDescriptor(
604
doc="""An 8-byte binary representation of a float, big-endian.
606
The format is unique to Python, and shared with the struct
607
module (format string '>d') "in theory" (the struct and cPickle
608
implementations don't share the code -- they should). It's
609
strongly related to the IEEE-754 double format, and, in normal
610
cases, is in fact identical to the big-endian 754 double format.
611
On other boxes the dynamic range is limited to that of a 754
612
double, and "add a half and chop" rounding is used to reduce
613
the precision to 53 bits. However, even on a 754 box,
614
infinities, NaNs, and minus zero may not be handled correctly
615
(may not survive roundtrip pickling intact).
620
from pickle import decode_long
625
>>> read_long1(StringIO.StringIO("\x00"))
627
>>> read_long1(StringIO.StringIO("\x02\xff\x00"))
629
>>> read_long1(StringIO.StringIO("\x02\xff\x7f"))
631
>>> read_long1(StringIO.StringIO("\x02\x00\xff"))
633
>>> read_long1(StringIO.StringIO("\x02\x00\x80"))
640
raise ValueError("not enough data in stream to read long1")
641
return decode_long(data)
643
long1 = ArgumentDescriptor(
645
n=TAKEN_FROM_ARGUMENT1,
647
doc="""A binary long, little-endian, using 1-byte size.
649
This first reads one byte as an unsigned size, then reads that
650
many bytes and interprets them as a little-endian 2's-complement long.
651
If the size is 0, that's taken as a shortcut for the long 0L.
657
>>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x00"))
659
>>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x7f"))
661
>>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\xff"))
663
>>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\x80"))
665
>>> read_long1(StringIO.StringIO("\x00\x00\x00\x00"))
671
raise ValueError("long4 byte count < 0: %d" % n)
674
raise ValueError("not enough data in stream to read long4")
675
return decode_long(data)
677
long4 = ArgumentDescriptor(
679
n=TAKEN_FROM_ARGUMENT4,
681
doc="""A binary representation of a long, little-endian.
683
This first reads four bytes as a signed size (but requires the
684
size to be >= 0), then reads that many bytes and interprets them
685
as a little-endian 2's-complement long. If the size is 0, that's taken
686
as a shortcut for the long 0L, although LONG1 should really be used
687
then instead (and in any case where # of bytes < 256).
691
##############################################################################
692
# Object descriptors. The stack used by the pickle machine holds objects,
693
# and in the stack_before and stack_after attributes of OpcodeInfo
694
# descriptors we need names to describe the various types of objects that can
695
# appear on the stack.
697
class StackObject(object):
699
# name of descriptor record, for info only
702
# type of object, or tuple of type objects (meaning the object can
703
# be of any type in the tuple)
706
# human-readable docs for this kind of stack object; a string
710
def __init__(self, name, obtype, doc):
711
assert isinstance(name, str)
714
assert isinstance(obtype, type) or isinstance(obtype, tuple)
715
if isinstance(obtype, tuple):
716
for contained in obtype:
717
assert isinstance(contained, type)
720
assert isinstance(doc, str)
730
doc="A short (as opposed to long) Python integer object.")
732
pylong = StackObject(
735
doc="A long (as opposed to short) Python integer object.")
737
pyinteger_or_bool = StackObject(
739
obtype=(int, long, bool),
740
doc="A Python integer object (short or long), or "
743
pybool = StackObject(
746
doc="A Python bool object.")
748
pyfloat = StackObject(
751
doc="A Python float object.")
753
pystring = StackObject(
756
doc="A Python string object.")
758
pyunicode = StackObject(
761
doc="A Python Unicode string object.")
763
pynone = StackObject(
766
doc="The Python None object.")
768
pytuple = StackObject(
771
doc="A Python tuple object.")
773
pylist = StackObject(
776
doc="A Python list object.")
778
pydict = StackObject(
781
doc="A Python dict object.")
783
anyobject = StackObject(
786
doc="Any kind of object whatsoever.")
788
markobject = StackObject(
791
doc="""'The mark' is a unique object.
793
Opcodes that operate on a variable number of objects
794
generally don't embed the count of objects in the opcode,
795
or pull it off the stack. Instead the MARK opcode is used
796
to push a special marker object on the stack, and then
797
some other opcodes grab all the objects from the top of
798
the stack down to (but not including) the topmost marker
802
stackslice = StackObject(
805
doc="""An object representing a contiguous slice of the stack.
807
This is used in conjunction with markobject, to represent all
808
of the stack following the topmost markobject. For example,
809
the POP_MARK opcode changes the stack from
811
[..., markobject, stackslice]
815
No matter how many object are on the stack after the topmost
816
markobject, POP_MARK gets rid of all of them (including the
817
topmost markobject too).
820
##############################################################################
821
# Descriptors for pickle opcodes.
823
class OpcodeInfo(object):
826
# symbolic name of opcode; a string
829
# the code used in a bytestream to represent the opcode; a
830
# one-character string
833
# If the opcode has an argument embedded in the byte string, an
834
# instance of ArgumentDescriptor specifying its type. Note that
835
# arg.reader(s) can be used to read and decode the argument from
836
# the bytestream s, and arg.doc documents the format of the raw
837
# argument bytes. If the opcode doesn't have an argument embedded
838
# in the bytestream, arg should be None.
841
# what the stack looks like before this opcode runs; a list
844
# what the stack looks like after this opcode runs; a list
847
# the protocol number in which this opcode was introduced; an int
850
# human-readable docs for this opcode; a string
854
def __init__(self, name, code, arg,
855
stack_before, stack_after, proto, doc):
856
assert isinstance(name, str)
859
assert isinstance(code, str)
860
assert len(code) == 1
863
assert arg is None or isinstance(arg, ArgumentDescriptor)
866
assert isinstance(stack_before, list)
867
for x in stack_before:
868
assert isinstance(x, StackObject)
869
self.stack_before = stack_before
871
assert isinstance(stack_after, list)
872
for x in stack_after:
873
assert isinstance(x, StackObject)
874
self.stack_after = stack_after
876
assert isinstance(proto, int) and 0 <= proto <= 2
879
assert isinstance(doc, str)
885
# Ways to spell integers.
891
stack_after=[pyinteger_or_bool],
893
doc="""Push an integer or bool.
895
The argument is a newline-terminated decimal literal string.
897
The intent may have been that this always fit in a short Python int,
898
but INT can be generated in pickles written on a 64-bit box that
899
require a Python long on a 32-bit box. The difference between this
900
and LONG then is that INT skips a trailing 'L', and produces a short
901
int whenever possible.
903
Another difference is due to that, when bool was introduced as a
904
distinct type in 2.3, builtin names True and False were also added to
905
2.2.2, mapping to ints 1 and 0. For compatibility in both directions,
906
True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
907
Leading zeroes are never produced for a genuine integer. The 2.3
908
(and later) unpicklers special-case these and return bool instead;
909
earlier unpicklers ignore the leading "0" and return the int.
918
doc="""Push a four-byte signed integer.
920
This handles the full range of Python (short) integers on a 32-bit
921
box, directly as binary bytes (1 for the opcode and 4 for the integer).
922
If the integer is non-negative and fits in 1 or 2 bytes, pickling via
923
BININT1 or BININT2 saves space.
932
doc="""Push a one-byte unsigned integer.
934
This is a space optimization for pickling very small non-negative ints,
944
doc="""Push a two-byte unsigned integer.
946
This is a space optimization for pickling small positive ints, in
947
range(256, 2**16). Integers in range(256) can also be pickled via
948
BININT2, but BININT1 instead saves a byte.
955
stack_after=[pylong],
957
doc="""Push a long integer.
959
The same as INT, except that the literal ends with 'L', and always
960
unpickles to a Python long. There doesn't seem a real purpose to the
963
Note that LONG takes time quadratic in the number of digits when
964
unpickling (this is simply due to the nature of decimal->binary
965
conversion). Proto 2 added linear-time (in C; still quadratic-time
966
in Python) LONG1 and LONG4 opcodes.
973
stack_after=[pylong],
975
doc="""Long integer using one-byte length.
977
A more efficient encoding of a Python long; the long1 encoding
984
stack_after=[pylong],
986
doc="""Long integer using found-byte length.
988
A more efficient encoding of a Python long; the long4 encoding
991
# Ways to spell strings (8-bit, not Unicode).
997
stack_after=[pystring],
999
doc="""Push a Python string object.
1001
The argument is a repr-style string, with bracketing quote characters,
1002
and perhaps embedded escapes. The argument extends until the next
1010
stack_after=[pystring],
1012
doc="""Push a Python string object.
1014
There are two arguments: the first is a 4-byte little-endian signed int
1015
giving the number of bytes in the string, and the second is that many
1016
bytes, which are taken literally as the string content.
1019
I(name='SHORT_BINSTRING',
1023
stack_after=[pystring],
1025
doc="""Push a Python string object.
1027
There are two arguments: the first is a 1-byte unsigned int giving
1028
the number of bytes in the string, and the second is that many bytes,
1029
which are taken literally as the string content.
1032
# Ways to spell None.
1038
stack_after=[pynone],
1040
doc="Push None on the stack."),
1042
# Ways to spell bools, starting with proto 2. See INT for how this was
1043
# done before proto 2.
1049
stack_after=[pybool],
1053
Push True onto the stack."""),
1059
stack_after=[pybool],
1063
Push False onto the stack."""),
1065
# Ways to spell Unicode strings.
1069
arg=unicodestringnl,
1071
stack_after=[pyunicode],
1072
proto=0, # this may be pure-text, but it's a later addition
1073
doc="""Push a Python Unicode string object.
1075
The argument is a raw-unicode-escape encoding of a Unicode string,
1076
and so may contain embedded escape sequences. The argument extends
1077
until the next newline character.
1080
I(name='BINUNICODE',
1084
stack_after=[pyunicode],
1086
doc="""Push a Python Unicode string object.
1088
There are two arguments: the first is a 4-byte little-endian signed int
1089
giving the number of bytes in the string. The second is that many
1090
bytes, and is the UTF-8 encoding of the Unicode string.
1093
# Ways to spell floats.
1099
stack_after=[pyfloat],
1101
doc="""Newline-terminated decimal float literal.
1103
The argument is repr(a_float), and in general requires 17 significant
1104
digits for roundtrip conversion to be an identity (this is so for
1105
IEEE-754 double precision values, which is what Python float maps to
1108
In general, FLOAT cannot be used to transport infinities, NaNs, or
1109
minus zero across boxes (or even on a single box, if the platform C
1110
library can't read the strings it produces for such things -- Windows
1111
is like that), but may do less damage than BINFLOAT on boxes with
1112
greater precision or dynamic range than IEEE-754 double.
1119
stack_after=[pyfloat],
1121
doc="""Float stored in binary form, with 8 bytes of data.
1123
This generally requires less than half the space of FLOAT encoding.
1124
In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1125
minus zero, raises an exception if the exponent exceeds the range of
1126
an IEEE-754 double, and retains no more than 53 bits of precision (if
1127
there are more than that, "add a half and chop" rounding is used to
1128
cut it back to 53 significant bits).
1131
# Ways to build lists.
1133
I(name='EMPTY_LIST',
1137
stack_after=[pylist],
1139
doc="Push an empty list."),
1144
stack_before=[pylist, anyobject],
1145
stack_after=[pylist],
1147
doc="""Append an object to a list.
1149
Stack before: ... pylist anyobject
1150
Stack after: ... pylist+[anyobject]
1152
although pylist is really extended in-place.
1158
stack_before=[pylist, markobject, stackslice],
1159
stack_after=[pylist],
1161
doc="""Extend a list by a slice of stack objects.
1163
Stack before: ... pylist markobject stackslice
1164
Stack after: ... pylist+stackslice
1166
although pylist is really extended in-place.
1172
stack_before=[markobject, stackslice],
1173
stack_after=[pylist],
1175
doc="""Build a list out of the topmost stack slice, after markobject.
1177
All the stack entries following the topmost markobject are placed into
1178
a single Python list, which single list object replaces all of the
1179
stack from the topmost markobject onward. For example,
1181
Stack before: ... markobject 1 2 3 'abc'
1182
Stack after: ... [1, 2, 3, 'abc']
1185
# Ways to build tuples.
1187
I(name='EMPTY_TUPLE',
1191
stack_after=[pytuple],
1193
doc="Push an empty tuple."),
1198
stack_before=[markobject, stackslice],
1199
stack_after=[pytuple],
1201
doc="""Build a tuple out of the topmost stack slice, after markobject.
1203
All the stack entries following the topmost markobject are placed into
1204
a single Python tuple, which single tuple object replaces all of the
1205
stack from the topmost markobject onward. For example,
1207
Stack before: ... markobject 1 2 3 'abc'
1208
Stack after: ... (1, 2, 3, 'abc')
1214
stack_before=[anyobject],
1215
stack_after=[pytuple],
1217
doc="""Build a one-tuple out of the topmost item on the stack.
1219
This code pops one value off the stack and pushes a tuple of
1220
length 1 whose one item is that value back onto it. In other
1223
stack[-1] = tuple(stack[-1:])
1229
stack_before=[anyobject, anyobject],
1230
stack_after=[pytuple],
1232
doc="""Build a two-tuple out of the top two items on the stack.
1234
This code pops two values off the stack and pushes a tuple of
1235
length 2 whose items are those values back onto it. In other
1238
stack[-2:] = [tuple(stack[-2:])]
1244
stack_before=[anyobject, anyobject, anyobject],
1245
stack_after=[pytuple],
1247
doc="""Build a three-tuple out of the top three items on the stack.
1249
This code pops three values off the stack and pushes a tuple of
1250
length 3 whose items are those values back onto it. In other
1253
stack[-3:] = [tuple(stack[-3:])]
1256
# Ways to build dicts.
1258
I(name='EMPTY_DICT',
1262
stack_after=[pydict],
1264
doc="Push an empty dict."),
1269
stack_before=[markobject, stackslice],
1270
stack_after=[pydict],
1272
doc="""Build a dict out of the topmost stack slice, after markobject.
1274
All the stack entries following the topmost markobject are placed into
1275
a single Python dict, which single dict object replaces all of the
1276
stack from the topmost markobject onward. The stack slice alternates
1277
key, value, key, value, .... For example,
1279
Stack before: ... markobject 1 2 3 'abc'
1280
Stack after: ... {1: 2, 3: 'abc'}
1286
stack_before=[pydict, anyobject, anyobject],
1287
stack_after=[pydict],
1289
doc="""Add a key+value pair to an existing dict.
1291
Stack before: ... pydict key value
1292
Stack after: ... pydict
1294
where pydict has been modified via pydict[key] = value.
1300
stack_before=[pydict, markobject, stackslice],
1301
stack_after=[pydict],
1303
doc="""Add an arbitrary number of key+value pairs to an existing dict.
1305
The slice of the stack following the topmost markobject is taken as
1306
an alternating sequence of keys and values, added to the dict
1307
immediately under the topmost markobject. Everything at and after the
1308
topmost markobject is popped, leaving the mutated dict at the top
1311
Stack before: ... pydict markobject key_1 value_1 ... key_n value_n
1312
Stack after: ... pydict
1314
where pydict has been modified via pydict[key_i] = value_i for i in
1315
1, 2, ..., n, and in that order.
1318
# Stack manipulation.
1323
stack_before=[anyobject],
1326
doc="Discard the top stack item, shrinking the stack by one item."),
1331
stack_before=[anyobject],
1332
stack_after=[anyobject, anyobject],
1334
doc="Push the top stack item onto the stack again, duplicating it."),
1340
stack_after=[markobject],
1342
doc="""Push markobject onto the stack.
1344
markobject is a unique object, used by other opcodes to identify a
1345
region of the stack containing a variable number of objects for them
1346
to work on. See markobject.doc for more detail.
1352
stack_before=[markobject, stackslice],
1355
doc="""Pop all the stack objects at and above the topmost markobject.
1357
When an opcode using a variable number of stack objects is done,
1358
POP_MARK is used to remove those objects, and to remove the markobject
1359
that delimited their starting position on the stack.
1362
# Memo manipulation. There are really only two operations (get and put),
1363
# each in all-text, "short binary", and "long binary" flavors.
1367
arg=decimalnl_short,
1369
stack_after=[anyobject],
1371
doc="""Read an object from the memo and push it on the stack.
1373
The index of the memo object to push is given by the newline-terminated
1374
decimal string following. BINGET and LONG_BINGET are space-optimized
1382
stack_after=[anyobject],
1384
doc="""Read an object from the memo and push it on the stack.
1386
The index of the memo object to push is given by the 1-byte unsigned
1390
I(name='LONG_BINGET',
1394
stack_after=[anyobject],
1396
doc="""Read an object from the memo and push it on the stack.
1398
The index of the memo object to push is given by the 4-byte signed
1399
little-endian integer following.
1404
arg=decimalnl_short,
1408
doc="""Store the stack top into the memo. The stack is not popped.
1410
The index of the memo location to write into is given by the newline-
1411
terminated decimal string following. BINPUT and LONG_BINPUT are
1412
space-optimized versions.
1421
doc="""Store the stack top into the memo. The stack is not popped.
1423
The index of the memo location to write into is given by the 1-byte
1424
unsigned integer following.
1427
I(name='LONG_BINPUT',
1433
doc="""Store the stack top into the memo. The stack is not popped.
1435
The index of the memo location to write into is given by the 4-byte
1436
signed little-endian integer following.
1439
# Access the extension registry (predefined objects). Akin to the GET
1446
stack_after=[anyobject],
1448
doc="""Extension code.
1450
This code and the similar EXT2 and EXT4 allow using a registry
1451
of popular objects that are pickled by name, typically classes.
1452
It is envisioned that through a global negotiation and
1453
registration process, third parties can set up a mapping between
1454
ints and object names.
1456
In order to guarantee pickle interchangeability, the extension
1457
code registry ought to be global, although a range of codes may
1458
be reserved for private use.
1460
EXT1 has a 1-byte integer argument. This is used to index into the
1461
extension registry, and the object at that index is pushed on the stack.
1468
stack_after=[anyobject],
1470
doc="""Extension code.
1472
See EXT1. EXT2 has a two-byte integer argument.
1479
stack_after=[anyobject],
1481
doc="""Extension code.
1483
See EXT1. EXT4 has a four-byte integer argument.
1486
# Push a class object, or module function, on the stack, via its module
1491
arg=stringnl_noescape_pair,
1493
stack_after=[anyobject],
1495
doc="""Push a global object (module.attr) on the stack.
1497
Two newline-terminated strings follow the GLOBAL opcode. The first is
1498
taken as a module name, and the second as a class name. The class
1499
object module.class is pushed on the stack. More accurately, the
1500
object returned by self.find_class(module, class) is pushed on the
1501
stack, so unpickling subclasses can override this form of lookup.
1504
# Ways to build objects of classes pickle doesn't know about directly
1505
# (user-defined classes). I despair of documenting this accurately
1506
# and comprehensibly -- you really have to read the pickle code to
1507
# find all the special cases.
1512
stack_before=[anyobject, anyobject],
1513
stack_after=[anyobject],
1515
doc="""Push an object built from a callable and an argument tuple.
1517
The opcode is named to remind of the __reduce__() method.
1519
Stack before: ... callable pytuple
1520
Stack after: ... callable(*pytuple)
1522
The callable and the argument tuple are the first two items returned
1523
by a __reduce__ method. Applying the callable to the argtuple is
1524
supposed to reproduce the original object, or at least get it started.
1525
If the __reduce__ method returns a 3-tuple, the last component is an
1526
argument to be passed to the object's __setstate__, and then the REDUCE
1527
opcode is followed by code to create setstate's argument, and then a
1528
BUILD opcode to apply __setstate__ to that argument.
1530
If type(callable) is not ClassType, REDUCE complains unless the
1531
callable has been registered with the copy_reg module's
1532
safe_constructors dict, or the callable has a magic
1533
'__safe_for_unpickling__' attribute with a true value. I'm not sure
1534
why it does this, but I've sure seen this complaint often enough when
1535
I didn't want to <wink>.
1541
stack_before=[anyobject, anyobject],
1542
stack_after=[anyobject],
1544
doc="""Finish building an object, via __setstate__ or dict update.
1546
Stack before: ... anyobject argument
1547
Stack after: ... anyobject
1549
where anyobject may have been mutated, as follows:
1551
If the object has a __setstate__ method,
1553
anyobject.__setstate__(argument)
1557
Else the argument must be a dict, the object must have a __dict__, and
1558
the object is updated via
1560
anyobject.__dict__.update(argument)
1562
This may raise RuntimeError in restricted execution mode (which
1563
disallows access to __dict__ directly); in that case, the object
1564
is updated instead via
1566
for k, v in argument.items():
1572
arg=stringnl_noescape_pair,
1573
stack_before=[markobject, stackslice],
1574
stack_after=[anyobject],
1576
doc="""Build a class instance.
1578
This is the protocol 0 version of protocol 1's OBJ opcode.
1579
INST is followed by two newline-terminated strings, giving a
1580
module and class name, just as for the GLOBAL opcode (and see
1581
GLOBAL for more details about that). self.find_class(module, name)
1582
is used to get a class object.
1584
In addition, all the objects on the stack following the topmost
1585
markobject are gathered into a tuple and popped (along with the
1586
topmost markobject), just as for the TUPLE opcode.
1588
Now it gets complicated. If all of these are true:
1590
+ The argtuple is empty (markobject was at the top of the stack
1593
+ It's an old-style class object (the type of the class object is
1596
+ The class object does not have a __getinitargs__ attribute.
1598
then we want to create an old-style class instance without invoking
1599
its __init__() method (pickle has waffled on this over the years; not
1600
calling __init__() is current wisdom). In this case, an instance of
1601
an old-style dummy class is created, and then we try to rebind its
1602
__class__ attribute to the desired class object. If this succeeds,
1603
the new instance object is pushed on the stack, and we're done. In
1604
restricted execution mode it can fail (assignment to __class__ is
1605
disallowed), and I'm not really sure what happens then -- it looks
1606
like the code ends up calling the class object's __init__ anyway,
1607
via falling into the next case.
1609
Else (the argtuple is not empty, it's not an old-style class object,
1610
or the class object does have a __getinitargs__ attribute), the code
1611
first insists that the class object have a __safe_for_unpickling__
1612
attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE,
1613
it doesn't matter whether this attribute has a true or false value, it
1614
only matters whether it exists (XXX this is a bug; cPickle
1615
requires the attribute to be true). If __safe_for_unpickling__
1616
doesn't exist, UnpicklingError is raised.
1618
Else (the class object does have a __safe_for_unpickling__ attr),
1619
the class object obtained from INST's arguments is applied to the
1620
argtuple obtained from the stack, and the resulting instance object
1621
is pushed on the stack.
1623
NOTE: checks for __safe_for_unpickling__ went away in Python 2.3.
1629
stack_before=[markobject, anyobject, stackslice],
1630
stack_after=[anyobject],
1632
doc="""Build a class instance.
1634
This is the protocol 1 version of protocol 0's INST opcode, and is
1635
very much like it. The major difference is that the class object
1636
is taken off the stack, allowing it to be retrieved from the memo
1637
repeatedly if several instances of the same class are created. This
1638
can be much more efficient (in both time and space) than repeatedly
1639
embedding the module and class names in INST opcodes.
1641
Unlike INST, OBJ takes no arguments from the opcode stream. Instead
1642
the class object is taken off the stack, immediately above the
1645
Stack before: ... markobject classobject stackslice
1646
Stack after: ... new_instance_object
1648
As for INST, the remainder of the stack above the markobject is
1649
gathered into an argument tuple, and then the logic seems identical,
1650
except that no __safe_for_unpickling__ check is done (XXX this is
1651
a bug; cPickle does test __safe_for_unpickling__). See INST for
1654
NOTE: In Python 2.3, INST and OBJ are identical except for how they
1655
get the class object. That was always the intent; the implementations
1656
had diverged for accidental reasons.
1662
stack_before=[anyobject, anyobject],
1663
stack_after=[anyobject],
1665
doc="""Build an object instance.
1667
The stack before should be thought of as containing a class
1668
object followed by an argument tuple (the tuple being the stack
1669
top). Call these cls and args. They are popped off the stack,
1670
and the value returned by cls.__new__(cls, *args) is pushed back
1682
doc="""Protocol version indicator.
1684
For protocol 2 and above, a pickle must start with this opcode.
1685
The argument is the protocol version, an int in range(2, 256).
1691
stack_before=[anyobject],
1694
doc="""Stop the unpickling machine.
1696
Every pickle ends with this opcode. The object at the top of the stack
1697
is popped, and that's the result of unpickling. The stack should be
1701
# Ways to deal with persistent IDs.
1705
arg=stringnl_noescape,
1707
stack_after=[anyobject],
1709
doc="""Push an object identified by a persistent ID.
1711
The pickle module doesn't define what a persistent ID means. PERSID's
1712
argument is a newline-terminated str-style (no embedded escapes, no
1713
bracketing quote characters) string, which *is* "the persistent ID".
1714
The unpickler passes this string to self.persistent_load(). Whatever
1715
object that returns is pushed on the stack. There is no implementation
1716
of persistent_load() in Python's unpickler: it must be supplied by an
1723
stack_before=[anyobject],
1724
stack_after=[anyobject],
1726
doc="""Push an object identified by a persistent ID.
1728
Like PERSID, except the persistent ID is popped off the stack (instead
1729
of being a string embedded in the opcode bytestream). The persistent
1730
ID is passed to self.persistent_load(), and whatever object that
1731
returns is pushed on the stack. See PERSID for more detail.
1736
# Verify uniqueness of .name and .code members.
1740
for i, d in enumerate(opcodes):
1741
if d.name in name2i:
1742
raise ValueError("repeated name %r at indices %d and %d" %
1743
(d.name, name2i[d.name], i))
1744
if d.code in code2i:
1745
raise ValueError("repeated code %r at indices %d and %d" %
1746
(d.code, code2i[d.code], i))
1751
del name2i, code2i, i, d
1753
##############################################################################
1754
# Build a code2op dict, mapping opcode characters to OpcodeInfo records.
1755
# Also ensure we've got the same stuff as pickle.py, although the
1756
# introspection here is dicey.
1763
def assure_pickle_consistency(verbose=False):
1766
copy = code2op.copy()
1767
for name in pickle.__all__:
1768
if not re.match("[A-Z][A-Z0-9_]+$", name):
1770
print "skipping %r: it doesn't look like an opcode name" % name
1772
picklecode = getattr(pickle, name)
1773
if not isinstance(picklecode, str) or len(picklecode) != 1:
1775
print ("skipping %r: value %r doesn't look like a pickle "
1776
"code" % (name, picklecode))
1778
if picklecode in copy:
1780
print "checking name %r w/ code %r for consistency" % (
1782
d = copy[picklecode]
1784
raise ValueError("for pickle code %r, pickle.py uses name %r "
1785
"but we're using name %r" % (picklecode,
1788
# Forget this one. Any left over in copy at the end are a problem
1789
# of a different kind.
1790
del copy[picklecode]
1792
raise ValueError("pickle.py appears to have a pickle opcode with "
1793
"name %r and code %r, but we don't" %
1796
msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
1797
for code, d in copy.items():
1798
msg.append(" name %r with code %r" % (d.name, code))
1799
raise ValueError("\n".join(msg))
1801
assure_pickle_consistency()
1802
del assure_pickle_consistency
1804
##############################################################################
1805
# A pickle opcode generator.
1808
"""Generate all the opcodes in a pickle.
1810
'pickle' is a file-like object, or string, containing the pickle.
1812
Each opcode in the pickle is generated, from the current pickle position,
1813
stopping after a STOP opcode is delivered. A triple is generated for
1818
opcode is an OpcodeInfo record, describing the current opcode.
1820
If the opcode has an argument embedded in the pickle, arg is its decoded
1821
value, as a Python object. If the opcode doesn't have an argument, arg
1824
If the pickle has a tell() method, pos was the value of pickle.tell()
1825
before reading the current opcode. If the pickle is a string object,
1826
it's wrapped in a StringIO object, and the latter's tell() result is
1827
used. Else (the pickle doesn't have a tell(), and it's not obvious how
1828
to query its current position) pos is None.
1831
import cStringIO as StringIO
1833
if isinstance(pickle, str):
1834
pickle = StringIO.StringIO(pickle)
1836
if hasattr(pickle, "tell"):
1837
getpos = pickle.tell
1839
getpos = lambda: None
1843
code = pickle.read(1)
1844
opcode = code2op.get(code)
1847
raise ValueError("pickle exhausted before seeing STOP")
1849
raise ValueError("at position %s, opcode %r unknown" % (
1850
pos is None and "<unknown>" or pos,
1852
if opcode.arg is None:
1855
arg = opcode.arg.reader(pickle)
1856
yield opcode, arg, pos
1858
assert opcode.name == 'STOP'
1861
##############################################################################
1862
# A pickle optimizer.
1865
'Optimize a pickle string by removing unused PUT opcodes'
1866
gets = set() # set of args used by a GET opcode
1867
puts = [] # (arg, startpos, stoppos) for the PUT opcodes
1868
prevpos = None # set to pos if previous opcode was a PUT
1869
for opcode, arg, pos in genops(p):
1870
if prevpos is not None:
1871
puts.append((prevarg, prevpos, pos))
1873
if 'PUT' in opcode.name:
1874
prevarg, prevpos = arg, pos
1875
elif 'GET' in opcode.name:
1878
# Copy the pickle string except for PUTS without a corresponding GET
1881
for arg, start, stop in puts:
1882
j = stop if (arg in gets) else start
1888
##############################################################################
1889
# A symbolic pickle disassembler.
1891
def dis(pickle, out=None, memo=None, indentlevel=4):
1892
"""Produce a symbolic disassembly of a pickle.
1894
'pickle' is a file-like object, or string, containing a (at least one)
1895
pickle. The pickle is disassembled from the current position, through
1896
the first STOP opcode encountered.
1898
Optional arg 'out' is a file-like object to which the disassembly is
1899
printed. It defaults to sys.stdout.
1901
Optional arg 'memo' is a Python dict, used as the pickle's memo. It
1902
may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
1903
Passing the same memo object to another dis() call then allows disassembly
1904
to proceed across multiple pickles that were all created by the same
1905
pickler with the same memo. Ordinarily you don't need to worry about this.
1907
Optional arg indentlevel is the number of blanks by which to indent
1908
a new MARK level. It defaults to 4.
1910
In addition to printing the disassembly, some sanity checks are made:
1912
+ All embedded opcode arguments "make sense".
1914
+ Explicit and implicit pop operations have enough items on the stack.
1916
+ When an opcode implicitly refers to a markobject, a markobject is
1917
actually on the stack.
1919
+ A memo entry isn't referenced before it's defined.
1921
+ The markobject isn't stored in the memo.
1923
+ A memo entry isn't redefined.
1926
# Most of the hair here is for sanity checks, but most of it is needed
1927
# anyway to detect when a protocol 0 POP takes a MARK off the stack
1928
# (which in turn is needed to indent MARK blocks correctly).
1930
stack = [] # crude emulation of unpickler stack
1932
memo = {} # crude emulation of unpickler memo
1933
maxproto = -1 # max protocol number seen
1934
markstack = [] # bytecode positions of MARK opcodes
1935
indentchunk = ' ' * indentlevel
1937
for opcode, arg, pos in genops(pickle):
1939
print >> out, "%5d:" % pos,
1941
line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
1942
indentchunk * len(markstack),
1945
maxproto = max(maxproto, opcode.proto)
1946
before = opcode.stack_before # don't mutate
1947
after = opcode.stack_after # don't mutate
1948
numtopop = len(before)
1950
# See whether a MARK should be popped.
1952
if markobject in before or (opcode.name == "POP" and
1954
stack[-1] is markobject):
1955
assert markobject not in after
1957
if markobject in before:
1958
assert before[-1] is stackslice
1960
markpos = markstack.pop()
1962
markmsg = "(MARK at unknown opcode offset)"
1964
markmsg = "(MARK at %d)" % markpos
1965
# Pop everything at and after the topmost markobject.
1966
while stack[-1] is not markobject:
1969
# Stop later code from popping too much.
1971
numtopop = before.index(markobject)
1973
assert opcode.name == "POP"
1976
errormsg = markmsg = "no MARK exists on stack"
1978
# Check for correct memo usage.
1979
if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"):
1980
assert arg is not None
1982
errormsg = "memo key %r already defined" % arg
1984
errormsg = "stack is empty -- can't store into memo"
1985
elif stack[-1] is markobject:
1986
errormsg = "can't store markobject in the memo"
1988
memo[arg] = stack[-1]
1990
elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
1992
assert len(after) == 1
1993
after = [memo[arg]] # for better stack emulation
1995
errormsg = "memo key %r has never been stored into" % arg
1997
if arg is not None or markmsg:
1998
# make a mild effort to align arguments
1999
line += ' ' * (10 - len(opcode.name))
2001
line += ' ' + repr(arg)
2003
line += ' ' + markmsg
2007
# Note that we delayed complaining until the offending opcode
2009
raise ValueError(errormsg)
2011
# Emulate the stack effects.
2012
if len(stack) < numtopop:
2013
raise ValueError("tries to pop %d items from stack with "
2014
"only %d items" % (numtopop, len(stack)))
2016
del stack[-numtopop:]
2017
if markobject in after:
2018
assert markobject not in before
2019
markstack.append(pos)
2023
print >> out, "highest protocol among opcodes =", maxproto
2025
raise ValueError("stack not empty after STOP: %r" % stack)
2027
# For use in the doctest, simply as an example of a class to pickle.
2029
def __init__(self, value):
2034
>>> x = [1, 2, (3, 4), {'abc': u"def"}]
2035
>>> pkl = pickle.dumps(x, 0)
2038
1: l LIST (MARK at 0)
2047
20: t TUPLE (MARK at 13)
2051
26: d DICT (MARK at 25)
2055
40: V UNICODE u'def'
2060
highest protocol among opcodes = 0
2062
Try again with a "binary" pickle.
2064
>>> pkl = pickle.dumps(x, 1)
2074
13: t TUPLE (MARK at 8)
2078
19: U SHORT_BINSTRING 'abc'
2080
26: X BINUNICODE u'def'
2083
37: e APPENDS (MARK at 3)
2085
highest protocol among opcodes = 1
2087
Exercise the INST/OBJ/BUILD family.
2089
>>> import pickletools
2090
>>> dis(pickle.dumps(pickletools.dis, 0))
2091
0: c GLOBAL 'pickletools dis'
2094
highest protocol among opcodes = 0
2096
>>> from pickletools import _Example
2097
>>> x = [_Example(42)] * 2
2098
>>> dis(pickle.dumps(x, 0))
2100
1: l LIST (MARK at 0)
2103
6: i INST 'pickletools _Example' (MARK at 5)
2106
32: d DICT (MARK at 31)
2108
36: S STRING 'value'
2117
highest protocol among opcodes = 0
2119
>>> dis(pickle.dumps(x, 1))
2124
5: c GLOBAL 'pickletools _Example'
2126
29: o OBJ (MARK at 4)
2130
35: U SHORT_BINSTRING 'value'
2136
50: e APPENDS (MARK at 3)
2138
highest protocol among opcodes = 1
2140
Try "the canonical" recursive-object test.
2153
>>> dis(pickle.dumps(L, 0))
2155
1: l LIST (MARK at 0)
2159
9: t TUPLE (MARK at 5)
2163
highest protocol among opcodes = 0
2165
>>> dis(pickle.dumps(L, 1))
2170
6: t TUPLE (MARK at 3)
2174
highest protocol among opcodes = 1
2176
Note that, in the protocol 0 pickle of the recursive tuple, the disassembler
2177
has to emulate the stack in order to realize that the POP opcode at 16 gets
2178
rid of the MARK at 0.
2180
>>> dis(pickle.dumps(T, 0))
2183
2: l LIST (MARK at 1)
2187
10: t TUPLE (MARK at 6)
2191
16: 0 POP (MARK at 0)
2194
highest protocol among opcodes = 0
2196
>>> dis(pickle.dumps(T, 1))
2202
7: t TUPLE (MARK at 4)
2205
11: 1 POP_MARK (MARK at 0)
2208
highest protocol among opcodes = 1
2212
>>> dis(pickle.dumps(L, 2))
2221
highest protocol among opcodes = 2
2223
>>> dis(pickle.dumps(T, 2))
2234
highest protocol among opcodes = 2
2239
>>> from StringIO import StringIO
2241
>>> p = pickle.Pickler(f, 2)
2247
>>> dis(f, memo=memo)
2255
12: e APPENDS (MARK at 5)
2257
highest protocol among opcodes = 2
2258
>>> dis(f, memo=memo)
2262
highest protocol among opcodes = 2
2265
__test__ = {'disassembler_test': _dis_test,
2266
'disassembler_memo_test': _memo_test,
2271
return doctest.testmod()
2273
if __name__ == "__main__":