1
==========================================================================================
2
Memory management and threading models as translation aspects -- solutions and challenges
3
==========================================================================================
11
One of the goals of the PyPy project is to have the memory and concurrency
12
models flexible and changeable without having to reimplement the
13
interpreter manually. In fact, PyPy, by the time of the 0.8 release contains code for memory
14
management and concurrency models which allows experimentation without
15
requiring early design decisions. This document describes many of the more
16
technical details of the current state of the implementation of the memory
17
object model, automatic memory management and concurrency models and describes
18
possible future developments.
21
The low level object model
22
===========================
24
One important part of the translation process is *rtyping* [DLT]_, [TR]_.
25
Before that step all objects in our flow graphs are annotated with types at the
26
level of the RPython type system which is still quite high-level and
27
target-independent. During rtyping they are transformed into objects that
28
match the model of the specific target platform. For C or C-like targets this
29
model consists of a set of C-like types like structures, arrays and functions
30
in addition to primitive types (integers, characters, floating point numbers).
31
This multi-stage approach gives a lot of flexibility in how a given object is
32
represented at the target's level. The RPython process can decide what
33
representation to use based on the type annotation and on the way the object is
36
In the following the structures used to represent RPython classes are described.
37
There is one "vtable" per RPython class, with the following structure: The root
38
class "object" has a vtable of the following type (expressed in a C-like
41
struct object_vtable {
42
struct object_vtable* parenttypeptr;
43
RuntimeTypeInfo * rtti;
44
Signed subclassrange_min;
45
Signed subclassrange_max;
46
array { char } * name;
47
struct object * instantiate();
50
The structure members ``subclassrange_min`` and ``subclassrange_max`` are used
51
for subclass checking (see below). Every other class X, with parent Y, has the
55
struct vtable_Y super; // inlined
56
... // extra class attributes
59
The extra class attributes usually contain function pointers to the methods
60
of that class, although the data class attributes (which are supported by the
61
RPython object model) are stored there.
63
The type of the instances is::
65
struct object { // for instances of the root class
66
struct object_vtable* typeptr;
69
struct X { // for instances of every other class
70
struct Y super; // inlined
71
... // extra instance attributes
74
The extra instance attributes are all the attributes of an instance.
76
These structure layouts are quite similar to how classes are usually
82
The way we do subclass checking is a good example of the flexibility provided
83
by our approach: in the beginning we were using a naive linear lookup
84
algorithm. Since subclass checking is quite a common operation (it is also used
85
to check whether an object is an instance of a certain class), we wanted to
86
replace it with the more efficient relative numbering algorithm (see [PVE]_ for
87
an overview of techniques). This was a matter of changing just the appropriate
88
code of the rtyping process to calculate the class-ids during rtyping and
89
insert the necessary fields into the class structure. It would be similarly
90
easy to switch to another implementation.
95
In the RPython type system, class instances can be used as dictionary keys using
96
a default hash implementation based on identity, which in practice is
97
implemented using the memory address. This is similar to CPython's behavior
98
when no user-defined hash function is present. The annotator keeps track of the
99
classes for which this hashing is ever used.
101
One of the peculiarities of PyPy's approach is that live objects are analyzed
102
by our translation toolchain. This leads to the presence of instances of RPython
103
classes that were built before the translation started. These are called
104
"pre-built constants" (PBCs for short). During rtyping, these instances must be
105
converted to the low level model. One of the problems with doing this is that
106
the standard hash implementation of Python is to take the id of an object, which
108
is just the memory address. If the RPython program explicitely captures the
109
hash of a PBC by storing it (for example in the implementation of a data
110
structure) then the stored hash value will not match the value of the object's
111
address after translation.
113
To prevent this the following strategy is used: for every class whose instances
114
are hashed somewhere in the program (either when storing them in a
115
dictionary or by calling the hash function) an extra field is introduced in the
116
structure used for the instances of that class. For PBCs of such a class this
117
field is used to store the memory address of the original object and new objects
118
have this field initialized to zero. The hash function for instances of such a
119
class stores the object's memory address in this field if it is zero. The
120
return value of the hash function is the content of the field. This means that
121
instances of such a class that are converted PBCs retain the hash values they
122
had before the conversion whereas new objects of the class have their memory
123
address as hash values. A strategy along these lines would in any case have been
124
required if we ever switch to using a copying garbage collector.
126
Cached functions with PBC arguments
127
------------------------------------
129
As explained in [DLT]_ the annotated code can contain
130
functions from a finite set of PBCs to something else. The set itself has to be
131
finite but its content does not need to be provided explictly but is discovered
132
as the annotation of the input argument by the annotator itself. This kind of
133
function is translated by recording the input-result relationship by calling
134
the function concretely at annotation time, and adding a field to the PBCs in
135
the set and emitting code reading that field instead of the function call.
137
Changing the representation of an object
138
----------------------------------------
140
One example of the flexibility the RTyper provides is how we deal with lists.
141
Based on information gathered by the annotator the RTyper chooses between two
142
different list implementations. If a list never changes its size after creation,
143
a low-level array is used directly. For lists which might be resized, a
144
representation consisting of a structure with a pointer to an array is used,
145
together with over-allocation.
147
We plan to use similar techniques to use tagged pointers instead of using boxing
148
to represent builtin types of the PyPy interpreter such as integers. This would
149
require attaching explicit hints to the involved classes. Field access would
150
then be translated to the corresponding masking operations.
153
Automatic Memory Management Implementations
154
============================================
156
The whole implementation of the PyPy interpreter assumes automatic memory
157
management, e.g. automatic reclamation of memory that is no longer used. The
158
whole analysis toolchain also assumes that memory management is being taken
159
care of -- only the backends have to concern themselves with that issue. For
160
backends that target environments that have their own garbage collector, like
161
Smalltalk or Javascript, this is not an issue. For other targets like C and
162
LLVM the backend has to produce code that uses some sort of garbage collection.
164
This approach has several advantages. It makes it possible to target different
165
platforms, with and without integrated garbage collection. Furthermore, the
166
interpreter implementation is not complicated by the need to do explicit memory
167
management everywhere. Even more important the backend can optimize the memory
168
handling to fit a certain situation (like a machine with very restricted
169
memory) or completely replace the memory management technique or memory model
170
with a different one without the need to change source code. Additionally,
171
the backend can use information that was inferred by the rest of the toolchain
172
to improve the quality of memory management.
174
Using the Boehm garbage collector
175
-----------------------------------
177
Currently there are two different garbage collectors implemented in the C
178
backend (which is the most complete backend right now). One of them uses the
179
existing Boehm-Demers-Weiser garbage collector [BOEHM]_. For every memory
180
allocating operation in a low level flow graph the C backend introduces a call
181
to a function of the boehm collector which returns a suitable amount of memory.
182
Since the C backend has a lot of information avaiable about the data structure
183
being allocated it can choose the memory allocation function out of the Boehm
184
API that fits best. For example, for objects that do not contain references to
185
other objects (e.g. strings) there is a special allocation function which
186
signals to the collector that it does not need to consider this memory when
189
Using the Boehm collector has disadvantages as well. The problems stem from the
190
fact that the Boehm collector is conservative which means that it has to
191
consider every word in memory as a potential pointer. Since PyPy's toolchain
192
has complete knowledge of the placement of data in memory we can generate an
193
exact garbage collector that considers only genuine pointers.
195
Using a simple reference counting garbage collector
196
-----------------------------------------------------
198
The other implemented garbage collector is a simple reference counting scheme.
199
The C backend inserts a reference count field into every structure that has to be
200
handled by the garbage collector and puts increment and decrement operations
201
for this reference count into suitable places in the resulting C code. After
202
every reference decrement operations a check is performed whether the reference
203
count has dropped to zero. If this is the case the memory of the object will be
204
reclaimed after the references counts of the objects the original object
205
refers to are decremented as well.
207
The current placement of reference counter updates is far from optimal: The
208
reference counts are updated much more often than theoretically necessary (e.g.
209
sometimes a counter is increased and then immediately decreased again).
210
Objects passed into a function as arguments can almost always use a "trusted reference",
211
because the call-site is responsible to create a valid reference.
212
Furthermore some more analysis could show that some objects don't need a
213
reference counter at all because they either have a very short, foreseeable
214
life-time or because they live exactly as long as another object.
216
Another drawback of the current reference counting implementation is that it
217
cannot deal with circular references, which is a fundamental flaw of reference
218
counting memory management schemes in general. CPython solves this problem by
219
having special code that handles circular garbage which PyPy lacks at the
220
moment. This problem has to be addressed in the future to make the reference
221
counting scheme a viable garbage collector. Since reference counting is quite
222
succesfully used by CPython it will be interesting to see how far it can be
225
Simple escape analysis to remove memory allocation
226
---------------------------------------------------
228
We also implemented a technique to reduce the amount of memory allocation.
229
Sometimes it is possible to deduce from the flow graphs that an object lives
230
exactly as long as the stack frame of the function it is allocated in.
231
This happens if no pointer to the object is stored into another object and if
232
no pointer to the object is returned from the function. If this is the case and
233
if the size of the object is known in advance the object can be allocated on
234
the stack. To achieve this, the object is "exploded", that means that for every
235
element of the structure a new variable is generated that is handed around in
236
the graph. Reads from elements of the structure are removed and just replaced
237
by one of the variables, writes by assignments to same.
239
Since quite a lot of objects are allocated in small helper functions, this
240
simple approach which does not track objects accross function boundaries only
241
works well in the presence of function inlining.
243
A general garbage collection framework
244
--------------------------------------
246
In addition to the garbage collectors implemented in the C backend we have also
247
started writing a more general toolkit for implementing exact garbage
248
collectors in Python. The general idea is to express the garbage collection
249
algorithms in Python as well and translate them as part of the translation
250
process to C code (or whatever the intended platform is).
252
To be able to access memory in a low level manner there are special ``Address``
253
objects that behave like pointers to memory and can be manipulated accordingly:
254
it is possible to read/write to the location they point to a variety of data
255
types and to do pointer arithmetic. These objects are translated to real
256
pointers and the appropriate operations. When run on top of CPython there is a
257
*memory simulator* that makes the address objects behave like they were
258
accessing real memory. In addition the memory simulator contains a number of
259
consistency checks that expose common memory handling errors like dangling
260
pointers, uninitialized memory, etc.
262
At the moment we have three simple garbage collectors implemented for this
263
framework: a simple copying collector, a mark-and-sweep collector and a
264
deferred reference counting collector. These garbage collectors are work when run on
265
top of the memory simulator, but at the moment it is not yet possible to translate
266
PyPy to C with them. This is because it is not easy to
267
find the root pointers that reside on the C stack -- both because the C stack layout is
268
heavily platform dependent, and also due to the possibility of roots that are not
269
only on the stack but also hiding in registers (which would give a problem for *moving
270
garbage collectors*).
272
There are several possible solutions for this problem: One
273
of them is to not use C compilers to generate machine code, so that the stack
274
frame layout gets into our control. This is one of the tasks that need to be
275
tackled in phase 2, as directly generating assembly is needed anyway for a
276
just-in-time compiler. The other possibility (which would be much easier to
277
implement) is to move all the data away from the stack to the heap
278
before collecting garbage, as described in section "Stackless C code" below.
280
Concurrency Model Implementations
281
============================================
283
At the moment we have implemented two different concurrency models, and the
284
option to not support concurrency at all
285
(another proof of the modularity of our approach):
286
threading with a global interpreter lock and a "stackless" model.
291
By default, multi-threading is not supported at all, which gives some small
292
benefits for single-threaded applications since even in the single-threaded
293
case there is some overhead if threading capabilities are built into
296
Threading with a Global Interpreter Lock
297
------------------------------------------
299
Right now, there is one non-trivial threading model implemented. It follows
300
the threading implementation of CPython and thus uses a global interpreter
301
lock. This lock prevents any two threads from interpreting python code at
302
the same time. The global interpreter lock is released around calls to blocking I/O
303
functions. This approach has a number of advantages: it gives very little
304
runtime penalty for single-threaded applications, makes many of the common uses
305
for threading possible, and it is relatively easy to implement and maintain. It has
306
the disadvantage that multiple threads cannot be distributed accross multiple
309
To make this threading-model usable for I/O-bound applications, the global
310
intepreter lock should be released around blocking external function calls
311
(which is also what CPython does). This has been partially implemented.
317
"Stackless" C code is C code that only uses a bounded amount of
318
space in the C stack, and that can generally obtain explicit
319
control of its own stack. This is commonly known as "continuations",
320
or "continuation-passing style" code, although in our case we will limit
321
ourselves to single-shot continuations, i.e. continuations that are
322
captured and subsequently will be resumed exactly once.
324
The technique we have implemented is based on the recurring idea
325
of emulating this style via exceptions: a specific program point can
326
generate a pseudo-exception whose purpose is to unwind the whole C stack
327
in a restartable way. More precisely, the "unwind" exception causes
328
the C stack to be saved into the heap in a compact and explicit
329
format, as described below. It is then possible to resume only the
330
innermost (most recent) frame of the saved stack -- allowing unlimited
331
recursion on OSes that limit the size of the C stack -- or to resume a
332
different previously-saved C stack altogether, thus implementing
333
coroutines or light-weight threads.
335
In our case, exception handling is always explicit in the generated code:
336
the C backend puts a cheap check
337
after each call site to detect if the callee exited
338
normally or generated an exception. So when compiling functions in
339
stackless mode, the generated exception handling code special-cases the
340
new "unwind" exception. This exception causes the current function to
341
respond by saving its local variables to a heap structure (a linked list
342
of records, one per stack frame) and then propagating the exception
343
outwards. Eventually, at the end of the frame chain, the outermost
344
function is a manually-written dispatcher that catches the "unwind"
347
At this point, the whole C stack is stored away in the heap. This is a
348
very interesting state in itself, because precisely there is no C stack
350
left. It is this which will allow us to write all the algorithms
351
in a portable way, that
352
normally require machine-specific code to inspect the stack,
353
in particular garbage collectors.
355
To continue execution, the dispatcher can resume either the freshly saved or a
356
completely different stack. Moreover, it can resume directly the innermost
357
(most recent) saved frame in the heap chain, without having to resume all
358
intermediate frames first. This not only makes stack switches fast, but it
359
also allows the frame to continue to run on top of a clean C stack. When that
360
frame eventually exits normally, it returns to the dispatcher, which then
361
invokes the previous (parent) saved frame, and so on. We insert stack checks
362
before calls that can lead to recursion by detecting cycles in the call graph.
363
These stack checks copy the stack to the heap (by raising the special
364
exception) if it is about to grow deeper than a certain level.
365
As a different point of view, the C stack can also be considered as a cache
366
for the heap-based saved frames in this model. When we run out
367
of C stack space, we flush the cache. When the cache is empty, we fill it with
368
the next item from the heap.
370
To give the translated program some amount of control over the
371
heap-based stack structures and over the top-level dispatcher that jumps
372
between them, there are a few "external" functions directly implemented
373
in C. These functions provide an elementary interface, on top of which
374
useful abstractions can be implemented, like:
376
* coroutines: explicitly switching code, similar to Greenlets [GREENLET]_.
378
* "tasklets": cooperatively-scheduled microthreads, as introduced in
379
Stackless Python [STK]_.
381
* implicitly-scheduled (preemptive) microthreads, also known as green threads.
383
An important property of the changes in all the generated C functions is
384
that they are written in a way that does only minimally degrade their performance in
385
the non-exceptional case. Most optimisations performed by C compilers,
386
like register allocation, continue to work...
388
The following picture shows a graph function together with the modifications
389
necessary for the stackless style: the check whether the stack is too big and
390
should be unwound, the check whether we are in the process of currently storing
391
away the stack and the check whether the call to the function is not a regular
392
call but a reentry call.
394
.. graphviz:: image/stackless_informal.dot
401
open challenges for phase 2:
406
One of the biggest missing features of our current garbage collectors is
407
finalization. At present finalizers are simply not invoked if an object is
408
freed by the garbage collector. Along the same lines weak references are not
409
supported yet. It should be possible to implement these with a reasonable
410
amount of effort for reference counting as well as the Boehm collector (which
411
provides the necessary hooks).
413
Integrating the now simulated-only GC framework into the rtyping process and
414
the code generation will require considerable effort. It requires being able to
415
keep track of the GC roots which is hard to do with portable C code. One
416
solution would be to use the "stackless" code since it can move the stack
417
completely to the heap. We expect that we can implement GC read and write
418
barriers as function calls and rely on inlining to make them more efficient.
420
We may also spend some time on improving the existing reference counting
421
implementation by removing unnecessary incref-decref pairs and identifying
422
trustworthy references. A bigger task would
423
be to add support for detecing circular references.
429
One of the interesting possibities that stackless offers is to implement *green
430
threading*. This would involve writing a scheduler and some preemption logic.
432
We should also investigate other threading models based on operating system
433
threads with various granularities of locking for access of shared objects.
438
We also might want to experiment with more sophisticated structure inlining.
439
Sometimes it is possible to find out that one structure object
440
allocated on the heap lives exactly as long as another structure object on the
441
heap pointing to it. If this is the case it is possible to inline the first
442
object into the second. This saves the space of one pointer and avoids
449
As concretely shown with various detailed examples, our approach gives us
450
flexibility and lets us choose various aspects at translation time instead
451
of encoding them into the implementation itself.
456
.. [BOEHM] `Boehm-Demers-Weiser garbage collector`_, a garbage collector
457
for C and C++, Hans Boehm, 1988-2004
458
.. _`Boehm-Demers-Weiser garbage collector`: http://www.hpl.hp.com/personal/Hans_Boehm/gc/
460
.. [GREENLET] `Lightweight concurrent programming`_, py-lib Documentation 2003-2005
461
.. _`Lightweight concurrent programming`: http://codespeak.net/py/current/doc/greenlet.html
463
.. [STK] `Stackless Python`_, a Python implementation that does not use
464
the C stack, Christian Tismer, 1999-2004
465
.. _`Stackless Python`: http://www.stackless.com
467
.. [TR] `Translation`_, PyPy documentation, 2003-2005
468
.. _`Translation`: translation.html
470
.. [LE] `Encapsulating low-level implementation aspects`_,
471
PyPy documentation (and EU deliverable D05.4), 2005
472
.. _`Encapsulating low-level implementation aspects`: low-level-encapsulation.html
474
.. [DLT] `Compiling dynamic language implementations`_,
475
PyPy documentation (and EU deliverable D05.1), 2005
476
.. _`Compiling dynamic language implementations`: dynamic-language-translation.html
478
.. [PVE] `Simple and Efficient Subclass Tests`_, Jonathan Bachrach, Draft submission to ECOOP-02, 2001
479
.. _`Simple and Efficient Subclass Tests`: http://people.csail.mit.edu/jrb/pve/pve.pdf