~ubuntu-branches/ubuntu/hardy/libgc/hardy-updates

« back to all changes in this revision

Viewing changes to doc/scale.html

Committer: Bazaar Package Importer
Author(s): Ryan Murray
Date: 2005-02-03 00:50:53 UTC
mto: (3.1.1 etch) (1.2.4 upstream)
mto: This revision was merged to the branch mainline in revision 3.
Revision ID: james.westby@ubuntu.com-20050203005053-9c0v9r2qcm2g1cfp

Tags: upstream-6.4

Import upstream version 6.4

files added:
NT_STATIC_THREADS_MAKEFILE

aix_irix_threads.c

darwin_stop_world.c

depcomp

doc/Makefile.am

doc/Makefile.in

doc/README.darwin

doc/gcinterface.html

doc/leak.html

doc/scale.html

doc/simple_example.html

include/Makefile.am

include/Makefile.in

include/gc_config_macros.h

include/private/darwin_semaphore.h

include/private/darwin_stop_world.h

include/private/pthread_stop_world.h

include/private/pthread_support.h

mips_sgi_mach_dep.s

missing

powerpc_darwin_mach_dep.s

pthread_stop_world.c

pthread_support.c

tests/middle.c

files removed:
irix_threads.c

linux_threads.c

ltconfig

mips_sgi_mach_dep.S

powerpc_macosx_mach_dep.s

files modified:
AmigaOS.c *

BCC_MAKEFILE

Mac_files/MacOS_Test_config.h

Mac_files/MacOS_config.h

Makefile

Makefile.am

Makefile.direct

Makefile.dj

Makefile.in

NT_MAKEFILE

acinclude.m4

aclocal.m4

allchblk.c

alloc.c

backgraph.c

checksums.c

config.guess

config.sub

configure

configure.host

configure.in

cord/cordbscs.c

cord/cordprnt.c

cord/cordtest.c

cord/cordxtra.c

cord/de.c

cord/de_win.c

dbg_mlc.c

digimars.mak *

doc/README

doc/README.MacOSX

doc/README.arm.cross *

doc/README.changes

doc/README.environment

doc/README.linux

doc/README.macros

doc/README.solaris2

doc/README.win32

doc/debugging.html

doc/gc.man

doc/gcdescr.html

doc/tree.html

dyn_load.c

finalize.c

gc_cpp.cc

gc_dlopen.c

gcj_mlc.c

headers.c

if_mach.c

if_not_there.c

include/gc.h

include/gc_allocator.h

include/gc_cpp.h

include/gc_gcj.h

include/gc_local_alloc.h

include/gc_mark.h

include/gc_pthread_redirects.h

include/gc_typed.h

include/javaxfc.h

include/new_gc_alloc.h

include/private/dbg_mlc.h

include/private/gc_hdrs.h

include/private/gc_locks.h

include/private/gc_pmark.h

include/private/gc_priv.h

include/private/gcconfig.h

include/private/solaris_threads.h

include/private/specific.h

install-sh

mach_dep.c

malloc.c

mallocx.c

mark.c

mark_rts.c

misc.c

mkinstalldirs

new_hblk.c

os_dep.c

ptr_chck.c

reclaim.c

solaris_pthreads.c

solaris_threads.c

specific.c

tests/test.c

tests/test_cpp.cc

tests/trace_test.c

threadlibs.c

typd_mlc.c

version.h

win32_threads.c

Show diffs side-by-side

added added

removed removed

doc/scale.html

<HTML>

<HEAD>

<TITLE>Garbage collector scalability</TITLE>

</HEAD>

<BODY>

<H1>Garbage collector scalability</h1>

In its default configuration, the Boehm-Demers-Weiser garbage collector

is not thread-safe. It can be made thread-safe for a number of environments

by building the collector with the appropriate

<TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation

flag. This has primarily two effects:

<OL>

<LI> It causes the garbage collector to stop all other threads when

it needs to see a consistent memory state.

<LI> It causes the collector to acquire a lock around essentially all

allocation and garbage collection activity.

</ol>

Since a single lock is used for all allocation-related activity, only one

thread can be allocating or collecting at one point. This inherently

limits performance of multi-threaded applications on multiprocessors.

<P>

On most platforms, the allocator/collector lock is implemented as a

spin lock with exponential back-off. Longer wait times are implemented

by yielding and/or sleeping. If a collection is in progress, the pure

spinning stage is skipped. This has the advantage that uncontested and

thus most uniprocessor lock acquisitions are very cheap. It has the

disadvantage that the application may sleep for small periods of time

even when there is work to be done. And threads may be unnecessarily

woken up for short periods. Nonetheless, this scheme empirically

outperforms native queue-based mutual exclusion implementations in most

cases, sometimes drastically so.

<H2>Options for enhanced scalability</h2>

Version 6.0 of the collector adds two facilities to enhance collector

scalability on multiprocessors. As of 6.0alpha1, these are supported

only under Linux on X86 and IA64 processors, though ports to other

otherwise supported Pthreads platforms should be straightforward.

They are intended to be used together.

<UL>

<LI>

Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to

run the mark phase in parallel in multiple threads, and thus on multiple

processors. The mark phase typically consumes the large majority of the

collection time. Thus this largely parallelizes the garbage collector

itself, though not the allocation process. Currently the marking is

performed by the thread that triggered the collection, together with

<I>N</i>-1 dedicated

threads, where <I>N</i> is the number of processors detected by the collector.

The dedicated threads are created once at initialization time.

<P>

A second effect of this flag is to switch to a more concurrent

implementation of <TT>GC_malloc_many</tt>, so that free lists can be

built, and memory can be cleared, by more than one thread concurrently.

<LI>

Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread

local allocation. It does not, by itself, cause thread local allocation

to be used. It simply allows the use of the interface in

<TT>gc_local_alloc.h</tt>.

<P>

Memory returned from thread-local allocators is completely interchangeable

with that returned by the standard allocators. It may be used by other

threads. The only difference is that, if the thread allocates enough

memory of a certain kind, it will build a thread-local free list for

objects of that kind, and allocate from that. This greatly reduces

locking. The thread-local free lists are refilled using

<TT>GC_malloc_many</tt>.

<P>

An important side effect of this flag is to replace the default

spin-then-sleep lock to be replace by a spin-then-queue based implementation.

This <I>reduces performance</i> for the standard allocation functions,

though it usually improves performance when thread-local allocation is

used heavily, and thus the number of short-duration lock acquisitions

is greatly reduced.

</ul>

<P>

The easiest way to switch an application to thread-local allocation is to

<OL>

<LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,

and then include the <TT>gc.h</tt>

header in each client source file.

<LI> Invoke <TT>GC_thr_init()</tt> before any allocation.

<LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,

and/or <TT>GC_GCJ_MALLOC</tt>.

</ol>

<H2>The Parallel Marking Algorithm</h2>

We use an algorithm similar to

<A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by

Endo, Taura, and Yonezawa</a> at the University of Tokyo.

However, the data structures and implementation are different,

and represent a smaller change to the original collector source,

probably at the expense of extreme scalability. Some of

the refinements they suggest, <I>e.g.</i> splitting large

objects, were also incorporated into out approach.

<P>

The global mark stack is transformed into a global work queue.

Unlike the usual case, it never shrinks during a mark phase.

The mark threads remove objects from the queue by copying them to a

local mark stack and changing the global descriptor to zero, indicating

that there is no more work to be done for this entry.

This removal

100

is done with no synchronization. Thus it is possible for more than

101

one worker to remove the same entry, resulting in some work duplication.

102

<P>

103

The global work queue grows only if a marker thread decides to

104

return some of its local mark stack to the global one. This

105

is done if the global queue appears to be running low, or if

106

the local stack is in danger of overflowing. It does require

107

synchronization, but should be relatively rare.

108

<P>

109

The sequential marking code is reused to process local mark stacks.

110

Hence the amount of additional code required for parallel marking

111

is minimal.

112

<P>

113

It should be possible to use generational collection in the presence of the

114

parallel collector, by calling <TT>GC_enable_incremental()</tt>.

115

This does not result in fully incremental collection, since parallel mark

116

phases cannot currently be interrupted, and doing so may be too

117

expensive.

118

<P>

119

Gcj-style mark descriptors do not currently mix with the combination

120

of local allocation and incremental collection. They should work correctly

121

with one or the other, but not both.

122

<P>

123

The number of marker threads is set on startup to the number of

124

available processors (or to the value of the <TT>GC_NPROCS</tt>

125

environment variable). If only a single processor is detected,

126

parallel marking is disabled.

127

<P>

128

Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside

129

the collector to immediately yield the processor instead of busy waiting

130

first. In the case of a multiprocessor and a client with multiple

131

simultaneously runnable threads, this may have disastrous performance

132

consequences (e.g. a factor of 10 slowdown).

133

<H2>Performance</h2>

134

We conducted some simple experiments with a version of

135

<A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to

136

run multiple concurrent client threads in the same address space.

137

Each client thread does the same work as the original benchmark, but they share

138

a heap.

139

This benchmark involves very little work outside of memory allocation.

140

This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine

141

under Linux 2.2.12.

142

<P>

143

Running with a thread-unsafe collector, the benchmark ran in 9

144

seconds. With the simple thread-safe collector,

145

built with <TT>-DLINUX_THREADS</tt>, the execution time

146

increased to 10.3 seconds, or 23.5 elapsed seconds with two clients.

147

(The times for the <TT>malloc</tt>/i<TT>free</tt> version

148

with glibc <TT>malloc</tt>

149

are 10.51 (standard library, pthreads not linked),

150

20.90 (one thread, pthreads linked),

151

and 24.55 seconds respectively. The benchmark favors a

152

garbage collector, since most objects are small.)

153

<P>

154

The following table gives execution times for the collector built

155

with parallel marking and thread-local allocation support

156

(<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested

157

the client using either one or two marker threads, and running

158

one or two client threads. Note that the client uses thread local

159

allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector

160

switches to a locking strategy that is better tuned to less frequent

161

lock acquisition. The standard allocation primitives thus peform

162

slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be

163

avoided in time-critical code.

164

<P>

165

(The results using <TT>pthread_mutex_lock</tt>

166

directly for allocation locking would have been worse still, at

167

least for older versions of linuxthreads.

168

With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the

169

lock with pthread_mutex_try_lock(), busy_waiting between attempts.

170

After a fixed number of attempts, we use pthread_mutex_lock().)

171

<P>

172

These measurements do not use incremental collection, nor was prefetching

173

enabled in the marker. We used the C version of the benchmark.

174

All measurements are in elapsed seconds on an unloaded machine.

175

<P>

176

177

<TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th>

178

<TH>2 marker threads (secs.)</th></tr>

179

<TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td>

180

<TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td>

181

</table>

182

<PP>

183

The execution time for the single threaded case is slightly worse than with

184

simple locking. However, even the single-threaded benchmark runs faster than

185

even the thread-unsafe version if a second processor is available.

186

The execution time for two clients with thread local allocation time is

187

only 1.4 times the sequential execution time for a single thread in a

188

thread-unsafe environment, even though it involves twice the client work.

189

That represents close to a

190

factor of 2 improvement over the 2 client case with the old collector.

191

The old collector clearly

192

still suffered from some contention overhead, in spite of the fact that the

193

locking scheme had been fairly well tuned.

194

<P>

195

Full linear speedup (i.e. the same execution time for 1 client on one

196

processor as 2 clients on 2 processors)

197

is probably not achievable on this kind of

198

hardware even with such a small number of processors,

199

since the memory system is

200

a major constraint for the garbage collector,

201

the processors usually share a single memory bus, and thus

202

the aggregate memory bandwidth does not increase in

203

proportion to the number of processors.

204

<P>

205

These results are likely to be very sensitive to both hardware and OS

206

issues. Preliminary experiments with an older Pentium Pro machine running

207

an older kernel were far less encouraging.

208

209

</body>

210

</html>

Older »