1
\input texinfo @c -*-texinfo-*-
4
@setfilename starpu.info
5
@settitle StarPU Handbook
10
@setchapternewpage odd
13
@title StarPU Handbook
14
@subtitle for StarPU @value{VERSION}
17
@comment For the @value{version-GCC} Version*
28
This manual documents the usage of StarPU version @value{VERSION}. It
29
was last updated on @value{UPDATED}.
33
@comment When you add a new menu item, please keep the right hand
34
@comment aligned to the same column. Do not use tabs. This provides
35
@comment better formatting.
38
* Introduction:: A basic introduction to using StarPU
39
* Installing StarPU:: How to configure, build and install StarPU
40
* Using StarPU:: How to run StarPU application
41
* Basic Examples:: Basic examples of the use of StarPU
42
* Performance optimization:: How to optimize performance with StarPU
43
* Performance feedback:: Performance debugging tools
44
* StarPU MPI support:: How to combine StarPU with MPI
45
* Configuring StarPU:: How to configure StarPU
46
* StarPU API:: The API to use StarPU
47
* Advanced Topics:: Advanced use of StarPU
48
* Full source code for the 'Scaling a Vector' example::
50
* Function Index:: Index of C functions.
53
@c ---------------------------------------------------------------------
54
@c Introduction to StarPU
55
@c ---------------------------------------------------------------------
58
@chapter Introduction to StarPU
61
* Motivation:: Why StarPU ?
62
* StarPU in a Nutshell:: The Fundamentals of StarPU
68
@c complex machines with heterogeneous cores/devices
69
The use of specialized hardware such as accelerators or coprocessors offers an
70
interesting approach to overcome the physical limits encountered by processor
71
architects. As a result, many machines are now equipped with one or several
72
accelerators (e.g. a GPU), in addition to the usual processor(s). While a lot of
73
efforts have been devoted to offload computation onto such accelerators, very
74
little attention as been paid to portability concerns on the one hand, and to the
75
possibility of having heterogeneous accelerators and processors to interact on the other hand.
77
StarPU is a runtime system that offers support for heterogeneous multicore
78
architectures, it not only offers a unified view of the computational resources
79
(i.e. CPUs and accelerators at the same time), but it also takes care of
80
efficiently mapping and executing tasks onto an heterogeneous machine while
81
transparently handling low-level issues such as data transfers in a portable
84
@c this leads to a complicated distributed memory design
85
@c which is not (easily) manageable by hand
87
@c added value/benefits of StarPU
89
@c - scheduling, perf. portability
91
@node StarPU in a Nutshell
92
@section StarPU in a Nutshell
96
* StarPU Data Management Library::
101
From a programming point of view, StarPU is not a new language but a library
102
that executes tasks explicitly submitted by the application. The data that a
103
task manipulates are automatically transferred onto the accelerator so that the
104
programmer does not have to take care of complex data movements. StarPU also
105
takes particular care of scheduling those tasks efficiently and allows
106
scheduling experts to implement custom scheduling policies in a portable
109
@c explain the notion of codelet and task (i.e. g(A, B)
110
@node Codelet and Tasks
111
@subsection Codelet and Tasks
113
One of the StarPU primary data structures is the @b{codelet}. A codelet describes a
114
computational kernel that can possibly be implemented on multiple architectures
115
such as a CPU, a CUDA device or a Cell's SPU.
117
@c TODO insert illustration f : f_spu, f_cpu, ...
119
Another important data structure is the @b{task}. Executing a StarPU task
120
consists in applying a codelet on a data set, on one of the architectures on
121
which the codelet is implemented. A task thus describes the codelet that it
122
uses, but also which data are accessed, and how they are
123
accessed during the computation (read and/or write).
124
StarPU tasks are asynchronous: submitting a task to StarPU is a non-blocking
125
operation. The task structure can also specify a @b{callback} function that is
126
called once StarPU has properly executed the task. It also contains optional
127
fields that the application may use to give hints to the scheduler (such as
130
By default, task dependencies are inferred from data dependency (sequential
131
coherence) by StarPU. The application can however disable sequential coherency
132
for some data, and dependencies be expressed by hand.
133
A task may be identified by a unique 64-bit number chosen by the application
134
which we refer as a @b{tag}.
135
Task dependencies can be enforced by hand either by the means of callback functions, by
136
submitting other tasks, or by expressing dependencies
137
between tags (which can thus correspond to tasks that have not been submitted
140
@c TODO insert illustration f(Ar, Brw, Cr) + ..
143
@node StarPU Data Management Library
144
@subsection StarPU Data Management Library
146
Because StarPU schedules tasks at runtime, data transfers have to be
147
done automatically and ``just-in-time'' between processing units,
148
relieving the application programmer from explicit data transfers.
149
Moreover, to avoid unnecessary transfers, StarPU keeps data
150
where it was last needed, even if was modified there, and it
151
allows multiple copies of the same data to reside at the same time on
152
several processing units as long as it is not modified.
157
A @b{codelet} records pointers to various implementations of the same
158
theoretical function.
160
A @b{memory node} can be either the main RAM or GPU-embedded memory.
162
A @b{bus} is a link between memory nodes.
164
A @b{data handle} keeps track of replicates of the same data (@b{registered} by the
165
application) over various memory nodes. The data management library manages
166
keeping them coherent.
168
The @b{home} memory node of a data handle is the memory node from which the data
169
was registered (usually the main memory node).
171
A @b{task} represents a scheduled execution of a codelet on some data handles.
173
A @b{tag} is a rendez-vous point. Tasks typically have their own tag, and can
174
depend on other tags. The value is chosen by the application.
176
A @b{worker} execute tasks. There is typically one per CPU computation core and
177
one per accelerator (for which a whole CPU core is dedicated).
179
A @b{driver} drives a given kind of workers. There are currently CPU, CUDA,
180
OpenCL and Gordon drivers. They usually start several workers to actually drive
183
A @b{performance model} is a (dynamic or static) model of the performance of a
184
given codelet. Codelets can have execution time performance model as well as
185
power consumption performance models.
187
A data @b{interface} describes the layout of the data: for a vector, a pointer
188
for the start, the number of elements and the size of elements ; for a matrix, a
189
pointer for the start, the number of elements per row, the offset between rows,
190
and the size of each element ; etc. To access their data, codelet functions are
191
given interfaces for the local memory node replicates of the data handles of the
194
@b{Partitioning} data means dividing the data of a given data handle (called
195
@b{father}) into a series of @b{children} data handles which designate various
196
portions of the former.
198
A @b{filter} is the function which computes children data handles from a father
199
data handle, and thus describes how the partitioning should be done (horizontal,
202
@b{Acquiring} a data handle can be done from the main application, to safely
203
access the data of a data handle from its home node, without having to
207
@node Research Papers
208
@subsection Research Papers
210
Research papers about StarPU can be found at
212
@indicateurl{http://runtime.bordeaux.inria.fr/Publis/Keyword/STARPU.html}
214
Notably a good overview in the research report
216
@indicateurl{http://hal.archives-ouvertes.fr/inria-00467677}
218
@c ---------------------------------------------------------------------
220
@c ---------------------------------------------------------------------
222
@node Installing StarPU
223
@chapter Installing StarPU
226
* Downloading StarPU::
227
* Configuration of StarPU::
228
* Building and Installing StarPU::
231
StarPU can be built and installed by the standard means of the GNU
232
autotools. The following chapter is intended to briefly remind how these tools
233
can be used to install StarPU.
235
@node Downloading StarPU
236
@section Downloading StarPU
240
* Optional dependencies::
243
@node Getting Sources
244
@subsection Getting Sources
246
The simplest way to get StarPU sources is to download the latest official
247
release tarball from @indicateurl{https://gforge.inria.fr/frs/?group_id=1570} ,
248
or the latest nightly snapshot from
249
@indicateurl{http://starpu.gforge.inria.fr/testing/} . The following documents
250
how to get the very latest version from the subversion repository itself, it
251
should be needed only if you need the very latest changes (i.e. less than a
254
The source code is managed by a Subversion server hosted by the
255
InriaGforge. To get the source code, you need:
259
To install the client side of the software Subversion if it is
260
not already available on your system. The software can be obtained from
261
@indicateurl{http://subversion.tigris.org} . If you are running
262
on Windows, you will probably prefer to use TortoiseSVN from
263
@indicateurl{http://tortoisesvn.tigris.org/} .
266
You can check out the project's SVN repository through anonymous
267
access. This will provide you with a read access to the
270
If you need to have write access on the StarPU project, you can also choose to
271
become a member of the project @code{starpu}. For this, you first need to get
272
an account to the gForge server. You can then send a request to join the project
273
(@indicateurl{https://gforge.inria.fr/project/request.php?group_id=1570}).
276
More information on how to get a gForge account, to become a member of
277
a project, or on any other related task can be obtained from the
278
InriaGforge at @indicateurl{https://gforge.inria.fr/}. The most important
279
thing is to upload your public SSH key on the gForge server (see the
280
FAQ at @indicateurl{http://siteadmin.gforge.inria.fr/FAQ.html#Q6} for
284
You can now check out the latest version from the Subversion server:
287
using the anonymous access via svn:
289
% svn checkout svn://scm.gforge.inria.fr/svn/starpu/trunk
292
using the anonymous access via https:
294
% svn checkout --username anonsvn https://scm.gforge.inria.fr/svn/starpu/trunk
296
The password is @code{anonsvn}.
298
using your gForge account
300
% svn checkout svn+ssh://<login>@@scm.gforge.inria.fr/svn/starpu/trunk
304
The following step requires the availability of @code{autoconf} and
305
@code{automake} to generate the @code{./configure} script. This is
306
done by calling @code{./autogen.sh}. The required version for
307
@code{autoconf} is 2.60 or higher. You will also need @code{makeinfo}.
313
If the autotools are not available on your machine or not recent
314
enough, you can choose to download the latest nightly tarball, which
315
is provided with a @code{configure} script.
318
% wget http://starpu.gforge.inria.fr/testing/starpu-nightly-latest.tar.gz
321
@node Optional dependencies
322
@subsection Optional dependencies
324
The topology discovery library, @code{hwloc}, is not mandatory to use StarPU
325
but strongly recommended. It allows to increase performance, and to
326
perform some topology aware scheduling.
328
@code{hwloc} is available in major distributions and for most OSes and can be
329
downloaded from @indicateurl{http://www.open-mpi.org/software/hwloc}.
331
@node Configuration of StarPU
332
@section Configuration of StarPU
335
* Generating Makefiles and configuration scripts::
336
* Running the configuration::
339
@node Generating Makefiles and configuration scripts
340
@subsection Generating Makefiles and configuration scripts
342
This step is not necessary when using the tarball releases of StarPU. If you
343
are using the source code from the svn repository, you first need to generate
344
the configure scripts and the Makefiles.
350
@node Running the configuration
351
@subsection Running the configuration
357
Details about options that are useful to give to @code{./configure} are given in
358
@ref{Compilation configuration}.
360
@node Building and Installing StarPU
361
@section Building and Installing StarPU
377
@subsection Sanity Checks
379
In order to make sure that StarPU is working properly on the system, it is also
380
possible to run a test suite.
387
@subsection Installing
389
In order to install StarPU at the location that was specified during
396
@c ---------------------------------------------------------------------
398
@c ---------------------------------------------------------------------
401
@chapter Using StarPU
404
* Setting flags for compiling and linking applications::
405
* Running a basic StarPU application::
406
* Kernel threads started by StarPU::
407
* Using accelerators::
410
@node Setting flags for compiling and linking applications
411
@section Setting flags for compiling and linking applications
413
Compiling and linking an application against StarPU may require to use
414
specific flags or libraries (for instance @code{CUDA} or @code{libspe2}).
415
To this end, it is possible to use the @code{pkg-config} tool.
417
If StarPU was not installed at some standard location, the path of StarPU's
418
library must be specified in the @code{PKG_CONFIG_PATH} environment variable so
419
that @code{pkg-config} can find it. For example if StarPU was installed in
423
% PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$prefix_dir/lib/pkgconfig
426
The flags required to compile or link against StarPU are then
427
accessible with the following commands:
430
% pkg-config --cflags libstarpu # options for the compiler
431
% pkg-config --libs libstarpu # options for the linker
434
@node Running a basic StarPU application
435
@section Running a basic StarPU application
437
Basic examples using StarPU have been built in the directory
438
@code{$prefix_dir/lib/starpu/examples/}. You can for example run the
439
example @code{vector_scal}.
442
% $prefix_dir/lib/starpu/examples/vector_scal
443
BEFORE : First element was 1.000000
444
AFTER First element is 3.140000
448
When StarPU is used for the first time, the directory
449
@code{$HOME/.starpu/} is created, performance models will be stored in
452
Please note that buses are benchmarked when StarPU is launched for the
453
first time. This may take a few minutes, or less if @code{hwloc} is
454
installed. This step is done only once per user and per machine.
456
@node Kernel threads started by StarPU
457
@section Kernel threads started by StarPU
459
TODO: StarPU starts one thread per CPU core and binds them there, uses one of
460
them per GPU. The application is not supposed to do computations in its own
461
threads. TODO: add a StarPU function to bind an application thread (e.g. the
462
main thread) to a dedicated core (and thus disable the corresponding StarPU CPU
465
@node Using accelerators
466
@section Using accelerators
468
When both CUDA and OpenCL drivers are enabled, StarPU will launch an
469
OpenCL worker for NVIDIA GPUs only if CUDA is not already running on them.
470
This design choice was necessary as OpenCL and CUDA can not run at the
471
same time on the same NVIDIA GPU, as there is currently no interoperability
474
Details on how to specify devices running OpenCL and the ones running
475
CUDA are given in @ref{Enabling OpenCL}.
478
@c ---------------------------------------------------------------------
480
@c ---------------------------------------------------------------------
483
@chapter Basic Examples
486
* Compiling and linking options::
487
* Hello World:: Submitting Tasks
488
* Scaling a Vector:: Manipulating Data
489
* Vector Scaling on an Hybrid CPU/GPU Machine:: Handling Heterogeneous Architectures
490
* Task and Worker Profiling::
491
* Partitioning Data:: Partitioning Data
492
* Performance model example::
493
* Theoretical lower bound on execution time::
494
* Insert Task Utility::
495
* More examples:: More examples shipped with StarPU
496
* Debugging:: When things go wrong.
499
@node Compiling and linking options
500
@section Compiling and linking options
502
Let's suppose StarPU has been installed in the directory
503
@code{$STARPU_DIR}. As explained in @ref{Setting flags for compiling and linking applications},
504
the variable @code{PKG_CONFIG_PATH} needs to be set. It is also
505
necessary to set the variable @code{LD_LIBRARY_PATH} to locate dynamic
506
libraries at runtime.
509
% PKG_CONFIG_PATH=$STARPU_DIR/lib/pkgconfig:$PKG_CONFIG_PATH
510
% LD_LIBRARY_PATH=$STARPU_DIR/lib:$LD_LIBRARY_PATH
513
The Makefile could for instance contain the following lines to define which
514
options must be given to the compiler and to the linker:
518
CFLAGS += $$(pkg-config --cflags libstarpu)
519
LDFLAGS += $$(pkg-config --libs libstarpu)
528
* Defining a Codelet::
529
* Submitting a Task::
530
* Execution of Hello World::
533
In this section, we show how to implement a simple program that submits a task to StarPU.
535
@node Required Headers
536
@subsection Required Headers
538
The @code{starpu.h} header should be included in any code using StarPU.
547
@node Defining a Codelet
548
@subsection Defining a Codelet
556
void cpu_func(void *buffers[], void *cl_arg)
558
struct params *params = cl_arg;
560
printf("Hello world (params = @{%i, %f@} )\n", params->i, params->f);
566
.cpu_func = cpu_func,
572
A codelet is a structure that represents a computational kernel. Such a codelet
573
may contain an implementation of the same kernel on different architectures
574
(e.g. CUDA, Cell's SPU, x86, ...).
576
The @code{nbuffers} field specifies the number of data buffers that are
577
manipulated by the codelet: here the codelet does not access or modify any data
578
that is controlled by our data management library. Note that the argument
579
passed to the codelet (the @code{cl_arg} field of the @code{starpu_task}
580
structure) does not count as a buffer since it is not managed by our data
581
management library, but just contain trivial parameters.
583
@c TODO need a crossref to the proper description of "where" see bla for more ...
584
We create a codelet which may only be executed on the CPUs. The @code{where}
585
field is a bitmask that defines where the codelet may be executed. Here, the
586
@code{STARPU_CPU} value means that only CPUs can execute this codelet
587
(@pxref{Codelets and Tasks} for more details on this field).
588
When a CPU core executes a codelet, it calls the @code{cpu_func} function,
589
which @emph{must} have the following prototype:
591
@code{void (*cpu_func)(void *buffers[], void *cl_arg);}
593
In this example, we can ignore the first argument of this function which gives a
594
description of the input and output buffers (e.g. the size and the location of
595
the matrices) since there is none.
596
The second argument is a pointer to a buffer passed as an
597
argument to the codelet by the means of the @code{cl_arg} field of the
598
@code{starpu_task} structure.
600
@c TODO rewrite so that it is a little clearer ?
601
Be aware that this may be a pointer to a
602
@emph{copy} of the actual buffer, and not the pointer given by the programmer:
603
if the codelet modifies this buffer, there is no guarantee that the initial
604
buffer will be modified as well: this for instance implies that the buffer
605
cannot be used as a synchronization medium. If synchronization is needed, data
606
has to be registered to StarPU, see @ref{Scaling a Vector}.
608
@node Submitting a Task
609
@subsection Submitting a Task
613
void callback_func(void *callback_arg)
615
printf("Callback function (arg %x)\n", callback_arg);
618
int main(int argc, char **argv)
620
/* @b{initialize StarPU} */
623
struct starpu_task *task = starpu_task_create();
625
task->cl = &cl; /* @b{Pointer to the codelet defined above} */
627
struct params params = @{ 1, 2.0f @};
628
task->cl_arg = ¶ms;
629
task->cl_arg_size = sizeof(params);
631
task->callback_func = callback_func;
632
task->callback_arg = 0x42;
634
/* @b{starpu_task_submit will be a blocking call} */
635
task->synchronous = 1;
637
/* @b{submit the task to StarPU} */
638
starpu_task_submit(task);
640
/* @b{terminate StarPU} */
648
Before submitting any tasks to StarPU, @code{starpu_init} must be called. The
649
@code{NULL} argument specifies that we use default configuration. Tasks cannot
650
be submitted after the termination of StarPU by a call to
651
@code{starpu_shutdown}.
653
In the example above, a task structure is allocated by a call to
654
@code{starpu_task_create}. This function only allocates and fills the
655
corresponding structure with the default settings (@pxref{Codelets and
656
Tasks, starpu_task_create}), but it does not submit the task to StarPU.
658
@c not really clear ;)
659
The @code{cl} field is a pointer to the codelet which the task will
660
execute: in other words, the codelet structure describes which computational
661
kernel should be offloaded on the different architectures, and the task
662
structure is a wrapper containing a codelet and the piece of data on which the
663
codelet should operate.
665
The optional @code{cl_arg} field is a pointer to a buffer (of size
666
@code{cl_arg_size}) with some parameters for the kernel
667
described by the codelet. For instance, if a codelet implements a computational
668
kernel that multiplies its input vector by a constant, the constant could be
669
specified by the means of this buffer, instead of registering it as a StarPU
670
data. It must however be noted that StarPU avoids making copy whenever possible
671
and rather passes the pointer as such, so the buffer which is pointed at must
672
kept allocated until the task terminates, and if several tasks are submitted
673
with various parameters, each of them must be given a pointer to their own
676
Once a task has been executed, an optional callback function is be called.
677
While the computational kernel could be offloaded on various architectures, the
678
callback function is always executed on a CPU. The @code{callback_arg}
679
pointer is passed as an argument of the callback. The prototype of a callback
682
@code{void (*callback_function)(void *);}
684
If the @code{synchronous} field is non-zero, task submission will be
685
synchronous: the @code{starpu_task_submit} function will not return until the
686
task was executed. Note that the @code{starpu_shutdown} method does not
687
guarantee that asynchronous tasks have been executed before it returns,
688
@code{starpu_task_wait_for_all} can be used to that effect, or data can be
689
unregistered (@code{starpu_data_unregister(vector_handle);}), which will
690
implicitly wait for all the tasks scheduled to work on it, unless explicitly
691
disabled thanks to @code{starpu_data_set_default_sequential_consistency_flag} or
692
@code{starpu_data_set_sequential_consistency_flag}.
694
@node Execution of Hello World
695
@subsection Execution of Hello World
699
cc $(pkg-config --cflags libstarpu) $(pkg-config --libs libstarpu) hello_world.c -o hello_world
701
Hello world (params = @{1, 2.000000@} )
702
Callback function (arg 42)
705
@node Scaling a Vector
706
@section Manipulating Data: Scaling a Vector
708
The previous example has shown how to submit tasks. In this section,
709
we show how StarPU tasks can manipulate data. The full source code for
710
this example is given in @ref{Full source code for the 'Scaling a Vector' example}.
713
* Source code of Vector Scaling::
714
* Execution of Vector Scaling::
717
@node Source code of Vector Scaling
718
@subsection Source code of Vector Scaling
720
Programmers can describe the data layout of their application so that StarPU is
721
responsible for enforcing data coherency and availability across the machine.
722
Instead of handling complex (and non-portable) mechanisms to perform data
723
movements, programmers only declare which piece of data is accessed and/or
724
modified by a task, and StarPU makes sure that when a computational kernel
725
starts somewhere (e.g. on a GPU), its data are available locally.
727
Before submitting those tasks, the programmer first needs to declare the
728
different pieces of data to StarPU using the @code{starpu_*_data_register}
729
functions. To ease the development of applications for StarPU, it is possible
730
to describe multiple types of data layout. A type of data layout is called an
731
@b{interface}. There are different predefined interfaces available in StarPU:
732
here we will consider the @b{vector interface}.
734
The following lines show how to declare an array of @code{NX} elements of type
735
@code{float} using the vector interface:
741
starpu_data_handle vector_handle;
742
starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX,
747
The first argument, called the @b{data handle}, is an opaque pointer which
748
designates the array in StarPU. This is also the structure which is used to
749
describe which data is used by a task. The second argument is the node number
750
where the data originally resides. Here it is 0 since the @code{vector} array is in
751
the main memory. Then comes the pointer @code{vector} where the data can be found in main memory,
752
the number of elements in the vector and the size of each element.
753
The following shows how to construct a StarPU task that will manipulate the
754
vector and a constant factor.
759
struct starpu_task *task = starpu_task_create();
761
task->cl = &cl; /* @b{Pointer to the codelet defined below} */
762
task->buffers[0].handle = vector_handle; /* @b{First parameter of the codelet} */
763
task->buffers[0].mode = STARPU_RW;
764
task->cl_arg = &factor;
765
task->cl_arg_size = sizeof(factor);
766
task->synchronous = 1;
768
starpu_task_submit(task);
772
Since the factor is a mere constant float value parameter,
773
it does not need a preliminary registration, and
774
can just be passed through the @code{cl_arg} pointer like in the previous
775
example. The vector parameter is described by its handle.
776
There are two fields in each element of the @code{buffers} array.
777
@code{handle} is the handle of the data, and @code{mode} specifies how the
778
kernel will access the data (@code{STARPU_R} for read-only, @code{STARPU_W} for
779
write-only and @code{STARPU_RW} for read and write access).
781
The definition of the codelet can be written as follows:
785
void scal_cpu_func(void *buffers[], void *cl_arg)
788
float *factor = cl_arg;
790
/* length of the vector */
791
unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
792
/* CPU copy of the vector pointer */
793
float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
795
for (i = 0; i < n; i++)
799
starpu_codelet cl = @{
801
.cpu_func = scal_cpu_func,
807
The first argument is an array that gives
808
a description of all the buffers passed in the @code{task->buffers}@ array. The
809
size of this array is given by the @code{nbuffers} field of the codelet
810
structure. For the sake of genericity, this array contains pointers to the
811
different interfaces describing each buffer. In the case of the @b{vector
812
interface}, the location of the vector (resp. its length) is accessible in the
813
@code{ptr} (resp. @code{nx}) of this array. Since the vector is accessed in a
814
read-write fashion, any modification will automatically affect future accesses
815
to this vector made by other tasks.
817
The second argument of the @code{scal_cpu_func} function contains a pointer to the
818
parameters of the codelet (given in @code{task->cl_arg}), so that we read the
819
constant factor from this pointer.
821
@node Execution of Vector Scaling
822
@subsection Execution of Vector Scaling
826
cc $(pkg-config --cflags libstarpu) $(pkg-config --libs libstarpu) vector_scal.c -o vector_scal
828
0.000000 3.000000 6.000000 9.000000 12.000000
831
@node Vector Scaling on an Hybrid CPU/GPU Machine
832
@section Vector Scaling on an Hybrid CPU/GPU Machine
834
Contrary to the previous examples, the task submitted in this example may not
835
only be executed by the CPUs, but also by a CUDA device.
838
* Definition of the CUDA Kernel::
839
* Definition of the OpenCL Kernel::
840
* Definition of the Main Code::
841
* Execution of Hybrid Vector Scaling::
844
@node Definition of the CUDA Kernel
845
@subsection Definition of the CUDA Kernel
847
The CUDA implementation can be written as follows. It needs to be compiled with
848
a CUDA compiler such as nvcc, the NVIDIA CUDA compiler driver. It must be noted
849
that the vector pointer returned by STARPU_VECTOR_GET_PTR is here a pointer in GPU
850
memory, so that it can be passed as such to the @code{vector_mult_cuda} kernel
857
static __global__ void vector_mult_cuda(float *val, unsigned n,
860
unsigned i = blockIdx.x*blockDim.x + threadIdx.x;
865
extern "C" void scal_cuda_func(void *buffers[], void *_args)
867
float *factor = (float *)_args;
869
/* length of the vector */
870
unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
871
/* CUDA copy of the vector pointer */
872
float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
873
unsigned threads_per_block = 64;
874
unsigned nblocks = (n + threads_per_block-1) / threads_per_block;
876
@i{ vector_mult_cuda<<<nblocks,threads_per_block, 0, starpu_cuda_get_local_stream()>>>(val, n, *factor);}
878
@i{ cudaStreamSynchronize(starpu_cuda_get_local_stream());}
883
@node Definition of the OpenCL Kernel
884
@subsection Definition of the OpenCL Kernel
886
The OpenCL implementation can be written as follows. StarPU provides
887
tools to compile a OpenCL kernel stored in a file.
891
__kernel void vector_mult_opencl(__global float* val, int nx, float factor)
893
const int i = get_global_id(0);
901
Similarly to CUDA, the pointer returned by @code{STARPU_VECTOR_GET_PTR} is here
902
a device pointer, so that it is passed as such to the OpenCL kernel.
907
@i{#include <starpu_opencl.h>}
909
@i{extern struct starpu_opencl_program programs;}
911
void scal_opencl_func(void *buffers[], void *_args)
913
float *factor = _args;
914
@i{ int id, devid, err;}
915
@i{ cl_kernel kernel;}
916
@i{ cl_command_queue queue;}
919
/* length of the vector */
920
unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
921
/* OpenCL copy of the vector pointer */
922
cl_mem val = (cl_mem) STARPU_VECTOR_GET_PTR(buffers[0]);
924
@i{ id = starpu_worker_get_id();}
925
@i{ devid = starpu_worker_get_devid(id);}
927
@i{ err = starpu_opencl_load_kernel(&kernel, &queue, &programs,}
928
@i{ "vector_mult_opencl", devid); /* @b{Name of the codelet defined above} */}
929
@i{ if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);}
931
@i{ err = clSetKernelArg(kernel, 0, sizeof(val), &val);}
932
@i{ err |= clSetKernelArg(kernel, 1, sizeof(n), &n);}
933
@i{ err |= clSetKernelArg(kernel, 2, sizeof(*factor), factor);}
934
@i{ if (err) STARPU_OPENCL_REPORT_ERROR(err);}
937
@i{ size_t global=1;}
939
@i{ err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, &event);}
940
@i{ if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);}
943
@i{ clFinish(queue);}
944
@i{ starpu_opencl_collect_stats(event);}
945
@i{ clReleaseEvent(event);}
947
@i{ starpu_opencl_release_kernel(kernel);}
953
@node Definition of the Main Code
954
@subsection Definition of the Main Code
956
The CPU implementation is the same as in the previous section.
958
Here is the source of the main application. You can notice the value of the
959
field @code{where} for the codelet. We specify
960
@code{STARPU_CPU|STARPU_CUDA|STARPU_OPENCL} to indicate to StarPU that the codelet
961
can be executed either on a CPU or on a CUDA or an OpenCL device.
969
extern void scal_cuda_func(void *buffers[], void *_args);
970
extern void scal_cpu_func(void *buffers[], void *_args);
971
extern void scal_opencl_func(void *buffers[], void *_args);
973
/* @b{Definition of the codelet} */
974
static starpu_codelet cl = @{
975
.where = STARPU_CPU|STARPU_CUDA|STARPU_OPENCL; /* @b{It can be executed on a CPU,} */
976
/* @b{on a CUDA device, or on an OpenCL device} */
977
.cuda_func = scal_cuda_func;
978
.cpu_func = scal_cpu_func;
979
.opencl_func = scal_opencl_func;
983
#ifdef STARPU_USE_OPENCL
984
/* @b{The compiled version of the OpenCL program} */
985
struct starpu_opencl_program programs;
988
int main(int argc, char **argv)
993
struct starpu_task *task;
994
starpu_data_handle vector_handle;
996
starpu_init(NULL); /* @b{Initialising StarPU} */
998
#ifdef STARPU_USE_OPENCL
999
starpu_opencl_load_opencl_from_file(
1000
"examples/basic_examples/vector_scal_opencl_codelet.cl",
1004
vector = malloc(NX*sizeof(vector[0]));
1006
for(i=0 ; i<NX ; i++) vector[i] = i;
1012
/* @b{Registering data within StarPU} */
1013
starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector,
1014
NX, sizeof(vector[0]));
1016
/* @b{Definition of the task} */
1017
task = starpu_task_create();
1019
task->buffers[0].handle = vector_handle;
1020
task->buffers[0].mode = STARPU_RW;
1021
task->cl_arg = &factor;
1022
task->cl_arg_size = sizeof(factor);
1028
/* @b{Submitting the task} */
1029
ret = starpu_task_submit(task);
1030
if (ret == -ENODEV) @{
1031
fprintf(stderr, "No worker may execute this task\n");
1035
@c TODO: Mmm, should rather be an unregistration with an implicit dependency, no?
1036
/* @b{Waiting for its termination} */
1037
starpu_task_wait_for_all();
1039
/* @b{Update the vector in RAM} */
1040
starpu_data_acquire(vector_handle, STARPU_R);
1046
/* @b{Access the data} */
1047
for(i=0 ; i<NX; i++) @{
1048
fprintf(stderr, "%f ", vector[i]);
1050
fprintf(stderr, "\n");
1052
/* @b{Release the RAM view of the data before unregistering it and shutting down StarPU} */
1053
starpu_data_release(vector_handle);
1054
starpu_data_unregister(vector_handle);
1062
@node Execution of Hybrid Vector Scaling
1063
@subsection Execution of Hybrid Vector Scaling
1065
The Makefile given at the beginning of the section must be extended to
1066
give the rules to compile the CUDA source code. Note that the source
1067
file of the OpenCL kernel does not need to be compiled now, it will
1068
be compiled at run-time when calling the function
1069
@code{starpu_opencl_load_opencl_from_file()} (@pxref{starpu_opencl_load_opencl_from_file}).
1073
CFLAGS += $(shell pkg-config --cflags libstarpu)
1074
LDFLAGS += $(shell pkg-config --libs libstarpu)
1077
vector_scal: vector_scal.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o
1080
nvcc $(CFLAGS) $< -c $@
1083
rm -f vector_scal *.o
1091
and to execute it, with the default configuration:
1095
0.000000 3.000000 6.000000 9.000000 12.000000
1098
or for example, by disabling CPU devices:
1101
% STARPU_NCPUS=0 ./vector_scal
1102
0.000000 3.000000 6.000000 9.000000 12.000000
1105
or by disabling CUDA devices (which may permit to enable the use of OpenCL,
1106
see @ref{Using accelerators}):
1109
% STARPU_NCUDA=0 ./vector_scal
1110
0.000000 3.000000 6.000000 9.000000 12.000000
1113
@node Task and Worker Profiling
1114
@section Task and Worker Profiling
1116
A full example showing how to use the profiling API is available in
1117
the StarPU sources in the directory @code{examples/profiling/}.
1121
struct starpu_task *task = starpu_task_create();
1123
task->synchronous = 1;
1124
/* We will destroy the task structure by hand so that we can
1125
* query the profiling info before the task is destroyed. */
1128
/* Submit and wait for completion (since synchronous was set to 1) */
1129
starpu_task_submit(task);
1131
/* The task is finished, get profiling information */
1132
struct starpu_task_profiling_info *info = task->profiling_info;
1134
/* How much time did it take before the task started ? */
1135
double delay += starpu_timing_timespec_delay_us(&info->submit_time, &info->start_time);
1137
/* How long was the task execution ? */
1138
double length += starpu_timing_timespec_delay_us(&info->start_time, &info->end_time);
1140
/* We don't need the task structure anymore */
1141
starpu_task_destroy(task);
1147
/* Display the occupancy of all workers during the test */
1149
for (worker = 0; worker < starpu_worker_get_count(); worker++)
1151
struct starpu_worker_profiling_info worker_info;
1152
int ret = starpu_worker_get_profiling_info(worker, &worker_info);
1153
STARPU_ASSERT(!ret);
1155
double total_time = starpu_timing_timespec_to_us(&worker_info.total_time);
1156
double executing_time = starpu_timing_timespec_to_us(&worker_info.executing_time);
1157
double sleeping_time = starpu_timing_timespec_to_us(&worker_info.sleeping_time);
1159
float executing_ratio = 100.0*executing_time/total_time;
1160
float sleeping_ratio = 100.0*sleeping_time/total_time;
1162
char workername[128];
1163
starpu_worker_get_name(worker, workername, 128);
1164
fprintf(stderr, "Worker %s:\n", workername);
1165
fprintf(stderr, "\ttotal time : %.2lf ms\n", total_time*1e-3);
1166
fprintf(stderr, "\texec time : %.2lf ms (%.2f %%)\n", executing_time*1e-3,
1168
fprintf(stderr, "\tblocked time : %.2lf ms (%.2f %%)\n", sleeping_time*1e-3,
1174
@node Partitioning Data
1175
@section Partitioning Data
1177
An existing piece of data can be partitioned in sub parts to be used by different tasks, for instance:
1182
starpu_data_handle handle;
1184
/* Declare data to StarPU */
1185
starpu_vector_data_register(&handle, 0, (uintptr_t)vector, NX, sizeof(vector[0]));
1187
/* Partition the vector in PARTS sub-vectors */
1190
.filter_func = starpu_block_filter_func_vector,
1192
.get_nchildren = NULL,
1193
.get_child_ops = NULL
1195
starpu_data_partition(handle, &f);
1201
/* Submit a task on each sub-vector */
1202
for (i=0; i<starpu_data_get_nb_children(handle); i++) @{
1203
/* Get subdata number i (there is only 1 dimension) */
1204
starpu_data_handle sub_handle = starpu_data_get_sub_data(handle, 1, i);
1205
struct starpu_task *task = starpu_task_create();
1207
task->buffers[0].handle = sub_handle;
1208
task->buffers[0].mode = STARPU_RW;
1210
task->synchronous = 1;
1211
task->cl_arg = &factor;
1212
task->cl_arg_size = sizeof(factor);
1214
starpu_task_submit(task);
1219
Partitioning can be applied several times, see
1220
@code{examples/basic_examples/mult.c} and @code{examples/filters/}.
1222
@node Performance model example
1223
@section Performance model example
1225
To achieve good scheduling, StarPU scheduling policies need to be able to
1226
estimate in advance the duration of a task. This is done by giving to codelets a
1227
performance model. There are several kinds of performance models.
1231
Providing an estimation from the application itself (@code{STARPU_COMMON} model type and @code{cost_model} field),
1233
@code{examples/common/blas_model.h} and @code{examples/common/blas_model.c}. It can also be provided for each architecture (@code{STARPU_PER_ARCH} model type and @code{per_arch} field)
1235
Measured at runtime (STARPU_HISTORY_BASED model type). This assumes that for a
1236
given set of data input/output sizes, the performance will always be about the
1237
same. This is very true for regular kernels on GPUs for instance (<0.1% error),
1238
and just a bit less true on CPUs (~=1% error). This also assumes that there are
1239
few different sets of data input/output sizes. StarPU will then keep record of
1240
the average time of previous executions on the various processing units, and use
1241
it as an estimation. History is done per task size, by using a hash of the input
1242
and ouput sizes as an index.
1243
It will also save it in @code{~/.starpu/sampling/codelets}
1244
for further executions, and can be observed by using the
1245
@code{starpu_perfmodel_display} command. The models are indexed by machine name. To share the models between machines (e.g. for a homogeneous cluster), use @code{export STARPU_HOSTNAME=some_global_name}. The following is a small code example.
1249
static struct starpu_perfmodel_t mult_perf_model = @{
1250
.type = STARPU_HISTORY_BASED,
1251
.symbol = "mult_perf_model"
1254
starpu_codelet cl = @{
1255
.where = STARPU_CPU,
1256
.cpu_func = cpu_mult,
1258
/* for the scheduling policy to be able to use performance models */
1259
.model = &mult_perf_model
1265
Measured at runtime and refined by regression (STARPU_REGRESSION_*_BASED
1266
model type). This still assumes performance regularity, but can work
1267
with various data input sizes, by applying regression over observed
1268
execution times. STARPU_REGRESSION_BASED uses an a*n^b regression
1269
form, STARPU_NL_REGRESSION_BASED uses an a*n^b+c (more precise than
1270
STARPU_REGRESSION_BASED, but costs a lot more to compute)
1273
Provided explicitly by the application (STARPU_PER_ARCH model type): the
1274
@code{.per_arch[i].cost_model} fields have to be filled with pointers to
1275
functions which return the expected duration of the task in micro-seconds, one
1280
How to use schedulers which can benefit from such performance model is explained
1281
in @ref{Task scheduling policy}.
1283
The same can be done for task power consumption estimation, by setting the
1284
@code{power_model} field the same way as the @code{model} field. Note: for
1285
now, the application has to give to the power consumption performance model
1286
a name which is different from the execution time performance model.
1288
@node Theoretical lower bound on execution time
1289
@section Theoretical lower bound on execution time
1291
For kernels with history-based performance models, StarPU can very easily provide a theoretical lower
1292
bound for the execution time of a whole set of tasks. See for
1293
instance @code{examples/lu/lu_example.c}: before submitting tasks,
1294
call @code{starpu_bound_start}, and after complete execution, call
1295
@code{starpu_bound_stop}. @code{starpu_bound_print_lp} or
1296
@code{starpu_bound_print_mps} can then be used to output a Linear Programming
1297
problem corresponding to the schedule of your tasks. Run it through
1298
@code{lp_solve} or any other linear programming solver, and that will give you a
1299
lower bound for the total execution time of your tasks. If StarPU was compiled
1300
with the glpk library installed, @code{starpu_bound_compute} can be used to
1301
solve it immediately and get the optimized minimum. Its @code{integer}
1302
parameter allows to decide whether integer resolution should be computed
1305
The @code{deps} parameter tells StarPU whether to take tasks and implicit data
1306
dependencies into account. It must be understood that the linear programming
1307
problem size is quadratic with the number of tasks and thus the time to solve it
1308
will be very long, it could be minutes for just a few dozen tasks. You should
1309
probably use @code{lp_solve -timeout 1 test.pl -wmps test.mps} to convert the
1310
problem to MPS format and then use a better solver, @code{glpsol} might be
1311
better than @code{lp_solve} for instance (the @code{--pcost} option may be
1312
useful), but sometimes doesn't manage to converge. @code{cbc} might look
1313
slower, but it is parallel. Be sure to try at least all the @code{-B} options
1314
of @code{lp_solve}. For instance, we often just use
1315
@code{lp_solve -cc -B1 -Bb -Bg -Bp -Bf -Br -BG -Bd -Bs -BB -Bo -Bc -Bi} , and
1316
the @code{-gr} option can also be quite useful.
1318
Setting @code{deps} to 0 will only take into account the actual computations
1319
on processing units. It however still properly takes into account the varying
1320
performances of kernels and processing units, which is quite more accurate than
1321
just comparing StarPU performances with the fastest of the kernels being used.
1323
The @code{prio} parameter tells StarPU whether to simulate taking into account
1324
the priorities as the StarPU scheduler would, i.e. schedule prioritized
1325
tasks before less prioritized tasks, to check to which extend this results
1326
to a less optimal solution. This increases even more computation time.
1328
Note that for simplicity, all this however doesn't take into account data
1329
transfers, which are assumed to be completely overlapped.
1331
@node Insert Task Utility
1332
@section Insert Task Utility
1334
StarPU provides the wrapper function @code{starpu_insert_task} to ease
1335
the creation and submission of tasks.
1337
@deftypefun int starpu_insert_task (starpu_codelet *@var{cl}, ...)
1338
Create and submit a task corresponding to @var{cl} with the following
1339
arguments. The argument list must be zero-terminated.
1341
The arguments following the codelets can be of the following types:
1345
@code{STARPU_R}, @code{STARPU_W}, @code{STARPU_RW}, @code{STARPU_SCRATCH}, @code{STARPU_REDUX} an access mode followed by a data handle;
1347
@code{STARPU_VALUE} followed by a pointer to a constant value and
1348
the size of the constant;
1350
@code{STARPU_CALLBACK} followed by a pointer to a callback function;
1352
@code{STARPU_CALLBACK_ARG} followed by a pointer to be given as an
1353
argument to the callback function;
1355
@code{STARPU_PRIORITY} followed by a integer defining a priority level.
1358
Parameters to be passed to the codelet implementation are defined
1359
through the type @code{STARPU_VALUE}. The function
1360
@code{starpu_unpack_cl_args} must be called within the codelet
1361
implementation to retrieve them.
1364
Here the implementation of the codelet:
1367
void func_cpu(void *descr[], void *_args)
1369
int *x0 = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);
1370
float *x1 = (float *)STARPU_VARIABLE_GET_PTR(descr[1]);
1374
starpu_unpack_cl_args(_args, &ifactor, &ffactor);
1375
*x0 = *x0 * ifactor;
1376
*x1 = *x1 * ffactor;
1379
starpu_codelet mycodelet = @{
1380
.where = STARPU_CPU,
1381
.cpu_func = func_cpu,
1386
And the call to the @code{starpu_insert_task} wrapper:
1389
starpu_insert_task(&mycodelet,
1390
STARPU_VALUE, &ifactor, sizeof(ifactor),
1391
STARPU_VALUE, &ffactor, sizeof(ffactor),
1392
STARPU_RW, data_handles[0], STARPU_RW, data_handles[1],
1396
The call to @code{starpu_insert_task} is equivalent to the following
1400
struct starpu_task *task = starpu_task_create();
1401
task->cl = &mycodelet;
1402
task->buffers[0].handle = data_handles[0];
1403
task->buffers[0].mode = STARPU_RW;
1404
task->buffers[1].handle = data_handles[1];
1405
task->buffers[1].mode = STARPU_RW;
1407
size_t arg_buffer_size;
1408
starpu_pack_cl_args(&arg_buffer, &arg_buffer_size,
1409
STARPU_VALUE, &ifactor, sizeof(ifactor),
1410
STARPU_VALUE, &ffactor, sizeof(ffactor),
1412
task->cl_arg = arg_buffer;
1413
task->cl_arg_size = arg_buffer_size;
1414
int ret = starpu_task_submit(task);
1420
StarPU provides several tools to help debugging aplications. Execution traces
1421
can be generated and displayed graphically, see @ref{Generating traces}. Some
1422
gdb helpers are also provided to show the whole StarPU state:
1425
(gdb) source tools/gdbinit
1430
@section More examples
1432
More examples are available in the StarPU sources in the @code{examples/}
1433
directory. Simple examples include:
1436
@item @code{incrementer/}:
1437
Trivial incrementation test.
1438
@item @code{basic_examples/}:
1439
Simple documented Hello world (as shown in @ref{Hello World}), vector/scalar product (as shown
1440
in @ref{Vector Scaling on an Hybrid CPU/GPU Machine}), matrix
1441
product examples (as shown in @ref{Performance model example}), an example using the blocked matrix data
1442
interface, and an example using the variable data interface.
1443
@item @code{matvecmult/}:
1444
OpenCL example from NVidia, adapted to StarPU.
1446
AXPY CUBLAS operation adapted to StarPU.
1447
@item @code{fortran/}:
1448
Example of Fortran bindings.
1451
More advanced examples include:
1454
@item @code{filters/}:
1455
Examples using filters, as shown in @ref{Partitioning Data}.
1457
LU matrix factorization, see for instance @code{xlu_implicit.c}
1458
@item @code{cholesky/}:
1459
Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.
1462
@c ---------------------------------------------------------------------
1463
@c Performance options
1464
@c ---------------------------------------------------------------------
1466
@node Performance optimization
1467
@chapter How to optimize performance with StarPU
1475
* Task scheduling policy::
1476
* Performance model calibration::
1477
* Task distribution vs Data transfer::
1479
* Power-based scheduling::
1481
* CUDA-specific optimizations::
1484
Simply encapsulating application kernels into tasks already permits to
1485
seamlessly support CPU and GPUs at the same time. To achieve good performance, a
1486
few additional changes are needed.
1488
@node Data management
1489
@section Data management
1491
When the application allocates data, whenever possible it should use the
1492
@code{starpu_malloc} function, which will ask CUDA or
1493
OpenCL to make the allocation itself and pin the corresponding allocated
1494
memory. This is needed to permit asynchronous data transfer, i.e. permit data
1495
transfer to overlap with computations.
1497
By default, StarPU leaves replicates of data wherever they were used, in case they
1498
will be re-used by other tasks, thus saving the data transfer time. When some
1499
task modifies some data, all the other replicates are invalidated, and only the
1500
processing unit which ran that task will have a valid replicate of the data. If the application knows
1501
that this data will not be re-used by further tasks, it should advise StarPU to
1502
immediately replicate it to a desired list of memory nodes (given through a
1503
bitmask). This can be understood like the write-through mode of CPU caches.
1506
starpu_data_set_wt_mask(img_handle, 1<<0);
1509
will for instance request to always transfer a replicate into the main memory (node
1510
0), as bit 0 of the write-through bitmask is being set.
1512
@node Task submission
1513
@section Task submission
1515
To let StarPU make online optimizations, tasks should be submitted
1516
asynchronously as much as possible. Ideally, all the tasks should be
1517
submitted, and mere calls to @code{starpu_task_wait_for_all} or
1518
@code{starpu_data_unregister} be done to wait for
1519
termination. StarPU will then be able to rework the whole schedule, overlap
1520
computation with communication, manage accelerator local memory usage, etc.
1522
@node Task priorities
1523
@section Task priorities
1525
By default, StarPU will consider the tasks in the order they are submitted by
1526
the application. If the application programmer knows that some tasks should
1527
be performed in priority (for instance because their output is needed by many
1528
other tasks and may thus be a bottleneck if not executed early enough), the
1529
@code{priority} field of the task structure should be set to transmit the
1530
priority information to StarPU.
1532
@node Task scheduling policy
1533
@section Task scheduling policy
1535
By default, StarPU uses the @code{eager} simple greedy scheduler. This is
1536
because it provides correct load balance even if the application codelets do not
1537
have performance models. If your application codelets have performance models
1538
(@pxref{Performance model example} for examples showing how to do it),
1539
you should change the scheduler thanks to the @code{STARPU_SCHED} environment
1540
variable. For instance @code{export STARPU_SCHED=dmda} . Use @code{help} to get
1541
the list of available schedulers.
1543
The @b{eager} scheduler uses a central task queue, from which workers draw tasks
1544
to work on. This however does not permit to prefetch data since the scheduling
1545
decision is taken late. If a task has a non-0 priority, it is put at the front of the queue.
1547
The @b{prio} scheduler also uses a central task queue, but sorts tasks by
1548
priority (between -5 and 5).
1550
The @b{random} scheduler distributes tasks randomly according to assumed worker
1551
overall performance.
1553
The @b{ws} (work stealing) scheduler schedules tasks on the local worker by
1554
default. When a worker becomes idle, it steals a task from the most loaded
1557
The @b{dm} (deque model) scheduler uses task execution performance models into account to
1558
perform an HEFT-similar scheduling strategy: it schedules tasks where their
1559
termination time will be minimal.
1561
The @b{dmda} (deque model data aware) scheduler is similar to dm, it also takes
1562
into account data transfer time.
1564
The @b{dmdar} (deque model data aware ready) scheduler is similar to dmda,
1565
it also sorts tasks on per-worker queues by number of already-available data
1568
The @b{dmdas} (deque model data aware sorted) scheduler is similar to dmda, it
1569
also supports arbitrary priority values.
1571
The @b{heft} (HEFT) scheduler is similar to dmda, it also supports task bundles.
1573
The @b{pheft} (parallel HEFT) scheduler is similar to heft, it also supports
1574
parallel tasks (still experimental).
1576
The @b{pgreedy} (parallel greedy) scheduler is similar to greedy, it also
1577
supports parallel tasks (still experimental).
1579
@node Performance model calibration
1580
@section Performance model calibration
1582
Most schedulers are based on an estimation of codelet duration on each kind
1583
of processing unit. For this to be possible, the application programmer needs
1584
to configure a performance model for the codelets of the application (see
1585
@ref{Performance model example} for instance). History-based performance models
1586
use on-line calibration. StarPU will automatically calibrate codelets
1587
which have never been calibrated yet, and save the result in
1588
@code{~/.starpu/sampling/codelets}.
1589
The models are indexed by machine name. To share the models between machines (e.g. for a homogeneous cluster), use @code{export STARPU_HOSTNAME=some_global_name}. To force continuing calibration, use
1590
@code{export STARPU_CALIBRATE=1} . This may be necessary if your application
1591
has not-so-stable performance. Details on the current performance model status
1592
can be obtained from the @code{starpu_perfmodel_display} command: the @code{-l}
1593
option lists the available performance models, and the @code{-s} option permits
1594
to choose the performance model to be displayed. The result looks like:
1597
$ starpu_perfmodel_display -s starpu_dlu_lu_model_22
1598
performance model for cpu
1599
# hash size mean dev n
1600
5c6c3401 1572864 1.216300e+04 2.277778e+03 1240
1603
Which shows that for the LU 22 kernel with a 1.5MiB matrix, the average
1604
execution time on CPUs was about 12ms, with a 2ms standard deviation, over
1605
1240 samples. It is a good idea to check this before doing actual performance
1608
If a kernel source code was modified (e.g. performance improvement), the
1609
calibration information is stale and should be dropped, to re-calibrate from
1610
start. This can be done by using @code{export STARPU_CALIBRATE=2}.
1612
Note: due to CUDA limitations, to be able to measure kernel duration,
1613
calibration mode needs to disable asynchronous data transfers. Calibration thus
1614
disables data transfer / computation overlapping, and should thus not be used
1615
for eventual benchmarks. Note 2: history-based performance models get calibrated
1616
only if a performance-model-based scheduler is chosen.
1618
@node Task distribution vs Data transfer
1619
@section Task distribution vs Data transfer
1621
Distributing tasks to balance the load induces data transfer penalty. StarPU
1622
thus needs to find a balance between both. The target function that the
1623
@code{dmda} scheduler of StarPU
1624
tries to minimize is @code{alpha * T_execution + beta * T_data_transfer}, where
1625
@code{T_execution} is the estimated execution time of the codelet (usually
1626
accurate), and @code{T_data_transfer} is the estimated data transfer time. The
1627
latter is estimated based on bus calibration before execution start,
1628
i.e. with an idle machine, thus without contention. You can force bus re-calibration by running
1629
@code{starpu_calibrate_bus}. The beta parameter defaults to 1, but it can be
1630
worth trying to tweak it by using @code{export STARPU_BETA=2} for instance,
1631
since during real application execution, contention makes transfer times bigger.
1632
This is of course imprecise, but in practice, a rough estimation already gives
1633
the good results that a precise estimation would give.
1636
@section Data prefetch
1638
The @code{heft}, @code{dmda} and @code{pheft} scheduling policies perform data prefetch (see @ref{STARPU_PREFETCH}):
1639
as soon as a scheduling decision is taken for a task, requests are issued to
1640
transfer its required data to the target processing unit, if needeed, so that
1641
when the processing unit actually starts the task, its data will hopefully be
1642
already available and it will not have to wait for the transfer to finish.
1644
The application may want to perform some manual prefetching, for several reasons
1645
such as excluding initial data transfers from performance measurements, or
1646
setting up an initial statically-computed data distribution on the machine
1647
before submitting tasks, which will thus guide StarPU toward an initial task
1648
distribution (since StarPU will try to avoid further transfers).
1650
This can be achieved by giving the @code{starpu_data_prefetch_on_node} function
1651
the handle and the desired target memory node.
1653
@node Power-based scheduling
1654
@section Power-based scheduling
1656
If the application can provide some power performance model (through
1657
the @code{power_model} field of the codelet structure), StarPU will
1658
take it into account when distributing tasks. The target function that
1659
the @code{dmda} scheduler minimizes becomes @code{alpha * T_execution +
1660
beta * T_data_transfer + gamma * Consumption} , where @code{Consumption}
1661
is the estimated task consumption in Joules. To tune this parameter, use
1662
@code{export STARPU_GAMMA=3000} for instance, to express that each Joule
1663
(i.e kW during 1000us) is worth 3000us execution time penalty. Setting
1664
@code{alpha} and @code{beta} to zero permits to only take into account power consumption.
1666
This is however not sufficient to correctly optimize power: the scheduler would
1667
simply tend to run all computations on the most energy-conservative processing
1668
unit. To account for the consumption of the whole machine (including idle
1669
processing units), the idle power of the machine should be given by setting
1670
@code{export STARPU_IDLE_POWER=200} for 200W, for instance. This value can often
1671
be obtained from the machine power supplier.
1673
The power actually consumed by the total execution can be displayed by setting
1674
@code{export STARPU_PROFILING=1 STARPU_WORKER_STATS=1} .
1679
A quick view of how many tasks each worker has executed can be obtained by setting
1680
@code{export STARPU_WORKER_STATS=1} This is a convenient way to check that
1681
execution did happen on accelerators without penalizing performance with
1682
the profiling overhead.
1684
A quick view of how much data transfers have been issued can be obtained by setting
1685
@code{export STARPU_BUS_STATS=1} .
1687
More detailed profiling information can be enabled by using @code{export STARPU_PROFILING=1} or by
1688
calling @code{starpu_profiling_status_set} from the source code.
1689
Statistics on the execution can then be obtained by using @code{export
1690
STARPU_BUS_STATS=1} and @code{export STARPU_WORKER_STATS=1} .
1691
More details on performance feedback are provided by the next chapter.
1693
@node CUDA-specific optimizations
1694
@section CUDA-specific optimizations
1696
Due to CUDA limitations, StarPU will have a hard time overlapping its own
1697
communications and the codelet computations if the application does not use a
1698
dedicated CUDA stream for its computations. StarPU provides one by the use of
1699
@code{starpu_cuda_get_local_stream()} which should be used by all CUDA codelet
1700
operations. For instance:
1703
func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
1704
cudaStreamSynchronize(starpu_cuda_get_local_stream());
1707
Unfortunately, some CUDA libraries do not have stream variants of
1708
kernels. That will lower the potential for overlapping.
1710
@c ---------------------------------------------------------------------
1711
@c Performance feedback
1712
@c ---------------------------------------------------------------------
1714
@node Performance feedback
1715
@chapter Performance feedback
1718
* On-line:: On-line performance feedback
1719
* Off-line:: Off-line performance feedback
1720
* Codelet performance:: Performance of codelets
1724
@section On-line performance feedback
1727
* Enabling monitoring:: Enabling on-line performance monitoring
1728
* Task feedback:: Per-task feedback
1729
* Codelet feedback:: Per-codelet feedback
1730
* Worker feedback:: Per-worker feedback
1731
* Bus feedback:: Bus-related feedback
1734
@node Enabling monitoring
1735
@subsection Enabling on-line performance monitoring
1737
In order to enable online performance monitoring, the application can call
1738
@code{starpu_profiling_status_set(STARPU_PROFILING_ENABLE)}. It is possible to
1739
detect whether monitoring is already enabled or not by calling
1740
@code{starpu_profiling_status_get()}. Enabling monitoring also reinitialize all
1741
previously collected feedback. The @code{STARPU_PROFILING} environment variable
1742
can also be set to 1 to achieve the same effect.
1744
Likewise, performance monitoring is stopped by calling
1745
@code{starpu_profiling_status_set(STARPU_PROFILING_DISABLE)}. Note that this
1746
does not reset the performance counters so that the application may consult
1749
More details about the performance monitoring API are available in section
1750
@ref{Profiling API}.
1753
@subsection Per-task feedback
1755
If profiling is enabled, a pointer to a @code{starpu_task_profiling_info}
1756
structure is put in the @code{.profiling_info} field of the @code{starpu_task}
1757
structure when a task terminates.
1758
This structure is automatically destroyed when the task structure is destroyed,
1759
either automatically or by calling @code{starpu_task_destroy}.
1761
The @code{starpu_task_profiling_info} structure indicates the date when the
1762
task was submitted (@code{submit_time}), started (@code{start_time}), and
1763
terminated (@code{end_time}), relative to the initialization of
1764
StarPU with @code{starpu_init}. It also specifies the identifier of the worker
1765
that has executed the task (@code{workerid}).
1766
These date are stored as @code{timespec} structures which the user may convert
1767
into micro-seconds using the @code{starpu_timing_timespec_to_us} helper
1770
It it worth noting that the application may directly access this structure from
1771
the callback executed at the end of the task. The @code{starpu_task} structure
1772
associated to the callback currently being executed is indeed accessible with
1773
the @code{starpu_get_current_task()} function.
1775
@node Codelet feedback
1776
@subsection Per-codelet feedback
1778
The @code{per_worker_stats} field of the @code{starpu_codelet_t} structure is
1779
an array of counters. The i-th entry of the array is incremented every time a
1780
task implementing the codelet is executed on the i-th worker.
1781
This array is not reinitialized when profiling is enabled or disabled.
1783
@node Worker feedback
1784
@subsection Per-worker feedback
1786
The second argument returned by the @code{starpu_worker_get_profiling_info}
1787
function is a @code{starpu_worker_profiling_info} structure that gives
1788
statistics about the specified worker. This structure specifies when StarPU
1789
started collecting profiling information for that worker (@code{start_time}),
1790
the duration of the profiling measurement interval (@code{total_time}), the
1791
time spent executing kernels (@code{executing_time}), the time spent sleeping
1792
because there is no task to execute at all (@code{sleeping_time}), and the
1793
number of tasks that were executed while profiling was enabled.
1794
These values give an estimation of the proportion of time spent do real work,
1795
and the time spent either sleeping because there are not enough executable
1796
tasks or simply wasted in pure StarPU overhead.
1798
Calling @code{starpu_worker_get_profiling_info} resets the profiling
1799
information associated to a worker.
1801
When an FxT trace is generated (see @ref{Generating traces}), it is also
1802
possible to use the @code{starpu_top} script (described in @ref{starpu-top}) to
1803
generate a graphic showing the evolution of these values during the time, for
1804
the different workers.
1807
@subsection Bus-related feedback
1811
@c how to enable/disable performance monitoring
1813
@c what kind of information do we get ?
1816
@section Off-line performance feedback
1819
* Generating traces:: Generating traces with FxT
1820
* Gantt diagram:: Creating a Gantt Diagram
1821
* DAG:: Creating a DAG with graphviz
1822
* starpu-top:: Monitoring activity
1825
@node Generating traces
1826
@subsection Generating traces with FxT
1828
StarPU can use the FxT library (see
1829
@indicateurl{https://savannah.nongnu.org/projects/fkt/}) to generate traces
1830
with a limited runtime overhead.
1832
You can either get a tarball:
1834
% wget http://download.savannah.gnu.org/releases/fkt/fxt-0.2.2.tar.gz
1837
or use the FxT library from CVS (autotools are required):
1839
% cvs -d :pserver:anonymous@@cvs.sv.gnu.org:/sources/fkt co FxT
1843
Compiling and installing the FxT library in the @code{$FXTDIR} path is
1844
done following the standard procedure:
1846
% ./configure --prefix=$FXTDIR
1851
In order to have StarPU to generate traces, StarPU should be configured with
1852
the @code{--with-fxt} option:
1854
$ ./configure --with-fxt=$FXTDIR
1857
Or you can simply point the @code{PKG_CONFIG_PATH} to
1858
@code{$FXTDIR/lib/pkgconfig} and pass @code{--with-fxt} to @code{./configure}
1860
When FxT is enabled, a trace is generated when StarPU is terminated by calling
1861
@code{starpu_shutdown()}). The trace is a binary file whose name has the form
1862
@code{prof_file_XXX_YYY} where @code{XXX} is the user name, and
1863
@code{YYY} is the pid of the process that used StarPU. This file is saved in the
1864
@code{/tmp/} directory by default, or by the directory specified by
1865
the @code{STARPU_FXT_PREFIX} environment variable.
1868
@subsection Creating a Gantt Diagram
1870
When the FxT trace file @code{filename} has been generated, it is possible to
1871
generate a trace in the Paje format by calling:
1873
% starpu_fxt_tool -i filename
1876
Or alternatively, setting the @code{STARPU_GENERATE_TRACE} environment variable
1877
to 1 before application execution will make StarPU do it automatically at
1878
application shutdown.
1880
This will create a @code{paje.trace} file in the current directory that can be
1881
inspected with the ViTE trace visualizing open-source tool. More information
1882
about ViTE is available at @indicateurl{http://vite.gforge.inria.fr/}. It is
1883
possible to open the @code{paje.trace} file with ViTE by using the following
1890
@subsection Creating a DAG with graphviz
1892
When the FxT trace file @code{filename} has been generated, it is possible to
1893
generate a task graph in the DOT format by calling:
1895
$ starpu_fxt_tool -i filename
1898
This will create a @code{dag.dot} file in the current directory. This file is a
1899
task graph described using the DOT language. It is possible to get a
1900
graphical output of the graph by using the graphviz library:
1902
$ dot -Tpdf dag.dot -o output.pdf
1906
@subsection Monitoring activity
1908
When the FxT trace file @code{filename} has been generated, it is possible to
1909
generate a activity trace by calling:
1911
$ starpu_fxt_tool -i filename
1914
This will create an @code{activity.data} file in the current
1915
directory. A profile of the application showing the activity of StarPU
1916
during the execution of the program can be generated:
1918
$ starpu_top.sh activity.data
1921
This will create a file named @code{activity.eps} in the current directory.
1922
This picture is composed of two parts.
1923
The first part shows the activity of the different workers. The green sections
1924
indicate which proportion of the time was spent executed kernels on the
1925
processing unit. The red sections indicate the proportion of time spent in
1926
StartPU: an important overhead may indicate that the granularity may be too
1927
low, and that bigger tasks may be appropriate to use the processing unit more
1928
efficiently. The black sections indicate that the processing unit was blocked
1929
because there was no task to process: this may indicate a lack of parallelism
1930
which may be alleviated by creating more tasks when it is possible.
1932
The second part of the @code{activity.eps} picture is a graph showing the
1933
evolution of the number of tasks available in the system during the execution.
1934
Ready tasks are shown in black, and tasks that are submitted but not
1935
schedulable yet are shown in grey.
1937
@node Codelet performance
1938
@section Performance of codelets
1940
The performance model of codelets can be examined by using the
1941
@code{starpu_perfmodel_display} tool:
1944
$ starpu_perfmodel_display -l
1945
file: <malloc_pinned.hannibal>
1946
file: <starpu_slu_lu_model_21.hannibal>
1947
file: <starpu_slu_lu_model_11.hannibal>
1948
file: <starpu_slu_lu_model_22.hannibal>
1949
file: <starpu_slu_lu_model_12.hannibal>
1952
Here, the codelets of the lu example are available. We can examine the
1953
performance of the 22 kernel:
1956
$ starpu_perfmodel_display -s starpu_slu_lu_model_22
1957
performance model for cpu
1958
# hash size mean dev n
1959
57618ab0 19660800 2.851069e+05 1.829369e+04 109
1960
performance model for cuda_0
1961
# hash size mean dev n
1962
57618ab0 19660800 1.164144e+04 1.556094e+01 315
1963
performance model for cuda_1
1964
# hash size mean dev n
1965
57618ab0 19660800 1.164271e+04 1.330628e+01 360
1966
performance model for cuda_2
1967
# hash size mean dev n
1968
57618ab0 19660800 1.166730e+04 3.390395e+02 456
1971
We can see that for the given size, over a sample of a few hundreds of
1972
execution, the GPUs are about 20 times faster than the CPUs (numbers are in
1973
us). The standard deviation is extremely low for the GPUs, and less than 10% for
1976
@c ---------------------------------------------------------------------
1978
@c ---------------------------------------------------------------------
1980
@node StarPU MPI support
1981
@chapter StarPU MPI support
1983
The integration of MPI transfers within task parallelism is done in a
1984
very natural way by the means of asynchronous interactions between the
1985
application and StarPU. This is implemented in a separate libstarpumpi library
1986
which basically provides "StarPU" equivalents of @code{MPI_*} functions, where
1987
@code{void *} buffers are replaced with @code{starpu_data_handle}s, and all
1988
GPU-RAM-NIC transfers are handled efficiently by StarPU-MPI.
1993
* MPI Insert Task Utility::
1999
@subsection Initialisation
2001
@deftypefun int starpu_mpi_initialize (void)
2002
Initializes the starpumpi library. This must be called between calling
2003
@code{starpu_init} and other @code{starpu_mpi} functions. This
2004
function does not call @code{MPI_Init}, it should be called beforehand.
2007
@deftypefun int starpu_mpi_initialize_extended (int *@var{rank}, int *@var{world_size})
2008
Initializes the starpumpi library. This must be called between calling
2009
@code{starpu_init} and other @code{starpu_mpi} functions.
2010
This function calls @code{MPI_Init}, and therefore should be prefered
2011
to the previous one for MPI implementations which are not thread-safe.
2012
Returns the current MPI node rank and world size.
2015
@deftypefun int starpu_mpi_shutdown (void)
2016
Cleans the starpumpi library. This must be called between calling
2017
@code{starpu_mpi} functions and @code{starpu_shutdown}.
2018
@code{MPI_Finalize} will be called if StarPU-MPI has been initialized
2019
by calling @code{starpu_mpi_initialize_extended}.
2022
@subsection Communication
2024
@deftypefun int starpu_mpi_send (starpu_data_handle @var{data_handle}, int @var{dest}, int @var{mpi_tag}, MPI_Comm @var{comm})
2027
@deftypefun int starpu_mpi_recv (starpu_data_handle @var{data_handle}, int @var{source}, int @var{mpi_tag}, MPI_Comm @var{comm}, MPI_Status *@var{status})
2030
@deftypefun int starpu_mpi_isend (starpu_data_handle @var{data_handle}, starpu_mpi_req *@var{req}, int @var{dest}, int @var{mpi_tag}, MPI_Comm @var{comm})
2034
@deftypefun int starpu_mpi_irecv (starpu_data_handle @var{data_handle}, starpu_mpi_req *@var{req}, int @var{source}, int @var{mpi_tag}, MPI_Comm @var{comm})
2037
@deftypefun int starpu_mpi_isend_detached (starpu_data_handle @var{data_handle}, int @var{dest}, int @var{mpi_tag}, MPI_Comm @var{comm}, void (*@var{callback})(void *), void *@var{arg})
2040
@deftypefun int starpu_mpi_irecv_detached (starpu_data_handle @var{data_handle}, int @var{source}, int @var{mpi_tag}, MPI_Comm @var{comm}, void (*@var{callback})(void *), void *@var{arg})
2043
@deftypefun int starpu_mpi_wait (starpu_mpi_req *@var{req}, MPI_Status *@var{status})
2046
@deftypefun int starpu_mpi_test (starpu_mpi_req *@var{req}, int *@var{flag}, MPI_Status *@var{status})
2049
@deftypefun int starpu_mpi_barrier (MPI_Comm @var{comm})
2052
@deftypefun int starpu_mpi_isend_detached_unlock_tag (starpu_data_handle @var{data_handle}, int @var{dest}, int @var{mpi_tag}, MPI_Comm @var{comm}, starpu_tag_t @var{tag})
2053
When the transfer is completed, the tag is unlocked
2056
@deftypefun int starpu_mpi_irecv_detached_unlock_tag (starpu_data_handle @var{data_handle}, int @var{source}, int @var{mpi_tag}, MPI_Comm @var{comm}, starpu_tag_t @var{tag})
2059
@deftypefun int starpu_mpi_isend_array_detached_unlock_tag (unsigned @var{array_size}, starpu_data_handle *@var{data_handle}, int *@var{dest}, int *@var{mpi_tag}, MPI_Comm *@var{comm}, starpu_tag_t @var{tag})
2060
Asynchronously send an array of buffers, and unlocks the tag once all
2061
of them are transmitted.
2064
@deftypefun int starpu_mpi_irecv_array_detached_unlock_tag (unsigned @var{array_size}, starpu_data_handle *@var{data_handle}, int *@var{source}, int *@var{mpi_tag}, MPI_Comm *@var{comm}, starpu_tag_t @var{tag})
2068
@node Simple Example
2069
@section Simple Example
2073
void increment_token(void)
2075
struct starpu_task *task = starpu_task_create();
2077
task->cl = &increment_cl;
2078
task->buffers[0].handle = token_handle;
2079
task->buffers[0].mode = STARPU_RW;
2081
starpu_task_submit(task);
2088
int main(int argc, char **argv)
2093
starpu_mpi_initialize_extended(&rank, &size);
2095
starpu_vector_data_register(&token_handle, 0, (uintptr_t)&token, 1, sizeof(unsigned));
2097
unsigned nloops = NITER;
2100
unsigned last_loop = nloops - 1;
2101
unsigned last_rank = size - 1;
2107
for (loop = 0; loop < nloops; loop++) @{
2108
int tag = loop*size + rank;
2110
if (loop == 0 && rank == 0)
2113
fprintf(stdout, "Start with token value %d\n", token);
2117
starpu_mpi_irecv_detached(token_handle, (rank+size-1)%size, tag,
2118
MPI_COMM_WORLD, NULL, NULL);
2123
if (loop == last_loop && rank == last_rank)
2125
starpu_data_acquire(token_handle, STARPU_R);
2126
fprintf(stdout, "Finished : token value %d\n", token);
2127
starpu_data_release(token_handle);
2131
starpu_mpi_isend_detached(token_handle, (rank+1)%size, tag+1,
2132
MPI_COMM_WORLD, NULL, NULL);
2136
starpu_task_wait_for_all();
2142
starpu_mpi_shutdown();
2145
if (rank == last_rank)
2147
fprintf(stderr, "[%d] token = %d == %d * %d ?\n", rank, token, nloops, size);
2148
STARPU_ASSERT(token == nloops*size);
2154
@node MPI Insert Task Utility
2155
@section MPI Insert Task Utility
2157
@deftypefun void starpu_mpi_insert_task (MPI_Comm @var{comm}, starpu_codelet *@var{cl}, ...)
2158
Create and submit a task corresponding to @var{cl} with the following
2159
arguments. The argument list must be zero-terminated.
2161
The arguments following the codelets are the same types as for the
2162
function @code{starpu_insert_task} defined in @ref{Insert Task
2163
Utility}. The extra argument @code{STARPU_EXECUTE} followed by an
2164
integer allows to specify the node to execute the codelet.
2166
The algorithm is as follows:
2168
@item Find out whether we are to execute the codelet because we own the
2169
data to be written to. If different tasks own data to be written to,
2170
the argument @code{STARPU_EXECUTE} should be used to specify the
2171
executing task @code{ET}.
2172
@item Send and receive data as requested. Tasks owning data which need
2173
to be read by the executing task @code{ET} are sending them to @code{ET}.
2174
@item Execute the codelet. This is done by the task selected in the
2175
1st step of the algorithm.
2176
@item In the case when different tasks own data to be written to, send
2177
W data back to their owners.
2180
The algorithm also includes a cache mechanism that allows not to send
2181
data twice to the same task, unless the data has been modified.
2185
@deftypefun void starpu_mpi_get_data_on_node (MPI_Comm @var{comm}, starpu_data_handle @var{data_handle}, int @var{node})
2190
Here an example showing how to use @code{starpu_mpi_insert_task}. One
2191
first needs to define a distribution function which specifies the
2192
locality of the data. Note that that distribution information needs to
2193
be given to StarPU by calling @code{starpu_data_set_rank}.
2197
/* Returns the MPI node number where data is */
2198
int my_distrib(int x, int y, int nb_nodes) @{
2199
/* Cyclic distrib */
2200
return ((int)(x / sqrt(nb_nodes) + (y / sqrt(nb_nodes)) * sqrt(nb_nodes))) % nb_nodes;
2201
// /* Linear distrib */
2202
// return x / sqrt(nb_nodes) + (y / sqrt(nb_nodes)) * X;
2207
Now the data can be registered within StarPU. Data which are not
2208
owned but will be needed for computations can be registered through
2209
the lazy allocation mechanism, i.e. with a @code{home_node} set to -1.
2210
StarPU will automatically allocate the memory when it is used for the
2215
unsigned matrix[X][Y];
2216
starpu_data_handle data_handles[X][Y];
2218
for(x = 0; x < X; x++) @{
2219
for (y = 0; y < Y; y++) @{
2220
int mpi_rank = my_distrib(x, y, size);
2221
if (mpi_rank == rank)
2223
starpu_variable_data_register(&data_handles[x][y], 0,
2224
(uintptr_t)&(matrix[x][y]), sizeof(unsigned));
2225
else if (rank == mpi_rank+1 || rank == mpi_rank-1)
2226
/* I don't own that index, but will need it for my computations */
2227
starpu_variable_data_register(&data_handles[x][y], -1,
2228
(uintptr_t)NULL, sizeof(unsigned));
2230
/* I know it's useless to allocate anything for this */
2231
data_handles[x][y] = NULL;
2232
if (data_handles[x][y])
2233
starpu_data_set_rank(data_handles[x][y], mpi_rank);
2239
Now @code{starpu_mpi_insert_task()} can be called for the different
2240
steps of the application.
2244
for(loop=0 ; loop<niter; loop++)
2245
for (x = 1; x < X-1; x++)
2246
for (y = 1; y < Y-1; y++)
2247
starpu_mpi_insert_task(MPI_COMM_WORLD, &stencil5_cl,
2248
STARPU_RW, data_handles[x][y],
2249
STARPU_R, data_handles[x-1][y],
2250
STARPU_R, data_handles[x+1][y],
2251
STARPU_R, data_handles[x][y-1],
2252
STARPU_R, data_handles[x][y+1],
2254
starpu_task_wait_for_all();
2258
@c ---------------------------------------------------------------------
2259
@c Configuration options
2260
@c ---------------------------------------------------------------------
2262
@node Configuring StarPU
2263
@chapter Configuring StarPU
2267
* Compilation configuration::
2268
* Execution configuration through environment variables::
2271
@node Compilation configuration
2272
@section Compilation configuration
2274
The following arguments can be given to the @code{configure} script.
2277
* Common configuration::
2278
* Configuring workers::
2279
* Advanced configuration::
2282
@node Common configuration
2283
@subsection Common configuration
2289
* --enable-verbose::
2290
* --enable-coverage::
2293
@node --enable-debug
2294
@subsubsection @code{--enable-debug}
2296
@item @emph{Description}:
2297
Enable debugging messages.
2301
@subsubsection @code{--enable-fast}
2303
@item @emph{Description}:
2304
Do not enforce assertions, saves a lot of time spent to compute them otherwise.
2307
@node --enable-verbose
2308
@subsubsection @code{--enable-verbose}
2310
@item @emph{Description}:
2311
Augment the verbosity of the debugging messages. This can be disabled
2312
at runtime by setting the environment variable @code{STARPU_SILENT} to
2316
% STARPU_SILENT=1 ./vector_scal
2320
@node --enable-coverage
2321
@subsubsection @code{--enable-coverage}
2323
@item @emph{Description}:
2324
Enable flags for the @code{gcov} coverage tool.
2327
@node Configuring workers
2328
@subsection Configuring workers
2331
* --enable-nmaxcpus::
2333
* --enable-maxcudadev::
2336
* --with-cuda-include-dir::
2337
* --with-cuda-lib-dir::
2338
* --enable-maxopencldev::
2339
* --disable-opencl::
2340
* --with-opencl-dir::
2341
* --with-opencl-include-dir::
2342
* --with-opencl-lib-dir::
2344
* --with-gordon-dir::
2347
@node --enable-nmaxcpus
2348
@subsubsection @code{--enable-nmaxcpus=<number>}
2350
@item @emph{Description}:
2351
Defines the maximum number of CPU cores that StarPU will support, then
2352
available as the @code{STARPU_NMAXCPUS} macro.
2356
@subsubsection @code{--disable-cpu}
2358
@item @emph{Description}:
2359
Disable the use of CPUs of the machine. Only GPUs etc. will be used.
2362
@node --enable-maxcudadev
2363
@subsubsection @code{--enable-maxcudadev=<number>}
2365
@item @emph{Description}:
2366
Defines the maximum number of CUDA devices that StarPU will support, then
2367
available as the @code{STARPU_MAXCUDADEVS} macro.
2370
@node --disable-cuda
2371
@subsubsection @code{--disable-cuda}
2373
@item @emph{Description}:
2374
Disable the use of CUDA, even if a valid CUDA installation was detected.
2377
@node --with-cuda-dir
2378
@subsubsection @code{--with-cuda-dir=<path>}
2380
@item @emph{Description}:
2381
Specify the directory where CUDA is installed. This directory should notably contain
2382
@code{include/cuda.h}.
2385
@node --with-cuda-include-dir
2386
@subsubsection @code{--with-cuda-include-dir=<path>}
2388
@item @emph{Description}:
2389
Specify the directory where CUDA headers are installed. This directory should
2390
notably contain @code{cuda.h}. This defaults to @code{/include} appended to the
2391
value given to @code{--with-cuda-dir}.
2394
@node --with-cuda-lib-dir
2395
@subsubsection @code{--with-cuda-lib-dir=<path>}
2397
@item @emph{Description}:
2398
Specify the directory where the CUDA library is installed. This directory should
2399
notably contain the CUDA shared libraries (e.g. libcuda.so). This defaults to
2400
@code{/lib} appended to the value given to @code{--with-cuda-dir}.
2404
@node --enable-maxopencldev
2405
@subsubsection @code{--enable-maxopencldev=<number>}
2407
@item @emph{Description}:
2408
Defines the maximum number of OpenCL devices that StarPU will support, then
2409
available as the @code{STARPU_MAXOPENCLDEVS} macro.
2412
@node --disable-opencl
2413
@subsubsection @code{--disable-opencl}
2415
@item @emph{Description}:
2416
Disable the use of OpenCL, even if the SDK is detected.
2419
@node --with-opencl-dir
2420
@subsubsection @code{--with-opencl-dir=<path>}
2422
@item @emph{Description}:
2423
Specify the location of the OpenCL SDK. This directory should notably contain
2424
@code{include/CL/cl.h} (or @code{include/OpenCL/cl.h} on Mac OS).
2427
@node --with-opencl-include-dir
2428
@subsubsection @code{--with-opencl-include-dir=<path>}
2430
@item @emph{Description}:
2431
Specify the location of OpenCL headers. This directory should notably contain
2432
@code{CL/cl.h} (or @code{OpenCL/cl.h} on Mac OS). This defaults to
2433
@code{/include} appended to the value given to @code{--with-opencl-dir}.
2437
@node --with-opencl-lib-dir
2438
@subsubsection @code{--with-opencl-lib-dir=<path>}
2440
@item @emph{Description}:
2441
Specify the location of the OpenCL library. This directory should notably
2442
contain the OpenCL shared libraries (e.g. libOpenCL.so). This defaults to
2443
@code{/lib} appended to the value given to @code{--with-opencl-dir}.
2446
@node --enable-gordon
2447
@subsubsection @code{--enable-gordon}
2449
@item @emph{Description}:
2450
Enable the use of the Gordon runtime for Cell SPUs.
2451
@c TODO: rather default to enabled when detected
2454
@node --with-gordon-dir
2455
@subsubsection @code{--with-gordon-dir=<path>}
2457
@item @emph{Description}:
2458
Specify the location of the Gordon SDK.
2461
@node Advanced configuration
2462
@subsection Advanced configuration
2465
* --enable-perf-debug::
2466
* --enable-model-debug::
2468
* --enable-maxbuffers::
2469
* --enable-allocation-cache::
2470
* --enable-opengl-render::
2471
* --enable-blas-lib::
2474
* --with-perf-model-dir::
2477
* --with-atlas-dir::
2478
* --with-mkl-cflags::
2479
* --with-mkl-ldflags::
2482
@node --enable-perf-debug
2483
@subsubsection @code{--enable-perf-debug}
2485
@item @emph{Description}:
2486
Enable performance debugging.
2489
@node --enable-model-debug
2490
@subsubsection @code{--enable-model-debug}
2492
@item @emph{Description}:
2493
Enable performance model debugging.
2496
@node --enable-stats
2497
@subsubsection @code{--enable-stats}
2499
@item @emph{Description}:
2503
@node --enable-maxbuffers
2504
@subsubsection @code{--enable-maxbuffers=<nbuffers>}
2506
@item @emph{Description}:
2507
Define the maximum number of buffers that tasks will be able to take
2508
as parameters, then available as the @code{STARPU_NMAXBUFS} macro.
2511
@node --enable-allocation-cache
2512
@subsubsection @code{--enable-allocation-cache}
2514
@item @emph{Description}:
2515
Enable the use of a data allocation cache to avoid the cost of it with
2516
CUDA. Still experimental.
2519
@node --enable-opengl-render
2520
@subsubsection @code{--enable-opengl-render}
2522
@item @emph{Description}:
2523
Enable the use of OpenGL for the rendering of some examples.
2524
@c TODO: rather default to enabled when detected
2527
@node --enable-blas-lib
2528
@subsubsection @code{--enable-blas-lib=<name>}
2530
@item @emph{Description}:
2531
Specify the blas library to be used by some of the examples. The
2532
library has to be 'atlas' or 'goto'.
2536
@subsubsection @code{--with-magma=<path>}
2538
@item @emph{Description}:
2539
Specify where magma is installed. This directory should notably contain
2540
@code{include/magmablas.h}.
2544
@subsubsection @code{--with-fxt=<path>}
2546
@item @emph{Description}:
2547
Specify the location of FxT (for generating traces and rendering them
2548
using ViTE). This directory should notably contain
2549
@code{include/fxt/fxt.h}.
2550
@c TODO add ref to other section
2553
@node --with-perf-model-dir
2554
@subsubsection @code{--with-perf-model-dir=<dir>}
2556
@item @emph{Description}:
2557
Specify where performance models should be stored (instead of defaulting to the
2558
current user's home).
2562
@subsubsection @code{--with-mpicc=<path to mpicc>}
2564
@item @emph{Description}:
2565
Specify the location of the @code{mpicc} compiler to be used for starpumpi.
2568
@node --with-goto-dir
2569
@subsubsection @code{--with-goto-dir=<dir>}
2571
@item @emph{Description}:
2572
Specify the location of GotoBLAS.
2575
@node --with-atlas-dir
2576
@subsubsection @code{--with-atlas-dir=<dir>}
2578
@item @emph{Description}:
2579
Specify the location of ATLAS. This directory should notably contain
2580
@code{include/cblas.h}.
2583
@node --with-mkl-cflags
2584
@subsubsection @code{--with-mkl-cflags=<cflags>}
2586
@item @emph{Description}:
2587
Specify the compilation flags for the MKL Library.
2590
@node --with-mkl-ldflags
2591
@subsubsection @code{--with-mkl-ldflags=<ldflags>}
2593
@item @emph{Description}:
2594
Specify the linking flags for the MKL Library. Note that the
2595
@url{http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/}
2596
website provides a script to determine the linking flags.
2600
@c ---------------------------------------------------------------------
2601
@c Environment variables
2602
@c ---------------------------------------------------------------------
2604
@node Execution configuration through environment variables
2605
@section Execution configuration through environment variables
2608
* Workers:: Configuring workers
2609
* Scheduling:: Configuring the Scheduling engine
2610
* Misc:: Miscellaneous and debug
2613
Note: the values given in @code{starpu_conf} structure passed when
2614
calling @code{starpu_init} will override the values of the environment
2618
@subsection Configuring workers
2621
* STARPU_NCPUS:: Number of CPU workers
2622
* STARPU_NCUDA:: Number of CUDA workers
2623
* STARPU_NOPENCL:: Number of OpenCL workers
2624
* STARPU_NGORDON:: Number of SPU workers (Cell)
2625
* STARPU_WORKERS_CPUID:: Bind workers to specific CPUs
2626
* STARPU_WORKERS_CUDAID:: Select specific CUDA devices
2627
* STARPU_WORKERS_OPENCLID:: Select specific OpenCL devices
2631
@subsubsection @code{STARPU_NCPUS} -- Number of CPU workers
2634
@item @emph{Description}:
2635
Specify the number of CPU workers (thus not including workers dedicated to control acceleratores). Note that by default, StarPU will not allocate
2636
more CPU workers than there are physical CPUs, and that some CPUs are used to control
2642
@subsubsection @code{STARPU_NCUDA} -- Number of CUDA workers
2645
@item @emph{Description}:
2646
Specify the number of CUDA devices that StarPU can use. If
2647
@code{STARPU_NCUDA} is lower than the number of physical devices, it is
2648
possible to select which CUDA devices should be used by the means of the
2649
@code{STARPU_WORKERS_CUDAID} environment variable. By default, StarPU will
2650
create as many CUDA workers as there are CUDA devices.
2654
@node STARPU_NOPENCL
2655
@subsubsection @code{STARPU_NOPENCL} -- Number of OpenCL workers
2658
@item @emph{Description}:
2659
OpenCL equivalent of the @code{STARPU_NCUDA} environment variable.
2662
@node STARPU_NGORDON
2663
@subsubsection @code{STARPU_NGORDON} -- Number of SPU workers (Cell)
2666
@item @emph{Description}:
2667
Specify the number of SPUs that StarPU can use.
2671
@node STARPU_WORKERS_CPUID
2672
@subsubsection @code{STARPU_WORKERS_CPUID} -- Bind workers to specific CPUs
2675
@item @emph{Description}:
2676
Passing an array of integers (starting from 0) in @code{STARPU_WORKERS_CPUID}
2677
specifies on which logical CPU the different workers should be
2678
bound. For instance, if @code{STARPU_WORKERS_CPUID = "0 1 4 5"}, the first
2679
worker will be bound to logical CPU #0, the second CPU worker will be bound to
2680
logical CPU #1 and so on. Note that the logical ordering of the CPUs is either
2681
determined by the OS, or provided by the @code{hwloc} library in case it is
2684
Note that the first workers correspond to the CUDA workers, then come the
2685
OpenCL and the SPU, and finally the CPU workers. For example if
2686
we have @code{STARPU_NCUDA=1}, @code{STARPU_NOPENCL=1}, @code{STARPU_NCPUS=2}
2687
and @code{STARPU_WORKERS_CPUID = "0 2 1 3"}, the CUDA device will be controlled
2688
by logical CPU #0, the OpenCL device will be controlled by logical CPU #2, and
2689
the logical CPUs #1 and #3 will be used by the CPU workers.
2691
If the number of workers is larger than the array given in
2692
@code{STARPU_WORKERS_CPUID}, the workers are bound to the logical CPUs in a
2693
round-robin fashion: if @code{STARPU_WORKERS_CPUID = "0 1"}, the first and the
2694
third (resp. second and fourth) workers will be put on CPU #0 (resp. CPU #1).
2696
This variable is ignored if the @code{use_explicit_workers_bindid} flag of the
2697
@code{starpu_conf} structure passed to @code{starpu_init} is set.
2701
@node STARPU_WORKERS_CUDAID
2702
@subsubsection @code{STARPU_WORKERS_CUDAID} -- Select specific CUDA devices
2705
@item @emph{Description}:
2706
Similarly to the @code{STARPU_WORKERS_CPUID} environment variable, it is
2707
possible to select which CUDA devices should be used by StarPU. On a machine
2708
equipped with 4 GPUs, setting @code{STARPU_WORKERS_CUDAID = "1 3"} and
2709
@code{STARPU_NCUDA=2} specifies that 2 CUDA workers should be created, and that
2710
they should use CUDA devices #1 and #3 (the logical ordering of the devices is
2711
the one reported by CUDA).
2713
This variable is ignored if the @code{use_explicit_workers_cuda_gpuid} flag of
2714
the @code{starpu_conf} structure passed to @code{starpu_init} is set.
2717
@node STARPU_WORKERS_OPENCLID
2718
@subsubsection @code{STARPU_WORKERS_OPENCLID} -- Select specific OpenCL devices
2721
@item @emph{Description}:
2722
OpenCL equivalent of the @code{STARPU_WORKERS_CUDAID} environment variable.
2724
This variable is ignored if the @code{use_explicit_workers_opencl_gpuid} flag of
2725
the @code{starpu_conf} structure passed to @code{starpu_init} is set.
2729
@subsection Configuring the Scheduling engine
2732
* STARPU_SCHED:: Scheduling policy
2733
* STARPU_CALIBRATE:: Calibrate performance models
2734
* STARPU_PREFETCH:: Use data prefetch
2735
* STARPU_SCHED_ALPHA:: Computation factor
2736
* STARPU_SCHED_BETA:: Communication factor
2740
@subsubsection @code{STARPU_SCHED} -- Scheduling policy
2743
@item @emph{Description}:
2745
This chooses between the different scheduling policies proposed by StarPU: work
2746
random, stealing, greedy, with performance models, etc.
2748
Use @code{STARPU_SCHED=help} to get the list of available schedulers.
2752
@node STARPU_CALIBRATE
2753
@subsubsection @code{STARPU_CALIBRATE} -- Calibrate performance models
2756
@item @emph{Description}:
2757
If this variable is set to 1, the performance models are calibrated during
2758
the execution. If it is set to 2, the previous values are dropped to restart
2759
calibration from scratch. Setting this variable to 0 disable calibration, this
2760
is the default behaviour.
2762
Note: this currently only applies to @code{dm}, @code{dmda} and @code{heft} scheduling policies.
2766
@node STARPU_PREFETCH
2767
@subsubsection @code{STARPU_PREFETCH} -- Use data prefetch
2770
@item @emph{Description}:
2771
This variable indicates whether data prefetching should be enabled (0 means
2772
that it is disabled). If prefetching is enabled, when a task is scheduled to be
2773
executed e.g. on a GPU, StarPU will request an asynchronous transfer in
2774
advance, so that data is already present on the GPU when the task starts. As a
2775
result, computation and data transfers are overlapped.
2776
Note that prefetching is enabled by default in StarPU.
2780
@node STARPU_SCHED_ALPHA
2781
@subsubsection @code{STARPU_SCHED_ALPHA} -- Computation factor
2784
@item @emph{Description}:
2785
To estimate the cost of a task StarPU takes into account the estimated
2786
computation time (obtained thanks to performance models). The alpha factor is
2787
the coefficient to be applied to it before adding it to the communication part.
2791
@node STARPU_SCHED_BETA
2792
@subsubsection @code{STARPU_SCHED_BETA} -- Communication factor
2795
@item @emph{Description}:
2796
To estimate the cost of a task StarPU takes into account the estimated
2797
data transfer time (obtained thanks to performance models). The beta factor is
2798
the coefficient to be applied to it before adding it to the computation part.
2803
@subsection Miscellaneous and debug
2806
* STARPU_SILENT:: Disable verbose mode
2807
* STARPU_LOGFILENAME:: Select debug file name
2808
* STARPU_FXT_PREFIX:: FxT trace location
2809
* STARPU_LIMIT_GPU_MEM:: Restrict memory size on the GPUs
2810
* STARPU_GENERATE_TRACE:: Generate a Paje trace when StarPU is shut down
2814
@subsubsection @code{STARPU_SILENT} -- Disable verbose mode
2817
@item @emph{Description}:
2818
This variable allows to disable verbose mode at runtime when StarPU
2819
has been configured with the option @code{--enable-verbose}.
2822
@node STARPU_LOGFILENAME
2823
@subsubsection @code{STARPU_LOGFILENAME} -- Select debug file name
2826
@item @emph{Description}:
2827
This variable specifies in which file the debugging output should be saved to.
2830
@node STARPU_FXT_PREFIX
2831
@subsubsection @code{STARPU_FXT_PREFIX} -- FxT trace location
2834
@item @emph{Description}
2835
This variable specifies in which directory to save the trace generated if FxT is enabled. It needs to have a trailing '/' character.
2838
@node STARPU_LIMIT_GPU_MEM
2839
@subsubsection @code{STARPU_LIMIT_GPU_MEM} -- Restrict memory size on the GPUs
2842
@item @emph{Description}
2843
This variable specifies the maximum number of megabytes that should be
2844
available to the application on each GPUs. In case this value is smaller than
2845
the size of the memory of a GPU, StarPU pre-allocates a buffer to waste memory
2846
on the device. This variable is intended to be used for experimental purposes
2847
as it emulates devices that have a limited amount of memory.
2850
@node STARPU_GENERATE_TRACE
2851
@subsubsection @code{STARPU_GENERATE_TRACE} -- Generate a Paje trace when StarPU is shut down
2854
@item @emph{Description}
2855
When set to 1, this variable indicates that StarPU should automatically
2856
generate a Paje trace when starpu_shutdown is called.
2860
@c ---------------------------------------------------------------------
2862
@c ---------------------------------------------------------------------
2868
* Initialization and Termination:: Initialization and Termination methods
2869
* Workers' Properties:: Methods to enumerate workers' properties
2870
* Data Library:: Methods to manipulate data
2873
* Codelets and Tasks:: Methods to construct tasks
2874
* Explicit Dependencies:: Explicit Dependencies
2875
* Implicit Data Dependencies:: Implicit Data Dependencies
2876
* Performance Model API::
2877
* Profiling API:: Profiling API
2878
* CUDA extensions:: CUDA extensions
2879
* OpenCL extensions:: OpenCL extensions
2880
* Cell extensions:: Cell extensions
2881
* Miscellaneous helpers::
2884
@node Initialization and Termination
2885
@section Initialization and Termination
2888
* starpu_init:: Initialize StarPU
2889
* struct starpu_conf:: StarPU runtime configuration
2890
* starpu_conf_init:: Initialize starpu_conf structure
2891
* starpu_shutdown:: Terminate StarPU
2895
@subsection @code{starpu_init} -- Initialize StarPU
2898
@item @emph{Description}:
2899
This is StarPU initialization method, which must be called prior to any other
2900
StarPU call. It is possible to specify StarPU's configuration (e.g. scheduling
2901
policy, number of cores, ...) by passing a non-null argument. Default
2902
configuration is used if the passed argument is @code{NULL}.
2903
@item @emph{Return value}:
2904
Upon successful completion, this function returns 0. Otherwise, @code{-ENODEV}
2905
indicates that no worker was available (so that StarPU was not initialized).
2907
@item @emph{Prototype}:
2908
@code{int starpu_init(struct starpu_conf *conf);}
2912
@node struct starpu_conf
2913
@subsection @code{struct starpu_conf} -- StarPU runtime configuration
2916
@item @emph{Description}:
2917
This structure is passed to the @code{starpu_init} function in order
2918
to configure StarPU.
2919
When the default value is used, StarPU automatically selects the number
2920
of processing units and takes the default scheduling policy. This parameter
2921
overwrites the equivalent environment variables.
2923
@item @emph{Fields}:
2925
@item @code{sched_policy_name} (default = NULL):
2926
This is the name of the scheduling policy. This can also be specified with the
2927
@code{STARPU_SCHED} environment variable.
2928
@item @code{sched_policy} (default = NULL):
2929
This is the definition of the scheduling policy. This field is ignored
2930
if @code{sched_policy_name} is set.
2932
@item @code{ncpus} (default = -1):
2933
This is the number of CPU cores that StarPU can use. This can also be
2934
specified with the @code{STARPU_NCPUS} environment variable.
2935
@item @code{ncuda} (default = -1):
2936
This is the number of CUDA devices that StarPU can use. This can also be
2937
specified with the @code{STARPU_NCUDA} environment variable.
2938
@item @code{nopencl} (default = -1):
2939
This is the number of OpenCL devices that StarPU can use. This can also be
2940
specified with the @code{STARPU_NOPENCL} environment variable.
2941
@item @code{nspus} (default = -1):
2942
This is the number of Cell SPUs that StarPU can use. This can also be
2943
specified with the @code{STARPU_NGORDON} environment variable.
2945
@item @code{use_explicit_workers_bindid} (default = 0)
2946
If this flag is set, the @code{workers_bindid} array indicates where the
2947
different workers are bound, otherwise StarPU automatically selects where to
2948
bind the different workers unless the @code{STARPU_WORKERS_CPUID} environment
2949
variable is set. The @code{STARPU_WORKERS_CPUID} environment variable is
2950
ignored if the @code{use_explicit_workers_bindid} flag is set.
2951
@item @code{workers_bindid[STARPU_NMAXWORKERS]}
2952
If the @code{use_explicit_workers_bindid} flag is set, this array indicates
2953
where to bind the different workers. The i-th entry of the
2954
@code{workers_bindid} indicates the logical identifier of the processor which
2955
should execute the i-th worker. Note that the logical ordering of the CPUs is
2956
either determined by the OS, or provided by the @code{hwloc} library in case it
2958
When this flag is set, the @ref{STARPU_WORKERS_CPUID} environment variable is
2961
@item @code{use_explicit_workers_cuda_gpuid} (default = 0)
2962
If this flag is set, the CUDA workers will be attached to the CUDA devices
2963
specified in the @code{workers_cuda_gpuid} array. Otherwise, StarPU affects the
2964
CUDA devices in a round-robin fashion.
2965
When this flag is set, the @ref{STARPU_WORKERS_CUDAID} environment variable is
2967
@item @code{workers_cuda_gpuid[STARPU_NMAXWORKERS]}
2968
If the @code{use_explicit_workers_cuda_gpuid} flag is set, this array contains
2969
the logical identifiers of the CUDA devices (as used by @code{cudaGetDevice}).
2970
@item @code{use_explicit_workers_opencl_gpuid} (default = 0)
2971
If this flag is set, the OpenCL workers will be attached to the OpenCL devices
2972
specified in the @code{workers_opencl_gpuid} array. Otherwise, StarPU affects the
2973
OpenCL devices in a round-robin fashion.
2974
@item @code{workers_opencl_gpuid[STARPU_NMAXWORKERS]}:
2976
@item @code{calibrate} (default = 0):
2977
If this flag is set, StarPU will calibrate the performance models when
2978
executing tasks. If this value is equal to -1, the default value is used. The
2979
default value is overwritten by the @code{STARPU_CALIBRATE} environment
2980
variable when it is set.
2986
@node starpu_conf_init
2987
@subsection @code{starpu_conf_init} -- Initialize starpu_conf structure
2990
This function initializes the @code{starpu_conf} structure passed as argument
2991
with the default values. In case some configuration parameters are already
2992
specified through environment variables, @code{starpu_conf_init} initializes
2993
the fields of the structure according to the environment variables. For
2994
instance if @code{STARPU_CALIBRATE} is set, its value is put in the
2995
@code{.ncuda} field of the structure passed as argument.
2997
@item @emph{Return value}:
2998
Upon successful completion, this function returns 0. Otherwise, @code{-EINVAL}
2999
indicates that the argument was NULL.
3001
@item @emph{Prototype}:
3002
@code{int starpu_conf_init(struct starpu_conf *conf);}
3008
@node starpu_shutdown
3009
@subsection @code{starpu_shutdown} -- Terminate StarPU
3010
@deftypefun void starpu_shutdown (void)
3011
This is StarPU termination method. It must be called at the end of the
3012
application: statistics and other post-mortem debugging information are not
3013
guaranteed to be available until this method has been called.
3016
@node Workers' Properties
3017
@section Workers' Properties
3020
* starpu_worker_get_count:: Get the number of processing units
3021
* starpu_worker_get_count_by_type:: Get the number of processing units of a given type
3022
* starpu_cpu_worker_get_count:: Get the number of CPU controlled by StarPU
3023
* starpu_cuda_worker_get_count:: Get the number of CUDA devices controlled by StarPU
3024
* starpu_opencl_worker_get_count:: Get the number of OpenCL devices controlled by StarPU
3025
* starpu_spu_worker_get_count:: Get the number of Cell SPUs controlled by StarPU
3026
* starpu_worker_get_id:: Get the identifier of the current worker
3027
* starpu_worker_get_ids_by_type:: Get the list of identifiers of workers with a given type
3028
* starpu_worker_get_devid:: Get the device identifier of a worker
3029
* starpu_worker_get_type:: Get the type of processing unit associated to a worker
3030
* starpu_worker_get_name:: Get the name of a worker
3031
* starpu_worker_get_memory_node:: Get the memory node of a worker
3034
@node starpu_worker_get_count
3035
@subsection @code{starpu_worker_get_count} -- Get the number of processing units
3036
@deftypefun unsigned starpu_worker_get_count (void)
3037
This function returns the number of workers (i.e. processing units executing
3038
StarPU tasks). The returned value should be at most @code{STARPU_NMAXWORKERS}.
3041
@node starpu_worker_get_count_by_type
3042
@subsection @code{starpu_worker_get_count_by_type} -- Get the number of processing units of a given type
3043
@deftypefun int starpu_worker_get_count_by_type ({enum starpu_archtype} @var{type})
3044
Returns the number of workers of the type indicated by the argument. A positive
3045
(or null) value is returned in case of success, @code{-EINVAL} indicates that
3046
the type is not valid otherwise.
3049
@node starpu_cpu_worker_get_count
3050
@subsection @code{starpu_cpu_worker_get_count} -- Get the number of CPU controlled by StarPU
3051
@deftypefun unsigned starpu_cpu_worker_get_count (void)
3052
This function returns the number of CPUs controlled by StarPU. The returned
3053
value should be at most @code{STARPU_NMAXCPUS}.
3056
@node starpu_cuda_worker_get_count
3057
@subsection @code{starpu_cuda_worker_get_count} -- Get the number of CUDA devices controlled by StarPU
3058
@deftypefun unsigned starpu_cuda_worker_get_count (void)
3059
This function returns the number of CUDA devices controlled by StarPU. The returned
3060
value should be at most @code{STARPU_MAXCUDADEVS}.
3063
@node starpu_opencl_worker_get_count
3064
@subsection @code{starpu_opencl_worker_get_count} -- Get the number of OpenCL devices controlled by StarPU
3065
@deftypefun unsigned starpu_opencl_worker_get_count (void)
3066
This function returns the number of OpenCL devices controlled by StarPU. The returned
3067
value should be at most @code{STARPU_MAXOPENCLDEVS}.
3070
@node starpu_spu_worker_get_count
3071
@subsection @code{starpu_spu_worker_get_count} -- Get the number of Cell SPUs controlled by StarPU
3072
@deftypefun unsigned starpu_opencl_worker_get_count (void)
3073
This function returns the number of Cell SPUs controlled by StarPU.
3077
@node starpu_worker_get_id
3078
@subsection @code{starpu_worker_get_id} -- Get the identifier of the current worker
3079
@deftypefun int starpu_worker_get_id (void)
3080
This function returns the identifier of the worker associated to the calling
3081
thread. The returned value is either -1 if the current context is not a StarPU
3082
worker (i.e. when called from the application outside a task or a callback), or
3083
an integer between 0 and @code{starpu_worker_get_count() - 1}.
3086
@node starpu_worker_get_ids_by_type
3087
@subsection @code{starpu_worker_get_ids_by_type} -- Get the list of identifiers of workers with a given type
3088
@deftypefun int starpu_worker_get_ids_by_type ({enum starpu_archtype} @var{type}, int *@var{workerids}, int @var{maxsize})
3089
Fill the workerids array with the identifiers of the workers that have the type
3090
indicated in the first argument. The maxsize argument indicates the size of the
3091
workids array. The returned value gives the number of identifiers that were put
3092
in the array. @code{-ERANGE} is returned is maxsize is lower than the number of
3093
workers with the appropriate type: in that case, the array is filled with the
3094
maxsize first elements. To avoid such overflows, the value of maxsize can be
3095
chosen by the means of the @code{starpu_worker_get_count_by_type} function, or
3096
by passing a value greater or equal to @code{STARPU_NMAXWORKERS}.
3099
@node starpu_worker_get_devid
3100
@subsection @code{starpu_worker_get_devid} -- Get the device identifier of a worker
3101
@deftypefun int starpu_worker_get_devid (int @var{id})
3102
This functions returns the device id of the worker associated to an identifier
3103
(as returned by the @code{starpu_worker_get_id} function). In the case of a
3104
CUDA worker, this device identifier is the logical device identifier exposed by
3105
CUDA (used by the @code{cudaGetDevice} function for instance). The device
3106
identifier of a CPU worker is the logical identifier of the core on which the
3107
worker was bound; this identifier is either provided by the OS or by the
3108
@code{hwloc} library in case it is available.
3111
@node starpu_worker_get_type
3112
@subsection @code{starpu_worker_get_type} -- Get the type of processing unit associated to a worker
3113
@deftypefun {enum starpu_archtype} starpu_worker_get_type (int @var{id})
3114
This function returns the type of worker associated to an identifier (as
3115
returned by the @code{starpu_worker_get_id} function). The returned value
3116
indicates the architecture of the worker: @code{STARPU_CPU_WORKER} for a CPU
3117
core, @code{STARPU_CUDA_WORKER} for a CUDA device,
3118
@code{STARPU_OPENCL_WORKER} for a OpenCL device, and
3119
@code{STARPU_GORDON_WORKER} for a Cell SPU. The value returned for an invalid
3120
identifier is unspecified.
3123
@node starpu_worker_get_name
3124
@subsection @code{starpu_worker_get_name} -- Get the name of a worker
3126
@deftypefun void starpu_worker_get_name (int @var{id}, char *@var{dst}, size_t @var{maxlen})
3127
StarPU associates a unique human readable string to each processing unit. This
3128
function copies at most the @var{maxlen} first bytes of the unique string
3129
associated to a worker identified by its identifier @var{id} into the
3130
@var{dst} buffer. The caller is responsible for ensuring that the @var{dst}
3131
is a valid pointer to a buffer of @var{maxlen} bytes at least. Calling this
3132
function on an invalid identifier results in an unspecified behaviour.
3135
@node starpu_worker_get_memory_node
3136
@subsection @code{starpu_worker_get_memory_node} -- Get the memory node of a worker
3137
@deftypefun unsigned starpu_worker_get_memory_node (unsigned @var{workerid})
3138
This function returns the identifier of the memory node associated to the
3139
worker identified by @var{workerid}.
3144
@section Data Library
3146
This section describes the data management facilities provided by StarPU.
3148
We show how to use existing data interfaces in @ref{Data Interfaces}, but developers can
3149
design their own data interfaces if required.
3152
* starpu_malloc:: Allocate data and pin it
3153
* starpu_access_mode:: Data access mode
3154
* unsigned memory_node:: Memory node
3155
* starpu_data_handle:: StarPU opaque data handle
3156
* void *interface:: StarPU data interface
3157
* starpu_data_register:: Register a piece of data to StarPU
3158
* starpu_data_unregister:: Unregister a piece of data from StarPU
3159
* starpu_data_invalidate:: Invalidate all data replicates
3160
* starpu_data_acquire:: Access registered data from the application
3161
* starpu_data_acquire_cb:: Access registered data from the application asynchronously
3162
* starpu_data_release:: Release registered data from the application
3163
* starpu_data_set_wt_mask:: Set the Write-Through mask
3164
* starpu_data_prefetch_on_node:: Prefetch data to a given node
3168
@subsection @code{starpu_malloc} -- Allocate data and pin it
3169
@deftypefun int starpu_malloc (void **@var{A}, size_t @var{dim})
3170
This function allocates data of the given size in main memory. It will also try to pin it in
3171
CUDA or OpenCL, so that data transfers from this buffer can be asynchronous, and
3172
thus permit data transfer and computation overlapping. The allocated buffer must
3173
be freed thanks to the @code{starpu_free} function.
3176
@node starpu_access_mode
3177
@subsection @code{starpu_access_mode} -- Data access mode
3178
This datatype describes a data access mode. The different available modes are:
3181
@item @code{STARPU_R} read-only mode.
3182
@item @code{STARPU_W} write-only mode.
3183
@item @code{STARPU_RW} read-write mode. This is equivalent to @code{STARPU_R|STARPU_W}.
3184
@item @code{STARPU_SCRATCH} scratch memory. A temporary buffer is allocated for the task, but StarPU does not enforce data consistency, i.e. each device has its own buffer, independently from each other (even for CPUs). This is useful for temporary variables. For now, no behaviour is defined concerning the relation with STARPU_R/W modes and the value provided at registration, i.e. the value of the scratch buffer is undefined at entry of the codelet function, but this is being considered for future extensions.
3185
@item @code{STARPU_REDUX} reduction mode. TODO: document, as well as @code{starpu_data_set_reduction_methods}
3189
@node unsigned memory_node
3190
@subsection @code{unsigned memory_node} -- Memory node
3192
@item @emph{Description}:
3193
Every worker is associated to a memory node which is a logical abstraction of
3194
the address space from which the processing unit gets its data. For instance,
3195
the memory node associated to the different CPU workers represents main memory
3196
(RAM), the memory node associated to a GPU is DRAM embedded on the device.
3197
Every memory node is identified by a logical index which is accessible from the
3198
@code{starpu_worker_get_memory_node} function. When registering a piece of data
3199
to StarPU, the specified memory node indicates where the piece of data
3200
initially resides (we also call this memory node the home node of a piece of
3205
@node starpu_data_handle
3206
@subsection @code{starpu_data_handle} -- StarPU opaque data handle
3208
@item @emph{Description}:
3209
StarPU uses @code{starpu_data_handle} as an opaque handle to manage a piece of
3210
data. Once a piece of data has been registered to StarPU, it is associated to a
3211
@code{starpu_data_handle} which keeps track of the state of the piece of data
3212
over the entire machine, so that we can maintain data consistency and locate
3213
data replicates for instance.
3216
@node void *interface
3217
@subsection @code{void *interface} -- StarPU data interface
3219
@item @emph{Description}:
3220
Data management is done at a high-level in StarPU: rather than accessing a mere
3221
list of contiguous buffers, the tasks may manipulate data that are described by
3222
a high-level construct which we call data interface.
3224
An example of data interface is the "vector" interface which describes a
3225
contiguous data array on a spefic memory node. This interface is a simple
3226
structure containing the number of elements in the array, the size of the
3227
elements, and the address of the array in the appropriate address space (this
3228
address may be invalid if there is no valid copy of the array in the memory
3229
node). More informations on the data interfaces provided by StarPU are
3230
given in @ref{Data Interfaces}.
3232
When a piece of data managed by StarPU is used by a task, the task
3233
implementation is given a pointer to an interface describing a valid copy of
3234
the data that is accessible from the current processing unit.
3237
@node starpu_data_register
3238
@subsection @code{starpu_data_register} -- Register a piece of data to StarPU
3239
@deftypefun void starpu_data_register (starpu_data_handle *@var{handleptr}, uint32_t @var{home_node}, void *@var{interface}, {struct starpu_data_interface_ops_t} *@var{ops})
3240
Register a piece of data into the handle located at the @var{handleptr}
3241
address. The @var{interface} buffer contains the initial description of the
3242
data in the home node. The @var{ops} argument is a pointer to a structure
3243
describing the different methods used to manipulate this type of interface. See
3244
@ref{struct starpu_data_interface_ops_t} for more details on this structure.
3246
If @code{home_node} is -1, StarPU will automatically
3247
allocate the memory when it is used for the
3248
first time in write-only mode. Once such data handle has been automatically
3249
allocated, it is possible to access it using any access mode.
3251
Note that StarPU supplies a set of predefined types of interface (e.g. vector or
3252
matrix) which can be registered by the means of helper functions (e.g.
3253
@code{starpu_vector_data_register} or @code{starpu_matrix_data_register}).
3256
@node starpu_data_unregister
3257
@subsection @code{starpu_data_unregister} -- Unregister a piece of data from StarPU
3258
@deftypefun void starpu_data_unregister (starpu_data_handle @var{handle})
3259
This function unregisters a data handle from StarPU. If the data was
3260
automatically allocated by StarPU because the home node was -1, all
3261
automatically allocated buffers are freed. Otherwise, a valid copy of the data
3262
is put back into the home node in the buffer that was initially registered.
3263
Using a data handle that has been unregistered from StarPU results in an
3264
undefined behaviour.
3267
@node starpu_data_invalidate
3268
@subsection @code{starpu_data_invalidate} -- Invalidate all data replicates
3269
@deftypefun void starpu_data_invalidate (starpu_data_handle @var{handle})
3270
Destroy all replicates of the data handle. After data invalidation, the first
3271
access to the handle must be performed in write-only mode. Accessing an
3272
invalidated data in read-mode results in undefined behaviour.
3275
@c TODO create a specific sections about user interaction with the DSM ?
3277
@node starpu_data_acquire
3278
@subsection @code{starpu_data_acquire} -- Access registered data from the application
3279
@deftypefun int starpu_data_acquire (starpu_data_handle @var{handle}, starpu_access_mode @var{mode})
3280
The application must call this function prior to accessing registered data from
3281
main memory outside tasks. StarPU ensures that the application will get an
3282
up-to-date copy of the data in main memory located where the data was
3283
originally registered, and that all concurrent accesses (e.g. from tasks) will
3284
be consistent with the access mode specified in the @var{mode} argument.
3285
@code{starpu_data_release} must be called once the application does not need to
3286
access the piece of data anymore. Note that implicit data
3287
dependencies are also enforced by @code{starpu_data_acquire}, i.e.
3288
@code{starpu_data_acquire} will wait for all tasks scheduled to work on
3289
the data, unless that they have not been disabled explictly by calling
3290
@code{starpu_data_set_default_sequential_consistency_flag} or
3291
@code{starpu_data_set_sequential_consistency_flag}.
3292
@code{starpu_data_acquire} is a blocking call, so that it cannot be called from
3293
tasks or from their callbacks (in that case, @code{starpu_data_acquire} returns
3294
@code{-EDEADLK}). Upon successful completion, this function returns 0.
3297
@node starpu_data_acquire_cb
3298
@subsection @code{starpu_data_acquire_cb} -- Access registered data from the application asynchronously
3299
@deftypefun int starpu_data_acquire_cb (starpu_data_handle @var{handle}, starpu_access_mode @var{mode}, void (*@var{callback})(void *), void *@var{arg})
3300
@code{starpu_data_acquire_cb} is the asynchronous equivalent of
3301
@code{starpu_data_release}. When the data specified in the first argument is
3302
available in the appropriate access mode, the callback function is executed.
3303
The application may access the requested data during the execution of this
3304
callback. The callback function must call @code{starpu_data_release} once the
3305
application does not need to access the piece of data anymore.
3306
Note that implicit data dependencies are also enforced by
3307
@code{starpu_data_acquire_cb} in case they are enabled.
3308
Contrary to @code{starpu_data_acquire}, this function is non-blocking and may
3309
be called from task callbacks. Upon successful completion, this function
3313
@node starpu_data_release
3314
@subsection @code{starpu_data_release} -- Release registered data from the application
3315
@deftypefun void starpu_data_release (starpu_data_handle @var{handle})
3316
This function releases the piece of data acquired by the application either by
3317
@code{starpu_data_acquire} or by @code{starpu_data_acquire_cb}.
3320
@node starpu_data_set_wt_mask
3321
@subsection @code{starpu_data_set_wt_mask} -- Set the Write-Through mask
3322
@deftypefun void starpu_data_set_wt_mask (starpu_data_handle @var{handle}, uint32_t @var{wt_mask})
3323
This function sets the write-through mask of a given data, i.e. a bitmask of
3324
nodes where the data should be always replicated after modification.
3327
@node starpu_data_prefetch_on_node
3328
@subsection @code{starpu_data_prefetch_on_node} -- Prefetch data to a given node
3330
@deftypefun int starpu_data_prefetch_on_node (starpu_data_handle @var{handle}, unsigned @var{node}, unsigned @var{async})
3331
Issue a prefetch request for a given data to a given node, i.e.
3332
requests that the data be replicated to the given node, so that it is available
3333
there for tasks. If the @var{async} parameter is 0, the call will block until
3334
the transfer is achieved, else the call will return as soon as the request is
3335
scheduled (which may however have to wait for a task completion).
3338
@node Data Interfaces
3339
@section Data Interfaces
3342
* Variable Interface::
3343
* Vector Interface::
3344
* Matrix Interface::
3345
* 3D Matrix Interface::
3346
* BCSR Interface for Sparse Matrices (Blocked Compressed Sparse Row Representation)::
3347
* CSR Interface for Sparse Matrices (Compressed Sparse Row Representation)::
3350
@node Variable Interface
3351
@subsection Variable Interface
3354
@item @emph{Description}:
3355
This variant of @code{starpu_data_register} uses the variable interface,
3356
i.e. for a mere single variable. @code{ptr} is the address of the variable,
3357
and @code{elemsize} is the size of the variable.
3358
@item @emph{Prototype}:
3359
@code{void starpu_variable_data_register(starpu_data_handle *handle,
3361
uintptr_t ptr, size_t elemsize);}
3362
@item @emph{Example}:
3366
starpu_data_handle var_handle;
3367
starpu_variable_data_register(&var_handle, 0, (uintptr_t)&var, sizeof(var));
3372
@node Vector Interface
3373
@subsection Vector Interface
3376
@item @emph{Description}:
3377
This variant of @code{starpu_data_register} uses the vector interface,
3378
i.e. for mere arrays of elements. @code{ptr} is the address of the first
3379
element in the home node. @code{nx} is the number of elements in the vector.
3380
@code{elemsize} is the size of each element.
3381
@item @emph{Prototype}:
3382
@code{void starpu_vector_data_register(starpu_data_handle *handle, uint32_t home_node,
3383
uintptr_t ptr, uint32_t nx, size_t elemsize);}
3384
@item @emph{Example}:
3388
starpu_data_handle vector_handle;
3389
starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX,
3395
@node Matrix Interface
3396
@subsection Matrix Interface
3399
@item @emph{Description}:
3400
This variant of @code{starpu_data_register} uses the matrix interface, i.e. for
3401
matrices of elements. @code{ptr} is the address of the first element in the home
3402
node. @code{ld} is the number of elements between rows. @code{nx} is the number
3403
of elements in a row (this can be different from @code{ld} if there are extra
3404
elements for alignment for instance). @code{ny} is the number of rows.
3405
@code{elemsize} is the size of each element.
3406
@item @emph{Prototype}:
3407
@code{void starpu_matrix_data_register(starpu_data_handle *handle, uint32_t home_node,
3408
uintptr_t ptr, uint32_t ld, uint32_t nx,
3409
uint32_t ny, size_t elemsize);}
3410
@item @emph{Example}:
3414
starpu_data_handle matrix_handle;
3415
matrix = (float*)malloc(width * height * sizeof(float));
3416
starpu_matrix_data_register(&matrix_handle, 0, (uintptr_t)matrix,
3417
width, width, height, sizeof(float));
3422
@node 3D Matrix Interface
3423
@subsection 3D Matrix Interface
3426
@item @emph{Description}:
3427
This variant of @code{starpu_data_register} uses the 3D matrix interface.
3428
@code{ptr} is the address of the array of first element in the home node.
3429
@code{ldy} is the number of elements between rows. @code{ldz} is the number
3430
of rows between z planes. @code{nx} is the number of elements in a row (this
3431
can be different from @code{ldy} if there are extra elements for alignment
3432
for instance). @code{ny} is the number of rows in a z plane (likewise with
3433
@code{ldz}). @code{nz} is the number of z planes. @code{elemsize} is the size of
3435
@item @emph{Prototype}:
3436
@code{void starpu_block_data_register(starpu_data_handle *handle, uint32_t home_node,
3437
uintptr_t ptr, uint32_t ldy, uint32_t ldz, uint32_t nx,
3438
uint32_t ny, uint32_t nz, size_t elemsize);}
3439
@item @emph{Example}:
3443
starpu_data_handle block_handle;
3444
block = (float*)malloc(nx*ny*nz*sizeof(float));
3445
starpu_block_data_register(&block_handle, 0, (uintptr_t)block,
3446
nx, nx*ny, nx, ny, nz, sizeof(float));
3451
@node BCSR Interface for Sparse Matrices (Blocked Compressed Sparse Row Representation)
3452
@subsection BCSR Interface for Sparse Matrices (Blocked Compressed Sparse Row Representation)
3454
@deftypefun void starpu_bcsr_data_register (starpu_data_handle *@var{handle}, uint32_t @var{home_node}, uint32_t @var{nnz}, uint32_t @var{nrow}, uintptr_t @var{nzval}, uint32_t *@var{colind}, uint32_t *@var{rowptr}, uint32_t @var{firstentry}, uint32_t @var{r}, uint32_t @var{c}, size_t @var{elemsize})
3455
This variant of @code{starpu_data_register} uses the BCSR sparse matrix interface.
3459
@node CSR Interface for Sparse Matrices (Compressed Sparse Row Representation)
3460
@subsection CSR Interface for Sparse Matrices (Compressed Sparse Row Representation)
3462
@deftypefun void starpu_csr_data_register (starpu_data_handle *@var{handle}, uint32_t @var{home_node}, uint32_t @var{nnz}, uint32_t @var{nrow}, uintptr_t @var{nzval}, uint32_t *@var{colind}, uint32_t *@var{rowptr}, uint32_t @var{firstentry}, size_t @var{elemsize})
3463
This variant of @code{starpu_data_register} uses the CSR sparse matrix interface.
3467
@node Data Partition
3468
@section Data Partition
3471
* struct starpu_data_filter:: StarPU filter structure
3472
* starpu_data_partition:: Partition Data
3473
* starpu_data_unpartition:: Unpartition Data
3474
* starpu_data_get_nb_children::
3475
* starpu_data_get_sub_data::
3476
* Predefined filter functions::
3479
@node struct starpu_data_filter
3480
@subsection @code{struct starpu_data_filter} -- StarPU filter structure
3482
@item @emph{Description}:
3483
The filter structure describes a data partitioning operation, to be given to the
3484
@code{starpu_data_partition} function, see @ref{starpu_data_partition} for an example.
3485
@item @emph{Fields}:
3487
@item @code{filter_func}:
3488
This function fills the @code{child_interface} structure with interface
3489
information for the @code{id}-th child of the parent @code{father_interface} (among @code{nparts}).
3490
@code{void (*filter_func)(void *father_interface, void* child_interface, struct starpu_data_filter *, unsigned id, unsigned nparts);}
3491
@item @code{nchildren}:
3492
This is the number of parts to partition the data into.
3493
@item @code{get_nchildren}:
3494
This returns the number of children. This can be used instead of @code{nchildren} when the number of
3495
children depends on the actual data (e.g. the number of blocks in a sparse
3497
@code{unsigned (*get_nchildren)(struct starpu_data_filter *, starpu_data_handle initial_handle);}
3498
@item @code{get_child_ops}:
3499
In case the resulting children use a different data interface, this function
3500
returns which interface is used by child number @code{id}.
3501
@code{struct starpu_data_interface_ops_t *(*get_child_ops)(struct starpu_data_filter *, unsigned id);}
3502
@item @code{filter_arg}:
3503
Some filters take an addition parameter, but this is usually unused.
3504
@item @code{filter_arg_ptr}:
3505
Some filters take an additional array parameter like the sizes of the parts, but
3506
this is usually unused.
3510
@node starpu_data_partition
3511
@subsection starpu_data_partition -- Partition Data
3514
@item @emph{Description}:
3515
This requests partitioning one StarPU data @code{initial_handle} into several
3516
subdata according to the filter @code{f}
3517
@item @emph{Prototype}:
3518
@code{void starpu_data_partition(starpu_data_handle initial_handle, struct starpu_data_filter *f);}
3519
@item @emph{Example}:
3522
struct starpu_data_filter f = @{
3523
.filter_func = starpu_vertical_block_filter_func,
3524
.nchildren = nslicesx,
3525
.get_nchildren = NULL,
3526
.get_child_ops = NULL
3528
starpu_data_partition(A_handle, &f);
3533
@node starpu_data_unpartition
3534
@subsection starpu_data_unpartition -- Unpartition data
3537
@item @emph{Description}:
3538
This unapplies one filter, thus unpartitioning the data. The pieces of data are
3539
collected back into one big piece in the @code{gathering_node} (usually 0).
3540
@item @emph{Prototype}:
3541
@code{void starpu_data_unpartition(starpu_data_handle root_data, uint32_t gathering_node);}
3542
@item @emph{Example}:
3545
starpu_data_unpartition(A_handle, 0);
3550
@node starpu_data_get_nb_children
3551
@subsection starpu_data_get_nb_children
3554
@item @emph{Description}:
3555
This function returns the number of children.
3556
@item @emph{Return value}:
3557
The number of children.
3558
@item @emph{Prototype}:
3559
@code{int starpu_data_get_nb_children(starpu_data_handle handle);}
3562
@c starpu_data_handle starpu_data_get_child(starpu_data_handle handle, unsigned i);
3564
@node starpu_data_get_sub_data
3565
@subsection starpu_data_get_sub_data
3568
@item @emph{Description}:
3569
After partitioning a StarPU data by applying a filter,
3570
@code{starpu_data_get_sub_data} can be used to get handles for each of the data
3571
portions. @code{root_data} is the parent data that was partitioned. @code{depth}
3572
is the number of filters to traverse (in case several filters have been applied,
3573
to e.g. partition in row blocks, and then in column blocks), and the subsequent
3574
parameters are the indexes.
3575
@item @emph{Return value}:
3576
A handle to the subdata.
3577
@item @emph{Prototype}:
3578
@code{starpu_data_handle starpu_data_get_sub_data(starpu_data_handle root_data, unsigned depth, ... );}
3579
@item @emph{Example}:
3582
h = starpu_data_get_sub_data(A_handle, 1, taskx);
3587
@node Predefined filter functions
3588
@subsection Predefined filter functions
3591
* Partitioning BCSR Data::
3592
* Partitioning BLAS interface::
3593
* Partitioning Vector Data::
3594
* Partitioning Block Data::
3597
This section gives a partial list of the predefined partitioning functions.
3598
Examples on how to use them are shown in @ref{Partitioning Data}. The complete
3599
list can be found in @code{starpu_data_filters.h} .
3601
@node Partitioning BCSR Data
3602
@subsubsection Partitioning BCSR Data
3604
@deftypefun void starpu_canonical_block_filter_bcsr (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})
3608
@deftypefun void starpu_vertical_block_filter_func_csr (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})
3612
@node Partitioning BLAS interface
3613
@subsubsection Partitioning BLAS interface
3615
@deftypefun void starpu_block_filter_func (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})
3616
This partitions a dense Matrix into horizontal blocks.
3619
@deftypefun void starpu_vertical_block_filter_func (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})
3620
This partitions a dense Matrix into vertical blocks.
3623
@node Partitioning Vector Data
3624
@subsubsection Partitioning Vector Data
3626
@deftypefun void starpu_block_filter_func_vector (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})
3627
This partitions a vector into blocks of the same size.
3631
@deftypefun void starpu_vector_list_filter_func (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})
3632
This partitions a vector into blocks of sizes given in @var{filter_arg_ptr}.
3635
@deftypefun void starpu_vector_divide_in_2_filter_func (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})
3636
This partitions a vector into two blocks, the first block size being given in @var{filter_arg}.
3640
@node Partitioning Block Data
3641
@subsubsection Partitioning Block Data
3643
@deftypefun void starpu_block_filter_func_block (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})
3644
This partitions a 3D matrix along the X axis.
3647
@node Codelets and Tasks
3648
@section Codelets and Tasks
3650
This section describes the interface to manipulate codelets and tasks.
3652
@deftp {Data Type} {struct starpu_codelet}
3653
The codelet structure describes a kernel that is possibly implemented on various
3654
targets. For compatibility, make sure to initialize the whole structure to zero.
3658
Indicates which types of processing units are able to execute the codelet.
3659
@code{STARPU_CPU|STARPU_CUDA} for instance indicates that the codelet is
3660
implemented for both CPU cores and CUDA devices while @code{STARPU_GORDON}
3661
indicates that it is only available on Cell SPUs.
3663
@item @code{cpu_func} (optional)
3664
Is a function pointer to the CPU implementation of the codelet. Its prototype
3665
must be: @code{void cpu_func(void *buffers[], void *cl_arg)}. The first
3666
argument being the array of data managed by the data management library, and
3667
the second argument is a pointer to the argument passed from the @code{cl_arg}
3668
field of the @code{starpu_task} structure.
3669
The @code{cpu_func} field is ignored if @code{STARPU_CPU} does not appear in
3670
the @code{where} field, it must be non-null otherwise.
3672
@item @code{cuda_func} (optional)
3673
Is a function pointer to the CUDA implementation of the codelet. @emph{This
3674
must be a host-function written in the CUDA runtime API}. Its prototype must
3675
be: @code{void cuda_func(void *buffers[], void *cl_arg);}. The @code{cuda_func}
3676
field is ignored if @code{STARPU_CUDA} does not appear in the @code{where}
3677
field, it must be non-null otherwise.
3679
@item @code{opencl_func} (optional)
3680
Is a function pointer to the OpenCL implementation of the codelet. Its
3682
@code{void opencl_func(starpu_data_interface_t *descr, void *arg);}.
3683
This pointer is ignored if @code{STARPU_OPENCL} does not appear in the
3684
@code{where} field, it must be non-null otherwise.
3686
@item @code{gordon_func} (optional)
3687
This is the index of the Cell SPU implementation within the Gordon library.
3688
See Gordon documentation for more details on how to register a kernel and
3691
@item @code{nbuffers}
3692
Specifies the number of arguments taken by the codelet. These arguments are
3693
managed by the DSM and are accessed from the @code{void *buffers[]}
3694
array. The constant argument passed with the @code{cl_arg} field of the
3695
@code{starpu_task} structure is not counted in this number. This value should
3696
not be above @code{STARPU_NMAXBUFS}.
3698
@item @code{model} (optional)
3699
This is a pointer to the task duration performance model associated to this
3700
codelet. This optional field is ignored when set to @code{NULL}.
3704
@item @code{power_model} (optional)
3705
This is a pointer to the task power consumption performance model associated
3706
to this codelet. This optional field is ignored when set to @code{NULL}.
3707
In the case of parallel codelets, this has to account for all processing units
3708
involved in the parallel execution.
3715
@deftp {Data Type} {struct starpu_task}
3716
The @code{starpu_task} structure describes a task that can be offloaded on the various
3717
processing units managed by StarPU. It instantiates a codelet. It can either be
3718
allocated dynamically with the @code{starpu_task_create} method, or declared
3719
statically. In the latter case, the programmer has to zero the
3720
@code{starpu_task} structure and to fill the different fields properly. The
3721
indicated default values correspond to the configuration of a task allocated
3722
with @code{starpu_task_create}.
3726
Is a pointer to the corresponding @code{starpu_codelet} data structure. This
3727
describes where the kernel should be executed, and supplies the appropriate
3728
implementations. When set to @code{NULL}, no code is executed during the tasks,
3729
such empty tasks can be useful for synchronization purposes.
3731
@item @code{buffers}
3732
Is an array of @code{starpu_buffer_descr_t} structures. It describes the
3733
different pieces of data accessed by the task, and how they should be accessed.
3734
The @code{starpu_buffer_descr_t} structure is composed of two fields, the
3735
@code{handle} field specifies the handle of the piece of data, and the
3736
@code{mode} field is the required access mode (eg @code{STARPU_RW}). The number
3737
of entries in this array must be specified in the @code{nbuffers} field of the
3738
@code{starpu_codelet} structure, and should not excede @code{STARPU_NMAXBUFS}.
3739
If unsufficient, this value can be set with the @code{--enable-maxbuffers}
3740
option when configuring StarPU.
3742
@item @code{cl_arg} (optional; default: @code{NULL})
3743
This pointer is passed to the codelet through the second argument
3744
of the codelet implementation (e.g. @code{cpu_func} or @code{cuda_func}).
3745
In the specific case of the Cell processor, see the @code{cl_arg_size}
3748
@item @code{cl_arg_size} (optional, Cell-specific)
3749
In the case of the Cell processor, the @code{cl_arg} pointer is not directly
3750
given to the SPU function. A buffer of size @code{cl_arg_size} is allocated on
3751
the SPU. This buffer is then filled with the @code{cl_arg_size} bytes starting
3752
at address @code{cl_arg}. In this case, the argument given to the SPU codelet
3753
is therefore not the @code{cl_arg} pointer, but the address of the buffer in
3754
local store (LS) instead. This field is ignored for CPU, CUDA and OpenCL
3755
codelets, where the @code{cl_arg} pointer is given as such.
3757
@item @code{callback_func} (optional) (default: @code{NULL})
3758
This is a function pointer of prototype @code{void (*f)(void *)} which
3759
specifies a possible callback. If this pointer is non-null, the callback
3760
function is executed @emph{on the host} after the execution of the task. The
3761
callback is passed the value contained in the @code{callback_arg} field. No
3762
callback is executed if the field is set to @code{NULL}.
3764
@item @code{callback_arg} (optional) (default: @code{NULL})
3765
This is the pointer passed to the callback function. This field is ignored if
3766
the @code{callback_func} is set to @code{NULL}.
3768
@item @code{use_tag} (optional) (default: @code{0})
3769
If set, this flag indicates that the task should be associated with the tag
3770
contained in the @code{tag_id} field. Tag allow the application to synchronize
3771
with the task and to express task dependencies easily.
3774
This fields contains the tag associated to the task if the @code{use_tag} field
3775
was set, it is ignored otherwise.
3777
@item @code{synchronous}
3778
If this flag is set, the @code{starpu_task_submit} function is blocking and
3779
returns only when the task has been executed (or if no worker is able to
3780
process the task). Otherwise, @code{starpu_task_submit} returns immediately.
3782
@item @code{priority} (optional) (default: @code{STARPU_DEFAULT_PRIO})
3783
This field indicates a level of priority for the task. This is an integer value
3784
that must be set between the return values of the
3785
@code{starpu_sched_get_min_priority} function for the least important tasks,
3786
and that of the @code{starpu_sched_get_max_priority} for the most important
3787
tasks (included). The @code{STARPU_MIN_PRIO} and @code{STARPU_MAX_PRIO} macros
3788
are provided for convenience and respectively returns value of
3789
@code{starpu_sched_get_min_priority} and @code{starpu_sched_get_max_priority}.
3790
Default priority is @code{STARPU_DEFAULT_PRIO}, which is always defined as 0 in
3791
order to allow static task initialization. Scheduling strategies that take
3792
priorities into account can use this parameter to take better scheduling
3793
decisions, but the scheduling policy may also ignore it.
3795
@item @code{execute_on_a_specific_worker} (default: @code{0})
3796
If this flag is set, StarPU will bypass the scheduler and directly affect this
3797
task to the worker specified by the @code{workerid} field.
3799
@item @code{workerid} (optional)
3800
If the @code{execute_on_a_specific_worker} field is set, this field indicates
3801
which is the identifier of the worker that should process this task (as
3802
returned by @code{starpu_worker_get_id}). This field is ignored if
3803
@code{execute_on_a_specific_worker} field is set to 0.
3805
@item @code{detach} (optional) (default: @code{1})
3806
If this flag is set, it is not possible to synchronize with the task
3807
by the means of @code{starpu_task_wait} later on. Internal data structures
3808
are only guaranteed to be freed once @code{starpu_task_wait} is called if the
3811
@item @code{destroy} (optional) (default: @code{1})
3812
If this flag is set, the task structure will automatically be freed, either
3813
after the execution of the callback if the task is detached, or during
3814
@code{starpu_task_wait} otherwise. If this flag is not set, dynamically
3815
allocated data structures will not be freed until @code{starpu_task_destroy} is
3816
called explicitly. Setting this flag for a statically allocated task structure
3817
will result in undefined behaviour.
3819
@item @code{predicted} (output field)
3820
Predicted duration of the task. This field is only set if the scheduling
3821
strategy used performance models.
3826
@deftypefun void starpu_task_init ({struct starpu_task} *@var{task})
3827
Initialize @var{task} with default values. This function is implicitly
3828
called by @code{starpu_task_create}. By default, tasks initialized with
3829
@code{starpu_task_init} must be deinitialized explicitly with
3830
@code{starpu_task_deinit}. Tasks can also be initialized statically, using the
3831
constant @code{STARPU_TASK_INITIALIZER}.
3834
@deftypefun {struct starpu_task *} starpu_task_create (void)
3835
Allocate a task structure and initialize it with default values. Tasks
3836
allocated dynamically with @code{starpu_task_create} are automatically freed when the
3837
task is terminated. If the destroy flag is explicitly unset, the resources used
3838
by the task are freed by calling
3839
@code{starpu_task_destroy}.
3842
@deftypefun void starpu_task_deinit ({struct starpu_task} *@var{task})
3843
Release all the structures automatically allocated to execute @var{task}. This is
3844
called automatically by @code{starpu_task_destroy}, but the task structure itself is not
3845
freed. This should be used for statically allocated tasks for instance.
3848
@deftypefun void starpu_task_destroy ({struct starpu_task} *@var{task})
3849
Free the resource allocated during @code{starpu_task_create} and
3850
associated with @var{task}. This function can be called automatically
3851
after the execution of a task by setting the @code{destroy} flag of the
3852
@code{starpu_task} structure (default behaviour). Calling this function
3853
on a statically allocated task results in an undefined behaviour.
3856
@deftypefun int starpu_task_wait ({struct starpu_task} *@var{task})
3857
This function blocks until @var{task} has been executed. It is not possible to
3858
synchronize with a task more than once. It is not possible to wait for
3859
synchronous or detached tasks.
3861
Upon successful completion, this function returns 0. Otherwise, @code{-EINVAL}
3862
indicates that the specified task was either synchronous or detached.
3865
@deftypefun int starpu_task_submit ({struct starpu_task} *@var{task})
3866
This function submits @var{task} to StarPU. Calling this function does
3867
not mean that the task will be executed immediately as there can be data or task
3868
(tag) dependencies that are not fulfilled yet: StarPU will take care of
3869
scheduling this task with respect to such dependencies.
3870
This function returns immediately if the @code{synchronous} field of the
3871
@code{starpu_task} structure was set to 0, and block until the termination of
3872
the task otherwise. It is also possible to synchronize the application with
3873
asynchronous tasks by the means of tags, using the @code{starpu_tag_wait}
3874
function for instance.
3876
In case of success, this function returns 0, a return value of @code{-ENODEV}
3877
means that there is no worker able to process this task (e.g. there is no GPU
3878
available and this task is only implemented for CUDA devices).
3881
@deftypefun int starpu_task_wait_for_all (void)
3882
This function blocks until all the tasks that were submitted are terminated.
3885
@deftypefun {struct starpu_task *} starpu_get_current_task (void)
3886
This function returns the task currently executed by the worker, or
3887
NULL if it is called either from a thread that is not a task or simply
3888
because there is no task being executed at the moment.
3891
@deftypefun void starpu_display_codelet_stats ({struct starpu_codelet_t} *@var{cl})
3892
Output on @code{stderr} some statistics on the codelet @var{cl}.
3896
@c Callbacks : what can we put in callbacks ?
3898
@node Explicit Dependencies
3899
@section Explicit Dependencies
3902
* starpu_task_declare_deps_array:: starpu_task_declare_deps_array
3903
* starpu_tag_t:: Task logical identifier
3904
* starpu_tag_declare_deps:: Declare the Dependencies of a Tag
3905
* starpu_tag_declare_deps_array:: Declare the Dependencies of a Tag
3906
* starpu_tag_wait:: Block until a Tag is terminated
3907
* starpu_tag_wait_array:: Block until a set of Tags is terminated
3908
* starpu_tag_remove:: Destroy a Tag
3909
* starpu_tag_notify_from_apps:: Feed a tag explicitly
3912
@node starpu_task_declare_deps_array
3913
@subsection @code{starpu_task_declare_deps_array} -- Declare task dependencies
3914
@deftypefun void starpu_task_declare_deps_array ({struct starpu_task} *@var{task}, unsigned @var{ndeps}, {struct starpu_task} *@var{task_array}[])
3915
Declare task dependencies between a @var{task} and an array of tasks of length
3916
@var{ndeps}. This function must be called prior to the submission of the task,
3917
but it may called after the submission or the execution of the tasks in the
3918
array provided the tasks are still valid (ie. they were not automatically
3919
destroyed). Calling this function on a task that was already submitted or with
3920
an entry of @var{task_array} that is not a valid task anymore results in an
3921
undefined behaviour. If @var{ndeps} is null, no dependency is added. It is
3922
possible to call @code{starpu_task_declare_deps_array} multiple times on the
3923
same task, in this case, the dependencies are added. It is possible to have
3924
redundancy in the task dependencies.
3930
@subsection @code{starpu_tag_t} -- Task logical identifier
3932
@item @emph{Description}:
3933
It is possible to associate a task with a unique ``tag'' chosen by the application, and to express
3934
dependencies between tasks by the means of those tags. To do so, fill the
3935
@code{tag_id} field of the @code{starpu_task} structure with a tag number (can
3936
be arbitrary) and set the @code{use_tag} field to 1.
3938
If @code{starpu_tag_declare_deps} is called with this tag number, the task will
3939
not be started until the tasks which holds the declared dependency tags are
3943
@node starpu_tag_declare_deps
3944
@subsection @code{starpu_tag_declare_deps} -- Declare the Dependencies of a Tag
3946
@item @emph{Description}:
3947
Specify the dependencies of the task identified by tag @code{id}. The first
3948
argument specifies the tag which is configured, the second argument gives the
3949
number of tag(s) on which @code{id} depends. The following arguments are the
3950
tags which have to be terminated to unlock the task.
3952
This function must be called before the associated task is submitted to StarPU
3953
with @code{starpu_task_submit}.
3956
Because of the variable arity of @code{starpu_tag_declare_deps}, note that the
3957
last arguments @emph{must} be of type @code{starpu_tag_t}: constant values
3958
typically need to be explicitly casted. Using the
3959
@code{starpu_tag_declare_deps_array} function avoids this hazard.
3961
@item @emph{Prototype}:
3962
@code{void starpu_tag_declare_deps(starpu_tag_t id, unsigned ndeps, ...);}
3964
@item @emph{Example}:
3967
/* Tag 0x1 depends on tags 0x32 and 0x52 */
3968
starpu_tag_declare_deps((starpu_tag_t)0x1,
3969
2, (starpu_tag_t)0x32, (starpu_tag_t)0x52);
3975
@node starpu_tag_declare_deps_array
3976
@subsection @code{starpu_tag_declare_deps_array} -- Declare the Dependencies of a Tag
3978
@item @emph{Description}:
3979
This function is similar to @code{starpu_tag_declare_deps}, except that its
3980
does not take a variable number of arguments but an array of tags of size
3982
@item @emph{Prototype}:
3983
@code{void starpu_tag_declare_deps_array(starpu_tag_t id, unsigned ndeps, starpu_tag_t *array);}
3984
@item @emph{Example}:
3987
/* Tag 0x1 depends on tags 0x32 and 0x52 */
3988
starpu_tag_t tag_array[2] = @{0x32, 0x52@};
3989
starpu_tag_declare_deps_array((starpu_tag_t)0x1, 2, tag_array);
3997
@node starpu_tag_wait
3998
@subsection @code{starpu_tag_wait} -- Block until a Tag is terminated
3999
@deftypefun void starpu_tag_wait (starpu_tag_t @var{id})
4000
This function blocks until the task associated to tag @var{id} has been
4001
executed. This is a blocking call which must therefore not be called within
4002
tasks or callbacks, but only from the application directly. It is possible to
4003
synchronize with the same tag multiple times, as long as the
4004
@code{starpu_tag_remove} function is not called. Note that it is still
4005
possible to synchronize with a tag associated to a task which @code{starpu_task}
4006
data structure was freed (e.g. if the @code{destroy} flag of the
4007
@code{starpu_task} was enabled).
4010
@node starpu_tag_wait_array
4011
@subsection @code{starpu_tag_wait_array} -- Block until a set of Tags is terminated
4012
@deftypefun void starpu_tag_wait_array (unsigned @var{ntags}, starpu_tag_t *@var{id})
4013
This function is similar to @code{starpu_tag_wait} except that it blocks until
4014
@emph{all} the @var{ntags} tags contained in the @var{id} array are
4018
@node starpu_tag_remove
4019
@subsection @code{starpu_tag_remove} -- Destroy a Tag
4020
@deftypefun void starpu_tag_remove (starpu_tag_t @var{id})
4021
This function releases the resources associated to tag @var{id}. It can be
4022
called once the corresponding task has been executed and when there is
4023
no other tag that depend on this tag anymore.
4026
@node starpu_tag_notify_from_apps
4027
@subsection @code{starpu_tag_notify_from_apps} -- Feed a Tag explicitly
4028
@deftypefun void starpu_tag_notify_from_apps (starpu_tag_t @var{id})
4029
This function explicitly unlocks tag @var{id}. It may be useful in the
4030
case of applications which execute part of their computation outside StarPU
4031
tasks (e.g. third-party libraries). It is also provided as a
4032
convenient tool for the programmer, for instance to entirely construct the task
4033
DAG before actually giving StarPU the opportunity to execute the tasks.
4036
@node Implicit Data Dependencies
4037
@section Implicit Data Dependencies
4040
* starpu_data_set_default_sequential_consistency_flag:: starpu_data_set_default_sequential_consistency_flag
4041
* starpu_data_get_default_sequential_consistency_flag:: starpu_data_get_default_sequential_consistency_flag
4042
* starpu_data_set_sequential_consistency_flag:: starpu_data_set_sequential_consistency_flag
4045
In this section, we describe how StarPU makes it possible to insert implicit
4046
task dependencies in order to enforce sequential data consistency. When this
4047
data consistency is enabled on a specific data handle, any data access will
4048
appear as sequentially consistent from the application. For instance, if the
4049
application submits two tasks that access the same piece of data in read-only
4050
mode, and then a third task that access it in write mode, dependencies will be
4051
added between the two first tasks and the third one. Implicit data dependencies
4052
are also inserted in the case of data accesses from the application.
4054
@node starpu_data_set_default_sequential_consistency_flag
4055
@subsection @code{starpu_data_set_default_sequential_consistency_flag} -- Set default sequential consistency flag
4056
@deftypefun void starpu_data_set_default_sequential_consistency_flag (unsigned @var{flag})
4057
Set the default sequential consistency flag. If a non-zero value is passed, a
4058
sequential data consistency will be enforced for all handles registered after
4059
this function call, otherwise it is disabled. By default, StarPU enables
4060
sequential data consistency. It is also possible to select the data consistency
4061
mode of a specific data handle with the
4062
@code{starpu_data_set_sequential_consistency_flag} function.
4065
@node starpu_data_get_default_sequential_consistency_flag
4066
@subsection @code{starpu_data_get_default_sequential_consistency_flag} -- Get current default sequential consistency flag
4067
@deftypefun unsigned starpu_data_set_default_sequential_consistency_flag (void)
4068
This function returns the current default sequential consistency flag.
4071
@node starpu_data_set_sequential_consistency_flag
4072
@subsection @code{starpu_data_set_sequential_consistency_flag} -- Set data sequential consistency mode
4073
@deftypefun void starpu_data_set_sequential_consistency_flag (starpu_data_handle @var{handle}, unsigned @var{flag})
4074
Select the data consistency mode associated to a data handle. The consistency
4075
mode set using this function has the priority over the default mode which can
4076
be set with @code{starpu_data_set_sequential_consistency_flag}.
4079
@node Performance Model API
4080
@section Performance Model API
4083
* starpu_load_history_debug::
4084
* starpu_perfmodel_debugfilepath::
4085
* starpu_perfmodel_get_arch_name::
4086
* starpu_force_bus_sampling::
4089
@node starpu_load_history_debug
4090
@subsection @code{starpu_load_history_debug}
4091
@deftypefun int starpu_load_history_debug ({const char} *@var{symbol}, {struct starpu_perfmodel_t} *@var{model})
4095
@node starpu_perfmodel_debugfilepath
4096
@subsection @code{starpu_perfmodel_debugfilepath}
4097
@deftypefun void starpu_perfmodel_debugfilepath ({struct starpu_perfmodel_t} *@var{model}, {enum starpu_perf_archtype} @var{arch}, char *@var{path}, size_t @var{maxlen})
4101
@node starpu_perfmodel_get_arch_name
4102
@subsection @code{starpu_perfmodel_get_arch_name}
4103
@deftypefun void starpu_perfmodel_get_arch_name ({enum starpu_perf_archtype} @var{arch}, char *@var{archname}, size_t @var{maxlen})
4107
@node starpu_force_bus_sampling
4108
@subsection @code{starpu_force_bus_sampling}
4109
@deftypefun void starpu_force_bus_sampling (void)
4110
This forces sampling the bus performance model again.
4115
@section Profiling API
4118
* starpu_profiling_status_set:: starpu_profiling_status_set
4119
* starpu_profiling_status_get:: starpu_profiling_status_get
4120
* struct starpu_task_profiling_info:: task profiling information
4121
* struct starpu_worker_profiling_info:: worker profiling information
4122
* starpu_worker_get_profiling_info:: starpu_worker_get_profiling_info
4123
* struct starpu_bus_profiling_info:: bus profiling information
4124
* starpu_bus_get_count::
4125
* starpu_bus_get_id::
4126
* starpu_bus_get_src::
4127
* starpu_bus_get_dst::
4128
* starpu_timing_timespec_delay_us::
4129
* starpu_timing_timespec_to_us::
4130
* starpu_bus_profiling_helper_display_summary::
4131
* starpu_worker_profiling_helper_display_summary::
4134
@node starpu_profiling_status_set
4135
@subsection @code{starpu_profiling_status_set} -- Set current profiling status
4137
@item @emph{Description}:
4138
Thie function sets the profiling status. Profiling is activated by passing
4139
@code{STARPU_PROFILING_ENABLE} in @code{status}. Passing
4140
@code{STARPU_PROFILING_DISABLE} disables profiling. Calling this function
4141
resets all profiling measurements. When profiling is enabled, the
4142
@code{profiling_info} field of the @code{struct starpu_task} structure points
4143
to a valid @code{struct starpu_task_profiling_info} structure containing
4144
information about the execution of the task.
4145
@item @emph{Return value}:
4146
Negative return values indicate an error, otherwise the previous status is
4148
@item @emph{Prototype}:
4149
@code{int starpu_profiling_status_set(int status);}
4152
@node starpu_profiling_status_get
4153
@subsection @code{starpu_profiling_status_get} -- Get current profiling status
4154
@deftypefun int starpu_profiling_status_get (void)
4155
Return the current profiling status or a negative value in case there was an error.
4158
@node struct starpu_task_profiling_info
4159
@subsection @code{struct starpu_task_profiling_info} -- Task profiling information
4161
@item @emph{Description}:
4162
This structure contains information about the execution of a task. It is
4163
accessible from the @code{.profiling_info} field of the @code{starpu_task}
4164
structure if profiling was enabled.
4165
@item @emph{Fields}:
4167
@item @code{submit_time}:
4168
Date of task submission (relative to the initialization of StarPU).
4169
@item @code{start_time}:
4170
Date of task execution beginning (relative to the initialization of StarPU).
4171
@item @code{end_time}:
4172
Date of task execution termination (relative to the initialization of StarPU).
4173
@item @code{workerid}:
4174
Identifier of the worker which has executed the task.
4178
@node struct starpu_worker_profiling_info
4179
@subsection @code{struct starpu_worker_profiling_info} -- Worker profiling information
4181
@item @emph{Description}:
4182
This structure contains the profiling information associated to a worker.
4183
@item @emph{Fields}:
4185
@item @code{start_time}:
4186
Starting date for the reported profiling measurements.
4187
@item @code{total_time}:
4188
Duration of the profiling measurement interval.
4189
@item @code{executing_time}:
4190
Time spent by the worker to execute tasks during the profiling measurement interval.
4191
@item @code{sleeping_time}:
4192
Time spent idling by the worker during the profiling measurement interval.
4193
@item @code{executed_tasks}:
4194
Number of tasks executed by the worker during the profiling measurement interval.
4198
@node starpu_worker_get_profiling_info
4199
@subsection @code{starpu_worker_get_profiling_info} -- Get worker profiling info
4202
@item @emph{Description}:
4203
Get the profiling info associated to the worker identified by @code{workerid},
4204
and reset the profiling measurements. If the @code{worker_info} argument is
4205
NULL, only reset the counters associated to worker @code{workerid}.
4206
@item @emph{Return value}:
4207
Upon successful completion, this function returns 0. Otherwise, a negative
4210
@item @emph{Prototype}:
4211
@code{int starpu_worker_get_profiling_info(int workerid, struct starpu_worker_profiling_info *worker_info);}
4214
@node struct starpu_bus_profiling_info
4215
@subsection @code{struct starpu_bus_profiling_info} -- Bus profiling information
4217
@item @emph{Description}:
4219
@item @emph{Fields}:
4221
@item @code{start_time}:
4223
@item @code{total_time}:
4225
@item @code{transferred_bytes}:
4227
@item @code{transfer_count}:
4232
@node starpu_bus_get_count
4233
@subsection @code{starpu_bus_get_count}
4234
@deftypefun int starpu_bus_get_count (void)
4238
@node starpu_bus_get_id
4239
@subsection @code{starpu_bus_get_id}
4240
@deftypefun int starpu_bus_get_id (int @var{src}, int @var{dst})
4244
@node starpu_bus_get_src
4245
@subsection @code{starpu_bus_get_src}
4246
@deftypefun int starpu_bus_get_src (int @var{busid})
4250
@node starpu_bus_get_dst
4251
@subsection @code{starpu_bus_get_dst}
4252
@deftypefun int starpu_bus_get_dst (int @var{busid})
4256
@node starpu_timing_timespec_delay_us
4257
@subsection @code{starpu_timing_timespec_delay_us}
4258
@deftypefun double starpu_timing_timespec_delay_us ({struct timespec} *@var{start}, {struct timespec} *@var{end})
4262
@node starpu_timing_timespec_to_us
4263
@subsection @code{starpu_timing_timespec_to_us}
4264
@deftypefun double starpu_timing_timespec_to_us ({struct timespec} *@var{ts})
4268
@node starpu_bus_profiling_helper_display_summary
4269
@subsection @code{starpu_bus_profiling_helper_display_summary}
4270
@deftypefun void starpu_bus_profiling_helper_display_summary (void)
4274
@node starpu_worker_profiling_helper_display_summary
4275
@subsection @code{starpu_worker_profiling_helper_display_summary}
4276
@deftypefun void starpu_worker_profiling_helper_display_summary (void)
4282
@node CUDA extensions
4283
@section CUDA extensions
4285
@c void starpu_malloc(float **A, size_t dim);
4288
* starpu_cuda_get_local_stream:: Get current worker's CUDA stream
4289
* starpu_helper_cublas_init:: Initialize CUBLAS on every CUDA device
4290
* starpu_helper_cublas_shutdown:: Deinitialize CUBLAS on every CUDA device
4293
@node starpu_cuda_get_local_stream
4294
@subsection @code{starpu_cuda_get_local_stream} -- Get current worker's CUDA stream
4295
@deftypefun {cudaStream_t *} starpu_cuda_get_local_stream (void)
4296
StarPU provides a stream for every CUDA device controlled by StarPU. This
4297
function is only provided for convenience so that programmers can easily use
4298
asynchronous operations within codelets without having to create a stream by
4299
hand. Note that the application is not forced to use the stream provided by
4300
@code{starpu_cuda_get_local_stream} and may also create its own streams.
4301
Synchronizing with @code{cudaThreadSynchronize()} is allowed, but will reduce
4302
the likelihood of having all transfers overlapped.
4305
@node starpu_helper_cublas_init
4306
@subsection @code{starpu_helper_cublas_init} -- Initialize CUBLAS on every CUDA device
4307
@deftypefun void starpu_helper_cublas_init (void)
4308
The CUBLAS library must be initialized prior to any CUBLAS call. Calling
4309
@code{starpu_helper_cublas_init} will initialize CUBLAS on every CUDA device
4310
controlled by StarPU. This call blocks until CUBLAS has been properly
4311
initialized on every device.
4314
@node starpu_helper_cublas_shutdown
4315
@subsection @code{starpu_helper_cublas_shutdown} -- Deinitialize CUBLAS on every CUDA device
4316
@deftypefun void starpu_helper_cublas_shutdown (void)
4317
This function synchronously deinitializes the CUBLAS library on every CUDA device.
4320
@node OpenCL extensions
4321
@section OpenCL extensions
4324
* Enabling OpenCL:: Enabling OpenCL
4325
* Compiling OpenCL kernels:: Compiling OpenCL kernels
4326
* Loading OpenCL kernels:: Loading OpenCL kernels
4327
* OpenCL statistics:: Collecting statistics from OpenCL
4330
@node Enabling OpenCL
4331
@subsection Enabling OpenCL
4333
On GPU devices which can run both CUDA and OpenCL, CUDA will be
4334
enabled by default. To enable OpenCL, you need either to disable CUDA
4335
when configuring StarPU:
4338
% ./configure --disable-cuda
4341
or when running applications:
4344
% STARPU_NCUDA=0 ./application
4347
OpenCL will automatically be started on any device not yet used by
4348
CUDA. So on a machine running 4 GPUS, it is therefore possible to
4349
enable CUDA on 2 devices, and OpenCL on the 2 other devices by doing
4353
% STARPU_NCUDA=2 ./application
4356
@node Compiling OpenCL kernels
4357
@subsection Compiling OpenCL kernels
4359
Source codes for OpenCL kernels can be stored in a file or in a
4360
string. StarPU provides functions to build the program executable for
4361
each available OpenCL device as a @code{cl_program} object. This
4362
program executable can then be loaded within a specific queue as
4363
explained in the next section. These are only helpers, Applications
4364
can also fill a @code{starpu_opencl_program} array by hand for more advanced
4365
use (e.g. different programs on the different OpenCL devices, for
4366
relocation purpose for instance).
4369
* starpu_opencl_load_opencl_from_file:: Compiling OpenCL source code
4370
* starpu_opencl_load_opencl_from_string:: Compiling OpenCL source code
4371
* starpu_opencl_unload_opencl:: Releasing OpenCL code
4374
@node starpu_opencl_load_opencl_from_file
4375
@subsubsection @code{starpu_opencl_load_opencl_from_file} -- Compiling OpenCL source code
4376
@deftypefun int starpu_opencl_load_opencl_from_file (char *@var{source_file_name}, {struct starpu_opencl_program} *@var{opencl_programs}, {const char}* @var{build_options})
4380
@node starpu_opencl_load_opencl_from_string
4381
@subsubsection @code{starpu_opencl_load_opencl_from_string} -- Compiling OpenCL source code
4382
@deftypefun int starpu_opencl_load_opencl_from_string (char *@var{opencl_program_source}, {struct starpu_opencl_program} *@var{opencl_programs}, {const char}* @var{build_options})
4386
@node starpu_opencl_unload_opencl
4387
@subsubsection @code{starpu_opencl_unload_opencl} -- Releasing OpenCL code
4388
@deftypefun int starpu_opencl_unload_opencl ({struct starpu_opencl_program} *@var{opencl_programs})
4392
@node Loading OpenCL kernels
4393
@subsection Loading OpenCL kernels
4396
* starpu_opencl_load_kernel:: Loading a kernel
4397
* starpu_opencl_relase_kernel:: Releasing a kernel
4400
@node starpu_opencl_load_kernel
4401
@subsubsection @code{starpu_opencl_load_kernel} -- Loading a kernel
4402
@deftypefun int starpu_opencl_load_kernel (cl_kernel *@var{kernel}, cl_command_queue *@var{queue}, {struct starpu_opencl_program} *@var{opencl_programs}, char *@var{kernel_name}, int @var{devid})
4406
@node starpu_opencl_relase_kernel
4407
@subsubsection @code{starpu_opencl_release_kernel} -- Releasing a kernel
4408
@deftypefun int starpu_opencl_release_kernel (cl_kernel @var{kernel})
4412
@node OpenCL statistics
4413
@subsection OpenCL statistics
4416
* starpu_opencl_collect_stats:: Collect statistics on a kernel execution
4419
@node starpu_opencl_collect_stats
4420
@subsubsection @code{starpu_opencl_collect_stats} -- Collect statistics on a kernel execution
4421
@deftypefun int starpu_opencl_collect_stats (cl_event @var{event})
4422
After termination of the kernels, the OpenCL codelet should call this function
4423
to pass it the even returned by @code{clEnqueueNDRangeKernel}, to let StarPU
4424
collect statistics about the kernel execution (used cycles, consumed power).
4428
@node Cell extensions
4429
@section Cell extensions
4433
@node Miscellaneous helpers
4434
@section Miscellaneous helpers
4437
* starpu_data_cpy:: Copy a data handle into another data handle
4438
* starpu_execute_on_each_worker:: Execute a function on a subset of workers
4441
@node starpu_data_cpy
4442
@subsection @code{starpu_data_cpy} -- Copy a data handle into another data handle
4443
@deftypefun int starpu_data_cpy (starpu_data_handle @var{dst_handle}, starpu_data_handle @var{src_handle}, int @var{asynchronous}, void (*@var{callback_func})(void*), void *@var{callback_arg})
4444
Copy the content of the @var{src_handle} into the @var{dst_handle} handle.
4445
The @var{asynchronous} parameter indicates whether the function should
4446
block or not. In the case of an asynchronous call, it is possible to
4447
synchronize with the termination of this operation either by the means of
4448
implicit dependencies (if enabled) or by calling
4449
@code{starpu_task_wait_for_all()}. If @var{callback_func} is not @code{NULL},
4450
this callback function is executed after the handle has been copied, and it is
4451
given the @var{callback_arg} pointer as argument.
4456
@node starpu_execute_on_each_worker
4457
@subsection @code{starpu_execute_on_each_worker} -- Execute a function on a subset of workers
4458
@deftypefun void starpu_execute_on_each_worker (void (*@var{func})(void *), void *@var{arg}, uint32_t @var{where})
4459
When calling this method, the offloaded function specified by the first argument is
4460
executed by every StarPU worker that may execute the function.
4461
The second argument is passed to the offloaded function.
4462
The last argument specifies on which types of processing units the function
4463
should be executed. Similarly to the @var{where} field of the
4464
@code{starpu_codelet} structure, it is possible to specify that the function
4465
should be executed on every CUDA device and every CPU by passing
4466
@code{STARPU_CPU|STARPU_CUDA}.
4467
This function blocks until the function has been executed on every appropriate
4468
processing units, so that it may not be called from a callback function for
4473
@c ---------------------------------------------------------------------
4475
@c ---------------------------------------------------------------------
4477
@node Advanced Topics
4478
@chapter Advanced Topics
4481
* Defining a new data interface::
4482
* Defining a new scheduling policy::
4485
@node Defining a new data interface
4486
@section Defining a new data interface
4489
* struct starpu_data_interface_ops_t:: Per-interface methods
4490
* struct starpu_data_copy_methods:: Per-interface data transfer methods
4491
* An example of data interface:: An example of data interface
4494
@c void *starpu_data_get_interface_on_node(starpu_data_handle handle, unsigned memory_node); TODO
4496
@node struct starpu_data_interface_ops_t
4497
@subsection @code{struct starpu_data_interface_ops_t} -- Per-interface methods
4499
@item @emph{Description}:
4500
TODO describe all the different fields
4503
@node struct starpu_data_copy_methods
4504
@subsection @code{struct starpu_data_copy_methods} -- Per-interface data transfer methods
4506
@item @emph{Description}:
4507
TODO describe all the different fields
4510
@node An example of data interface
4511
@subsection An example of data interface
4514
See @code{src/datawizard/interfaces/vector_interface.c} for now.
4517
@node Defining a new scheduling policy
4518
@section Defining a new scheduling policy
4522
A full example showing how to define a new scheduling policy is available in
4523
the StarPU sources in the directory @code{examples/scheduler/}.
4526
* struct starpu_sched_policy_s::
4527
* starpu_worker_set_sched_condition::
4528
* starpu_sched_set_min_priority:: Set the minimum priority level
4529
* starpu_sched_set_max_priority:: Set the maximum priority level
4530
* starpu_push_local_task:: Assign a task to a worker
4534
@node struct starpu_sched_policy_s
4535
@subsection @code{struct starpu_sched_policy_s} -- Scheduler methods
4537
@item @emph{Description}:
4538
This structure contains all the methods that implement a scheduling policy. An
4539
application may specify which scheduling strategy in the @code{sched_policy}
4540
field of the @code{starpu_conf} structure passed to the @code{starpu_init}
4543
@item @emph{Fields}:
4545
@item @code{init_sched}:
4546
Initialize the scheduling policy.
4547
@item @code{deinit_sched}:
4548
Cleanup the scheduling policy.
4549
@item @code{push_task}:
4550
Insert a task into the scheduler.
4551
@item @code{push_prio_task}:
4552
Insert a priority task into the scheduler.
4553
@item @code{push_prio_notify}:
4554
Notify the scheduler that a task was pushed on the worker. This method is
4555
called when a task that was explicitely assigned to a worker is scheduled. This
4556
method therefore permits to keep the state of of the scheduler coherent even
4557
when StarPU bypasses the scheduling strategy.
4558
@item @code{pop_task}:
4559
Get a task from the scheduler. The mutex associated to the worker is already
4560
taken when this method is called. If this method is defined as @code{NULL}, the
4561
worker will only execute tasks from its local queue. In this case, the
4562
@code{push_task} method should use the @code{starpu_push_local_task} method to
4563
assign tasks to the different workers.
4564
@item @code{pop_every_task}:
4565
Remove all available tasks from the scheduler (tasks are chained by the means
4566
of the prev and next fields of the starpu_task structure). The mutex associated
4567
to the worker is already taken when this method is called.
4568
@item @code{post_exec_hook} (optionnal):
4569
This method is called every time a task has been executed.
4570
@item @code{policy_name}:
4571
Name of the policy (optionnal).
4572
@item @code{policy_description}:
4573
Description of the policy (optionnal).
4578
@node starpu_worker_set_sched_condition
4579
@subsection @code{starpu_worker_set_sched_condition} -- Specify the condition variable associated to a worker
4580
@deftypefun void starpu_worker_set_sched_condition (int @var{workerid}, pthread_cond_t *@var{sched_cond}, pthread_mutex_t *@var{sched_mutex})
4581
When there is no available task for a worker, StarPU blocks this worker on a
4582
condition variable. This function specifies which condition variable (and the
4583
associated mutex) should be used to block (and to wake up) a worker. Note that
4584
multiple workers may use the same condition variable. For instance, in the case
4585
of a scheduling strategy with a single task queue, the same condition variable
4586
would be used to block and wake up all workers.
4587
The initialization method of a scheduling strategy (@code{init_sched}) must
4588
call this function once per worker.
4591
@node starpu_sched_set_min_priority
4592
@subsection @code{starpu_sched_set_min_priority}
4593
@deftypefun void starpu_sched_set_min_priority (int @var{min_prio})
4594
Defines the minimum priority level supported by the scheduling policy. The
4595
default minimum priority level is the same as the default priority level which
4596
is 0 by convention. The application may access that value by calling the
4597
@code{starpu_sched_get_min_priority} function. This function should only be
4598
called from the initialization method of the scheduling policy, and should not
4599
be used directly from the application.
4602
@node starpu_sched_set_max_priority
4603
@subsection @code{starpu_sched_set_max_priority}
4604
@deftypefun void starpu_sched_set_min_priority (int @var{max_prio})
4605
Defines the maximum priority level supported by the scheduling policy. The
4606
default maximum priority level is 1. The application may access that value by
4607
calling the @code{starpu_sched_get_max_priority} function. This function should
4608
only be called from the initialization method of the scheduling policy, and
4609
should not be used directly from the application.
4612
@node starpu_push_local_task
4613
@subsection @code{starpu_push_local_task}
4614
@deftypefun int starpu_push_local_task (int @var{workerid}, {struct starpu_task} *@var{task}, int @var{back})
4615
The scheduling policy may put tasks directly into a worker's local queue so
4616
that it is not always necessary to create its own queue when the local queue
4617
is sufficient. If "back" not null, the task is put at the back of the queue
4618
where the worker will pop tasks first. Setting "back" to 0 therefore ensures
4626
@subsection Source code
4630
static struct starpu_sched_policy_s dummy_sched_policy = @{
4631
.init_sched = init_dummy_sched,
4632
.deinit_sched = deinit_dummy_sched,
4633
.push_task = push_task_dummy,
4634
.push_prio_task = NULL,
4635
.pop_task = pop_task_dummy,
4636
.post_exec_hook = NULL,
4637
.pop_every_task = NULL,
4638
.policy_name = "dummy",
4639
.policy_description = "dummy scheduling strategy"
4645
@c ---------------------------------------------------------------------
4647
@c ---------------------------------------------------------------------
4649
@c ---------------------------------------------------------------------
4650
@c Full source code for the 'Scaling a Vector' example
4651
@c ---------------------------------------------------------------------
4653
@node Full source code for the 'Scaling a Vector' example
4654
@appendix Full source code for the 'Scaling a Vector' example
4657
* Main application::
4663
@node Main application
4664
@section Main application
4667
@include vector_scal_c.texi
4674
@include vector_scal_cpu.texi
4678
@section CUDA Kernel
4681
@include vector_scal_cuda.texi
4685
@section OpenCL Kernel
4688
* Invoking the kernel::
4689
* Source of the kernel::
4692
@node Invoking the kernel
4693
@subsection Invoking the kernel
4696
@include vector_scal_opencl.texi
4699
@node Source of the kernel
4700
@subsection Source of the kernel
4703
@include vector_scal_opencl_codelet.texi
4710
@node Function Index
4711
@unnumbered Function Index