~ubuntu-branches/ubuntu/quantal/starpu-contrib/quantal : revision 2

1

\input texinfo @c -*-texinfo-*-

2

3

@c %**start of header

4

@setfilename starpu.info

5

@settitle StarPU Handbook

6

@c %**end of header

7

8

@include version.texi

9

10

@setchapternewpage odd

11

12

@titlepage

13

@title StarPU Handbook

14

@subtitle for StarPU @value{VERSION}

15

@page

16

@vskip 0pt plus 1fill

17

@comment For the @value{version-GCC} Version*

18

@end titlepage

19

20

@c @summarycontents

21

@contents

22

@page

23

24

@node Top

25

@top Preface

26

@cindex Preface

27

28

This manual documents the usage of StarPU version @value{VERSION}. It

29

was last updated on @value{UPDATED}.

30

31

32

@comment

33

@comment When you add a new menu item, please keep the right hand

34

@comment aligned to the same column. Do not use tabs. This provides

35

@comment better formatting.

36

@comment

37

@menu

38

* Introduction:: A basic introduction to using StarPU

39

* Installing StarPU:: How to configure, build and install StarPU

40

* Using StarPU:: How to run StarPU application

41

* Basic Examples:: Basic examples of the use of StarPU

42

* Performance optimization:: How to optimize performance with StarPU

43

* Performance feedback:: Performance debugging tools

44

* StarPU MPI support:: How to combine StarPU with MPI

45

* Configuring StarPU:: How to configure StarPU

46

* StarPU API:: The API to use StarPU

47

* Advanced Topics:: Advanced use of StarPU

48

* Full source code for the 'Scaling a Vector' example::

49

50

* Function Index:: Index of C functions.

51

@end menu

52

53

@c ---------------------------------------------------------------------

54

@c Introduction to StarPU

55

@c ---------------------------------------------------------------------

56

57

@node Introduction

58

@chapter Introduction to StarPU

59

60

@menu

61

* Motivation:: Why StarPU ?

62

* StarPU in a Nutshell:: The Fundamentals of StarPU

63

@end menu

64

65

@node Motivation

66

@section Motivation

67

68

@c complex machines with heterogeneous cores/devices

69

The use of specialized hardware such as accelerators or coprocessors offers an

70

interesting approach to overcome the physical limits encountered by processor

71

architects. As a result, many machines are now equipped with one or several

72

accelerators (e.g. a GPU), in addition to the usual processor(s). While a lot of

73

efforts have been devoted to offload computation onto such accelerators, very

74

little attention as been paid to portability concerns on the one hand, and to the

75

possibility of having heterogeneous accelerators and processors to interact on the other hand.

76

77

StarPU is a runtime system that offers support for heterogeneous multicore

78

architectures, it not only offers a unified view of the computational resources

79

(i.e. CPUs and accelerators at the same time), but it also takes care of

80

efficiently mapping and executing tasks onto an heterogeneous machine while

81

transparently handling low-level issues such as data transfers in a portable

82

fashion.

83

84

@c this leads to a complicated distributed memory design

85

@c which is not (easily) manageable by hand

86

87

@c added value/benefits of StarPU

88

@c - portability

89

@c - scheduling, perf. portability

90

91

@node StarPU in a Nutshell

92

@section StarPU in a Nutshell

93

94

@menu

95

* Codelet and Tasks::

96

* StarPU Data Management Library::

97

* Glossary::

98

* Research Papers::

99

@end menu

100

101

From a programming point of view, StarPU is not a new language but a library

102

that executes tasks explicitly submitted by the application. The data that a

103

task manipulates are automatically transferred onto the accelerator so that the

104

programmer does not have to take care of complex data movements. StarPU also

105

takes particular care of scheduling those tasks efficiently and allows

106

scheduling experts to implement custom scheduling policies in a portable

107

fashion.

108

109

@c explain the notion of codelet and task (i.e. g(A, B)

110

@node Codelet and Tasks

111

@subsection Codelet and Tasks

112

113

One of the StarPU primary data structures is the @b{codelet}. A codelet describes a

114

computational kernel that can possibly be implemented on multiple architectures

115

such as a CPU, a CUDA device or a Cell's SPU.

116

117

@c TODO insert illustration f : f_spu, f_cpu, ...

118

119

Another important data structure is the @b{task}. Executing a StarPU task

120

consists in applying a codelet on a data set, on one of the architectures on

121

which the codelet is implemented. A task thus describes the codelet that it

122

uses, but also which data are accessed, and how they are

123

accessed during the computation (read and/or write).

124

StarPU tasks are asynchronous: submitting a task to StarPU is a non-blocking

125

operation. The task structure can also specify a @b{callback} function that is

126

called once StarPU has properly executed the task. It also contains optional

127

fields that the application may use to give hints to the scheduler (such as

128

priority levels).

129

130

By default, task dependencies are inferred from data dependency (sequential

131

coherence) by StarPU. The application can however disable sequential coherency

132

for some data, and dependencies be expressed by hand.

133

A task may be identified by a unique 64-bit number chosen by the application

134

which we refer as a @b{tag}.

135

Task dependencies can be enforced by hand either by the means of callback functions, by

136

submitting other tasks, or by expressing dependencies

137

between tags (which can thus correspond to tasks that have not been submitted

138

yet).

139

140

@c TODO insert illustration f(Ar, Brw, Cr) + ..

141

142

@c DSM

143

@node StarPU Data Management Library

144

@subsection StarPU Data Management Library

145

146

Because StarPU schedules tasks at runtime, data transfers have to be

147

done automatically and ``just-in-time'' between processing units,

148

relieving the application programmer from explicit data transfers.

149

Moreover, to avoid unnecessary transfers, StarPU keeps data

150

where it was last needed, even if was modified there, and it

151

allows multiple copies of the same data to reside at the same time on

152

several processing units as long as it is not modified.

153

154

@node Glossary

155

@subsection Glossary

156

157

A @b{codelet} records pointers to various implementations of the same

158

theoretical function.

159

160

A @b{memory node} can be either the main RAM or GPU-embedded memory.

161

162

A @b{bus} is a link between memory nodes.

163

164

A @b{data handle} keeps track of replicates of the same data (@b{registered} by the

165

application) over various memory nodes. The data management library manages

166

keeping them coherent.

167

168

The @b{home} memory node of a data handle is the memory node from which the data

169

was registered (usually the main memory node).

170

171

A @b{task} represents a scheduled execution of a codelet on some data handles.

172

173

A @b{tag} is a rendez-vous point. Tasks typically have their own tag, and can

174

depend on other tags. The value is chosen by the application.

175

176

A @b{worker} execute tasks. There is typically one per CPU computation core and

177

one per accelerator (for which a whole CPU core is dedicated).

178

179

A @b{driver} drives a given kind of workers. There are currently CPU, CUDA,

180

OpenCL and Gordon drivers. They usually start several workers to actually drive

181

them.

182

183

A @b{performance model} is a (dynamic or static) model of the performance of a

184

given codelet. Codelets can have execution time performance model as well as

185

power consumption performance models.

186

187

A data @b{interface} describes the layout of the data: for a vector, a pointer

188

for the start, the number of elements and the size of elements ; for a matrix, a

189

pointer for the start, the number of elements per row, the offset between rows,

190

and the size of each element ; etc. To access their data, codelet functions are

191

given interfaces for the local memory node replicates of the data handles of the

192

scheduled task.

193

194

@b{Partitioning} data means dividing the data of a given data handle (called

195

@b{father}) into a series of @b{children} data handles which designate various

196

portions of the former.

197

198

A @b{filter} is the function which computes children data handles from a father

199

data handle, and thus describes how the partitioning should be done (horizontal,

200

vertical, etc.)

201

202

@b{Acquiring} a data handle can be done from the main application, to safely

203

access the data of a data handle from its home node, without having to

204

unregister it.

205

206

207

@node Research Papers

208

@subsection Research Papers

209

210

Research papers about StarPU can be found at

211

212

@indicateurl{http://runtime.bordeaux.inria.fr/Publis/Keyword/STARPU.html}

213

214

Notably a good overview in the research report

215

216

@indicateurl{http://hal.archives-ouvertes.fr/inria-00467677}

217

218

@c ---------------------------------------------------------------------

219

@c Installing StarPU

220

@c ---------------------------------------------------------------------

221

222

@node Installing StarPU

223

@chapter Installing StarPU

224

225

@menu

226

* Downloading StarPU::

227

* Configuration of StarPU::

228

* Building and Installing StarPU::

229

@end menu

230

231

StarPU can be built and installed by the standard means of the GNU

232

autotools. The following chapter is intended to briefly remind how these tools

233

can be used to install StarPU.

234

235

@node Downloading StarPU

236

@section Downloading StarPU

237

238

@menu

239

* Getting Sources::

240

* Optional dependencies::

241

@end menu

242

243

@node Getting Sources

244

@subsection Getting Sources

245

246

The simplest way to get StarPU sources is to download the latest official

247

release tarball from @indicateurl{https://gforge.inria.fr/frs/?group_id=1570} ,

248

or the latest nightly snapshot from

249

@indicateurl{http://starpu.gforge.inria.fr/testing/} . The following documents

250

how to get the very latest version from the subversion repository itself, it

251

should be needed only if you need the very latest changes (i.e. less than a

252

day!)

253

254

The source code is managed by a Subversion server hosted by the

255

InriaGforge. To get the source code, you need:

256

257

@itemize

258

@item

259

To install the client side of the software Subversion if it is

260

not already available on your system. The software can be obtained from

261

@indicateurl{http://subversion.tigris.org} . If you are running

262

on Windows, you will probably prefer to use TortoiseSVN from

263

@indicateurl{http://tortoisesvn.tigris.org/} .

264

265

@item

266

You can check out the project's SVN repository through anonymous

267

access. This will provide you with a read access to the

268

repository.

269

270

If you need to have write access on the StarPU project, you can also choose to

271

become a member of the project @code{starpu}. For this, you first need to get

272

an account to the gForge server. You can then send a request to join the project

273

(@indicateurl{https://gforge.inria.fr/project/request.php?group_id=1570}).

274

275

@item

276

More information on how to get a gForge account, to become a member of

277

a project, or on any other related task can be obtained from the

278

InriaGforge at @indicateurl{https://gforge.inria.fr/}. The most important

279

thing is to upload your public SSH key on the gForge server (see the

280

FAQ at @indicateurl{http://siteadmin.gforge.inria.fr/FAQ.html#Q6} for

281

instructions).

282

@end itemize

283

284

You can now check out the latest version from the Subversion server:

285

@itemize

286

@item

287

using the anonymous access via svn:

288

@example

289

% svn checkout svn://scm.gforge.inria.fr/svn/starpu/trunk

290

@end example

291

@item

292

using the anonymous access via https:

293

@example

294

% svn checkout --username anonsvn https://scm.gforge.inria.fr/svn/starpu/trunk

295

@end example

296

The password is @code{anonsvn}.

297

@item

298

using your gForge account

299

@example

300

% svn checkout svn+ssh://<login>@@scm.gforge.inria.fr/svn/starpu/trunk

301

@end example

302

@end itemize

303

304

The following step requires the availability of @code{autoconf} and

305

@code{automake} to generate the @code{./configure} script. This is

306

done by calling @code{./autogen.sh}. The required version for

307

@code{autoconf} is 2.60 or higher. You will also need @code{makeinfo}.

308

309

@example

310

% ./autogen.sh

311

@end example

312

313

If the autotools are not available on your machine or not recent

314

enough, you can choose to download the latest nightly tarball, which

315

is provided with a @code{configure} script.

316

317

@example

318

% wget http://starpu.gforge.inria.fr/testing/starpu-nightly-latest.tar.gz

319

@end example

320

321

@node Optional dependencies

322

@subsection Optional dependencies

323

324

The topology discovery library, @code{hwloc}, is not mandatory to use StarPU

325

but strongly recommended. It allows to increase performance, and to

326

perform some topology aware scheduling.

327

328

@code{hwloc} is available in major distributions and for most OSes and can be

329

downloaded from @indicateurl{http://www.open-mpi.org/software/hwloc}.

330

331

@node Configuration of StarPU

332

@section Configuration of StarPU

333

334

@menu

335

* Generating Makefiles and configuration scripts::

336

* Running the configuration::

337

@end menu

338

339

@node Generating Makefiles and configuration scripts

340

@subsection Generating Makefiles and configuration scripts

341

342

This step is not necessary when using the tarball releases of StarPU. If you

343

are using the source code from the svn repository, you first need to generate

344

the configure scripts and the Makefiles.

345

346

@example

347

% ./autogen.sh

348

@end example

349

350

@node Running the configuration

351

@subsection Running the configuration

352

353

@example

354

% ./configure

355

@end example

356

357

Details about options that are useful to give to @code{./configure} are given in

358

@ref{Compilation configuration}.

359

360

@node Building and Installing StarPU

361

@section Building and Installing StarPU

362

363

@menu

364

* Building::

365

* Sanity Checks::

366

* Installing::

367

@end menu

368

369

@node Building

370

@subsection Building

371

372

@example

373

% make

374

@end example

375

376

@node Sanity Checks

377

@subsection Sanity Checks

378

379

In order to make sure that StarPU is working properly on the system, it is also

380

possible to run a test suite.

381

382

@example

383

% make check

384

@end example

385

386

@node Installing

387

@subsection Installing

388

389

In order to install StarPU at the location that was specified during

390

configuration:

391

392

@example

393

% make install

394

@end example

395

396

@c ---------------------------------------------------------------------

397

@c Using StarPU

398

@c ---------------------------------------------------------------------

399

400

@node Using StarPU

401

@chapter Using StarPU

402

403

@menu

404

* Setting flags for compiling and linking applications::

405

* Running a basic StarPU application::

406

* Kernel threads started by StarPU::

407

* Using accelerators::

408

@end menu

409

410

@node Setting flags for compiling and linking applications

411

@section Setting flags for compiling and linking applications

412

413

Compiling and linking an application against StarPU may require to use

414

specific flags or libraries (for instance @code{CUDA} or @code{libspe2}).

415

To this end, it is possible to use the @code{pkg-config} tool.

416

417

If StarPU was not installed at some standard location, the path of StarPU's

418

library must be specified in the @code{PKG_CONFIG_PATH} environment variable so

419

that @code{pkg-config} can find it. For example if StarPU was installed in

420

@code{$prefix_dir}:

421

422

@example

423

% PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$prefix_dir/lib/pkgconfig

424

@end example

425

426

The flags required to compile or link against StarPU are then

427

accessible with the following commands:

428

429

@example

430

% pkg-config --cflags libstarpu # options for the compiler

431

% pkg-config --libs libstarpu # options for the linker

432

@end example

433

434

@node Running a basic StarPU application

435

@section Running a basic StarPU application

436

437

Basic examples using StarPU have been built in the directory

438

@code{$prefix_dir/lib/starpu/examples/}. You can for example run the

439

example @code{vector_scal}.

440

441

@example

442

% $prefix_dir/lib/starpu/examples/vector_scal

443

BEFORE : First element was 1.000000

444

AFTER First element is 3.140000

445

%

446

@end example

447

448

When StarPU is used for the first time, the directory

449

@code{$HOME/.starpu/} is created, performance models will be stored in

450

that directory.

451

452

Please note that buses are benchmarked when StarPU is launched for the

453

first time. This may take a few minutes, or less if @code{hwloc} is

454

installed. This step is done only once per user and per machine.

455

456

@node Kernel threads started by StarPU

457

@section Kernel threads started by StarPU

458

459

TODO: StarPU starts one thread per CPU core and binds them there, uses one of

460

them per GPU. The application is not supposed to do computations in its own

461

threads. TODO: add a StarPU function to bind an application thread (e.g. the

462

main thread) to a dedicated core (and thus disable the corresponding StarPU CPU

463

worker).

464

465

@node Using accelerators

466

@section Using accelerators

467

468

When both CUDA and OpenCL drivers are enabled, StarPU will launch an

469

OpenCL worker for NVIDIA GPUs only if CUDA is not already running on them.

470

This design choice was necessary as OpenCL and CUDA can not run at the

471

same time on the same NVIDIA GPU, as there is currently no interoperability

472

between them.

473

474

Details on how to specify devices running OpenCL and the ones running

475

CUDA are given in @ref{Enabling OpenCL}.

476

477

478

@c ---------------------------------------------------------------------

479

@c Basic Examples

480

@c ---------------------------------------------------------------------

481

482

@node Basic Examples

483

@chapter Basic Examples

484

485

@menu

486

* Compiling and linking options::

487

* Hello World:: Submitting Tasks

488

* Scaling a Vector:: Manipulating Data

489

* Vector Scaling on an Hybrid CPU/GPU Machine:: Handling Heterogeneous Architectures

490

* Task and Worker Profiling::

491

* Partitioning Data:: Partitioning Data

492

* Performance model example::

493

* Theoretical lower bound on execution time::

494

* Insert Task Utility::

495

* More examples:: More examples shipped with StarPU

496

* Debugging:: When things go wrong.

497

@end menu

498

499

@node Compiling and linking options

500

@section Compiling and linking options

501

502

Let's suppose StarPU has been installed in the directory

503

@code{$STARPU_DIR}. As explained in @ref{Setting flags for compiling and linking applications},

504

the variable @code{PKG_CONFIG_PATH} needs to be set. It is also

505

necessary to set the variable @code{LD_LIBRARY_PATH} to locate dynamic

506

libraries at runtime.

507

508

@example

509

% PKG_CONFIG_PATH=$STARPU_DIR/lib/pkgconfig:$PKG_CONFIG_PATH

510

% LD_LIBRARY_PATH=$STARPU_DIR/lib:$LD_LIBRARY_PATH

511

@end example

512

513

The Makefile could for instance contain the following lines to define which

514

options must be given to the compiler and to the linker:

515

516

@cartouche

517

@example

518

CFLAGS += $$(pkg-config --cflags libstarpu)

519

LDFLAGS += $$(pkg-config --libs libstarpu)

520

@end example

521

@end cartouche

522

523

@node Hello World

524

@section Hello World

525

526

@menu

527

* Required Headers::

528

* Defining a Codelet::

529

* Submitting a Task::

530

* Execution of Hello World::

531

@end menu

532

533

In this section, we show how to implement a simple program that submits a task to StarPU.

534

535

@node Required Headers

536

@subsection Required Headers

537

538

The @code{starpu.h} header should be included in any code using StarPU.

539

540

@cartouche

541

@smallexample

542

#include <starpu.h>

543

@end smallexample

544

@end cartouche

545

546

547

@node Defining a Codelet

548

@subsection Defining a Codelet

549

550

@cartouche

551

@smallexample

552

struct params @{

553

int i;

554

float f;

555

@};

556

void cpu_func(void *buffers[], void *cl_arg)

557

@{

558

struct params *params = cl_arg;

559

560

printf("Hello world (params = @{%i, %f@} )\n", params->i, params->f);

561

@}

562

563

starpu_codelet cl =

564

@{

565

.where = STARPU_CPU,

566

.cpu_func = cpu_func,

567

.nbuffers = 0

568

@};

569

@end smallexample

570

@end cartouche

571

572

A codelet is a structure that represents a computational kernel. Such a codelet

573

may contain an implementation of the same kernel on different architectures

574

(e.g. CUDA, Cell's SPU, x86, ...).

575

576

The @code{nbuffers} field specifies the number of data buffers that are

577

manipulated by the codelet: here the codelet does not access or modify any data

578

that is controlled by our data management library. Note that the argument

579

passed to the codelet (the @code{cl_arg} field of the @code{starpu_task}

580

structure) does not count as a buffer since it is not managed by our data

581

management library, but just contain trivial parameters.

582

583

@c TODO need a crossref to the proper description of "where" see bla for more ...

584

We create a codelet which may only be executed on the CPUs. The @code{where}

585

field is a bitmask that defines where the codelet may be executed. Here, the

586

@code{STARPU_CPU} value means that only CPUs can execute this codelet

587

(@pxref{Codelets and Tasks} for more details on this field).

588

When a CPU core executes a codelet, it calls the @code{cpu_func} function,

589

which @emph{must} have the following prototype:

590

591

@code{void (*cpu_func)(void *buffers[], void *cl_arg);}

592

593

In this example, we can ignore the first argument of this function which gives a

594

description of the input and output buffers (e.g. the size and the location of

595

the matrices) since there is none.

596

The second argument is a pointer to a buffer passed as an

597

argument to the codelet by the means of the @code{cl_arg} field of the

598

@code{starpu_task} structure.

599

600

@c TODO rewrite so that it is a little clearer ?

601

Be aware that this may be a pointer to a

602

@emph{copy} of the actual buffer, and not the pointer given by the programmer:

603

if the codelet modifies this buffer, there is no guarantee that the initial

604

buffer will be modified as well: this for instance implies that the buffer

605

cannot be used as a synchronization medium. If synchronization is needed, data

606

has to be registered to StarPU, see @ref{Scaling a Vector}.

607

608

@node Submitting a Task

609

@subsection Submitting a Task

610

611

@cartouche

612

@smallexample

613

void callback_func(void *callback_arg)

614

@{

615

printf("Callback function (arg %x)\n", callback_arg);

616

@}

617

618

int main(int argc, char **argv)

619

@{

620

/* @b{initialize StarPU} */

621

starpu_init(NULL);

622

623

struct starpu_task *task = starpu_task_create();

624

625

task->cl = &cl; /* @b{Pointer to the codelet defined above} */

626

627

struct params params = @{ 1, 2.0f @};

628

task->cl_arg = &params;

629

task->cl_arg_size = sizeof(params);

630

631

task->callback_func = callback_func;

632

task->callback_arg = 0x42;

633

634

/* @b{starpu_task_submit will be a blocking call} */

635

task->synchronous = 1;

636

637

/* @b{submit the task to StarPU} */

638

starpu_task_submit(task);

639

640

/* @b{terminate StarPU} */

641

starpu_shutdown();

642

643

return 0;

644

@}

645

@end smallexample

646

@end cartouche

647

648

Before submitting any tasks to StarPU, @code{starpu_init} must be called. The

649

@code{NULL} argument specifies that we use default configuration. Tasks cannot

650

be submitted after the termination of StarPU by a call to

651

@code{starpu_shutdown}.

652

653

In the example above, a task structure is allocated by a call to

654

@code{starpu_task_create}. This function only allocates and fills the

655

corresponding structure with the default settings (@pxref{Codelets and

656

Tasks, starpu_task_create}), but it does not submit the task to StarPU.

657

658

@c not really clear ;)

659

The @code{cl} field is a pointer to the codelet which the task will

660

execute: in other words, the codelet structure describes which computational

661

kernel should be offloaded on the different architectures, and the task

662

structure is a wrapper containing a codelet and the piece of data on which the

663

codelet should operate.

664

665

The optional @code{cl_arg} field is a pointer to a buffer (of size

666

@code{cl_arg_size}) with some parameters for the kernel

667

described by the codelet. For instance, if a codelet implements a computational

668

kernel that multiplies its input vector by a constant, the constant could be

669

specified by the means of this buffer, instead of registering it as a StarPU

670

data. It must however be noted that StarPU avoids making copy whenever possible

671

and rather passes the pointer as such, so the buffer which is pointed at must

672

kept allocated until the task terminates, and if several tasks are submitted

673

with various parameters, each of them must be given a pointer to their own

674

buffer.

675

676

Once a task has been executed, an optional callback function is be called.

677

While the computational kernel could be offloaded on various architectures, the

678

callback function is always executed on a CPU. The @code{callback_arg}

679

pointer is passed as an argument of the callback. The prototype of a callback

680

function must be:

681

682

@code{void (*callback_function)(void *);}

683

684

If the @code{synchronous} field is non-zero, task submission will be

685

synchronous: the @code{starpu_task_submit} function will not return until the

686

task was executed. Note that the @code{starpu_shutdown} method does not

687

guarantee that asynchronous tasks have been executed before it returns,

688

@code{starpu_task_wait_for_all} can be used to that effect, or data can be

689

unregistered (@code{starpu_data_unregister(vector_handle);}), which will

690

implicitly wait for all the tasks scheduled to work on it, unless explicitly

691

disabled thanks to @code{starpu_data_set_default_sequential_consistency_flag} or

692

@code{starpu_data_set_sequential_consistency_flag}.

693

694

@node Execution of Hello World

695

@subsection Execution of Hello World

696

697

@smallexample

698

% make hello_world

699

cc $(pkg-config --cflags libstarpu) $(pkg-config --libs libstarpu) hello_world.c -o hello_world

700

% ./hello_world

701

Hello world (params = @{1, 2.000000@} )

702

Callback function (arg 42)

703

@end smallexample

704

705

@node Scaling a Vector

706

@section Manipulating Data: Scaling a Vector

707

708

The previous example has shown how to submit tasks. In this section,

709

we show how StarPU tasks can manipulate data. The full source code for

710

this example is given in @ref{Full source code for the 'Scaling a Vector' example}.

711

712

@menu

713

* Source code of Vector Scaling::

714

* Execution of Vector Scaling::

715

@end menu

716

717

@node Source code of Vector Scaling

718

@subsection Source code of Vector Scaling

719

720

Programmers can describe the data layout of their application so that StarPU is

721

responsible for enforcing data coherency and availability across the machine.

722

Instead of handling complex (and non-portable) mechanisms to perform data

723

movements, programmers only declare which piece of data is accessed and/or

724

modified by a task, and StarPU makes sure that when a computational kernel

725

starts somewhere (e.g. on a GPU), its data are available locally.

726

727

Before submitting those tasks, the programmer first needs to declare the

728

different pieces of data to StarPU using the @code{starpu_*_data_register}

729

functions. To ease the development of applications for StarPU, it is possible

730

to describe multiple types of data layout. A type of data layout is called an

731

@b{interface}. There are different predefined interfaces available in StarPU:

732

here we will consider the @b{vector interface}.

733

734

The following lines show how to declare an array of @code{NX} elements of type

735

@code{float} using the vector interface:

736

737

@cartouche

738

@smallexample

739

float vector[NX];

740

741

starpu_data_handle vector_handle;

742

starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX,

743

sizeof(vector[0]));

744

@end smallexample

745

@end cartouche

746

747

The first argument, called the @b{data handle}, is an opaque pointer which

748

designates the array in StarPU. This is also the structure which is used to

749

describe which data is used by a task. The second argument is the node number

750

where the data originally resides. Here it is 0 since the @code{vector} array is in

751

the main memory. Then comes the pointer @code{vector} where the data can be found in main memory,

752

the number of elements in the vector and the size of each element.

753

The following shows how to construct a StarPU task that will manipulate the

754

vector and a constant factor.

755

756

@cartouche

757

@smallexample

758

float factor = 3.14;

759

struct starpu_task *task = starpu_task_create();

760

761

task->cl = &cl; /* @b{Pointer to the codelet defined below} */

762

task->buffers[0].handle = vector_handle; /* @b{First parameter of the codelet} */

763

task->buffers[0].mode = STARPU_RW;

764

task->cl_arg = &factor;

765

task->cl_arg_size = sizeof(factor);

766

task->synchronous = 1;

767

768

starpu_task_submit(task);

769

@end smallexample

770

@end cartouche

771

772

Since the factor is a mere constant float value parameter,

773

it does not need a preliminary registration, and

774

can just be passed through the @code{cl_arg} pointer like in the previous

775

example. The vector parameter is described by its handle.

776

There are two fields in each element of the @code{buffers} array.

777

@code{handle} is the handle of the data, and @code{mode} specifies how the

778

kernel will access the data (@code{STARPU_R} for read-only, @code{STARPU_W} for

779

write-only and @code{STARPU_RW} for read and write access).

780

781

The definition of the codelet can be written as follows:

782

783

@cartouche

784

@smallexample

785

void scal_cpu_func(void *buffers[], void *cl_arg)

786

@{

787

unsigned i;

788

float *factor = cl_arg;

789

790

/* length of the vector */

791

unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);

792

/* CPU copy of the vector pointer */

793

float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);

794

795

for (i = 0; i < n; i++)

796

val[i] *= *factor;

797

@}

798

799

starpu_codelet cl = @{

800

.where = STARPU_CPU,

801

.cpu_func = scal_cpu_func,

802

.nbuffers = 1

803

@};

804

@end smallexample

805

@end cartouche

806

807

The first argument is an array that gives

808

a description of all the buffers passed in the @code{task->buffers}@ array. The

809

size of this array is given by the @code{nbuffers} field of the codelet

810

structure. For the sake of genericity, this array contains pointers to the

811

different interfaces describing each buffer. In the case of the @b{vector

812

interface}, the location of the vector (resp. its length) is accessible in the

813

@code{ptr} (resp. @code{nx}) of this array. Since the vector is accessed in a

814

read-write fashion, any modification will automatically affect future accesses

815

to this vector made by other tasks.

816

817

The second argument of the @code{scal_cpu_func} function contains a pointer to the

818

parameters of the codelet (given in @code{task->cl_arg}), so that we read the

819

constant factor from this pointer.

820

821

@node Execution of Vector Scaling

822

@subsection Execution of Vector Scaling

823

824

@smallexample

825

% make vector_scal

826

cc $(pkg-config --cflags libstarpu) $(pkg-config --libs libstarpu) vector_scal.c -o vector_scal

827

% ./vector_scal

828

0.000000 3.000000 6.000000 9.000000 12.000000

829

@end smallexample

830

831

@node Vector Scaling on an Hybrid CPU/GPU Machine

832

@section Vector Scaling on an Hybrid CPU/GPU Machine

833

834

Contrary to the previous examples, the task submitted in this example may not

835

only be executed by the CPUs, but also by a CUDA device.

836

837

@menu

838

* Definition of the CUDA Kernel::

839

* Definition of the OpenCL Kernel::

840

* Definition of the Main Code::

841

* Execution of Hybrid Vector Scaling::

842

@end menu

843

844

@node Definition of the CUDA Kernel

845

@subsection Definition of the CUDA Kernel

846

847

The CUDA implementation can be written as follows. It needs to be compiled with

848

a CUDA compiler such as nvcc, the NVIDIA CUDA compiler driver. It must be noted

849

that the vector pointer returned by STARPU_VECTOR_GET_PTR is here a pointer in GPU

850

memory, so that it can be passed as such to the @code{vector_mult_cuda} kernel

851

call.

852

853

@cartouche

854

@smallexample

855

#include <starpu.h>

856

857

static __global__ void vector_mult_cuda(float *val, unsigned n,

858

float factor)

859

@{

860

unsigned i = blockIdx.x*blockDim.x + threadIdx.x;

861

if (i < n)

862

val[i] *= factor;

863

@}

864

865

extern "C" void scal_cuda_func(void *buffers[], void *_args)

866

@{

867

float *factor = (float *)_args;

868

869

/* length of the vector */

870

unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);

871

/* CUDA copy of the vector pointer */

872

float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);

873

unsigned threads_per_block = 64;

874

unsigned nblocks = (n + threads_per_block-1) / threads_per_block;

875

876

@i{ vector_mult_cuda<<<nblocks,threads_per_block, 0, starpu_cuda_get_local_stream()>>>(val, n, *factor);}

877

878

@i{ cudaStreamSynchronize(starpu_cuda_get_local_stream());}

879

@}

880

@end smallexample

881

@end cartouche

882

883

@node Definition of the OpenCL Kernel

884

@subsection Definition of the OpenCL Kernel

885

886

The OpenCL implementation can be written as follows. StarPU provides

887

tools to compile a OpenCL kernel stored in a file.

888

889

@cartouche

890

@smallexample

891

__kernel void vector_mult_opencl(__global float* val, int nx, float factor)

892

@{

893

const int i = get_global_id(0);

894

if (i < nx) @{

895

val[i] *= factor;

896

@}

897

@}

898

@end smallexample

899

@end cartouche

900

901

Similarly to CUDA, the pointer returned by @code{STARPU_VECTOR_GET_PTR} is here

902

a device pointer, so that it is passed as such to the OpenCL kernel.

903

904

@cartouche

905

@smallexample

906

#include <starpu.h>

907

@i{#include <starpu_opencl.h>}

908

909

@i{extern struct starpu_opencl_program programs;}

910

911

void scal_opencl_func(void *buffers[], void *_args)

912

@{

913

float *factor = _args;

914

@i{ int id, devid, err;}

915

@i{ cl_kernel kernel;}

916

@i{ cl_command_queue queue;}

917

@i{ cl_event event;}

918

919

/* length of the vector */

920

unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);

921

/* OpenCL copy of the vector pointer */

922

cl_mem val = (cl_mem) STARPU_VECTOR_GET_PTR(buffers[0]);

923

924

@i{ id = starpu_worker_get_id();}

925

@i{ devid = starpu_worker_get_devid(id);}

926

927

@i{ err = starpu_opencl_load_kernel(&kernel, &queue, &programs,}

928

@i{ "vector_mult_opencl", devid); /* @b{Name of the codelet defined above} */}

929

@i{ if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);}

930

931

@i{ err = clSetKernelArg(kernel, 0, sizeof(val), &val);}

932

@i{ err |= clSetKernelArg(kernel, 1, sizeof(n), &n);}

933

@i{ err |= clSetKernelArg(kernel, 2, sizeof(*factor), factor);}

934

@i{ if (err) STARPU_OPENCL_REPORT_ERROR(err);}

935

936

@i{ @{}

937

@i{ size_t global=1;}

938

@i{ size_t local=1;}

939

@i{ err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, &event);}

940

@i{ if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);}

941

@i{ @}}

942

943

@i{ clFinish(queue);}

944

@i{ starpu_opencl_collect_stats(event);}

945

@i{ clReleaseEvent(event);}

946

947

@i{ starpu_opencl_release_kernel(kernel);}

948

@}

949

@end smallexample

950

@end cartouche

951

952

953

@node Definition of the Main Code

954

@subsection Definition of the Main Code

955

956

The CPU implementation is the same as in the previous section.

957

958

Here is the source of the main application. You can notice the value of the

959

field @code{where} for the codelet. We specify

960

@code{STARPU_CPU|STARPU_CUDA|STARPU_OPENCL} to indicate to StarPU that the codelet

961

can be executed either on a CPU or on a CUDA or an OpenCL device.

962

963

@cartouche

964

@smallexample

965

#include <starpu.h>

966

967

#define NX 2048

968

969

extern void scal_cuda_func(void *buffers[], void *_args);

970

extern void scal_cpu_func(void *buffers[], void *_args);

971

extern void scal_opencl_func(void *buffers[], void *_args);

972

973

/* @b{Definition of the codelet} */

974

static starpu_codelet cl = @{

975

.where = STARPU_CPU|STARPU_CUDA|STARPU_OPENCL; /* @b{It can be executed on a CPU,} */

976

/* @b{on a CUDA device, or on an OpenCL device} */

977

.cuda_func = scal_cuda_func;

978

.cpu_func = scal_cpu_func;

979

.opencl_func = scal_opencl_func;

980

.nbuffers = 1;

981

@}

982

983

#ifdef STARPU_USE_OPENCL

984

/* @b{The compiled version of the OpenCL program} */

985

struct starpu_opencl_program programs;

986

#endif

987

988

int main(int argc, char **argv)

989

@{

990

float *vector;

991

int i, ret;

992

float factor=3.0;

993

struct starpu_task *task;

994

starpu_data_handle vector_handle;

995

996

starpu_init(NULL); /* @b{Initialising StarPU} */

997

998

#ifdef STARPU_USE_OPENCL

999

starpu_opencl_load_opencl_from_file(

1000

"examples/basic_examples/vector_scal_opencl_codelet.cl",

1001

&programs, NULL);

1002

#endif

1003

1004

vector = malloc(NX*sizeof(vector[0]));

1005

assert(vector);

1006

for(i=0 ; i<NX ; i++) vector[i] = i;

1007

@end smallexample

1008

@end cartouche

1009

1010

@cartouche

1011

@smallexample

1012

/* @b{Registering data within StarPU} */

1013

starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector,

1014

NX, sizeof(vector[0]));

1015

1016

/* @b{Definition of the task} */

1017

task = starpu_task_create();

1018

task->cl = &cl;

1019

task->buffers[0].handle = vector_handle;

1020

task->buffers[0].mode = STARPU_RW;

1021

task->cl_arg = &factor;

1022

task->cl_arg_size = sizeof(factor);

1023

@end smallexample

1024

@end cartouche

1025

1026

@cartouche

1027

@smallexample

1028

/* @b{Submitting the task} */

1029

ret = starpu_task_submit(task);

1030

if (ret == -ENODEV) @{

1031

fprintf(stderr, "No worker may execute this task\n");

1032

return 1;

1033

@}

1034

1035

@c TODO: Mmm, should rather be an unregistration with an implicit dependency, no?

1036

/* @b{Waiting for its termination} */

1037

starpu_task_wait_for_all();

1038

1039

/* @b{Update the vector in RAM} */

1040

starpu_data_acquire(vector_handle, STARPU_R);

1041

@end smallexample

1042

@end cartouche

1043

1044

@cartouche

1045

@smallexample

1046

/* @b{Access the data} */

1047

for(i=0 ; i<NX; i++) @{

1048

fprintf(stderr, "%f ", vector[i]);

1049

@}

1050

fprintf(stderr, "\n");

1051

1052

/* @b{Release the RAM view of the data before unregistering it and shutting down StarPU} */

1053

starpu_data_release(vector_handle);

1054

starpu_data_unregister(vector_handle);

1055

starpu_shutdown();

1056

1057

return 0;

1058

@}

1059

@end smallexample

1060

@end cartouche

1061

1062

@node Execution of Hybrid Vector Scaling

1063

@subsection Execution of Hybrid Vector Scaling

1064

1065

The Makefile given at the beginning of the section must be extended to

1066

give the rules to compile the CUDA source code. Note that the source

1067

file of the OpenCL kernel does not need to be compiled now, it will

1068

be compiled at run-time when calling the function

1069

@code{starpu_opencl_load_opencl_from_file()} (@pxref{starpu_opencl_load_opencl_from_file}).

1070

1071

@cartouche

1072

@smallexample

1073

CFLAGS += $(shell pkg-config --cflags libstarpu)

1074

LDFLAGS += $(shell pkg-config --libs libstarpu)

1075

CC = gcc

1076

1077

vector_scal: vector_scal.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o

1078

1079

%.o: %.cu

1080

nvcc $(CFLAGS) $< -c $@

1081

1082

clean:

1083

rm -f vector_scal *.o

1084

@end smallexample

1085

@end cartouche

1086

1087

@smallexample

1088

% make

1089

@end smallexample

1090

1091

and to execute it, with the default configuration:

1092

1093

@smallexample

1094

% ./vector_scal

1095

0.000000 3.000000 6.000000 9.000000 12.000000

1096

@end smallexample

1097

1098

or for example, by disabling CPU devices:

1099

1100

@smallexample

1101

% STARPU_NCPUS=0 ./vector_scal

1102

0.000000 3.000000 6.000000 9.000000 12.000000

1103

@end smallexample

1104

1105

or by disabling CUDA devices (which may permit to enable the use of OpenCL,

1106

see @ref{Using accelerators}):

1107

1108

@smallexample

1109

% STARPU_NCUDA=0 ./vector_scal

1110

0.000000 3.000000 6.000000 9.000000 12.000000

1111

@end smallexample

1112

1113

@node Task and Worker Profiling

1114

@section Task and Worker Profiling

1115

1116

A full example showing how to use the profiling API is available in

1117

the StarPU sources in the directory @code{examples/profiling/}.

1118

1119

@cartouche

1120

@smallexample

1121

struct starpu_task *task = starpu_task_create();

1122

task->cl = &cl;

1123

task->synchronous = 1;

1124

/* We will destroy the task structure by hand so that we can

1125

* query the profiling info before the task is destroyed. */

1126

task->destroy = 0;

1127

1128

/* Submit and wait for completion (since synchronous was set to 1) */

1129

starpu_task_submit(task);

1130

1131

/* The task is finished, get profiling information */

1132

struct starpu_task_profiling_info *info = task->profiling_info;

1133

1134

/* How much time did it take before the task started ? */

1135

double delay += starpu_timing_timespec_delay_us(&info->submit_time, &info->start_time);

1136

1137

/* How long was the task execution ? */

1138

double length += starpu_timing_timespec_delay_us(&info->start_time, &info->end_time);

1139

1140

/* We don't need the task structure anymore */

1141

starpu_task_destroy(task);

1142

@end smallexample

1143

@end cartouche

1144

1145

@cartouche

1146

@smallexample

1147

/* Display the occupancy of all workers during the test */

1148

int worker;

1149

for (worker = 0; worker < starpu_worker_get_count(); worker++)

1150

@{

1151

struct starpu_worker_profiling_info worker_info;

1152

int ret = starpu_worker_get_profiling_info(worker, &worker_info);

1153

STARPU_ASSERT(!ret);

1154

1155

double total_time = starpu_timing_timespec_to_us(&worker_info.total_time);

1156

double executing_time = starpu_timing_timespec_to_us(&worker_info.executing_time);

1157

double sleeping_time = starpu_timing_timespec_to_us(&worker_info.sleeping_time);

1158

1159

float executing_ratio = 100.0*executing_time/total_time;

1160

float sleeping_ratio = 100.0*sleeping_time/total_time;

1161

1162

char workername[128];

1163

starpu_worker_get_name(worker, workername, 128);

1164

fprintf(stderr, "Worker %s:\n", workername);

1165

fprintf(stderr, "\ttotal time : %.2lf ms\n", total_time*1e-3);

1166

fprintf(stderr, "\texec time : %.2lf ms (%.2f %%)\n", executing_time*1e-3,

1167

executing_ratio);

1168

fprintf(stderr, "\tblocked time : %.2lf ms (%.2f %%)\n", sleeping_time*1e-3,

1169

sleeping_ratio);

1170

@}

1171

@end smallexample

1172

@end cartouche

1173

1174

@node Partitioning Data

1175

@section Partitioning Data

1176

1177

An existing piece of data can be partitioned in sub parts to be used by different tasks, for instance:

1178

1179

@cartouche

1180

@smallexample

1181

int vector[NX];

1182

starpu_data_handle handle;

1183

1184

/* Declare data to StarPU */

1185

starpu_vector_data_register(&handle, 0, (uintptr_t)vector, NX, sizeof(vector[0]));

1186

1187

/* Partition the vector in PARTS sub-vectors */

1188

starpu_filter f =

1189

@{

1190

.filter_func = starpu_block_filter_func_vector,

1191

.nchildren = PARTS,

1192

.get_nchildren = NULL,

1193

.get_child_ops = NULL

1194

@};

1195

starpu_data_partition(handle, &f);

1196

@end smallexample

1197

@end cartouche

1198

1199

@cartouche

1200

@smallexample

1201

/* Submit a task on each sub-vector */

1202

for (i=0; i<starpu_data_get_nb_children(handle); i++) @{

1203

/* Get subdata number i (there is only 1 dimension) */

1204

starpu_data_handle sub_handle = starpu_data_get_sub_data(handle, 1, i);

1205

struct starpu_task *task = starpu_task_create();

1206

1207

task->buffers[0].handle = sub_handle;

1208

task->buffers[0].mode = STARPU_RW;

1209

task->cl = &cl;

1210

task->synchronous = 1;

1211

task->cl_arg = &factor;

1212

task->cl_arg_size = sizeof(factor);

1213

1214

starpu_task_submit(task);

1215

@}

1216

@end smallexample

1217

@end cartouche

1218

1219

Partitioning can be applied several times, see

1220

@code{examples/basic_examples/mult.c} and @code{examples/filters/}.

1221

1222

@node Performance model example

1223

@section Performance model example

1224

1225

To achieve good scheduling, StarPU scheduling policies need to be able to

1226

estimate in advance the duration of a task. This is done by giving to codelets a

1227

performance model. There are several kinds of performance models.

1228

1229

@itemize

1230

@item

1231

Providing an estimation from the application itself (@code{STARPU_COMMON} model type and @code{cost_model} field),

1232

see for instance

1233

@code{examples/common/blas_model.h} and @code{examples/common/blas_model.c}. It can also be provided for each architecture (@code{STARPU_PER_ARCH} model type and @code{per_arch} field)

1234

@item

1235

Measured at runtime (STARPU_HISTORY_BASED model type). This assumes that for a

1236

given set of data input/output sizes, the performance will always be about the

1237

same. This is very true for regular kernels on GPUs for instance (<0.1% error),

1238

and just a bit less true on CPUs (~=1% error). This also assumes that there are

1239

few different sets of data input/output sizes. StarPU will then keep record of

1240

the average time of previous executions on the various processing units, and use

1241

it as an estimation. History is done per task size, by using a hash of the input

1242

and ouput sizes as an index.

1243

It will also save it in @code{~/.starpu/sampling/codelets}

1244

for further executions, and can be observed by using the

1245

@code{starpu_perfmodel_display} command. The models are indexed by machine name. To share the models between machines (e.g. for a homogeneous cluster), use @code{export STARPU_HOSTNAME=some_global_name}. The following is a small code example.

1246

1247

@cartouche

1248

@smallexample

1249

static struct starpu_perfmodel_t mult_perf_model = @{

1250

.type = STARPU_HISTORY_BASED,

1251

.symbol = "mult_perf_model"

1252

@};

1253

1254

starpu_codelet cl = @{

1255

.where = STARPU_CPU,

1256

.cpu_func = cpu_mult,

1257

.nbuffers = 3,

1258

/* for the scheduling policy to be able to use performance models */

1259

.model = &mult_perf_model

1260

@};

1261

@end smallexample

1262

@end cartouche

1263

1264

@item

1265

Measured at runtime and refined by regression (STARPU_REGRESSION_*_BASED

1266

model type). This still assumes performance regularity, but can work

1267

with various data input sizes, by applying regression over observed

1268

execution times. STARPU_REGRESSION_BASED uses an a*n^b regression

1269

form, STARPU_NL_REGRESSION_BASED uses an a*n^b+c (more precise than

1270

STARPU_REGRESSION_BASED, but costs a lot more to compute)

1271

1272

@item

1273

Provided explicitly by the application (STARPU_PER_ARCH model type): the

1274

@code{.per_arch[i].cost_model} fields have to be filled with pointers to

1275

functions which return the expected duration of the task in micro-seconds, one

1276

per architecture.

1277

1278

@end itemize

1279

1280

How to use schedulers which can benefit from such performance model is explained

1281

in @ref{Task scheduling policy}.

1282

1283

The same can be done for task power consumption estimation, by setting the

1284

@code{power_model} field the same way as the @code{model} field. Note: for

1285

now, the application has to give to the power consumption performance model

1286

a name which is different from the execution time performance model.

1287

1288

@node Theoretical lower bound on execution time

1289

@section Theoretical lower bound on execution time

1290

1291

For kernels with history-based performance models, StarPU can very easily provide a theoretical lower

1292

bound for the execution time of a whole set of tasks. See for

1293

instance @code{examples/lu/lu_example.c}: before submitting tasks,

1294

call @code{starpu_bound_start}, and after complete execution, call

1295

@code{starpu_bound_stop}. @code{starpu_bound_print_lp} or

1296

@code{starpu_bound_print_mps} can then be used to output a Linear Programming

1297

problem corresponding to the schedule of your tasks. Run it through

1298

@code{lp_solve} or any other linear programming solver, and that will give you a

1299

lower bound for the total execution time of your tasks. If StarPU was compiled

1300

with the glpk library installed, @code{starpu_bound_compute} can be used to

1301

solve it immediately and get the optimized minimum. Its @code{integer}

1302

parameter allows to decide whether integer resolution should be computed

1303

and returned.

1304

1305

The @code{deps} parameter tells StarPU whether to take tasks and implicit data

1306

dependencies into account. It must be understood that the linear programming

1307

problem size is quadratic with the number of tasks and thus the time to solve it

1308

will be very long, it could be minutes for just a few dozen tasks. You should

1309

probably use @code{lp_solve -timeout 1 test.pl -wmps test.mps} to convert the

1310

problem to MPS format and then use a better solver, @code{glpsol} might be

1311

better than @code{lp_solve} for instance (the @code{--pcost} option may be

1312

useful), but sometimes doesn't manage to converge. @code{cbc} might look

1313

slower, but it is parallel. Be sure to try at least all the @code{-B} options

1314

of @code{lp_solve}. For instance, we often just use

1315

@code{lp_solve -cc -B1 -Bb -Bg -Bp -Bf -Br -BG -Bd -Bs -BB -Bo -Bc -Bi} , and

1316

the @code{-gr} option can also be quite useful.

1317

1318

Setting @code{deps} to 0 will only take into account the actual computations

1319

on processing units. It however still properly takes into account the varying

1320

performances of kernels and processing units, which is quite more accurate than

1321

just comparing StarPU performances with the fastest of the kernels being used.

1322

1323

The @code{prio} parameter tells StarPU whether to simulate taking into account

1324

the priorities as the StarPU scheduler would, i.e. schedule prioritized

1325

tasks before less prioritized tasks, to check to which extend this results

1326

to a less optimal solution. This increases even more computation time.

1327

1328

Note that for simplicity, all this however doesn't take into account data

1329

transfers, which are assumed to be completely overlapped.

1330

1331

@node Insert Task Utility

1332

@section Insert Task Utility

1333

1334

StarPU provides the wrapper function @code{starpu_insert_task} to ease

1335

the creation and submission of tasks.

1336

1337

@deftypefun int starpu_insert_task (starpu_codelet *@var{cl}, ...)

1338

Create and submit a task corresponding to @var{cl} with the following

1339

arguments. The argument list must be zero-terminated.

1340

1341

The arguments following the codelets can be of the following types:

1342

1343

@itemize

1344

@item

1345

@code{STARPU_R}, @code{STARPU_W}, @code{STARPU_RW}, @code{STARPU_SCRATCH}, @code{STARPU_REDUX} an access mode followed by a data handle;

1346

@item

1347

@code{STARPU_VALUE} followed by a pointer to a constant value and

1348

the size of the constant;

1349

@item

1350

@code{STARPU_CALLBACK} followed by a pointer to a callback function;

1351

@item

1352

@code{STARPU_CALLBACK_ARG} followed by a pointer to be given as an

1353

argument to the callback function;

1354

@item

1355

@code{STARPU_PRIORITY} followed by a integer defining a priority level.

1356

@end itemize

1357

1358

Parameters to be passed to the codelet implementation are defined

1359

through the type @code{STARPU_VALUE}. The function

1360

@code{starpu_unpack_cl_args} must be called within the codelet

1361

implementation to retrieve them.

1362

@end deftypefun

1363

1364

Here the implementation of the codelet:

1365

1366

@smallexample

1367

void func_cpu(void *descr[], void *_args)

1368

@{

1369

int *x0 = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);

1370

float *x1 = (float *)STARPU_VARIABLE_GET_PTR(descr[1]);

1371

int ifactor;

1372

float ffactor;

1373

1374

starpu_unpack_cl_args(_args, &ifactor, &ffactor);

1375

*x0 = *x0 * ifactor;

1376

*x1 = *x1 * ffactor;

1377

@}

1378

1379

starpu_codelet mycodelet = @{

1380

.where = STARPU_CPU,

1381

.cpu_func = func_cpu,

1382

.nbuffers = 2

1383

@};

1384

@end smallexample

1385

1386

And the call to the @code{starpu_insert_task} wrapper:

1387

1388

@smallexample

1389

starpu_insert_task(&mycodelet,

1390

STARPU_VALUE, &ifactor, sizeof(ifactor),

1391

STARPU_VALUE, &ffactor, sizeof(ffactor),

1392

STARPU_RW, data_handles[0], STARPU_RW, data_handles[1],

1393

0);

1394

@end smallexample

1395

1396

The call to @code{starpu_insert_task} is equivalent to the following

1397

code:

1398

1399

@smallexample

1400

struct starpu_task *task = starpu_task_create();

1401

task->cl = &mycodelet;

1402

task->buffers[0].handle = data_handles[0];

1403

task->buffers[0].mode = STARPU_RW;

1404

task->buffers[1].handle = data_handles[1];

1405

task->buffers[1].mode = STARPU_RW;

1406

char *arg_buffer;

1407

size_t arg_buffer_size;

1408

starpu_pack_cl_args(&arg_buffer, &arg_buffer_size,

1409

STARPU_VALUE, &ifactor, sizeof(ifactor),

1410

STARPU_VALUE, &ffactor, sizeof(ffactor),

1411

0);

1412

task->cl_arg = arg_buffer;

1413

task->cl_arg_size = arg_buffer_size;

1414

int ret = starpu_task_submit(task);

1415

@end smallexample

1416

1417

@node Debugging

1418

@section Debugging

1419

1420

StarPU provides several tools to help debugging aplications. Execution traces

1421

can be generated and displayed graphically, see @ref{Generating traces}. Some

1422

gdb helpers are also provided to show the whole StarPU state:

1423

1424

@smallexample

1425

(gdb) source tools/gdbinit

1426

(gdb) help starpu

1427

@end smallexample

1428

1429

@node More examples

1430

@section More examples

1431

1432

More examples are available in the StarPU sources in the @code{examples/}

1433

directory. Simple examples include:

1434

1435

@table @asis

1436

@item @code{incrementer/}:

1437

Trivial incrementation test.

1438

@item @code{basic_examples/}:

1439

Simple documented Hello world (as shown in @ref{Hello World}), vector/scalar product (as shown

1440

in @ref{Vector Scaling on an Hybrid CPU/GPU Machine}), matrix

1441

product examples (as shown in @ref{Performance model example}), an example using the blocked matrix data

1442

interface, and an example using the variable data interface.

1443

@item @code{matvecmult/}:

1444

OpenCL example from NVidia, adapted to StarPU.

1445

@item @code{axpy/}:

1446

AXPY CUBLAS operation adapted to StarPU.

1447

@item @code{fortran/}:

1448

Example of Fortran bindings.

1449

@end table

1450

1451

More advanced examples include:

1452

1453

@table @asis

1454

@item @code{filters/}:

1455

Examples using filters, as shown in @ref{Partitioning Data}.

1456

@item @code{lu/}:

1457

LU matrix factorization, see for instance @code{xlu_implicit.c}

1458

@item @code{cholesky/}:

1459

Cholesky matrix factorization, see for instance @code{cholesky_implicit.c}.

1460

@end table

1461

1462

@c ---------------------------------------------------------------------

1463

@c Performance options

1464

@c ---------------------------------------------------------------------

1465

1466

@node Performance optimization

1467

@chapter How to optimize performance with StarPU

1468

1469

TODO: improve!

1470

1471

@menu

1472

* Data management::

1473

* Task submission::

1474

* Task priorities::

1475

* Task scheduling policy::

1476

* Performance model calibration::

1477

* Task distribution vs Data transfer::

1478

* Data prefetch::

1479

* Power-based scheduling::

1480

* Profiling::

1481

* CUDA-specific optimizations::

1482

@end menu

1483

1484

Simply encapsulating application kernels into tasks already permits to

1485

seamlessly support CPU and GPUs at the same time. To achieve good performance, a

1486

few additional changes are needed.

1487

1488

@node Data management

1489

@section Data management

1490

1491

When the application allocates data, whenever possible it should use the

1492

@code{starpu_malloc} function, which will ask CUDA or

1493

OpenCL to make the allocation itself and pin the corresponding allocated

1494

memory. This is needed to permit asynchronous data transfer, i.e. permit data

1495

transfer to overlap with computations.

1496

1497

By default, StarPU leaves replicates of data wherever they were used, in case they

1498

will be re-used by other tasks, thus saving the data transfer time. When some

1499

task modifies some data, all the other replicates are invalidated, and only the

1500

processing unit which ran that task will have a valid replicate of the data. If the application knows

1501

that this data will not be re-used by further tasks, it should advise StarPU to

1502

immediately replicate it to a desired list of memory nodes (given through a

1503

bitmask). This can be understood like the write-through mode of CPU caches.

1504

1505

@example

1506

starpu_data_set_wt_mask(img_handle, 1<<0);

1507

@end example

1508

1509

will for instance request to always transfer a replicate into the main memory (node

1510

0), as bit 0 of the write-through bitmask is being set.

1511

1512

@node Task submission

1513

@section Task submission

1514

1515

To let StarPU make online optimizations, tasks should be submitted

1516

asynchronously as much as possible. Ideally, all the tasks should be

1517

submitted, and mere calls to @code{starpu_task_wait_for_all} or

1518

@code{starpu_data_unregister} be done to wait for

1519

termination. StarPU will then be able to rework the whole schedule, overlap

1520

computation with communication, manage accelerator local memory usage, etc.

1521

1522

@node Task priorities

1523

@section Task priorities

1524

1525

By default, StarPU will consider the tasks in the order they are submitted by

1526

the application. If the application programmer knows that some tasks should

1527

be performed in priority (for instance because their output is needed by many

1528

other tasks and may thus be a bottleneck if not executed early enough), the

1529

@code{priority} field of the task structure should be set to transmit the

1530

priority information to StarPU.

1531

1532

@node Task scheduling policy

1533

@section Task scheduling policy

1534

1535

By default, StarPU uses the @code{eager} simple greedy scheduler. This is

1536

because it provides correct load balance even if the application codelets do not

1537

have performance models. If your application codelets have performance models

1538

(@pxref{Performance model example} for examples showing how to do it),

1539

you should change the scheduler thanks to the @code{STARPU_SCHED} environment

1540

variable. For instance @code{export STARPU_SCHED=dmda} . Use @code{help} to get

1541

the list of available schedulers.

1542

1543

The @b{eager} scheduler uses a central task queue, from which workers draw tasks

1544

to work on. This however does not permit to prefetch data since the scheduling

1545

decision is taken late. If a task has a non-0 priority, it is put at the front of the queue.

1546

1547

The @b{prio} scheduler also uses a central task queue, but sorts tasks by

1548

priority (between -5 and 5).

1549

1550

The @b{random} scheduler distributes tasks randomly according to assumed worker

1551

overall performance.

1552

1553

The @b{ws} (work stealing) scheduler schedules tasks on the local worker by

1554

default. When a worker becomes idle, it steals a task from the most loaded

1555

worker.

1556

1557

The @b{dm} (deque model) scheduler uses task execution performance models into account to

1558

perform an HEFT-similar scheduling strategy: it schedules tasks where their

1559

termination time will be minimal.

1560

1561

The @b{dmda} (deque model data aware) scheduler is similar to dm, it also takes

1562

into account data transfer time.

1563

1564

The @b{dmdar} (deque model data aware ready) scheduler is similar to dmda,

1565

it also sorts tasks on per-worker queues by number of already-available data

1566

buffers.

1567

1568

The @b{dmdas} (deque model data aware sorted) scheduler is similar to dmda, it

1569

also supports arbitrary priority values.

1570

1571

The @b{heft} (HEFT) scheduler is similar to dmda, it also supports task bundles.

1572

1573

The @b{pheft} (parallel HEFT) scheduler is similar to heft, it also supports

1574

parallel tasks (still experimental).

1575

1576

The @b{pgreedy} (parallel greedy) scheduler is similar to greedy, it also

1577

supports parallel tasks (still experimental).

1578

1579

@node Performance model calibration

1580

@section Performance model calibration

1581

1582

Most schedulers are based on an estimation of codelet duration on each kind

1583

of processing unit. For this to be possible, the application programmer needs

1584

to configure a performance model for the codelets of the application (see

1585

@ref{Performance model example} for instance). History-based performance models

1586

use on-line calibration. StarPU will automatically calibrate codelets

1587

which have never been calibrated yet, and save the result in

1588

@code{~/.starpu/sampling/codelets}.

1589

The models are indexed by machine name. To share the models between machines (e.g. for a homogeneous cluster), use @code{export STARPU_HOSTNAME=some_global_name}. To force continuing calibration, use

1590

@code{export STARPU_CALIBRATE=1} . This may be necessary if your application

1591

has not-so-stable performance. Details on the current performance model status

1592

can be obtained from the @code{starpu_perfmodel_display} command: the @code{-l}

1593

option lists the available performance models, and the @code{-s} option permits

1594

to choose the performance model to be displayed. The result looks like:

1595

1596

@example

1597

$ starpu_perfmodel_display -s starpu_dlu_lu_model_22

1598

performance model for cpu

1599

# hash size mean dev n

1600

5c6c3401 1572864 1.216300e+04 2.277778e+03 1240

1601

@end example

1602

1603

Which shows that for the LU 22 kernel with a 1.5MiB matrix, the average

1604

execution time on CPUs was about 12ms, with a 2ms standard deviation, over

1605

1240 samples. It is a good idea to check this before doing actual performance

1606

measurements.

1607

1608

If a kernel source code was modified (e.g. performance improvement), the

1609

calibration information is stale and should be dropped, to re-calibrate from

1610

start. This can be done by using @code{export STARPU_CALIBRATE=2}.

1611

1612

Note: due to CUDA limitations, to be able to measure kernel duration,

1613

calibration mode needs to disable asynchronous data transfers. Calibration thus

1614

disables data transfer / computation overlapping, and should thus not be used

1615

for eventual benchmarks. Note 2: history-based performance models get calibrated

1616

only if a performance-model-based scheduler is chosen.

1617

1618

@node Task distribution vs Data transfer

1619

@section Task distribution vs Data transfer

1620

1621

Distributing tasks to balance the load induces data transfer penalty. StarPU

1622

thus needs to find a balance between both. The target function that the

1623

@code{dmda} scheduler of StarPU

1624

tries to minimize is @code{alpha * T_execution + beta * T_data_transfer}, where

1625

@code{T_execution} is the estimated execution time of the codelet (usually

1626

accurate), and @code{T_data_transfer} is the estimated data transfer time. The

1627

latter is estimated based on bus calibration before execution start,

1628

i.e. with an idle machine, thus without contention. You can force bus re-calibration by running

1629

@code{starpu_calibrate_bus}. The beta parameter defaults to 1, but it can be

1630

worth trying to tweak it by using @code{export STARPU_BETA=2} for instance,

1631

since during real application execution, contention makes transfer times bigger.

1632

This is of course imprecise, but in practice, a rough estimation already gives

1633

the good results that a precise estimation would give.

1634

1635

@node Data prefetch

1636

@section Data prefetch

1637

1638

The @code{heft}, @code{dmda} and @code{pheft} scheduling policies perform data prefetch (see @ref{STARPU_PREFETCH}):

1639

as soon as a scheduling decision is taken for a task, requests are issued to

1640

transfer its required data to the target processing unit, if needeed, so that

1641

when the processing unit actually starts the task, its data will hopefully be

1642

already available and it will not have to wait for the transfer to finish.

1643

1644

The application may want to perform some manual prefetching, for several reasons

1645

such as excluding initial data transfers from performance measurements, or

1646

setting up an initial statically-computed data distribution on the machine

1647

before submitting tasks, which will thus guide StarPU toward an initial task

1648

distribution (since StarPU will try to avoid further transfers).

1649

1650

This can be achieved by giving the @code{starpu_data_prefetch_on_node} function

1651

the handle and the desired target memory node.

1652

1653

@node Power-based scheduling

1654

@section Power-based scheduling

1655

1656

If the application can provide some power performance model (through

1657

the @code{power_model} field of the codelet structure), StarPU will

1658

take it into account when distributing tasks. The target function that

1659

the @code{dmda} scheduler minimizes becomes @code{alpha * T_execution +

1660

beta * T_data_transfer + gamma * Consumption} , where @code{Consumption}

1661

is the estimated task consumption in Joules. To tune this parameter, use

1662

@code{export STARPU_GAMMA=3000} for instance, to express that each Joule

1663

(i.e kW during 1000us) is worth 3000us execution time penalty. Setting

1664

@code{alpha} and @code{beta} to zero permits to only take into account power consumption.

1665

1666

This is however not sufficient to correctly optimize power: the scheduler would

1667

simply tend to run all computations on the most energy-conservative processing

1668

unit. To account for the consumption of the whole machine (including idle

1669

processing units), the idle power of the machine should be given by setting

1670

@code{export STARPU_IDLE_POWER=200} for 200W, for instance. This value can often

1671

be obtained from the machine power supplier.

1672

1673

The power actually consumed by the total execution can be displayed by setting

1674

@code{export STARPU_PROFILING=1 STARPU_WORKER_STATS=1} .

1675

1676

@node Profiling

1677

@section Profiling

1678

1679

A quick view of how many tasks each worker has executed can be obtained by setting

1680

@code{export STARPU_WORKER_STATS=1} This is a convenient way to check that

1681

execution did happen on accelerators without penalizing performance with

1682

the profiling overhead.

1683

1684

A quick view of how much data transfers have been issued can be obtained by setting

1685

@code{export STARPU_BUS_STATS=1} .

1686

1687

More detailed profiling information can be enabled by using @code{export STARPU_PROFILING=1} or by

1688

calling @code{starpu_profiling_status_set} from the source code.

1689

Statistics on the execution can then be obtained by using @code{export

1690

STARPU_BUS_STATS=1} and @code{export STARPU_WORKER_STATS=1} .

1691

More details on performance feedback are provided by the next chapter.

1692

1693

@node CUDA-specific optimizations

1694

@section CUDA-specific optimizations

1695

1696

Due to CUDA limitations, StarPU will have a hard time overlapping its own

1697

communications and the codelet computations if the application does not use a

1698

dedicated CUDA stream for its computations. StarPU provides one by the use of

1699

@code{starpu_cuda_get_local_stream()} which should be used by all CUDA codelet

1700

operations. For instance:

1701

1702

@example

1703

func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);

1704

cudaStreamSynchronize(starpu_cuda_get_local_stream());

1705

@end example

1706

1707

Unfortunately, some CUDA libraries do not have stream variants of

1708

kernels. That will lower the potential for overlapping.

1709

1710

@c ---------------------------------------------------------------------

1711

@c Performance feedback

1712

@c ---------------------------------------------------------------------

1713

1714

@node Performance feedback

1715

@chapter Performance feedback

1716

1717

@menu

1718

* On-line:: On-line performance feedback

1719

* Off-line:: Off-line performance feedback

1720

* Codelet performance:: Performance of codelets

1721

@end menu

1722

1723

@node On-line

1724

@section On-line performance feedback

1725

1726

@menu

1727

* Enabling monitoring:: Enabling on-line performance monitoring

1728

* Task feedback:: Per-task feedback

1729

* Codelet feedback:: Per-codelet feedback

1730

* Worker feedback:: Per-worker feedback

1731

* Bus feedback:: Bus-related feedback

1732

@end menu

1733

1734

@node Enabling monitoring

1735

@subsection Enabling on-line performance monitoring

1736

1737

In order to enable online performance monitoring, the application can call

1738

@code{starpu_profiling_status_set(STARPU_PROFILING_ENABLE)}. It is possible to

1739

detect whether monitoring is already enabled or not by calling

1740

@code{starpu_profiling_status_get()}. Enabling monitoring also reinitialize all

1741

previously collected feedback. The @code{STARPU_PROFILING} environment variable

1742

can also be set to 1 to achieve the same effect.

1743

1744

Likewise, performance monitoring is stopped by calling

1745

@code{starpu_profiling_status_set(STARPU_PROFILING_DISABLE)}. Note that this

1746

does not reset the performance counters so that the application may consult

1747

them later on.

1748

1749

More details about the performance monitoring API are available in section

1750

@ref{Profiling API}.

1751

1752

@node Task feedback

1753

@subsection Per-task feedback

1754

1755

If profiling is enabled, a pointer to a @code{starpu_task_profiling_info}

1756

structure is put in the @code{.profiling_info} field of the @code{starpu_task}

1757

structure when a task terminates.

1758

This structure is automatically destroyed when the task structure is destroyed,

1759

either automatically or by calling @code{starpu_task_destroy}.

1760

1761

The @code{starpu_task_profiling_info} structure indicates the date when the

1762

task was submitted (@code{submit_time}), started (@code{start_time}), and

1763

terminated (@code{end_time}), relative to the initialization of

1764

StarPU with @code{starpu_init}. It also specifies the identifier of the worker

1765

that has executed the task (@code{workerid}).

1766

These date are stored as @code{timespec} structures which the user may convert

1767

into micro-seconds using the @code{starpu_timing_timespec_to_us} helper

1768

function.

1769

1770

It it worth noting that the application may directly access this structure from

1771

the callback executed at the end of the task. The @code{starpu_task} structure

1772

associated to the callback currently being executed is indeed accessible with

1773

the @code{starpu_get_current_task()} function.

1774

1775

@node Codelet feedback

1776

@subsection Per-codelet feedback

1777

1778

The @code{per_worker_stats} field of the @code{starpu_codelet_t} structure is

1779

an array of counters. The i-th entry of the array is incremented every time a

1780

task implementing the codelet is executed on the i-th worker.

1781

This array is not reinitialized when profiling is enabled or disabled.

1782

1783

@node Worker feedback

1784

@subsection Per-worker feedback

1785

1786

The second argument returned by the @code{starpu_worker_get_profiling_info}

1787

function is a @code{starpu_worker_profiling_info} structure that gives

1788

statistics about the specified worker. This structure specifies when StarPU

1789

started collecting profiling information for that worker (@code{start_time}),

1790

the duration of the profiling measurement interval (@code{total_time}), the

1791

time spent executing kernels (@code{executing_time}), the time spent sleeping

1792

because there is no task to execute at all (@code{sleeping_time}), and the

1793

number of tasks that were executed while profiling was enabled.

1794

These values give an estimation of the proportion of time spent do real work,

1795

and the time spent either sleeping because there are not enough executable

1796

tasks or simply wasted in pure StarPU overhead.

1797

1798

Calling @code{starpu_worker_get_profiling_info} resets the profiling

1799

information associated to a worker.

1800

1801

When an FxT trace is generated (see @ref{Generating traces}), it is also

1802

possible to use the @code{starpu_top} script (described in @ref{starpu-top}) to

1803

generate a graphic showing the evolution of these values during the time, for

1804

the different workers.

1805

1806

@node Bus feedback

1807

@subsection Bus-related feedback

1808

1809

TODO

1810

1811

@c how to enable/disable performance monitoring

1812

1813

@c what kind of information do we get ?

1814

1815

@node Off-line

1816

@section Off-line performance feedback

1817

1818

@menu

1819

* Generating traces:: Generating traces with FxT

1820

* Gantt diagram:: Creating a Gantt Diagram

1821

* DAG:: Creating a DAG with graphviz

1822

* starpu-top:: Monitoring activity

1823

@end menu

1824

1825

@node Generating traces

1826

@subsection Generating traces with FxT

1827

1828

StarPU can use the FxT library (see

1829

@indicateurl{https://savannah.nongnu.org/projects/fkt/}) to generate traces

1830

with a limited runtime overhead.

1831

1832

You can either get a tarball:

1833

@example

1834

% wget http://download.savannah.gnu.org/releases/fkt/fxt-0.2.2.tar.gz

1835

@end example

1836

1837

or use the FxT library from CVS (autotools are required):

1838

@example

1839

% cvs -d :pserver:anonymous@@cvs.sv.gnu.org:/sources/fkt co FxT

1840

% ./bootstrap

1841

@end example

1842

1843

Compiling and installing the FxT library in the @code{$FXTDIR} path is

1844

done following the standard procedure:

1845

@example

1846

% ./configure --prefix=$FXTDIR

1847

% make

1848

% make install

1849

@end example

1850

1851

In order to have StarPU to generate traces, StarPU should be configured with

1852

the @code{--with-fxt} option:

1853

@example

1854

$ ./configure --with-fxt=$FXTDIR

1855

@end example

1856

1857

Or you can simply point the @code{PKG_CONFIG_PATH} to

1858

@code{$FXTDIR/lib/pkgconfig} and pass @code{--with-fxt} to @code{./configure}

1859

1860

When FxT is enabled, a trace is generated when StarPU is terminated by calling

1861

@code{starpu_shutdown()}). The trace is a binary file whose name has the form

1862

@code{prof_file_XXX_YYY} where @code{XXX} is the user name, and

1863

@code{YYY} is the pid of the process that used StarPU. This file is saved in the

1864

@code{/tmp/} directory by default, or by the directory specified by

1865

the @code{STARPU_FXT_PREFIX} environment variable.

1866

1867

@node Gantt diagram

1868

@subsection Creating a Gantt Diagram

1869

1870

When the FxT trace file @code{filename} has been generated, it is possible to

1871

generate a trace in the Paje format by calling:

1872

@example

1873

% starpu_fxt_tool -i filename

1874

@end example

1875

1876

Or alternatively, setting the @code{STARPU_GENERATE_TRACE} environment variable

1877

to 1 before application execution will make StarPU do it automatically at

1878

application shutdown.

1879

1880

This will create a @code{paje.trace} file in the current directory that can be

1881

inspected with the ViTE trace visualizing open-source tool. More information

1882

about ViTE is available at @indicateurl{http://vite.gforge.inria.fr/}. It is

1883

possible to open the @code{paje.trace} file with ViTE by using the following

1884

command:

1885

@example

1886

% vite paje.trace

1887

@end example

1888

1889

@node DAG

1890

@subsection Creating a DAG with graphviz

1891

1892

When the FxT trace file @code{filename} has been generated, it is possible to

1893

generate a task graph in the DOT format by calling:

1894

@example

1895

$ starpu_fxt_tool -i filename

1896

@end example

1897

1898

This will create a @code{dag.dot} file in the current directory. This file is a

1899

task graph described using the DOT language. It is possible to get a

1900

graphical output of the graph by using the graphviz library:

1901

@example

1902

$ dot -Tpdf dag.dot -o output.pdf

1903

@end example

1904

1905

@node starpu-top

1906

@subsection Monitoring activity

1907

1908

When the FxT trace file @code{filename} has been generated, it is possible to

1909

generate a activity trace by calling:

1910

@example

1911

$ starpu_fxt_tool -i filename

1912

@end example

1913

1914

This will create an @code{activity.data} file in the current

1915

directory. A profile of the application showing the activity of StarPU

1916

during the execution of the program can be generated:

1917

@example

1918

$ starpu_top.sh activity.data

1919

@end example

1920

1921

This will create a file named @code{activity.eps} in the current directory.

1922

This picture is composed of two parts.

1923

The first part shows the activity of the different workers. The green sections

1924

indicate which proportion of the time was spent executed kernels on the

1925

processing unit. The red sections indicate the proportion of time spent in

1926

StartPU: an important overhead may indicate that the granularity may be too

1927

low, and that bigger tasks may be appropriate to use the processing unit more

1928

efficiently. The black sections indicate that the processing unit was blocked

1929

because there was no task to process: this may indicate a lack of parallelism

1930

which may be alleviated by creating more tasks when it is possible.

1931

1932

The second part of the @code{activity.eps} picture is a graph showing the

1933

evolution of the number of tasks available in the system during the execution.

1934

Ready tasks are shown in black, and tasks that are submitted but not

1935

schedulable yet are shown in grey.

1936

1937

@node Codelet performance

1938

@section Performance of codelets

1939

1940

The performance model of codelets can be examined by using the

1941

@code{starpu_perfmodel_display} tool:

1942

1943

@example

1944

$ starpu_perfmodel_display -l

1945

file: <malloc_pinned.hannibal>

1946

file: <starpu_slu_lu_model_21.hannibal>

1947

file: <starpu_slu_lu_model_11.hannibal>

1948

file: <starpu_slu_lu_model_22.hannibal>

1949

file: <starpu_slu_lu_model_12.hannibal>

1950

@end example

1951

1952

Here, the codelets of the lu example are available. We can examine the

1953

performance of the 22 kernel:

1954

1955

@example

1956

$ starpu_perfmodel_display -s starpu_slu_lu_model_22

1957

performance model for cpu

1958

# hash size mean dev n

1959

57618ab0 19660800 2.851069e+05 1.829369e+04 109

1960

performance model for cuda_0

1961

# hash size mean dev n

1962

57618ab0 19660800 1.164144e+04 1.556094e+01 315

1963

performance model for cuda_1

1964

# hash size mean dev n

1965

57618ab0 19660800 1.164271e+04 1.330628e+01 360

1966

performance model for cuda_2

1967

# hash size mean dev n

1968

57618ab0 19660800 1.166730e+04 3.390395e+02 456

1969

@end example

1970

1971

We can see that for the given size, over a sample of a few hundreds of

1972

execution, the GPUs are about 20 times faster than the CPUs (numbers are in

1973

us). The standard deviation is extremely low for the GPUs, and less than 10% for

1974

CPUs.

1975

1976

@c ---------------------------------------------------------------------

1977

@c MPI support

1978

@c ---------------------------------------------------------------------

1979

1980

@node StarPU MPI support

1981

@chapter StarPU MPI support

1982

1983

The integration of MPI transfers within task parallelism is done in a

1984

very natural way by the means of asynchronous interactions between the

1985

application and StarPU. This is implemented in a separate libstarpumpi library

1986

which basically provides "StarPU" equivalents of @code{MPI_*} functions, where

1987

@code{void *} buffers are replaced with @code{starpu_data_handle}s, and all

1988

GPU-RAM-NIC transfers are handled efficiently by StarPU-MPI.

1989

1990

@menu

1991

* The API::

1992

* Simple Example::

1993

* MPI Insert Task Utility::

1994

@end menu

1995

1996

@node The API

1997

@section The API

1998

1999

@subsection Initialisation

2000

2001

@deftypefun int starpu_mpi_initialize (void)

2002

Initializes the starpumpi library. This must be called between calling

2003

@code{starpu_init} and other @code{starpu_mpi} functions. This

2004

function does not call @code{MPI_Init}, it should be called beforehand.

2005

@end deftypefun

2006

2007

@deftypefun int starpu_mpi_initialize_extended (int *@var{rank}, int *@var{world_size})

2008

Initializes the starpumpi library. This must be called between calling

2009

@code{starpu_init} and other @code{starpu_mpi} functions.

2010

This function calls @code{MPI_Init}, and therefore should be prefered

2011

to the previous one for MPI implementations which are not thread-safe.

2012

Returns the current MPI node rank and world size.

2013

@end deftypefun

2014

2015

@deftypefun int starpu_mpi_shutdown (void)

2016

Cleans the starpumpi library. This must be called between calling

2017

@code{starpu_mpi} functions and @code{starpu_shutdown}.

2018

@code{MPI_Finalize} will be called if StarPU-MPI has been initialized

2019

by calling @code{starpu_mpi_initialize_extended}.

2020

@end deftypefun

2021

2022

@subsection Communication

2023

2024

@deftypefun int starpu_mpi_send (starpu_data_handle @var{data_handle}, int @var{dest}, int @var{mpi_tag}, MPI_Comm @var{comm})

2025

@end deftypefun

2026

2027

@deftypefun int starpu_mpi_recv (starpu_data_handle @var{data_handle}, int @var{source}, int @var{mpi_tag}, MPI_Comm @var{comm}, MPI_Status *@var{status})

2028

@end deftypefun

2029

2030

@deftypefun int starpu_mpi_isend (starpu_data_handle @var{data_handle}, starpu_mpi_req *@var{req}, int @var{dest}, int @var{mpi_tag}, MPI_Comm @var{comm})

2031

2032

@end deftypefun

2033

2034

@deftypefun int starpu_mpi_irecv (starpu_data_handle @var{data_handle}, starpu_mpi_req *@var{req}, int @var{source}, int @var{mpi_tag}, MPI_Comm @var{comm})

2035

@end deftypefun

2036

2037

@deftypefun int starpu_mpi_isend_detached (starpu_data_handle @var{data_handle}, int @var{dest}, int @var{mpi_tag}, MPI_Comm @var{comm}, void (*@var{callback})(void *), void *@var{arg})

2038

@end deftypefun

2039

2040

@deftypefun int starpu_mpi_irecv_detached (starpu_data_handle @var{data_handle}, int @var{source}, int @var{mpi_tag}, MPI_Comm @var{comm}, void (*@var{callback})(void *), void *@var{arg})

2041

@end deftypefun

2042

2043

@deftypefun int starpu_mpi_wait (starpu_mpi_req *@var{req}, MPI_Status *@var{status})

2044

@end deftypefun

2045

2046

@deftypefun int starpu_mpi_test (starpu_mpi_req *@var{req}, int *@var{flag}, MPI_Status *@var{status})

2047

@end deftypefun

2048

2049

@deftypefun int starpu_mpi_barrier (MPI_Comm @var{comm})

2050

@end deftypefun

2051

2052

@deftypefun int starpu_mpi_isend_detached_unlock_tag (starpu_data_handle @var{data_handle}, int @var{dest}, int @var{mpi_tag}, MPI_Comm @var{comm}, starpu_tag_t @var{tag})

2053

When the transfer is completed, the tag is unlocked

2054

@end deftypefun

2055

2056

@deftypefun int starpu_mpi_irecv_detached_unlock_tag (starpu_data_handle @var{data_handle}, int @var{source}, int @var{mpi_tag}, MPI_Comm @var{comm}, starpu_tag_t @var{tag})

2057

@end deftypefun

2058

2059

@deftypefun int starpu_mpi_isend_array_detached_unlock_tag (unsigned @var{array_size}, starpu_data_handle *@var{data_handle}, int *@var{dest}, int *@var{mpi_tag}, MPI_Comm *@var{comm}, starpu_tag_t @var{tag})

2060

Asynchronously send an array of buffers, and unlocks the tag once all

2061

of them are transmitted.

2062

@end deftypefun

2063

2064

@deftypefun int starpu_mpi_irecv_array_detached_unlock_tag (unsigned @var{array_size}, starpu_data_handle *@var{data_handle}, int *@var{source}, int *@var{mpi_tag}, MPI_Comm *@var{comm}, starpu_tag_t @var{tag})

2065

@end deftypefun

2066

2067

@page

2068

@node Simple Example

2069

@section Simple Example

2070

2071

@cartouche

2072

@smallexample

2073

void increment_token(void)

2074

@{

2075

struct starpu_task *task = starpu_task_create();

2076

2077

task->cl = &increment_cl;

2078

task->buffers[0].handle = token_handle;

2079

task->buffers[0].mode = STARPU_RW;

2080

2081

starpu_task_submit(task);

2082

@}

2083

@end smallexample

2084

@end cartouche

2085

2086

@cartouche

2087

@smallexample

2088

int main(int argc, char **argv)

2089

@{

2090

int rank, size;

2091

2092

starpu_init(NULL);

2093

starpu_mpi_initialize_extended(&rank, &size);

2094

2095

starpu_vector_data_register(&token_handle, 0, (uintptr_t)&token, 1, sizeof(unsigned));

2096

2097

unsigned nloops = NITER;

2098

unsigned loop;

2099

2100

unsigned last_loop = nloops - 1;

2101

unsigned last_rank = size - 1;

2102

@end smallexample

2103

@end cartouche

2104

2105

@cartouche

2106

@smallexample

2107

for (loop = 0; loop < nloops; loop++) @{

2108

int tag = loop*size + rank;

2109

2110

if (loop == 0 && rank == 0)

2111

@{

2112

token = 0;

2113

fprintf(stdout, "Start with token value %d\n", token);

2114

@}

2115

else

2116

@{

2117

starpu_mpi_irecv_detached(token_handle, (rank+size-1)%size, tag,

2118

MPI_COMM_WORLD, NULL, NULL);

2119

@}

2120

2121

increment_token();

2122

2123

if (loop == last_loop && rank == last_rank)

2124

@{

2125

starpu_data_acquire(token_handle, STARPU_R);

2126

fprintf(stdout, "Finished : token value %d\n", token);

2127

starpu_data_release(token_handle);

2128

@}

2129

else

2130

@{

2131

starpu_mpi_isend_detached(token_handle, (rank+1)%size, tag+1,

2132

MPI_COMM_WORLD, NULL, NULL);

2133

@}

2134

@}

2135

2136

starpu_task_wait_for_all();

2137

@end smallexample

2138

@end cartouche

2139

2140

@cartouche

2141

@smallexample

2142

starpu_mpi_shutdown();

2143

starpu_shutdown();

2144

2145

if (rank == last_rank)

2146

@{

2147

fprintf(stderr, "[%d] token = %d == %d * %d ?\n", rank, token, nloops, size);

2148

STARPU_ASSERT(token == nloops*size);

2149

@}

2150

@end smallexample

2151

@end cartouche

2152

2153

@page

2154

@node MPI Insert Task Utility

2155

@section MPI Insert Task Utility

2156

2157

@deftypefun void starpu_mpi_insert_task (MPI_Comm @var{comm}, starpu_codelet *@var{cl}, ...)

2158

Create and submit a task corresponding to @var{cl} with the following

2159

arguments. The argument list must be zero-terminated.

2160

2161

The arguments following the codelets are the same types as for the

2162

function @code{starpu_insert_task} defined in @ref{Insert Task

2163

Utility}. The extra argument @code{STARPU_EXECUTE} followed by an

2164

integer allows to specify the node to execute the codelet.

2165

2166

The algorithm is as follows:

2167

@enumerate

2168

@item Find out whether we are to execute the codelet because we own the

2169

data to be written to. If different tasks own data to be written to,

2170

the argument @code{STARPU_EXECUTE} should be used to specify the

2171

executing task @code{ET}.

2172

@item Send and receive data as requested. Tasks owning data which need

2173

to be read by the executing task @code{ET} are sending them to @code{ET}.

2174

@item Execute the codelet. This is done by the task selected in the

2175

1st step of the algorithm.

2176

@item In the case when different tasks own data to be written to, send

2177

W data back to their owners.

2178

@end enumerate

2179

2180

The algorithm also includes a cache mechanism that allows not to send

2181

data twice to the same task, unless the data has been modified.

2182

2183

@end deftypefun

2184

2185

@deftypefun void starpu_mpi_get_data_on_node (MPI_Comm @var{comm}, starpu_data_handle @var{data_handle}, int @var{node})

2186

@end deftypefun

2187

2188

@page

2189

2190

Here an example showing how to use @code{starpu_mpi_insert_task}. One

2191

first needs to define a distribution function which specifies the

2192

locality of the data. Note that that distribution information needs to

2193

be given to StarPU by calling @code{starpu_data_set_rank}.

2194

2195

@cartouche

2196

@smallexample

2197

/* Returns the MPI node number where data is */

2198

int my_distrib(int x, int y, int nb_nodes) @{

2199

/* Cyclic distrib */

2200

return ((int)(x / sqrt(nb_nodes) + (y / sqrt(nb_nodes)) * sqrt(nb_nodes))) % nb_nodes;

2201

// /* Linear distrib */

2202

// return x / sqrt(nb_nodes) + (y / sqrt(nb_nodes)) * X;

2203

@}

2204

@end smallexample

2205

@end cartouche

2206

2207

Now the data can be registered within StarPU. Data which are not

2208

owned but will be needed for computations can be registered through

2209

the lazy allocation mechanism, i.e. with a @code{home_node} set to -1.

2210

StarPU will automatically allocate the memory when it is used for the

2211

first time.

2212

2213

@cartouche

2214

@smallexample

2215

unsigned matrix[X][Y];

2216

starpu_data_handle data_handles[X][Y];

2217

2218

for(x = 0; x < X; x++) @{

2219

for (y = 0; y < Y; y++) @{

2220

int mpi_rank = my_distrib(x, y, size);

2221

if (mpi_rank == rank)

2222

/* Owning data */

2223

starpu_variable_data_register(&data_handles[x][y], 0,

2224

(uintptr_t)&(matrix[x][y]), sizeof(unsigned));

2225

else if (rank == mpi_rank+1 || rank == mpi_rank-1)

2226

/* I don't own that index, but will need it for my computations */

2227

starpu_variable_data_register(&data_handles[x][y], -1,

2228

(uintptr_t)NULL, sizeof(unsigned));

2229

else

2230

/* I know it's useless to allocate anything for this */

2231

data_handles[x][y] = NULL;

2232

if (data_handles[x][y])

2233

starpu_data_set_rank(data_handles[x][y], mpi_rank);

2234

@}

2235

@}

2236

@end smallexample

2237

@end cartouche

2238

2239

Now @code{starpu_mpi_insert_task()} can be called for the different

2240

steps of the application.

2241

2242

@cartouche

2243

@smallexample

2244

for(loop=0 ; loop<niter; loop++)

2245

for (x = 1; x < X-1; x++)

2246

for (y = 1; y < Y-1; y++)

2247

starpu_mpi_insert_task(MPI_COMM_WORLD, &stencil5_cl,

2248

STARPU_RW, data_handles[x][y],

2249

STARPU_R, data_handles[x-1][y],

2250

STARPU_R, data_handles[x+1][y],

2251

STARPU_R, data_handles[x][y-1],

2252

STARPU_R, data_handles[x][y+1],

2253

0);

2254

starpu_task_wait_for_all();

2255

@end smallexample

2256

@end cartouche

2257

2258

@c ---------------------------------------------------------------------

2259

@c Configuration options

2260

@c ---------------------------------------------------------------------

2261

2262

@node Configuring StarPU

2263

@chapter Configuring StarPU

2264

2265

2266

@menu

2267

* Compilation configuration::

2268

* Execution configuration through environment variables::

2269

@end menu

2270

2271

@node Compilation configuration

2272

@section Compilation configuration

2273

2274

The following arguments can be given to the @code{configure} script.

2275

2276

@menu

2277

* Common configuration::

2278

* Configuring workers::

2279

* Advanced configuration::

2280

@end menu

2281

2282

@node Common configuration

2283

@subsection Common configuration

2284

2285

2286

@menu

2287

* --enable-debug::

2288

* --enable-fast::

2289

* --enable-verbose::

2290

* --enable-coverage::

2291

@end menu

2292

2293

@node --enable-debug

2294

@subsubsection @code{--enable-debug}

2295

@table @asis

2296

@item @emph{Description}:

2297

Enable debugging messages.

2298

@end table

2299

2300

@node --enable-fast

2301

@subsubsection @code{--enable-fast}

2302

@table @asis

2303

@item @emph{Description}:

2304

Do not enforce assertions, saves a lot of time spent to compute them otherwise.

2305

@end table

2306

2307

@node --enable-verbose

2308

@subsubsection @code{--enable-verbose}

2309

@table @asis

2310

@item @emph{Description}:

2311

Augment the verbosity of the debugging messages. This can be disabled

2312

at runtime by setting the environment variable @code{STARPU_SILENT} to

2313

any value.

2314

2315

@smallexample

2316

% STARPU_SILENT=1 ./vector_scal

2317

@end smallexample

2318

@end table

2319

2320

@node --enable-coverage

2321

@subsubsection @code{--enable-coverage}

2322

@table @asis

2323

@item @emph{Description}:

2324

Enable flags for the @code{gcov} coverage tool.

2325

@end table

2326

2327

@node Configuring workers

2328

@subsection Configuring workers

2329

2330

@menu

2331

* --enable-nmaxcpus::

2332

* --disable-cpu::

2333

* --enable-maxcudadev::

2334

* --disable-cuda::

2335

* --with-cuda-dir::

2336

* --with-cuda-include-dir::

2337

* --with-cuda-lib-dir::

2338

* --enable-maxopencldev::

2339

* --disable-opencl::

2340

* --with-opencl-dir::

2341

* --with-opencl-include-dir::

2342

* --with-opencl-lib-dir::

2343

* --enable-gordon::

2344

* --with-gordon-dir::

2345

@end menu

2346

2347

@node --enable-nmaxcpus

2348

@subsubsection @code{--enable-nmaxcpus=<number>}

2349

@table @asis

2350

@item @emph{Description}:

2351

Defines the maximum number of CPU cores that StarPU will support, then

2352

available as the @code{STARPU_NMAXCPUS} macro.

2353

@end table

2354

2355

@node --disable-cpu

2356

@subsubsection @code{--disable-cpu}

2357

@table @asis

2358

@item @emph{Description}:

2359

Disable the use of CPUs of the machine. Only GPUs etc. will be used.

2360

@end table

2361

2362

@node --enable-maxcudadev

2363

@subsubsection @code{--enable-maxcudadev=<number>}

2364

@table @asis

2365

@item @emph{Description}:

2366

Defines the maximum number of CUDA devices that StarPU will support, then

2367

available as the @code{STARPU_MAXCUDADEVS} macro.

2368

@end table

2369

2370

@node --disable-cuda

2371

@subsubsection @code{--disable-cuda}

2372

@table @asis

2373

@item @emph{Description}:

2374

Disable the use of CUDA, even if a valid CUDA installation was detected.

2375

@end table

2376

2377

@node --with-cuda-dir

2378

@subsubsection @code{--with-cuda-dir=<path>}

2379

@table @asis

2380

@item @emph{Description}:

2381

Specify the directory where CUDA is installed. This directory should notably contain

2382

@code{include/cuda.h}.

2383

@end table

2384

2385

@node --with-cuda-include-dir

2386

@subsubsection @code{--with-cuda-include-dir=<path>}

2387

@table @asis

2388

@item @emph{Description}:

2389

Specify the directory where CUDA headers are installed. This directory should

2390

notably contain @code{cuda.h}. This defaults to @code{/include} appended to the

2391

value given to @code{--with-cuda-dir}.

2392

@end table

2393

2394

@node --with-cuda-lib-dir

2395

@subsubsection @code{--with-cuda-lib-dir=<path>}

2396

@table @asis

2397

@item @emph{Description}:

2398

Specify the directory where the CUDA library is installed. This directory should

2399

notably contain the CUDA shared libraries (e.g. libcuda.so). This defaults to

2400

@code{/lib} appended to the value given to @code{--with-cuda-dir}.

2401

2402

@end table

2403

2404

@node --enable-maxopencldev

2405

@subsubsection @code{--enable-maxopencldev=<number>}

2406

@table @asis

2407

@item @emph{Description}:

2408

Defines the maximum number of OpenCL devices that StarPU will support, then

2409

available as the @code{STARPU_MAXOPENCLDEVS} macro.

2410

@end table

2411

2412

@node --disable-opencl

2413

@subsubsection @code{--disable-opencl}

2414

@table @asis

2415

@item @emph{Description}:

2416

Disable the use of OpenCL, even if the SDK is detected.

2417

@end table

2418

2419

@node --with-opencl-dir

2420

@subsubsection @code{--with-opencl-dir=<path>}

2421

@table @asis

2422

@item @emph{Description}:

2423

Specify the location of the OpenCL SDK. This directory should notably contain

2424

@code{include/CL/cl.h} (or @code{include/OpenCL/cl.h} on Mac OS).

2425

@end table

2426

2427

@node --with-opencl-include-dir

2428

@subsubsection @code{--with-opencl-include-dir=<path>}

2429

@table @asis

2430

@item @emph{Description}:

2431

Specify the location of OpenCL headers. This directory should notably contain

2432

@code{CL/cl.h} (or @code{OpenCL/cl.h} on Mac OS). This defaults to

2433

@code{/include} appended to the value given to @code{--with-opencl-dir}.

2434

2435

@end table

2436

2437

@node --with-opencl-lib-dir

2438

@subsubsection @code{--with-opencl-lib-dir=<path>}

2439

@table @asis

2440

@item @emph{Description}:

2441

Specify the location of the OpenCL library. This directory should notably

2442

contain the OpenCL shared libraries (e.g. libOpenCL.so). This defaults to

2443

@code{/lib} appended to the value given to @code{--with-opencl-dir}.

2444

@end table

2445

2446

@node --enable-gordon

2447

@subsubsection @code{--enable-gordon}

2448

@table @asis

2449

@item @emph{Description}:

2450

Enable the use of the Gordon runtime for Cell SPUs.

2451

@c TODO: rather default to enabled when detected

2452

@end table

2453

2454

@node --with-gordon-dir

2455

@subsubsection @code{--with-gordon-dir=<path>}

2456

@table @asis

2457

@item @emph{Description}:

2458

Specify the location of the Gordon SDK.

2459

@end table

2460

2461

@node Advanced configuration

2462

@subsection Advanced configuration

2463

2464

@menu

2465

* --enable-perf-debug::

2466

* --enable-model-debug::

2467

* --enable-stats::

2468

* --enable-maxbuffers::

2469

* --enable-allocation-cache::

2470

* --enable-opengl-render::

2471

* --enable-blas-lib::

2472

* --with-magma::

2473

* --with-fxt::

2474

* --with-perf-model-dir::

2475

* --with-mpicc::

2476

* --with-goto-dir::

2477

* --with-atlas-dir::

2478

* --with-mkl-cflags::

2479

* --with-mkl-ldflags::

2480

@end menu

2481

2482

@node --enable-perf-debug

2483

@subsubsection @code{--enable-perf-debug}

2484

@table @asis

2485

@item @emph{Description}:

2486

Enable performance debugging.

2487

@end table

2488

2489

@node --enable-model-debug

2490

@subsubsection @code{--enable-model-debug}

2491

@table @asis

2492

@item @emph{Description}:

2493

Enable performance model debugging.

2494

@end table

2495

2496

@node --enable-stats

2497

@subsubsection @code{--enable-stats}

2498

@table @asis

2499

@item @emph{Description}:

2500

Enable statistics.

2501

@end table

2502

2503

@node --enable-maxbuffers

2504

@subsubsection @code{--enable-maxbuffers=<nbuffers>}

2505

@table @asis

2506

@item @emph{Description}:

2507

Define the maximum number of buffers that tasks will be able to take

2508

as parameters, then available as the @code{STARPU_NMAXBUFS} macro.

2509

@end table

2510

2511

@node --enable-allocation-cache

2512

@subsubsection @code{--enable-allocation-cache}

2513

@table @asis

2514

@item @emph{Description}:

2515

Enable the use of a data allocation cache to avoid the cost of it with

2516

CUDA. Still experimental.

2517

@end table

2518

2519

@node --enable-opengl-render

2520

@subsubsection @code{--enable-opengl-render}

2521

@table @asis

2522

@item @emph{Description}:

2523

Enable the use of OpenGL for the rendering of some examples.

2524

@c TODO: rather default to enabled when detected

2525

@end table

2526

2527

@node --enable-blas-lib

2528

@subsubsection @code{--enable-blas-lib=<name>}

2529

@table @asis

2530

@item @emph{Description}:

2531

Specify the blas library to be used by some of the examples. The

2532

library has to be 'atlas' or 'goto'.

2533

@end table

2534

2535

@node --with-magma

2536

@subsubsection @code{--with-magma=<path>}

2537

@table @asis

2538

@item @emph{Description}:

2539

Specify where magma is installed. This directory should notably contain

2540

@code{include/magmablas.h}.

2541

@end table

2542

2543

@node --with-fxt

2544

@subsubsection @code{--with-fxt=<path>}

2545

@table @asis

2546

@item @emph{Description}:

2547

Specify the location of FxT (for generating traces and rendering them

2548

using ViTE). This directory should notably contain

2549

@code{include/fxt/fxt.h}.

2550

@c TODO add ref to other section

2551

@end table

2552

2553

@node --with-perf-model-dir

2554

@subsubsection @code{--with-perf-model-dir=<dir>}

2555

@table @asis

2556

@item @emph{Description}:

2557

Specify where performance models should be stored (instead of defaulting to the

2558

current user's home).

2559

@end table

2560

2561

@node --with-mpicc

2562

@subsubsection @code{--with-mpicc=<path to mpicc>}

2563

@table @asis

2564

@item @emph{Description}:

2565

Specify the location of the @code{mpicc} compiler to be used for starpumpi.

2566

@end table

2567

2568

@node --with-goto-dir

2569

@subsubsection @code{--with-goto-dir=<dir>}

2570

@table @asis

2571

@item @emph{Description}:

2572

Specify the location of GotoBLAS.

2573

@end table

2574

2575

@node --with-atlas-dir

2576

@subsubsection @code{--with-atlas-dir=<dir>}

2577

@table @asis

2578

@item @emph{Description}:

2579

Specify the location of ATLAS. This directory should notably contain

2580

@code{include/cblas.h}.

2581

@end table

2582

2583

@node --with-mkl-cflags

2584

@subsubsection @code{--with-mkl-cflags=<cflags>}

2585

@table @asis

2586

@item @emph{Description}:

2587

Specify the compilation flags for the MKL Library.

2588

@end table

2589

2590

@node --with-mkl-ldflags

2591

@subsubsection @code{--with-mkl-ldflags=<ldflags>}

2592

@table @asis

2593

@item @emph{Description}:

2594

Specify the linking flags for the MKL Library. Note that the

2595

@url{http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/}

2596

website provides a script to determine the linking flags.

2597

@end table

2598

2599

2600

@c ---------------------------------------------------------------------

2601

@c Environment variables

2602

@c ---------------------------------------------------------------------

2603

2604

@node Execution configuration through environment variables

2605

@section Execution configuration through environment variables

2606

2607

@menu

2608

* Workers:: Configuring workers

2609

* Scheduling:: Configuring the Scheduling engine

2610

* Misc:: Miscellaneous and debug

2611

@end menu

2612

2613

Note: the values given in @code{starpu_conf} structure passed when

2614

calling @code{starpu_init} will override the values of the environment

2615

variables.

2616

2617

@node Workers

2618

@subsection Configuring workers

2619

2620

@menu

2621

* STARPU_NCPUS:: Number of CPU workers

2622

* STARPU_NCUDA:: Number of CUDA workers

2623

* STARPU_NOPENCL:: Number of OpenCL workers

2624

* STARPU_NGORDON:: Number of SPU workers (Cell)

2625

* STARPU_WORKERS_CPUID:: Bind workers to specific CPUs

2626

* STARPU_WORKERS_CUDAID:: Select specific CUDA devices

2627

* STARPU_WORKERS_OPENCLID:: Select specific OpenCL devices

2628

@end menu

2629

2630

@node STARPU_NCPUS

2631

@subsubsection @code{STARPU_NCPUS} -- Number of CPU workers

2632

@table @asis

2633

2634

@item @emph{Description}:

2635

Specify the number of CPU workers (thus not including workers dedicated to control acceleratores). Note that by default, StarPU will not allocate

2636

more CPU workers than there are physical CPUs, and that some CPUs are used to control

2637

the accelerators.

2638

2639

@end table

2640

2641

@node STARPU_NCUDA

2642

@subsubsection @code{STARPU_NCUDA} -- Number of CUDA workers

2643

@table @asis

2644

2645

@item @emph{Description}:

2646

Specify the number of CUDA devices that StarPU can use. If

2647

@code{STARPU_NCUDA} is lower than the number of physical devices, it is

2648

possible to select which CUDA devices should be used by the means of the

2649

@code{STARPU_WORKERS_CUDAID} environment variable. By default, StarPU will

2650

create as many CUDA workers as there are CUDA devices.

2651

2652

@end table

2653

2654

@node STARPU_NOPENCL

2655

@subsubsection @code{STARPU_NOPENCL} -- Number of OpenCL workers

2656

@table @asis

2657

2658

@item @emph{Description}:

2659

OpenCL equivalent of the @code{STARPU_NCUDA} environment variable.

2660

@end table

2661

2662

@node STARPU_NGORDON

2663

@subsubsection @code{STARPU_NGORDON} -- Number of SPU workers (Cell)

2664

@table @asis

2665

2666

@item @emph{Description}:

2667

Specify the number of SPUs that StarPU can use.

2668

@end table

2669

2670

2671

@node STARPU_WORKERS_CPUID

2672

@subsubsection @code{STARPU_WORKERS_CPUID} -- Bind workers to specific CPUs

2673

@table @asis

2674

2675

@item @emph{Description}:

2676

Passing an array of integers (starting from 0) in @code{STARPU_WORKERS_CPUID}

2677

specifies on which logical CPU the different workers should be

2678

bound. For instance, if @code{STARPU_WORKERS_CPUID = "0 1 4 5"}, the first

2679

worker will be bound to logical CPU #0, the second CPU worker will be bound to

2680

logical CPU #1 and so on. Note that the logical ordering of the CPUs is either

2681

determined by the OS, or provided by the @code{hwloc} library in case it is

2682

available.

2683

2684

Note that the first workers correspond to the CUDA workers, then come the

2685

OpenCL and the SPU, and finally the CPU workers. For example if

2686

we have @code{STARPU_NCUDA=1}, @code{STARPU_NOPENCL=1}, @code{STARPU_NCPUS=2}

2687

and @code{STARPU_WORKERS_CPUID = "0 2 1 3"}, the CUDA device will be controlled

2688

by logical CPU #0, the OpenCL device will be controlled by logical CPU #2, and

2689

the logical CPUs #1 and #3 will be used by the CPU workers.

2690

2691

If the number of workers is larger than the array given in

2692

@code{STARPU_WORKERS_CPUID}, the workers are bound to the logical CPUs in a

2693

round-robin fashion: if @code{STARPU_WORKERS_CPUID = "0 1"}, the first and the

2694

third (resp. second and fourth) workers will be put on CPU #0 (resp. CPU #1).

2695

2696

This variable is ignored if the @code{use_explicit_workers_bindid} flag of the

2697

@code{starpu_conf} structure passed to @code{starpu_init} is set.

2698

2699

@end table

2700

2701

@node STARPU_WORKERS_CUDAID

2702

@subsubsection @code{STARPU_WORKERS_CUDAID} -- Select specific CUDA devices

2703

@table @asis

2704

2705

@item @emph{Description}:

2706

Similarly to the @code{STARPU_WORKERS_CPUID} environment variable, it is

2707

possible to select which CUDA devices should be used by StarPU. On a machine

2708

equipped with 4 GPUs, setting @code{STARPU_WORKERS_CUDAID = "1 3"} and

2709

@code{STARPU_NCUDA=2} specifies that 2 CUDA workers should be created, and that

2710

they should use CUDA devices #1 and #3 (the logical ordering of the devices is

2711

the one reported by CUDA).

2712

2713

This variable is ignored if the @code{use_explicit_workers_cuda_gpuid} flag of

2714

the @code{starpu_conf} structure passed to @code{starpu_init} is set.

2715

@end table

2716

2717

@node STARPU_WORKERS_OPENCLID

2718

@subsubsection @code{STARPU_WORKERS_OPENCLID} -- Select specific OpenCL devices

2719

@table @asis

2720

2721

@item @emph{Description}:

2722

OpenCL equivalent of the @code{STARPU_WORKERS_CUDAID} environment variable.

2723

2724

This variable is ignored if the @code{use_explicit_workers_opencl_gpuid} flag of

2725

the @code{starpu_conf} structure passed to @code{starpu_init} is set.

2726

@end table

2727

2728

@node Scheduling

2729

@subsection Configuring the Scheduling engine

2730

2731

@menu

2732

* STARPU_SCHED:: Scheduling policy

2733

* STARPU_CALIBRATE:: Calibrate performance models

2734

* STARPU_PREFETCH:: Use data prefetch

2735

* STARPU_SCHED_ALPHA:: Computation factor

2736

* STARPU_SCHED_BETA:: Communication factor

2737

@end menu

2738

2739

@node STARPU_SCHED

2740

@subsubsection @code{STARPU_SCHED} -- Scheduling policy

2741

@table @asis

2742

2743

@item @emph{Description}:

2744

2745

This chooses between the different scheduling policies proposed by StarPU: work

2746

random, stealing, greedy, with performance models, etc.

2747

2748

Use @code{STARPU_SCHED=help} to get the list of available schedulers.

2749

2750

@end table

2751

2752

@node STARPU_CALIBRATE

2753

@subsubsection @code{STARPU_CALIBRATE} -- Calibrate performance models

2754

@table @asis

2755

2756

@item @emph{Description}:

2757

If this variable is set to 1, the performance models are calibrated during

2758

the execution. If it is set to 2, the previous values are dropped to restart

2759

calibration from scratch. Setting this variable to 0 disable calibration, this

2760

is the default behaviour.

2761

2762

Note: this currently only applies to @code{dm}, @code{dmda} and @code{heft} scheduling policies.

2763

2764

@end table

2765

2766

@node STARPU_PREFETCH

2767

@subsubsection @code{STARPU_PREFETCH} -- Use data prefetch

2768

@table @asis

2769

2770

@item @emph{Description}:

2771

This variable indicates whether data prefetching should be enabled (0 means

2772

that it is disabled). If prefetching is enabled, when a task is scheduled to be

2773

executed e.g. on a GPU, StarPU will request an asynchronous transfer in

2774

advance, so that data is already present on the GPU when the task starts. As a

2775

result, computation and data transfers are overlapped.

2776

Note that prefetching is enabled by default in StarPU.

2777

2778

@end table

2779

2780

@node STARPU_SCHED_ALPHA

2781

@subsubsection @code{STARPU_SCHED_ALPHA} -- Computation factor

2782

@table @asis

2783

2784

@item @emph{Description}:

2785

To estimate the cost of a task StarPU takes into account the estimated

2786

computation time (obtained thanks to performance models). The alpha factor is

2787

the coefficient to be applied to it before adding it to the communication part.

2788

2789

@end table

2790

2791

@node STARPU_SCHED_BETA

2792

@subsubsection @code{STARPU_SCHED_BETA} -- Communication factor

2793

@table @asis

2794

2795

@item @emph{Description}:

2796

To estimate the cost of a task StarPU takes into account the estimated

2797

data transfer time (obtained thanks to performance models). The beta factor is

2798

the coefficient to be applied to it before adding it to the computation part.

2799

2800

@end table

2801

2802

@node Misc

2803

@subsection Miscellaneous and debug

2804

2805

@menu

2806

* STARPU_SILENT:: Disable verbose mode

2807

* STARPU_LOGFILENAME:: Select debug file name

2808

* STARPU_FXT_PREFIX:: FxT trace location

2809

* STARPU_LIMIT_GPU_MEM:: Restrict memory size on the GPUs

2810

* STARPU_GENERATE_TRACE:: Generate a Paje trace when StarPU is shut down

2811

@end menu

2812

2813

@node STARPU_SILENT

2814

@subsubsection @code{STARPU_SILENT} -- Disable verbose mode

2815

@table @asis

2816

2817

@item @emph{Description}:

2818

This variable allows to disable verbose mode at runtime when StarPU

2819

has been configured with the option @code{--enable-verbose}.

2820

@end table

2821

2822

@node STARPU_LOGFILENAME

2823

@subsubsection @code{STARPU_LOGFILENAME} -- Select debug file name

2824

@table @asis

2825

2826

@item @emph{Description}:

2827

This variable specifies in which file the debugging output should be saved to.

2828

@end table

2829

2830

@node STARPU_FXT_PREFIX

2831

@subsubsection @code{STARPU_FXT_PREFIX} -- FxT trace location

2832

@table @asis

2833

2834

@item @emph{Description}

2835

This variable specifies in which directory to save the trace generated if FxT is enabled. It needs to have a trailing '/' character.

2836

@end table

2837

2838

@node STARPU_LIMIT_GPU_MEM

2839

@subsubsection @code{STARPU_LIMIT_GPU_MEM} -- Restrict memory size on the GPUs

2840

@table @asis

2841

2842

@item @emph{Description}

2843

This variable specifies the maximum number of megabytes that should be

2844

available to the application on each GPUs. In case this value is smaller than

2845

the size of the memory of a GPU, StarPU pre-allocates a buffer to waste memory

2846

on the device. This variable is intended to be used for experimental purposes

2847

as it emulates devices that have a limited amount of memory.

2848

@end table

2849

2850

@node STARPU_GENERATE_TRACE

2851

@subsubsection @code{STARPU_GENERATE_TRACE} -- Generate a Paje trace when StarPU is shut down

2852

@table @asis

2853

2854

@item @emph{Description}

2855

When set to 1, this variable indicates that StarPU should automatically

2856

generate a Paje trace when starpu_shutdown is called.

2857

@end table

2858

2859

2860

@c ---------------------------------------------------------------------

2861

@c StarPU API

2862

@c ---------------------------------------------------------------------

2863

2864

@node StarPU API

2865

@chapter StarPU API

2866

2867

@menu

2868

* Initialization and Termination:: Initialization and Termination methods

2869

* Workers' Properties:: Methods to enumerate workers' properties

2870

* Data Library:: Methods to manipulate data

2871

* Data Interfaces::

2872

* Data Partition::

2873

* Codelets and Tasks:: Methods to construct tasks

2874

* Explicit Dependencies:: Explicit Dependencies

2875

* Implicit Data Dependencies:: Implicit Data Dependencies

2876

* Performance Model API::

2877

* Profiling API:: Profiling API

2878

* CUDA extensions:: CUDA extensions

2879

* OpenCL extensions:: OpenCL extensions

2880

* Cell extensions:: Cell extensions

2881

* Miscellaneous helpers::

2882

@end menu

2883

2884

@node Initialization and Termination

2885

@section Initialization and Termination

2886

2887

@menu

2888

* starpu_init:: Initialize StarPU

2889

* struct starpu_conf:: StarPU runtime configuration

2890

* starpu_conf_init:: Initialize starpu_conf structure

2891

* starpu_shutdown:: Terminate StarPU

2892

@end menu

2893

2894

@node starpu_init

2895

@subsection @code{starpu_init} -- Initialize StarPU

2896

@table @asis

2897

2898

@item @emph{Description}:

2899

This is StarPU initialization method, which must be called prior to any other

2900

StarPU call. It is possible to specify StarPU's configuration (e.g. scheduling

2901

policy, number of cores, ...) by passing a non-null argument. Default

2902

configuration is used if the passed argument is @code{NULL}.

2903

@item @emph{Return value}:

2904

Upon successful completion, this function returns 0. Otherwise, @code{-ENODEV}

2905

indicates that no worker was available (so that StarPU was not initialized).

2906

2907

@item @emph{Prototype}:

2908

@code{int starpu_init(struct starpu_conf *conf);}

2909

2910

@end table

2911

2912

@node struct starpu_conf

2913

@subsection @code{struct starpu_conf} -- StarPU runtime configuration

2914

2915

@table @asis

2916

@item @emph{Description}:

2917

This structure is passed to the @code{starpu_init} function in order

2918

to configure StarPU.

2919

When the default value is used, StarPU automatically selects the number

2920

of processing units and takes the default scheduling policy. This parameter

2921

overwrites the equivalent environment variables.

2922

2923

@item @emph{Fields}:

2924

@table @asis

2925

@item @code{sched_policy_name} (default = NULL):

2926

This is the name of the scheduling policy. This can also be specified with the

2927

@code{STARPU_SCHED} environment variable.

2928

@item @code{sched_policy} (default = NULL):

2929

This is the definition of the scheduling policy. This field is ignored

2930

if @code{sched_policy_name} is set.

2931

2932

@item @code{ncpus} (default = -1):

2933

This is the number of CPU cores that StarPU can use. This can also be

2934

specified with the @code{STARPU_NCPUS} environment variable.

2935

@item @code{ncuda} (default = -1):

2936

This is the number of CUDA devices that StarPU can use. This can also be

2937

specified with the @code{STARPU_NCUDA} environment variable.

2938

@item @code{nopencl} (default = -1):

2939

This is the number of OpenCL devices that StarPU can use. This can also be

2940

specified with the @code{STARPU_NOPENCL} environment variable.

2941

@item @code{nspus} (default = -1):

2942

This is the number of Cell SPUs that StarPU can use. This can also be

2943

specified with the @code{STARPU_NGORDON} environment variable.

2944

2945

@item @code{use_explicit_workers_bindid} (default = 0)

2946

If this flag is set, the @code{workers_bindid} array indicates where the

2947

different workers are bound, otherwise StarPU automatically selects where to

2948

bind the different workers unless the @code{STARPU_WORKERS_CPUID} environment

2949

variable is set. The @code{STARPU_WORKERS_CPUID} environment variable is

2950

ignored if the @code{use_explicit_workers_bindid} flag is set.

2951

@item @code{workers_bindid[STARPU_NMAXWORKERS]}

2952

If the @code{use_explicit_workers_bindid} flag is set, this array indicates

2953

where to bind the different workers. The i-th entry of the

2954

@code{workers_bindid} indicates the logical identifier of the processor which

2955

should execute the i-th worker. Note that the logical ordering of the CPUs is

2956

either determined by the OS, or provided by the @code{hwloc} library in case it

2957

is available.

2958

When this flag is set, the @ref{STARPU_WORKERS_CPUID} environment variable is

2959

ignored.

2960

2961

@item @code{use_explicit_workers_cuda_gpuid} (default = 0)

2962

If this flag is set, the CUDA workers will be attached to the CUDA devices

2963

specified in the @code{workers_cuda_gpuid} array. Otherwise, StarPU affects the

2964

CUDA devices in a round-robin fashion.

2965

When this flag is set, the @ref{STARPU_WORKERS_CUDAID} environment variable is

2966

ignored.

2967

@item @code{workers_cuda_gpuid[STARPU_NMAXWORKERS]}

2968

If the @code{use_explicit_workers_cuda_gpuid} flag is set, this array contains

2969

the logical identifiers of the CUDA devices (as used by @code{cudaGetDevice}).

2970

@item @code{use_explicit_workers_opencl_gpuid} (default = 0)

2971

If this flag is set, the OpenCL workers will be attached to the OpenCL devices

2972

specified in the @code{workers_opencl_gpuid} array. Otherwise, StarPU affects the

2973

OpenCL devices in a round-robin fashion.

2974

@item @code{workers_opencl_gpuid[STARPU_NMAXWORKERS]}:

2975

2976

@item @code{calibrate} (default = 0):

2977

If this flag is set, StarPU will calibrate the performance models when

2978

executing tasks. If this value is equal to -1, the default value is used. The

2979

default value is overwritten by the @code{STARPU_CALIBRATE} environment

2980

variable when it is set.

2981

@end table

2982

2983

@end table

2984

2985

2986

@node starpu_conf_init

2987

@subsection @code{starpu_conf_init} -- Initialize starpu_conf structure

2988

@table @asis

2989

2990

This function initializes the @code{starpu_conf} structure passed as argument

2991

with the default values. In case some configuration parameters are already

2992

specified through environment variables, @code{starpu_conf_init} initializes

2993

the fields of the structure according to the environment variables. For

2994

instance if @code{STARPU_CALIBRATE} is set, its value is put in the

2995

@code{.ncuda} field of the structure passed as argument.

2996

2997

@item @emph{Return value}:

2998

Upon successful completion, this function returns 0. Otherwise, @code{-EINVAL}

2999

indicates that the argument was NULL.

3000

3001

@item @emph{Prototype}:

3002

@code{int starpu_conf_init(struct starpu_conf *conf);}

3003

3004

@end table

3005

3006

3007

3008

@node starpu_shutdown

3009

@subsection @code{starpu_shutdown} -- Terminate StarPU

3010

@deftypefun void starpu_shutdown (void)

3011

This is StarPU termination method. It must be called at the end of the

3012

application: statistics and other post-mortem debugging information are not

3013

guaranteed to be available until this method has been called.

3014

@end deftypefun

3015

3016

@node Workers' Properties

3017

@section Workers' Properties

3018

3019

@menu

3020

* starpu_worker_get_count:: Get the number of processing units

3021

* starpu_worker_get_count_by_type:: Get the number of processing units of a given type

3022

* starpu_cpu_worker_get_count:: Get the number of CPU controlled by StarPU

3023

* starpu_cuda_worker_get_count:: Get the number of CUDA devices controlled by StarPU

3024

* starpu_opencl_worker_get_count:: Get the number of OpenCL devices controlled by StarPU

3025

* starpu_spu_worker_get_count:: Get the number of Cell SPUs controlled by StarPU

3026

* starpu_worker_get_id:: Get the identifier of the current worker

3027

* starpu_worker_get_ids_by_type:: Get the list of identifiers of workers with a given type

3028

* starpu_worker_get_devid:: Get the device identifier of a worker

3029

* starpu_worker_get_type:: Get the type of processing unit associated to a worker

3030

* starpu_worker_get_name:: Get the name of a worker

3031

* starpu_worker_get_memory_node:: Get the memory node of a worker

3032

@end menu

3033

3034

@node starpu_worker_get_count

3035

@subsection @code{starpu_worker_get_count} -- Get the number of processing units

3036

@deftypefun unsigned starpu_worker_get_count (void)

3037

This function returns the number of workers (i.e. processing units executing

3038

StarPU tasks). The returned value should be at most @code{STARPU_NMAXWORKERS}.

3039

@end deftypefun

3040

3041

@node starpu_worker_get_count_by_type

3042

@subsection @code{starpu_worker_get_count_by_type} -- Get the number of processing units of a given type

3043

@deftypefun int starpu_worker_get_count_by_type ({enum starpu_archtype} @var{type})

3044

Returns the number of workers of the type indicated by the argument. A positive

3045

(or null) value is returned in case of success, @code{-EINVAL} indicates that

3046

the type is not valid otherwise.

3047

@end deftypefun

3048

3049

@node starpu_cpu_worker_get_count

3050

@subsection @code{starpu_cpu_worker_get_count} -- Get the number of CPU controlled by StarPU

3051

@deftypefun unsigned starpu_cpu_worker_get_count (void)

3052

This function returns the number of CPUs controlled by StarPU. The returned

3053

value should be at most @code{STARPU_NMAXCPUS}.

3054

@end deftypefun

3055

3056

@node starpu_cuda_worker_get_count

3057

@subsection @code{starpu_cuda_worker_get_count} -- Get the number of CUDA devices controlled by StarPU

3058

@deftypefun unsigned starpu_cuda_worker_get_count (void)

3059

This function returns the number of CUDA devices controlled by StarPU. The returned

3060

value should be at most @code{STARPU_MAXCUDADEVS}.

3061

@end deftypefun

3062

3063

@node starpu_opencl_worker_get_count

3064

@subsection @code{starpu_opencl_worker_get_count} -- Get the number of OpenCL devices controlled by StarPU

3065

@deftypefun unsigned starpu_opencl_worker_get_count (void)

3066

This function returns the number of OpenCL devices controlled by StarPU. The returned

3067

value should be at most @code{STARPU_MAXOPENCLDEVS}.

3068

@end deftypefun

3069

3070

@node starpu_spu_worker_get_count

3071

@subsection @code{starpu_spu_worker_get_count} -- Get the number of Cell SPUs controlled by StarPU

3072

@deftypefun unsigned starpu_opencl_worker_get_count (void)

3073

This function returns the number of Cell SPUs controlled by StarPU.

3074

@end deftypefun

3075

3076

3077

@node starpu_worker_get_id

3078

@subsection @code{starpu_worker_get_id} -- Get the identifier of the current worker

3079

@deftypefun int starpu_worker_get_id (void)

3080

This function returns the identifier of the worker associated to the calling

3081

thread. The returned value is either -1 if the current context is not a StarPU

3082

worker (i.e. when called from the application outside a task or a callback), or

3083

an integer between 0 and @code{starpu_worker_get_count() - 1}.

3084

@end deftypefun

3085

3086

@node starpu_worker_get_ids_by_type

3087

@subsection @code{starpu_worker_get_ids_by_type} -- Get the list of identifiers of workers with a given type

3088

@deftypefun int starpu_worker_get_ids_by_type ({enum starpu_archtype} @var{type}, int *@var{workerids}, int @var{maxsize})

3089

Fill the workerids array with the identifiers of the workers that have the type

3090

indicated in the first argument. The maxsize argument indicates the size of the

3091

workids array. The returned value gives the number of identifiers that were put

3092

in the array. @code{-ERANGE} is returned is maxsize is lower than the number of

3093

workers with the appropriate type: in that case, the array is filled with the

3094

maxsize first elements. To avoid such overflows, the value of maxsize can be

3095

chosen by the means of the @code{starpu_worker_get_count_by_type} function, or

3096

by passing a value greater or equal to @code{STARPU_NMAXWORKERS}.

3097

@end deftypefun

3098

3099

@node starpu_worker_get_devid

3100

@subsection @code{starpu_worker_get_devid} -- Get the device identifier of a worker

3101

@deftypefun int starpu_worker_get_devid (int @var{id})

3102

This functions returns the device id of the worker associated to an identifier

3103

(as returned by the @code{starpu_worker_get_id} function). In the case of a

3104

CUDA worker, this device identifier is the logical device identifier exposed by

3105

CUDA (used by the @code{cudaGetDevice} function for instance). The device

3106

identifier of a CPU worker is the logical identifier of the core on which the

3107

worker was bound; this identifier is either provided by the OS or by the

3108

@code{hwloc} library in case it is available.

3109

@end deftypefun

3110

3111

@node starpu_worker_get_type

3112

@subsection @code{starpu_worker_get_type} -- Get the type of processing unit associated to a worker

3113

@deftypefun {enum starpu_archtype} starpu_worker_get_type (int @var{id})

3114

This function returns the type of worker associated to an identifier (as

3115

returned by the @code{starpu_worker_get_id} function). The returned value

3116

indicates the architecture of the worker: @code{STARPU_CPU_WORKER} for a CPU

3117

core, @code{STARPU_CUDA_WORKER} for a CUDA device,

3118

@code{STARPU_OPENCL_WORKER} for a OpenCL device, and

3119

@code{STARPU_GORDON_WORKER} for a Cell SPU. The value returned for an invalid

3120

identifier is unspecified.

3121

@end deftypefun

3122

3123

@node starpu_worker_get_name

3124

@subsection @code{starpu_worker_get_name} -- Get the name of a worker

3125

3126

@deftypefun void starpu_worker_get_name (int @var{id}, char *@var{dst}, size_t @var{maxlen})

3127

StarPU associates a unique human readable string to each processing unit. This

3128

function copies at most the @var{maxlen} first bytes of the unique string

3129

associated to a worker identified by its identifier @var{id} into the

3130

@var{dst} buffer. The caller is responsible for ensuring that the @var{dst}

3131

is a valid pointer to a buffer of @var{maxlen} bytes at least. Calling this

3132

function on an invalid identifier results in an unspecified behaviour.

3133

@end deftypefun

3134

3135

@node starpu_worker_get_memory_node

3136

@subsection @code{starpu_worker_get_memory_node} -- Get the memory node of a worker

3137

@deftypefun unsigned starpu_worker_get_memory_node (unsigned @var{workerid})

3138

This function returns the identifier of the memory node associated to the

3139

worker identified by @var{workerid}.

3140

@end deftypefun

3141

3142

3143

@node Data Library

3144

@section Data Library

3145

3146

This section describes the data management facilities provided by StarPU.

3147

3148

We show how to use existing data interfaces in @ref{Data Interfaces}, but developers can

3149

design their own data interfaces if required.

3150

3151

@menu

3152

* starpu_malloc:: Allocate data and pin it

3153

* starpu_access_mode:: Data access mode

3154

* unsigned memory_node:: Memory node

3155

* starpu_data_handle:: StarPU opaque data handle

3156

* void *interface:: StarPU data interface

3157

* starpu_data_register:: Register a piece of data to StarPU

3158

* starpu_data_unregister:: Unregister a piece of data from StarPU

3159

* starpu_data_invalidate:: Invalidate all data replicates

3160

* starpu_data_acquire:: Access registered data from the application

3161

* starpu_data_acquire_cb:: Access registered data from the application asynchronously

3162

* starpu_data_release:: Release registered data from the application

3163

* starpu_data_set_wt_mask:: Set the Write-Through mask

3164

* starpu_data_prefetch_on_node:: Prefetch data to a given node

3165

@end menu

3166

3167

@node starpu_malloc

3168

@subsection @code{starpu_malloc} -- Allocate data and pin it

3169

@deftypefun int starpu_malloc (void **@var{A}, size_t @var{dim})

3170

This function allocates data of the given size in main memory. It will also try to pin it in

3171

CUDA or OpenCL, so that data transfers from this buffer can be asynchronous, and

3172

thus permit data transfer and computation overlapping. The allocated buffer must

3173

be freed thanks to the @code{starpu_free} function.

3174

@end deftypefun

3175

3176

@node starpu_access_mode

3177

@subsection @code{starpu_access_mode} -- Data access mode

3178

This datatype describes a data access mode. The different available modes are:

3179

@table @asis

3180

@table @asis

3181

@item @code{STARPU_R} read-only mode.

3182

@item @code{STARPU_W} write-only mode.

3183

@item @code{STARPU_RW} read-write mode. This is equivalent to @code{STARPU_R|STARPU_W}.

3184

@item @code{STARPU_SCRATCH} scratch memory. A temporary buffer is allocated for the task, but StarPU does not enforce data consistency, i.e. each device has its own buffer, independently from each other (even for CPUs). This is useful for temporary variables. For now, no behaviour is defined concerning the relation with STARPU_R/W modes and the value provided at registration, i.e. the value of the scratch buffer is undefined at entry of the codelet function, but this is being considered for future extensions.

3185

@item @code{STARPU_REDUX} reduction mode. TODO: document, as well as @code{starpu_data_set_reduction_methods}

3186

@end table

3187

@end table

3188

3189

@node unsigned memory_node

3190

@subsection @code{unsigned memory_node} -- Memory node

3191

@table @asis

3192

@item @emph{Description}:

3193

Every worker is associated to a memory node which is a logical abstraction of

3194

the address space from which the processing unit gets its data. For instance,

3195

the memory node associated to the different CPU workers represents main memory

3196

(RAM), the memory node associated to a GPU is DRAM embedded on the device.

3197

Every memory node is identified by a logical index which is accessible from the

3198

@code{starpu_worker_get_memory_node} function. When registering a piece of data

3199

to StarPU, the specified memory node indicates where the piece of data

3200

initially resides (we also call this memory node the home node of a piece of

3201

data).

3202

@end table

3203

3204

3205

@node starpu_data_handle

3206

@subsection @code{starpu_data_handle} -- StarPU opaque data handle

3207

@table @asis

3208

@item @emph{Description}:

3209

StarPU uses @code{starpu_data_handle} as an opaque handle to manage a piece of

3210

data. Once a piece of data has been registered to StarPU, it is associated to a

3211

@code{starpu_data_handle} which keeps track of the state of the piece of data

3212

over the entire machine, so that we can maintain data consistency and locate

3213

data replicates for instance.

3214

@end table

3215

3216

@node void *interface

3217

@subsection @code{void *interface} -- StarPU data interface

3218

@table @asis

3219

@item @emph{Description}:

3220

Data management is done at a high-level in StarPU: rather than accessing a mere

3221

list of contiguous buffers, the tasks may manipulate data that are described by

3222

a high-level construct which we call data interface.

3223

3224

An example of data interface is the "vector" interface which describes a

3225

contiguous data array on a spefic memory node. This interface is a simple

3226

structure containing the number of elements in the array, the size of the

3227

elements, and the address of the array in the appropriate address space (this

3228

address may be invalid if there is no valid copy of the array in the memory

3229

node). More informations on the data interfaces provided by StarPU are

3230

given in @ref{Data Interfaces}.

3231

3232

When a piece of data managed by StarPU is used by a task, the task

3233

implementation is given a pointer to an interface describing a valid copy of

3234

the data that is accessible from the current processing unit.

3235

@end table

3236

3237

@node starpu_data_register

3238

@subsection @code{starpu_data_register} -- Register a piece of data to StarPU

3239

@deftypefun void starpu_data_register (starpu_data_handle *@var{handleptr}, uint32_t @var{home_node}, void *@var{interface}, {struct starpu_data_interface_ops_t} *@var{ops})

3240

Register a piece of data into the handle located at the @var{handleptr}

3241

address. The @var{interface} buffer contains the initial description of the

3242

data in the home node. The @var{ops} argument is a pointer to a structure

3243

describing the different methods used to manipulate this type of interface. See

3244

@ref{struct starpu_data_interface_ops_t} for more details on this structure.

3245

3246

If @code{home_node} is -1, StarPU will automatically

3247

allocate the memory when it is used for the

3248

first time in write-only mode. Once such data handle has been automatically

3249

allocated, it is possible to access it using any access mode.

3250

3251

Note that StarPU supplies a set of predefined types of interface (e.g. vector or

3252

matrix) which can be registered by the means of helper functions (e.g.

3253

@code{starpu_vector_data_register} or @code{starpu_matrix_data_register}).

3254

@end deftypefun

3255

3256

@node starpu_data_unregister

3257

@subsection @code{starpu_data_unregister} -- Unregister a piece of data from StarPU

3258

@deftypefun void starpu_data_unregister (starpu_data_handle @var{handle})

3259

This function unregisters a data handle from StarPU. If the data was

3260

automatically allocated by StarPU because the home node was -1, all

3261

automatically allocated buffers are freed. Otherwise, a valid copy of the data

3262

is put back into the home node in the buffer that was initially registered.

3263

Using a data handle that has been unregistered from StarPU results in an

3264

undefined behaviour.

3265

@end deftypefun

3266

3267

@node starpu_data_invalidate

3268

@subsection @code{starpu_data_invalidate} -- Invalidate all data replicates

3269

@deftypefun void starpu_data_invalidate (starpu_data_handle @var{handle})

3270

Destroy all replicates of the data handle. After data invalidation, the first

3271

access to the handle must be performed in write-only mode. Accessing an

3272

invalidated data in read-mode results in undefined behaviour.

3273

@end deftypefun

3274

3275

@c TODO create a specific sections about user interaction with the DSM ?

3276

3277

@node starpu_data_acquire

3278

@subsection @code{starpu_data_acquire} -- Access registered data from the application

3279

@deftypefun int starpu_data_acquire (starpu_data_handle @var{handle}, starpu_access_mode @var{mode})

3280

The application must call this function prior to accessing registered data from

3281

main memory outside tasks. StarPU ensures that the application will get an

3282

up-to-date copy of the data in main memory located where the data was

3283

originally registered, and that all concurrent accesses (e.g. from tasks) will

3284

be consistent with the access mode specified in the @var{mode} argument.

3285

@code{starpu_data_release} must be called once the application does not need to

3286

access the piece of data anymore. Note that implicit data

3287

dependencies are also enforced by @code{starpu_data_acquire}, i.e.

3288

@code{starpu_data_acquire} will wait for all tasks scheduled to work on

3289

the data, unless that they have not been disabled explictly by calling

3290

@code{starpu_data_set_default_sequential_consistency_flag} or

3291

@code{starpu_data_set_sequential_consistency_flag}.

3292

@code{starpu_data_acquire} is a blocking call, so that it cannot be called from

3293

tasks or from their callbacks (in that case, @code{starpu_data_acquire} returns

3294

@code{-EDEADLK}). Upon successful completion, this function returns 0.

3295

@end deftypefun

3296

3297

@node starpu_data_acquire_cb

3298

@subsection @code{starpu_data_acquire_cb} -- Access registered data from the application asynchronously

3299

@deftypefun int starpu_data_acquire_cb (starpu_data_handle @var{handle}, starpu_access_mode @var{mode}, void (*@var{callback})(void *), void *@var{arg})

3300

@code{starpu_data_acquire_cb} is the asynchronous equivalent of

3301

@code{starpu_data_release}. When the data specified in the first argument is

3302

available in the appropriate access mode, the callback function is executed.

3303

The application may access the requested data during the execution of this

3304

callback. The callback function must call @code{starpu_data_release} once the

3305

application does not need to access the piece of data anymore.

3306

Note that implicit data dependencies are also enforced by

3307

@code{starpu_data_acquire_cb} in case they are enabled.

3308

Contrary to @code{starpu_data_acquire}, this function is non-blocking and may

3309

be called from task callbacks. Upon successful completion, this function

3310

returns 0.

3311

@end deftypefun

3312

3313

@node starpu_data_release

3314

@subsection @code{starpu_data_release} -- Release registered data from the application

3315

@deftypefun void starpu_data_release (starpu_data_handle @var{handle})

3316

This function releases the piece of data acquired by the application either by

3317

@code{starpu_data_acquire} or by @code{starpu_data_acquire_cb}.

3318

@end deftypefun

3319

3320

@node starpu_data_set_wt_mask

3321

@subsection @code{starpu_data_set_wt_mask} -- Set the Write-Through mask

3322

@deftypefun void starpu_data_set_wt_mask (starpu_data_handle @var{handle}, uint32_t @var{wt_mask})

3323

This function sets the write-through mask of a given data, i.e. a bitmask of

3324

nodes where the data should be always replicated after modification.

3325

@end deftypefun

3326

3327

@node starpu_data_prefetch_on_node

3328

@subsection @code{starpu_data_prefetch_on_node} -- Prefetch data to a given node

3329

3330

@deftypefun int starpu_data_prefetch_on_node (starpu_data_handle @var{handle}, unsigned @var{node}, unsigned @var{async})

3331

Issue a prefetch request for a given data to a given node, i.e.

3332

requests that the data be replicated to the given node, so that it is available

3333

there for tasks. If the @var{async} parameter is 0, the call will block until

3334

the transfer is achieved, else the call will return as soon as the request is

3335

scheduled (which may however have to wait for a task completion).

3336

@end deftypefun

3337

3338

@node Data Interfaces

3339

@section Data Interfaces

3340

3341

@menu

3342

* Variable Interface::

3343

* Vector Interface::

3344

* Matrix Interface::

3345

* 3D Matrix Interface::

3346

* BCSR Interface for Sparse Matrices (Blocked Compressed Sparse Row Representation)::

3347

* CSR Interface for Sparse Matrices (Compressed Sparse Row Representation)::

3348

@end menu

3349

3350

@node Variable Interface

3351

@subsection Variable Interface

3352

3353

@table @asis

3354

@item @emph{Description}:

3355

This variant of @code{starpu_data_register} uses the variable interface,

3356

i.e. for a mere single variable. @code{ptr} is the address of the variable,

3357

and @code{elemsize} is the size of the variable.

3358

@item @emph{Prototype}:

3359

@code{void starpu_variable_data_register(starpu_data_handle *handle,

3360

uint32_t home_node,

3361

uintptr_t ptr, size_t elemsize);}

3362

@item @emph{Example}:

3363

@cartouche

3364

@smallexample

3365

float var;

3366

starpu_data_handle var_handle;

3367

starpu_variable_data_register(&var_handle, 0, (uintptr_t)&var, sizeof(var));

3368

@end smallexample

3369

@end cartouche

3370

@end table

3371

3372

@node Vector Interface

3373

@subsection Vector Interface

3374

3375

@table @asis

3376

@item @emph{Description}:

3377

This variant of @code{starpu_data_register} uses the vector interface,

3378

i.e. for mere arrays of elements. @code{ptr} is the address of the first

3379

element in the home node. @code{nx} is the number of elements in the vector.

3380

@code{elemsize} is the size of each element.

3381

@item @emph{Prototype}:

3382

@code{void starpu_vector_data_register(starpu_data_handle *handle, uint32_t home_node,

3383

uintptr_t ptr, uint32_t nx, size_t elemsize);}

3384

@item @emph{Example}:

3385

@cartouche

3386

@smallexample

3387

float vector[NX];

3388

starpu_data_handle vector_handle;

3389

starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX,

3390

sizeof(vector[0]));

3391

@end smallexample

3392

@end cartouche

3393

@end table

3394

3395

@node Matrix Interface

3396

@subsection Matrix Interface

3397

3398

@table @asis

3399

@item @emph{Description}:

3400

This variant of @code{starpu_data_register} uses the matrix interface, i.e. for

3401

matrices of elements. @code{ptr} is the address of the first element in the home

3402

node. @code{ld} is the number of elements between rows. @code{nx} is the number

3403

of elements in a row (this can be different from @code{ld} if there are extra

3404

elements for alignment for instance). @code{ny} is the number of rows.

3405

@code{elemsize} is the size of each element.

3406

@item @emph{Prototype}:

3407

@code{void starpu_matrix_data_register(starpu_data_handle *handle, uint32_t home_node,

3408

uintptr_t ptr, uint32_t ld, uint32_t nx,

3409

uint32_t ny, size_t elemsize);}

3410

@item @emph{Example}:

3411

@cartouche

3412

@smallexample

3413

float *matrix;

3414

starpu_data_handle matrix_handle;

3415

matrix = (float*)malloc(width * height * sizeof(float));

3416

starpu_matrix_data_register(&matrix_handle, 0, (uintptr_t)matrix,

3417

width, width, height, sizeof(float));

3418

@end smallexample

3419

@end cartouche

3420

@end table

3421

3422

@node 3D Matrix Interface

3423

@subsection 3D Matrix Interface

3424

3425

@table @asis

3426

@item @emph{Description}:

3427

This variant of @code{starpu_data_register} uses the 3D matrix interface.

3428

@code{ptr} is the address of the array of first element in the home node.

3429

@code{ldy} is the number of elements between rows. @code{ldz} is the number

3430

of rows between z planes. @code{nx} is the number of elements in a row (this

3431

can be different from @code{ldy} if there are extra elements for alignment

3432

for instance). @code{ny} is the number of rows in a z plane (likewise with

3433

@code{ldz}). @code{nz} is the number of z planes. @code{elemsize} is the size of

3434

each element.

3435

@item @emph{Prototype}:

3436

@code{void starpu_block_data_register(starpu_data_handle *handle, uint32_t home_node,

3437

uintptr_t ptr, uint32_t ldy, uint32_t ldz, uint32_t nx,

3438

uint32_t ny, uint32_t nz, size_t elemsize);}

3439

@item @emph{Example}:

3440

@cartouche

3441

@smallexample

3442

float *block;

3443

starpu_data_handle block_handle;

3444

block = (float*)malloc(nx*ny*nz*sizeof(float));

3445

starpu_block_data_register(&block_handle, 0, (uintptr_t)block,

3446

nx, nx*ny, nx, ny, nz, sizeof(float));

3447

@end smallexample

3448

@end cartouche

3449

@end table

3450

3451

@node BCSR Interface for Sparse Matrices (Blocked Compressed Sparse Row Representation)

3452

@subsection BCSR Interface for Sparse Matrices (Blocked Compressed Sparse Row Representation)

3453

3454

@deftypefun void starpu_bcsr_data_register (starpu_data_handle *@var{handle}, uint32_t @var{home_node}, uint32_t @var{nnz}, uint32_t @var{nrow}, uintptr_t @var{nzval}, uint32_t *@var{colind}, uint32_t *@var{rowptr}, uint32_t @var{firstentry}, uint32_t @var{r}, uint32_t @var{c}, size_t @var{elemsize})

3455

This variant of @code{starpu_data_register} uses the BCSR sparse matrix interface.

3456

TODO

3457

@end deftypefun

3458

3459

@node CSR Interface for Sparse Matrices (Compressed Sparse Row Representation)

3460

@subsection CSR Interface for Sparse Matrices (Compressed Sparse Row Representation)

3461

3462

@deftypefun void starpu_csr_data_register (starpu_data_handle *@var{handle}, uint32_t @var{home_node}, uint32_t @var{nnz}, uint32_t @var{nrow}, uintptr_t @var{nzval}, uint32_t *@var{colind}, uint32_t *@var{rowptr}, uint32_t @var{firstentry}, size_t @var{elemsize})

3463

This variant of @code{starpu_data_register} uses the CSR sparse matrix interface.

3464

TODO

3465

@end deftypefun

3466

3467

@node Data Partition

3468

@section Data Partition

3469

3470

@menu

3471

* struct starpu_data_filter:: StarPU filter structure

3472

* starpu_data_partition:: Partition Data

3473

* starpu_data_unpartition:: Unpartition Data

3474

* starpu_data_get_nb_children::

3475

* starpu_data_get_sub_data::

3476

* Predefined filter functions::

3477

@end menu

3478

3479

@node struct starpu_data_filter

3480

@subsection @code{struct starpu_data_filter} -- StarPU filter structure

3481

@table @asis

3482

@item @emph{Description}:

3483

The filter structure describes a data partitioning operation, to be given to the

3484

@code{starpu_data_partition} function, see @ref{starpu_data_partition} for an example.

3485

@item @emph{Fields}:

3486

@table @asis

3487

@item @code{filter_func}:

3488

This function fills the @code{child_interface} structure with interface

3489

information for the @code{id}-th child of the parent @code{father_interface} (among @code{nparts}).

3490

@code{void (*filter_func)(void *father_interface, void* child_interface, struct starpu_data_filter *, unsigned id, unsigned nparts);}

3491

@item @code{nchildren}:

3492

This is the number of parts to partition the data into.

3493

@item @code{get_nchildren}:

3494

This returns the number of children. This can be used instead of @code{nchildren} when the number of

3495

children depends on the actual data (e.g. the number of blocks in a sparse

3496

matrix).

3497

@code{unsigned (*get_nchildren)(struct starpu_data_filter *, starpu_data_handle initial_handle);}

3498

@item @code{get_child_ops}:

3499

In case the resulting children use a different data interface, this function

3500

returns which interface is used by child number @code{id}.

3501

@code{struct starpu_data_interface_ops_t *(*get_child_ops)(struct starpu_data_filter *, unsigned id);}

3502

@item @code{filter_arg}:

3503

Some filters take an addition parameter, but this is usually unused.

3504

@item @code{filter_arg_ptr}:

3505

Some filters take an additional array parameter like the sizes of the parts, but

3506

this is usually unused.

3507

@end table

3508

@end table

3509

3510

@node starpu_data_partition

3511

@subsection starpu_data_partition -- Partition Data

3512

3513

@table @asis

3514

@item @emph{Description}:

3515

This requests partitioning one StarPU data @code{initial_handle} into several

3516

subdata according to the filter @code{f}

3517

@item @emph{Prototype}:

3518

@code{void starpu_data_partition(starpu_data_handle initial_handle, struct starpu_data_filter *f);}

3519

@item @emph{Example}:

3520

@cartouche

3521

@smallexample

3522

struct starpu_data_filter f = @{

3523

.filter_func = starpu_vertical_block_filter_func,

3524

.nchildren = nslicesx,

3525

.get_nchildren = NULL,

3526

.get_child_ops = NULL

3527

@};

3528

starpu_data_partition(A_handle, &f);

3529

@end smallexample

3530

@end cartouche

3531

@end table

3532

3533

@node starpu_data_unpartition

3534

@subsection starpu_data_unpartition -- Unpartition data

3535

3536

@table @asis

3537

@item @emph{Description}:

3538

This unapplies one filter, thus unpartitioning the data. The pieces of data are

3539

collected back into one big piece in the @code{gathering_node} (usually 0).

3540

@item @emph{Prototype}:

3541

@code{void starpu_data_unpartition(starpu_data_handle root_data, uint32_t gathering_node);}

3542

@item @emph{Example}:

3543

@cartouche

3544

@smallexample

3545

starpu_data_unpartition(A_handle, 0);

3546

@end smallexample

3547

@end cartouche

3548

@end table

3549

3550

@node starpu_data_get_nb_children

3551

@subsection starpu_data_get_nb_children

3552

3553

@table @asis

3554

@item @emph{Description}:

3555

This function returns the number of children.

3556

@item @emph{Return value}:

3557

The number of children.

3558

@item @emph{Prototype}:

3559

@code{int starpu_data_get_nb_children(starpu_data_handle handle);}

3560

@end table

3561

3562

@c starpu_data_handle starpu_data_get_child(starpu_data_handle handle, unsigned i);

3563

3564

@node starpu_data_get_sub_data

3565

@subsection starpu_data_get_sub_data

3566

3567

@table @asis

3568

@item @emph{Description}:

3569

After partitioning a StarPU data by applying a filter,

3570

@code{starpu_data_get_sub_data} can be used to get handles for each of the data

3571

portions. @code{root_data} is the parent data that was partitioned. @code{depth}

3572

is the number of filters to traverse (in case several filters have been applied,

3573

to e.g. partition in row blocks, and then in column blocks), and the subsequent

3574

parameters are the indexes.

3575

@item @emph{Return value}:

3576

A handle to the subdata.

3577

@item @emph{Prototype}:

3578

@code{starpu_data_handle starpu_data_get_sub_data(starpu_data_handle root_data, unsigned depth, ... );}

3579

@item @emph{Example}:

3580

@cartouche

3581

@smallexample

3582

h = starpu_data_get_sub_data(A_handle, 1, taskx);

3583

@end smallexample

3584

@end cartouche

3585

@end table

3586

3587

@node Predefined filter functions

3588

@subsection Predefined filter functions

3589

3590

@menu

3591

* Partitioning BCSR Data::

3592

* Partitioning BLAS interface::

3593

* Partitioning Vector Data::

3594

* Partitioning Block Data::

3595

@end menu

3596

3597

This section gives a partial list of the predefined partitioning functions.

3598

Examples on how to use them are shown in @ref{Partitioning Data}. The complete

3599

list can be found in @code{starpu_data_filters.h} .

3600

3601

@node Partitioning BCSR Data

3602

@subsubsection Partitioning BCSR Data

3603

3604

@deftypefun void starpu_canonical_block_filter_bcsr (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})

3605

TODO

3606

@end deftypefun

3607

3608

@deftypefun void starpu_vertical_block_filter_func_csr (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})

3609

TODO

3610

@end deftypefun

3611

3612

@node Partitioning BLAS interface

3613

@subsubsection Partitioning BLAS interface

3614

3615

@deftypefun void starpu_block_filter_func (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})

3616

This partitions a dense Matrix into horizontal blocks.

3617

@end deftypefun

3618

3619

@deftypefun void starpu_vertical_block_filter_func (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})

3620

This partitions a dense Matrix into vertical blocks.

3621

@end deftypefun

3622

3623

@node Partitioning Vector Data

3624

@subsubsection Partitioning Vector Data

3625

3626

@deftypefun void starpu_block_filter_func_vector (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})

3627

This partitions a vector into blocks of the same size.

3628

@end deftypefun

3629

3630

3631

@deftypefun void starpu_vector_list_filter_func (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})

3632

This partitions a vector into blocks of sizes given in @var{filter_arg_ptr}.

3633

@end deftypefun

3634

3635

@deftypefun void starpu_vector_divide_in_2_filter_func (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})

3636

This partitions a vector into two blocks, the first block size being given in @var{filter_arg}.

3637

@end deftypefun

3638

3639

3640

@node Partitioning Block Data

3641

@subsubsection Partitioning Block Data

3642

3643

@deftypefun void starpu_block_filter_func_block (void *@var{father_interface}, void *@var{child_interface}, {struct starpu_data_filter} *@var{f}, unsigned @var{id}, unsigned @var{nparts})

3644

This partitions a 3D matrix along the X axis.

3645

@end deftypefun

3646

3647

@node Codelets and Tasks

3648

@section Codelets and Tasks

3649

3650

This section describes the interface to manipulate codelets and tasks.

3651

3652

@deftp {Data Type} {struct starpu_codelet}

3653

The codelet structure describes a kernel that is possibly implemented on various

3654

targets. For compatibility, make sure to initialize the whole structure to zero.

3655

3656

@table @asis

3657

@item @code{where}

3658

Indicates which types of processing units are able to execute the codelet.

3659

@code{STARPU_CPU|STARPU_CUDA} for instance indicates that the codelet is

3660

implemented for both CPU cores and CUDA devices while @code{STARPU_GORDON}

3661

indicates that it is only available on Cell SPUs.

3662

3663

@item @code{cpu_func} (optional)

3664

Is a function pointer to the CPU implementation of the codelet. Its prototype

3665

must be: @code{void cpu_func(void *buffers[], void *cl_arg)}. The first

3666

argument being the array of data managed by the data management library, and

3667

the second argument is a pointer to the argument passed from the @code{cl_arg}

3668

field of the @code{starpu_task} structure.

3669

The @code{cpu_func} field is ignored if @code{STARPU_CPU} does not appear in

3670

the @code{where} field, it must be non-null otherwise.

3671

3672

@item @code{cuda_func} (optional)

3673

Is a function pointer to the CUDA implementation of the codelet. @emph{This

3674

must be a host-function written in the CUDA runtime API}. Its prototype must

3675

be: @code{void cuda_func(void *buffers[], void *cl_arg);}. The @code{cuda_func}

3676

field is ignored if @code{STARPU_CUDA} does not appear in the @code{where}

3677

field, it must be non-null otherwise.

3678

3679

@item @code{opencl_func} (optional)

3680

Is a function pointer to the OpenCL implementation of the codelet. Its

3681

prototype must be:

3682

@code{void opencl_func(starpu_data_interface_t *descr, void *arg);}.

3683

This pointer is ignored if @code{STARPU_OPENCL} does not appear in the

3684

@code{where} field, it must be non-null otherwise.

3685

3686

@item @code{gordon_func} (optional)

3687

This is the index of the Cell SPU implementation within the Gordon library.

3688

See Gordon documentation for more details on how to register a kernel and

3689

retrieve its index.

3690

3691

@item @code{nbuffers}

3692

Specifies the number of arguments taken by the codelet. These arguments are

3693

managed by the DSM and are accessed from the @code{void *buffers[]}

3694

array. The constant argument passed with the @code{cl_arg} field of the

3695

@code{starpu_task} structure is not counted in this number. This value should

3696

not be above @code{STARPU_NMAXBUFS}.

3697

3698

@item @code{model} (optional)

3699

This is a pointer to the task duration performance model associated to this

3700

codelet. This optional field is ignored when set to @code{NULL}.

3701

3702

TODO

3703

3704

@item @code{power_model} (optional)

3705

This is a pointer to the task power consumption performance model associated

3706

to this codelet. This optional field is ignored when set to @code{NULL}.

3707

In the case of parallel codelets, this has to account for all processing units

3708

involved in the parallel execution.

3709

3710

TODO

3711

3712

@end table

3713

@end deftp

3714

3715

@deftp {Data Type} {struct starpu_task}

3716

The @code{starpu_task} structure describes a task that can be offloaded on the various

3717

processing units managed by StarPU. It instantiates a codelet. It can either be

3718

allocated dynamically with the @code{starpu_task_create} method, or declared

3719

statically. In the latter case, the programmer has to zero the

3720

@code{starpu_task} structure and to fill the different fields properly. The

3721

indicated default values correspond to the configuration of a task allocated

3722

with @code{starpu_task_create}.

3723

3724

@table @asis

3725

@item @code{cl}

3726

Is a pointer to the corresponding @code{starpu_codelet} data structure. This

3727

describes where the kernel should be executed, and supplies the appropriate

3728

implementations. When set to @code{NULL}, no code is executed during the tasks,

3729

such empty tasks can be useful for synchronization purposes.

3730

3731

@item @code{buffers}

3732

Is an array of @code{starpu_buffer_descr_t} structures. It describes the

3733

different pieces of data accessed by the task, and how they should be accessed.

3734

The @code{starpu_buffer_descr_t} structure is composed of two fields, the

3735

@code{handle} field specifies the handle of the piece of data, and the

3736

@code{mode} field is the required access mode (eg @code{STARPU_RW}). The number

3737

of entries in this array must be specified in the @code{nbuffers} field of the

3738

@code{starpu_codelet} structure, and should not excede @code{STARPU_NMAXBUFS}.

3739

If unsufficient, this value can be set with the @code{--enable-maxbuffers}

3740

option when configuring StarPU.

3741

3742

@item @code{cl_arg} (optional; default: @code{NULL})

3743

This pointer is passed to the codelet through the second argument

3744

of the codelet implementation (e.g. @code{cpu_func} or @code{cuda_func}).

3745

In the specific case of the Cell processor, see the @code{cl_arg_size}

3746

argument.

3747

3748

@item @code{cl_arg_size} (optional, Cell-specific)

3749

In the case of the Cell processor, the @code{cl_arg} pointer is not directly

3750

given to the SPU function. A buffer of size @code{cl_arg_size} is allocated on

3751

the SPU. This buffer is then filled with the @code{cl_arg_size} bytes starting

3752

at address @code{cl_arg}. In this case, the argument given to the SPU codelet

3753

is therefore not the @code{cl_arg} pointer, but the address of the buffer in

3754

local store (LS) instead. This field is ignored for CPU, CUDA and OpenCL

3755

codelets, where the @code{cl_arg} pointer is given as such.

3756

3757

@item @code{callback_func} (optional) (default: @code{NULL})

3758

This is a function pointer of prototype @code{void (*f)(void *)} which

3759

specifies a possible callback. If this pointer is non-null, the callback

3760

function is executed @emph{on the host} after the execution of the task. The

3761

callback is passed the value contained in the @code{callback_arg} field. No

3762

callback is executed if the field is set to @code{NULL}.

3763

3764

@item @code{callback_arg} (optional) (default: @code{NULL})

3765

This is the pointer passed to the callback function. This field is ignored if

3766

the @code{callback_func} is set to @code{NULL}.

3767

3768

@item @code{use_tag} (optional) (default: @code{0})

3769

If set, this flag indicates that the task should be associated with the tag

3770

contained in the @code{tag_id} field. Tag allow the application to synchronize

3771

with the task and to express task dependencies easily.

3772

3773

@item @code{tag_id}

3774

This fields contains the tag associated to the task if the @code{use_tag} field

3775

was set, it is ignored otherwise.

3776

3777

@item @code{synchronous}

3778

If this flag is set, the @code{starpu_task_submit} function is blocking and

3779

returns only when the task has been executed (or if no worker is able to

3780

process the task). Otherwise, @code{starpu_task_submit} returns immediately.

3781

3782

@item @code{priority} (optional) (default: @code{STARPU_DEFAULT_PRIO})

3783

This field indicates a level of priority for the task. This is an integer value

3784

that must be set between the return values of the

3785

@code{starpu_sched_get_min_priority} function for the least important tasks,

3786

and that of the @code{starpu_sched_get_max_priority} for the most important

3787

tasks (included). The @code{STARPU_MIN_PRIO} and @code{STARPU_MAX_PRIO} macros

3788

are provided for convenience and respectively returns value of

3789

@code{starpu_sched_get_min_priority} and @code{starpu_sched_get_max_priority}.

3790

Default priority is @code{STARPU_DEFAULT_PRIO}, which is always defined as 0 in

3791

order to allow static task initialization. Scheduling strategies that take

3792

priorities into account can use this parameter to take better scheduling

3793

decisions, but the scheduling policy may also ignore it.

3794

3795

@item @code{execute_on_a_specific_worker} (default: @code{0})

3796

If this flag is set, StarPU will bypass the scheduler and directly affect this

3797

task to the worker specified by the @code{workerid} field.

3798

3799

@item @code{workerid} (optional)

3800

If the @code{execute_on_a_specific_worker} field is set, this field indicates

3801

which is the identifier of the worker that should process this task (as

3802

returned by @code{starpu_worker_get_id}). This field is ignored if

3803

@code{execute_on_a_specific_worker} field is set to 0.

3804

3805

@item @code{detach} (optional) (default: @code{1})

3806

If this flag is set, it is not possible to synchronize with the task

3807

by the means of @code{starpu_task_wait} later on. Internal data structures

3808

are only guaranteed to be freed once @code{starpu_task_wait} is called if the

3809

flag is not set.

3810

3811

@item @code{destroy} (optional) (default: @code{1})

3812

If this flag is set, the task structure will automatically be freed, either

3813

after the execution of the callback if the task is detached, or during

3814

@code{starpu_task_wait} otherwise. If this flag is not set, dynamically

3815

allocated data structures will not be freed until @code{starpu_task_destroy} is

3816

called explicitly. Setting this flag for a statically allocated task structure

3817

will result in undefined behaviour.

3818

3819

@item @code{predicted} (output field)

3820

Predicted duration of the task. This field is only set if the scheduling

3821

strategy used performance models.

3822

3823

@end table

3824

@end deftp

3825

3826

@deftypefun void starpu_task_init ({struct starpu_task} *@var{task})

3827

Initialize @var{task} with default values. This function is implicitly

3828

called by @code{starpu_task_create}. By default, tasks initialized with

3829

@code{starpu_task_init} must be deinitialized explicitly with

3830

@code{starpu_task_deinit}. Tasks can also be initialized statically, using the

3831

constant @code{STARPU_TASK_INITIALIZER}.

3832

@end deftypefun

3833

3834

@deftypefun {struct starpu_task *} starpu_task_create (void)

3835

Allocate a task structure and initialize it with default values. Tasks

3836

allocated dynamically with @code{starpu_task_create} are automatically freed when the

3837

task is terminated. If the destroy flag is explicitly unset, the resources used

3838

by the task are freed by calling

3839

@code{starpu_task_destroy}.

3840

@end deftypefun

3841

3842

@deftypefun void starpu_task_deinit ({struct starpu_task} *@var{task})

3843

Release all the structures automatically allocated to execute @var{task}. This is

3844

called automatically by @code{starpu_task_destroy}, but the task structure itself is not

3845

freed. This should be used for statically allocated tasks for instance.

3846

@end deftypefun

3847

3848

@deftypefun void starpu_task_destroy ({struct starpu_task} *@var{task})

3849

Free the resource allocated during @code{starpu_task_create} and

3850

associated with @var{task}. This function can be called automatically

3851

after the execution of a task by setting the @code{destroy} flag of the

3852

@code{starpu_task} structure (default behaviour). Calling this function

3853

on a statically allocated task results in an undefined behaviour.

3854

@end deftypefun

3855

3856

@deftypefun int starpu_task_wait ({struct starpu_task} *@var{task})

3857

This function blocks until @var{task} has been executed. It is not possible to

3858

synchronize with a task more than once. It is not possible to wait for

3859

synchronous or detached tasks.

3860

3861

Upon successful completion, this function returns 0. Otherwise, @code{-EINVAL}

3862

indicates that the specified task was either synchronous or detached.

3863

@end deftypefun

3864

3865

@deftypefun int starpu_task_submit ({struct starpu_task} *@var{task})

3866

This function submits @var{task} to StarPU. Calling this function does

3867

not mean that the task will be executed immediately as there can be data or task

3868

(tag) dependencies that are not fulfilled yet: StarPU will take care of

3869

scheduling this task with respect to such dependencies.

3870

This function returns immediately if the @code{synchronous} field of the

3871

@code{starpu_task} structure was set to 0, and block until the termination of

3872

the task otherwise. It is also possible to synchronize the application with

3873

asynchronous tasks by the means of tags, using the @code{starpu_tag_wait}

3874

function for instance.

3875

3876

In case of success, this function returns 0, a return value of @code{-ENODEV}

3877

means that there is no worker able to process this task (e.g. there is no GPU

3878

available and this task is only implemented for CUDA devices).

3879

@end deftypefun

3880

3881

@deftypefun int starpu_task_wait_for_all (void)

3882

This function blocks until all the tasks that were submitted are terminated.

3883

@end deftypefun

3884

3885

@deftypefun {struct starpu_task *} starpu_get_current_task (void)

3886

This function returns the task currently executed by the worker, or

3887

NULL if it is called either from a thread that is not a task or simply

3888

because there is no task being executed at the moment.

3889

@end deftypefun

3890

3891

@deftypefun void starpu_display_codelet_stats ({struct starpu_codelet_t} *@var{cl})

3892

Output on @code{stderr} some statistics on the codelet @var{cl}.

3893

@end deftypefun

3894

3895

3896

@c Callbacks : what can we put in callbacks ?

3897

3898

@node Explicit Dependencies

3899

@section Explicit Dependencies

3900

3901

@menu

3902

* starpu_task_declare_deps_array:: starpu_task_declare_deps_array

3903

* starpu_tag_t:: Task logical identifier

3904

* starpu_tag_declare_deps:: Declare the Dependencies of a Tag

3905

* starpu_tag_declare_deps_array:: Declare the Dependencies of a Tag

3906

* starpu_tag_wait:: Block until a Tag is terminated

3907

* starpu_tag_wait_array:: Block until a set of Tags is terminated

3908

* starpu_tag_remove:: Destroy a Tag

3909

* starpu_tag_notify_from_apps:: Feed a tag explicitly

3910

@end menu

3911

3912

@node starpu_task_declare_deps_array

3913

@subsection @code{starpu_task_declare_deps_array} -- Declare task dependencies

3914

@deftypefun void starpu_task_declare_deps_array ({struct starpu_task} *@var{task}, unsigned @var{ndeps}, {struct starpu_task} *@var{task_array}[])

3915

Declare task dependencies between a @var{task} and an array of tasks of length

3916

@var{ndeps}. This function must be called prior to the submission of the task,

3917

but it may called after the submission or the execution of the tasks in the

3918

array provided the tasks are still valid (ie. they were not automatically

3919

destroyed). Calling this function on a task that was already submitted or with

3920

an entry of @var{task_array} that is not a valid task anymore results in an

3921

undefined behaviour. If @var{ndeps} is null, no dependency is added. It is

3922

possible to call @code{starpu_task_declare_deps_array} multiple times on the

3923

same task, in this case, the dependencies are added. It is possible to have

3924

redundancy in the task dependencies.

3925

@end deftypefun

3926

3927

3928

3929

@node starpu_tag_t

3930

@subsection @code{starpu_tag_t} -- Task logical identifier

3931

@table @asis

3932

@item @emph{Description}:

3933

It is possible to associate a task with a unique ``tag'' chosen by the application, and to express

3934

dependencies between tasks by the means of those tags. To do so, fill the

3935

@code{tag_id} field of the @code{starpu_task} structure with a tag number (can

3936

be arbitrary) and set the @code{use_tag} field to 1.

3937

3938

If @code{starpu_tag_declare_deps} is called with this tag number, the task will

3939

not be started until the tasks which holds the declared dependency tags are

3940

completed.

3941

@end table

3942

3943

@node starpu_tag_declare_deps

3944

@subsection @code{starpu_tag_declare_deps} -- Declare the Dependencies of a Tag

3945

@table @asis

3946

@item @emph{Description}:

3947

Specify the dependencies of the task identified by tag @code{id}. The first

3948

argument specifies the tag which is configured, the second argument gives the

3949

number of tag(s) on which @code{id} depends. The following arguments are the

3950

tags which have to be terminated to unlock the task.

3951

3952

This function must be called before the associated task is submitted to StarPU

3953

with @code{starpu_task_submit}.

3954

3955

@item @emph{Remark}

3956

Because of the variable arity of @code{starpu_tag_declare_deps}, note that the

3957

last arguments @emph{must} be of type @code{starpu_tag_t}: constant values

3958

typically need to be explicitly casted. Using the

3959

@code{starpu_tag_declare_deps_array} function avoids this hazard.

3960

3961

@item @emph{Prototype}:

3962

@code{void starpu_tag_declare_deps(starpu_tag_t id, unsigned ndeps, ...);}

3963

3964

@item @emph{Example}:

3965

@cartouche

3966

@example

3967

/* Tag 0x1 depends on tags 0x32 and 0x52 */

3968

starpu_tag_declare_deps((starpu_tag_t)0x1,

3969

2, (starpu_tag_t)0x32, (starpu_tag_t)0x52);

3970

@end example

3971

@end cartouche

3972

3973

@end table

3974

3975

@node starpu_tag_declare_deps_array

3976

@subsection @code{starpu_tag_declare_deps_array} -- Declare the Dependencies of a Tag

3977

@table @asis

3978

@item @emph{Description}:

3979

This function is similar to @code{starpu_tag_declare_deps}, except that its

3980

does not take a variable number of arguments but an array of tags of size

3981

@code{ndeps}.

3982

@item @emph{Prototype}:

3983

@code{void starpu_tag_declare_deps_array(starpu_tag_t id, unsigned ndeps, starpu_tag_t *array);}

3984

@item @emph{Example}:

3985

@cartouche

3986

@example

3987

/* Tag 0x1 depends on tags 0x32 and 0x52 */

3988

starpu_tag_t tag_array[2] = @{0x32, 0x52@};

3989

starpu_tag_declare_deps_array((starpu_tag_t)0x1, 2, tag_array);

3990

@end example

3991

@end cartouche

3992

3993

3994

@end table

3995

3996

3997

@node starpu_tag_wait

3998

@subsection @code{starpu_tag_wait} -- Block until a Tag is terminated

3999

@deftypefun void starpu_tag_wait (starpu_tag_t @var{id})

4000

This function blocks until the task associated to tag @var{id} has been

4001

executed. This is a blocking call which must therefore not be called within

4002

tasks or callbacks, but only from the application directly. It is possible to

4003

synchronize with the same tag multiple times, as long as the

4004

@code{starpu_tag_remove} function is not called. Note that it is still

4005

possible to synchronize with a tag associated to a task which @code{starpu_task}

4006

data structure was freed (e.g. if the @code{destroy} flag of the

4007

@code{starpu_task} was enabled).

4008

@end deftypefun

4009

4010

@node starpu_tag_wait_array

4011

@subsection @code{starpu_tag_wait_array} -- Block until a set of Tags is terminated

4012

@deftypefun void starpu_tag_wait_array (unsigned @var{ntags}, starpu_tag_t *@var{id})

4013

This function is similar to @code{starpu_tag_wait} except that it blocks until

4014

@emph{all} the @var{ntags} tags contained in the @var{id} array are

4015

terminated.

4016

@end deftypefun

4017

4018

@node starpu_tag_remove

4019

@subsection @code{starpu_tag_remove} -- Destroy a Tag

4020

@deftypefun void starpu_tag_remove (starpu_tag_t @var{id})

4021

This function releases the resources associated to tag @var{id}. It can be

4022

called once the corresponding task has been executed and when there is

4023

no other tag that depend on this tag anymore.

4024

@end deftypefun

4025

4026

@node starpu_tag_notify_from_apps

4027

@subsection @code{starpu_tag_notify_from_apps} -- Feed a Tag explicitly

4028

@deftypefun void starpu_tag_notify_from_apps (starpu_tag_t @var{id})

4029

This function explicitly unlocks tag @var{id}. It may be useful in the

4030

case of applications which execute part of their computation outside StarPU

4031

tasks (e.g. third-party libraries). It is also provided as a

4032

convenient tool for the programmer, for instance to entirely construct the task

4033

DAG before actually giving StarPU the opportunity to execute the tasks.

4034

@end deftypefun

4035

4036

@node Implicit Data Dependencies

4037

@section Implicit Data Dependencies

4038

4039

@menu

4040

* starpu_data_set_default_sequential_consistency_flag:: starpu_data_set_default_sequential_consistency_flag

4041

* starpu_data_get_default_sequential_consistency_flag:: starpu_data_get_default_sequential_consistency_flag

4042

* starpu_data_set_sequential_consistency_flag:: starpu_data_set_sequential_consistency_flag

4043

@end menu

4044

4045

In this section, we describe how StarPU makes it possible to insert implicit

4046

task dependencies in order to enforce sequential data consistency. When this

4047

data consistency is enabled on a specific data handle, any data access will

4048

appear as sequentially consistent from the application. For instance, if the

4049

application submits two tasks that access the same piece of data in read-only

4050

mode, and then a third task that access it in write mode, dependencies will be

4051

added between the two first tasks and the third one. Implicit data dependencies

4052

are also inserted in the case of data accesses from the application.

4053

4054

@node starpu_data_set_default_sequential_consistency_flag

4055

@subsection @code{starpu_data_set_default_sequential_consistency_flag} -- Set default sequential consistency flag

4056

@deftypefun void starpu_data_set_default_sequential_consistency_flag (unsigned @var{flag})

4057

Set the default sequential consistency flag. If a non-zero value is passed, a

4058

sequential data consistency will be enforced for all handles registered after

4059

this function call, otherwise it is disabled. By default, StarPU enables

4060

sequential data consistency. It is also possible to select the data consistency

4061

mode of a specific data handle with the

4062

@code{starpu_data_set_sequential_consistency_flag} function.

4063

@end deftypefun

4064

4065

@node starpu_data_get_default_sequential_consistency_flag

4066

@subsection @code{starpu_data_get_default_sequential_consistency_flag} -- Get current default sequential consistency flag

4067

@deftypefun unsigned starpu_data_set_default_sequential_consistency_flag (void)

4068

This function returns the current default sequential consistency flag.

4069

@end deftypefun

4070

4071

@node starpu_data_set_sequential_consistency_flag

4072

@subsection @code{starpu_data_set_sequential_consistency_flag} -- Set data sequential consistency mode

4073

@deftypefun void starpu_data_set_sequential_consistency_flag (starpu_data_handle @var{handle}, unsigned @var{flag})

4074

Select the data consistency mode associated to a data handle. The consistency

4075

mode set using this function has the priority over the default mode which can

4076

be set with @code{starpu_data_set_sequential_consistency_flag}.

4077

@end deftypefun

4078

4079

@node Performance Model API

4080

@section Performance Model API

4081

4082

@menu

4083

* starpu_load_history_debug::

4084

* starpu_perfmodel_debugfilepath::

4085

* starpu_perfmodel_get_arch_name::

4086

* starpu_force_bus_sampling::

4087

@end menu

4088

4089

@node starpu_load_history_debug

4090

@subsection @code{starpu_load_history_debug}

4091

@deftypefun int starpu_load_history_debug ({const char} *@var{symbol}, {struct starpu_perfmodel_t} *@var{model})

4092

TODO

4093

@end deftypefun

4094

4095

@node starpu_perfmodel_debugfilepath

4096

@subsection @code{starpu_perfmodel_debugfilepath}

4097

@deftypefun void starpu_perfmodel_debugfilepath ({struct starpu_perfmodel_t} *@var{model}, {enum starpu_perf_archtype} @var{arch}, char *@var{path}, size_t @var{maxlen})

4098

TODO

4099

@end deftypefun

4100

4101

@node starpu_perfmodel_get_arch_name

4102

@subsection @code{starpu_perfmodel_get_arch_name}

4103

@deftypefun void starpu_perfmodel_get_arch_name ({enum starpu_perf_archtype} @var{arch}, char *@var{archname}, size_t @var{maxlen})

4104

TODO

4105

@end deftypefun

4106

4107

@node starpu_force_bus_sampling

4108

@subsection @code{starpu_force_bus_sampling}

4109

@deftypefun void starpu_force_bus_sampling (void)

4110

This forces sampling the bus performance model again.

4111

@end deftypefun

4112

4113

4114

@node Profiling API

4115

@section Profiling API

4116

4117

@menu

4118

* starpu_profiling_status_set:: starpu_profiling_status_set

4119

* starpu_profiling_status_get:: starpu_profiling_status_get

4120

* struct starpu_task_profiling_info:: task profiling information

4121

* struct starpu_worker_profiling_info:: worker profiling information

4122

* starpu_worker_get_profiling_info:: starpu_worker_get_profiling_info

4123

* struct starpu_bus_profiling_info:: bus profiling information

4124

* starpu_bus_get_count::

4125

* starpu_bus_get_id::

4126

* starpu_bus_get_src::

4127

* starpu_bus_get_dst::

4128

* starpu_timing_timespec_delay_us::

4129

* starpu_timing_timespec_to_us::

4130

* starpu_bus_profiling_helper_display_summary::

4131

* starpu_worker_profiling_helper_display_summary::

4132

@end menu

4133

4134

@node starpu_profiling_status_set

4135

@subsection @code{starpu_profiling_status_set} -- Set current profiling status

4136

@table @asis

4137

@item @emph{Description}:

4138

Thie function sets the profiling status. Profiling is activated by passing

4139

@code{STARPU_PROFILING_ENABLE} in @code{status}. Passing

4140

@code{STARPU_PROFILING_DISABLE} disables profiling. Calling this function

4141

resets all profiling measurements. When profiling is enabled, the

4142

@code{profiling_info} field of the @code{struct starpu_task} structure points

4143

to a valid @code{struct starpu_task_profiling_info} structure containing

4144

information about the execution of the task.

4145

@item @emph{Return value}:

4146

Negative return values indicate an error, otherwise the previous status is

4147

returned.

4148

@item @emph{Prototype}:

4149

@code{int starpu_profiling_status_set(int status);}

4150

@end table

4151

4152

@node starpu_profiling_status_get

4153

@subsection @code{starpu_profiling_status_get} -- Get current profiling status

4154

@deftypefun int starpu_profiling_status_get (void)

4155

Return the current profiling status or a negative value in case there was an error.

4156

@end deftypefun

4157

4158

@node struct starpu_task_profiling_info

4159

@subsection @code{struct starpu_task_profiling_info} -- Task profiling information

4160

@table @asis

4161

@item @emph{Description}:

4162

This structure contains information about the execution of a task. It is

4163

accessible from the @code{.profiling_info} field of the @code{starpu_task}

4164

structure if profiling was enabled.

4165

@item @emph{Fields}:

4166

@table @asis

4167

@item @code{submit_time}:

4168

Date of task submission (relative to the initialization of StarPU).

4169

@item @code{start_time}:

4170

Date of task execution beginning (relative to the initialization of StarPU).

4171

@item @code{end_time}:

4172

Date of task execution termination (relative to the initialization of StarPU).

4173

@item @code{workerid}:

4174

Identifier of the worker which has executed the task.

4175

@end table

4176

@end table

4177

4178

@node struct starpu_worker_profiling_info

4179

@subsection @code{struct starpu_worker_profiling_info} -- Worker profiling information

4180

@table @asis

4181

@item @emph{Description}:

4182

This structure contains the profiling information associated to a worker.

4183

@item @emph{Fields}:

4184

@table @asis

4185

@item @code{start_time}:

4186

Starting date for the reported profiling measurements.

4187

@item @code{total_time}:

4188

Duration of the profiling measurement interval.

4189

@item @code{executing_time}:

4190

Time spent by the worker to execute tasks during the profiling measurement interval.

4191

@item @code{sleeping_time}:

4192

Time spent idling by the worker during the profiling measurement interval.

4193

@item @code{executed_tasks}:

4194

Number of tasks executed by the worker during the profiling measurement interval.

4195

@end table

4196

@end table

4197

4198

@node starpu_worker_get_profiling_info

4199

@subsection @code{starpu_worker_get_profiling_info} -- Get worker profiling info

4200

@table @asis

4201

4202

@item @emph{Description}:

4203

Get the profiling info associated to the worker identified by @code{workerid},

4204

and reset the profiling measurements. If the @code{worker_info} argument is

4205

NULL, only reset the counters associated to worker @code{workerid}.

4206

@item @emph{Return value}:

4207

Upon successful completion, this function returns 0. Otherwise, a negative

4208

value is returned.

4209

4210

@item @emph{Prototype}:

4211

@code{int starpu_worker_get_profiling_info(int workerid, struct starpu_worker_profiling_info *worker_info);}

4212

@end table

4213

4214

@node struct starpu_bus_profiling_info

4215

@subsection @code{struct starpu_bus_profiling_info} -- Bus profiling information

4216

@table @asis

4217

@item @emph{Description}:

4218

TODO

4219

@item @emph{Fields}:

4220

@table @asis

4221

@item @code{start_time}:

4222

TODO

4223

@item @code{total_time}:

4224

TODO

4225

@item @code{transferred_bytes}:

4226

TODO

4227

@item @code{transfer_count}:

4228

TODO

4229

@end table

4230

@end table

4231

4232

@node starpu_bus_get_count

4233

@subsection @code{starpu_bus_get_count}

4234

@deftypefun int starpu_bus_get_count (void)

4235

TODO

4236

@end deftypefun

4237

4238

@node starpu_bus_get_id

4239

@subsection @code{starpu_bus_get_id}

4240

@deftypefun int starpu_bus_get_id (int @var{src}, int @var{dst})

4241

TODO

4242

@end deftypefun

4243

4244

@node starpu_bus_get_src

4245

@subsection @code{starpu_bus_get_src}

4246

@deftypefun int starpu_bus_get_src (int @var{busid})

4247

TODO

4248

@end deftypefun

4249

4250

@node starpu_bus_get_dst

4251

@subsection @code{starpu_bus_get_dst}

4252

@deftypefun int starpu_bus_get_dst (int @var{busid})

4253

TODO

4254

@end deftypefun

4255

4256

@node starpu_timing_timespec_delay_us

4257

@subsection @code{starpu_timing_timespec_delay_us}

4258

@deftypefun double starpu_timing_timespec_delay_us ({struct timespec} *@var{start}, {struct timespec} *@var{end})

4259

TODO

4260

@end deftypefun

4261

4262

@node starpu_timing_timespec_to_us

4263

@subsection @code{starpu_timing_timespec_to_us}

4264

@deftypefun double starpu_timing_timespec_to_us ({struct timespec} *@var{ts})

4265

TODO

4266

@end deftypefun

4267

4268

@node starpu_bus_profiling_helper_display_summary

4269

@subsection @code{starpu_bus_profiling_helper_display_summary}

4270

@deftypefun void starpu_bus_profiling_helper_display_summary (void)

4271

TODO

4272

@end deftypefun

4273

4274

@node starpu_worker_profiling_helper_display_summary

4275

@subsection @code{starpu_worker_profiling_helper_display_summary}

4276

@deftypefun void starpu_worker_profiling_helper_display_summary (void)

4277

TODO

4278

@end deftypefun

4279

4280

4281

4282

@node CUDA extensions

4283

@section CUDA extensions

4284

4285

@c void starpu_malloc(float **A, size_t dim);

4286

4287

@menu

4288

* starpu_cuda_get_local_stream:: Get current worker's CUDA stream

4289

* starpu_helper_cublas_init:: Initialize CUBLAS on every CUDA device

4290

* starpu_helper_cublas_shutdown:: Deinitialize CUBLAS on every CUDA device

4291

@end menu

4292

4293

@node starpu_cuda_get_local_stream

4294

@subsection @code{starpu_cuda_get_local_stream} -- Get current worker's CUDA stream

4295

@deftypefun {cudaStream_t *} starpu_cuda_get_local_stream (void)

4296

StarPU provides a stream for every CUDA device controlled by StarPU. This

4297

function is only provided for convenience so that programmers can easily use

4298

asynchronous operations within codelets without having to create a stream by

4299

hand. Note that the application is not forced to use the stream provided by

4300

@code{starpu_cuda_get_local_stream} and may also create its own streams.

4301

Synchronizing with @code{cudaThreadSynchronize()} is allowed, but will reduce

4302

the likelihood of having all transfers overlapped.

4303

@end deftypefun

4304

4305

@node starpu_helper_cublas_init

4306

@subsection @code{starpu_helper_cublas_init} -- Initialize CUBLAS on every CUDA device

4307

@deftypefun void starpu_helper_cublas_init (void)

4308

The CUBLAS library must be initialized prior to any CUBLAS call. Calling

4309

@code{starpu_helper_cublas_init} will initialize CUBLAS on every CUDA device

4310

controlled by StarPU. This call blocks until CUBLAS has been properly

4311

initialized on every device.

4312

@end deftypefun

4313

4314

@node starpu_helper_cublas_shutdown

4315

@subsection @code{starpu_helper_cublas_shutdown} -- Deinitialize CUBLAS on every CUDA device

4316

@deftypefun void starpu_helper_cublas_shutdown (void)

4317

This function synchronously deinitializes the CUBLAS library on every CUDA device.

4318

@end deftypefun

4319

4320

@node OpenCL extensions

4321

@section OpenCL extensions

4322

4323

@menu

4324

* Enabling OpenCL:: Enabling OpenCL

4325

* Compiling OpenCL kernels:: Compiling OpenCL kernels

4326

* Loading OpenCL kernels:: Loading OpenCL kernels

4327

* OpenCL statistics:: Collecting statistics from OpenCL

4328

@end menu

4329

4330

@node Enabling OpenCL

4331

@subsection Enabling OpenCL

4332

4333

On GPU devices which can run both CUDA and OpenCL, CUDA will be

4334

enabled by default. To enable OpenCL, you need either to disable CUDA

4335

when configuring StarPU:

4336

4337

@example

4338

% ./configure --disable-cuda

4339

@end example

4340

4341

or when running applications:

4342

4343

@example

4344

% STARPU_NCUDA=0 ./application

4345

@end example

4346

4347

OpenCL will automatically be started on any device not yet used by

4348

CUDA. So on a machine running 4 GPUS, it is therefore possible to

4349

enable CUDA on 2 devices, and OpenCL on the 2 other devices by doing

4350

so:

4351

4352

@example

4353

% STARPU_NCUDA=2 ./application

4354

@end example

4355

4356

@node Compiling OpenCL kernels

4357

@subsection Compiling OpenCL kernels

4358

4359

Source codes for OpenCL kernels can be stored in a file or in a

4360

string. StarPU provides functions to build the program executable for

4361

each available OpenCL device as a @code{cl_program} object. This

4362

program executable can then be loaded within a specific queue as

4363

explained in the next section. These are only helpers, Applications

4364

can also fill a @code{starpu_opencl_program} array by hand for more advanced

4365

use (e.g. different programs on the different OpenCL devices, for

4366

relocation purpose for instance).

4367

4368

@menu

4369

* starpu_opencl_load_opencl_from_file:: Compiling OpenCL source code

4370

* starpu_opencl_load_opencl_from_string:: Compiling OpenCL source code

4371

* starpu_opencl_unload_opencl:: Releasing OpenCL code

4372

@end menu

4373

4374

@node starpu_opencl_load_opencl_from_file

4375

@subsubsection @code{starpu_opencl_load_opencl_from_file} -- Compiling OpenCL source code

4376

@deftypefun int starpu_opencl_load_opencl_from_file (char *@var{source_file_name}, {struct starpu_opencl_program} *@var{opencl_programs}, {const char}* @var{build_options})

4377

TODO

4378

@end deftypefun

4379

4380

@node starpu_opencl_load_opencl_from_string

4381

@subsubsection @code{starpu_opencl_load_opencl_from_string} -- Compiling OpenCL source code

4382

@deftypefun int starpu_opencl_load_opencl_from_string (char *@var{opencl_program_source}, {struct starpu_opencl_program} *@var{opencl_programs}, {const char}* @var{build_options})

4383

TODO

4384

@end deftypefun

4385

4386

@node starpu_opencl_unload_opencl

4387

@subsubsection @code{starpu_opencl_unload_opencl} -- Releasing OpenCL code

4388

@deftypefun int starpu_opencl_unload_opencl ({struct starpu_opencl_program} *@var{opencl_programs})

4389

TODO

4390

@end deftypefun

4391

4392

@node Loading OpenCL kernels

4393

@subsection Loading OpenCL kernels

4394

4395

@menu

4396

* starpu_opencl_load_kernel:: Loading a kernel

4397

* starpu_opencl_relase_kernel:: Releasing a kernel

4398

@end menu

4399

4400

@node starpu_opencl_load_kernel

4401

@subsubsection @code{starpu_opencl_load_kernel} -- Loading a kernel

4402

@deftypefun int starpu_opencl_load_kernel (cl_kernel *@var{kernel}, cl_command_queue *@var{queue}, {struct starpu_opencl_program} *@var{opencl_programs}, char *@var{kernel_name}, int @var{devid})

4403

TODO

4404

@end deftypefun

4405

4406

@node starpu_opencl_relase_kernel

4407

@subsubsection @code{starpu_opencl_release_kernel} -- Releasing a kernel

4408

@deftypefun int starpu_opencl_release_kernel (cl_kernel @var{kernel})

4409

TODO

4410

@end deftypefun

4411

4412

@node OpenCL statistics

4413

@subsection OpenCL statistics

4414

4415

@menu

4416

* starpu_opencl_collect_stats:: Collect statistics on a kernel execution

4417

@end menu

4418

4419

@node starpu_opencl_collect_stats

4420

@subsubsection @code{starpu_opencl_collect_stats} -- Collect statistics on a kernel execution

4421

@deftypefun int starpu_opencl_collect_stats (cl_event @var{event})

4422

After termination of the kernels, the OpenCL codelet should call this function

4423

to pass it the even returned by @code{clEnqueueNDRangeKernel}, to let StarPU

4424

collect statistics about the kernel execution (used cycles, consumed power).

4425

@end deftypefun

4426

4427

4428

@node Cell extensions

4429

@section Cell extensions

4430

4431

nothing yet.

4432

4433

@node Miscellaneous helpers

4434

@section Miscellaneous helpers

4435

4436

@menu

4437

* starpu_data_cpy:: Copy a data handle into another data handle

4438

* starpu_execute_on_each_worker:: Execute a function on a subset of workers

4439

@end menu

4440

4441

@node starpu_data_cpy

4442

@subsection @code{starpu_data_cpy} -- Copy a data handle into another data handle

4443

@deftypefun int starpu_data_cpy (starpu_data_handle @var{dst_handle}, starpu_data_handle @var{src_handle}, int @var{asynchronous}, void (*@var{callback_func})(void*), void *@var{callback_arg})

4444

Copy the content of the @var{src_handle} into the @var{dst_handle} handle.

4445

The @var{asynchronous} parameter indicates whether the function should

4446

block or not. In the case of an asynchronous call, it is possible to

4447

synchronize with the termination of this operation either by the means of

4448

implicit dependencies (if enabled) or by calling

4449

@code{starpu_task_wait_for_all()}. If @var{callback_func} is not @code{NULL},

4450

this callback function is executed after the handle has been copied, and it is

4451

given the @var{callback_arg} pointer as argument.

4452

@end deftypefun

4453

4454

4455

4456

@node starpu_execute_on_each_worker

4457

@subsection @code{starpu_execute_on_each_worker} -- Execute a function on a subset of workers

4458

@deftypefun void starpu_execute_on_each_worker (void (*@var{func})(void *), void *@var{arg}, uint32_t @var{where})

4459

When calling this method, the offloaded function specified by the first argument is

4460

executed by every StarPU worker that may execute the function.

4461

The second argument is passed to the offloaded function.

4462

The last argument specifies on which types of processing units the function

4463

should be executed. Similarly to the @var{where} field of the

4464

@code{starpu_codelet} structure, it is possible to specify that the function

4465

should be executed on every CUDA device and every CPU by passing

4466

@code{STARPU_CPU|STARPU_CUDA}.

4467

This function blocks until the function has been executed on every appropriate

4468

processing units, so that it may not be called from a callback function for

4469

instance.

4470

@end deftypefun

4471

4472

4473

@c ---------------------------------------------------------------------

4474

@c Advanced Topics

4475

@c ---------------------------------------------------------------------

4476

4477

@node Advanced Topics

4478

@chapter Advanced Topics

4479

4480

@menu

4481

* Defining a new data interface::

4482

* Defining a new scheduling policy::

4483

@end menu

4484

4485

@node Defining a new data interface

4486

@section Defining a new data interface

4487

4488

@menu

4489

* struct starpu_data_interface_ops_t:: Per-interface methods

4490

* struct starpu_data_copy_methods:: Per-interface data transfer methods

4491

* An example of data interface:: An example of data interface

4492

@end menu

4493

4494

@c void *starpu_data_get_interface_on_node(starpu_data_handle handle, unsigned memory_node); TODO

4495

4496

@node struct starpu_data_interface_ops_t

4497

@subsection @code{struct starpu_data_interface_ops_t} -- Per-interface methods

4498

@table @asis

4499

@item @emph{Description}:

4500

TODO describe all the different fields

4501

@end table

4502

4503

@node struct starpu_data_copy_methods

4504

@subsection @code{struct starpu_data_copy_methods} -- Per-interface data transfer methods

4505

@table @asis

4506

@item @emph{Description}:

4507

TODO describe all the different fields

4508

@end table

4509

4510

@node An example of data interface

4511

@subsection An example of data interface

4512

@table @asis

4513

TODO

4514

See @code{src/datawizard/interfaces/vector_interface.c} for now.

4515

@end table

4516

4517

@node Defining a new scheduling policy

4518

@section Defining a new scheduling policy

4519

4520

TODO

4521

4522

A full example showing how to define a new scheduling policy is available in

4523

the StarPU sources in the directory @code{examples/scheduler/}.

4524

4525

@menu

4526

* struct starpu_sched_policy_s::

4527

* starpu_worker_set_sched_condition::

4528

* starpu_sched_set_min_priority:: Set the minimum priority level

4529

* starpu_sched_set_max_priority:: Set the maximum priority level

4530

* starpu_push_local_task:: Assign a task to a worker

4531

* Source code::

4532

@end menu

4533

4534

@node struct starpu_sched_policy_s

4535

@subsection @code{struct starpu_sched_policy_s} -- Scheduler methods

4536

@table @asis

4537

@item @emph{Description}:

4538

This structure contains all the methods that implement a scheduling policy. An

4539

application may specify which scheduling strategy in the @code{sched_policy}

4540

field of the @code{starpu_conf} structure passed to the @code{starpu_init}

4541

function.

4542

4543

@item @emph{Fields}:

4544

@table @asis

4545

@item @code{init_sched}:

4546

Initialize the scheduling policy.

4547

@item @code{deinit_sched}:

4548

Cleanup the scheduling policy.

4549

@item @code{push_task}:

4550

Insert a task into the scheduler.

4551

@item @code{push_prio_task}:

4552

Insert a priority task into the scheduler.

4553

@item @code{push_prio_notify}:

4554

Notify the scheduler that a task was pushed on the worker. This method is

4555

called when a task that was explicitely assigned to a worker is scheduled. This

4556

method therefore permits to keep the state of of the scheduler coherent even

4557

when StarPU bypasses the scheduling strategy.

4558

@item @code{pop_task}:

4559

Get a task from the scheduler. The mutex associated to the worker is already

4560

taken when this method is called. If this method is defined as @code{NULL}, the

4561

worker will only execute tasks from its local queue. In this case, the

4562

@code{push_task} method should use the @code{starpu_push_local_task} method to

4563

assign tasks to the different workers.

4564

@item @code{pop_every_task}:

4565

Remove all available tasks from the scheduler (tasks are chained by the means

4566

of the prev and next fields of the starpu_task structure). The mutex associated

4567

to the worker is already taken when this method is called.

4568

@item @code{post_exec_hook} (optionnal):

4569

This method is called every time a task has been executed.

4570

@item @code{policy_name}:

4571

Name of the policy (optionnal).

4572

@item @code{policy_description}:

4573

Description of the policy (optionnal).

4574

@end table

4575

@end table

4576

4577

4578

@node starpu_worker_set_sched_condition

4579

@subsection @code{starpu_worker_set_sched_condition} -- Specify the condition variable associated to a worker

4580

@deftypefun void starpu_worker_set_sched_condition (int @var{workerid}, pthread_cond_t *@var{sched_cond}, pthread_mutex_t *@var{sched_mutex})

4581

When there is no available task for a worker, StarPU blocks this worker on a

4582

condition variable. This function specifies which condition variable (and the

4583

associated mutex) should be used to block (and to wake up) a worker. Note that

4584

multiple workers may use the same condition variable. For instance, in the case

4585

of a scheduling strategy with a single task queue, the same condition variable

4586

would be used to block and wake up all workers.

4587

The initialization method of a scheduling strategy (@code{init_sched}) must

4588

call this function once per worker.

4589

@end deftypefun

4590

4591

@node starpu_sched_set_min_priority

4592

@subsection @code{starpu_sched_set_min_priority}

4593

@deftypefun void starpu_sched_set_min_priority (int @var{min_prio})

4594

Defines the minimum priority level supported by the scheduling policy. The

4595

default minimum priority level is the same as the default priority level which

4596

is 0 by convention. The application may access that value by calling the

4597

@code{starpu_sched_get_min_priority} function. This function should only be

4598

called from the initialization method of the scheduling policy, and should not

4599

be used directly from the application.

4600

@end deftypefun

4601

4602

@node starpu_sched_set_max_priority

4603

@subsection @code{starpu_sched_set_max_priority}

4604

@deftypefun void starpu_sched_set_min_priority (int @var{max_prio})

4605

Defines the maximum priority level supported by the scheduling policy. The

4606

default maximum priority level is 1. The application may access that value by

4607

calling the @code{starpu_sched_get_max_priority} function. This function should

4608

only be called from the initialization method of the scheduling policy, and

4609

should not be used directly from the application.

4610

@end deftypefun

4611

4612

@node starpu_push_local_task

4613

@subsection @code{starpu_push_local_task}

4614

@deftypefun int starpu_push_local_task (int @var{workerid}, {struct starpu_task} *@var{task}, int @var{back})

4615

The scheduling policy may put tasks directly into a worker's local queue so

4616

that it is not always necessary to create its own queue when the local queue

4617

is sufficient. If "back" not null, the task is put at the back of the queue

4618

where the worker will pop tasks first. Setting "back" to 0 therefore ensures

4619

a FIFO ordering.

4620

@end deftypefun

4621

4622

4623

4624

4625

@node Source code

4626

@subsection Source code

4627

4628

@cartouche

4629

@smallexample

4630

static struct starpu_sched_policy_s dummy_sched_policy = @{

4631

.init_sched = init_dummy_sched,

4632

.deinit_sched = deinit_dummy_sched,

4633

.push_task = push_task_dummy,

4634

.push_prio_task = NULL,

4635

.pop_task = pop_task_dummy,

4636

.post_exec_hook = NULL,

4637

.pop_every_task = NULL,

4638

.policy_name = "dummy",

4639

.policy_description = "dummy scheduling strategy"

4640

@};

4641

@end smallexample

4642

@end cartouche

4643

4644

4645

@c ---------------------------------------------------------------------

4646

@c Appendices

4647

@c ---------------------------------------------------------------------

4648

4649

@c ---------------------------------------------------------------------

4650

@c Full source code for the 'Scaling a Vector' example

4651

@c ---------------------------------------------------------------------

4652

4653

@node Full source code for the 'Scaling a Vector' example

4654

@appendix Full source code for the 'Scaling a Vector' example

4655

4656

@menu

4657

* Main application::

4658

* CPU Kernel::

4659

* CUDA Kernel::

4660

* OpenCL Kernel::

4661

@end menu

4662

4663

@node Main application

4664

@section Main application

4665

4666

@smallexample

4667

@include vector_scal_c.texi

4668

@end smallexample

4669

4670

@node CPU Kernel

4671

@section CPU Kernel

4672

4673

@smallexample

4674

@include vector_scal_cpu.texi

4675

@end smallexample

4676

4677

@node CUDA Kernel

4678

@section CUDA Kernel

4679

4680

@smallexample

4681

@include vector_scal_cuda.texi

4682

@end smallexample

4683

4684

@node OpenCL Kernel

4685

@section OpenCL Kernel

4686

4687

@menu

4688

* Invoking the kernel::

4689

* Source of the kernel::

4690

@end menu

4691

4692

@node Invoking the kernel

4693

@subsection Invoking the kernel

4694

4695

@smallexample

4696

@include vector_scal_opencl.texi

4697

@end smallexample

4698

4699

@node Source of the kernel

4700

@subsection Source of the kernel

4701

4702

@smallexample

4703

@include vector_scal_opencl_codelet.texi

4704

@end smallexample

4705

4706

@c

4707

@c Indices.

4708

@c

4709

4710

@node Function Index

4711

@unnumbered Function Index

4712

@printindex fn

4713

4714

@bye