~yeconomou/pocl/pocl

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
Notes on compiling pocl in Debian Sid/Cell/Playstation 3.

The "default" debian for Powerpc has a 32bit userspace runtime, 
even if the processor itself is a 64bit PowerPC. You can run 
multilibs on debian to be able to build 64bit binaries in the
32bit environment, but the 'dlopen' of 64 bit libraries in 32bit
binaries will fail. Hence, both libpocl and the opencl kerenls 
must be of the same bitwidth.

status
------
32bit binaries are assumed, and built by default. If you know you
have a 64bit userspace, you can override by giving
	--enable-ppc64
option to configure.

The only thing that currently will fail are the tests that use 
double precision floating points (see pocl bug #911911).

powerpc64
---------
Compiling in powerpc64 mode compiles the pocl but dlopening
the kernels fail. The problem is likely the combination of
32bit pocl libraries and the libtool dlopen library and
the kernels which are compiled in 64bit mode by Clang.

Forcing the build to use the 64bit mode by passing 
CXXFLAGS=-m64 CFLAGS=-m64 fails at configure time because
the libltdl lib is 32-bit only in this env and does not
get detected.

powerpc32
---------
Configure with:
./configure 

This makes pocl to use the 32bit mode for the
kernel compilation.


spu device driver
-----------------

Initial implememtation is in place. It uses libspe2 
to communucate with the spus.

A context for one (current status - to be improved upon)
spu is created at driver initialization stage. In this image, 
we allocate memory for the opencl global/local buffers. This
memory chunk is managed by buffalloc in the driver. See 
lib/CL/drivrs/cellspu/cellspu.h for a memory map.

clRead/WriteBuffer can operate directly on this context, 
as it can be memory mapped in the host, and loading an
ELF into this context doesn't seem to overwrite unused/
uninitialized areas.

At run time, a __kernel_exec_cmd data structure is 
filled, similar to as with the TCE driver, this control
structure is copied into the context, and the SPU is
started. The SPU has a similar wrapper function as the TTA
to parse this structure.

This is still unfinished. Major parts missing are:
-get a non-trivial 'hello world' running (i.e. one with
arguments).
-multithreaded driver. spe_context_run *blocks*, so 
we need at least 1 thread/spu. Possibly more - depends
on how the kernel schedules the SPU contexts.


to do
-----
* Fix the 64bit build. 

It requires forcing pocl to be built in the 64 bit mode
(-m64) and all the required libraries (at least the 
libltdl) to be found as 64 bit versions by configure.

* A device layer implementation for the SPEs. 

The SPEs have local memories and no random access to 
a shared memory (AFAIK) so implementing the global buffers 
across SPEs is problematic. Therefore, at least the
first version could present one device per SPE thus require 
one command queue per SPE to utilize fully.

* The load balancing of multiple work groups from a single
kernel execution across all the SPEs. 

Might require changing the global accesses in the LLVM 
bitcode to DMA transfer calls which possibly ruins
the performance, but could be provided as an option.