1
1
FFmpeg & evaluating performance on the PowerPC Architecture HOWTO
3
(c) 2003 Romain Dolbeau <romain@dolbeau.org>
3
(c) 2003-2004 Romain Dolbeau <romain@dolbeau.org>
9
The PowerPC architecture and its SIMD extension AltiVec offer some interesting tools to evaluate performance and improve the code. This document try to explain how to use those tools with FFmpeg.
9
The PowerPC architecture and its SIMD extension AltiVec offer some
10
interesting tools to evaluate performance and improve the code.
11
This document tries to explain how to use those tools with FFmpeg.
11
The architecture itself offers two ways to evaluate the performance of a given piece of code :
13
The architecture itself offers two ways to evaluate the performance of
14
a given piece of code:
13
16
1) The Time Base Registers (TBL)
14
17
2) The Performance Monitor Counter Registers (PMC)
16
The firsts are always available, always active, but they're not very accurate : the registers increment by one every four *bus* cycle. On my 667 Mhz tibook (ppc7450) , this means once every twenty *processor* cycle. So we won't use that.
18
The PMC are much more useful : not only they can report cycle-accurate timing, but they can also be used to monitor many other parameters, such as the number of AltiVec stalls for every kind of instructions, or instruction cache misses. The downside is that not all processors support the PMC (all G3, all G4 and the 970 do support them), and they're inactive by default - you need to activate them with a dedicated tool. Also, the number of available PMC depend on the procesor : the various 604 have 2, the various 75x (aka. G3) have 4, anbd the various 74xx (aka G4) have 6.
19
The first ones are always available, always active, but they're not very
20
accurate: the registers increment by one every four *bus* cycles. On
21
my 667 Mhz tiBook (ppc7450), this means once every twenty *processor*
22
cycles. So we won't use that.
24
The PMC are much more useful: not only can they report cycle-accurate
25
timing, but they can also be used to monitor many other parameters,
26
such as the number of AltiVec stalls for every kind of instruction,
27
or instruction cache misses. The downside is that not all processors
28
support the PMC (all G3, all G4 and the 970 do support them), and
29
they're inactive by default - you need to activate them with a
30
dedicated tool. Also, the number of available PMC depends on the
31
procesor: the various 604 have 2, the various 75x (aka. G3) have 4,
32
and the various 74xx (aka G4) have 6.
34
*WARNING*: The PowerPC 970 is not very well documented, and its PMC
35
registers are 64 bits wide. To properly notify the code, you *must*
36
tune for the 970 (using --tune=970), or the code will assume 32 bit
22
40
II - Enabling FFmpeg PowerPC performance support
24
This need to be done by hand. First, you need to configure FFmpeg as usual, plus using the "--powerpc-perf-enable". for instance :
27
./configure --prefix=/usr/local/ffmpeg-cvs --cc=gcc-3.3 --tune=7450 --powerpc-perf-enable
30
This will configure FFmpeg to install inside /usr/local/ffmpeg-cvs, compiling with gcc-3.3 (you should try to use this one or a newer gcc), and tuning for the PowerPC7450 (i.e. the newer G4 ; as a rule of thumb, those at 550Mhz and more). It will also enables the PMCs.
42
This needs to be done by hand. First, you need to configure FFmpeg as
43
usual, but add the "--powerpc-perf-enable" option. For instance:
46
./configure --prefix=/usr/local/ffmpeg-svn --cc=gcc-3.3 --tune=7450 --powerpc-perf-enable
49
This will configure FFmpeg to install inside /usr/local/ffmpeg-svn,
50
compiling with gcc-3.3 (you should try to use this one or a newer
51
gcc), and tuning for the PowerPC 7450 (i.e. the newer G4; as a rule of
52
thumb, those at 550Mhz and more). It will also enable the PMC.
32
54
You may also edit the file "config.h" to enable the following line:
35
57
// #define ALTIVEC_USE_REFERENCE_C_CODE 1
38
If you enable this line, then the code will not make use of AltiVec, but will use the reference C code instead. This is useful to compare performance between the two versions of the code.
60
If you enable this line, then the code will not make use of AltiVec,
61
but will use the reference C code instead. This is useful to compare
62
performance between two versions of the code.
40
Also, the number of enabled PMC is defined in "libavcodec/ppc/dsputil_ppc.h" :
64
Also, the number of enabled PMC is defined in "libavcodec/ppc/dsputil_ppc.h":
43
67
#define POWERPC_NUM_PMC_ENABLED 4
46
If you have a G4 cpus, you can enable all 6 PMCs. DO NOT enable more PMCs than available on your cpu !
70
If you have a G4 CPU, you can enable all 6 PMC. DO NOT enable more
71
PMC than available on your CPU!
48
Then, simply compile ffmpeg as usual (make && make install).
73
Then, simply compile FFmpeg as usual (make && make install).
52
77
III - Using FFmpeg PowerPC performance support
54
This FFmeg can be used exactly as usual. But before exiting, Ffmpeg will dump a per-function report that looks like this:
79
This FFmeg can be used exactly as usual. But before exiting, FFmpeg
80
will dump a per-function report that looks like this:
57
83
PowerPC performance report
58
Values are from the PMC registers, and represent whatever the registers are set to record.
84
Values are from the PMC registers, and represent whatever the
85
registers are set to record.
59
86
Function "gmc1_altivec" (pmc1):
74
In this example, PMC1 was set to record CPU cycles, PMC2 was set to record AltiVec Permute Stall Cycle, and PMC3 was set to record AltiVec Issue Stalls.
76
The function "gmc1_altivec" was monitored 255302 times, and the minimum execution time was 231 processor cycles. The max and average aren't much use, as it's very likely the OS interrupted execution for reasons of it's own :-(
78
With the exact same setting and source file, but using the reference C code we get :
101
In this example, PMC1 was set to record CPU cycles, PMC2 was set to
102
record AltiVec Permute Stall Cycles, and PMC3 was set to record AltiVec
105
The function "gmc1_altivec" was monitored 255302 times, and the
106
minimum execution time was 231 processor cycles. The max and average
107
aren't much use, as it's very likely the OS interrupted execution for
108
reasons of its own :-(
110
With the exact same settings and source file, but using the reference C
81
114
PowerPC performance report
82
Values are from the PMC registers, and represent whatever the registers are set to record.
115
Values are from the PMC registers, and represent whatever the
116
registers are set to record.
83
117
Function "gmc1_altivec" (pmc1):
98
592 cycles, so the fastest AltiVec execution is about 2.5x faster than the fastest C execution in this example. It's not perfect but it's not bad (well I wrote this function so I can't say otherwise :-).
100
Once you have that kind of report, you can try to improve things by finding what goes wrong and fixing it ; in the example above, one shoud try to diminish the number of AltiVec stalls, as this *may* improve performances.
104
IV) Enabling the PMC in MacOS X
106
This is easy. Use "Monster" and "monster". Those tools come from Apple's CHUD package, and can be found hidden in the developer web site & ftp site. "MONster" is the graphical application, use it to generate a config file specifying what each register should monitor. Then use the command-line application "monster" to use that config file, and enjoy the results.
108
Note that "MONster" can be used for many other stuff, but it's documented by Apple, it's not my subject.
112
V) Enabling the PMC in Linux
114
I don't know how to do it, sorry :-) Any idea very much welcome.
132
592 cycles, so the fastest AltiVec execution is about 2.5x faster than
133
the fastest C execution in this example. It's not perfect but it's not
134
bad (well I wrote this function so I can't say otherwise :-).
136
Once you have that kind of report, you can try to improve things by
137
finding what goes wrong and fixing it; in the example above, one
138
should try to diminish the number of AltiVec stalls, as this *may*
143
IV) Enabling the PMC in Mac OS X
145
This is easy. Use "Monster" and "monster". Those tools come from
146
Apple's CHUD package, and can be found hidden in the developer web
147
site & FTP site. "MONster" is the graphical application, use it to
148
generate a config file specifying what each register should
149
monitor. Then use the command-line application "monster" to use that
150
config file, and enjoy the results.
152
Note that "MONster" can be used for many other things, but it's
153
documented by Apple, it's not my subject.
155
If you are using CHUD 4.4.2 or later, you'll notice that MONster is
156
no longer available. It's been superseeded by Shark, where
157
configuration of PMCs is available as a plugin.
161
V) Enabling the PMC on Linux
163
On linux you may use oprofile from http://oprofile.sf.net, depending on the
164
version and the cpu you may need to apply a patch[1] to access a set of the
165
possibile counters from the userspace application. You can always define them
166
using the kernel interface /dev/oprofile/* .
168
[1] http://dev.gentoo.org/~lu_zero/development/oprofile-g4-20060423.patch
171
Romain Dolbeau <romain@dolbeau.org>
172
Luca Barbato <lu_zero@gentoo.org>