99
//===---------------------------------------------------------------------===//
101
Vararg function prologue can be further optimized. Currently all XMM registers
102
are stored into register save area. Most of them can be eliminated since the
103
upper bound of the number of XMM registers used are passed in %al. gcc produces
104
something like the following:
107
leaq 0(,%rdx,4), %rax
108
leaq 4+L2(%rip), %rdx
111
movaps %xmm7, -15(%rax)
112
movaps %xmm6, -31(%rax)
113
movaps %xmm5, -47(%rax)
114
movaps %xmm4, -63(%rax)
115
movaps %xmm3, -79(%rax)
116
movaps %xmm2, -95(%rax)
117
movaps %xmm1, -111(%rax)
118
movaps %xmm0, -127(%rax)
121
It jumps over the movaps that do not need to be stored. Hard to see this being
122
significant as it added 5 instruciton (including a indirect branch) to avoid
123
executing 0 to 8 stores in the function prologue.
125
Perhaps we can optimize for the common case where no XMM registers are used for
126
parameter passing. i.e. is %al == 0 jump over all stores. Or in the case of a
127
leaf function where we can determine that no XMM input parameter is need, avoid
128
emitting the stores at all.
130
//===---------------------------------------------------------------------===//
132
AMD64 has a complex calling convention for aggregate passing by value:
134
1. If the size of an object is larger than two eightbytes, or in C++, is a non-
135
POD structure or union type, or contains unaligned fields, it has class
137
2. Both eightbytes get initialized to class NO_CLASS.
138
3. Each field of an object is classified recursively so that always two fields
139
are considered. The resulting class is calculated according to the classes
140
of the fields in the eightbyte:
141
(a) If both classes are equal, this is the resulting class.
142
(b) If one of the classes is NO_CLASS, the resulting class is the other
144
(c) If one of the classes is MEMORY, the result is the MEMORY class.
145
(d) If one of the classes is INTEGER, the result is the INTEGER.
146
(e) If one of the classes is X87, X87UP, COMPLEX_X87 class, MEMORY is used as
148
(f) Otherwise class SSE is used.
149
4. Then a post merger cleanup is done:
150
(a) If one of the classes is MEMORY, the whole argument is passed in memory.
151
(b) If SSEUP is not preceeded by SSE, it is converted to SSE.
153
Currently llvm frontend does not handle this correctly.
156
typedef struct { int i; double d; } QuadWordS;
157
It is currently passed in two i64 integer registers. However, gcc compiled
158
callee expects the second element 'd' to be passed in XMM0.
161
typedef struct { int32_t i; float j; double d; } QuadWordS;
162
The size of the first two fields == i64 so they will be combined and passed in
163
a integer register RDI. The third field is still passed in XMM0.
166
typedef struct { int64_t i; int8_t j; int64_t d; } S;
168
The size of this aggregate is greater than two i64 so it should be passed in
169
memory. Currently llvm breaks this down and passed it in three integer
173
Taking problem 3 one step ahead where a function expects a aggregate value
174
in memory followed by more parameter(s) passed in register(s).
175
void test(S s, int b)
177
LLVM IR does not allow parameter passing by aggregates, therefore it must break
178
the aggregates value (in problem 3 and 4) into a number of scalar values:
179
void %test(long %s.i, byte %s.j, long %s.d);
181
However, if the backend were to lower this code literally it would pass the 3
182
values in integer registers. To force it be passed in memory, the frontend
183
should change the function signiture to:
184
void %test(long %undef1, long %undef2, long %undef3, long %undef4,
185
long %undef5, long %undef6,
186
long %s.i, byte %s.j, long %s.d);
187
And the callee would look something like this:
188
call void %test( undef, undef, undef, undef, undef, undef,
189
%tmp.s.i, %tmp.s.j, %tmp.s.d );
190
The first 6 undef parameters would exhaust the 6 integer registers used for
191
parameter passing. The following three integer values would then be forced into
194
For problem 4, the parameter 'd' would be moved to the front of the parameter
195
list so it will be passed in register:
197
long %undef1, long %undef2, long %undef3, long %undef4,
198
long %undef5, long %undef6,
199
long %s.i, byte %s.j, long %s.d);
201
//===---------------------------------------------------------------------===//
203
Right now the asm printer assumes GlobalAddress are accessed via RIP relative
204
addressing. Therefore, it is not possible to generate this:
205
movabsq $__ZTV10polynomialIdE+16, %rax
207
That is ok for now since we currently only support small model. So the above
209
leaq __ZTV10polynomialIdE+16(%rip), %rax
211
This is probably slightly slower but is much shorter than movabsq. However, if
212
we were to support medium or larger code models, we need to use the movabs
213
instruction. We should probably introduce something like AbsoluteAddress to
214
distinguish it from GlobalAddress so the asm printer and JIT code emitter can
77
And the codegen is even worse for the following
78
(from http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33103):
79
void fill1(char *s, int a)
81
__builtin_memset(s, a, 15);
84
For this version, we duplicate the computation of the constant to store.
217
86
//===---------------------------------------------------------------------===//
298
167
if we have whole-function selectiondags.
300
169
//===---------------------------------------------------------------------===//
171
Take the following C code
172
(from http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43640):
180
float foo(struct u1 u)
185
Optimizes to the following IR:
186
define float @foo(double %u.0) nounwind readnone {
188
%tmp8 = bitcast double %u.0 to i64 ; <i64> [#uses=2]
189
%tmp6 = trunc i64 %tmp8 to i32 ; <i32> [#uses=1]
190
%tmp7 = bitcast i32 %tmp6 to float ; <float> [#uses=1]
191
%tmp2 = lshr i64 %tmp8, 32 ; <i64> [#uses=1]
192
%tmp3 = trunc i64 %tmp2 to i32 ; <i32> [#uses=1]
193
%tmp4 = bitcast i32 %tmp3 to float ; <float> [#uses=1]
194
%0 = fadd float %tmp7, %tmp4 ; <float> [#uses=1]
198
And current llvm-gcc/clang output:
206
We really shouldn't move the floats to RAX, only to immediately move them
207
straight back to the XMM registers.
209
There really isn't any good way to handle this purely in IR optimizers; it
210
could possibly be handled by changing the output of the fronted, though. It
211
would also be feasible to add a x86-specific DAGCombine to optimize the
212
bitcast+trunc+(lshr+)bitcast combination.
214
//===---------------------------------------------------------------------===//
216
Take the following code
217
(from http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34653):
218
extern unsigned long table[];
219
unsigned long foo(unsigned char *p) {
220
unsigned long tag = *p;
221
return table[tag >> 4] + table[tag & 0xf];
224
Current code generated:
230
movq table(,%rax,8), %rax
231
addq table(%rcx), %rax
235
1. First movq should be movl; saves a byte.
236
2. Both andq's should be andl; saves another two bytes. I think this was
237
implemented at one point, but subsequently regressed.
238
3. shrq should be shrl; saves another byte.
239
4. The first andq can be completely eliminated by using a slightly more
240
expensive addressing mode.
242
//===---------------------------------------------------------------------===//
244
Consider the following (contrived testcase, but contains common factors):
247
int test(int x, ...) {
251
for (i = 0; i < x; i++)
252
sum += va_arg(l, int);
257
Testcase given in C because fixing it will likely involve changing the IR
258
generated for it. The primary issue with the result is that it doesn't do any
259
of the optimizations which are possible if we know the address of a va_list
260
in the current function is never taken:
261
1. We shouldn't spill the XMM registers because we only call va_arg with "int".
262
2. It would be nice if we could scalarrepl the va_list.
263
3. Probably overkill, but it'd be cool if we could peel off the first five
264
iterations of the loop.
266
Other optimizations involving functions which use va_arg on floats which don't
267
have the address of a va_list taken:
268
1. Conversely to the above, we shouldn't spill general registers if we only
269
call va_arg on "double".
270
2. If we know nothing more than 64 bits wide is read from the XMM registers,
271
we can change the spilling code to reduce the amount of stack used by half.
273
//===---------------------------------------------------------------------===//