1 \section{IA32 (x86, i386) code generator}
2 \label{sectionia32codegenerator}
5 \subsection{Introduction}
7 The IA32 architecture is the most important architecture on the
8 desktop market. Since the current IA32 processors are getting faster
9 and more powerful, the IA32 architecture also becomes more important
10 in the low-end and mid-end server market. Major Java Virtual Machine
11 vendors, like Sun or IBM, have highly optimized IA32 ports of their
12 Virtual Machines, so it's fairly important for an Open Source Java
13 Virtual Machine to have a good IA32 performance.
15 Porting CACAO to the IA32 platform was more effort than
16 expected. CACAO was designed to run on RISC machines from ground up,
17 so the whole code generation part has to be adapted. The first
18 approach was to replace the simple RISC macros with IA32 code, but
19 this turned out to be not successful. So new IA32 code generation
20 macros were written, with no respect to the old RISC macros.
22 Some smaller problems occured since the IA32 port was the first 32 bit
23 target platform, like segmentation faults due to heap corruption,
24 which turned out to be a simple \texttt{for} loop bug only hit on 32
25 bit systems. Most of the CACAO system already was
26 \textit{32-bit-ready}, namely an architecture dependent
27 \texttt{types.h} with definitions of the used datatypes and some
28 feature flags, which features the processor itself natively
29 supports. Most noticeable change was the \texttt{s8} and \texttt{u8}
30 datatype, changed from \texttt{long} to \texttt{long long} to support
34 \subsection{Code generation}
36 One big difference in writing the new code generation macros was, that
37 the IA32 architecture is not a \textit{load-store architecture} like
38 the RISC machines, but the \textit{machine instructions} can handle
39 both \textit{memory operands} and \textit{register operands}. This led
40 to a much more complicated handling of the various ICMDs. The typical
41 handling of an ICMD on RISC machines looks like this (on the example
42 of the integer add ICMD):
46 var_to_reg_int(s1, src->prev, REG_ITMP1);
47 var_to_reg_int(s2, src, REG_ITMP2);
48 d = reg_of_var(iptr->dst, REG_ITMP3);
50 store_reg_to_var_int(iptr->dst, d);
54 This means loading the two \textit{source operands} from memory to
55 temporary registers, if necessary, getting a \textit{destination
56 register}, do the calculation and store the result to memory, if the
57 destination variable resides in memory. If all operands are assigned
58 to registers, only the calculation is done. This design also works on
59 IA32 machines but leads to much bigger code size, reduces decoding
60 bandwith and increases register pressure in the processor itself,
61 which results in lower performance~\cite{IA32opt}. Thus CACAO uses all
62 kinds of instruction types that are available and decide which one is
63 used in some \texttt{if} statements:
66 if (IS_INMEMORY(iptr->dst)) {
67 if (IS_INMEMORY(src) && IS_INMEMORY(src->prev)) {
69 } else if (IS_INMEMORY(src) && !IS_INMEMORY(src->prev)) {
71 } else if (!IS_INMEMORY(src) && IS_INMEMORY(src->prev)) {
77 if (IS_INMEMORY(src) && IS_INMEMORY(src->prev)) {
79 } else if (IS_INMEMORY(src) && !IS_INMEMORY(src->prev)) {
81 } else if (!IS_INMEMORY(src) && IS_INMEMORY(src->prev)) {
89 For most ICMDs the generated code can be further optimized when one
90 source variable and the destination variable share the same local
93 To be backward compatible, mostly in respect of embedded systems, all
94 generated code can be run on i386 compatible systems.
96 Another problem was the access to the functions' data segment. Since
97 RISC platforms like Alpha and MIPS have a procedure vector register,
98 for the IA32 platform there had to be implemented a special handling
99 for accesses to the data segment, like \texttt{ICMD\_PUTSTATIC} and
100 \texttt{ICMD\_GETSTATIC} instructions. The solution is like the
101 handling of \textit{jump references} or \textit{check cast
102 references}, which also have to be code patched, when the code and
103 data segment are relocated. This means, there has to be an extra
104 \textit{immediate-to-register} move (\texttt{i386\_mov\_imm\_reg()})
105 before every \texttt{ICMD\_PUTSTATIC}/\texttt{ICMD\_GETSTATIC}
106 instruction, which moves the start address of the procedure, and thus
107 the start address of the data segment, in one of the temporary
108 registers (code snippet from \texttt{ICMD\_PUTSTATIC}):
111 i386_mov_imm_reg(0, REG_ITMP2);
112 dseg_adddata(mcodeptr);
115 The \texttt{dseg\_adddata()} call inserts the current postion of the
116 code generation pointer into a datastructure which is later processed
117 by \texttt{codegen\_finish()}, where the final address of the data
121 \subsection{Constant handling}
123 Unlike RISC machines the IA32 architecture has \textit{immediate move}
124 instructions which can handle the maximum bitsize of the
125 registers. Thus the IA32 port of CACAO does not have to load big
126 constants indirect from the data segment, which means a \textit{memory
127 load} instruction, but can move 32 bit constants \textit{inline} into
128 their destination registers.
131 i386_mov_imm_reg(0xcafebabe, REG_ITMP1);
134 For constants bigger than 32 bits up to 64 bits, we split the move
135 up into two immediate move instructions.
138 \subsection{Calling conventions}
140 The normal calling conventions of the IA32 processor is passing all
141 function arguments on the stack~\cite{IA32vol1}. The store size on the
142 stack depends on the data type (see table
143 \ref{ia32callingconventionstackstoresizes}).
147 \begin{tabular}[b]{|l|c|}
149 JAVA Data Type & Bytes \\ \hline
150 \texttt{boolean} & \\
153 \texttt{short} & 4 \\
156 \texttt{float} & \\ \hline
158 \texttt{double} & 8 \\ \hline
160 \caption{IA32 calling convention stack store sizes}
161 \label{ia32callingconventionstackstoresizes}
165 This convention has been changed for CACAO in a way, that each
166 datatype uses always 8 bytes on the stack. due to this fact after
170 void sub(int i, long l, float f, double d);
173 the stack layout looks like in figure \ref{stacklayout}.
177 \setlength{\unitlength}{1mm}
178 \begin{picture}(50,54)
180 \put(0,0){\framebox(24,54){}}
181 \put(0,42){\line(1,0){24}}
182 \put(0,36){\line(1,0){24}}
183 \put(0,30){\line(1,0){24}}
184 \put(0,18){\line(1,0){24}}
185 \put(0,12){\line(1,0){24}}
186 \put(0,6){\line(1,0){24}}
187 \put(30,0){\vector(-1,0){6}}
188 \put(30,3){\makebox(24,6){\textit{+4 bytes}}}
189 \put(30,-3){\makebox(24,6){\textit{stack pointer}}}
191 \put(0,45){\makebox(24,6){\texttt{d}}}
192 \put(0,36){\makebox(24,6){\textit{unused}}}
193 \put(0,30){\makebox(24,6){\texttt{f}}}
194 \put(0,21){\makebox(24,6){\texttt{l}}}
195 \put(0,12){\makebox(24,6){\textit{unused}}}
196 \put(0,6){\makebox(24,6){\texttt{i}}}
197 \put(0,0){\makebox(24,6){return address}}
199 \caption{CACAO IA32 stack layout after function call}
204 If the function passes a 32-bit variable, CACAO just push 4 bytes onto
205 the stack and leave the remaining 4 bytes untouched. This does not
206 make any problems since CACAO does not read a 64-bit value from a
207 32-bit location. Passing a 64-bit value is straightforward.
209 With this adaptation, it is possible to use the \textit{stack space
210 allocation algorithm} without any changes. The drawback of this
211 decision is, that all arguments of a native function calls have to be
212 copied into a new stack frame and the memory footprint is slightly
215 But calling a native function always means a stack manipulation,
216 because the \textit{JNI environment}, and additionally for
217 \texttt{static} functions the \textit{class pointer}, have to be
218 stored in front of the function parameters. So this negligible.
220 For some \texttt{BUILTIN} functions there are assembler function
221 counterparts, which copy the 8 byte parameters in their correct size
222 in a new stack frame. But this only affects \texttt{BUILTIN} functions
223 with more than one function parameter. To be precise, two functions,
224 namely \texttt{arrayinstanceof} and \texttt{newarray}. So this is not
227 Return parameters are stored in different places, this depends on the
228 return type of the function:
231 \item \texttt{boolean}, \texttt{byte}, \texttt{char}, \texttt{short},
232 \texttt{int}, \texttt{void}: return value resides in \texttt{\%eax}
233 (\texttt{REG\_RESULT})
235 \item \texttt{long}: return value is split up onto the register pair
237 (\texttt{REG\_RESULT2:REG\_RESULT}, high 32-bit in
238 \texttt{\%edx}, low 32-bit in \texttt{\%eax})
240 \item \texttt{float}, \texttt{double}: return value resides in the
241 \textit{top of stack} element of the \textit{floating point unit}
242 stack (\texttt{st(0)}, described in more detail in section
243 \ref{ia32floatingpointarithmetic})
247 \subsection{Register allocation}
248 \label{sectionia32registerallocation}
250 Register usage was another problem in porting the CACAO to IA32. An
251 IA32 processor has 8 integer general-purpose registers (GPR), of which
252 one is the \textit{stack pointer} (SP) and thus can not be used for
253 arithmetic instructions. From the remaining 7 registers, in
254 \textit{worst-case instructions} like \texttt{CHECKCAST} or
255 \texttt{INSTANCEOF}, 3 temporary registers need to be reserved for
256 storing temporary values. Due to this fact there are 4 integer
257 registers available for arithmetic operations.
259 CACAO uses \texttt{\%ebp}, \texttt{\%esi}, \texttt{\%edi} as callee
260 saved registers, which are callee saved registers in the IA32 ABI and
261 \texttt{\%ebx} as scratch register, which is also a callee saved
262 register in the IA32 ABI. The remaining \texttt{\%eax}, \texttt{\%ecx}
263 and \texttt{\%edx} registers are used as the previously mentioned
266 The register allocator itself is very straightforward, this means, it
267 does neither \textit{linear scan} nor any other analyse of the methods
268 variables, but allocates registers for the local variables in order as
269 they are defined---\textit{first-come-first-serve}. This may result in
270 a fairly good register allocation on RISC machines, as there are
271 almost always enough registers available for the functions local
272 variables, but can result in a really bad allocation on IA32
275 So the first step to make the IA32 port more competitive with Sun's or
276 IBM's JVM would be to rewrite the register allocator.
278 Only small register allocator changes were necessary for the IA32
279 port. Since CACAO---on the IA32 architecture---stores all
280 \texttt{long} variables, because of lack of integer general-purpose
281 registers, in memory locations (described in more detail in section
282 \ref{sectionia32longarithmetic}) the register allocator has to be
283 adapted to support this feature. This means all \texttt{long}
284 variables are assigned to stack locations and tagged with the
285 \texttt{INMEMORY} flag.
288 \subsection{Long arithmetic}
289 \label{sectionia32longarithmetic}
291 Unlike the PowerPC port, the IA32 port cannot easily store
292 \texttt{long}'s in two 32-bit integer registers, since there are too
293 little of them. Maybe this could bring a speedup, if the register
294 allocator would be more intelligent or in leaf functions which have
295 only \texttt{long} variables. But this is not implemented yet. So, the
296 current approach is to store all \texttt{long}'s in memory, this means
297 they are always spilled.
299 Nearly all \texttt{long} instructions are inline, except two of them:
300 \texttt{ICMD\_LDIV} and \texttt{ICMD\_LREM}. These two are computed
301 via \texttt{BUILTIN} calls. It would also be possible to inline them,
302 but the code size would be too big and the latency of the
303 \texttt{idiv} machine instruction is so high, that the function calls
306 The IA32 processor has some machine instructions which are
307 specifically designed for 64-bit operations. With them several 64-bit
308 integer arithemtic operations can be implemented very
309 efficiently~\cite{AMDopt}. Some of them are
312 \item \texttt{cltd} --- Convert Signed Long to Signed Double Long
313 \item \texttt{adc} --- Integer Add With Carry
314 \item \texttt{sbb} --- Integer Subtraction With Borrow
317 Thus some of the 64-bit calculations like \texttt{ICMD\_LADD} or
318 \texttt{ICMD\_LSUB} could be executed in two instructions, if both
319 operand would reside in registers. But this does not apply to CACAO,
322 The generated machine code of intermediate commands which operate on
323 \texttt{long} variables instructions always operate on 64-bit, even if
324 it is not necessary. The dependency information that would be required
325 to just operate on the lower or upper half of the \texttt{long}
326 variable, is not generated by CACAO.
329 \subsection{Floating point arithmetic}
330 \label{ia32floatingpointarithmetic}
332 Since the i386, with it's i387 extension or the i486, the IA32
333 processor has a \textit{floating point unit} (FPU). This FPU is
334 implemented as a stack with 8 elements (see table \ref{FPUStack}).
338 \begin{tabular}[b]{|l|l|}
340 st(x) & FPU Data Register Stack \\ \hline
341 0 & TOS (Top Of Stack) \\ \hline
350 \caption{x87 FPU Data Register Stack}
355 This stack is designed to wrap around if values are loaded to the
356 \textit{top of stack} (TOS). For this reason, it has a special register which
357 points to the TOS. This pointer is increased if a load instruction to
358 the TOS is executed and decreased for a store from the TOS.
360 At first sight, the stack design of the FPU is perfect for the stack
361 based design of a Java Virtual Machine. But there is one problem. The
362 JVM stack has no fixed size, so it can grow up to much more than 8
363 elements and this simply results in an stack wrap around and thus an
364 stack overflow. For this reason it it necessary to implement an
365 \textit{stack-element-to-register-mapping}.
367 A very basic design idea, is to define a pointer to the current TOS
368 offset (\texttt{fpu\_st\_offset}). With this offset the current
369 register position in the FPU stack of an arbitrary register can
370 determined. From the 8 stack elements the last two ones
371 (\texttt{st(6), st(7)}) are reserved, so two memory operands can be
372 loaded onto the stack and to preform an arithmetic operation. Most
373 IA32 floating point arithmetic operations have an \textit{do
374 arithmetic and pop one element} version of the instruction, that means
375 the float arithmetic is done and the TOS element is poped off. The
376 remaining stack element has the result of the calculation. On the
377 example of the \texttt{ICMD\_FADD} intermediate command with two
378 memory operands, it looks like this:
381 var_to_reg_flt(s1, src->prev, REG_FTMP1); /* load 1st operand in st(0), increase
383 var_to_reg_flt(s2, src, REG_FTMP2); /* load 2nd operand in st(0), increase
385 i386_faddp(); /* add 2 upper most elements st(1) = st(1) + st(0) -- pop st(0) */
386 fpu_st_offset--; /* decrease fpu_st_offset from previous pop */
387 store_reg_to_var_flt(iptr->dst, d); /* store result -- decrease fpu_st_offset */
390 This mapping works very good with \textit{scratch registers}
391 only. Defining \textit{callee saved float registers} makes some
392 problemes since the IA32 ABI has no callee saved float registers. This
393 would need a special handling in the \textit{native stub} of a native
394 function, namely saving the registers and cleaning the whole FPU
395 stack, because a C function expects a clear FPU stack.
397 Basically the IA32 FPU can handle 3 float data types:
400 \item single-precision (32 bit)
401 \item double-precision (64 bit)
402 \item double extended-precision (80 bit)
405 The FPU has a 16 bit \textit{control register} which has a
406 \textit{precision control field} (PC) and a \textit{rounding control
407 field} (RC), each of 2 bit length (see table \ref{PCField} and
412 \begin{tabular}[b]{|l|c|}
414 Precision & PC Field \\ \hline
415 single-precision (32 bit) & 00B \\ \hline
416 reserved & 01B \\ \hline
417 double-precision (64 bit) & 10B \\ \hline
418 double extended-precision (80 bit) & 11B \\ \hline
420 \caption{Precision Control Field (PC)}
427 \begin{tabular}[b]{|l|c|}
429 Rounding Mode & RC Field \\ \hline
430 round to nearest (even) & 00B \\ \hline
431 round down (toward -$\infty$) & 01B \\ \hline
432 round up (toward +$\infty$) & 10B \\ \hline
433 round toward zero (truncate) & 11B \\ \hline
435 \caption{Rounding Control Field (RC)}
440 The internal data type used by the FPU is always the \textit{double
441 extended-precision} (80 bit) format. Therefore implementing a IEEE 754
442 compliant floating point code on IA32 processors is not trivial.
444 Correct rounding to \textit{single-precision} or
445 \textit{double-precision} is only done if the value is stored into
446 memory. This means for certain instructions, like \texttt{ICMD\_FMUL}
447 or \texttt{ICMD\_FDIV}, a special technique called
448 \textit{store-load}, has to be implemented~\cite{OgKoNa02}. This
449 technique is in fact very simple but takes two memory accesses more
450 for this instruction.
452 For single-precision floats the \textit{store-load} code could looks
456 i386_fstps_membase(REG_SP, 0); /* store single-precision float to stack */
457 i386_flds_membase(REG_SP, 0); /* load single-precision float from stack */
460 Another technique which has to be implemented to be IEEE 754
461 compliant, is \textit{precision mode switching}. Mode switching on the
462 IA32 processor is made with the \texttt{fldcw}---load control
463 word---instruction. A \texttt{fldcw} instruction has a quite large
464 overhead, so frequently mode switches should be avoided. For this
465 technique there are two different approaches:
468 \item \textbf{Mode switch on float arithmetic} --- switch the FPU on
469 initialization in one precision mode, mostly \textit{double-precision
470 mode} because \texttt{double} arithmetic is more common. With this
471 setting \texttt{double}s are calculated correctly. To handle
472 \texttt{float}s in this approach, the FPU has to be switched into
473 \textit{single-precision mode} before each \texttt{float} instruction
474 and switched back afterwards. Needless to say, that this is only
475 useful, if \texttt{float} arithmetic is sparse. For a simple
481 float e = f * f + g * g;
484 the generated ICMD's look like this (with comments where precision
485 mode switches take place):
489 <fpu in double-precision mode>
492 <switch fpu to single-precision mode>
494 <switch fpu to double-precision mode>
497 <switch fpu to single-precision mode>
499 <switch fpu to double-precision mode>
500 <switch fpu to single-precision mode>
502 <switch fpu to double-precision mode>
507 For this corner case situation the mode switches are extrem, but the
508 example should demonstrate how this technique works.
510 \item \textbf{Mode switch on float data type change} --- switch the
511 FPU on initialization in on precision mode, like before, in
512 \textit{double-precision mode}. But the difference on this approach
513 is, that the precision mode is only switched if the float data type
514 is changed. That means if your calculation switches from
515 \texttt{double} arithmetic to \texttt{float} arithmetic or
516 backwards. This technique makes much sense due to the fact that there
517 are always a bunch of instructions of the same data type in one row
518 in a normal program. Now the same example as before with this
523 <fpu in double-precision mode>
526 <switch fpu to single-precision mode>
536 After this code sequence the FPU is in \textit{single-precision mode}
537 and if a function return would occur, the caller function would not
538 know which FPU mode is currently set. One solution would be to reset
539 the FPU to \textit{double-precision mode} on a function return, if
540 the actual mode is \textit{single-precision}.
543 Each of these FPU techniques described have been implemented in
544 CACAO's IA32 port, but the results were not completly IEEE 754
545 compliant. So the CACAO developer team decided to be on the safe side
546 and to store all float variables in memory, until we have found a
547 solution which is fast and 100\% compliant.
550 \subsection{Exception handling}
552 The exception handling for the IA32 architecture is implemented as
553 intended to be for CACAO. To handle the common and unexpected, but
554 often checked, \texttt{java.lang.NullPointerException} very fast,
555 CACAO uses \textit{hardware null-pointer checking}. That means a
556 signal handler for the \texttt{SIGSEGV} operating system signal is
557 installed. If the signal is hit, the CACAO signal handler forwards the
558 exception to CACAO's internal exception handling system. So if an
559 instruction tries to access the memory at address \texttt{0x0}, a
560 \texttt{SIGSEGV} signal is raised because the memory page is not read
561 or writeable. After the signal is hit, the handler has to be
562 reinstalled, so that further exceptions can be catched. This is done
563 in the handler itself.
565 The \texttt{SIGSEGV} handler is used on any architecture CACAO has
566 been ported to. Additionally on the IA32 architecture a handler for
567 the \texttt{SIGFPE} signal is installed. With this handler a
568 \texttt{java.lang.ArithmeticException}'s for integer \textit{/ by
569 zero} can be catched in hardware and there is no need to write helper
570 functions, like \texttt{asm\_builtin\_idiv}, which check the division
571 operands as this is done for the Alpha or MIPS port.