doc/handbook/x86.tex

   1 \section{IA32 (x86, i386) code generator}
   2
   3 Porting to the famous x86 platform was more effort than
   4 expected. CACAO was designed to run on RISC machines from ground up,
   5 so the whole code generation part has to be adapted. The first
   6 approach was to replace the simple RISC macros with x86 code, but this
   7 turned out to be not successful. So new x86 code generation macros
   8 were written, with no respect to the old RISC macros.
   9
  10 Some smaller problems occured since the x86 port was the first 32 bit
  11 target platform, like segmentation faults due to heap corruption,
  12 which turned out to be a simple \texttt{for} loop bug only hit on 32
  13 bit systems. Most of the CACAO system already was
  14 \textit{32-bit-ready}, namely an architecture dependent
  15 \texttt{types.h} with definitions of the used datatypes and some
  16 feature flags, which features the processor itself natively
  17 supports. Most noticeable change was the \texttt{s8} and \texttt{u8}
  18 datatype, changed from \texttt{long} to \texttt{long long} to support
  19 64 bit calculations.
  20
  21
  22 \subsection{Code generation}
  23
  24 One big difference in writing the new code generation macros was, that
  25 the x86 architecture is not a \textit{load-store architecture} like
  26 the RISC machines, but the \textit{machine instructions} can handle
  27 both \textit{memory operands} and \textit{register operands}. This led
  28 to a much more complicated handling of the various ICMDs. The typical
  29 handling of an ICMD on RISC machines looks like this (on the example
  30 of the integer add ICMD):
  31
  32 \begin{verbatim}
  33         case ICMD_IADD:
  34             var_to_reg_int(s1, src->prev, REG_ITMP1);
  35             var_to_reg_int(s2, src, REG_ITMP2);
  36             d = reg_of_var(iptr->dst, REG_ITMP3);
  37             M_IADD(s1, s2, d);
  38             store_reg_to_var_int(iptr->dst, d);
  39             break;
  40 \end{verbatim}
  41
  42 This means loading the two \textit{source operands} from memory to
  43 temporary registers, if necessary, getting a \textit{destination
  44 register}, do the calculation and store the result to memory, if the
  45 destination variable resides in memory. If all operands are assigned
  46 to registers, only the calculation is done. This design also works on
  47 x86 machines but leads to much bigger code size, reduces decoding
  48 bandwith and increases register pressure in the processor itself,
  49 which results in lower performance \ref{}. Thus we use all kinds of
  50 instruction types that are available and decide which one we have to
  51 use in some \texttt{if} statements:
  52
  53 \begin{verbatim}
  54         if (IS_INMEMORY(iptr->dst)) {
  55             if (IS_INMEMORY(src) && IS_INMEMORY(src->prev)) {
  56                 ...
  57             } else if (IS_INMEMORY(src) && !IS_INMEMORY(src->prev)) {
  58                 ...
  59             } else if (!IS_INMEMORY(src) && IS_INMEMORY(src->prev)) {
  60                 ...
  61             } else {
  62                 ...
  63             }
  64         } else {
  65             if (IS_INMEMORY(src) && IS_INMEMORY(src->prev)) {
  66                 ...
  67             } else if (IS_INMEMORY(src) && !IS_INMEMORY(src->prev)) {
  68                 ...
  69             } else if (!IS_INMEMORY(src) && IS_INMEMORY(src->prev)) {
  70                 ...
  71             } else {
  72                 ...
  73             }
  74         }
  75 \end{verbatim}
  76
  77 For most ICMDs we can further optimize the generated code when one
  78 source variable and the destination variable share the same local
  79 variable.
  80
  81 To be backward compatible, mostly in respect of embedded systems, all
  82 generated code can be run on i386 systems.
  83
  84 Another problem was the access to the functions data segment. Since
  85 RISC platforms like ALPHA and MIPS have a procedure pointer register,
  86 for the x86 platform there had to be implemented a special handling
  87 for accesses to the data segment, like \texttt{PUTSTATIC} and
  88 \texttt{GETSTATIC} instructions. The solution is like the handling of
  89 \textit{jump references} or \textit{check cast references}, which also
  90 have to be code patched, when the code and data segment are
  91 relocated. This means, there has to be an extra
  92 \textit{immediate-to-register} move (\texttt{i386\_mov\_imm\_reg()})
  93 before every \texttt{PUT}/\texttt{GETSTATIC} instruction, which moves
  94 the start address of the procedure, and thus the start address of the
  95 data segment, in one of the temporary registers (code snippet from
  96 \texttt{PUTSTATIC}):
  97
  98 \begin{verbatim}
  99         i386_mov_imm_reg(0, REG_ITMP2);
 100         dseg_adddata(mcodeptr);
 101 \end{verbatim}
 102
 103 The \texttt{dseg\_adddata()} call inserts the current postion of the
 104 code generation pointer into a datastructure which is later processed
 105 by \texttt{codegen\_finish()}, where the final address of the data
 106 segment is patched.
 107
 108
 109 \subsection{Constant handling}
 110
 111 Unlike RISC machines the x86 architecture has \textit{immediate move}
 112 instructions which can handle the maximum bitsize of the
 113 registers. Thus we don't have to load big constants indirect from the
 114 data segment, which means a \textit{memory load} instruction, but we
 115 can move 32 bit constants \textit{inline} into their destination
 116 registers.
 117
 118 \begin{verbatim}
 119         i386_mov_imm_reg(0xcafebabe, REG_ITMP1);
 120 \end{verbatim}
 121
 122 For constants bigger than 32 bits up to 64 bits, we split the move
 123 up into two immediate move instructions.
 124
 125
 126 \subsection{Calling conventions}
 127
 128 The normal calling convention of the x86 processor is passing all
 129 function arguments on the stack. The store size depends on the data
 130 type (the following types reflect the JAVA data types):
 131
 132 \begin{itemize}
 133  \item \texttt{boolean}, \texttt{byte}, \texttt{char}, \texttt{short}, \texttt{int},
 134        \texttt{float}, \texttt{void} --- 4 bytes
 135  \item \texttt{long}, \texttt{double} --- 8 bytes
 136 \end{itemize}
 137
 138 We changed this convention for CACAO in a way, that we are using
 139 always 8 bytes on the stack for each datatype. After calling the function
 140
 141 \begin{verbatim}
 142         void sub(int i, long l, float f, double d);
 143 \end{verbatim}
 144
 145 we have a stack layout like in figure \ref{stacklayout}.
 146
 147 \begin{figure}[htb]
 148 \begin{center}
 149 \setlength{\unitlength}{1mm}
 150 \begin{picture}(50,54)
 151 \thicklines
 152 \put(0,0){\framebox(24,54){}}
 153 \put(0,42){\line(1,0){24}}
 154 \put(0,36){\line(1,0){24}}
 155 \put(0,30){\line(1,0){24}}
 156 \put(0,18){\line(1,0){24}}
 157 \put(0,12){\line(1,0){24}}
 158 \put(0,6){\line(1,0){24}}
 159 \put(30,0){\vector(-1,0){6}}
 160 \put(30,3){\makebox(24,6){\textit{+4 bytes}}}
 161 \put(30,-3){\makebox(24,6){\textit{stack pointer}}}
 162
 163 \put(0,45){\makebox(24,6){\textit{double value}}}
 164 \put(0,36){\makebox(24,6){\textit{unused}}}
 165 \put(0,30){\makebox(24,6){\textit{float value}}}
 166 \put(0,21){\makebox(24,6){\textit{long value}}}
 167 \put(0,12){\makebox(24,6){\textit{unused}}}
 168 \put(0,6){\makebox(24,6){\textit{int value}}}
 169 \put(0,0){\makebox(24,6){\textit{return address}}}
 170 \end{picture}
 171 \caption{CACAO x86 stack layout after function call}
 172 \label{stacklayout}
 173 \end{center}
 174 \end{figure}
 175
 176 If we pass a 32 bit variable, we just push 4 bytes onto the stack and
 177 leave the remaining 4 bytes untouched. This makes no problems since we
 178 do not read a 64 bit value from a 32 bit location. Passing a 64 bit
 179 value is straightforward.
 180
 181 With this adaptation, it was possible to use the \textit{stack space
 182 allocation algorithm} without any changes. The drawback of this
 183 decision was, that we have to copy all arguments of a native function
 184 call into a new stack frame and we have a slightly bigger memory
 185 footprint.
 186
 187 But calling a native function always means a stack manipulation,
 188 because you have to put the \textit{JNI environment}, and additionally
 189 for \texttt{static} functions the \textit{class pointer}, in front of
 190 the function parameters. So this negligible.
 191
 192 For some \texttt{BUILTIN} functions there had to be written
 193 \texttt{asm\_} counterparts, which copy the 8 byte parameters in their
 194 correct size in a new stack frame. But this only affected
 195 \texttt{BUILTIN} functions with more than 1 function parameter. To be
 196 exactly, 2 functions, namely \texttt{arrayinstanceof} and
 197 \texttt{newarray}. So this is not a big speed impact.
 198
 199 Return parameters are stored in different places, this depends on the
 200 return type of the function:
 201
 202 \begin{itemize}
 203  \item \texttt{boolean}, \texttt{byte}, \texttt{char}, \texttt{short},
 204  \texttt{int}, \texttt{void}: return value resides in \texttt{\%eax}
 205  (\texttt{REG\_RESULT})
 206
 207  \item \texttt{long}: return value is split up onto the register pair
 208  \texttt{\%edx:\%eax}
 209  (\texttt{REG\_RESULT2:REG\_RESULT}, high 32 bit in
 210  \texttt{\%edx}, low 32 bit in \texttt{\%eax})
 211
 212  \item \texttt{float}, \texttt{double}: return value resides in the
 213  \textit{top of stack} element of the \textit{floating point unit}
 214  stack (\texttt{st(0)}, described in detail later)
 215 \end{itemize}
 216
 217
 218 \subsection{Register allocator}
 219
 220 Register usage was another problem in porting the CACAO to x86. An x86
 221 processor has 8 genernal purpose registers (GPR), of which one is the
 222 \textit{stack pointer} (SP) and thus it can not be used for arithmetic
 223 instructions. From the remaining 7 registers, in \textit{worst-case
 224 instructions} like \texttt{CHECKCAST} or \texttt{INSTANCEOF}, we need
 225 to reserve 3 temporary registers. So we have 4 registers available.
 226
 227 We use \texttt{\%ebp}, \texttt{\%esi}, \texttt{\%edi} as callee saved
 228 registers (which are callee saved registers in the x86 ABI) and
 229 \texttt{\%ebx} as scratch register (which is also a callee saved
 230 register in the x86 ABI, but we need some scratch registers). So we
 231 have a lack of scratch registers. But for most ICMD instructions, we
 232 do not need all, or sometimes none, of the temporary registers.
 233
 234 This fact we use in the \texttt{analyse\_stack()} pass. We try to use
 235 \texttt{\%edx} (which is \texttt{REG\_ITMP3}) and \texttt{\%ecx} (which
 236 is \texttt{REG\_ITMP2}) as scratch registers for the register
 237 allocator if certain ICMD instructions are not used in the compiled
 238 method. So for \textit{best-case situations} CACAO has 3
 239 \textit{callee saved} and 3 \textit{scratch} registers.
 240
 241 This analysis should be changed from \textit{method level} to
 242 \textit{basic-block level}, so CACAO could emit better code on x86.
 243
 244 The register allocator itself is very straightforward, this means, it
 245 does neither \textit{linear scan} nor any other analyse of the methods
 246 variables, but allocates registers for the local variables in order as
 247 they are defined. This may result in good code on RISC machines, as
 248 there are almost always enough registers available, with 32 registers,
 249 but can produce really bad code on x86 processors.
 250
 251 So the first step to make the x86 port more competitive with SUN's or
 252 IBM's JVM would be to rewrite the register allocator.
 253
 254 Basically the allocation order of the register allocator is as
 255 follows:
 256
 257 \begin{itemize}
 258  \item interface register allocation
 259  \item scratch register allocation
 260  \item local register allocation
 261 \end{itemize}
 262
 263 The only change which had to be made to all allocator passes, was the
 264 handling of \texttt{long} variables, because they are all spilled to
 265 memory (described in more detail in \ref{LongArithmetic}).
 266
 267
 268 \subsection{Long arithmetic}\label{LongArithmetic}
 269
 270 Unlike the PowerPC port, we cannot put \texttt{long}'s in two 32 bit
 271 integer registers, since we have to little of them. Maybe this could
 272 bring a speedup, if the register allocator would be more intelligent
 273 or in leaf functions which have only \texttt{long} variables. But this
 274 is not implemented yet. So, the current approach is to store all
 275 \texttt{long}'s in memory, this means they are always spilled.
 276
 277 Nearly all \texttt{long} instructions are inline, except two of them:
 278 \texttt{LDIV} and \texttt{LREM}. These two are computed via
 279 \texttt{BUILTIN} calls. It would be definitely possible to also
 280 inline them, but the code size is too big and the latency is so high,
 281 that the function calls are negligible.
 282
 283 The x86 processor has some machine instructions which are specifically
 284 designed for 64 bit operations. Some of them are
 285
 286 \begin{itemize}
 287  \item \texttt{cltd} --- Convert Signed Long to Signed Double Long
 288  \item \texttt{adc} --- Integer Add With Carry
 289  \item \texttt{sbb} --- Integer Subtraction With Borrow
 290 \end{itemize}
 291
 292 Thus some of the 64 bit calculations like \texttt{LADD} or
 293 \texttt{LSUB} could be executed in two instructions, if both
 294 operand would reside in registers. But this does not apply to CACAO,
 295 yet.
 296
 297 All of the \texttt{long} instructions operate on 64 bit, even if it is
 298 not necessary. The dependency information that would be needed to just
 299 operate on the lower or upper half of the \texttt{long} variable, is
 300 not generated by CACAO.
 301
 302
 303 \subsection{Floating point arithmetic}
 304
 305 Since the i386, with it's i387 counterpart or the i486, the x86
 306 processor has a \textit{floating point unit} (FPU). This FPU is
 307 implemented as a stack with 8 elements (see table \ref{FPUStack}).
 308
 309 \begin{table*}
 310 \begin{center}
 311 \begin{tabular}[b]{|l|l|}
 312 \hline
 313 st(x) & FPU Data Register Stack \\ \hline
 314 0     & TOS (Top Of Stack) \\ \hline
 315 1     & \\ \hline
 316 2     & \\ \hline
 317 3     & \\ \hline
 318 4     & \\ \hline
 319 5     & \\ \hline
 320 6     & \\ \hline
 321 7     & \\ \hline
 322 \end{tabular}
 323 \caption{x87 FPU Data Register Stack}
 324 \label{FPUStack}
 325 \end{center}
 326 \end{table*}
 327
 328 This stack is designed to wrap around if values are loaded to the
 329 \textit{top of stack} (TOS). For this reason, it has a special register which
 330 points to the TOS. This pointer is increased if a load instruction to
 331 the TOS is executed and decreased for a store from the TOS.
 332
 333 At first sight, the stack design of the FPU is perfect for the stack
 334 based design of a \textit{java virtual machine} (JVM). But there is a
 335 big problem. The JVM stack has no fixed size, so it can grow up to
 336 much more than 8 elements and we get an stack wrap around and thus an
 337 stack overflow. For this reason we need to implement an
 338 \textit{stack-element-to-register-mapping}.
 339
 340 A very basic design idea, is to define a pointer to the current TOS
 341 offset (\texttt{fpu\_st\_offset}). With this offset we can determine
 342 the current register position in the FPU stack of an arbitrary
 343 register.  From the 8 stack elements we need to reserve the last two
 344 ones (\texttt{st(6), st(7)}), so we can load two memory operands onto
 345 the stack and do the arithmetic on them. Most x86 floating point
 346 arithmetic operations have an \textit{do arithmetic and pop one
 347 element} version of the instruction, that means the float arithmetic
 348 is done and the TOS element is poped off. The remaining stack element
 349 has the result of the calculation. On the example of the \texttt{FADD}
 350 ICMD with two memory operands, it looks like this:
 351
 352 \begin{verbatim}
 353 var_to_reg_flt(s1, src->prev, REG_FTMP1); /* load 1st operand in st(0), increase
 354                                              fpu_st_offset                          */
 355 var_to_reg_flt(s2, src, REG_FTMP2);       /* load 2nd operand in st(0), increase
 356                                              fpu_st_offset                          */
 357 i386_faddp();       /* add 2 upper most elements st(1) = st(1) + st(0) -- pop st(0) */
 358 fpu_st_offset--;                        /* decrease fpu_st_offset from previous pop */
 359 store_reg_to_var_flt(iptr->dst, d); /* store result -- decrease fpu_st_offset       */
 360 \end{verbatim}
 361
 362 This mapping works very good with \textit{scratch registers}
 363 only. Defining \textit{callee saved float registers} makes some
 364 problemes since the x86 ABI has no callee saved float registers. This
 365 would need a special handling in the \textit{native stub} of a native
 366 function, namely saving the registers and cleaning the whole FPU
 367 stack, because a C function expects a clear FPU stack.
 368
 369 Basically the x86 FPU can handle 3 float data types:
 370
 371 \begin{itemize}
 372  \item single-precision (32 bit)
 373  \item double-precision (64 bit)
 374  \item double extended-precision (80 bit)
 375 \end{itemize}
 376
 377 The FPU has a 16 bit \textit{control register} which has a
 378 \textit{precision control field} (PC) and a \textit{rounding control
 379 field} (RC), each of 2 bit length (see table \ref{PCField} and
 380 \ref{RCField}).
 381
 382 \begin{table*}
 383 \begin{center}
 384 \begin{tabular}[b]{|l|c|}
 385 \hline
 386 Precision                          & PC Field \\ \hline
 387 single-precision (32 bit)          & 00B      \\ \hline
 388 reserved                           & 01B      \\ \hline
 389 double-precision (64 bit)          & 10B      \\ \hline
 390 double extended-precision (80 bit) & 11B      \\ \hline
 391 \end{tabular}
 392 \caption{Precision Control Field (PC)}
 393 \label{PCField}
 394 \end{center}
 395 \end{table*}
 396
 397 \begin{table*}
 398 \begin{center}
 399 \begin{tabular}[b]{|l|c|}
 400 \hline
 401 Rounding Mode                 & RC Field \\ \hline
 402 round to nearest (even)       & 00B      \\ \hline
 403 round down (toward -$\infty$) & 01B      \\ \hline
 404 round up (toward +$\infty$)   & 10B      \\ \hline
 405 round toward zero (truncate)  & 11B      \\ \hline
 406 \end{tabular}
 407 \caption{Rounding Control Field (RC)}
 408 \label{RCField}
 409 \end{center}
 410 \end{table*}
 411
 412 The internal data type used by the FPU is always the \textit{double
 413 extended-precision} (80 bit) format. Therefore implementing a IEEE 754
 414 compliant floating point code on x86 processors is not trivial.
 415
 416 Correct rounding to \textit{single-precision} or
 417 \textit{double-precision} is only done if the value is stored into
 418 memory. This means for certain instructions, like \texttt{FMUL} or
 419 \texttt{FDIV}, a special technique called \textit{store-reload}, has
 420 to be implemented. This technique is in fact very simple but takes two
 421 memory accesses more for this instruction.
 422
 423 For single-precision floats the \textit{store-reload} code could looks
 424 like this:
 425
 426 \begin{verbatim}
 427 i386_fstps_membase(REG_SP, 0);    /* store single-precision float to stack  */
 428 i386_flds_membase(REG_SP, 0);     /* load single-precision float from stack */
 429 \end{verbatim}
 430
 431 Another technique which has to be implemented to be IEEE 754
 432 compliant, is \textit{precision mode switching}. Mode switching on the
 433 x86 processor is made with the \texttt{fldcw} (load control word)
 434 instruction. A \texttt{fldcw} instruction has a quite large overhead,
 435 so frequently mode switches should be avoided. For this technique
 436 there are two different approaches:
 437
 438 \begin{itemize}
 439  \item \textbf{Mode switch on float arithmetic} --- switch the FPU on
 440  initialization in one precision mode, mostly \textit{double-precision
 441  mode} because \texttt{double} arithmetic is more common. With this
 442  setting \texttt{doubles} are calculated correctly. To handle
 443  \texttt{floats} in this approach, the FPU has to be switched into
 444  \textit{single-precision mode} before each \texttt{float} instruction
 445  and switched back afterwards. Needless to say, that this is only
 446  useful, if \texttt{float} arithmetic is sparse. For a simple
 447  calculation like
 448
 449  \begin{verbatim}
 450         float f = 1.234F;
 451         float g = 2.345F;
 452         float e = f * f + g * g;
 453  \end{verbatim}
 454
 455  the generated ICMD's look like this (with comments where precision
 456  mode switches take place):
 457
 458  \begin{verbatim}
 459         ...
 460         <fpu in double-precision mode>
 461         FLOAD         1
 462         FLOAD         1
 463         <switch fpu to single-precision mode>
 464         FMUL
 465         <switch fpu to double-precision mode>
 466         FLOAD         2
 467         FLOAD         2
 468         <switch fpu to single-precision mode>
 469         FMUL
 470         <switch fpu to double-precision mode>
 471         <switch fpu to single-precision mode>
 472         FADD
 473         <switch fpu to double-precision mode>
 474         FSTORE        3
 475         ...
 476  \end{verbatim}
 477
 478  For this corner case situation the mode switches are extrem, but the
 479  example should demonstrate how this technique works.
 480
 481  \item \textbf{Mode switch on float data type change} --- switch the
 482  FPU on initialization in on precision mode, like before, in
 483  \textit{double-precision mode}. But the difference on this approach
 484  is, that the precision mode is only switched if the float data type
 485  is changed. That means if your calculation switches from
 486  \texttt{double} arithmetic to \texttt{float} or backwards. This
 487  technique makes much sense due to the fact that there are always a
 488  bunch of instructions of the same data type in one row in a normal
 489  program. Now the same example as before with this approach:
 490
 491  \begin{verbatim}
 492         ...
 493         <fpu in double-precision mode>
 494         FLOAD         1
 495         FLOAD         1
 496         <switch fpu to single-precision mode>
 497         FMUL
 498         FLOAD         2
 499         FLOAD         2
 500         FMUL
 501         FADD
 502         FSTORE        3
 503         ...
 504  \end{verbatim}
 505
 506  After this code sequence the FPU is in \textit{single-precision mode}
 507  and if a function return would occur, the caller function would not
 508  know in which mode the FPU is currently. One solution would be to
 509  reset the FPU to \textit{double-precision} on a function return, if
 510  the actual mode is \textit{single-precision}.
 511 \end{itemize}
 512
 513 These techniques and further researches into optimizations which could
 514 be done, are described in \cite{OgKoNa02}.
 515
 516 All of these described FPU techniques have been implemented in CACAO's
 517 x86 port, but the results were not completly IEEE 754 compliant. So
 518 the CACAO developer team decided to be on the safe side and to store
 519 all float variables in memory, until we have found a solution which is
 520 fast and 100\% compliant.
 521
 522
 523 \subsection{Exception handling}
 524
 525 The exception handling for the x86 architecture is implemented as
 526 intended to be for CACAO. To handle the common and unexpected, but
 527 often checked, \texttt{NullPointerException} very fast, we use
 528 \textit{hardware null-pointer checking}. That means we install a
 529 signal handler for the \texttt{SIGSEGV} operating system signal and in
 530 the handler we forward the exception to CACAO's internal exception
 531 handling system. So if an instruction tries to access the memory at
 532 address \texttt{0x0}, a \texttt{SIGSEGV} signal is raised because the
 533 memory page is not read or writeable. After the signal is hit, we have
 534 to reinstall the handler, so we can catch further exceptions and this
 535 is done in the handler itself.
 536
 537 The \texttt{SIGSEGV} handler is used on any architecture to which
 538 CACAO has been ported. Additionally we install a handler for the
 539 \texttt{SIGFPE} on the x86 architecture. With this handler we can
 540 catch \texttt{ArithmeticException}'s for integer \textit{/ by zero} in
 541 hardware and there is no need to write a helper function which checks
 542 the operands, as it has to be done for the ALPHA or MIPS port.