doc/handbook/x86.tex

   1 \section{x86 code generator}
   2
   3 Porting to the famous x86 platform was more effort than
   4 expected. CACAO was designed to run on RISC machines from ground up,
   5 so the code generation part hat to be adapted. The first approach was
   6 to replace the simple RISC macros with x86 code, but this turned out
   7 to be not successful. So new x86 code generation macros where written,
   8 with no respect to the old RISC macros. One big difference in writing
   9 the new code generation macros was, that the x86 architecture is not a
  10 \textit{load-store architecture} like the RISC machines, but the
  11 \textit{machine instructions} can handle both \textit{memory operands}
  12 and \textit{register operands}. This led to a much more complicated
  13 handling of the various ICMDs. The typical handling of an ICMD on RISC
  14 machines looks like this (on the example of the integer add ICMD):
  15
  16 \begin{verbatim}
  17         case ICMD_IADD:
  18             var_to_reg_int(s1, src->prev, REG_ITMP1);
  19             var_to_reg_int(s2, src, REG_ITMP2);
  20             d = reg_of_var(iptr->dst, REG_ITMP3);
  21             M_IADD(s1, s2, d);
  22             store_reg_to_var_int(iptr->dst, d);
  23             break;
  24 \end{verbatim}
  25
  26 This means loading the two \textit{source operands} from memory to
  27 temporary registers, if necessary, getting a \textit{destination
  28 register}, do the calculation and store the result to memory, if the
  29 destination variable resides in memory. If all operands are assigned
  30 to registers, only the calculation is done. This design also works on
  31 x86 machines but leads to much bigger code size, reduces decoding
  32 bandwith and increases register pressure in the processor itself,
  33 which results in lower performance \ref{}. Thus we use all kinds of
  34 instruction types that are available and decide which one we have to
  35 use in some \texttt{if} statements:
  36
  37 \begin{verbatim}
  38         if (IS_INMEMORY(iptr->dst)) {
  39             if (IS_INMEMORY(src) && IS_INMEMORY(src->prev)) {
  40                 ...
  41             } else if (IS_INMEMORY(src) && !IS_INMEMORY(src->prev)) {
  42                 ...
  43             } else if (!IS_INMEMORY(src) && IS_INMEMORY(src->prev)) {
  44                 ...
  45             } else {
  46                 ...
  47             }
  48         } else {
  49             if (IS_INMEMORY(src) && IS_INMEMORY(src->prev)) {
  50                 ...
  51             } else if (IS_INMEMORY(src) && !IS_INMEMORY(src->prev)) {
  52                 ...
  53             } else if (!IS_INMEMORY(src) && IS_INMEMORY(src->prev)) {
  54                 ...
  55             } else {
  56                 ...
  57             }
  58         }
  59 \end{verbatim}
  60
  61 For most ICMDs we can further optimize the generated code when one
  62 source variable and the destination variable share the same local
  63 variable.
  64
  65 To be backward compatible, mostly in respect of embedded systems, all
  66 generated code can be run on i386 systems.
  67
  68 Some smaller problems occured since the x86 port was the first 32 bit
  69 target platform, like segmentation faults due to heap corruption,
  70 which turned out to be a simple \texttt{for} loop bug only hit on 32
  71 bit systems. Most of the CACAO system already was
  72 \textit{32-bit-ready}, namely an architecture dependent
  73 \texttt{types.h} with definitions of the used datatypes and some
  74 feature flags, which features the processor itself natively
  75 supports. Most noticeable change was the \texttt{s8} and \texttt{u8}
  76 datatype from \texttt{long} to \texttt{long long} to support 64 bit
  77 calculations.
  78
  79 Another problem was the access to the functions data segment. Since
  80 RISC platforms like ALPHA and MIPS have a procedure pointer register,
  81 for the x86 platform there had to be implemented a special handling
  82 for accesses to the data segment, like \texttt{PUTSTATIC} and
  83 \texttt{GETSTATIC} instructions. The solution is like the handling of
  84 \textit{jump references} or \textit{check cast references}, which also
  85 have to be code patched, when the code and data segment are
  86 relocated. This means, there has to be an extra
  87 \textit{immediate-to-register} move (\texttt{i386\_mov\_imm\_reg()})
  88 before every \texttt{PUT}/\texttt{GETSTATIC} instruction, which moves
  89 the start address of the procedure, and thus the start address of the
  90 data segment, in one of the temporary registers (code snippet from
  91 \texttt{PUTSTATIC}):
  92
  93 \begin{verbatim}
  94         i386_mov_imm_reg(0, REG_ITMP2);
  95         dseg_adddata(mcodeptr);
  96 \end{verbatim}
  97
  98 The \texttt{dseg\_adddata()} call inserts the current postion of the
  99 code generation pointer into a datastructure which is later processed
 100 by \texttt{codegen\_finish()}, where the final address of the data
 101 segment is patched.
 102
 103
 104 \subsection{Calling conventions}
 105
 106 The normal calling convention of the x86 processor is passing all
 107 function arguments on the stack. The store size depends on the data
 108 type (the following types reflect the JAVA data types):
 109
 110 \begin{itemize}
 111  \item \texttt{boolean}, \texttt{byte}, \texttt{char}, \texttt{short}, \texttt{int},
 112        \texttt{float}, \texttt{void} --- 4 bytes
 113  \item \texttt{long}, \texttt{double} --- 8 bytes
 114 \end{itemize}
 115
 116 We changed this convention for CACAO in a way, that we are using
 117 always 8 bytes on the stack for each datatype. After calling the function
 118
 119 \begin{verbatim}
 120         void sub(int i, long l, float f, double d);
 121 \end{verbatim}
 122
 123 we have a stack layout like in figure \ref{stacklayout}.
 124
 125 \begin{figure}[htb]
 126 \begin{center}
 127 \setlength{\unitlength}{1mm}
 128 \begin{picture}(50,54)
 129 \thicklines
 130 \put(0,0){\framebox(24,54){}}
 131 \put(0,42){\line(1,0){24}}
 132 \put(0,36){\line(1,0){24}}
 133 \put(0,30){\line(1,0){24}}
 134 \put(0,18){\line(1,0){24}}
 135 \put(0,12){\line(1,0){24}}
 136 \put(0,6){\line(1,0){24}}
 137 \put(30,0){\vector(-1,0){6}}
 138 \put(30,3){\makebox(24,6){\textit{+4 bytes}}}
 139 \put(30,-3){\makebox(24,6){\textit{stack pointer}}}
 140
 141 \put(0,45){\makebox(24,6){\textit{double value}}}
 142 \put(0,36){\makebox(24,6){\textit{unused}}}
 143 \put(0,30){\makebox(24,6){\textit{float value}}}
 144 \put(0,21){\makebox(24,6){\textit{long value}}}
 145 \put(0,12){\makebox(24,6){\textit{unused}}}
 146 \put(0,6){\makebox(24,6){\textit{int value}}}
 147 \put(0,0){\makebox(24,6){\textit{return address}}}
 148 \end{picture}
 149 \caption{CACAO x86 stack layout after function call}
 150 \label{stacklayout}
 151 \end{center}
 152 \end{figure}
 153
 154 If we pass a 32 bit variable, we just push 4 bytes onto the stack and
 155 leave the remaining 4 bytes untouched. This makes no problems since we
 156 do not read a 64 bit value from a 32 bit location. Passing a 64 bit
 157 value is straightforward.
 158
 159 With this adaptation, it was possible to use the \textit{stack space
 160 allocation algorithm} without any changes. The drawback of this
 161 decision was, that we have to copy all arguments of a native function
 162 call into a new stack frame and we have a slightly bigger memory
 163 footprint.
 164
 165 But calling a native function always means a stack manipulation,
 166 because you have to put the \textit{JNI environment}, and for
 167 \texttt{static} functions the \textit{class pointer}, in front of the
 168 function parameters. So this negligible.
 169
 170 For some \texttt{BUILTIN} functions there had to be written
 171 \texttt{asm\_} counterparts, which copy the 8 byte parameters in their
 172 correct size in a new stack frame. But this only affected
 173 \texttt{BUILTIN} functions with more than 1 function parameter. To be
 174 exactly, 2 functions, namely \texttt{arrayinstanceof} and
 175 \texttt{newarray}. So this is not a big speed impact.
 176
 177 Return parameters are stored in different places, this depends on the
 178 return type of the function:
 179
 180 \begin{itemize}
 181  \item \texttt{boolean}, \texttt{byte}, \texttt{char}, \texttt{short},
 182  \texttt{int}, \texttt{void}: return value resides in \texttt{\%eax}
 183  (\texttt{REG\_RESULT})
 184
 185  \item \texttt{long}: return value is split up onto the register pair
 186  \texttt{\%edx:\%eax}
 187  (\texttt{REG\_RESULT2:REG\_RESULT}, high 32 bit in
 188  \texttt{\%edx}, low 32 bit in \texttt{\%eax})
 189
 190  \item \texttt{float}, \texttt{double}: return value resides in the
 191  \textit{top of stack} element of the \textit{floating point unit}
 192  stack (\texttt{st(0)}, described in detail later)
 193 \end{itemize}
 194
 195
 196 \subsection{Register allocator}
 197
 198 Register usage was another problem in porting the CACAO to x86. An x86
 199 processor has 8 genernal purpose registers (GPR), of which one is the
 200 \textit{stack pointer} (SP) and thus it can not be used for arithmetic
 201 instructions. From the remaining 7 registers, in \textit{worst-case
 202 instructions} like \texttt{CHECKCAST} or \texttt{INSTANCEOF}, we need
 203 to reserve 3 temporary registers. So we have 4 registers available.
 204
 205 We use \texttt{\%ebp}, \texttt{\%esi}, \texttt{\%edi} as callee saved
 206 registers (which are callee saved registers in the x86 ABI) and
 207 \texttt{\%ebx} as scratch register (which is also a callee saved
 208 register in the x86 ABI, but we need some scratch registers). So we
 209 have a lack of scratch registers. But for most ICMD instructions, we
 210 do not need all, or sometimes none, of the temporary registers.
 211
 212 This fact we use in the \texttt{analyse\_stack()} pass. We try to use
 213 \texttt{\%edx} (which is \texttt{REG\_ITMP3}) and \texttt{\%ecx} (which
 214 is \texttt{REG\_ITMP2}) as scratch registers for the register
 215 allocator if certain ICMD instructions are not used in the compiled
 216 method. So for \textit{best-case situations} CACAO has 3
 217 \textit{callee saved} and 3 \textit{scratch} registers.
 218
 219 This analysis should be changed from \textit{method level} to
 220 \textit{basic-block level}, so CACAO could emit better code on x86.
 221
 222 The register allocator itself is very straightforward, this means, it
 223 does neither \textit{linear scan} nor any other analyse of the methods
 224 variables, but allocates registers for the local variables in order as
 225 they are defined. This may result in good code on RISC machines, as
 226 there are almost always enough registers available, with 32 registers,
 227 but can produce really bad code on x86 processors.
 228
 229 So the first step to make the x86 port more competitive with SUN's or
 230 IBM's JVM would be to rewrite the register allocator.
 231
 232 Basically the allocation order of the register allocator is as
 233 follows:
 234
 235 \begin{itemize}
 236  \item interface register allocation
 237  \item scratch register allocation
 238  \item local register allocation
 239 \end{itemize}
 240
 241 The only change which had to be made to all allocator passes, was the
 242 handling of \texttt{long} variables, because they are all spilled to
 243 memory (described in more detail in \ref{LongArithmetic}).
 244
 245
 246 \subsection{Long arithmetic}\label{LongArithmetic}
 247
 248 Unlike the PowerPC port, we cannot put \texttt{long}'s in two 32 bit
 249 integer registers, since we have to little of them. Maybe this could
 250 bring a speedup, if the register allocator would be more intelligent
 251 or in leaf functions which have only \texttt{long} variables. But this
 252 is not implemented yet. So, the current approach is to store all
 253 \texttt{long}'s in memory, this means they are always spilled.
 254
 255 Nearly all \texttt{long} instructions are inline, except two of them:
 256 \texttt{LDIV} and \texttt{LREM}. These two are computed via
 257 \texttt{BUILTIN} calls. It would be definitely possible to also
 258 inline them, but the code size is too big and the latency is so high,
 259 that the function calls are negligible.
 260
 261 The x86 processor has some machine instructions which are specifically
 262 designed for 64 bit operations like \texttt{cltd} --- Convert Signed
 263 Long to Signed Double Long, \texttt{adc} --- Integer Add With Carry or
 264 \texttt{sbb} --- Integer Subtraction With Borrow. So some of the 64
 265 bit calculations like \texttt{LADD} or \texttt{LSUB} can be exceuted
 266 in two instructions.
 267
 268 All of the \texttt{long} instructions operate on 64 bit, even if it is
 269 not necessary. The dependency information that would be needed to just
 270 operate on the lower or upper half of the \texttt{long} variable, is
 271 not generated by CACAO.
 272
 273
 274 \subsection{Floating point arithmetic}
 275
 276 Since the i386, with it's i387 counterpart or the i486, the x86
 277 processor has an \textit{floating point unit} (FPU). This FPU is
 278 implemented as a stack with 8 elements (see table \ref{FPUStack}).
 279
 280 \begin{table*}
 281 \begin{center}
 282 \begin{tabular}[b]{|l|l|}
 283 \hline
 284 st(x) & FPU Data Register Stack \\ \hline
 285 0     & TOS (Top Of Stack) \\ \hline
 286 1     & \\ \hline
 287 2     & \\ \hline
 288 3     & \\ \hline
 289 4     & \\ \hline
 290 5     & \\ \hline
 291 6     & \\ \hline
 292 7     & \\ \hline
 293 \end{tabular}
 294 \caption{x87 FPU Data Register Stack}
 295 \label{FPUStack}
 296 \end{center}
 297 \end{table*}
 298
 299 This stack is designed to wrap around if values are loaded to the
 300 \textit{top of stack} (TOS). For this reason, it has a special register which
 301 points to the TOS. This pointer is increased if a load instruction to
 302 the TOS is executed and decreased for a store from the TOS.
 303
 304 At first sight, the stack design of the FPU is perfect for the stack
 305 based design of a \textit{java virtual machine} (JVM). But there is a
 306 big problem. The JVM stack has no fixed size, so it can grow up to
 307 much more than 8 elements and we get an stack wrap around and thus an
 308 stack overflow. For this reason we need to implement an
 309 \textit{stack-element-to-register-mapping}.
 310
 311 A very basic design idea, is to define a pointer to the current TOS
 312 offset (\texttt{fpu\_st\_offset}). With this offset we can determine
 313 the current register position in the FPU stack of an arbitrary
 314 register.  From the 8 stack elements we need to reserve the last two
 315 ones (\texttt{st(6), st(7)}), so we can load two memory operands onto
 316 the stack and do the arithmetic on them. Most x86 floating point
 317 arithmetic operations have an \textit{do arithmetic and pop one
 318 element} version of the instruction, that means the float arithmetic
 319 is done and the TOS element is poped off. The remaining stack element
 320 has the result of the calculation. On the example of the \texttt{FADD}
 321 ICMD with two memory operands, it looks like this:
 322
 323 \begin{verbatim}
 324 var_to_reg_flt(s1, src->prev, REG_FTMP1); /* load 1st operand in st(0), increase
 325                                              fpu_st_offset                          */
 326 var_to_reg_flt(s2, src, REG_FTMP2);       /* load 2nd operand in st(0), increase
 327                                              fpu_st_offset                          */
 328 i386_faddp();       /* add 2 upper most elements st(1) = st(1) + st(0) -- pop st(0) */
 329 fpu_st_offset--;                        /* decrease fpu_st_offset from previous pop */
 330 store_reg_to_var_flt(iptr->dst, d); /* store result -- decrease fpu_st_offset       */
 331 \end{verbatim}
 332
 333 This mapping works very good with \textit{scratch registers}
 334 only. Defining \textit{callee saved float registers} makes some
 335 problemes since the x86 ABI has no callee saved float registers. This
 336 would need a special handling in the \textit{native stub} of a native
 337 function, namely saving the registers and cleaning the whole FPU
 338 stack, because a C function expects a clear FPU stack.
 339
 340 Basically the x86 FPU can handle 3 float data types:
 341
 342 \begin{itemize}
 343  \item single-precision (32 bit)
 344  \item double-precision (64 bit)
 345  \item double extended-precision (80 bit)
 346 \end{itemize}
 347
 348 The FPU has a 16 bit \textit{control register} which has a
 349 \textit{precision control field} (PC) and a \textit{rounding control
 350 field} (RC), each of 2 bit length (see table \ref{PCField} and
 351 \ref{RCField}).
 352
 353 \begin{table*}
 354 \begin{center}
 355 \begin{tabular}[b]{|l|c|}
 356 \hline
 357 Precision                          & PC Field \\ \hline
 358 single-precision (32 bit)          & 00B      \\ \hline
 359 reserved                           & 01B      \\ \hline
 360 double-precision (64 bit)          & 10B      \\ \hline
 361 double extended-precision (80 bit) & 11B      \\ \hline
 362 \end{tabular}
 363 \caption{Precision Control Field (PC)}
 364 \label{PCField}
 365 \end{center}
 366 \end{table*}
 367
 368 \begin{table*}
 369 \begin{center}
 370 \begin{tabular}[b]{|l|c|}
 371 \hline
 372 Rounding Mode                 & RC Field \\ \hline
 373 round to nearest (even)       & 00B      \\ \hline
 374 round down (toward -$\infty$) & 01B      \\ \hline
 375 round up (toward +$\infty$)   & 10B      \\ \hline
 376 round toward zero (truncate)  & 11B      \\ \hline
 377 \end{tabular}
 378 \caption{Rounding Control Field (RC)}
 379 \label{RCField}
 380 \end{center}
 381 \end{table*}
 382
 383 The internal data type used by the FPU is always the \textit{double
 384 extended-precision} (80 bit) format. Therefore implementing a IEEE 754
 385 compliant floating point code on x86 processors is not trivial.
 386
 387 Correct rounding to \textit{single-precision} or
 388 \textit{double-precision} is only done if the value is stored into
 389 memory. This means for certain instructions, like \texttt{FMUL} or
 390 \texttt{FDIV}, a special technique called \textit{store-reload}, has
 391 to be implemented. This technique is in fact very simple but takes two
 392 memory accesses more for this instruction.
 393
 394 For single-precision floats the \textit{store-reload} code could looks
 395 like this:
 396
 397 \begin{verbatim}
 398 i386_fstps_membase(REG_SP, 0);    /* store single-precision float to stack  */
 399 i386_flds_membase(REG_SP, 0);     /* load single-precision float from stack */
 400 \end{verbatim}
 401
 402 Another technique which has to be implemented to be IEEE 754
 403 compliant, is \textit{precision mode switching}. Mode switching on the
 404 x86 processor is made with the \texttt{fldcw} (load control word)
 405 instruction. A \texttt{fldcw} instruction has a quite large overhead,
 406 so frequently mode switches should be avoided. For this technique
 407 there are two different approaches:
 408
 409 \begin{itemize}
 410  \item \textbf{Mode switch on float arithmetic} --- switch the FPU on
 411  initialization in one precision mode, mostly \textit{double-precision
 412  mode} because \texttt{double} arithmetic is more common. With this
 413  setting \texttt{doubles} are calculated correctly. To handle
 414  \texttt{floats} in this approach, the FPU has to be switched into
 415  \textit{single-precision mode} before each \texttt{float} instruction
 416  and switched back afterwards. Needless to say, that this is only
 417  useful, if \texttt{float} arithmetic is sparse. For a simple
 418  calculation like
 419
 420  \begin{verbatim}
 421         float f = 1.234F;
 422         float g = 2.345F;
 423         float e = f * f + g * g;
 424  \end{verbatim}
 425
 426  the generated ICMD's look like this (with comments where precision
 427  mode switches take place):
 428
 429  \begin{verbatim}
 430         ...
 431         <fpu in double-precision mode>
 432         FLOAD         1
 433         FLOAD         1
 434         <switch fpu to single-precision mode>
 435         FMUL
 436         <switch fpu to double-precision mode>
 437         FLOAD         2
 438         FLOAD         2
 439         <switch fpu to single-precision mode>
 440         FMUL
 441         <switch fpu to double-precision mode>
 442         <switch fpu to single-precision mode>
 443         FADD
 444         <switch fpu to double-precision mode>
 445         FSTORE        3
 446         ...
 447  \end{verbatim}
 448
 449  For this corner case situation the mode switches are extrem, but the
 450  example should demonstrate how this technique works.
 451
 452  \item \textbf{Mode switch on float data type change} --- switch the
 453  FPU on initialization in on precision mode, like before, in
 454  \textit{double-precision mode}. But the difference on this approach
 455  is, that the precision mode is only switched if the float data type
 456  is changed. That means if your calculation switches from
 457  \texttt{double} arithmetic to \texttt{float} or backwards. This
 458  technique makes much sense due to the fact that there are always a
 459  bunch of instructions of the same data type in one row in a normal
 460  program. Now the same example as before with this approach:
 461
 462  \begin{verbatim}
 463         ...
 464         <fpu in double-precision mode>
 465         FLOAD         1
 466         FLOAD         1
 467         <switch fpu to single-precision mode>
 468         FMUL
 469         FLOAD         2
 470         FLOAD         2
 471         FMUL
 472         FADD
 473         FSTORE        3
 474         ...
 475  \end{verbatim}
 476
 477  After this code sequence the FPU is in \textit{single-precision mode}
 478  and if a function return would occur, the caller function would not
 479  know in which mode the FPU is currently. One solution would be to
 480  reset the FPU to \textit{double-precision} on a function return, if
 481  the actual mode is \textit{single-precision}.
 482 \end{itemize}
 483
 484 These techniques and further researches into optimizations which could
 485 be done, are described in \cite{OgKoNa02}.
 486
 487 All of these described FPU techniques have been implemented in CACAO's
 488 x86 port, but the results were not completly IEEE 754 compliant. So
 489 the CACAO developer team decided to be on the safe side and to store
 490 all float variables in memory, until we have found a solution which is
 491 fast and 100\% compliant.