doc/handbook/x86.tex

   1 \section{X86 code generator}
   2
   3 Porting to the famous x86 platform was more effort than
   4 expected. CACAO was designed to run on RISC machines from ground up,
   5 so the code generation part hat to be adapted. The first approach was
   6 to replace the simple RISC macros with x86 code, but this turned out
   7 to be not successful. So new x86 code generation macros where
   8 written. To be backward compatible, mostly in respect of embedded
   9 systems, all generated code can be run on i386 systems.
  10
  11 Some smaller problems occured since the x86 port was the first 32 bit
  12 target platform, like segmentation faults due to heap
  13 corruption. Another problem was the access to the functions data
  14 segment. Since RISC platforms like ALPHA and MIPS have a procedure
  15 pointer register, for the x86 platform there had to be implemented a
  16 special handling for accesses to the data segment, like
  17 \texttt{PUTSTATIC} and \texttt{GETSTATIC} instructions. The solution
  18 is like the handling of \textit{jump references} or \textit{check cast
  19 references}, which also have to be code patched, when the code and
  20 data segment are relocated. This means, there has to be an extra
  21 \textit{immediate-to-register} move (\texttt{i386\_mov\_imm\_reg()})
  22 before every \texttt{PUT}/\texttt{GETSTATIC} instruction, which moves
  23 the start address of the procedure, and thus the start address of the
  24 data segment, in one of the temporary registers.
  25
  26 Register usage was another problem in porting the CACAO to x86. An x86
  27 processor has 8 genernal purpose registers (GPR), of which one is the
  28 \textit{stack pointer} (SP) and thus it can not be used for arithmetic
  29 instructions. From the remaining 7 registers, in \textit{worst-case
  30 instructions} like \texttt{CHECKCAST} or \texttt{INSTANCEOF}, we need
  31 to reserve 3 temporary registers. So we have 4 registers available.
  32
  33
  34 \subsection{Calling conventions}
  35
  36 Normal calling convention of the x86 processor is passing all function
  37 arguments on the stack. The store size depends on the data type (the
  38 following types reflect the JAVA data types):
  39
  40 \begin{itemize}
  41  \item \texttt{byte}, \texttt{char}, \texttt{short}, \texttt{int},
  42        \texttt{float}, \texttt{void} --- 4 bytes
  43  \item \texttt{long}, \texttt{double} --- 8 bytes
  44 \end{itemize}
  45
  46 We changed this convention in a way, that we are using always 8 bytes
  47 on the stack for each datatype. With this adaptation, it was possible
  48 to use the \textit{stack space allocation algorithm} without any
  49 changes. The drawback of this decision was, that we have to copy all
  50 arguments of a native function call into a new stack frame and we have
  51 a slightly bigger memory footprint.
  52
  53 But calling a native function always means a stack manipulation,
  54 because you have to put the \textit{JNI environment}, and for
  55 \texttt{static} functions the \textit{class pointer}, in front of the
  56 function parameters. So this negligible.
  57
  58 For some \texttt{BUILTIN} functions there had to be written
  59 \texttt{asm\_} counterparts, which copy the 8 byte parameters in their
  60 correct size in a new stack frame. But this only affected
  61 \texttt{BUILTIN} functions with more than 1 function parameter. To be
  62 exactly, 2 functions, namely \texttt{arrayinstanceof} and
  63 \texttt{newarray}. So this is not a big speed impact.
  64
  65
  66 \subsection{Register allocator}
  67
  68 As mentioned before, in  \textit{worst-case situations} there are only
  69 4 integer registers available. We use \texttt{\%ebp}, \texttt{\%esi},
  70 \texttt{\%edi} as callee saved registers (which are callee saved
  71 registers in the x86 ABI) and \texttt{\%ebx} as scratch register
  72 (which is also a callee saved register in the x86 ABI, but we need
  73 some scratch registers). So we have a lack of scratch registers. But
  74 for most ICMD instructions, we do not need all, or sometimes none, of
  75 the temporary registers.
  76
  77 This fact we use in the \texttt{analyse\_stack()} pass. We try to use
  78 \texttt{\%edx} (which is \texttt{REG\_ITMP3}) and \texttt{\%ecx} (which
  79 is \texttt{REG\_ITMP2}) as scratch registers for the register allocator
  80 if certain ICMD instructions are not used in the compiled method.
  81
  82 So, for \textit{best-case situations} CACAO has 3 \textit{callee
  83 saved} and 3 \textit{scratch} registers.
  84
  85 This analysis should be changed from \textit{method level} to
  86 \textit{basic-block level}, so CACAO could emit better code on x86.
  87
  88
  89 \subsection{Long arithmetic}
  90
  91 Unlike the PowerPC port, we cannot put \texttt{long}'s in 2 32 bit
  92 integer registers, since we have to little of them. Maybe this could
  93 bring a speedup, if the register allocator would be more intelligent
  94 or in leaf functions which have only \texttt{long} variables. But this
  95 is not implemented yet. So, the current approach is to store all
  96 \texttt{long}'s in memory, this means they are always spilled.
  97
  98 Nearly all \texttt{long} instructions are inline, except 2 of them:
  99 \texttt{LDIV} and \texttt{LREM}. These 2 are computed via
 100 \texttt{BUILTIN} calls.
 101
 102 All of the \texttt{long} instructions operate on 64 bit, even if it is
 103 not necessary. The dependency information that would be needed to just
 104 operate on the lower or upper half of the \texttt{long} variable, is
 105 not generated by CACAO.
 106
 107
 108 \subsection{Floating point arithmetic}
 109
 110 Since the i386, with it's i387 counterpart or the i486, the x86
 111 processor has an \textit{floating point unit} (FPU). This FPU is
 112 implemented as a stack with 8 elements (see table \ref{FPUStack}).
 113
 114 \begin{table*}
 115 \begin{center}
 116 \begin{tabular}[b]{|l|l|}
 117 \hline
 118   & FPU Data Register Stack \\ \hline
 119 0 & TOS (Top Of Stack) \\ \hline
 120 1 & \\ \hline
 121 2 & \\ \hline
 122 3 & \\ \hline
 123 4 & \\ \hline
 124 5 & \\ \hline
 125 6 & \\ \hline
 126 7 & \\ \hline
 127 \end{tabular}
 128 \caption{x87 FPU Data Register Stack}
 129 \label{FPUStack}
 130 \end{center}
 131 \end{table*}
 132
 133 This stack is designed to wrap around if values are loaded to the
 134 \textit{top of stack} (TOS). For this reason, it has a special register which
 135 points to the TOS. This pointer is increased if a load instruction to
 136 the TOS is executed and decreased for a store from the TOS.
 137
 138 The x86 FPU can handle 3 float data types:
 139
 140 \begin{itemize}
 141  \item single-precision (32 bit)
 142  \item double-precision (64 bit)
 143  \item double extended-precision (80 bit)
 144 \end{itemize}
 145
 146 The FPU has a 16 bit \textit{control register} which has a
 147 \textit{precision control field} (PC) and a \textit{rounding control
 148 field} (RC), each of 2 bit length (see table \ref{PCField} and
 149 \ref{RCField}).
 150
 151 \begin{table*}
 152 \begin{center}
 153 \begin{tabular}[b]{|l|c|}
 154 \hline
 155 Precision                          & PC Field \\ \hline
 156 single-precision (32 bit)          & 00B      \\ \hline
 157 reserved                           & 01B      \\ \hline
 158 double-precision (64 bit)          & 10B      \\ \hline
 159 double extended-precision (80 bit) & 11B      \\ \hline
 160 \end{tabular}
 161 \caption{Precision Control Field (PC)}
 162 \label{PCField}
 163 \end{center}
 164 \end{table*}
 165
 166 \begin{table*}
 167 \begin{center}
 168 \begin{tabular}[b]{|l|c|}
 169 \hline
 170 Rounding Mode                 & RC Field \\ \hline
 171 round to nearest (even)       & 00B      \\ \hline
 172 round down (toward -$\infty$) & 01B      \\ \hline
 173 round up (toward +$\infty$)   & 10B      \\ \hline
 174 round toward zero (truncate)  & 11B      \\ \hline
 175 \end{tabular}
 176 \caption{Rounding Control Field (RC)}
 177 \label{RCField}
 178 \end{center}
 179 \end{table*}
 180
 181 The internal data type used by the FPU is always the \textit{double
 182 extended-precision} (80 bit) format. Therefore implementing a IEEE 754
 183 compliant floating point code on x86 processors is not trivial.
 184
 185 Correct rounding to \textit{single-precision} or
 186 \textit{double-precision} is only done if the value is stored into
 187 memory. This means for certain instructions, like \texttt{FMUL} or
 188 \texttt{FDIV}, a special technique called \textit{store-reload}, has
 189 to be implemented. This technique is in fact very simple but takes two
 190 memory accesses more for this instruction.
 191
 192 For single-precision floats the \textit{store-reload} code could looks
 193 like this:
 194
 195 \begin{verbatim}
 196 i386_fstps_membase(REG_SP, 0);    /* store single-precision float to stack  */
 197 i386_flds_membase(REG_SP, 0);     /* load single-precision float from stack */
 198 \end{verbatim}
 199
 200 Another technique which has to be implemented to be IEEE 754
 201 compliant, is \textit{precision mode switching}. Mode switching on the
 202 x86 processor is made with the \texttt{fldcw} (load control word)
 203 instruction. A \texttt{fldcw} instruction has a quite large overhead,
 204 so frequently mode switches should be avoided. For this technique
 205 there are two different approaches:
 206
 207 \begin{itemize}
 208  \item \textbf{Mode switch on float arithmetic} --- switch the FPU on
 209  initialization in one precision mode, mostly \textit{double-precision
 210  mode} because \texttt{double} arithmetic is more common. With this
 211  setting \texttt{doubles} are calculated correctly. To handle
 212  \texttt{floats} in this approach, the FPU has to be switched into
 213  \textit{single-precision mode} before each \texttt{float} instruction
 214  and switched back afterwards. Needless to say, that this is only
 215  useful, if \texttt{float} arithmetic is sparse. For a simple
 216  calculation like
 217
 218  \begin{verbatim}
 219         float f = 1.234F;
 220         float g = 2.345F;
 221         float e = f * f + g * g;
 222  \end{verbatim}
 223
 224  the generated ICMD's look like this (with comments where precision
 225  mode switches take place):
 226
 227  \begin{verbatim}
 228         ...
 229         <fpu in double-precision mode>
 230         FLOAD         1
 231         FLOAD         1
 232         <switch fpu to single-precision mode>
 233         FMUL
 234         <switch fpu to double-precision mode>
 235         FLOAD         2
 236         FLOAD         2
 237         <switch fpu to single-precision mode>
 238         FMUL
 239         <switch fpu to double-precision mode>
 240         <switch fpu to single-precision mode>
 241         FADD
 242         <switch fpu to double-precision mode>
 243         FSTORE        3
 244         ...
 245  \end{verbatim}
 246
 247  For this corner case situation the mode switches are extrem, but the
 248  example should demonstrate how this technique works.
 249
 250  \item \textbf{Mode switch on float data type change} --- switch the
 251  FPU on initialization in on precision mode, like before, in
 252  \textit{double-precision mode}. But the difference on this approach
 253  is, that the precision mode is only switched if the float data type
 254  is changed. That means if your calculation switches from
 255  \texttt{double} arithmetic to \texttt{float} or backwards. This
 256  technique makes much sense due to the fact that there are always a
 257  bunch of instructions of the same data type in one row in a normal
 258  program. Now the same example as before with this approach:
 259
 260  \begin{verbatim}
 261         ...
 262         <fpu in double-precision mode>
 263         FLOAD         1
 264         FLOAD         1
 265         <switch fpu to single-precision mode>
 266         FMUL
 267         FLOAD         2
 268         FLOAD         2
 269         FMUL
 270         FADD
 271         FSTORE        3
 272         ...
 273  \end{verbatim}
 274
 275  After this code sequence the FPU is in \textit{single-precision mode}
 276  and if a function return would occur, the caller function would not
 277  know in which mode the FPU is currently. One solution would be to
 278  reset the FPU to \textit{double-precision} on a function return, if
 279  the actual mode is \textit{single-precision}.
 280 \end{itemize}
 281
 282 TODO: description of stack-to-register mapping
 283
 284 None of these techniques worked prefectly for CACAO, so the decision
 285 was to store all \texttt{float}'s and \texttt{double}'s into memory,
 286 so the rounding was correct. This behaviour will be changed, when we
 287 have found a solution that works.