doc/handbook/x86_64.tex

   1 \section{AMD64 (x86\_64) code generator}
   2
   3 \subsection{Introduction}
   4
   5 The AMD64 architecture, formerly know as x86\_64, is an improvement of
   6 the Intel IA32 architecture by AMD -- Advanced Micro Devices. The
   7 extraordinary success of the IA32 architecture and the upcoming memory
   8 address space problem on IA32 high-end servers, led to a special
   9 design decision. Unlike Intel, with it's completely new designed IA64
  10 architecture, AMD decided to extend the IA32 instruction set with
  11 new 64-bit instructions.
  12
  13 Due to the fact that the IA32 instructions have no fixed length, as
  14 this is the fact on RISC machines, it was easy for them to introduce a
  15 new \textit{prefix byte} called \texttt{REX}. The \textit{REX prefix}
  16 enables the 64-bit operation mode of the following instruction in the
  17 new \textit{64-bit mode} of the processor.
  18
  19 A processor of the AMD64 architecture has two main operating modes:
  20
  21 \begin{itemize}
  22 \item Long Mode
  23 \item Legacy Mode
  24 \end{itemize}
  25
  26 In the \textit{Legacy Mode} the processor acts like an IA32
  27 processor. Any 32-bit operating system or software can be run on these
  28 type of processors without changes, so companies running IA32 servers
  29 and software can change their hardware to AMD64 and their systems are
  30 still operational. This was the main intention for AMD to develop this
  31 architecture. Furthermore the \textit{Long Mode} is split into two
  32 coexistent operating modes:
  33
  34 \begin{itemize}
  35 \item 64-bit Mode
  36 \item Compatibility Mode
  37 \end{itemize}
  38
  39 The \textit{64-bit Mode} exposes the power of this architecture. Any
  40 memory operation now uses 64-bit addresses and ALU instructions can
  41 operate on 64-bit operands. Within \textit{Compatibility Mode} any
  42 IA32 software can be run under the control of 64-bit operation
  43 system. This, as mentioned before, is yet another point for companies
  44 to change their hardware to AMD64. So their software can be slowly
  45 migrated to the new 64-bit system, but not every type of software is
  46 faster in 64-bit code.
  47
  48 Another crucial pointer to make the AMD64 architecture faster than
  49 IA32, is the limited number of registers. Any IA32 architecture, from
  50 the early \textit{i386} to the newest generation of \textit{Intel
  51 Pentium 4} or \textit{AMD Athlon}, has only 8 general-purpose
  52 registers. With the \textit{REX prefix}, AMD has the ability to
  53 increase the amount of accessible registers by 1 bit. This means in
  54 \textit{64-bit Mode} 16 general-purpose registers are available. The
  55 value of a \textit{REX prefix} is in the range \texttt{40h} through
  56 \texttt{4Fh}, depending on the particular bits used (see table
  57 \ref{REX}).
  58
  59 \begin{table}
  60 \begin{center}
  61 \begin{tabular}[b]{|c|c|l|}
  62 \hline
  63 Mnemonic & Bit Position & Definition \\ \hline
  64 -        & 7-4          & 0100 \\ \hline
  65 REX.W    & 3            & 0 = Default operand size \\
  66          &              & 1 = 64-bit operand size \\ \hline
  67 REX.R    & 2            & 1-bit (high) extension of the ModRM \textit{reg} field, \\
  68          &              & thus permitting access to 16 registers. \\ \hline
  69 REX.X    & 1            & 1-bit (high) extension of the SIB \textit{index} field, \\
  70          &              & thus permitting access to 16 registers. \\ \hline
  71 REX.B    & 0            & 1-bit (high) extension of the ModRM \textit{r/m} field, \\
  72          &              & SIB \textit{base} field, or opcode \textit{reg} field, thus \\
  73          &              & permitting access to 16 registers. \\ \hline
  74 \end{tabular}
  75 \caption{REX Prefix Byte Fields}
  76 \label{REX}
  77 \end{center}
  78 \end{table}
  79
  80
  81 \subsection{Code generation}
  82
  83 AMD64 code generation is mostly the same as on IA32. All new 64-bit
  84 instructions can handle both \textit{memory operands} and
  85 \textit{register operands}, so there is no need to change the
  86 implementation of the IA32 ICMDs.
  87
  88 Much better code generation can be achieved in the area of
  89 \textit{long arithmetic}. Since all 16 general-purpose registers can
  90 hold 64-bit integer values, there is no need for special long
  91 handling, like on IA32 were we stored all long varibales in memory. A
  92 simple \texttt{ICMD\_LADD} on IA32, best case shown for AMD64 ---
  93 \texttt{src->regoff == iptr->dst->regoff}:
  94
  95 \begin{verbatim}
  96         i386_mov_membase_reg(REG_SP, src->prev->regoff * 8, REG_ITMP1);
  97         i386_alu_reg_membase(I386_ADD, REG_ITMP1, REG_SP, iptr->dst->regoff * 8);
  98         i386_mov_membase_reg(REG_SP, src->prev->regoff * 8 + 4, REG_ITMP1);
  99         i386_alu_reg_membase(I386_ADC, REG_ITMP1, REG_SP, iptr->dst->regoff * 8 + 4);
 100 \end{verbatim}
 101
 102 First memory operand is added to second memory operand which is at the
 103 same stack location as the destination operand. This means, there are
 104 four instructions executed for one long addition. If we would use
 105 registers for long variables we could get a \textit{best-case} of two
 106 instructions, namely \textit{add} followed by an \textit{adc}. On
 107 AMD64 we can generate one instruction for this addition:
 108
 109 \begin{verbatim}
 110         x86_64_alu_reg_reg(X86_64_ADD, src->prev->regoff, iptr->dst->regoff);
 111 \end{verbatim}
 112
 113 This means, the AMD64 port is \textit{four-times} faster than the IA32
 114 port (maybe even more, because we do not use memory accesses). Even if
 115 we would implement the usage of registers for long variables on IA32,
 116 the AMD64 port would be at least twice as fast.
 117
 118 To be able to use the new 64-bit instructions, we need to prefix
 119 nearly all instructions --- some instructions can be used in 64-bit
 120 mode without escaping --- with the mentioned \textit{REX prefix}
 121 byte. In CACAO we use a macro called
 122
 123 \begin{verbatim}
 124         x86_64_emit_rex(size,reg,index,rm)
 125 \end{verbatim}
 126
 127 The names of the arguments are respective to their use in the
 128 \textit{REX prefix} (see table \ref{REX}).
 129
 130 The AMD64 architecture introduces also a new addressing method called
 131 \textit{RIP-relative addressing}. In 64-bit mode, addressing relative
 132 to the contents of the 64-bit instruction pointer (program counter)
 133 --- called \textit{RIP-relative addressing} or \textit{PC-relative
 134 addressing} --- is implemented for certain instructions. In this
 135 instructions, the effective address is formed by adding the
 136 displacement to the 64-bit \texttt{RIP} of the next instruction. With
 137 this addressing mode, we can replace the IA32 method of addressing
 138 data in the method's data segment. Like in the
 139 \texttt{ICMD\_PUTSTATIC} instruction, the IA32 code
 140
 141 \begin{verbatim}
 142         a = dseg_addaddress(&(((fieldinfo *) iptr->val.a)->value));
 143         i386_mov_imm_reg(0, REG_ITMP2);
 144         dseg_adddata(mcodeptr);
 145         i386_mov_membase_reg(REG_ITMP2, a, REG_ITMP2);
 146 \end{verbatim}
 147
 148 can be replaced with the new \textit{RIP-relative addressing} code
 149
 150 \begin{verbatim}
 151         a = dseg_addaddress(&(((fieldinfo *) iptr->val.a)->value));
 152         x86_64_mov_membase_reg(RIP, -(((s8) mcodeptr + 7) - (s8) mcodebase) + a, REG_ITMP2);
 153 \end{verbatim}
 154
 155 So we can save one instruction on the read or write of an static
 156 variable. The additional offset of \texttt{+ 7} is the code size of
 157 the instruction itself. The fictive register \texttt{RIP} is defined
 158 with
 159
 160 \begin{verbatim}
 161         #define RIP    -1
 162 \end{verbatim}
 163
 164 Thus we can determine the special \textit{RIP-relative addressing}
 165 mode in the code generating macro
 166 \texttt{x86\_64\_emit\_membase(basereg,disp,dreg)} with
 167
 168 \begin{verbatim}
 169         if ((basereg) == RIP) {
 170             x86_64_address_byte(0,(dreg),RBP);
 171             x86_64_emit_imm32((disp));
 172             break;
 173         }
 174 \end{verbatim}
 175
 176 and generate the \textit{RIP-relative addressing} code. As shown in
 177 the code sample, it's an special encoding of the \textit{address byte}
 178 mit the \texttt{mod} field set to zero and \texttt{RBP}
 179 (\texttt{\%rbp}) as baseregister.
 180
 181
 182 \subsection{Constant handling}
 183
 184 As on IA32, the AMD64 code generator can use \textit{immediate move}
 185 instructions to load integer constants into their destination
 186 registers. The 64-bit extensions of the AMD64 architecture can also
 187 load 64-bit immediates inline. So loading a \texttt{long} constant
 188 just uses one instruction, despite of two instructions on the IA32
 189 architecture. Of course the AMD64 code generator uses the \textit{move
 190 long} (\texttt{movl}) instruction to load 32-bit \texttt{int} constants
 191 to minimize code size. This instruction clears the upper 32-bit of the
 192 destination register.
 193
 194 \begin{verbatim}
 195         case ICMD_ICONST:
 196                 ...
 197                 x86_64_movl_imm_reg(cd, iptr->val.i, d);
 198                 ...
 199
 200         case ICMD_LCONST:
 201                 ...
 202                 x86_64_mov_imm_reg(cd, iptr->val.l, d);
 203                 ...
 204 \end{verbatim}
 205
 206 \texttt{float} and \texttt{double} values are loaded from the data
 207 segment via the \textit{move doubleword or quadword} (\texttt{movd})
 208 instruction with \textit{RIP-relative addressing}.
 209
 210
 211 \subsection{Calling conventions}
 212
 213 The AMD64 calling conventions are described here \ref{}. CACAO uses a
 214 subset of this calling convention, to cover its requirements. CACAO
 215 just needs to pass the JAVA data types, no other special features. The
 216 sizes of the JAVA data types on the AMD64 port are shown in table
 217 \ref{javadatatypesizes}.
 218
 219 \begin{table}
 220 \begin{center}
 221 \begin{tabular}[b]{|l|c|}
 222 \hline
 223 JAVA Data Type   & Bytes \\ \hline
 224 \texttt{boolean} & 1     \\
 225 \texttt{byte}    &       \\
 226 \texttt{char}    &       \\ \hline
 227 \texttt{short}   & 2     \\ \hline
 228 \texttt{int}     & 4     \\
 229 \texttt{float}   &       \\ \hline
 230 \texttt{long}    & 8     \\
 231 \texttt{double}  &       \\
 232 \texttt{void}    &       \\ \hline
 233 \end{tabular}
 234 \caption{JAVA Data Type sizes on AMD64}
 235 \label{javadatatypesizes}
 236 \end{center}
 237 \end{table}
 238
 239 \subsubsection{Integer arguments}
 240
 241 The AMD64 architecture has 6 integer argument registers. The order of
 242 the argument registers is shown in table
 243 \ref{amd64integerargumentregisters}.
 244
 245 \begin{table}
 246 \begin{center}
 247 \begin{tabular}[b]{|l|l|}
 248 \hline
 249 Register       & Argument Register \\ \hline
 250 \texttt{\%rdi} & 1$^{\rm st}$      \\ \hline
 251 \texttt{\%rsi} & 2$^{\rm nd}$      \\ \hline
 252 \texttt{\%rdx} & 3$^{\rm rd}$      \\ \hline
 253 \texttt{\%rcx} & 4$^{\rm th}$      \\ \hline
 254 \texttt{\%r8}  & 5$^{\rm th}$      \\ \hline
 255 \texttt{\%r9}  & 6$^{\rm th}$      \\ \hline
 256 \end{tabular}
 257 \caption{AMD64 Integer Argument Register}
 258 \label{amd64integerargumentregisters}
 259 \end{center}
 260 \end{table}
 261
 262 As on RISC machines, the remaining integer arguments are passed on the
 263 stack. Each integer argument, regardless of which size, uses 8 bytes
 264 on the stack.
 265
 266 Integer return values of any size are stored in \texttt{REG\_RESULT},
 267 which is \texttt{\%rax}.
 268
 269 \subsubsection{Floating point arguments}
 270
 271 The AMD64 architecture has 8 floating point argument registers, namely
 272 \texttt{\%xmm0} through \texttt{\%xmm7}. \texttt{\%xmm} registers are
 273 128-bit wide floating point registers on which SSE and SSE2
 274 instructions can operate. Remaining floating point arguments are
 275 passed, like with integer arguments, on the stack using 8 bytes per
 276 argument.
 277
 278 Floating point return values are stored in \texttt{\%xmm0}.
 279
 280 As shown, the calling conventions for the AMD64 architecture are
 281 nearly the same as for RISC machines, which allows to use CACAOs
 282 \textit{register allocator algorithm} and \textit{stack space
 283 allocation algorithm} without any changes.
 284
 285 Calling native functions means register moves and stack copying like
 286 on RISC machines. This depends on the count of the arguments used for
 287 the called native function. For non-static native functions the first
 288 integer argument has to be the JNI environment variable, so any
 289 arguments passed need to be shifted by one register, which can include
 290 creating a new stackframe and storing some arguments on the
 291 stack. Additionally for static native functions the class pointer of
 292 the current objects' class is passed in the 2$^{\rm nd}$ integer
 293 argument register. This means that the integer argument registers need
 294 to be shifted by two registers.
 295
 296 One difference of the calling convention to RISC type machines, like
 297 Alpha or MIPS, is the usage of integer and floating point argument
 298 registers with mixed integer and floating point arguments. Assume a
 299 function like this:
 300
 301 \begin{verbatim}
 302         void sub(int a, float b, long c, double d);
 303 \end{verbatim}
 304
 305 On a RISC machine, like Alpha, we would have an argument register
 306 usage like in figure \ref{alphaargumentregisterusage}. \texttt{a?}
 307 represent integer argument registers and \texttt{fa?} floating point
 308 argument registers.
 309
 310 \begin{figure}[htb]
 311 \begin{center}
 312 \setlength{\unitlength}{1mm}
 313 \begin{picture}(60,35)
 314 \thicklines
 315 \put(0,15){\framebox(15,10){a0 = a}}
 316 \put(30,15){\framebox(15,10){a2 = c}}
 317 \put(15,5){\framebox(15,10){fa1 = b}}
 318 \put(45,5){\framebox(15,10){fa3 = d}}
 319 \put(0,0){\line(0,1){30}}
 320 \end{picture}
 321 \caption{Alpha argument register usage for \texttt{void sub(int a, float b, long c, double d);}}
 322 \label{alphaargumentregisterusage}
 323 \end{center}
 324 \end{figure}
 325
 326 On AMD64 the same function call would look like in figure
 327 \ref{amd64argumentregisterusage}.
 328
 329 \begin{figure}[htb]
 330 \begin{center}
 331 \setlength{\unitlength}{1mm}
 332 \begin{picture}(60,35)
 333 \thicklines
 334 \put(0,15){\framebox(15,10){a0 = a}}
 335 \put(15,15){\framebox(15,10){a1 = c}}
 336 \put(0,5){\framebox(15,10){fa0 = b}}
 337 \put(15,5){\framebox(15,10){fa1 = d}}
 338 \put(0,0){\line(0,1){30}}
 339 \end{picture}
 340 \caption{AMD64 argument register usage for \texttt{void sub(int a, float b, long c, double d);}}
 341 \label{amd64argumentregisterusage}
 342 \end{center}
 343 \end{figure}
 344
 345 The register assigment would be \texttt{a0 = \%rdi}, \texttt{a1 =
 346 \%rsi}, \texttt{fa0 = \%xmm0} and \texttt{fa1 = \%xmm1}. This special
 347 usage of the argument registers required some changes in the argument
 348 register allocation algorithm for function calls during stack
 349 analysis and some changes in the code generator itself.
 350
 351
 352 \subsection{Register allocator}
 353
 354 As mentioned in the introduction, the AMD64 architecture has 16
 355 general-purpose registers and 16 floating-point registers. One
 356 general-purpose register is reserved for the \textit{stack pointer}
 357 --- namely \texttt{\%rsp} --- and thus cannot be used for arithmetic
 358 instructions. The register usage as used in CACAO is shown in table
 359 \ref{amd64registerusage}.
 360
 361 \begin{table}
 362 \begin{center}
 363 \begin{tabular}{l|l|l}
 364 Register       & Usage                                        & Callee-saved \\ \hline
 365 \texttt{\%rax} & return register, reserved for code generator & no           \\
 366 \texttt{\%rcx} & 4$^{\rm th}$ argument register               & no           \\
 367 \texttt{\%rdx} & 3$^{\rm rd}$ argument register               & no           \\
 368 \texttt{\%rbx} & temporary register                           & no           \\
 369 \texttt{\%rsp} & stack pointer                                & yes          \\
 370 \texttt{\%rbp} & callee-saved register                        & yes          \\
 371 \texttt{\%rsi} & 2$^{\rm nd}$ argument register               & no           \\
 372 \texttt{\%rdi} & 1$^{\rm st}$ argument register               & no           \\
 373 \texttt{\%r8}  & 5$^{\rm th}$ argument register               & no           \\
 374 \texttt{\%r9}  & 6$^{\rm th}$ argument register               & no           \\
 375 \texttt{\%r10}-\texttt{\%r11} & reserved for code generator   & no           \\
 376 \texttt{\%r12}-\texttt{\%r15} & callee-saved register         & yes          \\
 377 \texttt{\%xmm0}-\texttt{\%xmm7} & argument registers          & no           \\
 378 \texttt{\%xmm8}-\texttt{\%xmm15} & temporary registers        & no           \\
 379 \end{tabular}
 380 \caption{AMD64 Register usage in CACAO}
 381 \label{amd64registerusage}
 382 \end{center}
 383 \end{table}
 384
 385 There is only one change to the original AMD64 \textit{application
 386 binary interface} --- ABI. CACAO uses \texttt{\%rbx} as temporary
 387 register, while the AMD64 ABI uses the \texttt{\%rbx} register as
 388 callee-saved register.
 389
 390 In adapting the register allocator there was a problem concerning the
 391 order of the integer argument registers. The order of the first four
 392 argument register is inverted. This fact can be seen in table
 393 \ref{amd64registerusage} which is ordered ascending by the processors'
 394 internal register numbers. That means the ascending search algorithm
 395 for argument registers in the register allocator would allocate the
 396 first four argument registers in the wrong direction. So there is a
 397 little hack implemented in CACAOs register allocator to handle this
 398 fact. After searching the register definition array for the argument
 399 registers, the first four argument registers are interchanged in their
 400 array. This is done by a simple code sequence (taken from
 401 \texttt{jit/reg.inc}):
 402
 403 \begin{verbatim}
 404         /*
 405          * on x86_64 the argument registers are not in ascending order
 406          * a00 (%rdi) <-> a03 (%rcx) and a01 (%rsi) <-> a02 (%rdx)
 407          */
 408         n = r->argintregs[3];
 409         r->argintregs[3] = r->argintregs[0];
 410         r->argintregs[0] = n;
 411
 412         n = r->argintregs[2];
 413         r->argintregs[2] = r->argintregs[1];
 414         r->argintregs[1] = n;
 415 \end{verbatim}
 416
 417
 418 \subsection{Floating point arithmetic}
 419
 420 The AMD64 architecture has implemented two sets of floating point instructions:
 421
 422 \begin{itemize}
 423 \item old i387 (x87)
 424 \item SSE and SSE2
 425 \end{itemize}
 426
 427 The x87 \textit{floating point unit} (FPU) implementation is
 428 completely compatible to the IA32 implementation with all its
 429 advantages and drawbacks, like the FPU stack.