doc/handbook/x86_64.tex

   1 \section{AMD64 (x86\_64) code generator}
   2 \label{sectionamd64codegenerator}
   3
   4
   5 \subsection{Introduction}
   6
   7 The AMD64~\cite{AMD64} architecture, formerly know as x86\_64, is an
   8 improvement of the Intel IA32 architecture by AMD---Advanced Micro
   9 Devices~\cite{AMD}. The extraordinary success of the IA32 architecture
  10 and the upcoming memory address space problem on IA32 high-end
  11 servers, led to a special design decision by AMD. Unlike Intel, with
  12 it's completely new designed 64-bit architecture---IA64---AMD decided
  13 to extend the IA32 instruction set with a new 64-bit instruction mode.
  14
  15 Due to the fact that the IA32 instructions have no fixed length, like
  16 this is the fact on RISC machines, it was easy for AMD to introduce a
  17 new \textit{prefix byte} called \texttt{tablerexprefixbytefields}. The
  18 \textit{REX prefix} enables the 64-bit operation mode of the following
  19 instruction in the new \textit{64-bit mode} of the processor.
  20
  21 A processor which implements the AMD64 architecture has two main
  22 operating modes:
  23
  24 \begin{itemize}
  25 \item Long Mode
  26 \item Legacy Mode
  27 \end{itemize}
  28
  29 In the \textit{Legacy Mode} the processor acts like an IA32
  30 processor. Any 32-bit operating system or software can be run on these
  31 type of processors without changes, so companies running IA32 servers
  32 and software can change their hardware to AMD64 and their systems are
  33 still operational. This was the main intention for AMD to develop this
  34 architecture. Furthermore the \textit{Long Mode} is split into two
  35 coexistent operating modes:
  36
  37 \begin{itemize}
  38 \item 64-bit Mode
  39 \item Compatibility Mode
  40 \end{itemize}
  41
  42 The \textit{64-bit Mode} exposes the power of this architecture. Any
  43 memory operation now uses 64-bit addresses and ALU instructions can
  44 operate on 64-bit operands. Within \textit{Compatibility Mode} any
  45 IA32 software can be run under the control of 64-bit operating
  46 system. This, as mentioned before, is yet another point for companies
  47 to change their hardware to AMD64. So their software can be slowly
  48 migrated to the new 64-bit systems, but not every type of software is
  49 faster in 64-bit code. Any memory address fetched or stored into
  50 memory needs to transfer now 64-bits instead of 32-bits. This means
  51 twice as much memory transfer as on IA32 machines.
  52
  53 Another crucial point to make the AMD64 architecture faster than IA32,
  54 is the limited number of registers. Any IA32 architecture, from the
  55 early \textit{i386} to the newest generation of \textit{Intel Pentium
  56 4} or \textit{AMD Athlon}, has only 8 general-purpose registers. With
  57 the \textit{REX prefix}, AMD has the ability to increase the amount of
  58 accessible registers by 1 bit. This means in \textit{64-bit Mode} 16
  59 general-purpose registers are available. The value of a \textit{REX
  60 prefix} is in the range \texttt{40h} through \texttt{4Fh}, depending
  61 on the particular bits used (see table
  62 \ref{tablerexprefixbytefields}).
  63
  64 \begin{table}
  65 \begin{center}
  66 \begin{tabular}[b]{|c|c|l|}
  67 \hline
  68 Mnemonic & Bit Position & Definition \\ \hline
  69 -        & 7-4          & 0100 \\ \hline
  70 REX.W    & 3            & 0 = Default operand size \\
  71          &              & 1 = 64-bit operand size \\ \hline
  72 REX.R    & 2            & 1-bit (high) extension of the ModRM \textit{reg} field, \\
  73          &              & thus permitting access to 16 registers. \\ \hline
  74 REX.X    & 1            & 1-bit (high) extension of the SIB \textit{index} field, \\
  75          &              & thus permitting access to 16 registers. \\ \hline
  76 REX.B    & 0            & 1-bit (high) extension of the ModRM \textit{r/m} field, \\
  77          &              & SIB \textit{base} field, or opcode \textit{reg} field, thus \\
  78          &              & permitting access to 16 registers. \\ \hline
  79 \end{tabular}
  80 \caption{REX Prefix Byte Fields}
  81 \label{tablerexprefixbytefields}
  82 \end{center}
  83 \end{table}
  84
  85
  86 \subsection{Code generation}
  87
  88 AMD64 code generation is mostly the same as on IA32. All new 64-bit
  89 instructions can handle both \textit{memory operands} and
  90 \textit{register operands}, so there is no need to change the
  91 implementation of the IA32 ICMDs.
  92
  93 Much better code generation can be achieved in the area of
  94 \textit{long arithmetic}. Since all 16 general-purpose registers can
  95 hold 64-bit integer values, there is no need for special long
  96 handling, like on IA32 were we stored all long varibales in memory. As
  97 example a simple \texttt{ICMD\_LADD} on IA32, best case shown for
  98 AMD64 --- \texttt{src->regoff == iptr->dst->regoff}:
  99
 100 \begin{verbatim}
 101         i386_mov_membase_reg(REG_SP, src->prev->regoff * 8, REG_ITMP1);
 102         i386_alu_reg_membase(I386_ADD, REG_ITMP1, REG_SP, iptr->dst->regoff * 8);
 103         i386_mov_membase_reg(REG_SP, src->prev->regoff * 8 + 4, REG_ITMP1);
 104         i386_alu_reg_membase(I386_ADC, REG_ITMP1, REG_SP, iptr->dst->regoff * 8 + 4);
 105 \end{verbatim}
 106
 107 First memory operand is added to second memory operand which is at the
 108 same stack location as the destination operand. This means, there are
 109 four instructions executed for one \texttt{long} addition. If we would
 110 use registers for \texttt{long} variables we could get a
 111 \textit{best-case} of two instructions, namely \textit{add} followed
 112 by an \textit{adc}. On AMD64 we can generate one instruction for this
 113 addition:
 114
 115 \begin{verbatim}
 116         x86_64_alu_reg_reg(X86_64_ADD, src->prev->regoff, iptr->dst->regoff);
 117 \end{verbatim}
 118
 119 This means, the AMD64 port is \textit{four-times} faster than the IA32
 120 port (maybe even more, because we do not use memory accesses). Even if
 121 we would implement the usage of registers for \texttt{long} variables
 122 on IA32, the AMD64 port would be at least twice as fast.
 123
 124 To be able to use the new 64-bit instructions, we need to prefix
 125 nearly all instructions---some instructions can be used in their
 126 64-bit mode without escaping---with the mentioned \textit{REX prefix}
 127 byte. In CACAO we use a macro called
 128
 129 \begin{verbatim}
 130         x86_64_emit_rex(size,reg,index,rm)
 131 \end{verbatim}
 132
 133 to emit this byte. The names of the arguments are respective to their
 134 usage in the \textit{REX prefix} itself (see table
 135 \ref{tablerexprefixbytefields}).
 136
 137 The AMD64 architecture introduces also a new addressing method called
 138 \textit{RIP-relative addressing}. In 64-bit mode, addressing relative
 139 to the contents of the 64-bit instruction pointer (program counter)
 140 --- called \textit{RIP-relative addressing} or \textit{PC-relative
 141 addressing} --- is implemented for certain instructions. In this
 142 instructions, the effective address is formed by adding the
 143 displacement to the 64-bit \texttt{RIP} of the next instruction. With
 144 this addressing mode, we can replace the IA32 method of addressing
 145 data in the method's data segment. Like in the
 146 \texttt{ICMD\_PUTSTATIC} instruction, the IA32 code
 147
 148 \begin{verbatim}
 149         a = dseg_addaddress(&(((fieldinfo *) iptr->val.a)->value));
 150         i386_mov_imm_reg(0, REG_ITMP2);
 151         dseg_adddata(mcodeptr);
 152         i386_mov_membase_reg(REG_ITMP2, a, REG_ITMP2);
 153 \end{verbatim}
 154
 155 can be replaced with the new \textit{RIP-relative addressing} code
 156
 157 \begin{verbatim}
 158         a = dseg_addaddress(&(((fieldinfo *) iptr->val.a)->value));
 159         x86_64_mov_membase_reg(RIP, -(((s8) mcodeptr + 7) - (s8) mcodebase) + a, REG_ITMP2);
 160 \end{verbatim}
 161
 162 So we can save one instruction on the read or write of an static
 163 variable. The additional offset of \texttt{+ 7} is the code size of
 164 the instruction itself. The fictive register \texttt{RIP} is defined
 165 with
 166
 167 \begin{verbatim}
 168         #define RIP    -1
 169 \end{verbatim}
 170
 171 Thus we can determine the special \textit{RIP-relative addressing}
 172 mode in the code generating macro
 173 \texttt{x86\_64\_emit\_membase(basereg,disp,dreg)} with
 174
 175 \begin{verbatim}
 176         if ((basereg) == RIP) {
 177             x86_64_address_byte(0,(dreg),RBP);
 178             x86_64_emit_imm32((disp));
 179             break;
 180         }
 181 \end{verbatim}
 182
 183 and generate the \textit{RIP-relative addressing} code. As shown in
 184 the code sample, it's an special encoding of the \textit{address byte}
 185 with \texttt{mod} field set to zero and \texttt{RBP} (\texttt{\%rbp})
 186 as baseregister.
 187
 188
 189 \subsection{Constant handling}
 190
 191 As on IA32, the AMD64 code generator can use \textit{immediate move}
 192 instructions to load integer constants into their destination
 193 registers. The 64-bit extensions of the AMD64 architecture can also
 194 load 64-bit immediates inline. So loading a \texttt{long} constant
 195 just uses one instruction, despite of two instructions on the IA32
 196 architecture. Of course the AMD64 code generator uses the \textit{move
 197 long} (\texttt{movl}) instruction to load 32-bit \texttt{int}
 198 constants to minimize code size. The \texttt{movl} instruction clears
 199 the upper 32-bit of the destination register.
 200
 201 \begin{verbatim}
 202         case ICMD_ICONST:
 203                 ...
 204                 x86_64_movl_imm_reg(cd, iptr->val.i, d);
 205                 ...
 206
 207         case ICMD_LCONST:
 208                 ...
 209                 x86_64_mov_imm_reg(cd, iptr->val.l, d);
 210                 ...
 211 \end{verbatim}
 212
 213 \texttt{float} and \texttt{double} values are loaded from the data
 214 segment via the \textit{move doubleword or quadword} (\texttt{movd})
 215 instruction with \textit{RIP-relative addressing}.
 216
 217
 218 \subsection{Calling conventions}
 219
 220 The AMD64 calling conventions are described here
 221 \cite{AMD64ABI}. CACAO uses a subset of this calling convention, to
 222 cover its requirements. CACAO just needs to pass the JAVA data types
 223 to called functions, no other special features are required. The byte
 224 sizes of the JAVA data types on the AMD64 port are shown in table
 225 \ref{javadatatypesizes}.
 226
 227 \begin{table}
 228 \begin{center}
 229 \begin{tabular}[b]{|l|c|}
 230 \hline
 231 JAVA Data Type   & Bytes \\ \hline
 232 \texttt{boolean} & 1     \\
 233 \texttt{byte}    &       \\
 234 \texttt{char}    &       \\ \hline
 235 \texttt{short}   & 2     \\ \hline
 236 \texttt{int}     & 4     \\
 237 \texttt{float}   &       \\ \hline
 238 \texttt{long}    & 8     \\
 239 \texttt{double}  &       \\
 240 \texttt{void}    &       \\ \hline
 241 \end{tabular}
 242 \caption{JAVA Data Type sizes on AMD64}
 243 \label{javadatatypesizes}
 244 \end{center}
 245 \end{table}
 246
 247 \subsubsection{Integer arguments}
 248
 249 The AMD64 architecture has 6 integer argument registers. The order of
 250 the argument registers is shown in table
 251 \ref{amd64integerargumentregisters}.
 252
 253 \begin{table}
 254 \begin{center}
 255 \begin{tabular}[b]{|l|l|}
 256 \hline
 257 Register       & Argument Register \\ \hline
 258 \texttt{\%rdi} & 1$^{\rm st}$      \\ \hline
 259 \texttt{\%rsi} & 2$^{\rm nd}$      \\ \hline
 260 \texttt{\%rdx} & 3$^{\rm rd}$      \\ \hline
 261 \texttt{\%rcx} & 4$^{\rm th}$      \\ \hline
 262 \texttt{\%r8}  & 5$^{\rm th}$      \\ \hline
 263 \texttt{\%r9}  & 6$^{\rm th}$      \\ \hline
 264 \end{tabular}
 265 \caption{AMD64 Integer Argument Register}
 266 \label{amd64integerargumentregisters}
 267 \end{center}
 268 \end{table}
 269
 270 As on RISC machines, the remaining integer arguments are passed on the
 271 stack. Each integer argument, regardless of which integer JAVA data
 272 type, uses 8 bytes on the stack.
 273
 274 Integer return values of any integer JAVA data type are stored in
 275 \texttt{REG\_RESULT}, which is \texttt{\%rax}.
 276
 277 \subsubsection{Floating-point arguments}
 278
 279 The AMD64 architecture has 8 floating point argument registers, namely
 280 \texttt{\%xmm0} through \texttt{\%xmm7}. \texttt{\%xmm} registers are
 281 128-bit wide floating point registers on which SSE and SSE2
 282 instructions can operate. Remaining floating point arguments are
 283 passed, like integer arguments, on the stack using 8 bytes per
 284 argument, regardless to the floating-point JAVA data type.
 285
 286 Floating point return values of any floating-point JAVA data type are
 287 stored in \texttt{\%xmm0}.
 288
 289 As shown, the calling conventions for the AMD64 architecture are
 290 similar to the calling conventions of RISC machines, which allows to
 291 use CACAOs \textit{register allocator algorithm} and \textit{stack
 292 space allocation algorithm} without any changes.
 293
 294 Calling native functions means register moves and stack copying like
 295 on RISC machines. This depends on the count of the arguments used for
 296 the called native function. For non-static native functions the first
 297 integer argument has to be the JNI environment variable, so any
 298 arguments passed need to be shifted by one register, which can include
 299 creating a new stackframe and storing some arguments on the
 300 stack. Additionally for static native functions the class pointer of
 301 the current objects' class is passed in the 2$^{\rm nd}$ integer
 302 argument register. This means that the integer argument registers need
 303 to be shifted by two registers.
 304
 305 One difference of the AMD64 calling conventions to RISC type machines,
 306 like Alpha or MIPS, is the allocation of integer and floating point
 307 argument registers with mixed integer and floating point
 308 arguments. Assume a function like this:
 309
 310 \begin{verbatim}
 311         void sub(int a, float b, long c, double d);
 312 \end{verbatim}
 313
 314 On a RISC machine, like Alpha, we would have an argument register
 315 allocation like in figure \ref{alphaargumentregisterusage}.
 316 \texttt{a?} represent integer argument registers and \texttt{fa?}
 317 floating point argument registers.
 318
 319 \begin{figure}[htb]
 320 \begin{center}
 321 \setlength{\unitlength}{1mm}
 322 \begin{picture}(60,35)
 323 \thicklines
 324 \put(0,15){\framebox(15,10){a0 = a}}
 325 \put(30,15){\framebox(15,10){a2 = c}}
 326 \put(15,5){\framebox(15,10){fa1 = b}}
 327 \put(45,5){\framebox(15,10){fa3 = d}}
 328 \put(0,0){\line(0,1){30}}
 329 \end{picture}
 330 \caption{Alpha argument register usage for \texttt{void sub(int a, float b, long c, double d);}}
 331 \label{alphaargumentregisterusage}
 332 \end{center}
 333 \end{figure}
 334
 335 On AMD64 the same function call would look like in figure
 336 \ref{amd64argumentregisterusage}.
 337
 338 \begin{figure}[htb]
 339 \begin{center}
 340 \setlength{\unitlength}{1mm}
 341 \begin{picture}(60,35)
 342 \thicklines
 343 \put(0,15){\framebox(15,10){a0 = a}}
 344 \put(15,15){\framebox(15,10){a1 = c}}
 345 \put(0,5){\framebox(15,10){fa0 = b}}
 346 \put(15,5){\framebox(15,10){fa1 = d}}
 347 \put(0,0){\line(0,1){30}}
 348 \end{picture}
 349 \caption{AMD64 argument register usage for \texttt{void sub(int a, float b, long c, double d);}}
 350 \label{amd64argumentregisterusage}
 351 \end{center}
 352 \end{figure}
 353
 354 The register assigment would be \texttt{a0 = \%rdi}, \texttt{a1 =
 355 \%rsi}, \texttt{fa0 = \%xmm0} and \texttt{fa1 = \%xmm1}. This special
 356 usage of the argument registers required some changes in the argument
 357 register allocation algorithm for function calls during stack
 358 analysis and some changes in the code generator itself.
 359
 360
 361 \subsection{Register allocation}
 362 \label{sectionamd64registerallocation}
 363
 364 As mentioned in the introduction, the AMD64 architecture has 16
 365 integer general-purpose registers and 16 floating-point registers. One
 366 integer general-purpose register is reserved for the \textit{stack
 367 pointer}---namely \texttt{\%rsp}---and thus cannot be used for
 368 arithmetic instructions. The register usage as used in CACAO is shown
 369 in table \ref{amd64registerusage}.
 370
 371 \begin{table}
 372 \begin{center}
 373 \begin{tabular}{l|l|l}
 374 Register       & Usage                                         & Callee-saved \\ \hline
 375 \texttt{\%rax} & return register, reserved for code generator  & no           \\
 376 \texttt{\%rcx} & 4$^{\rm th}$ argument register                & no           \\
 377 \texttt{\%rdx} & 3$^{\rm rd}$ argument register                & no           \\
 378 \texttt{\%rbx} & temporary register                            & no           \\
 379 \texttt{\%rsp} & stack pointer                                 & yes          \\
 380 \texttt{\%rbp} & callee-saved register                         & yes          \\
 381 \texttt{\%rsi} & 2$^{\rm nd}$ argument register                & no           \\
 382 \texttt{\%rdi} & 1$^{\rm st}$ argument register                & no           \\
 383 \texttt{\%r8}  & 5$^{\rm th}$ argument register                & no           \\
 384 \texttt{\%r9}  & 6$^{\rm th}$ argument register                & no           \\
 385 \texttt{\%r10} - \texttt{\%r11} & reserved for code generator  & no           \\
 386 \texttt{\%r12} - \texttt{\%r15} & callee-saved register        & yes          \\
 387 \texttt{\%xmm0} & 1$^{\rm st}$ argument register, return register & no        \\
 388 \texttt{\%xmm1} - \texttt{\%xmm7} & argument registers         & no           \\
 389 \texttt{\%xmm8} - \texttt{\%xmm10} & reserved for code generator & no         \\
 390 \texttt{\%xmm11} - \texttt{\%xmm15} & temporary registers      & no           \\
 391 \end{tabular}
 392 \caption{AMD64 Register usage in CACAO}
 393 \label{amd64registerusage}
 394 \end{center}
 395 \end{table}
 396
 397 There is only one change to the original AMD64 \textit{application
 398 binary interface} (ABI). CACAO uses \texttt{\%rbx} as temporary
 399 register, while the AMD64 ABI uses the \texttt{\%rbx} register as
 400 callee-saved register. So CACAO needs to save the \texttt{\%rbx}
 401 register when a JAVA method is called from a native function, like a
 402 JNI function. This is done in \texttt{asm\_calljavafunction} located in
 403 \texttt{jit/x86\_64/asmpart.S}.
 404
 405 In adapting the register allocator there was a problem concerning the
 406 order of the integer argument registers. The order of the first four
 407 argument register is inverted. This fact can be seen in table
 408 \ref{amd64registerusage} which is ordered ascending by the processors'
 409 internal register numbers. That means the ascending search algorithm
 410 for argument registers in the register allocator would allocate the
 411 first four argument registers in the wrong direction. So there is a
 412 little hack implemented in CACAOs register allocator to handle this
 413 fact. After searching the register definition array for the argument
 414 registers, the first four argument registers are interchanged in their
 415 array. This is done by a simple code sequence (taken from
 416 \texttt{jit/reg.inc}):
 417
 418 \begin{verbatim}
 419         /*
 420          * on x86_64 the argument registers are not in ascending order
 421          * a00 (%rdi) <-> a03 (%rcx) and a01 (%rsi) <-> a02 (%rdx)
 422          */
 423         n = r->argintregs[3];
 424         r->argintregs[3] = r->argintregs[0];
 425         r->argintregs[0] = n;
 426
 427         n = r->argintregs[2];
 428         r->argintregs[2] = r->argintregs[1];
 429         r->argintregs[1] = n;
 430 \end{verbatim}
 431
 432
 433 \subsection{Floating-point arithmetic}
 434
 435 The AMD64 architecture has implemented two sets of floating-point
 436 instructions:
 437
 438 \begin{itemize}
 439 \item x87 (i387)
 440 \item SSE/SSE2
 441 \end{itemize}
 442
 443 The x87 \textit{floating-point unit} (FPU) implementation is
 444 completely compatible to the IA32 implementation, since the i386 with
 445 its i387 coproccessor, with all the advantages and drawbacks, like the
 446 8 slot FPU stack.
 447
 448 The SSE/SSE2 technique is taken from the newest generation of Intel
 449 processors, introduced with Intel's Pentium 4, and can process scalar
 450 32-bit \texttt{float} values and scalar 64-bit \texttt{double} values
 451 in the 128-bit wide \texttt{xmm} floating-point registers. While SSE
 452 instructions operate on 32-bit \texttt{float} values, SSE2 is
 453 responsible for 64-bit \texttt{double} values. In CACAO we implemented
 454 the JAVA floating-point instructions using SSE/SSE2, because SSE/SSE2
 455 is much easier to use and should be the technique of the future. In
 456 some areas SSE/SSE2 is slower than the old x87 implementation, even on
 457 the new designed AMD64 architecture, but SSE/SSE2 offers 16
 458 floating-point registers, which should speed up daily JAVA
 459 floating-point calculations. Another big advantage of SSE/SSE2 to x87
 460 is the missing \textit{single-double precision-rounding} problem, as
 461 described in detail in the ``IA32 code generator'' section. With
 462 SSE/SSE2 the 32-bit \texttt{float} and 64-bit \texttt{double}
 463 arithmetic is calculated and rounded completely IEEE 754 compliant, so
 464 no further adjustments need to take place to fullfil JAVAs
 465 floating-point requirements.
 466
 467 In floating-point value to integer value conversions a JVM has to
 468 check for corner cases as described in the JVM specification. This is
 469 done via a simple inline integer compare of the integer result value
 470 and a call to special assembler wrapper functions for builtin calls,
 471 like \texttt{asm\_builtin\_f2i} for \texttt{ICMD\_F2I} ---
 472 \texttt{float} to \texttt{int} conversion. These corner cases are then
 473 computed in a builtin C function with respect to all special cases
 474 like \textit{Infinite} or \textit{NaN} values.
 475
 476
 477 \subsection{Exception handling}
 478
 479 Since the AMD64 architecture is just an extension to the IA32
 480 architecture, an AMD64 processor itself raises the same signals as an
 481 IA32 processor, so we can catch the same signals in our own signal
 482 handlers. This includes the signals \texttt{SIGSEGV} and
 483 \texttt{SIGFPE}.
 484
 485 When a signal of this type is raised and the signal hits our signal
 486 handler, we reinstall the handler, create a new exception object and
 487 jump to a---in assembler---written exception handling code. The
 488 difference to the exception handling code of RISC machines, is the
 489 fact that RISC machines have a \textit{procedure vector} (PV)
 490 register. So it's easy to find the methods' data segment, which starts
 491 at the PV growing down to smaller addresses like a stack. For the IA32
 492 and AMD64 architecture we had to implement a \textit{method tree}
 493 which contains the start \textit{program counter} (PC) and the end PC
 494 for every single JAVA method compiled in CACAO, to find for any
 495 exception PC the corresponding method and thus the PV. We need the
 496 data segment for the methods' exception table (for a detailed
 497 description see section ''Exception handling'').
 498
 499 We use \texttt{SIGSEGV} for \textit{hardware null-pointer checking},
 500 so we can handle this common exception as fast as possible in
 501 CACAO. The signal handler creates a
 502 \texttt{java.lang.NullPointerException}.
 503
 504 \texttt{SIGFPE} is used to catch integer division by zero exceptions
 505 in hardware. The signal handler generates a
 506 \texttt{java.lang.ArithmeticException} with \texttt{/ by zero} as detail
 507 message.
 508
 509 Both exceptions are handled in hardware by default, but they can also
 510 be catched in software when using CACAOs commandline switch
 511 \texttt{-softnull}. On the RISC ports only the \textit{null-pointer
 512 exception} is checked in software when using this switch, but on IA32
 513 and AMD64 both are checked, \texttt{SIGSEGV} and \texttt{SIGFPE}.