doc/handbook/x86_64.tex

   1 \section{AMD64 (x86\_64) code generator}
   2
   3 \subsection{Introduction}
   4
   5 The AMD64~\cite{AMD64} architecture, formerly know as x86\_64, is an
   6 improvement of the Intel IA32 architecture by AMD---Advanced Micro
   7 Devices~\cite{AMD}. The extraordinary success of the IA32 architecture
   8 and the upcoming memory address space problem on IA32 high-end
   9 servers, led to a special design decision by AMD. Unlike Intel, with
  10 it's completely new designed 64-bit architecture---IA64---AMD decided
  11 to extend the IA32 instruction set with new a 64-bit instruction mode.
  12
  13 Due to the fact that the IA32 instructions have no fixed length, like
  14 this is the fact on RISC machines, it was easy for AMD to introduce a
  15 new \textit{prefix byte} called \texttt{REX}. The \textit{REX prefix}
  16 enables the 64-bit operation mode of the following instruction in the
  17 new \textit{64-bit mode} of the processor.
  18
  19 A processor of the AMD64 architecture has two main operating modes:
  20
  21 \begin{itemize}
  22 \item Long Mode
  23 \item Legacy Mode
  24 \end{itemize}
  25
  26 In the \textit{Legacy Mode} the processor acts like an IA32
  27 processor. Any 32-bit operating system or software can be run on these
  28 type of processors without changes, so companies running IA32 servers
  29 and software can change their hardware to AMD64 and their systems are
  30 still operational. This was the main intention for AMD to develop this
  31 architecture. Furthermore the \textit{Long Mode} is split into two
  32 coexistent operating modes:
  33
  34 \begin{itemize}
  35 \item 64-bit Mode
  36 \item Compatibility Mode
  37 \end{itemize}
  38
  39 The \textit{64-bit Mode} exposes the power of this architecture. Any
  40 memory operation now uses 64-bit addresses and ALU instructions can
  41 operate on 64-bit operands. Within \textit{Compatibility Mode} any
  42 IA32 software can be run under the control of 64-bit operating
  43 system. This, as mentioned before, is yet another point for companies
  44 to change their hardware to AMD64. So their software can be slowly
  45 migrated to the new 64-bit systems, but not every type of software is
  46 faster in 64-bit code. Any memory address fetched or stored into
  47 memory needs to transfer now 64-bits instead of 32-bits. This means
  48 twice as much memory transfer as on IA32 machines.
  49
  50 Another crucial point to make the AMD64 architecture faster than IA32,
  51 is the limited number of registers. Any IA32 architecture, from the
  52 early \textit{i386} to the newest generation of \textit{Intel Pentium
  53 4} or \textit{AMD Athlon}, has only 8 general-purpose registers. With
  54 the \textit{REX prefix}, AMD has the ability to increase the amount of
  55 accessible registers by 1 bit. This means in \textit{64-bit Mode} 16
  56 general-purpose registers are available. The value of a \textit{REX
  57 prefix} is in the range \texttt{40h} through \texttt{4Fh}, depending
  58 on the particular bits used (see table \ref{REX}).
  59
  60 \begin{table}
  61 \begin{center}
  62 \begin{tabular}[b]{|c|c|l|}
  63 \hline
  64 Mnemonic & Bit Position & Definition \\ \hline
  65 -        & 7-4          & 0100 \\ \hline
  66 REX.W    & 3            & 0 = Default operand size \\
  67          &              & 1 = 64-bit operand size \\ \hline
  68 REX.R    & 2            & 1-bit (high) extension of the ModRM \textit{reg} field, \\
  69          &              & thus permitting access to 16 registers. \\ \hline
  70 REX.X    & 1            & 1-bit (high) extension of the SIB \textit{index} field, \\
  71          &              & thus permitting access to 16 registers. \\ \hline
  72 REX.B    & 0            & 1-bit (high) extension of the ModRM \textit{r/m} field, \\
  73          &              & SIB \textit{base} field, or opcode \textit{reg} field, thus \\
  74          &              & permitting access to 16 registers. \\ \hline
  75 \end{tabular}
  76 \caption{REX Prefix Byte Fields}
  77 \label{REX}
  78 \end{center}
  79 \end{table}
  80
  81
  82 \subsection{Code generation}
  83
  84 AMD64 code generation is mostly the same as on IA32. All new 64-bit
  85 instructions can handle both \textit{memory operands} and
  86 \textit{register operands}, so there is no need to change the
  87 implementation of the IA32 ICMDs.
  88
  89 Much better code generation can be achieved in the area of
  90 \textit{long arithmetic}. Since all 16 general-purpose registers can
  91 hold 64-bit integer values, there is no need for special long
  92 handling, like on IA32 were we stored all long varibales in memory. As
  93 example a simple \texttt{ICMD\_LADD} on IA32, best case shown for
  94 AMD64 --- \texttt{src->regoff == iptr->dst->regoff}:
  95
  96 \begin{verbatim}
  97         i386_mov_membase_reg(REG_SP, src->prev->regoff * 8, REG_ITMP1);
  98         i386_alu_reg_membase(I386_ADD, REG_ITMP1, REG_SP, iptr->dst->regoff * 8);
  99         i386_mov_membase_reg(REG_SP, src->prev->regoff * 8 + 4, REG_ITMP1);
 100         i386_alu_reg_membase(I386_ADC, REG_ITMP1, REG_SP, iptr->dst->regoff * 8 + 4);
 101 \end{verbatim}
 102
 103 First memory operand is added to second memory operand which is at the
 104 same stack location as the destination operand. This means, there are
 105 four instructions executed for one \texttt{long} addition. If we would
 106 use registers for \texttt{long} variables we could get a
 107 \textit{best-case} of two instructions, namely \textit{add} followed
 108 by an \textit{adc}. On AMD64 we can generate one instruction for this
 109 addition:
 110
 111 \begin{verbatim}
 112         x86_64_alu_reg_reg(X86_64_ADD, src->prev->regoff, iptr->dst->regoff);
 113 \end{verbatim}
 114
 115 This means, the AMD64 port is \textit{four-times} faster than the IA32
 116 port (maybe even more, because we do not use memory accesses). Even if
 117 we would implement the usage of registers for \texttt{long} variables
 118 on IA32, the AMD64 port would be at least twice as fast.
 119
 120 To be able to use the new 64-bit instructions, we need to prefix
 121 nearly all instructions---some instructions can be used in their
 122 64-bit mode without escaping---with the mentioned \textit{REX prefix}
 123 byte. In CACAO we use a macro called
 124
 125 \begin{verbatim}
 126         x86_64_emit_rex(size,reg,index,rm)
 127 \end{verbatim}
 128
 129 to emit this byte. The names of the arguments are respective to their
 130 usage in the \textit{REX prefix} itself (see table \ref{REX}).
 131
 132 The AMD64 architecture introduces also a new addressing method called
 133 \textit{RIP-relative addressing}. In 64-bit mode, addressing relative
 134 to the contents of the 64-bit instruction pointer (program counter)
 135 --- called \textit{RIP-relative addressing} or \textit{PC-relative
 136 addressing} --- is implemented for certain instructions. In this
 137 instructions, the effective address is formed by adding the
 138 displacement to the 64-bit \texttt{RIP} of the next instruction. With
 139 this addressing mode, we can replace the IA32 method of addressing
 140 data in the method's data segment. Like in the
 141 \texttt{ICMD\_PUTSTATIC} instruction, the IA32 code
 142
 143 \begin{verbatim}
 144         a = dseg_addaddress(&(((fieldinfo *) iptr->val.a)->value));
 145         i386_mov_imm_reg(0, REG_ITMP2);
 146         dseg_adddata(mcodeptr);
 147         i386_mov_membase_reg(REG_ITMP2, a, REG_ITMP2);
 148 \end{verbatim}
 149
 150 can be replaced with the new \textit{RIP-relative addressing} code
 151
 152 \begin{verbatim}
 153         a = dseg_addaddress(&(((fieldinfo *) iptr->val.a)->value));
 154         x86_64_mov_membase_reg(RIP, -(((s8) mcodeptr + 7) - (s8) mcodebase) + a, REG_ITMP2);
 155 \end{verbatim}
 156
 157 So we can save one instruction on the read or write of an static
 158 variable. The additional offset of \texttt{+ 7} is the code size of
 159 the instruction itself. The fictive register \texttt{RIP} is defined
 160 with
 161
 162 \begin{verbatim}
 163         #define RIP    -1
 164 \end{verbatim}
 165
 166 Thus we can determine the special \textit{RIP-relative addressing}
 167 mode in the code generating macro
 168 \texttt{x86\_64\_emit\_membase(basereg,disp,dreg)} with
 169
 170 \begin{verbatim}
 171         if ((basereg) == RIP) {
 172             x86_64_address_byte(0,(dreg),RBP);
 173             x86_64_emit_imm32((disp));
 174             break;
 175         }
 176 \end{verbatim}
 177
 178 and generate the \textit{RIP-relative addressing} code. As shown in
 179 the code sample, it's an special encoding of the \textit{address byte}
 180 with \texttt{mod} field set to zero and \texttt{RBP} (\texttt{\%rbp})
 181 as baseregister.
 182
 183
 184 \subsection{Constant handling}
 185
 186 As on IA32, the AMD64 code generator can use \textit{immediate move}
 187 instructions to load integer constants into their destination
 188 registers. The 64-bit extensions of the AMD64 architecture can also
 189 load 64-bit immediates inline. So loading a \texttt{long} constant
 190 just uses one instruction, despite of two instructions on the IA32
 191 architecture. Of course the AMD64 code generator uses the \textit{move
 192 long} (\texttt{movl}) instruction to load 32-bit \texttt{int}
 193 constants to minimize code size. The \texttt{movl} instruction clears
 194 the upper 32-bit of the destination register.
 195
 196 \begin{verbatim}
 197         case ICMD_ICONST:
 198                 ...
 199                 x86_64_movl_imm_reg(cd, iptr->val.i, d);
 200                 ...
 201
 202         case ICMD_LCONST:
 203                 ...
 204                 x86_64_mov_imm_reg(cd, iptr->val.l, d);
 205                 ...
 206 \end{verbatim}
 207
 208 \texttt{float} and \texttt{double} values are loaded from the data
 209 segment via the \textit{move doubleword or quadword} (\texttt{movd})
 210 instruction with \textit{RIP-relative addressing}.
 211
 212
 213 \subsection{Calling conventions}
 214
 215 The AMD64 calling conventions are described here
 216 \cite{AMD64ABI}. CACAO uses a subset of this calling convention, to
 217 cover its requirements. CACAO just needs to pass the JAVA data types
 218 to called functions, no other special features are required. The byte
 219 sizes of the JAVA data types on the AMD64 port are shown in table
 220 \ref{javadatatypesizes}.
 221
 222 \begin{table}
 223 \begin{center}
 224 \begin{tabular}[b]{|l|c|}
 225 \hline
 226 JAVA Data Type   & Bytes \\ \hline
 227 \texttt{boolean} & 1     \\
 228 \texttt{byte}    &       \\
 229 \texttt{char}    &       \\ \hline
 230 \texttt{short}   & 2     \\ \hline
 231 \texttt{int}     & 4     \\
 232 \texttt{float}   &       \\ \hline
 233 \texttt{long}    & 8     \\
 234 \texttt{double}  &       \\
 235 \texttt{void}    &       \\ \hline
 236 \end{tabular}
 237 \caption{JAVA Data Type sizes on AMD64}
 238 \label{javadatatypesizes}
 239 \end{center}
 240 \end{table}
 241
 242 \subsubsection{Integer arguments}
 243
 244 The AMD64 architecture has 6 integer argument registers. The order of
 245 the argument registers is shown in table
 246 \ref{amd64integerargumentregisters}.
 247
 248 \begin{table}
 249 \begin{center}
 250 \begin{tabular}[b]{|l|l|}
 251 \hline
 252 Register       & Argument Register \\ \hline
 253 \texttt{\%rdi} & 1$^{\rm st}$      \\ \hline
 254 \texttt{\%rsi} & 2$^{\rm nd}$      \\ \hline
 255 \texttt{\%rdx} & 3$^{\rm rd}$      \\ \hline
 256 \texttt{\%rcx} & 4$^{\rm th}$      \\ \hline
 257 \texttt{\%r8}  & 5$^{\rm th}$      \\ \hline
 258 \texttt{\%r9}  & 6$^{\rm th}$      \\ \hline
 259 \end{tabular}
 260 \caption{AMD64 Integer Argument Register}
 261 \label{amd64integerargumentregisters}
 262 \end{center}
 263 \end{table}
 264
 265 As on RISC machines, the remaining integer arguments are passed on the
 266 stack. Each integer argument, regardless of which integer JAVA data
 267 type, uses 8 bytes on the stack.
 268
 269 Integer return values of any integer JAVA data type are stored in
 270 \texttt{REG\_RESULT}, which is \texttt{\%rax}.
 271
 272 \subsubsection{Floating-point arguments}
 273
 274 The AMD64 architecture has 8 floating point argument registers, namely
 275 \texttt{\%xmm0} through \texttt{\%xmm7}. \texttt{\%xmm} registers are
 276 128-bit wide floating point registers on which SSE and SSE2
 277 instructions can operate. Remaining floating point arguments are
 278 passed, like integer arguments, on the stack using 8 bytes per
 279 argument, regardless to the floating-point JAVA data type.
 280
 281 Floating point return values of any floating-point JAVA data type are
 282 stored in \texttt{\%xmm0}.
 283
 284 As shown, the calling conventions for the AMD64 architecture are
 285 similar to the calling conventions of RISC machines, which allows to
 286 use CACAOs \textit{register allocator algorithm} and \textit{stack
 287 space allocation algorithm} without any changes.
 288
 289 Calling native functions means register moves and stack copying like
 290 on RISC machines. This depends on the count of the arguments used for
 291 the called native function. For non-static native functions the first
 292 integer argument has to be the JNI environment variable, so any
 293 arguments passed need to be shifted by one register, which can include
 294 creating a new stackframe and storing some arguments on the
 295 stack. Additionally for static native functions the class pointer of
 296 the current objects' class is passed in the 2$^{\rm nd}$ integer
 297 argument register. This means that the integer argument registers need
 298 to be shifted by two registers.
 299
 300 One difference of the AMD64 calling conventions to RISC type machines,
 301 like Alpha or MIPS, is the allocation of integer and floating point
 302 argument registers with mixed integer and floating point
 303 arguments. Assume a function like this:
 304
 305 \begin{verbatim}
 306         void sub(int a, float b, long c, double d);
 307 \end{verbatim}
 308
 309 On a RISC machine, like Alpha, we would have an argument register
 310 allocation like in figure \ref{alphaargumentregisterusage}.
 311 \texttt{a?} represent integer argument registers and \texttt{fa?}
 312 floating point argument registers.
 313
 314 \begin{figure}[htb]
 315 \begin{center}
 316 \setlength{\unitlength}{1mm}
 317 \begin{picture}(60,35)
 318 \thicklines
 319 \put(0,15){\framebox(15,10){a0 = a}}
 320 \put(30,15){\framebox(15,10){a2 = c}}
 321 \put(15,5){\framebox(15,10){fa1 = b}}
 322 \put(45,5){\framebox(15,10){fa3 = d}}
 323 \put(0,0){\line(0,1){30}}
 324 \end{picture}
 325 \caption{Alpha argument register usage for \texttt{void sub(int a, float b, long c, double d);}}
 326 \label{alphaargumentregisterusage}
 327 \end{center}
 328 \end{figure}
 329
 330 On AMD64 the same function call would look like in figure
 331 \ref{amd64argumentregisterusage}.
 332
 333 \begin{figure}[htb]
 334 \begin{center}
 335 \setlength{\unitlength}{1mm}
 336 \begin{picture}(60,35)
 337 \thicklines
 338 \put(0,15){\framebox(15,10){a0 = a}}
 339 \put(15,15){\framebox(15,10){a1 = c}}
 340 \put(0,5){\framebox(15,10){fa0 = b}}
 341 \put(15,5){\framebox(15,10){fa1 = d}}
 342 \put(0,0){\line(0,1){30}}
 343 \end{picture}
 344 \caption{AMD64 argument register usage for \texttt{void sub(int a, float b, long c, double d);}}
 345 \label{amd64argumentregisterusage}
 346 \end{center}
 347 \end{figure}
 348
 349 The register assigment would be \texttt{a0 = \%rdi}, \texttt{a1 =
 350 \%rsi}, \texttt{fa0 = \%xmm0} and \texttt{fa1 = \%xmm1}. This special
 351 usage of the argument registers required some changes in the argument
 352 register allocation algorithm for function calls during stack
 353 analysis and some changes in the code generator itself.
 354
 355
 356 \subsection{Register allocator}
 357
 358 As mentioned in the introduction, the AMD64 architecture has 16
 359 general-purpose registers and 16 floating-point registers. One
 360 general-purpose register is reserved for the \textit{stack
 361 pointer}---namely \texttt{\%rsp}---and thus cannot be used for
 362 arithmetic instructions. The register usage as used in CACAO is shown
 363 in table \ref{amd64registerusage}.
 364
 365 \begin{table}
 366 \begin{center}
 367 \begin{tabular}{l|l|l}
 368 Register       & Usage                                         & Callee-saved \\ \hline
 369 \texttt{\%rax} & return register, reserved for code generator  & no           \\
 370 \texttt{\%rcx} & 4$^{\rm th}$ argument register                & no           \\
 371 \texttt{\%rdx} & 3$^{\rm rd}$ argument register                & no           \\
 372 \texttt{\%rbx} & temporary register                            & no           \\
 373 \texttt{\%rsp} & stack pointer                                 & yes          \\
 374 \texttt{\%rbp} & callee-saved register                         & yes          \\
 375 \texttt{\%rsi} & 2$^{\rm nd}$ argument register                & no           \\
 376 \texttt{\%rdi} & 1$^{\rm st}$ argument register                & no           \\
 377 \texttt{\%r8}  & 5$^{\rm th}$ argument register                & no           \\
 378 \texttt{\%r9}  & 6$^{\rm th}$ argument register                & no           \\
 379 \texttt{\%r10}-\texttt{\%r11} & reserved for code generator    & no           \\
 380 \texttt{\%r12}-\texttt{\%r15} & callee-saved register          & yes          \\
 381 \texttt{\%xmm0} & 1$^{\rm st}$ argument register, return register & no        \\
 382 \texttt{\%xmm1}-\texttt{\%xmm7} & argument registers           & no           \\
 383 \texttt{\%xmm8}-\texttt{\%xmm10} & reserved for code generator & no           \\
 384 \texttt{\%xmm11}-\texttt{\%xmm15} & temporary registers        & no           \\
 385 \end{tabular}
 386 \caption{AMD64 Register usage in CACAO}
 387 \label{amd64registerusage}
 388 \end{center}
 389 \end{table}
 390
 391 There is only one change to the original AMD64 \textit{application
 392 binary interface} (ABI). CACAO uses \texttt{\%rbx} as temporary
 393 register, while the AMD64 ABI uses the \texttt{\%rbx} register as
 394 callee-saved register. So CACAO needs to save the \texttt{\%rbx}
 395 register when a JAVA method is called from a native function, like a
 396 JNI function. This is done in \texttt{asm\_calljavafunction} located in
 397 \texttt{jit/x86\_64/asmpart.S}.
 398
 399 In adapting the register allocator there was a problem concerning the
 400 order of the integer argument registers. The order of the first four
 401 argument register is inverted. This fact can be seen in table
 402 \ref{amd64registerusage} which is ordered ascending by the processors'
 403 internal register numbers. That means the ascending search algorithm
 404 for argument registers in the register allocator would allocate the
 405 first four argument registers in the wrong direction. So there is a
 406 little hack implemented in CACAOs register allocator to handle this
 407 fact. After searching the register definition array for the argument
 408 registers, the first four argument registers are interchanged in their
 409 array. This is done by a simple code sequence (taken from
 410 \texttt{jit/reg.inc}):
 411
 412 \begin{verbatim}
 413         /*
 414          * on x86_64 the argument registers are not in ascending order
 415          * a00 (%rdi) <-> a03 (%rcx) and a01 (%rsi) <-> a02 (%rdx)
 416          */
 417         n = r->argintregs[3];
 418         r->argintregs[3] = r->argintregs[0];
 419         r->argintregs[0] = n;
 420
 421         n = r->argintregs[2];
 422         r->argintregs[2] = r->argintregs[1];
 423         r->argintregs[1] = n;
 424 \end{verbatim}
 425
 426
 427 \subsection{Floating-point arithmetic}
 428
 429 The AMD64 architecture has implemented two sets of floating-point
 430 instructions:
 431
 432 \begin{itemize}
 433 \item x87 (i387)
 434 \item SSE/SSE2
 435 \end{itemize}
 436
 437 The x87 \textit{floating-point unit} (FPU) implementation is
 438 completely compatible to the IA32 implementation, since the i386 with
 439 its i387 coproccessor, with all the advantages and drawbacks, like the
 440 8 slot FPU stack.
 441
 442 The SSE/SSE2 technique is taken from the newest generation of Intel
 443 processors, introduced with Intel's Pentium 4, and can process scalar
 444 32-bit \texttt{float} values and scalar 64-bit \texttt{double} values
 445 in the 128-bit wide \texttt{xmm} floating-point registers. While SSE
 446 instructions operate on 32-bit \texttt{float} values, SSE2 is
 447 responsible for 64-bit \texttt{double} values. In CACAO we implemented
 448 the JAVA floating-point instructions using SSE/SSE2, because SSE/SSE2
 449 is much easier to use and should be the technique of the future. In
 450 some areas SSE/SSE2 is slower than the old x87 implementation, even on
 451 the new designed AMD64 architecture, but SSE/SSE2 offers 16
 452 floating-point registers, which should speed up daily JAVA
 453 floating-point calculations. Another big advantage of SSE/SSE2 to x87
 454 is the missing \textit{single-double precision-rounding} problem, as
 455 described in detail in the ``IA32 code generator'' section. With
 456 SSE/SSE2 the 32-bit \texttt{float} and 64-bit \texttt{double}
 457 arithmetic is calculated and rounded completely IEEE 754 compliant, so
 458 no further adjustments need to take place to fullfil JAVAs
 459 floating-point requirements.
 460
 461 In floating-point value to integer value conversions a JVM has to
 462 check for corner cases as described in the JVM specification. This is
 463 done via a simple inline integer compare of the integer result value
 464 and a call to special assembler wrapper functions for builtin calls,
 465 like \texttt{asm\_builtin\_f2i} for \texttt{ICMD\_F2I} ---
 466 \texttt{float} to \texttt{int} conversion. These corner cases are then
 467 computed in a builtin C function with respect to all special cases
 468 like \textit{Infinite} or \textit{NaN} values.
 469
 470
 471 \subsection{Exception handling}
 472
 473 Since the AMD64 architecture is just an extension to the IA32
 474 architecture, an AMD64 processor itself raises the same signals as an
 475 IA32 processor, so we can catch the same signals in our own signal
 476 handlers. This includes the signals \texttt{SIGSEGV} and
 477 \texttt{SIGFPE}.
 478
 479 When a signal of this type is raised and the signal hits our signal
 480 handler, we reinstall the handler, create a new exception object and
 481 jump to a---in assembler---written exception handling code. The
 482 difference to the exception handling code of RISC machines, is the
 483 fact that RISC machines have a \textit{procedure vector} (PV)
 484 register. So it's easy to find the methods' data segment, which starts
 485 at the PV growing down to smaller addresses like a stack. For the IA32
 486 and AMD64 architecture we had to implement a \textit{method tree}
 487 which contains the start \textit{program counter} (PC) and the end PC
 488 for every single JAVA method compiled in CACAO, to find for any
 489 exception PC the corresponding method and thus the PV. We need the
 490 data segment for the methods' exception table (for a detailed
 491 description see section ''Exception handling'').
 492
 493 We use \texttt{SIGSEGV} for \textit{hardware null-pointer checking},
 494 so we can handle this common exception as fast as possible in
 495 CACAO. The signal handler creates a
 496 \texttt{java.lang.NullPointerException}.
 497
 498 \texttt{SIGFPE} is used to catch integer division by zero exceptions
 499 in hardware. The signal handler generates a
 500 \texttt{java.lang.ArithmeticException} with \texttt{/ by zero} as detail
 501 message.
 502
 503 Both exceptions are handled in hardware by default, but they can also
 504 be catched in software when using CACAOs commandline switch
 505 \texttt{-softnull}. On the RISC ports only the \textit{null-pointer
 506 exception} is checked in software when using this switch, but on IA32
 507 and AMD64 both are checked, \texttt{SIGSEGV} and \texttt{SIGFPE}.
 508
 509
 510 \subsection{Related work}
 511
 512 The AMD64 architecture is a reasonably young architecture, released in
 513 April 2003. At the writing of this document the only available 64-bit
 514 operating systems for AMD64 are GNU/Linux---from different
 515 distributors---, FreeBSD, NetBSD and OpenBSD. Microsoft Windows is not
 516 available yet, although it was announced to be released in the first
 517 half of 2004.
 518
 519 The first available 64-bit JVM for the AMD64 architecture was GCC's
 520 GCJ---The GNU Compiler for the Java Programming
 521 Language~\cite{GCJ}. \texttt{gcj} itself is a portable, optimizing,
 522 ahead-of-time compiler for the JAVA Programming Language, which can
 523 compile:
 524
 525 \begin{itemize}
 526 \item JAVA source code directly to native machine code
 527 \item JAVA source code to JAVA bytecode (class files)
 528 \item JAVA bytecode to native machine code
 529 \end{itemize}
 530
 531 One part of the GCJ is \texttt{gij}, which is the JVM
 532 interpreter. Much of the porting effort for the \textit{GNU Compiler
 533 Collection} to the AMD64 architecture was done by people working at
 534 SUSE~\cite{SUSE}.
 535
 536 Long time no AMD64 JIT was available, till Sun~\cite{Sun} released
 537 their AMD64 version of J2SE 1.4.2-rc1 for GNU/Linux by
 538 Blackdown~\cite{Blackdown} in December 2003. At this time our AMD64
 539 JIT was already working for months, but we were not able to release
 540 CACAO, because of the common status of CACAO to be a compliant
 541 JVM. The Sun JVM uses the HotSpot Server VM by default, the HotSpot
 542 Client VM is not available for AMD64 at this time.
 543
 544 The Kaffe~\cite{Wilkinson:97} JVM has ported their interpreter to the
 545 AMD64 architecture for GNU/Linux, but they still have no plans to port
 546 their JIT.