From 48e3a41c9a90e17830c94124b619d0e5192a9fda Mon Sep 17 00:00:00 2001 From: twisti Date: Mon, 2 Aug 2004 13:53:02 +0000 Subject: [PATCH] Save. --- doc/handbook/x86_64.tex | 299 ++++++++++++++++++++++++++++------------ 1 file changed, 208 insertions(+), 91 deletions(-) diff --git a/doc/handbook/x86_64.tex b/doc/handbook/x86_64.tex index 0eb684fd5..341f20379 100644 --- a/doc/handbook/x86_64.tex +++ b/doc/handbook/x86_64.tex @@ -2,16 +2,16 @@ \subsection{Introduction} -The AMD64 architecture, formerly know as x86\_64, is an improvement of -the Intel IA32 architecture by AMD -- Advanced Micro Devices. The -extraordinary success of the IA32 architecture and the upcoming memory -address space problem on IA32 high-end servers, led to a special -design decision. Unlike Intel, with it's completely new designed IA64 -architecture, AMD decided to extend the IA32 instruction set with -new 64-bit instructions. - -Due to the fact that the IA32 instructions have no fixed length, as -this is the fact on RISC machines, it was easy for them to introduce a +The AMD64~\cite{AMD64} architecture, formerly know as x86\_64, is an +improvement of the Intel IA32 architecture by AMD---Advanced Micro +Devices~\cite{AMD}. The extraordinary success of the IA32 architecture +and the upcoming memory address space problem on IA32 high-end +servers, led to a special design decision by AMD. Unlike Intel, with +it's completely new designed 64-bit architecture---IA64---AMD decided +to extend the IA32 instruction set with new a 64-bit instruction mode. + +Due to the fact that the IA32 instructions have no fixed length, like +this is the fact on RISC machines, it was easy for AMD to introduce a new \textit{prefix byte} called \texttt{REX}. The \textit{REX prefix} enables the 64-bit operation mode of the following instruction in the new \textit{64-bit mode} of the processor. @@ -39,22 +39,23 @@ coexistent operating modes: The \textit{64-bit Mode} exposes the power of this architecture. Any memory operation now uses 64-bit addresses and ALU instructions can operate on 64-bit operands. Within \textit{Compatibility Mode} any -IA32 software can be run under the control of 64-bit operation +IA32 software can be run under the control of 64-bit operating system. This, as mentioned before, is yet another point for companies to change their hardware to AMD64. So their software can be slowly -migrated to the new 64-bit system, but not every type of software is -faster in 64-bit code. - -Another crucial pointer to make the AMD64 architecture faster than -IA32, is the limited number of registers. Any IA32 architecture, from -the early \textit{i386} to the newest generation of \textit{Intel -Pentium 4} or \textit{AMD Athlon}, has only 8 general-purpose -registers. With the \textit{REX prefix}, AMD has the ability to -increase the amount of accessible registers by 1 bit. This means in -\textit{64-bit Mode} 16 general-purpose registers are available. The -value of a \textit{REX prefix} is in the range \texttt{40h} through -\texttt{4Fh}, depending on the particular bits used (see table -\ref{REX}). +migrated to the new 64-bit systems, but not every type of software is +faster in 64-bit code. Any memory address fetched or stored into +memory needs to transfer now 64-bits instead of 32-bits. This means +twice as much memory transfer as on IA32 machines. + +Another crucial point to make the AMD64 architecture faster than IA32, +is the limited number of registers. Any IA32 architecture, from the +early \textit{i386} to the newest generation of \textit{Intel Pentium +4} or \textit{AMD Athlon}, has only 8 general-purpose registers. With +the \textit{REX prefix}, AMD has the ability to increase the amount of +accessible registers by 1 bit. This means in \textit{64-bit Mode} 16 +general-purpose registers are available. The value of a \textit{REX +prefix} is in the range \texttt{40h} through \texttt{4Fh}, depending +on the particular bits used (see table \ref{REX}). \begin{table} \begin{center} @@ -88,9 +89,9 @@ implementation of the IA32 ICMDs. Much better code generation can be achieved in the area of \textit{long arithmetic}. Since all 16 general-purpose registers can hold 64-bit integer values, there is no need for special long -handling, like on IA32 were we stored all long varibales in memory. A -simple \texttt{ICMD\_LADD} on IA32, best case shown for AMD64 --- -\texttt{src->regoff == iptr->dst->regoff}: +handling, like on IA32 were we stored all long varibales in memory. As +example a simple \texttt{ICMD\_LADD} on IA32, best case shown for +AMD64 --- \texttt{src->regoff == iptr->dst->regoff}: \begin{verbatim} i386_mov_membase_reg(REG_SP, src->prev->regoff * 8, REG_ITMP1); @@ -101,10 +102,11 @@ simple \texttt{ICMD\_LADD} on IA32, best case shown for AMD64 --- First memory operand is added to second memory operand which is at the same stack location as the destination operand. This means, there are -four instructions executed for one long addition. If we would use -registers for long variables we could get a \textit{best-case} of two -instructions, namely \textit{add} followed by an \textit{adc}. On -AMD64 we can generate one instruction for this addition: +four instructions executed for one \texttt{long} addition. If we would +use registers for \texttt{long} variables we could get a +\textit{best-case} of two instructions, namely \textit{add} followed +by an \textit{adc}. On AMD64 we can generate one instruction for this +addition: \begin{verbatim} x86_64_alu_reg_reg(X86_64_ADD, src->prev->regoff, iptr->dst->regoff); @@ -112,20 +114,20 @@ AMD64 we can generate one instruction for this addition: This means, the AMD64 port is \textit{four-times} faster than the IA32 port (maybe even more, because we do not use memory accesses). Even if -we would implement the usage of registers for long variables on IA32, -the AMD64 port would be at least twice as fast. +we would implement the usage of registers for \texttt{long} variables +on IA32, the AMD64 port would be at least twice as fast. To be able to use the new 64-bit instructions, we need to prefix -nearly all instructions --- some instructions can be used in 64-bit -mode without escaping --- with the mentioned \textit{REX prefix} +nearly all instructions---some instructions can be used in their +64-bit mode without escaping---with the mentioned \textit{REX prefix} byte. In CACAO we use a macro called \begin{verbatim} x86_64_emit_rex(size,reg,index,rm) \end{verbatim} -The names of the arguments are respective to their use in the -\textit{REX prefix} (see table \ref{REX}). +to emit this byte. The names of the arguments are respective to their +usage in the \textit{REX prefix} itself (see table \ref{REX}). The AMD64 architecture introduces also a new addressing method called \textit{RIP-relative addressing}. In 64-bit mode, addressing relative @@ -175,8 +177,8 @@ mode in the code generating macro and generate the \textit{RIP-relative addressing} code. As shown in the code sample, it's an special encoding of the \textit{address byte} -mit the \texttt{mod} field set to zero and \texttt{RBP} -(\texttt{\%rbp}) as baseregister. +with \texttt{mod} field set to zero and \texttt{RBP} (\texttt{\%rbp}) +as baseregister. \subsection{Constant handling} @@ -187,9 +189,9 @@ registers. The 64-bit extensions of the AMD64 architecture can also load 64-bit immediates inline. So loading a \texttt{long} constant just uses one instruction, despite of two instructions on the IA32 architecture. Of course the AMD64 code generator uses the \textit{move -long} (\texttt{movl}) instruction to load 32-bit \texttt{int} constants -to minimize code size. This instruction clears the upper 32-bit of the -destination register. +long} (\texttt{movl}) instruction to load 32-bit \texttt{int} +constants to minimize code size. The \texttt{movl} instruction clears +the upper 32-bit of the destination register. \begin{verbatim} case ICMD_ICONST: @@ -210,9 +212,10 @@ instruction with \textit{RIP-relative addressing}. \subsection{Calling conventions} -The AMD64 calling conventions are described here \ref{}. CACAO uses a -subset of this calling convention, to cover its requirements. CACAO -just needs to pass the JAVA data types, no other special features. The +The AMD64 calling conventions are described here +\cite{AMD64ABI}. CACAO uses a subset of this calling convention, to +cover its requirements. CACAO just needs to pass the JAVA data types +to called functions, no other special features are required. The byte sizes of the JAVA data types on the AMD64 port are shown in table \ref{javadatatypesizes}. @@ -260,27 +263,28 @@ Register & Argument Register \\ \hline \end{table} As on RISC machines, the remaining integer arguments are passed on the -stack. Each integer argument, regardless of which size, uses 8 bytes -on the stack. +stack. Each integer argument, regardless of which integer JAVA data +type, uses 8 bytes on the stack. -Integer return values of any size are stored in \texttt{REG\_RESULT}, -which is \texttt{\%rax}. +Integer return values of any integer JAVA data type are stored in +\texttt{REG\_RESULT}, which is \texttt{\%rax}. -\subsubsection{Floating point arguments} +\subsubsection{Floating-point arguments} The AMD64 architecture has 8 floating point argument registers, namely \texttt{\%xmm0} through \texttt{\%xmm7}. \texttt{\%xmm} registers are 128-bit wide floating point registers on which SSE and SSE2 instructions can operate. Remaining floating point arguments are -passed, like with integer arguments, on the stack using 8 bytes per -argument. +passed, like integer arguments, on the stack using 8 bytes per +argument, regardless to the floating-point JAVA data type. -Floating point return values are stored in \texttt{\%xmm0}. +Floating point return values of any floating-point JAVA data type are +stored in \texttt{\%xmm0}. As shown, the calling conventions for the AMD64 architecture are -nearly the same as for RISC machines, which allows to use CACAOs -\textit{register allocator algorithm} and \textit{stack space -allocation algorithm} without any changes. +similar to the calling conventions of RISC machines, which allows to +use CACAOs \textit{register allocator algorithm} and \textit{stack +space allocation algorithm} without any changes. Calling native functions means register moves and stack copying like on RISC machines. This depends on the count of the arguments used for @@ -293,19 +297,19 @@ the current objects' class is passed in the 2$^{\rm nd}$ integer argument register. This means that the integer argument registers need to be shifted by two registers. -One difference of the calling convention to RISC type machines, like -Alpha or MIPS, is the usage of integer and floating point argument -registers with mixed integer and floating point arguments. Assume a -function like this: +One difference of the AMD64 calling conventions to RISC type machines, +like Alpha or MIPS, is the allocation of integer and floating point +argument registers with mixed integer and floating point +arguments. Assume a function like this: \begin{verbatim} void sub(int a, float b, long c, double d); \end{verbatim} On a RISC machine, like Alpha, we would have an argument register -usage like in figure \ref{alphaargumentregisterusage}. \texttt{a?} -represent integer argument registers and \texttt{fa?} floating point -argument registers. +allocation like in figure \ref{alphaargumentregisterusage}. +\texttt{a?} represent integer argument registers and \texttt{fa?} +floating point argument registers. \begin{figure}[htb] \begin{center} @@ -353,29 +357,31 @@ analysis and some changes in the code generator itself. As mentioned in the introduction, the AMD64 architecture has 16 general-purpose registers and 16 floating-point registers. One -general-purpose register is reserved for the \textit{stack pointer} ---- namely \texttt{\%rsp} --- and thus cannot be used for arithmetic -instructions. The register usage as used in CACAO is shown in table -\ref{amd64registerusage}. +general-purpose register is reserved for the \textit{stack +pointer}---namely \texttt{\%rsp}---and thus cannot be used for +arithmetic instructions. The register usage as used in CACAO is shown +in table \ref{amd64registerusage}. \begin{table} \begin{center} \begin{tabular}{l|l|l} -Register & Usage & Callee-saved \\ \hline -\texttt{\%rax} & return register, reserved for code generator & no \\ -\texttt{\%rcx} & 4$^{\rm th}$ argument register & no \\ -\texttt{\%rdx} & 3$^{\rm rd}$ argument register & no \\ -\texttt{\%rbx} & temporary register & no \\ -\texttt{\%rsp} & stack pointer & yes \\ -\texttt{\%rbp} & callee-saved register & yes \\ -\texttt{\%rsi} & 2$^{\rm nd}$ argument register & no \\ -\texttt{\%rdi} & 1$^{\rm st}$ argument register & no \\ -\texttt{\%r8} & 5$^{\rm th}$ argument register & no \\ -\texttt{\%r9} & 6$^{\rm th}$ argument register & no \\ -\texttt{\%r10}-\texttt{\%r11} & reserved for code generator & no \\ -\texttt{\%r12}-\texttt{\%r15} & callee-saved register & yes \\ -\texttt{\%xmm0}-\texttt{\%xmm7} & argument registers & no \\ -\texttt{\%xmm8}-\texttt{\%xmm15} & temporary registers & no \\ +Register & Usage & Callee-saved \\ \hline +\texttt{\%rax} & return register, reserved for code generator & no \\ +\texttt{\%rcx} & 4$^{\rm th}$ argument register & no \\ +\texttt{\%rdx} & 3$^{\rm rd}$ argument register & no \\ +\texttt{\%rbx} & temporary register & no \\ +\texttt{\%rsp} & stack pointer & yes \\ +\texttt{\%rbp} & callee-saved register & yes \\ +\texttt{\%rsi} & 2$^{\rm nd}$ argument register & no \\ +\texttt{\%rdi} & 1$^{\rm st}$ argument register & no \\ +\texttt{\%r8} & 5$^{\rm th}$ argument register & no \\ +\texttt{\%r9} & 6$^{\rm th}$ argument register & no \\ +\texttt{\%r10}-\texttt{\%r11} & reserved for code generator & no \\ +\texttt{\%r12}-\texttt{\%r15} & callee-saved register & yes \\ +\texttt{\%xmm0} & 1$^{\rm st}$ argument register, return register & no \\ +\texttt{\%xmm1}-\texttt{\%xmm7} & argument registers & no \\ +\texttt{\%xmm8}-\texttt{\%xmm10} & reserved for code generator & no \\ +\texttt{\%xmm11}-\texttt{\%xmm15} & temporary registers & no \\ \end{tabular} \caption{AMD64 Register usage in CACAO} \label{amd64registerusage} @@ -383,9 +389,12 @@ Register & Usage & Callee-saved \\ \end{table} There is only one change to the original AMD64 \textit{application -binary interface} --- ABI. CACAO uses \texttt{\%rbx} as temporary +binary interface} (ABI). CACAO uses \texttt{\%rbx} as temporary register, while the AMD64 ABI uses the \texttt{\%rbx} register as -callee-saved register. +callee-saved register. So CACAO needs to save the \texttt{\%rbx} +register when a JAVA method is called from a native function, like a +JNI function. This is done in \texttt{asm\_calljavafunction} located in +\texttt{jit/x86\_64/asmpart.S}. In adapting the register allocator there was a problem concerning the order of the integer argument registers. The order of the first four @@ -415,15 +424,123 @@ array. This is done by a simple code sequence (taken from \end{verbatim} -\subsection{Floating point arithmetic} +\subsection{Floating-point arithmetic} -The AMD64 architecture has implemented two sets of floating point instructions: +The AMD64 architecture has implemented two sets of floating-point +instructions: \begin{itemize} -\item old i387 (x87) -\item SSE and SSE2 +\item x87 (i387) +\item SSE/SSE2 \end{itemize} -The x87 \textit{floating point unit} (FPU) implementation is -completely compatible to the IA32 implementation with all its -advantages and drawbacks, like the FPU stack. +The x87 \textit{floating-point unit} (FPU) implementation is +completely compatible to the IA32 implementation, since the i386 with +its i387 coproccessor, with all the advantages and drawbacks, like the +8 slot FPU stack. + +The SSE/SSE2 technique is taken from the newest generation of Intel +processors, introduced with Intel's Pentium 4, and can process scalar +32-bit \texttt{float} values and scalar 64-bit \texttt{double} values +in the 128-bit wide \texttt{xmm} floating-point registers. While SSE +instructions operate on 32-bit \texttt{float} values, SSE2 is +responsible for 64-bit \texttt{double} values. In CACAO we implemented +the JAVA floating-point instructions using SSE/SSE2, because SSE/SSE2 +is much easier to use and should be the technique of the future. In +some areas SSE/SSE2 is slower than the old x87 implementation, even on +the new designed AMD64 architecture, but SSE/SSE2 offers 16 +floating-point registers, which should speed up daily JAVA +floating-point calculations. Another big advantage of SSE/SSE2 to x87 +is the missing \textit{single-double precision-rounding} problem, as +described in detail in the ``IA32 code generator'' section. With +SSE/SSE2 the 32-bit \texttt{float} and 64-bit \texttt{double} +arithmetic is calculated and rounded completely IEEE 754 compliant, so +no further adjustments need to take place to fullfil JAVAs +floating-point requirements. + +In floating-point value to integer value conversions a JVM has to +check for corner cases as described in the JVM specification. This is +done via a simple inline integer compare of the integer result value +and a call to special assembler wrapper functions for builtin calls, +like \texttt{asm\_builtin\_f2i} for \texttt{ICMD\_F2I} --- +\texttt{float} to \texttt{int} conversion. These corner cases are then +computed in a builtin C function with respect to all special cases +like \textit{Infinite} or \textit{NaN} values. + + +\subsection{Exception handling} + +Since the AMD64 architecture is just an extension to the IA32 +architecture, an AMD64 processor itself raises the same signals as an +IA32 processor, so we can catch the same signals in our own signal +handlers. This includes the signals \texttt{SIGSEGV} and +\texttt{SIGFPE}. + +When a signal of this type is raised and the signal hits our signal +handler, we reinstall the handler, create a new exception object and +jump to a---in assembler---written exception handling code. The +difference to the exception handling code of RISC machines, is the +fact that RISC machines have a \textit{procedure vector} (PV) +register. So it's easy to find the methods' data segment, which starts +at the PV growing down to smaller addresses like a stack. For the IA32 +and AMD64 architecture we had to implement a \textit{method tree} +which contains the start \textit{program counter} (PC) and the end PC +for every single JAVA method compiled in CACAO, to find for any +exception PC the corresponding method and thus the PV. We need the +data segment for the methods' exception table (for a detailed +description see section ''Exception handling''). + +We use \texttt{SIGSEGV} for \textit{hardware null-pointer checking}, +so we can handle this common exception as fast as possible in +CACAO. The signal handler creates a +\texttt{java.lang.NullPointerException}. + +\texttt{SIGFPE} is used to catch integer division by zero exceptions +in hardware. The signal handler generates a +\texttt{java.lang.ArithmeticException} with \texttt{/ by zero} as detail +message. + +Both exceptions are handled in hardware by default, but they can also +be catched in software when using CACAOs commandline switch +\texttt{-softnull}. On the RISC ports only the \textit{null-pointer +exception} is checked in software when using this switch, but on IA32 +and AMD64 both are checked, \texttt{SIGSEGV} and \texttt{SIGFPE}. + + +\subsection{Related work} + +The AMD64 architecture is a reasonably young architecture, released in +April 2003. At the writing of this document the only available 64-bit +operating systems for AMD64 are GNU/Linux---from different +distributors---, FreeBSD, NetBSD and OpenBSD. Microsoft Windows is not +available yet, although it was announced to be released in the first +half of 2004. + +The first available 64-bit JVM for the AMD64 architecture was GCC's +GCJ---The GNU Compiler for the Java Programming +Language~\cite{GCJ}. \texttt{gcj} itself is a portable, optimizing, +ahead-of-time compiler for the JAVA Programming Language, which can +compile: + +\begin{itemize} +\item JAVA source code directly to native machine code +\item JAVA source code to JAVA bytecode (class files) +\item JAVA bytecode to native machine code +\end{itemize} + +One part of the GCJ is \texttt{gij}, which is the JVM +interpreter. Much of the porting effort for the \textit{GNU Compiler +Collection} to the AMD64 architecture was done by people working at +SUSE~\cite{SUSE}. + +Long time no AMD64 JIT was available, till Sun~\cite{Sun} released +their AMD64 version of J2SE 1.4.2-rc1 for GNU/Linux by +Blackdown~\cite{Blackdown} in December 2003. At this time our AMD64 +JIT was already working for months, but we were not able to release +CACAO, because of the common status of CACAO to be a compliant +JVM. The Sun JVM uses the HotSpot Server VM by default, the HotSpot +Client VM is not available for AMD64 at this time. + +The Kaffe~\cite{Wilkinson:97} JVM has ported their interpreter to the +AMD64 architecture for GNU/Linux, but they still have no plans to port +their JIT. -- 2.25.1