From: twisti Date: Fri, 30 Jul 2004 21:56:24 +0000 (+0000) Subject: C-x-s X-Git-Url: http://wien.tomnetworks.com/gitweb/?p=cacao.git;a=commitdiff_plain;h=e772170b31aaff1353f472ffafdb9ffdee4b2f86 C-x-s --- diff --git a/doc/handbook/x86_64.tex b/doc/handbook/x86_64.tex index c96cec673..0eb684fd5 100644 --- a/doc/handbook/x86_64.tex +++ b/doc/handbook/x86_64.tex @@ -5,7 +5,7 @@ The AMD64 architecture, formerly know as x86\_64, is an improvement of the Intel IA32 architecture by AMD -- Advanced Micro Devices. The extraordinary success of the IA32 architecture and the upcoming memory -address space problem on IA32 high end servers, led to a special +address space problem on IA32 high-end servers, led to a special design decision. Unlike Intel, with it's completely new designed IA64 architecture, AMD decided to extend the IA32 instruction set with new 64-bit instructions. @@ -48,10 +48,10 @@ faster in 64-bit code. Another crucial pointer to make the AMD64 architecture faster than IA32, is the limited number of registers. Any IA32 architecture, from the early \textit{i386} to the newest generation of \textit{Intel -Pentium 4} or \textit{AMD Athlon}, has only 8 general purpose +Pentium 4} or \textit{AMD Athlon}, has only 8 general-purpose registers. With the \textit{REX prefix}, AMD has the ability to increase the amount of accessible registers by 1 bit. This means in -\textit{64-bit Mode} 16 general purpose registers are available. The +\textit{64-bit Mode} 16 general-purpose registers are available. The value of a \textit{REX prefix} is in the range \texttt{40h} through \texttt{4Fh}, depending on the particular bits used (see table \ref{REX}). @@ -86,28 +86,28 @@ instructions can handle both \textit{memory operands} and implementation of the IA32 ICMDs. Much better code generation can be achieved in the area of -\textit{long arithmetic}. Since all 16 general purpose registers can +\textit{long arithmetic}. Since all 16 general-purpose registers can hold 64-bit integer values, there is no need for special long handling, like on IA32 were we stored all long varibales in memory. A -simple \texttt{ICMD\_LADD} was on IA32 (best case shown for AMD64 --- -\texttt{src->regoff == iptr->dst->regoff}): +simple \texttt{ICMD\_LADD} on IA32, best case shown for AMD64 --- +\texttt{src->regoff == iptr->dst->regoff}: \begin{verbatim} -i386_mov_membase_reg(REG_SP, src->prev->regoff * 8, REG_ITMP1); -i386_alu_reg_membase(I386_ADD, REG_ITMP1, REG_SP, iptr->dst->regoff * 8); -i386_mov_membase_reg(REG_SP, src->prev->regoff * 8 + 4, REG_ITMP1); -i386_alu_reg_membase(I386_ADC, REG_ITMP1, REG_SP, iptr->dst->regoff * 8 + 4); + i386_mov_membase_reg(REG_SP, src->prev->regoff * 8, REG_ITMP1); + i386_alu_reg_membase(I386_ADD, REG_ITMP1, REG_SP, iptr->dst->regoff * 8); + i386_mov_membase_reg(REG_SP, src->prev->regoff * 8 + 4, REG_ITMP1); + i386_alu_reg_membase(I386_ADC, REG_ITMP1, REG_SP, iptr->dst->regoff * 8 + 4); \end{verbatim} First memory operand is added to second memory operand which is at the -same stack location as the destination operand. This are four -instructions executed for one addition. If we would use registers for -long variables we could get a \textit{best-case} of two instructions, -namely \textit{add} followed by a \textit{adc}. On AMD64 we can -generate one instruction for this addition: +same stack location as the destination operand. This means, there are +four instructions executed for one long addition. If we would use +registers for long variables we could get a \textit{best-case} of two +instructions, namely \textit{add} followed by an \textit{adc}. On +AMD64 we can generate one instruction for this addition: \begin{verbatim} -x86_64_alu_reg_reg(X86_64_ADD, src->prev->regoff, iptr->dst->regoff); + x86_64_alu_reg_reg(X86_64_ADD, src->prev->regoff, iptr->dst->regoff); \end{verbatim} This means, the AMD64 port is \textit{four-times} faster than the IA32 @@ -121,8 +121,309 @@ mode without escaping --- with the mentioned \textit{REX prefix} byte. In CACAO we use a macro called \begin{verbatim} -x86_64_emit_rex(size,reg,index,rm) + x86_64_emit_rex(size,reg,index,rm) \end{verbatim} The names of the arguments are respective to their use in the \textit{REX prefix} (see table \ref{REX}). + +The AMD64 architecture introduces also a new addressing method called +\textit{RIP-relative addressing}. In 64-bit mode, addressing relative +to the contents of the 64-bit instruction pointer (program counter) +--- called \textit{RIP-relative addressing} or \textit{PC-relative +addressing} --- is implemented for certain instructions. In this +instructions, the effective address is formed by adding the +displacement to the 64-bit \texttt{RIP} of the next instruction. With +this addressing mode, we can replace the IA32 method of addressing +data in the method's data segment. Like in the +\texttt{ICMD\_PUTSTATIC} instruction, the IA32 code + +\begin{verbatim} + a = dseg_addaddress(&(((fieldinfo *) iptr->val.a)->value)); + i386_mov_imm_reg(0, REG_ITMP2); + dseg_adddata(mcodeptr); + i386_mov_membase_reg(REG_ITMP2, a, REG_ITMP2); +\end{verbatim} + +can be replaced with the new \textit{RIP-relative addressing} code + +\begin{verbatim} + a = dseg_addaddress(&(((fieldinfo *) iptr->val.a)->value)); + x86_64_mov_membase_reg(RIP, -(((s8) mcodeptr + 7) - (s8) mcodebase) + a, REG_ITMP2); +\end{verbatim} + +So we can save one instruction on the read or write of an static +variable. The additional offset of \texttt{+ 7} is the code size of +the instruction itself. The fictive register \texttt{RIP} is defined +with + +\begin{verbatim} + #define RIP -1 +\end{verbatim} + +Thus we can determine the special \textit{RIP-relative addressing} +mode in the code generating macro +\texttt{x86\_64\_emit\_membase(basereg,disp,dreg)} with + +\begin{verbatim} + if ((basereg) == RIP) { + x86_64_address_byte(0,(dreg),RBP); + x86_64_emit_imm32((disp)); + break; + } +\end{verbatim} + +and generate the \textit{RIP-relative addressing} code. As shown in +the code sample, it's an special encoding of the \textit{address byte} +mit the \texttt{mod} field set to zero and \texttt{RBP} +(\texttt{\%rbp}) as baseregister. + + +\subsection{Constant handling} + +As on IA32, the AMD64 code generator can use \textit{immediate move} +instructions to load integer constants into their destination +registers. The 64-bit extensions of the AMD64 architecture can also +load 64-bit immediates inline. So loading a \texttt{long} constant +just uses one instruction, despite of two instructions on the IA32 +architecture. Of course the AMD64 code generator uses the \textit{move +long} (\texttt{movl}) instruction to load 32-bit \texttt{int} constants +to minimize code size. This instruction clears the upper 32-bit of the +destination register. + +\begin{verbatim} + case ICMD_ICONST: + ... + x86_64_movl_imm_reg(cd, iptr->val.i, d); + ... + + case ICMD_LCONST: + ... + x86_64_mov_imm_reg(cd, iptr->val.l, d); + ... +\end{verbatim} + +\texttt{float} and \texttt{double} values are loaded from the data +segment via the \textit{move doubleword or quadword} (\texttt{movd}) +instruction with \textit{RIP-relative addressing}. + + +\subsection{Calling conventions} + +The AMD64 calling conventions are described here \ref{}. CACAO uses a +subset of this calling convention, to cover its requirements. CACAO +just needs to pass the JAVA data types, no other special features. The +sizes of the JAVA data types on the AMD64 port are shown in table +\ref{javadatatypesizes}. + +\begin{table} +\begin{center} +\begin{tabular}[b]{|l|c|} +\hline +JAVA Data Type & Bytes \\ \hline +\texttt{boolean} & 1 \\ +\texttt{byte} & \\ +\texttt{char} & \\ \hline +\texttt{short} & 2 \\ \hline +\texttt{int} & 4 \\ +\texttt{float} & \\ \hline +\texttt{long} & 8 \\ +\texttt{double} & \\ +\texttt{void} & \\ \hline +\end{tabular} +\caption{JAVA Data Type sizes on AMD64} +\label{javadatatypesizes} +\end{center} +\end{table} + +\subsubsection{Integer arguments} + +The AMD64 architecture has 6 integer argument registers. The order of +the argument registers is shown in table +\ref{amd64integerargumentregisters}. + +\begin{table} +\begin{center} +\begin{tabular}[b]{|l|l|} +\hline +Register & Argument Register \\ \hline +\texttt{\%rdi} & 1$^{\rm st}$ \\ \hline +\texttt{\%rsi} & 2$^{\rm nd}$ \\ \hline +\texttt{\%rdx} & 3$^{\rm rd}$ \\ \hline +\texttt{\%rcx} & 4$^{\rm th}$ \\ \hline +\texttt{\%r8} & 5$^{\rm th}$ \\ \hline +\texttt{\%r9} & 6$^{\rm th}$ \\ \hline +\end{tabular} +\caption{AMD64 Integer Argument Register} +\label{amd64integerargumentregisters} +\end{center} +\end{table} + +As on RISC machines, the remaining integer arguments are passed on the +stack. Each integer argument, regardless of which size, uses 8 bytes +on the stack. + +Integer return values of any size are stored in \texttt{REG\_RESULT}, +which is \texttt{\%rax}. + +\subsubsection{Floating point arguments} + +The AMD64 architecture has 8 floating point argument registers, namely +\texttt{\%xmm0} through \texttt{\%xmm7}. \texttt{\%xmm} registers are +128-bit wide floating point registers on which SSE and SSE2 +instructions can operate. Remaining floating point arguments are +passed, like with integer arguments, on the stack using 8 bytes per +argument. + +Floating point return values are stored in \texttt{\%xmm0}. + +As shown, the calling conventions for the AMD64 architecture are +nearly the same as for RISC machines, which allows to use CACAOs +\textit{register allocator algorithm} and \textit{stack space +allocation algorithm} without any changes. + +Calling native functions means register moves and stack copying like +on RISC machines. This depends on the count of the arguments used for +the called native function. For non-static native functions the first +integer argument has to be the JNI environment variable, so any +arguments passed need to be shifted by one register, which can include +creating a new stackframe and storing some arguments on the +stack. Additionally for static native functions the class pointer of +the current objects' class is passed in the 2$^{\rm nd}$ integer +argument register. This means that the integer argument registers need +to be shifted by two registers. + +One difference of the calling convention to RISC type machines, like +Alpha or MIPS, is the usage of integer and floating point argument +registers with mixed integer and floating point arguments. Assume a +function like this: + +\begin{verbatim} + void sub(int a, float b, long c, double d); +\end{verbatim} + +On a RISC machine, like Alpha, we would have an argument register +usage like in figure \ref{alphaargumentregisterusage}. \texttt{a?} +represent integer argument registers and \texttt{fa?} floating point +argument registers. + +\begin{figure}[htb] +\begin{center} +\setlength{\unitlength}{1mm} +\begin{picture}(60,35) +\thicklines +\put(0,15){\framebox(15,10){a0 = a}} +\put(30,15){\framebox(15,10){a2 = c}} +\put(15,5){\framebox(15,10){fa1 = b}} +\put(45,5){\framebox(15,10){fa3 = d}} +\put(0,0){\line(0,1){30}} +\end{picture} +\caption{Alpha argument register usage for \texttt{void sub(int a, float b, long c, double d);}} +\label{alphaargumentregisterusage} +\end{center} +\end{figure} + +On AMD64 the same function call would look like in figure +\ref{amd64argumentregisterusage}. + +\begin{figure}[htb] +\begin{center} +\setlength{\unitlength}{1mm} +\begin{picture}(60,35) +\thicklines +\put(0,15){\framebox(15,10){a0 = a}} +\put(15,15){\framebox(15,10){a1 = c}} +\put(0,5){\framebox(15,10){fa0 = b}} +\put(15,5){\framebox(15,10){fa1 = d}} +\put(0,0){\line(0,1){30}} +\end{picture} +\caption{AMD64 argument register usage for \texttt{void sub(int a, float b, long c, double d);}} +\label{amd64argumentregisterusage} +\end{center} +\end{figure} + +The register assigment would be \texttt{a0 = \%rdi}, \texttt{a1 = +\%rsi}, \texttt{fa0 = \%xmm0} and \texttt{fa1 = \%xmm1}. This special +usage of the argument registers required some changes in the argument +register allocation algorithm for function calls during stack +analysis and some changes in the code generator itself. + + +\subsection{Register allocator} + +As mentioned in the introduction, the AMD64 architecture has 16 +general-purpose registers and 16 floating-point registers. One +general-purpose register is reserved for the \textit{stack pointer} +--- namely \texttt{\%rsp} --- and thus cannot be used for arithmetic +instructions. The register usage as used in CACAO is shown in table +\ref{amd64registerusage}. + +\begin{table} +\begin{center} +\begin{tabular}{l|l|l} +Register & Usage & Callee-saved \\ \hline +\texttt{\%rax} & return register, reserved for code generator & no \\ +\texttt{\%rcx} & 4$^{\rm th}$ argument register & no \\ +\texttt{\%rdx} & 3$^{\rm rd}$ argument register & no \\ +\texttt{\%rbx} & temporary register & no \\ +\texttt{\%rsp} & stack pointer & yes \\ +\texttt{\%rbp} & callee-saved register & yes \\ +\texttt{\%rsi} & 2$^{\rm nd}$ argument register & no \\ +\texttt{\%rdi} & 1$^{\rm st}$ argument register & no \\ +\texttt{\%r8} & 5$^{\rm th}$ argument register & no \\ +\texttt{\%r9} & 6$^{\rm th}$ argument register & no \\ +\texttt{\%r10}-\texttt{\%r11} & reserved for code generator & no \\ +\texttt{\%r12}-\texttt{\%r15} & callee-saved register & yes \\ +\texttt{\%xmm0}-\texttt{\%xmm7} & argument registers & no \\ +\texttt{\%xmm8}-\texttt{\%xmm15} & temporary registers & no \\ +\end{tabular} +\caption{AMD64 Register usage in CACAO} +\label{amd64registerusage} +\end{center} +\end{table} + +There is only one change to the original AMD64 \textit{application +binary interface} --- ABI. CACAO uses \texttt{\%rbx} as temporary +register, while the AMD64 ABI uses the \texttt{\%rbx} register as +callee-saved register. + +In adapting the register allocator there was a problem concerning the +order of the integer argument registers. The order of the first four +argument register is inverted. This fact can be seen in table +\ref{amd64registerusage} which is ordered ascending by the processors' +internal register numbers. That means the ascending search algorithm +for argument registers in the register allocator would allocate the +first four argument registers in the wrong direction. So there is a +little hack implemented in CACAOs register allocator to handle this +fact. After searching the register definition array for the argument +registers, the first four argument registers are interchanged in their +array. This is done by a simple code sequence (taken from +\texttt{jit/reg.inc}): + +\begin{verbatim} + /* + * on x86_64 the argument registers are not in ascending order + * a00 (%rdi) <-> a03 (%rcx) and a01 (%rsi) <-> a02 (%rdx) + */ + n = r->argintregs[3]; + r->argintregs[3] = r->argintregs[0]; + r->argintregs[0] = n; + + n = r->argintregs[2]; + r->argintregs[2] = r->argintregs[1]; + r->argintregs[1] = n; +\end{verbatim} + + +\subsection{Floating point arithmetic} + +The AMD64 architecture has implemented two sets of floating point instructions: + +\begin{itemize} +\item old i387 (x87) +\item SSE and SSE2 +\end{itemize} + +The x87 \textit{floating point unit} (FPU) implementation is +completely compatible to the IA32 implementation with all its +advantages and drawbacks, like the FPU stack.