\section{X86 code generator}

Porting to the famous x86 platform was more effort than
expected. CACAO was designed to run on RISC machines from ground up,
so the code generation part hat to be adapted. The first approach was
to replace the simple RISC macros with x86 code, but this turned out
to be not successful. So new x86 code generation macros where
written. To be backward compatible, mostly in respect of embedded
systems, all generated code can be run on i386 systems.

Some smaller problems occured since the x86 port was the first 32 bit
target platform, like segmentation faults due to heap
corruption. Another problem was the access to the functions data
segment. Since RISC platforms like ALPHA and MIPS have a procedure
pointer register, for the x86 platform there had to be implemented a
special handling for accesses to the data segment, like
\texttt{PUTSTATIC} and \texttt{GETSTATIC} instructions. The solution
is like the handling of \textit{jump references} or \textit{check cast
references}, which also have to be code patched, when the code and
data segment are relocated. This means, there has to be an extra
\textit{immediate-to-register} move (\texttt{i386\_mov\_imm\_reg()})
before every \texttt{PUT}/\texttt{GETSTATIC} instruction, which moves
the start address of the procedure, and thus the start address of the
data segment, in one of the temporary registers.

Register usage was another problem in porting the CACAO to x86. An x86
processor has 8 genernal purpose registers (GPR), of which one is the
\textit{stack pointer} (SP) and thus it can not be used for arithmetic
instructions. From the remaining 7 registers, in \textit{worst-case
instructions} like \texttt{CHECKCAST} or \texttt{INSTANCEOF}, we need
to reserve 3 temporary registers. So we have 4 registers available.


\subsection{Calling conventions}

Normal calling convention of the x86 processor is passing all function
arguments on the stack. The store size depends on the data type (the
following types reflect the JAVA data types):

\begin{itemize}
 \item \texttt{byte}, \texttt{char}, \texttt{short}, \texttt{int},
       \texttt{float}, \texttt{void} --- 4 bytes
 \item \texttt{long}, \texttt{double} --- 8 bytes
\end{itemize}

We changed this convention in a way, that we are using always 8 bytes
on the stack for each datatype. With this adaptation, it was possible
to use the \textit{stack space allocation algorithm} without any
changes. The drawback of this decision was, that we have to copy all
arguments of a native function call into a new stack frame and we have
a slightly bigger memory footprint.

But calling a native function always means a stack manipulation,
because you have to put the \textit{JNI environment}, and for
\texttt{static} functions the \textit{class pointer}, in front of the
function parameters. So this negligible.

For some \texttt{BUILTIN} functions there had to be written
\texttt{asm\_} counterparts, which copy the 8 byte parameters in their
correct size in a new stack frame. But this only affected
\texttt{BUILTIN} functions with more than 1 function parameter. To be
exactly, 2 functions, namely \texttt{arrayinstanceof} and
\texttt{newarray}. So this is not a big speed impact.


\subsection{Register allocator}

As mentioned before, in  \textit{worst-case situations} there are only
4 integer registers available. We use \texttt{\%ebp}, \texttt{\%esi},
\texttt{\%edi} as callee saved registers (which are callee saved
registers in the x86 ABI) and \texttt{\%ebx} as scratch register
(which is also a callee saved register in the x86 ABI, but we need
some scratch registers). So we have a lack of scratch registers. But
for most ICMD instructions, we do not need all, or sometimes none, of
the temporary registers.

This fact we use in the \texttt{analyse\_stack()} pass. We try to use
\texttt{\%edx} (which is \texttt{REG\_ITMP3}) and \texttt{\%ecx} (which
is \texttt{REG\_ITMP2}) as scratch registers for the register allocator
if certain ICMD instructions are not used in the compiled method.

So, for \textit{best-case situations} CACAO has 3 \textit{callee
saved} and 3 \textit{scratch} registers.

This analysis should be changed from \textit{method level} to
\textit{basic-block level}, so CACAO could emit better code on x86.


\subsection{Long arithmetic}

Unlike the PowerPC port, we cannot put \texttt{long}'s in 2 32 bit
integer registers, since we have to little of them. Maybe this could
bring a speedup, if the register allocator would be more intelligent
or in leaf functions which have only \texttt{long} variables. But this
is not implemented yet. So, the current approach is to store all
\texttt{long}'s in memory, this means they are always spilled.

Nearly all \texttt{long} instructions are inline, except 2 of them:
\texttt{LDIV} and \texttt{LREM}. These 2 are computed via
\texttt{BUILTIN} calls.

All of the \texttt{long} instructions operate on 64 bit, even if it is
not necessary. The dependency information that would be needed to just
operate on the lower or upper half of the \texttt{long} variable, is
not generated by CACAO.


\subsection{Floating point arithmetic}

Since the i386, with it's i387 counterpart or the i486, the x86
processor has an \textit{floating point unit} (FPU). This FPU is
implemented as a stack with 8 elements (see table \ref{FPUStack}).

\begin{table*}
\begin{center}
\begin{tabular}[b]{|l|l|}
\hline 
  & FPU Data Register Stack \\ \hline
0 & TOS (Top Of Stack) \\ \hline
1 & \\ \hline
2 & \\ \hline
3 & \\ \hline
4 & \\ \hline
5 & \\ \hline
6 & \\ \hline
7 & \\ \hline
\end{tabular}
\caption{x87 FPU Data Register Stack}
\label{FPUStack}
\end{center}
\end{table*}

This stack is designed to wrap around if values are loaded to the
\textit{top of stack} (TOS). For this reason, it has a special register which
points to the TOS. This pointer is increased if a load instruction to
the TOS is executed and decreased for a store from the TOS.

The x86 FPU can handle 3 float data types:

\begin{itemize}
 \item single-precision (32 bit)
 \item double-precision (64 bit)
 \item double extended-precision (80 bit)
\end{itemize}

The FPU has a 16 bit \textit{control register} which has a
\textit{precision control field} (PC) and a \textit{rounding control
field} (RC), each of 2 bit length (see table \ref{PCField} and
\ref{RCField}).

\begin{table*}
\begin{center}
\begin{tabular}[b]{|l|c|}
\hline 
Precision                          & PC Field \\ \hline
single-precision (32 bit)          & 00B      \\ \hline
reserved                           & 01B      \\ \hline
double-precision (64 bit)          & 10B      \\ \hline
double extended-precision (80 bit) & 11B      \\ \hline
\end{tabular}
\caption{Precision Control Field (PC)}
\label{PCField}
\end{center}
\end{table*}

\begin{table*}
\begin{center}
\begin{tabular}[b]{|l|c|}
\hline 
Rounding Mode                 & RC Field \\ \hline
round to nearest (even)       & 00B      \\ \hline
round down (toward -$\infty$) & 01B      \\ \hline
round up (toward +$\infty$)   & 10B      \\ \hline
round toward zero (truncate)  & 11B      \\ \hline
\end{tabular}
\caption{Rounding Control Field (RC)}
\label{RCField}
\end{center}
\end{table*}

The internal data type used by the FPU is always the \textit{double
extended-precision} (80 bit) format. Therefore implementing a IEEE 754
compliant floating point code on x86 processors is not trivial.

Correct rounding to \textit{single-precision} or
\textit{double-precision} is only done if the value is stored into
memory. This means for certain instructions, like \texttt{FMUL} or
\texttt{FDIV}, a special technique called \textit{store-reload}, has
to be implemented. This technique is in fact very simple but takes two
memory accesses more for this instruction.

For single-precision floats the \textit{store-reload} code could looks
like this:

\begin{verbatim}
i386_fstps_membase(REG_SP, 0);    /* store single-precision float to stack  */
i386_flds_membase(REG_SP, 0);     /* load single-precision float from stack */
\end{verbatim}

Another technique which has to be implemented to be IEEE 754
compliant, is \textit{precision mode switching}. Mode switching on the
x86 processor is made with the \texttt{fldcw} (load control word)
instruction. A \texttt{fldcw} instruction has a quite large overhead,
so frequently mode switches should be avoided. For this technique
there are two different approaches:

\begin{itemize}
 \item \textbf{Mode switch on float arithmetic} --- switch the FPU on
 initialization in one precision mode, mostly \textit{double-precision
 mode} because \texttt{double} arithmetic is more common. With this
 setting \texttt{doubles} are calculated correctly. To handle
 \texttt{floats} in this approach, the FPU has to be switched into
 \textit{single-precision mode} before each \texttt{float} instruction
 and switched back afterwards. Needless to say, that this is only
 useful, if \texttt{float} arithmetic is sparse. For a simple
 calculation like

 \begin{verbatim}
        float f = 1.234F;
        float g = 2.345F;
        float e = f * f + g * g;
 \end{verbatim}        

 the generated ICMD's look like this (with comments where precision
 mode switches take place):

 \begin{verbatim}
        ...
        <fpu in double-precision mode>
        FLOAD         1
        FLOAD         1
        <switch fpu to single-precision mode>
        FMUL         
        <switch fpu to double-precision mode>
        FLOAD         2
        FLOAD         2
        <switch fpu to single-precision mode>
        FMUL         
        <switch fpu to double-precision mode>
        <switch fpu to single-precision mode>
        FADD         
        <switch fpu to double-precision mode>
        FSTORE        3
        ...
 \end{verbatim}

 For this corner case situation the mode switches are extrem, but the
 example should demonstrate how this technique works.

 \item \textbf{Mode switch on float data type change} --- switch the
 FPU on initialization in on precision mode, like before, in
 \textit{double-precision mode}. But the difference on this approach
 is, that the precision mode is only switched if the float data type
 is changed. That means if your calculation switches from
 \texttt{double} arithmetic to \texttt{float} or backwards. This
 technique makes much sense due to the fact that there are always a
 bunch of instructions of the same data type in one row in a normal
 program. Now the same example as before with this approach:

 \begin{verbatim}
        ...
        <fpu in double-precision mode>
        FLOAD         1
        FLOAD         1
        <switch fpu to single-precision mode>
        FMUL         
        FLOAD         2
        FLOAD         2
        FMUL         
        FADD         
        FSTORE        3
        ...
 \end{verbatim}

 After this code sequence the FPU is in \textit{single-precision mode}
 and if a function return would occur, the caller function would not
 know in which mode the FPU is currently. One solution would be to
 reset the FPU to \textit{double-precision} on a function return, if
 the actual mode is \textit{single-precision}.
\end{itemize}

TODO: description of stack-to-register mapping

None of these techniques worked prefectly for CACAO, so the decision
was to store all \texttt{float}'s and \texttt{double}'s into memory,
so the rounding was correct. This behaviour will be changed, when we
have found a solution that works.