\chapter{The Just-In-Time Compiler}


\section{Introduction}

A Java Virtual Machine can implement four different approaches of
executing Java byte code:

\begin{itemize}
\item Interpreter
\item Ahead-Of-Time Compiler
\item Just-In-Time Compiler
\item Mixed Mode
\end{itemize}

An \textit{Interpreter} interprets every single virtual machine
instruction in the language the Java Virtual Machine is written in,
mainly C. Due to this fact an interpreter based Java Virtual Machine
is easily portable, but the execution speed is very slow, up to ten
times slower than a current Just-In-Time Compilers or similar code
written in C.

An \textit{Ahead-Of-Time Compiler} compiles every Java method of a
class when the class is loaded. This reduces the compiler overhead
since the compilation of the class methods is done in one step and at
runtime the method called can be executed immediately. The drawback of
this approach is the fact that every single method is compiled, even
if it's not needed. This can use a serious amount of memory and time
since the java libraries contain a lot of methods.

A \textit{Just-In-Time Compiler} is the solution to the memory and
compilation time problem. This compiler only compiles a method when it
is called the first time. The only drawback of this approach is the
deferred execution of the called method since it must be compiled
before, but a Just-In-Time Compiler can save much of compilation time.

The \textit{Mixed Mode} is mostly a mixture of an Interpreter and a
Just-In-Time Compiler. Normally the code is interpreted, but code
parts which are frequently executed are compiled to native machine
code with an Just-In-Time Compiler. This technique is used by Sun's
and IBM's JVM.

CACAO only implements a \textit{Just-In-Time Compiler} and has no
Interpreter. The main target of CACAO was to build a fast executing
Java Virtual Machine with short compilation times. Thus the CACAO
development team decided to only implement a fast compiling
Just-In-Time Compiler. So every single Java method executed is
compiled to native machine code.

The following sections decribe some basics of byte code to machine
code compilation and how the CACAO Just-In-Time Compiler works in
detail.


\section{The Java Virtual Machine}

The JavaVM is a typed stack architecture \cite{javavm96}. There are
different instructions for integer, long integer, floating point and
address types. Byte and character types have only special memory access
instructions and are treated as integers for arithmetic operations. The
main instruction set consists of arithmetic/logical and load/store/constant
instructions. There are special instructions for array access and for
accessing the fields of objects (memory access), for method invocation,
and  for type checking. A JavaVM has to check the program for type
correctness and executes only correct programs. The following examples show
some important JavaVM instructions and how a Java program is represented by
these instructions.

\begin{verbatim}
        iload  n    ; load contents of local variable n
        istore n    ; store stack top in local variable n
        imul        ; product of 2 topmost stack elements
        isub        ; difference of 2 topmost stack elements
\end{verbatim} 

The Java assignment statement {\tt a = b - c * d} is translated into

\begin{verbatim}
        iload b     ; load contents of variable b
        iload c     ; load contents of variable c
        iload d     ; load contents of variable d
        imul        ; compute c * d
        isub        ; compute b - (c * d)
        istore a    ; store stack top in variable a
\end{verbatim} 

Accessing the fields of objects is handled by the instructions {\tt
getfield} and {\tt putfield}. The instruction {\tt getfield} expects an
object reference on the stack and has an index into the constant pool as an
operand. The index into the constant pool must be a reference to a pair
containing the class name and a field name. The types of the classes
referenced by the constant pool index and by the object reference must be
compatible, a fact which is usually checked statically at load time. The
object reference has to be different from the {\tt null} pointer, a fact
which must usually be checked at run time.

Array access is handled by the {\tt aload} and {\tt astore} instructions.
Separate versions of these instructions exist for each of the basic types
({\tt byte}, {\tt int}, {\tt float}, {\tt ref}, etc.). The {\tt aload}
instruction expects a reference to an array and an index (of type {\tt
int}) on the stack. The array reference must not be the {\tt null} pointer.
The index must be greater than or equal to zero and less than the array
length.

There are special commands for method invocation. Each method has its own
virtual stack and an area for local variables. After the method invocation,
the stack of the caller is empty and the arguments are copied into the
first local variables of the called method. After execution of a {\tt
return} instruction, the called method returns to its caller. If the called
method is a function, it pops the return value from its own stack and
pushes it onto the stack of the caller.

The {\tt instanceof} and {\tt checkcast} instructions are used for subtype
testing. Both expect a reference to an object on the stack and have an
index into the constant pool as operand. The index must reference a class,
array or interface type. The two instructions differ in their result and in
their behavior if the object reference is {\tt null}.


\section{Translation of stack code to register code}

The architecture of a RISC---\textit{Reduced Instruction Set Computer}
or CISC---\textit{Complex Instruction Set Computer}---processor is
completely different from the stack architecture of the Java Virtual
Machine.

RISC processors have large sets of registers. The Alpha architecture
has 32 integer registers and 32 floating point registers which are
both 64-bits wide. They execute arithmetic and logic operations only
on values which are held in registers. RISC machines are a load-store
architecture, this means load and store instructions are provided to
move data between memory and registers. Local variables of methods
usually reside in registers and are saved in memory only during a
method call or if there are too few registers. The use of registers
for parameter passing is defined by calling conventions.

CISC processors have a relatively small set of registers, like the
IA32 architecture with 8 integer general purpose registers or the
AMD64 architecture with 16 integer general purpose registers. But, as
the name implies, CISC processors have a large and complex machine
instruction set. Unlike the load-store architecture of RISC machines,
CISC instructions can operate on operands residing in registers and in
memory locations. Local variables of methods should reside in
registers, but due to the limited number of registers this very rare
and most local variables are stored on the stack. Detailed information
of the IA32 and AMD64 architecture can be found in section
\ref{sectionia32codegenerator} and section
\ref{sectionamd64codegenerator} respectively.


\subsection{Machine code translation examples}

The previous example expression {\tt a = b - c * d} would be translated
by an optimizing C compiler to the following two Alpha instructions (the
variables {\tt a}, {\tt b}, {\tt c} and {\tt d} reside in registers):

\begin{verbatim}
        MULL c,d,tmp0    ; tmp0 = c * d
        SUBL b,tmp0,a    ; a = b - tmp0
\end{verbatim}

If JavaVM code is translated to machine code, the stack is eliminated and
the stack slots are represented by temporary variables usually residing in
registers. A naive translation of the previous example would result in the
following Alpha instructions:

\begin{verbatim}
        MOVE b,t0        ; iload b
        MOVE c,t1        ; iload c
        MOVE d,t2        ; iload d
        MULL t1,t2,t1    ; imul
        SUBL t0,t1,t0    ; isub
        MOVE t0,a        ; istore a
\end{verbatim}

The problems of translating JavaVM code to machine code are primarily the
elimination of the unnecessary copy instructions and finding an efficient
register allocation algorithm. A common but expensive technique is to do
the naive translation and use an additional pass for copy elimination and
coalescing.


\section{The translation algorithm}

The new translation algorithm can get by with three passes. The first pass
determines basic blocks and builds a representation of the JavaVM 
instructions which is faster to decode. The second pass analyses the stack
and generates a static stack structure. During stack analysis variable
dependencies are tracked and register requirements are computed. In the
final pass register allocation of temporary registers is combined with
machine code generation.

The new compiler computes the exact number of objects needed or computes an
upper bound and allocates the memory for the necessary temporary data
structures in three big blocks: the basic block array, the instruction
array and the stack array. Eliminating all the double linked lists also
reduced the memory requirements by a factor of five.


\subsection{Basic block determination}

The first pass scans the JavaVM instructions, determines the basic blocks
and generates an array of instructions which has fixed size and is easier
to decode in the following passes. Each instruction contains the opcode,
two operands and a pointer to the static stack structure after the
instruction (see next sections). The different opcodes of JavaVM
instructions which fold operands into the opcode are represented by just
one opcode in the instruction array.


\subsection{Basic block interfacing convention}

The handling of control flow joins was quite complicated in the old
compiler. We therefore introduced a fixed interface at basic block
boundaries. Every stack slot at a basic block boundary is assigned a fixed
interface register. The stack analysis pass determines the type of the
register and if it has to be saved across method invocations. To enlarge
the size of basic blocks method invocations do not end basic blocks. To
guide our compiler design we did some static analysis on a large
application written in Java: the javac compiler and the libraries it uses.
Table \ref{tab-1} shows that in more than 93\% of the cases the stack is
empty at basic block boundaries and that the maximal stack depth is 6.
Using this data it becomes clear that the old join handling did not improve
the quality of the machine code.

\begin{table*}
\begin{center}
\begin{tabular}[b]{|l|c|c|c|c|c|c|c|c|}
\hline 
stack depth &   0  &  1  &  2  &  3  &  4  &  5  &  6 & $>$6 \\ \hline           
occurrences & 7930 & 258 & 136 & 112 &  36 &  8  &  3 &  0   \\ \hline           
\end{tabular}
\caption{distribution of stack depth at block boundary}
\label{tab-1}
\end{center}
\end{table*}


\subsection{Copy elimination}

To eliminate unnecessary copies, the loading of values is delayed until the
instruction is reached which consumes the value. To compute the information
the run time stack is simulated at compile time. Instead of values the
compile time stack contains the type of the value, if a local variable was
loaded to a stack location and similar information. Adl-Tabatabai
\cite{Taba+98} used a dynamic stack which is changed at every instruction.
A dynamic stack only gives the possibility to move information from earlier
instructions to later instructions. We use a static stack structure which
enables information flow in both directions.

Figure~\ref{Trans1} shows our instruction and stack representation. An
instruction has a reference to the stack before the instruction and the
stack after the instruction. The stack is represented as a linked list. The
two stacks can be seen as the source and destination operands of an
instruction. In the implementation only the destination stack is stored,
the source stack is the destination of stack of the previous instruction.

\begin{figure}[htb]
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(25,32)
\put( 0, 0  ){\makebox(10,5){\tt b}}
\put( 5, 2.5){\oval(10,5)}
\put( 5, 9  ){\vector(0,-1){4}}
\put( 0, 9  ){\makebox(10,5){\tt c}}
\put( 5,11.5){\oval(10,5)}
\put( 5,18  ){\vector(0,-1){4}}
\put( 0,18  ){\makebox(10,5){\tt d}}
\put( 5,20.5){\oval(10,5)}
%\put( 5,27  ){\vector(0,-1){4}}
\put(15, 9  ){\makebox(10,5){\tt c*d}}
\put(20,11.5){\oval(10,5)}
\put(20, 9  ){\line(0,-1){2}}
\put( 5, 7  ){\line(1,0){15}}
\put(10,27  ){\framebox(15,5){\tt imul}}
\put(20,27  ){\vector(0,-1){13}}
\put(15,27  ){\line(0,-1){6.5}}
\put(15,20.5){\vector(-1,0){5}}
\end{picture}
\caption{instruction and stack representation}
\label{Trans1}
\end{center}
\end{figure}

This representation can easily be used for copy elimination. Each stack
element not only contains the type of the stack slot but also the local
variable number of which it is a copy, the argument number if it is an
argument, the interface register number if it is an interface. Load (push
the content of a variable onto the stack) and store instructions do
no generate a copy machine instruction if the stack slot contains the same
local variable. Generated machine instructions for arithmetic operations
directly use the local variables as their operands.

There are some pitfalls with this scheme. Take the example of
figure~\ref{Trans2}. The stack bottom contains the local variable {\tt a}.
The instruction {\tt istore a} will write a new value for {\tt a} and will
make a later use of this variable invalid. To avoid this we have to copy 
the local variable to a stack variable. An important decision is at which
position the copy instruction should be inserted. Since there is a high
number of {\tt dup} instructions in Java programs (around 4\%) and it is
possible that a local variable resides in memory, the copy should be done
with the {\tt load} instruction. Since the stack is represented as a linked
list only the destination stack has to be checked for occurrences of the
offending variable and these occurrences are replaced by a stack variable.


\begin{figure}[htb]
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(79,32)
\put( 0,27  ){\framebox(15,5){\tt iload a}}
\put(16,27  ){\framebox(13,5){\tt dup}}
\put(30,27  ){\framebox(17,5){\tt iconst 1}}
\put(48,27  ){\framebox(13,5){\tt iadd}}
\put(62,27  ){\framebox(17,5){\tt istore a}}
\put(10,27  ){\vector(0,-1){22}}
\put( 5,27  ){\vector(0,-1){4}}
\put( 5, 0  ){\makebox(10,5){\tt a}}
\put(10, 2.5){\oval(10,5)}
\put(20,27  ){\line(0,-1){2}}
\put(20,25  ){\line(-1,0){10}}
\put(25,27  ){\vector(0,-1){13}}
\put(20, 9  ){\makebox(10,5){\tt a}}
\put(25,11.5){\oval(10,5)}
\put(25, 9  ){\line(0,-1){2}}
\put(10, 7  ){\line(1,0){63}}
\put(36,27  ){\line(0,-1){2}}
\put(36,25  ){\line(-1,0){11}}
\put(41,27  ){\vector(0,-1){4}}
\put(36,18  ){\makebox(10,5){\tt 1}}
\put(41,20.5){\oval(10,5)}
\put(41,18  ){\line(0,-1){2}}
\put(41,16  ){\line(-1,0){16}}
\put(52,27  ){\line(0,-1){2}}
\put(52,25  ){\line(-1,0){11}}
\put(73,27  ){\line(0,-1){20}}
\put(57,27  ){\vector(0,-1){13}}
\put(52, 9  ){\makebox(10,5){\tt +}}
\put(57,11.5){\oval(10,5)}
\put(57,9   ){\line(0,-1){2}}
\put(68,27  ){\line(0,-1){2}}
\put(68,25  ){\line(-1,0){11}}
\end{picture}
\caption{anti dependence}
\label{Trans2}
\end{center}
\end{figure}

To answer the question of how often this could happen and how expensive
the stack search is, we analysed again the {\tt javac} compiler. In more
than 98\% of the cases the stack is empty (see table \ref{tab-2}). In only
0.2\% of the cases the stack depth is higher than 1 and the biggest stack
depth is 3.

\begin{table}[htb]
\begin{center}
\begin{tabular}[b]{|l|c|c|c|c|c|}
\hline 
stack depth &   0  &  1  &  2  &  3  & $>$3 \\ \hline           
occurrences & 2167 & 31  &  1  &  3  &  0   \\ \hline           
\end{tabular}
\caption{distribution of {\tt store} stack depth}
\label{tab-2}
\end{center}
\end{table}

\begin{table*}
\begin{center}
\begin{tabular}[b]{|l|c|c|c|c|c|c|c|c|c|c|}
\hline 
chain length &  1  &  2 &  3 &  4 &  5 &  6 &  7 & 8 & 9 & $>$9 \\ \hline           
occurrences  & 1892& 62 & 23 & 62 & 30 & 11 & 41 & 9 & 7 &  65  \\ \hline           
\end{tabular}
\caption{distribution of creator-store distances}
\label{tab-3}
\end{center}
\end{table*}

To avoid copy instructions when executing a {\tt store} it is necessary to
connect the  creation of a value with the store which consumes it. In that
case a {\tt store} not only can conflict with copies of a local variable
which result from {\tt load} instructions before the creator of the value,
but also with {\tt load} and {\tt store} instructions which exist between
the creation of value and the {\tt store}. In figure~\ref{Trans3} the {\tt
iload a} instruction conflicts with the {\tt istore a} instruction.

\begin{figure}[htb]
\begin{center}
\setlength{\unitlength}{1mm}
\begin{picture}(67,27)
\put( 0,22  ){\framebox(15,5){\tt iadd}}
\put(16,22  ){\framebox(15,5){\tt iload a}}
\put(32,22  ){\framebox(17,5){\tt istore b}}
\put(50,22  ){\framebox(17,5){\tt istore a}}
\put(10,22  ){\vector(0,-1){13}}
\put( 5,22  ){\line(0,-1){2}}
\put( 5,20 ){\line(-1,0){3}}
\put( 2,20  ){\vector(0,-1){20}}
\put(10, 4  ){\line(0,-1){2}}
\put( 5, 4  ){\makebox(10,5){\tt +}}
\put(10, 6.5){\oval(10,5)}
\put(21,22  ){\line(0,-1){2}}
\put(21,20  ){\line(-1,0){11}}
\put(26,22  ){\vector(0,-1){4}}
\put(21,13  ){\makebox(10,5){\tt a}}
\put(26,15.5){\oval(10,5)}
\put(26,13  ){\line(0,-1){2}}
\put(10,11  ){\line(1,0){46}}
\put(38,22  ){\line(0,-1){2}}
\put(38,20  ){\line(-1,0){12}}
\put(43,22  ){\line(0,-1){11}}
\put(56,22  ){\line(0,-1){11}}
\put(61,22  ){\line(0,-1){20}}
\put( 2, 2  ){\line(1,0){59}}
\end{picture}
\caption{anti dependence}
\label{Trans3}
\end{center}
\end{figure}

The anti dependences are detected by checking the stack locations of the
previous instructions for conflicts. Since the stack locations are
allocated as one big array just the stack elements which have a higher
index than the current stack element have to be checked. Table \ref{tab-3}
gives the distribution of the distance between the creation of the value
and the corresponding store. In 86\% of the cases the distance is one.

The output dependences are checked by storing the instruction number of the
last store in each local variable. If a store conflicts due to dependences
the creator places the value in a stack register. Additional dependences
arise because of exceptions. The exception mechanism in Java is precise.
Therefore {\tt store} instructions are not allowed to be executed before
an exception raising instruction. This is checked easily be remembering
the last instruction which could raise an exception. In methods which contain
no exception handler this conflict can be safely ignored because no
exception handler can have access to these variables.


\subsection{Register allocator}

he current register allocator of CACAO is a very simple,
straightforward allocator. It simply assigns free registers with a
\textit{first-come-first-serve} based algorithm. This is mostly
suitable for RISC architectures with large register sets like Alpha or
MIPS with 32 integer registers and 32 floating-point registers. For
these architectures the current register allocator was designed for.

Basically the allocation passes of the register allocator are:

\begin{itemize}
 \item interface register allocation
 \item scratch register allocation
 \item local register allocation
\end{itemize}

The \texttt{javac} compiler also supports this simple
\textit{first-come-first-serve} approach CACAO uses and does a
coloring of the local variables and assigns the same number to
variables which are not active at the same time. The stack variables
have implicitly encoded their live ranges. When a value is pushed, the
live range start. When a value is popped, the live range ends.

Complications arise only with stack manipulation instructions like {\tt dup}
and {\tt swap}. We flag therefore the first creation of a stack variable and
mark a duplicated one as a copy. The register used for this variable can
be reused only after the last copy is popped.

During stack analysis stack variables are marked which have to survive a
method invocation. These stack variables and local variables are assigned
to callee saved registers. If there are not enough registers available,
these variables are allocated in memory.

Efficient implementation of method invocation is crucial to the performance
of Java. Therefore, we preallocate the argument registers and the return
value in a similar way as we handle store instructions. Input arguments (in
Java input arguments are the first variables) for leaf procedures (and
input arguments for processors with register windows) are preassigned, too.

Since CACAO has now also been ported to CISC architectures like IA32
and AMD64, the \textit{first-come-first-serve} register allocator has
hit it's limits. The results produced for an architecture with 8
integer general purpose registers like IA32 or 16 integer general
purpose registers like AMD64, is far from perfect. Further details to
register allocation of these architectures can be found in section
\ref{sectionia32registerallocation} and section
\ref{sectionamd64registerallocation} respectively.

The CACAO development team is currently working on a new register
allocator based on a \textit{linear scan} algorithm. This allocator
should produce much better results on CISC machines than the current
register allocator.


\subsection{Instruction combining}

Together with stack analysis we combine constant loading instructions
with selected instructions which are following immediately. In the
class of combinable instructions are add, subtract, multiply and
divide instructions, logical and shift instructions, compare/branch
and array store instructions.

These combined immediate instructions are:

\begin{itemize}
 \item \texttt{ICMD\_IADDCONST}, \texttt{ICMD\_ISUBCONST},
 \texttt{ICMD\_IMULCONST}, \texttt{ICMD\_IDIVPOW2},
 \texttt{ICMD\_IREMPOW2}

 \item \texttt{ICMD\_LADDCONST}, \texttt{ICMD\_LSUBCONST},
 \texttt{ICMD\_LMULCONST}, \texttt{ICMD\_LDIVPOW2},
 \texttt{ICMD\_LREMPOW2}

 \item \texttt{ICMD\_IANDCONST}, \texttt{ICMD\_IORCONST},
 \texttt{ICMD\_IXORCONST}

 \item \texttt{ICMD\_LANDCONST}, \texttt{ICMD\_LORCONST},
 \texttt{ICMD\_LXORCONST}

 \item \texttt{ICMD\_ISHLCONST}, \texttt{ICMD\_ISHRCONST},
 \texttt{ICMD\_IUSHRCONST}

 \item \texttt{ICMD\_LSHLCONST}, \texttt{ICMD\_LSHRCONST},
 \texttt{ICMD\_LUSHRCONST}

 \item \texttt{ICMD\_IFxx}

 \item \texttt{ICMD\_IF\_Lxx}

 \item \texttt{ICMD\_xASTORECONST}
\end{itemize}

During code generation the constant is checked if it lies in the range
for immediate operands of the target architecture and appropriate code
is generated.

Arithmetic and logical instructions are processed straightforward. The
intermediate command opcode of the current instruction is changed and
the immediate value from the previous instruction is stored in the
current instruction. The register pressure is always reduced by one
register by this optimization.

\texttt{ICMD\_IDIV} and \texttt{ICMD\_IREM} divisions by a constant
which is a power of two are handled in a special way. They are
converted into \texttt{ICMD\_IDIVPOW2} and \texttt{ICMD\_IREMPOW2}
respectively. For \texttt{ICMD\_IDIVPOW2} an immediate value is
assigned which represents the left shift amount of \texttt{0x1} to get
the divisor value. In the code generation pass a very fast shift-based
machine code can be generated for this instruction. For
\texttt{ICMD\_IREMPOW2} the intermediate value gets one
subtracted. The generated machine code consists of logical
\texttt{and}'s, \texttt{neg}'s and a conditional jump. For both
instructions the generated machine code is much fast than an integer
division. \texttt{ICMD\_LDIV} and \texttt{ICMD\_LREM} intermediate
commands are handled respectively.

\texttt{ICMD\_IxSHx} instructions by a constant value are converted
to \texttt{ICMD\_IxSHxCONST} instructions. Nearly every architecture
has machine shift instructions by a constant value. This optimization
always reduces the register pressure by one
register. \texttt{ICMD\_LxSHx} intermediate commands are converted to
\texttt{ICMD\_LxSHxCONST} commands respectively.

\texttt{ICMD\_IF\_ICMPxx} intermediate commands are converted to
\texttt{ICMD\_IFxx} commands. This commands compare the source
operand directly with an immediate value if possible. The generated
machine code depends on the architecture. On the IA32 or AMD64
architecture the immediate value can always be inlined. On RISC
architectures the immediate value range is limited, like the Alpha
architecture where the immediate value may be between 0 and 255. On
architectures which support conditional branches on a source register,
like Alpha or MIPS, the compare with 0 is optimized to a single
instruction. This optimization can reduce the register pressure by one
register. \texttt{ICMD\_IF\_Lxx} intermediate commands are handled
respectively.

The \texttt{ICMD\_xASTORE} optimization was actually implemented for
the IA32 and AMD64 architecture. These architectures can handle inline
immediate values up to their address pointer size, this means 32-bit
for IA32 and 64-bit for AMD64 respectively. For RISC architectures
which have a \texttt{REG\_ZERO}---a register which always contains the
values zero---this array store optimization can be used only for zero
values. Address array stores---\texttt{ICMD\_AASTORE}---can only be
optimized in the \texttt{null} pointer case because of the dynamic
type check. In this case the optimization not only reduces the
register pressure by one register, but the dynamic type check
subroutine call can be eliminated.


\subsection{Complexity of the algorithm}

The complexity of the algorithm is mostly linear with respect to the number
of instructions and the number of local variables plus the number of stack
slots. There are only a small number of spots where it is not linear. 

\begin{itemize}
\item At the begin of a basic block the stack has to be copied to separate
      the stacks of different basic blocks. Table \ref{tab-1} shows that
      the stack at the boundary of a basic block is in most cases zero.
      Therefore, this copying does not influence the linear performance of
      the algorithm.
\item A store has to check for a later use of the same variable. Table
      \ref{tab-2} shows that this is not a problem, too.
\item A store additionally has to check for the previous use of the same
      variable between creation of the value and the store. The distances
      between the creation and the use are small (in most case only 1) as
      shown by table \ref{tab-3}.
\end{itemize}

Compiling {\tt javac} 29\% of the compile time are spent in parsing and
basic block determination, 18\% are spent in stack analysis, 16\% are spent
in register allocation and 37\% are spent in machine code generation.


\subsection{Example}

Figure \ref{IntermediateStack} shows the intermediate representation and
stack information as produced by the compiler for debugging purposes. The
{\tt Local Table} gives the types and register assignment for the local
variables. The Java compiler reuses the same local variable slot for
different local variables if there life ranges do not overlap. In this
example the variable slot 3 is even used for local variables of different
types (integer and address). The JIT-compiler assigned the saved register
12 to this variable.

One interface register is used in this example entering the basic block
with label {\tt L004}. At the entry of the basic block the interface
register has to be copied to the argument register {\tt A00}. This is one
of the rare cases where a more sophisticated coalescing algorithm could
have allocated an argument register for the interface.

At instruction 2 and 3 you can see the combining of a constant with an
arithmetic instruction. Since the instructions are allocated in an array
the empty slot has to be filled with a {\tt NOP} instruction. The {\tt
ADDCONSTANT} instruction already has the local variable {\tt L02} as
destination, an information which comes from the later {\tt ISTORE} at
number 4. Similarly the {\tt INVOKESTATIC} at number 31 has marked all its
operands as arguments. In this example all copy (beside the one to the
interface register) have been eliminated.

\begin{figure*}[htb]
\begin{center}
\begin{verbatim}
  java.io.ByteArrayOutputStream.write (int)void

  Local Table:
       0:                (adr) S15
       1:    (int) S14
       2:    (int) S13
       3:    (int) S12   (adr) S12

  Interface Table:
       0:    (int) T24

  [                 L00]        0  ALOAD         0
  [                 T23]        1  GETFIELD      16
  [                 L02]        2  IADDCONST     1
  [                 L02]        3  NOP         
  [                    ]        4  ISTORE        2
  [                 L02]        5  ILOAD         2
  [             L00 L02]        6  ALOAD         0
  [             T23 L02]        7  GETFIELD      8
  [             T23 L02]        8  ARRAYLENGTH  
  [                    ]        9  IF_ICMPLE     L005
                                ...............

  [                    ]       18  IF_ICMPLT     L003
  [                    ] L002:
  [                 I00]       19  ILOAD         3
  [                 I00]       20  GOTO          L004
  [                    ] L003:
  [                 I00]       21  ILOAD         2
  [                 A00] L004:
  [                 L03]       22  BUILTIN1      newarray_byte
  [                    ]       23  ASTORE        3
  [                 L00]       24  ALOAD         0
  [                 A00]       25  GETFIELD      8
  [             A01 A00]       26  ICONST        0
  [         A02 A01 A00]       27  ALOAD         3
  [     A03 A02 A01 A00]       28  ICONST        0
  [ L00 A03 A02 A01 A00]       29  ALOAD         0
  [ A04 A03 A02 A01 A00]       30  GETFIELD      16
  [                    ]       31  INVOKESTATIC  java/lang/System.arraycopy
  [                 L00]       32  ALOAD         0
  [             L03 L00]       33  ALOAD         3
  [                    ]       34  PUTFIELD      8
  [                    ] L005:
                               ...............

  [                    ]       45  RETURN       
\end{verbatim}
\caption{Example: intermediate instructions and stack contents}
\label{IntermediateStack}
\end{center}
\end{figure*}


\section{Compiling a Java method}

The CACAO JIT compiler is invoked via the

\begin{verbatim}
        methodptr jit_compile(methodinfo *m);
\end{verbatim}

function call. This function is just a wrapper function to the
internal compiler function

\begin{verbatim}
        static methodptr jit_compile_intern(methodinfo *m);
\end{verbatim}

The argument of the compiler function is a pointer to a
\texttt{methodinfo} structure (see figure \ref{methodinfostructure})
allocated by the system class loader. This function should not be
called directly and thus is declared \texttt{static} because the
wrapper function has to ensure some conditions:

\begin{itemize}
 \item enter a monitor on the \texttt{methodinfo} structure to make
 sure that only one thread can compile the same Java method at the
 same time

 \item check if the method already has a \texttt{entrypoint}, if so
 the monitor is left and the entrypoint is returned

 \item measure the compiling time if requested

 \item call the internal compiler function

 \item leave the monitor and return the functions' \texttt{entrypoint}
\end{itemize}

The internal compiler function \texttt{jit\_compile\_intern} does the
actual compilation of the Java method. It calls the different passes
of the JIT compiler.

If the passed Java method does not have a \textit{Code Attribute} (see
\ref{sectionmethodloading}) a \texttt{methodptr} to a
\texttt{do\_nothing\_function} is returned.

If the method has the \texttt{ACC\_STATIC} flag bit set and the
methods' class is not yet initialized, \texttt{class\_init} is called
with the methods' class as argument

Then the compiler passes are called:

\begin{enumerate}

 \item \texttt{reg\_init}: initializes the register allocator

  \begin{itemize}

   \item allocates the \texttt{registerdata} structure

   \item calculate the number of callee saved, temporary and argument
   registers

  \end{itemize}

 \item \texttt{reg\_setup}: sets up the register allocator data which
 is changed in every compiler run

 \item \texttt{codegen\_setup}: initializes the code generator

  \begin{itemize}

   \item allocates the \texttt{codegendata} structure

   \item allocate code and data memory

  \end{itemize}

 \item \texttt{parse}: parse pass

  \begin{itemize}

   \item parse the Java Virtual Machine instructions and convert them
   into CACAO intermediate commands

   \item determine basic blocks

  \end{itemize}

 \item \texttt{analyse\_stack}: analyse stack pass

 \item \texttt{regalloc}: register allocation pass

 \item \texttt{codegen}: code generation pass

 \item \texttt{reg\_close}: release all allocated register allocator
 memory

 \item \texttt{codegen\_close}: release all allocated code generator
 memory

\end{enumerate}

After all compiler passes were run and no exception or error occured,
the \texttt{entrypoint} of the compiled method is returned.

The CACAO JIT compiler is designed to be reentrant. This design
decision was taken to easily support exception throwing during one of
the compiler passes and to support concurrent compilation in different
threads running. Concurrent compilation can speed up startup and run
time especially on multi processor machines.