\chapter{The Just-In-Time Compiler} \section{Introduction} \section{The Java Virtual Machine} \label{chapjvm} The JavaVM is a typed stack architecture \cite{javavm96}. There are different instructions for integer, long integer, floating point and address types. Byte and character types have only special memory access instructions and are treated as integers for arithmetic operations. The main instruction set consists of arithmetic/logical and load/store/constant instructions. There are special instructions for array access and for accessing the fields of objects (memory access), for method invocation, and for type checking. A JavaVM has to check the program for type correctness and executes only correct programs. The following examples show some important JavaVM instructions and how a Java program is represented by these instructions. \begin{tabular}{ll} {\tt iload n}& load contents of local variable n \\ {\tt istore n}& store stack top in local variable n \\ {\tt imul }& product of 2 topmost stack elements \\ {\tt isub }& difference of 2 topmost stack elements \\ \end{tabular} The Java assignment statement {\tt a = b - c * d} is translated into \begin{tabular}{ll} {\tt iload b }& load contents of variable b \\ {\tt iload c }& load contents of variable c \\ {\tt iload d }& load contents of variable d \\ {\tt imul }& compute c * d \\ {\tt isub }& compute b - (c * d) \\ {\tt istore a}& store stack top in variable a \\ \end{tabular} Accessing the fields of objects is handled by the instructions {\tt getfield} and {\tt putfield}. The instruction {\tt getfield} expects an object reference on the stack and has an index into the constant pool as an operand. The index into the constant pool must be a reference to a pair containing the class name and a field name. The types of the classes referenced by the constant pool index and by the object reference must be compatible, a fact which is usually checked statically at load time. The object reference has to be different from the {\tt null} pointer, a fact which must usually be checked at run time. Array access is handled by the {\tt aload} and {\tt astore} instructions. Separate versions of these instructions exist for each of the basic types ({\tt byte}, {\tt int}, {\tt float}, {\tt ref}, etc.). The {\tt aload} instruction expects a reference to an array and an index (of type {\tt int}) on the stack. The array reference must not be the {\tt null} pointer. The index must be greater than or equal to zero and less than the array length. There a special commands for method invocation. Each method has its own virtual stack and an area for local variables. After the method invocation, the stack of the caller is empty and the arguments are copied into the first local variables of the called method. After execution of a {\tt return} instruction, the called method returns to its caller. If the called method is a function, it pops the return value from its own stack and pushes it onto the stack of the caller. The {\tt instanceof} and {\tt checkcast} instructions are used for subtype testing. Both expect a reference to an object on the stack and have an index into the constant pool as operand. The index must reference a class, array or interface type. The two instructions differ in their result and in their behavior if the object reference is {\tt null}. \section{Translation of stack code to register code} \label{chaptranslation} The architecture of a RISC processor is completely different from the stack architecture of the JavaVM. RISC processors have large sets of registers. (The Alpha has 32 integer registers and 32 floating point registers which are both 64 bits wide.) They execute arithmetic and logic operations only on values which are held in registers. Load and store instructions are provided to move data between memory and registers. Local variables of methods usually reside in registers and are saved in memory only during a method call or if there are too few registers. The use of registers for parameter passing is defined by calling conventions. \subsection{Machine code translation examples} The previous example expression {\tt a = b - c * d} would be translated by an optimizing C compiler to the following two Alpha instructions (the variables {\tt a}, {\tt b}, {\tt c} and {\tt d} reside in registers): \begin{quote} \begin{tabular}{ll} {\tt MULL c,d,tmp0 }&{\tt ; tmp0 = c * d }\\ {\tt SUBL b,tmp0,a }&{\tt ; a = b - tmp0 }\\ \end{tabular} \end{quote} If JavaVM code is translated to machine code, the stack is eliminated and the stack slots are represented by temporary variables usually residing in registers. A naive translation of the previous example would result in the following Alpha instructions: \begin{quote} \begin{tabular}{ll} {\tt MOVE b,t0 }&{\tt ; iload b } \\ {\tt MOVE c,t1 }&{\tt ; iload c } \\ {\tt MOVE d,t2 }&{\tt ; iload d } \\ {\tt MULL t1,t2,t1}&{\tt ; imul } \\ {\tt SUBL t0,t1,t0}&{\tt ; isub } \\ {\tt MOVE t0,a }&{\tt ; istore a} \\ \end{tabular} \end{quote} The problems of translating JavaVM code to machine code are primarily the elimination of the unnecessary copy instructions and finding an efficient register allocation algorithm. A common but expensive technique is to do the naive translation and use an additional pass for copy elimination and coalescing. \section{The translation algorithm} The new translation algorithm can get by with three passes. The first pass determines basic blocks and builds a representation of the JavaVM instructions which is faster to decode. The second pass analyses the stack and generates a static stack structure. During stack analysis variable dependencies are tracked and register requirements are computed. In the final pass register allocation of temporary registers is combined with machine code generation. The new compiler computes the exact number of objects needed or computes an upper bound and allocates the memory for the necessary temporary data structures in three big blocks (the basic block array, the instruction array and the stack array). Eliminating all the double linked lists also reduced the memory requirements by a factor of five. \subsection{Basic block determination} The first pass scans the JavaVM instructions, determines the basic blocks and generates an array of instructions which has fixed size and is easier to decode in the following passes. Each instruction contains the opcode, two operands and a pointer to the static stack structure after the instruction (see next sections). The different opcodes of JavaVM instructions which fold operands into the opcode are represented by just one opcode in the instruction array. \subsection{Basic block interfacing convention} The handling of control flow joins was quite complicated in the old compiler. We therefore introduced a fixed interface at basic block boundaries. Every stack slot at a basic block boundary is assigned a fixed interface register. The stack analysis pass determines the type of the register and if it has to be saved across method invocations. To enlarge the size of basic blocks method invocations do not end basic blocks. To guide our compiler design we did some static analysis on a large application written in Java: the javac compiler and the libraries it uses. Table \ref{tab-1} shows that in more than 93\% of the cases the stack is empty at basic block boundaries and that the maximal stack depth is 6. Using this data it becomes clear that the old join handling did not improve the quality of the machine code. \begin{table*} \begin{center} \begin{tabular}[b]{|l|c|c|c|c|c|c|c|c|} \hline stack depth & 0 & 1 & 2 & 3 & 4 & 5 & 6 & $>$6 \\ \hline occurrences & 7930 & 258 & 136 & 112 & 36 & 8 & 3 & 0 \\ \hline \end{tabular} \caption{distribution of stack depth at block boundary} \label{tab-1} \end{center} \end{table*} \subsection{Copy elimination} To eliminate unnecessary copies the loading of values is delayed until the instruction is reached which consumes the value. To compute the information a the run time stack is simulated at compile time. Instead of values the compile time stack contains the type of the value, if a local variable was loaded to a stack location and similar information. Adl-Tabatabai \cite{Taba+98} used a dynamic stack which is changed at every instruction. A dynamic stack only gives the possibility to move information from earlier instructions to later instructions. We use a static stack structure which enables information flow in both directions. Fig.~\ref{Trans1} shows our instruction and stack representation. An instruction has a reference to the stack before the instruction and the stack after the instruction. The stack is represented as a linked list. The two stacks can be seen as the source and destination operands of an instruction. In the implementation only the destination stack is stored, the source stack is the destination of stack of the previous instruction. \begin{figure}[htb] \begin{center} \setlength{\unitlength}{1mm} \begin{picture}(25,32) \put( 0, 0 ){\makebox(10,5){\tt b}} \put( 5, 2.5){\oval(10,5)} \put( 5, 9 ){\vector(0,-1){4}} \put( 0, 9 ){\makebox(10,5){\tt c}} \put( 5,11.5){\oval(10,5)} \put( 5,18 ){\vector(0,-1){4}} \put( 0,18 ){\makebox(10,5){\tt d}} \put( 5,20.5){\oval(10,5)} %\put( 5,27 ){\vector(0,-1){4}} \put(15, 9 ){\makebox(10,5){\tt c*d}} \put(20,11.5){\oval(10,5)} \put(20, 9 ){\line(0,-1){2}} \put( 5, 7 ){\line(1,0){15}} \put(10,27 ){\framebox(15,5){\tt imul}} \put(20,27 ){\vector(0,-1){13}} \put(15,27 ){\line(0,-1){6.5}} \put(15,20.5){\vector(-1,0){5}} \end{picture} \caption{instruction and stack representation} \label{Trans1} \end{center} \end{figure} This representation can easily be used for copy elimination. Each stack element not only contains the type of the stack slot but also the local variable number of which it is a copy, the argument number if it is an argument, the interface register number if it is an interface. Load (push the content of a variable onto the stack) and store store instructions do no generate a copy machine instruction if the stack slot contains the same local variable. Generated machine instructions for arithmetic operations directly use the local variables as their operands. There are some pitfalls with this scheme. Take the example of fig.~\ref{Trans2}. The stack bottom contains the local variable {\tt a}. The instruction {\tt istore a} will write a new value for {\tt a} and will make a later use of this variable invalid. To avoid this we have to copy the local variable to a stack variable. An important decision is at which position the copy instruction should be inserted. Since there is a high number of {\tt dup} instructions in Java programs (around 4\%) and it is possible that a local variable resides in memory, the copy should be done with the {\tt load} instruction. Since the stack is represented as a linked list only the destination stack has to be checked for occurrences of the offending variable and these occurrences are replaced by a stack variable. \begin{figure}[htb] \begin{center} \setlength{\unitlength}{1mm} \begin{picture}(79,32) \put( 0,27 ){\framebox(15,5){\tt iload a}} \put(16,27 ){\framebox(13,5){\tt dup}} \put(30,27 ){\framebox(17,5){\tt iconst 1}} \put(48,27 ){\framebox(13,5){\tt iadd}} \put(62,27 ){\framebox(17,5){\tt istore a}} \put(10,27 ){\vector(0,-1){22}} \put( 5,27 ){\vector(0,-1){4}} \put( 5, 0 ){\makebox(10,5){\tt a}} \put(10, 2.5){\oval(10,5)} \put(20,27 ){\line(0,-1){2}} \put(20,25 ){\line(-1,0){10}} \put(25,27 ){\vector(0,-1){13}} \put(20, 9 ){\makebox(10,5){\tt a}} \put(25,11.5){\oval(10,5)} \put(25, 9 ){\line(0,-1){2}} \put(10, 7 ){\line(1,0){63}} \put(36,27 ){\line(0,-1){2}} \put(36,25 ){\line(-1,0){11}} \put(41,27 ){\vector(0,-1){4}} \put(36,18 ){\makebox(10,5){\tt 1}} \put(41,20.5){\oval(10,5)} \put(41,18 ){\line(0,-1){2}} \put(41,16 ){\line(-1,0){16}} \put(52,27 ){\line(0,-1){2}} \put(52,25 ){\line(-1,0){11}} \put(73,27 ){\line(0,-1){20}} \put(57,27 ){\vector(0,-1){13}} \put(52, 9 ){\makebox(10,5){\tt +}} \put(57,11.5){\oval(10,5)} \put(57,9 ){\line(0,-1){2}} \put(68,27 ){\line(0,-1){2}} \put(68,25 ){\line(-1,0){11}} \end{picture} \caption{anti dependence} \label{Trans2} \end{center} \end{figure} To answer the question of how often this could happen and how expensive the stack search is, we analysed again the {\tt javac} compiler. In more than 98\% of the cases the stack is empty (see table \ref{tab-2}). In only 0.2\% of the cases the stack depth is higher than 1 and the biggest stack depth is 3. \begin{table}[htb] \begin{center} \begin{tabular}[b]{|l|c|c|c|c|c|} \hline stack depth & 0 & 1 & 2 & 3 & $>$3 \\ \hline occurrences & 2167 & 31 & 1 & 3 & 0 \\ \hline \end{tabular} \caption{distribution of {\tt store} stack depth} \label{tab-2} \end{center} \end{table} \begin{table*} \begin{center} \begin{tabular}[b]{|l|c|c|c|c|c|c|c|c|c|c|} \hline chain length & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & $>$9 \\ \hline occurrences & 1892& 62 & 23 & 62 & 30 & 11 & 41 & 9 & 7 & 65 \\ \hline \end{tabular} \caption{distribution of creator-store distances} \label{tab-3} \end{center} \end{table*} To avoid copy instructions when executing a {\tt store} it is necessary to connect the creation of a value with the store which consumes it. In that case a {\tt store} not only can conflict with copies of a local variable which result from {\tt load} instructions before the creator of the value, but also with {\tt load} and {\tt store} instructions which exist between the creation of value and the {\tt store}. In fig.~\ref{Trans3} the {\tt iload a} instruction conflicts with the {\tt istore a} instruction. \begin{figure}[htb] \begin{center} \setlength{\unitlength}{1mm} \begin{picture}(67,27) \put( 0,22 ){\framebox(15,5){\tt iadd}} \put(16,22 ){\framebox(15,5){\tt iload a}} \put(32,22 ){\framebox(17,5){\tt istore b}} \put(50,22 ){\framebox(17,5){\tt istore a}} \put(10,22 ){\vector(0,-1){13}} \put( 5,22 ){\line(0,-1){2}} \put( 5,20 ){\line(-1,0){3}} \put( 2,20 ){\vector(0,-1){20}} \put(10, 4 ){\line(0,-1){2}} \put( 5, 4 ){\makebox(10,5){\tt +}} \put(10, 6.5){\oval(10,5)} \put(21,22 ){\line(0,-1){2}} \put(21,20 ){\line(-1,0){11}} \put(26,22 ){\vector(0,-1){4}} \put(21,13 ){\makebox(10,5){\tt a}} \put(26,15.5){\oval(10,5)} \put(26,13 ){\line(0,-1){2}} \put(10,11 ){\line(1,0){46}} \put(38,22 ){\line(0,-1){2}} \put(38,20 ){\line(-1,0){12}} \put(43,22 ){\line(0,-1){11}} \put(56,22 ){\line(0,-1){11}} \put(61,22 ){\line(0,-1){20}} \put( 2, 2 ){\line(1,0){59}} \end{picture} \caption{anti dependence} \label{Trans3} \end{center} \end{figure} The anti dependences are detected by checking the stack locations of the previous instructions for conflicts. Since the stack locations are allocated as one big array just the stack elements which have a higher index than the current stack element have to be checked. Table \ref{tab-3} gives the distribution of the distance between the creation of the value and the corresponding store. In 86\% of the cases the distance is one. \begin{figure*}[htb] \begin{center} %\small \begin{verbatim} java.io.ByteArrayOutputStream.write (int)void Local Table: 0: (adr) S15 1: (int) S14 2: (int) S13 3: (int) S12 (adr) S12 Interface Table: 0: (int) T24 [ L00] 0 ALOAD 0 [ T23] 1 GETFIELD 16 [ L02] 2 IADDCONST 1 [ L02] 3 NOP [ ] 4 ISTORE 2 [ L02] 5 ILOAD 2 [ L00 L02] 6 ALOAD 0 [ T23 L02] 7 GETFIELD 8 [ T23 L02] 8 ARRAYLENGTH [ ] 9 IF_ICMPLE L005 ............... [ ] 18 IF_ICMPLT L003 [ ] L002: [ I00] 19 ILOAD 3 [ I00] 20 GOTO L004 [ ] L003: [ I00] 21 ILOAD 2 [ A00] L004: [ L03] 22 BUILTIN1 newarray_byte [ ] 23 ASTORE 3 [ L00] 24 ALOAD 0 [ A00] 25 GETFIELD 8 [ A01 A00] 26 ICONST 0 [ A02 A01 A00] 27 ALOAD 3 [ A03 A02 A01 A00] 28 ICONST 0 [ L00 A03 A02 A01 A00] 29 ALOAD 0 [ A04 A03 A02 A01 A00] 30 GETFIELD 16 [ ] 31 INVOKESTATIC java/lang/System.arraycopy [ L00] 32 ALOAD 0 [ L03 L00] 33 ALOAD 3 [ ] 34 PUTFIELD 8 [ ] L005: ............... [ ] 45 RETURN \end{verbatim} \caption{Example: intermediate instructions and stack contents} \label{IntermediateStack} \end{center} \end{figure*} The output dependences are checked by storing the instruction number of the last store in each local variable. If a store conflicts due to dependences the creator places the value in a stack register. Additional dependences arise because of exceptions. The exception mechanism in Java is precise. Therefore {\tt store} instructions are not allowed to be executed before an exception raising instruction. This is checked easily be remembering the last instruction which could raise an exception. In methods which contain no exception handler this conflict can be safely ignored because no exception handler can have access to these variables. \subsection{Register allocation} Expensive register allocation algorithms are neither suitable nor necessary. The {\tt javac} compiler does a coloring of the local variables and assigns the same number to variables which are not active at the same time. The stack variables have implicitly encoded their live ranges. When a value is pushed, the live range start. When a value is popped, the live range ends. Complications arise only with stack manipulation instructions like {\tt dup} and {\tt swap}. We flag therefore the first creation of a stack variable and mark a duplicated one as a copy. The register used for this variable can be reused only after the last copy is popped. During stack analysis stack variables are marked which have to survive a method invocation. These stack variables and local variables are assigned callee saved registers. If there are not enough registers available, these variables are allocated in memory. Efficient implementation of method invocation is crucial to the performance of Java. Therefore, we preallocate the argument registers and the return value in a similar way as we handle store instructions. Input arguments (in Java input arguments are the first variables) for leaf procedures (and input arguments for processors with register windows) are preassigned, too. \subsection{Instruction combining} Together with stack analysis we combine constant loading instructions with selected instructions which are following immediately. In the class of combinable instructions are add, subtract, multiply and divide instructions, logical and shift instructions and compare/branch instructions. During code generation the constant is checked if it lies in the range for immediate operands of the target architecture and appropriate code is generated. The old translator did for some complex instructions an expansion in multiple instructions to avoid complex instructions in the later passes. One of such instructions was the expansion of the {\tt lookup} instruction in a series of load constant and compare and branch instructions. Since the constants are usually quite small this unnecessarily increased the size of the intermediate representation and the final code. The new compiler delays the expansion into multiple instructions to the code generation pass which reduces all representations and speeds up the compilation. \subsection{Example} Fig. \ref{IntermediateStack} shows the intermediate representation and stack information as produced by the compiler for debugging purposes. The {\tt Local Table} gives the types and register assignment for the local variables. The Java compiler reuses the same local variable slot for different local variables if there life ranges do not overlap. In this example the variable slot 3 is even used for local variables of different types (integer and address). The JIT-compiler assigned the saved register 12 to this variable. One interface register is used in this example entering the basic block with label {\tt L004}. At the entry of the basic block the interface register has to be copied to the argument register {\tt A00}. This is one of the rare cases where a more sophisticated coalescing algorithm could have allocated an argument register for the interface. At instruction 2 and 3 you can see the combining of a constant with an arithmetic instruction. Since the instructions are allocated in an array the empty slot has to be filled with a {\tt NOP} instruction. The {\tt ADDCONSTANT} instruction already has the local variable {\tt L02} as destination, an information which comes from the later {\tt ISTORE} at number 4. Similarly the {\tt INVOKESTATIC} at number 31 has marked all its operands as arguments. In this example all copy (beside the one to the interface register) have been eliminated. \subsection{Complexity of the algorithm} The complexity of the algorithm is mostly linear with respect to the number of instructions and the number of local variables plus the number of stack slots. There are only a small number of spots where it is not linear. \begin{itemize} \item At the begin of a basic block the stack has to be copied to separate the stacks of different basic blocks. Table \ref{tab-1} shows that the stack at the boundary of a basic block is in most cases zero. Therefore, this copying does not influence the linear performance of the algorithm. \item A store has to check for a later use of the same variable. Table \ref{tab-2} shows that this is not a problem, too. \item A store additionally has to check for the previous use of the same variable between creation of the value and the store. The distances between the creation and the use are small (in most case only 1) as shown by table \ref{tab-3}. \end{itemize} Compiling {\tt javac} 29\% of the compile time are spent in parsing and basic block determination, 18\% are spent in stack analysis, 16\% are spent in register allocation and 37\% are spent in machine code generation.