1 \chapter{The Just-In-Time Compiler}
6 A Java Virtual Machine can implement four different approaches of
7 executing Java byte code:
11 \item Ahead-Of-Time Compiler
12 \item Just-In-Time Compiler
16 An \textit{Interpreter} interprets every single virtual machine
17 instruction in the language the Java Virtual Machine is written in,
18 mainly C. Due to this fact an interpreter based Java Virtual Machine
19 is easily portable, but the execution speed is very slow, up to ten
20 times slower than a current Just-In-Time Compilers or similar code
23 An \textit{Ahead-Of-Time Compiler} compiles every Java method of a
24 class when the class is loaded. This reduces the compiler overhead
25 since the compilation of the class methods is done in one step and at
26 runtime the method called can be executed immediately. The drawback of
27 this approach is the fact that every single method is compiled, even
28 if it's not needed. This can use a serious amount of memory and time
29 since the java libraries contain a lot of methods.
31 A \textit{Just-In-Time Compiler} is the solution to the memory and
32 compilation time problem. This compiler only compiles a method when it
33 is called the first time. The only drawback of this approach is the
34 deferred execution of the called method since it must be compiled
35 before, but a Just-In-Time Compiler can save much of compilation time.
37 The \textit{Mixed Mode} is mostly a mixture of an Interpreter and a
38 Just-In-Time Compiler. Normally the code is interpreted, but code
39 parts which are frequently executed are compiled to native machine
40 code with an Just-In-Time Compiler. This technique is used by Sun's
43 CACAO only implements a \textit{Just-In-Time Compiler} and has no
44 Interpreter. The main target of CACAO was to build a fast executing
45 Java Virtual Machine with short compilation times. Thus the CACAO
46 development team decided to only implement a fast compiling
47 Just-In-Time Compiler. So every single Java method executed is
48 compiled to native machine code.
50 The following sections decribe some basics of byte code to machine
51 code compilation and how the CACAO Just-In-Time Compiler works in
55 \section{The Java Virtual Machine}
57 The JavaVM is a typed stack architecture \cite{javavm96}. There are
58 different instructions for integer, long integer, floating point and
59 address types. Byte and character types have only special memory access
60 instructions and are treated as integers for arithmetic operations. The
61 main instruction set consists of arithmetic/logical and load/store/constant
62 instructions. There are special instructions for array access and for
63 accessing the fields of objects (memory access), for method invocation,
64 and for type checking. A JavaVM has to check the program for type
65 correctness and executes only correct programs. The following examples show
66 some important JavaVM instructions and how a Java program is represented by
70 iload n ; load contents of local variable n
71 istore n ; store stack top in local variable n
72 imul ; product of 2 topmost stack elements
73 isub ; difference of 2 topmost stack elements
76 The Java assignment statement {\tt a = b - c * d} is translated into
79 iload b ; load contents of variable b
80 iload c ; load contents of variable c
81 iload d ; load contents of variable d
83 isub ; compute b - (c * d)
84 istore a ; store stack top in variable a
87 Accessing the fields of objects is handled by the instructions {\tt
88 getfield} and {\tt putfield}. The instruction {\tt getfield} expects an
89 object reference on the stack and has an index into the constant pool as an
90 operand. The index into the constant pool must be a reference to a pair
91 containing the class name and a field name. The types of the classes
92 referenced by the constant pool index and by the object reference must be
93 compatible, a fact which is usually checked statically at load time. The
94 object reference has to be different from the {\tt null} pointer, a fact
95 which must usually be checked at run time.
97 Array access is handled by the {\tt aload} and {\tt astore} instructions.
98 Separate versions of these instructions exist for each of the basic types
99 ({\tt byte}, {\tt int}, {\tt float}, {\tt ref}, etc.). The {\tt aload}
100 instruction expects a reference to an array and an index (of type {\tt
101 int}) on the stack. The array reference must not be the {\tt null} pointer.
102 The index must be greater than or equal to zero and less than the array
105 There are special commands for method invocation. Each method has its own
106 virtual stack and an area for local variables. After the method invocation,
107 the stack of the caller is empty and the arguments are copied into the
108 first local variables of the called method. After execution of a {\tt
109 return} instruction, the called method returns to its caller. If the called
110 method is a function, it pops the return value from its own stack and
111 pushes it onto the stack of the caller.
113 The {\tt instanceof} and {\tt checkcast} instructions are used for subtype
114 testing. Both expect a reference to an object on the stack and have an
115 index into the constant pool as operand. The index must reference a class,
116 array or interface type. The two instructions differ in their result and in
117 their behavior if the object reference is {\tt null}.
120 \section{Translation of stack code to register code}
122 The architecture of a RISC---\textit{Reduced Instruction Set Computer}
123 or CISC---\textit{Complex Instruction Set Computer}---processor is
124 completely different from the stack architecture of the Java Virtual
127 RISC processors have large sets of registers. The Alpha architecture
128 has 32 integer registers and 32 floating point registers which are
129 both 64-bits wide. They execute arithmetic and logic operations only
130 on values which are held in registers. RISC machines are a load-store
131 architecture, this means load and store instructions are provided to
132 move data between memory and registers. Local variables of methods
133 usually reside in registers and are saved in memory only during a
134 method call or if there are too few registers. The use of registers
135 for parameter passing is defined by calling conventions.
137 CISC processors have a relatively small set of registers, like the
138 IA32 architecture with 8 integer general purpose registers or the
139 AMD64 architecture with 16 integer general purpose registers. But, as
140 the name implies, CISC processors have a large and complex machine
141 instruction set. Unlike the load-store architecture of RISC machines,
142 CISC instructions can operate on operands residing in registers and in
143 memory locations. Local variables of methods should reside in
144 registers, but due to the limited number of registers this very rare
145 and most local variables are stored on the stack. Detailed information
146 of the IA32 and AMD64 architecture can be found in section
147 \ref{sectionia32codegenerator} and section
148 \ref{sectionamd64codegenerator} respectively.
151 \subsection{Machine code translation examples}
153 The previous example expression {\tt a = b - c * d} would be translated
154 by an optimizing C compiler to the following two Alpha instructions (the
155 variables {\tt a}, {\tt b}, {\tt c} and {\tt d} reside in registers):
158 MULL c,d,tmp0 ; tmp0 = c * d
159 SUBL b,tmp0,a ; a = b - tmp0
162 If JavaVM code is translated to machine code, the stack is eliminated and
163 the stack slots are represented by temporary variables usually residing in
164 registers. A naive translation of the previous example would result in the
165 following Alpha instructions:
176 The problems of translating JavaVM code to machine code are primarily the
177 elimination of the unnecessary copy instructions and finding an efficient
178 register allocation algorithm. A common but expensive technique is to do
179 the naive translation and use an additional pass for copy elimination and
183 \section{The translation algorithm}
185 The new translation algorithm can get by with three passes. The first pass
186 determines basic blocks and builds a representation of the JavaVM
187 instructions which is faster to decode. The second pass analyses the stack
188 and generates a static stack structure. During stack analysis variable
189 dependencies are tracked and register requirements are computed. In the
190 final pass register allocation of temporary registers is combined with
191 machine code generation.
193 The new compiler computes the exact number of objects needed or computes an
194 upper bound and allocates the memory for the necessary temporary data
195 structures in three big blocks: the basic block array, the instruction
196 array and the stack array. Eliminating all the double linked lists also
197 reduced the memory requirements by a factor of five.
200 \subsection{Basic block determination}
202 The first pass scans the JavaVM instructions, determines the basic blocks
203 and generates an array of instructions which has fixed size and is easier
204 to decode in the following passes. Each instruction contains the opcode,
205 two operands and a pointer to the static stack structure after the
206 instruction (see next sections). The different opcodes of JavaVM
207 instructions which fold operands into the opcode are represented by just
208 one opcode in the instruction array.
211 \subsection{Basic block interfacing convention}
213 The handling of control flow joins was quite complicated in the old
214 compiler. We therefore introduced a fixed interface at basic block
215 boundaries. Every stack slot at a basic block boundary is assigned a fixed
216 interface register. The stack analysis pass determines the type of the
217 register and if it has to be saved across method invocations. To enlarge
218 the size of basic blocks method invocations do not end basic blocks. To
219 guide our compiler design we did some static analysis on a large
220 application written in Java: the javac compiler and the libraries it uses.
221 Table \ref{tab-1} shows that in more than 93\% of the cases the stack is
222 empty at basic block boundaries and that the maximal stack depth is 6.
223 Using this data it becomes clear that the old join handling did not improve
224 the quality of the machine code.
228 \begin{tabular}[b]{|l|c|c|c|c|c|c|c|c|}
230 stack depth & 0 & 1 & 2 & 3 & 4 & 5 & 6 & $>$6 \\ \hline
231 occurrences & 7930 & 258 & 136 & 112 & 36 & 8 & 3 & 0 \\ \hline
233 \caption{distribution of stack depth at block boundary}
239 \subsection{Copy elimination}
241 To eliminate unnecessary copies, the loading of values is delayed until the
242 instruction is reached which consumes the value. To compute the information
243 the run time stack is simulated at compile time. Instead of values the
244 compile time stack contains the type of the value, if a local variable was
245 loaded to a stack location and similar information. Adl-Tabatabai
246 \cite{Taba+98} used a dynamic stack which is changed at every instruction.
247 A dynamic stack only gives the possibility to move information from earlier
248 instructions to later instructions. We use a static stack structure which
249 enables information flow in both directions.
251 Figure~\ref{Trans1} shows our instruction and stack representation. An
252 instruction has a reference to the stack before the instruction and the
253 stack after the instruction. The stack is represented as a linked list. The
254 two stacks can be seen as the source and destination operands of an
255 instruction. In the implementation only the destination stack is stored,
256 the source stack is the destination of stack of the previous instruction.
260 \setlength{\unitlength}{1mm}
261 \begin{picture}(25,32)
262 \put( 0, 0 ){\makebox(10,5){\tt b}}
263 \put( 5, 2.5){\oval(10,5)}
264 \put( 5, 9 ){\vector(0,-1){4}}
265 \put( 0, 9 ){\makebox(10,5){\tt c}}
266 \put( 5,11.5){\oval(10,5)}
267 \put( 5,18 ){\vector(0,-1){4}}
268 \put( 0,18 ){\makebox(10,5){\tt d}}
269 \put( 5,20.5){\oval(10,5)}
270 %\put( 5,27 ){\vector(0,-1){4}}
271 \put(15, 9 ){\makebox(10,5){\tt c*d}}
272 \put(20,11.5){\oval(10,5)}
273 \put(20, 9 ){\line(0,-1){2}}
274 \put( 5, 7 ){\line(1,0){15}}
275 \put(10,27 ){\framebox(15,5){\tt imul}}
276 \put(20,27 ){\vector(0,-1){13}}
277 \put(15,27 ){\line(0,-1){6.5}}
278 \put(15,20.5){\vector(-1,0){5}}
280 \caption{instruction and stack representation}
285 This representation can easily be used for copy elimination. Each stack
286 element not only contains the type of the stack slot but also the local
287 variable number of which it is a copy, the argument number if it is an
288 argument, the interface register number if it is an interface. Load (push
289 the content of a variable onto the stack) and store instructions do
290 no generate a copy machine instruction if the stack slot contains the same
291 local variable. Generated machine instructions for arithmetic operations
292 directly use the local variables as their operands.
294 There are some pitfalls with this scheme. Take the example of
295 figure~\ref{Trans2}. The stack bottom contains the local variable {\tt a}.
296 The instruction {\tt istore a} will write a new value for {\tt a} and will
297 make a later use of this variable invalid. To avoid this we have to copy
298 the local variable to a stack variable. An important decision is at which
299 position the copy instruction should be inserted. Since there is a high
300 number of {\tt dup} instructions in Java programs (around 4\%) and it is
301 possible that a local variable resides in memory, the copy should be done
302 with the {\tt load} instruction. Since the stack is represented as a linked
303 list only the destination stack has to be checked for occurrences of the
304 offending variable and these occurrences are replaced by a stack variable.
309 \setlength{\unitlength}{1mm}
310 \begin{picture}(79,32)
311 \put( 0,27 ){\framebox(15,5){\tt iload a}}
312 \put(16,27 ){\framebox(13,5){\tt dup}}
313 \put(30,27 ){\framebox(17,5){\tt iconst 1}}
314 \put(48,27 ){\framebox(13,5){\tt iadd}}
315 \put(62,27 ){\framebox(17,5){\tt istore a}}
316 \put(10,27 ){\vector(0,-1){22}}
317 \put( 5,27 ){\vector(0,-1){4}}
318 \put( 5, 0 ){\makebox(10,5){\tt a}}
319 \put(10, 2.5){\oval(10,5)}
320 \put(20,27 ){\line(0,-1){2}}
321 \put(20,25 ){\line(-1,0){10}}
322 \put(25,27 ){\vector(0,-1){13}}
323 \put(20, 9 ){\makebox(10,5){\tt a}}
324 \put(25,11.5){\oval(10,5)}
325 \put(25, 9 ){\line(0,-1){2}}
326 \put(10, 7 ){\line(1,0){63}}
327 \put(36,27 ){\line(0,-1){2}}
328 \put(36,25 ){\line(-1,0){11}}
329 \put(41,27 ){\vector(0,-1){4}}
330 \put(36,18 ){\makebox(10,5){\tt 1}}
331 \put(41,20.5){\oval(10,5)}
332 \put(41,18 ){\line(0,-1){2}}
333 \put(41,16 ){\line(-1,0){16}}
334 \put(52,27 ){\line(0,-1){2}}
335 \put(52,25 ){\line(-1,0){11}}
336 \put(73,27 ){\line(0,-1){20}}
337 \put(57,27 ){\vector(0,-1){13}}
338 \put(52, 9 ){\makebox(10,5){\tt +}}
339 \put(57,11.5){\oval(10,5)}
340 \put(57,9 ){\line(0,-1){2}}
341 \put(68,27 ){\line(0,-1){2}}
342 \put(68,25 ){\line(-1,0){11}}
344 \caption{anti dependence}
349 To answer the question of how often this could happen and how expensive
350 the stack search is, we analysed again the {\tt javac} compiler. In more
351 than 98\% of the cases the stack is empty (see table \ref{tab-2}). In only
352 0.2\% of the cases the stack depth is higher than 1 and the biggest stack
357 \begin{tabular}[b]{|l|c|c|c|c|c|}
359 stack depth & 0 & 1 & 2 & 3 & $>$3 \\ \hline
360 occurrences & 2167 & 31 & 1 & 3 & 0 \\ \hline
362 \caption{distribution of {\tt store} stack depth}
369 \begin{tabular}[b]{|l|c|c|c|c|c|c|c|c|c|c|}
371 chain length & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & $>$9 \\ \hline
372 occurrences & 1892& 62 & 23 & 62 & 30 & 11 & 41 & 9 & 7 & 65 \\ \hline
374 \caption{distribution of creator-store distances}
379 To avoid copy instructions when executing a {\tt store} it is necessary to
380 connect the creation of a value with the store which consumes it. In that
381 case a {\tt store} not only can conflict with copies of a local variable
382 which result from {\tt load} instructions before the creator of the value,
383 but also with {\tt load} and {\tt store} instructions which exist between
384 the creation of value and the {\tt store}. In figure~\ref{Trans3} the {\tt
385 iload a} instruction conflicts with the {\tt istore a} instruction.
389 \setlength{\unitlength}{1mm}
390 \begin{picture}(67,27)
391 \put( 0,22 ){\framebox(15,5){\tt iadd}}
392 \put(16,22 ){\framebox(15,5){\tt iload a}}
393 \put(32,22 ){\framebox(17,5){\tt istore b}}
394 \put(50,22 ){\framebox(17,5){\tt istore a}}
395 \put(10,22 ){\vector(0,-1){13}}
396 \put( 5,22 ){\line(0,-1){2}}
397 \put( 5,20 ){\line(-1,0){3}}
398 \put( 2,20 ){\vector(0,-1){20}}
399 \put(10, 4 ){\line(0,-1){2}}
400 \put( 5, 4 ){\makebox(10,5){\tt +}}
401 \put(10, 6.5){\oval(10,5)}
402 \put(21,22 ){\line(0,-1){2}}
403 \put(21,20 ){\line(-1,0){11}}
404 \put(26,22 ){\vector(0,-1){4}}
405 \put(21,13 ){\makebox(10,5){\tt a}}
406 \put(26,15.5){\oval(10,5)}
407 \put(26,13 ){\line(0,-1){2}}
408 \put(10,11 ){\line(1,0){46}}
409 \put(38,22 ){\line(0,-1){2}}
410 \put(38,20 ){\line(-1,0){12}}
411 \put(43,22 ){\line(0,-1){11}}
412 \put(56,22 ){\line(0,-1){11}}
413 \put(61,22 ){\line(0,-1){20}}
414 \put( 2, 2 ){\line(1,0){59}}
416 \caption{anti dependence}
421 The anti dependences are detected by checking the stack locations of the
422 previous instructions for conflicts. Since the stack locations are
423 allocated as one big array just the stack elements which have a higher
424 index than the current stack element have to be checked. Table \ref{tab-3}
425 gives the distribution of the distance between the creation of the value
426 and the corresponding store. In 86\% of the cases the distance is one.
428 The output dependences are checked by storing the instruction number of the
429 last store in each local variable. If a store conflicts due to dependences
430 the creator places the value in a stack register. Additional dependences
431 arise because of exceptions. The exception mechanism in Java is precise.
432 Therefore {\tt store} instructions are not allowed to be executed before
433 an exception raising instruction. This is checked easily be remembering
434 the last instruction which could raise an exception. In methods which contain
435 no exception handler this conflict can be safely ignored because no
436 exception handler can have access to these variables.
439 \subsection{Register allocator}
441 he current register allocator of CACAO is a very simple,
442 straightforward allocator. It simply assigns free registers with a
443 \textit{first-come-first-serve} based algorithm. This is mostly
444 suitable for RISC architectures with large register sets like Alpha or
445 MIPS with 32 integer registers and 32 floating-point registers. For
446 these architectures the current register allocator was designed for.
448 Basically the allocation passes of the register allocator are:
451 \item interface register allocation
452 \item scratch register allocation
453 \item local register allocation
456 The \texttt{javac} compiler also supports this simple
457 \textit{first-come-first-serve} approach CACAO uses and does a
458 coloring of the local variables and assigns the same number to
459 variables which are not active at the same time. The stack variables
460 have implicitly encoded their live ranges. When a value is pushed, the
461 live range start. When a value is popped, the live range ends.
463 Complications arise only with stack manipulation instructions like {\tt dup}
464 and {\tt swap}. We flag therefore the first creation of a stack variable and
465 mark a duplicated one as a copy. The register used for this variable can
466 be reused only after the last copy is popped.
468 During stack analysis stack variables are marked which have to survive a
469 method invocation. These stack variables and local variables are assigned
470 to callee saved registers. If there are not enough registers available,
471 these variables are allocated in memory.
473 Efficient implementation of method invocation is crucial to the performance
474 of Java. Therefore, we preallocate the argument registers and the return
475 value in a similar way as we handle store instructions. Input arguments (in
476 Java input arguments are the first variables) for leaf procedures (and
477 input arguments for processors with register windows) are preassigned, too.
479 Since CACAO has now also been ported to CISC architectures like IA32
480 and AMD64, the \textit{first-come-first-serve} register allocator has
481 hit it's limits. The results produced for an architecture with 8
482 integer general purpose registers like IA32 or 16 integer general
483 purpose registers like AMD64, is far from perfect. Further details to
484 register allocation of these architectures can be found in section
485 \ref{sectionia32registerallocation} and section
486 \ref{sectionamd64registerallocation} respectively.
488 The CACAO development team is currently working on a new register
489 allocator based on a \textit{linear scan} algorithm. This allocator
490 should produce much better results on CISC machines than the current
494 \subsection{Instruction combining}
496 Together with stack analysis we combine constant loading instructions
497 with selected instructions which are following immediately. In the
498 class of combinable instructions are add, subtract, multiply and
499 divide instructions, logical and shift instructions, compare/branch
500 and array store instructions.
502 These combined immediate instructions are:
505 \item \texttt{ICMD\_IADDCONST}, \texttt{ICMD\_ISUBCONST},
506 \texttt{ICMD\_IMULCONST}, \texttt{ICMD\_IDIVPOW2},
507 \texttt{ICMD\_IREMPOW2}
509 \item \texttt{ICMD\_LADDCONST}, \texttt{ICMD\_LSUBCONST},
510 \texttt{ICMD\_LMULCONST}, \texttt{ICMD\_LDIVPOW2},
511 \texttt{ICMD\_LREMPOW2}
513 \item \texttt{ICMD\_IANDCONST}, \texttt{ICMD\_IORCONST},
514 \texttt{ICMD\_IXORCONST}
516 \item \texttt{ICMD\_LANDCONST}, \texttt{ICMD\_LORCONST},
517 \texttt{ICMD\_LXORCONST}
519 \item \texttt{ICMD\_ISHLCONST}, \texttt{ICMD\_ISHRCONST},
520 \texttt{ICMD\_IUSHRCONST}
522 \item \texttt{ICMD\_LSHLCONST}, \texttt{ICMD\_LSHRCONST},
523 \texttt{ICMD\_LUSHRCONST}
525 \item \texttt{ICMD\_IFxx}
527 \item \texttt{ICMD\_IF\_Lxx}
529 \item \texttt{ICMD\_xASTORECONST}
532 During code generation the constant is checked if it lies in the range
533 for immediate operands of the target architecture and appropriate code
536 Arithmetic and logical instructions are processed straightforward. The
537 intermediate command opcode of the current instruction is changed and
538 the immediate value from the previous instruction is stored in the
539 current instruction. The register pressure is always reduced by one
540 register by this optimization.
542 \texttt{ICMD\_IDIV} and \texttt{ICMD\_IREM} divisions by a constant
543 which is a power of two are handled in a special way. They are
544 converted into \texttt{ICMD\_IDIVPOW2} and \texttt{ICMD\_IREMPOW2}
545 respectively. For \texttt{ICMD\_IDIVPOW2} an immediate value is
546 assigned which represents the left shift amount of \texttt{0x1} to get
547 the divisor value. In the code generation pass a very fast shift-based
548 machine code can be generated for this instruction. For
549 \texttt{ICMD\_IREMPOW2} the intermediate value gets one
550 subtracted. The generated machine code consists of logical
551 \texttt{and}'s, \texttt{neg}'s and a conditional jump. For both
552 instructions the generated machine code is much fast than an integer
553 division. \texttt{ICMD\_LDIV} and \texttt{ICMD\_LREM} intermediate
554 commands are handled respectively.
556 \texttt{ICMD\_IxSHx} instructions by a constant value are converted
557 to \texttt{ICMD\_IxSHxCONST} instructions. Nearly every architecture
558 has machine shift instructions by a constant value. This optimization
559 always reduces the register pressure by one
560 register. \texttt{ICMD\_LxSHx} intermediate commands are converted to
561 \texttt{ICMD\_LxSHxCONST} commands respectively.
563 \texttt{ICMD\_IF\_ICMPxx} intermediate commands are converted to
564 \texttt{ICMD\_IFxx} commands. This commands compare the source
565 operand directly with an immediate value if possible. The generated
566 machine code depends on the architecture. On the IA32 or AMD64
567 architecture the immediate value can always be inlined. On RISC
568 architectures the immediate value range is limited, like the Alpha
569 architecture where the immediate value may be between 0 and 255. On
570 architectures which support conditional branches on a source register,
571 like Alpha or MIPS, the compare with 0 is optimized to a single
572 instruction. This optimization can reduce the register pressure by one
573 register. \texttt{ICMD\_IF\_Lxx} intermediate commands are handled
576 The \texttt{ICMD\_xASTORE} optimization was actually implemented for
577 the IA32 and AMD64 architecture. These architectures can handle inline
578 immediate values up to their address pointer size, this means 32-bit
579 for IA32 and 64-bit for AMD64 respectively. For RISC architectures
580 which have a \texttt{REG\_ZERO}---a register which always contains the
581 values zero---this array store optimization can be used only for zero
582 values. Address array stores---\texttt{ICMD\_AASTORE}---can only be
583 optimized in the \texttt{null} pointer case because of the dynamic
584 type check. In this case the optimization not only reduces the
585 register pressure by one register, but the dynamic type check
586 subroutine call can be eliminated.
589 \subsection{Complexity of the algorithm}
591 The complexity of the algorithm is mostly linear with respect to the number
592 of instructions and the number of local variables plus the number of stack
593 slots. There are only a small number of spots where it is not linear.
596 \item At the begin of a basic block the stack has to be copied to separate
597 the stacks of different basic blocks. Table \ref{tab-1} shows that
598 the stack at the boundary of a basic block is in most cases zero.
599 Therefore, this copying does not influence the linear performance of
601 \item A store has to check for a later use of the same variable. Table
602 \ref{tab-2} shows that this is not a problem, too.
603 \item A store additionally has to check for the previous use of the same
604 variable between creation of the value and the store. The distances
605 between the creation and the use are small (in most case only 1) as
606 shown by table \ref{tab-3}.
609 Compiling {\tt javac} 29\% of the compile time are spent in parsing and
610 basic block determination, 18\% are spent in stack analysis, 16\% are spent
611 in register allocation and 37\% are spent in machine code generation.
616 Figure \ref{IntermediateStack} shows the intermediate representation and
617 stack information as produced by the compiler for debugging purposes. The
618 {\tt Local Table} gives the types and register assignment for the local
619 variables. The Java compiler reuses the same local variable slot for
620 different local variables if there life ranges do not overlap. In this
621 example the variable slot 3 is even used for local variables of different
622 types (integer and address). The JIT-compiler assigned the saved register
625 One interface register is used in this example entering the basic block
626 with label {\tt L004}. At the entry of the basic block the interface
627 register has to be copied to the argument register {\tt A00}. This is one
628 of the rare cases where a more sophisticated coalescing algorithm could
629 have allocated an argument register for the interface.
631 At instruction 2 and 3 you can see the combining of a constant with an
632 arithmetic instruction. Since the instructions are allocated in an array
633 the empty slot has to be filled with a {\tt NOP} instruction. The {\tt
634 ADDCONSTANT} instruction already has the local variable {\tt L02} as
635 destination, an information which comes from the later {\tt ISTORE} at
636 number 4. Similarly the {\tt INVOKESTATIC} at number 31 has marked all its
637 operands as arguments. In this example all copy (beside the one to the
638 interface register) have been eliminated.
643 java.io.ByteArrayOutputStream.write (int)void
649 3: (int) S12 (adr) S12
661 [ T23 L02] 7 GETFIELD 8
662 [ T23 L02] 8 ARRAYLENGTH
666 [ ] 18 IF_ICMPLT L003
673 [ L03] 22 BUILTIN1 newarray_byte
677 [ A01 A00] 26 ICONST 0
678 [ A02 A01 A00] 27 ALOAD 3
679 [ A03 A02 A01 A00] 28 ICONST 0
680 [ L00 A03 A02 A01 A00] 29 ALOAD 0
681 [ A04 A03 A02 A01 A00] 30 GETFIELD 16
682 [ ] 31 INVOKESTATIC java/lang/System.arraycopy
684 [ L03 L00] 33 ALOAD 3
691 \caption{Example: intermediate instructions and stack contents}
692 \label{IntermediateStack}
697 \section{Compiling a Java method}
699 The CACAO JIT compiler is invoked via the
702 methodptr jit_compile(methodinfo *m);
705 function call. This function is just a wrapper function to the
706 internal compiler function
709 static methodptr jit_compile_intern(methodinfo *m);
712 The argument of the compiler function is a pointer to a
713 \texttt{methodinfo} structure (see figure \ref{methodinfostructure})
714 allocated by the system class loader. This function should not be
715 called directly and thus is declared \texttt{static} because the
716 wrapper function has to ensure some conditions:
719 \item enter a monitor on the \texttt{methodinfo} structure to make
720 sure that only one thread can compile the same Java method at the
723 \item check if the method already has a \texttt{entrypoint}, if so
724 the monitor is left and the entrypoint is returned
726 \item measure the compiling time if requested
728 \item call the internal compiler function
730 \item leave the monitor and return the functions' \texttt{entrypoint}
733 The internal compiler function \texttt{jit\_compile\_intern} does the
734 actual compilation of the Java method. It calls the different passes
737 If the passed Java method does not have a \textit{Code Attribute} (see
738 \ref{sectionmethodloading}) a \texttt{methodptr} to a
739 \texttt{do\_nothing\_function} is returned.
741 If the method has the \texttt{ACC\_STATIC} flag bit set and the
742 methods' class is not yet initialized, \texttt{class\_init} is called
743 with the methods' class as argument
745 Then the compiler passes are called:
749 \item \texttt{reg\_init}: initializes the register allocator
753 \item allocates the \texttt{registerdata} structure
755 \item calculate the number of callee saved, temporary and argument
760 \item \texttt{reg\_setup}: sets up the register allocator data which
761 is changed in every compiler run
763 \item \texttt{codegen\_setup}: initializes the code generator
767 \item allocates the \texttt{codegendata} structure
769 \item allocate code and data memory
773 \item \texttt{parse}: parse pass
777 \item parse the Java Virtual Machine instructions and convert them
778 into CACAO intermediate commands
780 \item determine basic blocks
784 \item \texttt{analyse\_stack}: analyse stack pass
786 \item \texttt{regalloc}: register allocation pass
788 \item \texttt{codegen}: code generation pass
790 \item \texttt{reg\_close}: release all allocated register allocator
793 \item \texttt{codegen\_close}: release all allocated code generator
798 After all compiler passes were run and no exception or error occured,
799 the \texttt{entrypoint} of the compiled method is returned.
801 The CACAO JIT compiler is designed to be reentrant. This design
802 decision was taken to easily support exception throwing during one of
803 the compiler passes and to support concurrent compilation in different
804 threads running. Concurrent compilation can speed up startup and run
805 time especially on multi processor machines.