doc/handbook/jit.tex

   1 \chapter{The Just-In-Time Compiler}
   2
   3
   4 \section{Introduction}
   5
   6 A Java Virtual Machine can implement four different approaches of
   7 executing Java byte code:
   8
   9 \begin{itemize}
  10 \item Interpreter
  11 \item Ahead-Of-Time Compiler
  12 \item Just-In-Time Compiler
  13 \item Mixed Mode
  14 \end{itemize}
  15
  16 An \textit{Interpreter} interprets every single virtual machine
  17 instruction in the language the Java Virtual Machine is written in,
  18 mainly C. Due to this fact an interpreter based Java Virtual Machine
  19 is easily portable, but the execution speed is very slow, up to ten
  20 times slower than a current Just-In-Time Compilers or similar code
  21 written in C.
  22
  23 An \textit{Ahead-Of-Time Compiler} compiles every Java method of a
  24 class when the class is loaded. This reduces the compiler overhead
  25 since the compilation of the class methods is done in one step and at
  26 runtime the method called can be executed immediately. The drawback of
  27 this approach is the fact that every single method is compiled, even
  28 if it's not needed. This can use a serious amount of memory and time
  29 since the java libraries contain a lot of methods.
  30
  31 A \textit{Just-In-Time Compiler} is the solution to the memory and
  32 compilation time problem. This compiler only compiles a method when it
  33 is called the first time. The only drawback of this approach is the
  34 deferred execution of the called method since it must be compiled
  35 before, but a Just-In-Time Compiler can save much of compilation time.
  36
  37 The \textit{Mixed Mode} is mostly a mixture of an Interpreter and a
  38 Just-In-Time Compiler. Normally the code is interpreted, but code
  39 parts which are frequently executed are compiled to native machine
  40 code with an Just-In-Time Compiler. This technique is used by Sun's
  41 and IBM's JVM.
  42
  43 CACAO only implements a \textit{Just-In-Time Compiler} and has no
  44 Interpreter. The main target of CACAO was to build a fast executing
  45 Java Virtual Machine with short compilation times. Thus the CACAO
  46 development team decided to only implement a fast compiling
  47 Just-In-Time Compiler. So every single Java method executed is
  48 compiled to native machine code.
  49
  50 The following sections decribe some basics of byte code to machine
  51 code compilation and how the CACAO Just-In-Time Compiler works in
  52 detail.
  53
  54
  55 \section{The Java Virtual Machine}
  56
  57 The JavaVM is a typed stack architecture \cite{javavm96}. There are
  58 different instructions for integer, long integer, floating point and
  59 address types. Byte and character types have only special memory access
  60 instructions and are treated as integers for arithmetic operations. The
  61 main instruction set consists of arithmetic/logical and load/store/constant
  62 instructions. There are special instructions for array access and for
  63 accessing the fields of objects (memory access), for method invocation,
  64 and  for type checking. A JavaVM has to check the program for type
  65 correctness and executes only correct programs. The following examples show
  66 some important JavaVM instructions and how a Java program is represented by
  67 these instructions.
  68
  69 \begin{verbatim}
  70         iload  n    ; load contents of local variable n
  71         istore n    ; store stack top in local variable n
  72         imul        ; product of 2 topmost stack elements
  73         isub        ; difference of 2 topmost stack elements
  74 \end{verbatim}
  75
  76 The Java assignment statement {\tt a = b - c * d} is translated into
  77
  78 \begin{verbatim}
  79         iload b     ; load contents of variable b
  80         iload c     ; load contents of variable c
  81         iload d     ; load contents of variable d
  82         imul        ; compute c * d
  83         isub        ; compute b - (c * d)
  84         istore a    ; store stack top in variable a
  85 \end{verbatim}
  86
  87 Accessing the fields of objects is handled by the instructions {\tt
  88 getfield} and {\tt putfield}. The instruction {\tt getfield} expects an
  89 object reference on the stack and has an index into the constant pool as an
  90 operand. The index into the constant pool must be a reference to a pair
  91 containing the class name and a field name. The types of the classes
  92 referenced by the constant pool index and by the object reference must be
  93 compatible, a fact which is usually checked statically at load time. The
  94 object reference has to be different from the {\tt null} pointer, a fact
  95 which must usually be checked at run time.
  96
  97 Array access is handled by the {\tt aload} and {\tt astore} instructions.
  98 Separate versions of these instructions exist for each of the basic types
  99 ({\tt byte}, {\tt int}, {\tt float}, {\tt ref}, etc.). The {\tt aload}
 100 instruction expects a reference to an array and an index (of type {\tt
 101 int}) on the stack. The array reference must not be the {\tt null} pointer.
 102 The index must be greater than or equal to zero and less than the array
 103 length.
 104
 105 There are special commands for method invocation. Each method has its own
 106 virtual stack and an area for local variables. After the method invocation,
 107 the stack of the caller is empty and the arguments are copied into the
 108 first local variables of the called method. After execution of a {\tt
 109 return} instruction, the called method returns to its caller. If the called
 110 method is a function, it pops the return value from its own stack and
 111 pushes it onto the stack of the caller.
 112
 113 The {\tt instanceof} and {\tt checkcast} instructions are used for subtype
 114 testing. Both expect a reference to an object on the stack and have an
 115 index into the constant pool as operand. The index must reference a class,
 116 array or interface type. The two instructions differ in their result and in
 117 their behavior if the object reference is {\tt null}.
 118
 119
 120 \section{Translation of stack code to register code}
 121
 122 The architecture of a RISC---\textit{Reduced Instruction Set Computer}
 123 or CISC---\textit{Complex Instruction Set Computer}---processor is
 124 completely different from the stack architecture of the Java Virtual
 125 Machine.
 126
 127 RISC processors have large sets of registers. The Alpha architecture
 128 has 32 integer registers and 32 floating point registers which are
 129 both 64-bits wide. They execute arithmetic and logic operations only
 130 on values which are held in registers. RISC machines are a load-store
 131 architecture, this means load and store instructions are provided to
 132 move data between memory and registers. Local variables of methods
 133 usually reside in registers and are saved in memory only during a
 134 method call or if there are too few registers. The use of registers
 135 for parameter passing is defined by calling conventions.
 136
 137 CISC processors have a relatively small set of registers, like the
 138 IA32 architecture with 8 integer general purpose registers or the
 139 AMD64 architecture with 16 integer general purpose registers. But, as
 140 the name implies, CISC processors have a large and complex machine
 141 instruction set. Unlike the load-store architecture of RISC machines,
 142 CISC instructions can operate on operands residing in registers and in
 143 memory locations. Local variables of methods should reside in
 144 registers, but due to the limited number of registers this very rare
 145 and most local variables are stored on the stack. Detailed information
 146 of the IA32 and AMD64 architecture can be found in section
 147 \ref{sectionia32codegenerator} and section
 148 \ref{sectionamd64codegenerator} respectively.
 149
 150
 151 \subsection{Machine code translation examples}
 152
 153 The previous example expression {\tt a = b - c * d} would be translated
 154 by an optimizing C compiler to the following two Alpha instructions (the
 155 variables {\tt a}, {\tt b}, {\tt c} and {\tt d} reside in registers):
 156
 157 \begin{verbatim}
 158         MULL c,d,tmp0    ; tmp0 = c * d
 159         SUBL b,tmp0,a    ; a = b - tmp0
 160 \end{verbatim}
 161
 162 If JavaVM code is translated to machine code, the stack is eliminated and
 163 the stack slots are represented by temporary variables usually residing in
 164 registers. A naive translation of the previous example would result in the
 165 following Alpha instructions:
 166
 167 \begin{verbatim}
 168         MOVE b,t0        ; iload b
 169         MOVE c,t1        ; iload c
 170         MOVE d,t2        ; iload d
 171         MULL t1,t2,t1    ; imul
 172         SUBL t0,t1,t0    ; isub
 173         MOVE t0,a        ; istore a
 174 \end{verbatim}
 175
 176 The problems of translating JavaVM code to machine code are primarily the
 177 elimination of the unnecessary copy instructions and finding an efficient
 178 register allocation algorithm. A common but expensive technique is to do
 179 the naive translation and use an additional pass for copy elimination and
 180 coalescing.
 181
 182
 183 \section{The translation algorithm}
 184
 185 The new translation algorithm can get by with three passes. The first pass
 186 determines basic blocks and builds a representation of the JavaVM
 187 instructions which is faster to decode. The second pass analyses the stack
 188 and generates a static stack structure. During stack analysis variable
 189 dependencies are tracked and register requirements are computed. In the
 190 final pass register allocation of temporary registers is combined with
 191 machine code generation.
 192
 193 The new compiler computes the exact number of objects needed or computes an
 194 upper bound and allocates the memory for the necessary temporary data
 195 structures in three big blocks: the basic block array, the instruction
 196 array and the stack array. Eliminating all the double linked lists also
 197 reduced the memory requirements by a factor of five.
 198
 199
 200 \subsection{Basic block determination}
 201
 202 The first pass scans the JavaVM instructions, determines the basic blocks
 203 and generates an array of instructions which has fixed size and is easier
 204 to decode in the following passes. Each instruction contains the opcode,
 205 two operands and a pointer to the static stack structure after the
 206 instruction (see next sections). The different opcodes of JavaVM
 207 instructions which fold operands into the opcode are represented by just
 208 one opcode in the instruction array.
 209
 210
 211 \subsection{Basic block interfacing convention}
 212
 213 The handling of control flow joins was quite complicated in the old
 214 compiler. We therefore introduced a fixed interface at basic block
 215 boundaries. Every stack slot at a basic block boundary is assigned a fixed
 216 interface register. The stack analysis pass determines the type of the
 217 register and if it has to be saved across method invocations. To enlarge
 218 the size of basic blocks method invocations do not end basic blocks. To
 219 guide our compiler design we did some static analysis on a large
 220 application written in Java: the javac compiler and the libraries it uses.
 221 Table \ref{tab-1} shows that in more than 93\% of the cases the stack is
 222 empty at basic block boundaries and that the maximal stack depth is 6.
 223 Using this data it becomes clear that the old join handling did not improve
 224 the quality of the machine code.
 225
 226 \begin{table*}
 227 \begin{center}
 228 \begin{tabular}[b]{|l|c|c|c|c|c|c|c|c|}
 229 \hline
 230 stack depth &   0  &  1  &  2  &  3  &  4  &  5  &  6 & $>$6 \\ \hline
 231 occurrences & 7930 & 258 & 136 & 112 &  36 &  8  &  3 &  0   \\ \hline
 232 \end{tabular}
 233 \caption{distribution of stack depth at block boundary}
 234 \label{tab-1}
 235 \end{center}
 236 \end{table*}
 237
 238
 239 \subsection{Copy elimination}
 240
 241 To eliminate unnecessary copies, the loading of values is delayed until the
 242 instruction is reached which consumes the value. To compute the information
 243 the run time stack is simulated at compile time. Instead of values the
 244 compile time stack contains the type of the value, if a local variable was
 245 loaded to a stack location and similar information. Adl-Tabatabai
 246 \cite{Taba+98} used a dynamic stack which is changed at every instruction.
 247 A dynamic stack only gives the possibility to move information from earlier
 248 instructions to later instructions. We use a static stack structure which
 249 enables information flow in both directions.
 250
 251 Figure~\ref{Trans1} shows our instruction and stack representation. An
 252 instruction has a reference to the stack before the instruction and the
 253 stack after the instruction. The stack is represented as a linked list. The
 254 two stacks can be seen as the source and destination operands of an
 255 instruction. In the implementation only the destination stack is stored,
 256 the source stack is the destination of stack of the previous instruction.
 257
 258 \begin{figure}[htb]
 259 \begin{center}
 260 \setlength{\unitlength}{1mm}
 261 \begin{picture}(25,32)
 262 \put( 0, 0  ){\makebox(10,5){\tt b}}
 263 \put( 5, 2.5){\oval(10,5)}
 264 \put( 5, 9  ){\vector(0,-1){4}}
 265 \put( 0, 9  ){\makebox(10,5){\tt c}}
 266 \put( 5,11.5){\oval(10,5)}
 267 \put( 5,18  ){\vector(0,-1){4}}
 268 \put( 0,18  ){\makebox(10,5){\tt d}}
 269 \put( 5,20.5){\oval(10,5)}
 270 %\put( 5,27  ){\vector(0,-1){4}}
 271 \put(15, 9  ){\makebox(10,5){\tt c*d}}
 272 \put(20,11.5){\oval(10,5)}
 273 \put(20, 9  ){\line(0,-1){2}}
 274 \put( 5, 7  ){\line(1,0){15}}
 275 \put(10,27  ){\framebox(15,5){\tt imul}}
 276 \put(20,27  ){\vector(0,-1){13}}
 277 \put(15,27  ){\line(0,-1){6.5}}
 278 \put(15,20.5){\vector(-1,0){5}}
 279 \end{picture}
 280 \caption{instruction and stack representation}
 281 \label{Trans1}
 282 \end{center}
 283 \end{figure}
 284
 285 This representation can easily be used for copy elimination. Each stack
 286 element not only contains the type of the stack slot but also the local
 287 variable number of which it is a copy, the argument number if it is an
 288 argument, the interface register number if it is an interface. Load (push
 289 the content of a variable onto the stack) and store instructions do
 290 no generate a copy machine instruction if the stack slot contains the same
 291 local variable. Generated machine instructions for arithmetic operations
 292 directly use the local variables as their operands.
 293
 294 There are some pitfalls with this scheme. Take the example of
 295 figure~\ref{Trans2}. The stack bottom contains the local variable {\tt a}.
 296 The instruction {\tt istore a} will write a new value for {\tt a} and will
 297 make a later use of this variable invalid. To avoid this we have to copy
 298 the local variable to a stack variable. An important decision is at which
 299 position the copy instruction should be inserted. Since there is a high
 300 number of {\tt dup} instructions in Java programs (around 4\%) and it is
 301 possible that a local variable resides in memory, the copy should be done
 302 with the {\tt load} instruction. Since the stack is represented as a linked
 303 list only the destination stack has to be checked for occurrences of the
 304 offending variable and these occurrences are replaced by a stack variable.
 305
 306
 307 \begin{figure}[htb]
 308 \begin{center}
 309 \setlength{\unitlength}{1mm}
 310 \begin{picture}(79,32)
 311 \put( 0,27  ){\framebox(15,5){\tt iload a}}
 312 \put(16,27  ){\framebox(13,5){\tt dup}}
 313 \put(30,27  ){\framebox(17,5){\tt iconst 1}}
 314 \put(48,27  ){\framebox(13,5){\tt iadd}}
 315 \put(62,27  ){\framebox(17,5){\tt istore a}}
 316 \put(10,27  ){\vector(0,-1){22}}
 317 \put( 5,27  ){\vector(0,-1){4}}
 318 \put( 5, 0  ){\makebox(10,5){\tt a}}
 319 \put(10, 2.5){\oval(10,5)}
 320 \put(20,27  ){\line(0,-1){2}}
 321 \put(20,25  ){\line(-1,0){10}}
 322 \put(25,27  ){\vector(0,-1){13}}
 323 \put(20, 9  ){\makebox(10,5){\tt a}}
 324 \put(25,11.5){\oval(10,5)}
 325 \put(25, 9  ){\line(0,-1){2}}
 326 \put(10, 7  ){\line(1,0){63}}
 327 \put(36,27  ){\line(0,-1){2}}
 328 \put(36,25  ){\line(-1,0){11}}
 329 \put(41,27  ){\vector(0,-1){4}}
 330 \put(36,18  ){\makebox(10,5){\tt 1}}
 331 \put(41,20.5){\oval(10,5)}
 332 \put(41,18  ){\line(0,-1){2}}
 333 \put(41,16  ){\line(-1,0){16}}
 334 \put(52,27  ){\line(0,-1){2}}
 335 \put(52,25  ){\line(-1,0){11}}
 336 \put(73,27  ){\line(0,-1){20}}
 337 \put(57,27  ){\vector(0,-1){13}}
 338 \put(52, 9  ){\makebox(10,5){\tt +}}
 339 \put(57,11.5){\oval(10,5)}
 340 \put(57,9   ){\line(0,-1){2}}
 341 \put(68,27  ){\line(0,-1){2}}
 342 \put(68,25  ){\line(-1,0){11}}
 343 \end{picture}
 344 \caption{anti dependence}
 345 \label{Trans2}
 346 \end{center}
 347 \end{figure}
 348
 349 To answer the question of how often this could happen and how expensive
 350 the stack search is, we analysed again the {\tt javac} compiler. In more
 351 than 98\% of the cases the stack is empty (see table \ref{tab-2}). In only
 352 0.2\% of the cases the stack depth is higher than 1 and the biggest stack
 353 depth is 3.
 354
 355 \begin{table}[htb]
 356 \begin{center}
 357 \begin{tabular}[b]{|l|c|c|c|c|c|}
 358 \hline
 359 stack depth &   0  &  1  &  2  &  3  & $>$3 \\ \hline
 360 occurrences & 2167 & 31  &  1  &  3  &  0   \\ \hline
 361 \end{tabular}
 362 \caption{distribution of {\tt store} stack depth}
 363 \label{tab-2}
 364 \end{center}
 365 \end{table}
 366
 367 \begin{table*}
 368 \begin{center}
 369 \begin{tabular}[b]{|l|c|c|c|c|c|c|c|c|c|c|}
 370 \hline
 371 chain length &  1  &  2 &  3 &  4 &  5 &  6 &  7 & 8 & 9 & $>$9 \\ \hline
 372 occurrences  & 1892& 62 & 23 & 62 & 30 & 11 & 41 & 9 & 7 &  65  \\ \hline
 373 \end{tabular}
 374 \caption{distribution of creator-store distances}
 375 \label{tab-3}
 376 \end{center}
 377 \end{table*}
 378
 379 To avoid copy instructions when executing a {\tt store} it is necessary to
 380 connect the  creation of a value with the store which consumes it. In that
 381 case a {\tt store} not only can conflict with copies of a local variable
 382 which result from {\tt load} instructions before the creator of the value,
 383 but also with {\tt load} and {\tt store} instructions which exist between
 384 the creation of value and the {\tt store}. In figure~\ref{Trans3} the {\tt
 385 iload a} instruction conflicts with the {\tt istore a} instruction.
 386
 387 \begin{figure}[htb]
 388 \begin{center}
 389 \setlength{\unitlength}{1mm}
 390 \begin{picture}(67,27)
 391 \put( 0,22  ){\framebox(15,5){\tt iadd}}
 392 \put(16,22  ){\framebox(15,5){\tt iload a}}
 393 \put(32,22  ){\framebox(17,5){\tt istore b}}
 394 \put(50,22  ){\framebox(17,5){\tt istore a}}
 395 \put(10,22  ){\vector(0,-1){13}}
 396 \put( 5,22  ){\line(0,-1){2}}
 397 \put( 5,20 ){\line(-1,0){3}}
 398 \put( 2,20  ){\vector(0,-1){20}}
 399 \put(10, 4  ){\line(0,-1){2}}
 400 \put( 5, 4  ){\makebox(10,5){\tt +}}
 401 \put(10, 6.5){\oval(10,5)}
 402 \put(21,22  ){\line(0,-1){2}}
 403 \put(21,20  ){\line(-1,0){11}}
 404 \put(26,22  ){\vector(0,-1){4}}
 405 \put(21,13  ){\makebox(10,5){\tt a}}
 406 \put(26,15.5){\oval(10,5)}
 407 \put(26,13  ){\line(0,-1){2}}
 408 \put(10,11  ){\line(1,0){46}}
 409 \put(38,22  ){\line(0,-1){2}}
 410 \put(38,20  ){\line(-1,0){12}}
 411 \put(43,22  ){\line(0,-1){11}}
 412 \put(56,22  ){\line(0,-1){11}}
 413 \put(61,22  ){\line(0,-1){20}}
 414 \put( 2, 2  ){\line(1,0){59}}
 415 \end{picture}
 416 \caption{anti dependence}
 417 \label{Trans3}
 418 \end{center}
 419 \end{figure}
 420
 421 The anti dependences are detected by checking the stack locations of the
 422 previous instructions for conflicts. Since the stack locations are
 423 allocated as one big array just the stack elements which have a higher
 424 index than the current stack element have to be checked. Table \ref{tab-3}
 425 gives the distribution of the distance between the creation of the value
 426 and the corresponding store. In 86\% of the cases the distance is one.
 427
 428 The output dependences are checked by storing the instruction number of the
 429 last store in each local variable. If a store conflicts due to dependences
 430 the creator places the value in a stack register. Additional dependences
 431 arise because of exceptions. The exception mechanism in Java is precise.
 432 Therefore {\tt store} instructions are not allowed to be executed before
 433 an exception raising instruction. This is checked easily be remembering
 434 the last instruction which could raise an exception. In methods which contain
 435 no exception handler this conflict can be safely ignored because no
 436 exception handler can have access to these variables.
 437
 438
 439 \subsection{Register allocator}
 440
 441 he current register allocator of CACAO is a very simple,
 442 straightforward allocator. It simply assigns free registers with a
 443 \textit{first-come-first-serve} based algorithm. This is mostly
 444 suitable for RISC architectures with large register sets like Alpha or
 445 MIPS with 32 integer registers and 32 floating-point registers. For
 446 these architectures the current register allocator was designed for.
 447
 448 Basically the allocation passes of the register allocator are:
 449
 450 \begin{itemize}
 451  \item interface register allocation
 452  \item scratch register allocation
 453  \item local register allocation
 454 \end{itemize}
 455
 456 The \texttt{javac} compiler also supports this simple
 457 \textit{first-come-first-serve} approach CACAO uses and does a
 458 coloring of the local variables and assigns the same number to
 459 variables which are not active at the same time. The stack variables
 460 have implicitly encoded their live ranges. When a value is pushed, the
 461 live range start. When a value is popped, the live range ends.
 462
 463 Complications arise only with stack manipulation instructions like {\tt dup}
 464 and {\tt swap}. We flag therefore the first creation of a stack variable and
 465 mark a duplicated one as a copy. The register used for this variable can
 466 be reused only after the last copy is popped.
 467
 468 During stack analysis stack variables are marked which have to survive a
 469 method invocation. These stack variables and local variables are assigned
 470 to callee saved registers. If there are not enough registers available,
 471 these variables are allocated in memory.
 472
 473 Efficient implementation of method invocation is crucial to the performance
 474 of Java. Therefore, we preallocate the argument registers and the return
 475 value in a similar way as we handle store instructions. Input arguments (in
 476 Java input arguments are the first variables) for leaf procedures (and
 477 input arguments for processors with register windows) are preassigned, too.
 478
 479 Since CACAO has now also been ported to CISC architectures like IA32
 480 and AMD64, the \textit{first-come-first-serve} register allocator has
 481 hit it's limits. The results produced for an architecture with 8
 482 integer general purpose registers like IA32 or 16 integer general
 483 purpose registers like AMD64, is far from perfect. Further details to
 484 register allocation of these architectures can be found in section
 485 \ref{sectionia32registerallocation} and section
 486 \ref{sectionamd64registerallocation} respectively.
 487
 488 The CACAO development team is currently working on a new register
 489 allocator based on a \textit{linear scan} algorithm. This allocator
 490 should produce much better results on CISC machines than the current
 491 register allocator.
 492
 493
 494 \subsection{Instruction combining}
 495
 496 Together with stack analysis we combine constant loading instructions
 497 with selected instructions which are following immediately. In the
 498 class of combinable instructions are add, subtract, multiply and
 499 divide instructions, logical and shift instructions, compare/branch
 500 and array store instructions.
 501
 502 These combined immediate instructions are:
 503
 504 \begin{itemize}
 505  \item \texttt{ICMD\_IADDCONST}, \texttt{ICMD\_ISUBCONST},
 506  \texttt{ICMD\_IMULCONST}, \texttt{ICMD\_IDIVPOW2},
 507  \texttt{ICMD\_IREMPOW2}
 508
 509  \item \texttt{ICMD\_LADDCONST}, \texttt{ICMD\_LSUBCONST},
 510  \texttt{ICMD\_LMULCONST}, \texttt{ICMD\_LDIVPOW2},
 511  \texttt{ICMD\_LREMPOW2}
 512
 513  \item \texttt{ICMD\_IANDCONST}, \texttt{ICMD\_IORCONST},
 514  \texttt{ICMD\_IXORCONST}
 515
 516  \item \texttt{ICMD\_LANDCONST}, \texttt{ICMD\_LORCONST},
 517  \texttt{ICMD\_LXORCONST}
 518
 519  \item \texttt{ICMD\_ISHLCONST}, \texttt{ICMD\_ISHRCONST},
 520  \texttt{ICMD\_IUSHRCONST}
 521
 522  \item \texttt{ICMD\_LSHLCONST}, \texttt{ICMD\_LSHRCONST},
 523  \texttt{ICMD\_LUSHRCONST}
 524
 525  \item \texttt{ICMD\_IFxx}
 526
 527  \item \texttt{ICMD\_IF\_Lxx}
 528
 529  \item \texttt{ICMD\_xASTORECONST}
 530 \end{itemize}
 531
 532 During code generation the constant is checked if it lies in the range
 533 for immediate operands of the target architecture and appropriate code
 534 is generated.
 535
 536 Arithmetic and logical instructions are processed straightforward. The
 537 intermediate command opcode of the current instruction is changed and
 538 the immediate value from the previous instruction is stored in the
 539 current instruction. The register pressure is always reduced by one
 540 register by this optimization.
 541
 542 \texttt{ICMD\_IDIV} and \texttt{ICMD\_IREM} divisions by a constant
 543 which is a power of two are handled in a special way. They are
 544 converted into \texttt{ICMD\_IDIVPOW2} and \texttt{ICMD\_IREMPOW2}
 545 respectively. For \texttt{ICMD\_IDIVPOW2} an immediate value is
 546 assigned which represents the left shift amount of \texttt{0x1} to get
 547 the divisor value. In the code generation pass a very fast shift-based
 548 machine code can be generated for this instruction. For
 549 \texttt{ICMD\_IREMPOW2} the intermediate value gets one
 550 subtracted. The generated machine code consists of logical
 551 \texttt{and}'s, \texttt{neg}'s and a conditional jump. For both
 552 instructions the generated machine code is much fast than an integer
 553 division. \texttt{ICMD\_LDIV} and \texttt{ICMD\_LREM} intermediate
 554 commands are handled respectively.
 555
 556 \texttt{ICMD\_IxSHx} instructions by a constant value are converted
 557 to \texttt{ICMD\_IxSHxCONST} instructions. Nearly every architecture
 558 has machine shift instructions by a constant value. This optimization
 559 always reduces the register pressure by one
 560 register. \texttt{ICMD\_LxSHx} intermediate commands are converted to
 561 \texttt{ICMD\_LxSHxCONST} commands respectively.
 562
 563 \texttt{ICMD\_IF\_ICMPxx} intermediate commands are converted to
 564 \texttt{ICMD\_IFxx} commands. This commands compare the source
 565 operand directly with an immediate value if possible. The generated
 566 machine code depends on the architecture. On the IA32 or AMD64
 567 architecture the immediate value can always be inlined. On RISC
 568 architectures the immediate value range is limited, like the Alpha
 569 architecture where the immediate value may be between 0 and 255. On
 570 architectures which support conditional branches on a source register,
 571 like Alpha or MIPS, the compare with 0 is optimized to a single
 572 instruction. This optimization can reduce the register pressure by one
 573 register. \texttt{ICMD\_IF\_Lxx} intermediate commands are handled
 574 respectively.
 575
 576 The \texttt{ICMD\_xASTORE} optimization was actually implemented for
 577 the IA32 and AMD64 architecture. These architectures can handle inline
 578 immediate values up to their address pointer size, this means 32-bit
 579 for IA32 and 64-bit for AMD64 respectively. For RISC architectures
 580 which have a \texttt{REG\_ZERO}---a register which always contains the
 581 values zero---this array store optimization can be used only for zero
 582 values. Address array stores---\texttt{ICMD\_AASTORE}---can only be
 583 optimized in the \texttt{null} pointer case because of the dynamic
 584 type check. In this case the optimization not only reduces the
 585 register pressure by one register, but the dynamic type check
 586 subroutine call can be eliminated.
 587
 588
 589 \subsection{Complexity of the algorithm}
 590
 591 The complexity of the algorithm is mostly linear with respect to the number
 592 of instructions and the number of local variables plus the number of stack
 593 slots. There are only a small number of spots where it is not linear.
 594
 595 \begin{itemize}
 596 \item At the begin of a basic block the stack has to be copied to separate
 597       the stacks of different basic blocks. Table \ref{tab-1} shows that
 598       the stack at the boundary of a basic block is in most cases zero.
 599       Therefore, this copying does not influence the linear performance of
 600       the algorithm.
 601 \item A store has to check for a later use of the same variable. Table
 602       \ref{tab-2} shows that this is not a problem, too.
 603 \item A store additionally has to check for the previous use of the same
 604       variable between creation of the value and the store. The distances
 605       between the creation and the use are small (in most case only 1) as
 606       shown by table \ref{tab-3}.
 607 \end{itemize}
 608
 609 Compiling {\tt javac} 29\% of the compile time are spent in parsing and
 610 basic block determination, 18\% are spent in stack analysis, 16\% are spent
 611 in register allocation and 37\% are spent in machine code generation.
 612
 613
 614 \subsection{Example}
 615
 616 Figure \ref{IntermediateStack} shows the intermediate representation and
 617 stack information as produced by the compiler for debugging purposes. The
 618 {\tt Local Table} gives the types and register assignment for the local
 619 variables. The Java compiler reuses the same local variable slot for
 620 different local variables if there life ranges do not overlap. In this
 621 example the variable slot 3 is even used for local variables of different
 622 types (integer and address). The JIT-compiler assigned the saved register
 623 12 to this variable.
 624
 625 One interface register is used in this example entering the basic block
 626 with label {\tt L004}. At the entry of the basic block the interface
 627 register has to be copied to the argument register {\tt A00}. This is one
 628 of the rare cases where a more sophisticated coalescing algorithm could
 629 have allocated an argument register for the interface.
 630
 631 At instruction 2 and 3 you can see the combining of a constant with an
 632 arithmetic instruction. Since the instructions are allocated in an array
 633 the empty slot has to be filled with a {\tt NOP} instruction. The {\tt
 634 ADDCONSTANT} instruction already has the local variable {\tt L02} as
 635 destination, an information which comes from the later {\tt ISTORE} at
 636 number 4. Similarly the {\tt INVOKESTATIC} at number 31 has marked all its
 637 operands as arguments. In this example all copy (beside the one to the
 638 interface register) have been eliminated.
 639
 640 \begin{figure*}[htb]
 641 \begin{center}
 642 \begin{verbatim}
 643   java.io.ByteArrayOutputStream.write (int)void
 644
 645   Local Table:
 646        0:                (adr) S15
 647        1:    (int) S14
 648        2:    (int) S13
 649        3:    (int) S12   (adr) S12
 650
 651   Interface Table:
 652        0:    (int) T24
 653
 654   [                 L00]        0  ALOAD         0
 655   [                 T23]        1  GETFIELD      16
 656   [                 L02]        2  IADDCONST     1
 657   [                 L02]        3  NOP
 658   [                    ]        4  ISTORE        2
 659   [                 L02]        5  ILOAD         2
 660   [             L00 L02]        6  ALOAD         0
 661   [             T23 L02]        7  GETFIELD      8
 662   [             T23 L02]        8  ARRAYLENGTH
 663   [                    ]        9  IF_ICMPLE     L005
 664                                 ...............
 665
 666   [                    ]       18  IF_ICMPLT     L003
 667   [                    ] L002:
 668   [                 I00]       19  ILOAD         3
 669   [                 I00]       20  GOTO          L004
 670   [                    ] L003:
 671   [                 I00]       21  ILOAD         2
 672   [                 A00] L004:
 673   [                 L03]       22  BUILTIN1      newarray_byte
 674   [                    ]       23  ASTORE        3
 675   [                 L00]       24  ALOAD         0
 676   [                 A00]       25  GETFIELD      8
 677   [             A01 A00]       26  ICONST        0
 678   [         A02 A01 A00]       27  ALOAD         3
 679   [     A03 A02 A01 A00]       28  ICONST        0
 680   [ L00 A03 A02 A01 A00]       29  ALOAD         0
 681   [ A04 A03 A02 A01 A00]       30  GETFIELD      16
 682   [                    ]       31  INVOKESTATIC  java/lang/System.arraycopy
 683   [                 L00]       32  ALOAD         0
 684   [             L03 L00]       33  ALOAD         3
 685   [                    ]       34  PUTFIELD      8
 686   [                    ] L005:
 687                                ...............
 688
 689   [                    ]       45  RETURN
 690 \end{verbatim}
 691 \caption{Example: intermediate instructions and stack contents}
 692 \label{IntermediateStack}
 693 \end{center}
 694 \end{figure*}
 695
 696
 697 \section{Compiling a Java method}
 698
 699 The CACAO JIT compiler is invoked via the
 700
 701 \begin{verbatim}
 702         methodptr jit_compile(methodinfo *m);
 703 \end{verbatim}
 704
 705 function call. This function is just a wrapper function to the
 706 internal compiler function
 707
 708 \begin{verbatim}
 709         static methodptr jit_compile_intern(methodinfo *m);
 710 \end{verbatim}
 711
 712 The argument of the compiler function is a pointer to a
 713 \texttt{methodinfo} structure (see figure \ref{methodinfostructure})
 714 allocated by the system class loader. This function should not be
 715 called directly and thus is declared \texttt{static} because the
 716 wrapper function has to ensure some conditions:
 717
 718 \begin{itemize}
 719  \item enter a monitor on the \texttt{methodinfo} structure to make
 720  sure that only one thread can compile the same Java method at the
 721  same time
 722
 723  \item check if the method already has a \texttt{entrypoint}, if so
 724  the monitor is left and the entrypoint is returned
 725
 726  \item measure the compiling time if requested
 727
 728  \item call the internal compiler function
 729
 730  \item leave the monitor and return the functions' \texttt{entrypoint}
 731 \end{itemize}
 732
 733 The internal compiler function \texttt{jit\_compile\_intern} does the
 734 actual compilation of the Java method. It calls the different passes
 735 of the JIT compiler.
 736
 737 If the passed Java method does not have a \textit{Code Attribute} (see
 738 \ref{sectionmethodloading}) a \texttt{methodptr} to a
 739 \texttt{do\_nothing\_function} is returned.
 740
 741 If the method has the \texttt{ACC\_STATIC} flag bit set and the
 742 methods' class is not yet initialized, \texttt{class\_init} is called
 743 with the methods' class as argument
 744
 745 Then the compiler passes are called:
 746
 747 \begin{enumerate}
 748
 749  \item \texttt{reg\_init}: initializes the register allocator
 750
 751   \begin{itemize}
 752
 753    \item allocates the \texttt{registerdata} structure
 754
 755    \item calculate the number of callee saved, temporary and argument
 756    registers
 757
 758   \end{itemize}
 759
 760  \item \texttt{reg\_setup}: sets up the register allocator data which
 761  is changed in every compiler run
 762
 763  \item \texttt{codegen\_setup}: initializes the code generator
 764
 765   \begin{itemize}
 766
 767    \item allocates the \texttt{codegendata} structure
 768
 769    \item allocate code and data memory
 770
 771   \end{itemize}
 772
 773  \item \texttt{parse}: parse pass
 774
 775   \begin{itemize}
 776
 777    \item parse the Java Virtual Machine instructions and convert them
 778    into CACAO intermediate commands
 779
 780    \item determine basic blocks
 781
 782   \end{itemize}
 783
 784  \item \texttt{analyse\_stack}: analyse stack pass
 785
 786  \item \texttt{regalloc}: register allocation pass
 787
 788  \item \texttt{codegen}: code generation pass
 789
 790  \item \texttt{reg\_close}: release all allocated register allocator
 791  memory
 792
 793  \item \texttt{codegen\_close}: release all allocated code generator
 794  memory
 795
 796 \end{enumerate}
 797
 798 After all compiler passes were run and no exception or error occured,
 799 the \texttt{entrypoint} of the compiled method is returned.
 800
 801 The CACAO JIT compiler is designed to be reentrant. This design
 802 decision was taken to easily support exception throwing during one of
 803 the compiler passes and to support concurrent compilation in different
 804 threads running. Concurrent compilation can speed up startup and run
 805 time especially on multi processor machines.