+The output dependences are checked by storing the instruction number of the
+last store in each local variable. If a store conflicts due to dependences
+the creator places the value in a stack register. Additional dependences
+arise because of exceptions. The exception mechanism in Java is precise.
+Therefore {\tt store} instructions are not allowed to be executed before
+an exception raising instruction. This is checked easily be remembering
+the last instruction which could raise an exception. In methods which contain
+no exception handler this conflict can be safely ignored because no
+exception handler can have access to these variables.
+
+
+\subsection{Register allocator}
+
+he current register allocator of CACAO is a very simple,
+straightforward allocator. It simply assigns free registers with a
+\textit{first-come-first-serve} based algorithm. This is mostly
+suitable for RISC architectures with large register sets like Alpha or
+MIPS with 32 integer registers and 32 floating-point registers. For
+these architectures the current register allocator was designed for.
+
+Basically the allocation passes of the register allocator are:
+
+\begin{itemize}
+ \item interface register allocation
+ \item scratch register allocation
+ \item local register allocation
+\end{itemize}
+
+The \texttt{javac} compiler also supports this simple
+\textit{first-come-first-serve} approach CACAO uses and does a
+coloring of the local variables and assigns the same number to
+variables which are not active at the same time. The stack variables
+have implicitly encoded their live ranges. When a value is pushed, the
+live range start. When a value is popped, the live range ends.
+
+Complications arise only with stack manipulation instructions like {\tt dup}
+and {\tt swap}. We flag therefore the first creation of a stack variable and
+mark a duplicated one as a copy. The register used for this variable can
+be reused only after the last copy is popped.
+
+During stack analysis stack variables are marked which have to survive a
+method invocation. These stack variables and local variables are assigned
+to callee saved registers. If there are not enough registers available,
+these variables are allocated in memory.
+
+Efficient implementation of method invocation is crucial to the performance
+of Java. Therefore, we preallocate the argument registers and the return
+value in a similar way as we handle store instructions. Input arguments (in
+Java input arguments are the first variables) for leaf procedures (and
+input arguments for processors with register windows) are preassigned, too.
+
+Since CACAO has now also been ported to CISC architectures like IA32
+and AMD64, the \textit{first-come-first-serve} register allocator has
+hit it's limits. The results produced for an architecture with 8
+integer general purpose registers like IA32 or 16 integer general
+purpose registers like AMD64, is far from perfect. Further details to
+register allocation of these architectures can be found in section
+\ref{sectionia32registerallocation} and section
+\ref{sectionamd64registerallocation} respectively.
+
+The CACAO development team is currently working on a new register
+allocator based on a \textit{linear scan} algorithm. This allocator
+should produce much better results on CISC machines than the current
+register allocator.
+
+
+\subsection{Instruction combining}
+
+Together with stack analysis we combine constant loading instructions
+with selected instructions which are following immediately. In the
+class of combinable instructions are add, subtract, multiply and
+divide instructions, logical and shift instructions, compare/branch
+and array store instructions.
+
+These combined immediate instructions are:
+
+\begin{itemize}
+ \item \texttt{ICMD\_IADDCONST}, \texttt{ICMD\_ISUBCONST},
+ \texttt{ICMD\_IMULCONST}, \texttt{ICMD\_IDIVPOW2},
+ \texttt{ICMD\_IREMPOW2}
+
+ \item \texttt{ICMD\_LADDCONST}, \texttt{ICMD\_LSUBCONST},
+ \texttt{ICMD\_LMULCONST}, \texttt{ICMD\_LDIVPOW2},
+ \texttt{ICMD\_LREMPOW2}
+
+ \item \texttt{ICMD\_IANDCONST}, \texttt{ICMD\_IORCONST},
+ \texttt{ICMD\_IXORCONST}
+
+ \item \texttt{ICMD\_LANDCONST}, \texttt{ICMD\_LORCONST},
+ \texttt{ICMD\_LXORCONST}
+
+ \item \texttt{ICMD\_ISHLCONST}, \texttt{ICMD\_ISHRCONST},
+ \texttt{ICMD\_IUSHRCONST}
+
+ \item \texttt{ICMD\_LSHLCONST}, \texttt{ICMD\_LSHRCONST},
+ \texttt{ICMD\_LUSHRCONST}
+
+ \item \texttt{ICMD\_IFxx}
+
+ \item \texttt{ICMD\_IF\_Lxx}
+
+ \item \texttt{ICMD\_xASTORECONST}
+\end{itemize}
+
+During code generation the constant is checked if it lies in the range
+for immediate operands of the target architecture and appropriate code
+is generated.
+
+Arithmetic and logical instructions are processed straightforward. The
+intermediate command opcode of the current instruction is changed and
+the immediate value from the previous instruction is stored in the
+current instruction. The register pressure is always reduced by one
+register by this optimization.
+
+\texttt{ICMD\_IDIV} and \texttt{ICMD\_IREM} divisions by a constant
+which is a power of two are handled in a special way. They are
+converted into \texttt{ICMD\_IDIVPOW2} and \texttt{ICMD\_IREMPOW2}
+respectively. For \texttt{ICMD\_IDIVPOW2} an immediate value is
+assigned which represents the left shift amount of \texttt{0x1} to get
+the divisor value. In the code generation pass a very fast shift-based
+machine code can be generated for this instruction. For
+\texttt{ICMD\_IREMPOW2} the intermediate value gets one
+subtracted. The generated machine code consists of logical
+\texttt{and}'s, \texttt{neg}'s and a conditional jump. For both
+instructions the generated machine code is much fast than an integer
+division. \texttt{ICMD\_LDIV} and \texttt{ICMD\_LREM} intermediate
+commands are handled respectively.
+
+\texttt{ICMD\_IxSHx} instructions by a constant value are converted
+to \texttt{ICMD\_IxSHxCONST} instructions. Nearly every architecture
+has machine shift instructions by a constant value. This optimization
+always reduces the register pressure by one
+register. \texttt{ICMD\_LxSHx} intermediate commands are converted to
+\texttt{ICMD\_LxSHxCONST} commands respectively.
+
+\texttt{ICMD\_IF\_ICMPxx} intermediate commands are converted to
+\texttt{ICMD\_IFxx} commands. This commands compare the source
+operand directly with an immediate value if possible. The generated
+machine code depends on the architecture. On the IA32 or AMD64
+architecture the immediate value can always be inlined. On RISC
+architectures the immediate value range is limited, like the Alpha
+architecture where the immediate value may be between 0 and 255. On
+architectures which support conditional branches on a source register,
+like Alpha or MIPS, the compare with 0 is optimized to a single
+instruction. This optimization can reduce the register pressure by one
+register. \texttt{ICMD\_IF\_Lxx} intermediate commands are handled
+respectively.
+
+The \texttt{ICMD\_xASTORE} optimization was actually implemented for
+the IA32 and AMD64 architecture. These architectures can handle inline
+immediate values up to their address pointer size, this means 32-bit
+for IA32 and 64-bit for AMD64 respectively. For RISC architectures
+which have a \texttt{REG\_ZERO}---a register which always contains the
+values zero---this array store optimization can be used only for zero
+values. Address array stores---\texttt{ICMD\_AASTORE}---can only be
+optimized in the \texttt{null} pointer case because of the dynamic
+type check. In this case the optimization not only reduces the
+register pressure by one register, but the dynamic type check
+subroutine call can be eliminated.
+
+
+\subsection{Complexity of the algorithm}
+
+The complexity of the algorithm is mostly linear with respect to the number
+of instructions and the number of local variables plus the number of stack
+slots. There are only a small number of spots where it is not linear.
+
+\begin{itemize}
+\item At the begin of a basic block the stack has to be copied to separate
+ the stacks of different basic blocks. Table \ref{tab-1} shows that
+ the stack at the boundary of a basic block is in most cases zero.
+ Therefore, this copying does not influence the linear performance of
+ the algorithm.
+\item A store has to check for a later use of the same variable. Table
+ \ref{tab-2} shows that this is not a problem, too.
+\item A store additionally has to check for the previous use of the same
+ variable between creation of the value and the store. The distances
+ between the creation and the use are small (in most case only 1) as
+ shown by table \ref{tab-3}.
+\end{itemize}
+
+Compiling {\tt javac} 29\% of the compile time are spent in parsing and
+basic block determination, 18\% are spent in stack analysis, 16\% are spent
+in register allocation and 37\% are spent in machine code generation.
+
+
+\subsection{Example}
+
+Figure \ref{IntermediateStack} shows the intermediate representation and
+stack information as produced by the compiler for debugging purposes. The
+{\tt Local Table} gives the types and register assignment for the local
+variables. The Java compiler reuses the same local variable slot for
+different local variables if there life ranges do not overlap. In this
+example the variable slot 3 is even used for local variables of different
+types (integer and address). The JIT-compiler assigned the saved register
+12 to this variable.
+
+One interface register is used in this example entering the basic block
+with label {\tt L004}. At the entry of the basic block the interface
+register has to be copied to the argument register {\tt A00}. This is one
+of the rare cases where a more sophisticated coalescing algorithm could
+have allocated an argument register for the interface.
+
+At instruction 2 and 3 you can see the combining of a constant with an
+arithmetic instruction. Since the instructions are allocated in an array
+the empty slot has to be filled with a {\tt NOP} instruction. The {\tt
+ADDCONSTANT} instruction already has the local variable {\tt L02} as
+destination, an information which comes from the later {\tt ISTORE} at
+number 4. Similarly the {\tt INVOKESTATIC} at number 31 has marked all its
+operands as arguments. In this example all copy (beside the one to the
+interface register) have been eliminated.
+