doc/handbook/s390.tex

   1 %\newcommand{\code}{\texttt}
   2
   3 %\chapter{The Just-In-Time Compiler}
   4
   5 \section{ESA/390 code generator}
   6
   7 \subsection{About s390 Architecture}
   8
   9 IBM's mainframe architecture s390 is the living architecture with the longest heritage, defined in a time where assembler programming was predominant and compilers where in their childhood. This fact made already porting of the GNU Compiler Collection a difficult task, as reported by the developers in \cite{s390:bib:gcc}. However the existence of a paper describing the experiences and difficulties encountered while porting an existing C compiler to the s390 architecture made the task of porting CACAO a little easier.
  10
  11 s390 being a CISC architecture provides an extensive instruction set defined in \cite{s390:bib:principles}. There is a wide range of I/O related instructions and control instructions used by the operating system. For the CACAO port, only a small subset of instructions is from interest. \cite{s390:bib:principles} categorises them into into three sections: general instructions, floating point instructions and binary floating point instructions.
  12
  13 The architecture defines 16 general purpose registers and 16 floating point registers. Instructions have variable length, which always is a multiple of 2 bytes. There are 2, 4 and 6 bytes long instructions and more than 30 \cite{s390:bib:gcc} instruction formats defined. Most instructions take 2 operands: the first operand being also the destination one of the operation. Most arithmetic instructions exist in two flavours: in the RR format they take two register operands, while in the RX format they take one register and one memory operand. There is also the RI format taking one register and one signed 16 bit immediate operand.
  14
  15 For conditional execution, there exists a 2 bit condition code. A lot of instructions set this condition code depending on their result to a value defined for each instruction individually. The \emph{branch on condition} and \emph{branch relative on condition} instructions take a 4 bit bitmask, containing one bit for each possible condition code value. The branch is taken if the bit corresponding the the current condition code value is set. To branch unconditionnaly, one sets all bits in the bitmask. A \emph{branch on condition} instruction with all bits cleared in the bitmask is a nop instruction.
  16
  17 \subsection{The ELF ABI}
  18
  19 The \emph{application binary interface} (ABI) for GNU/Linux, the target operating system of the s390 CACAO port is defined in \cite{s390:bib:abi} and referred to as \emph{ELF ABI}.
  20
  21 The \emph{ELF ABI} defines a system interface for compiled application programs with the purpose of establishing a standard binary interface for application programs on Linux for s390~\cite{s390:bib:abi}. The part of interest for this work is the function calling sequence, that defines register usage, stack layout and the way function parameters are passed. These are summarised in the following tables: table~\ref{s390:tbl:elf:intreg} for the general purpose register usage, table~\ref{s390:tbl:elf:fpreg} for the floating point register usage and table~\ref{s390:tbl:elf:stack} for the layout of a stack frame. To illustrate the typical function prologue, epilogue and calling conventions, figure~\ref{s390:fig:elf:hello} lists the assembly code of a simple function printing the string ``Hello world''.
  22
  23 \begin{table}
  24 \centering
  25 \begin{tabular}{|l|l|l|}
  26 \hline
  27 Register      & Usage                                                  & Call effect \\
  28 \hline
  29 \%r0, \%r1    & General purpose                                        & Volatile \\
  30 \%r2, \%r3    & Parameter passing and return values                    & Volatile \\
  31 \%r4, \%r5    & Parameter passing                                      & Volatile \\
  32 \%r6          & Parameter passing                                      & Saved \\
  33 \%r7 - \%r11  & Local variables                                        & Saved \\
  34 \%r12         & Local variable, commonly used as GOT pointer           & Saved \\
  35 \%r13         & Local variable, commonly used as Literal Pool pointer  & Saved \\
  36 \%r14         & Return address                                         & Volatile \\
  37 \%r15         & Stack pointer                                          & Saved \\
  38 \hline
  39 \end{tabular}
  40 \caption{General purpose register usage in the \emph{ELF ABI}}
  41 \label{s390:tbl:elf:intreg}
  42 \end{table}
  43
  44 \begin{table}
  45 \centering
  46 \begin{tabular}{|l|l|l|}
  47 \hline
  48 Register                       & Usage                                  & Call effect \\
  49 \hline
  50 \%f0, \%f2                     & Parameter passing and return values    & Volatile \\
  51 \%f4, \%f6                     & General purpose                        & Saved \\
  52 \%f1, \%f3, \%f5, \%f7 – \%f15 & General purpose                        & Volatile \\
  53 \hline
  54 \end{tabular}
  55 \caption{Floating point register usage in the \emph{ELF ABI}}
  56 \label{s390:tbl:elf:fpreg}
  57 \end{table}
  58
  59 \begin{table}
  60 \centering
  61 \begin{tabular}{|l|l|l|}
  62 \hline
  63 Use & Size & Offset \\
  64 \hline
  65 Local and spill variable area of calling function & Varies & Varies \\
  66 On stack parameters passed to called function & Varies & 96 \\
  67 Register save area for called function use & 88 & 8 \\
  68 Reserved for compiler use & 4 & 4 \\
  69 Back chain (optional pointer to previous frame) & 4 & 0, 8 byte aligned \\
  70 \hline
  71 \end{tabular}
  72 \caption{Stackframe layout in the \emph{ELF ABI}}
  73 \label{s390:tbl:elf:stack}
  74 \end{table}
  75
  76 \begin{figure}
  77 \begin{verbatim}
  78 .data
  79 L_data_hello:
  80     .asciz "Hello world\n"
  81 .text
  82
  83 .globl hello
  84
  85 hello:
  86     ; Store callee saved registers in register save area.
  87     ; The ELF ABI requires the registers to be stored at particular offsets.
  88     ; %r14 is not callee saved, but contains the return address.
  89     stm     %r11, %r15, 44(%r15)
  90     ; Setup literal pool pointer in %r13 and jump over literal pool.
  91     bras    %r13, L_literal_pool_end
  92
  93         ; Literal pool.
  94 L_literal_pool:
  95 L_hello:
  96     .long  L_data_hello
  97 L_printf:
  98     .long  printf
  99 L_literal_pool_end:
 100
 101     ; Allocate mandatory register save area for callee.
 102     ahi    %r15, -96
 103     ; Load pointer to string as argument from literal pool.
 104     l      %r2, L_hello - L_literal_pool(%r13)
 105     ; Load address of printf function from literal pool.
 106     l      %r1, L_printf - L_literal_pool(%r13)
 107     ; Save program counter of next instruction in %r14 and jump to address in %r1
 108     basr   %r14, %r1
 109
 110     ; Restore callee saved registers from stack.
 111     ; Restore return address as well
 112     lm     %r11, %r15, 96 + 44(%r15)
 113     ; Branch to return address
 114     br     %r14
 115 \end{verbatim}
 116 \caption{Function that prints the string ``Hello world'' using the \emph{ELF ABI}}
 117 \label{s390:fig:elf:hello}
 118 \end{figure}
 119
 120 \subsection{The JIT ABI}
 121 \label{s390:sec:jitabi}
 122
 123 In just-in-time compiled Java code we try to stay as close as possible to the requirements of the \emph{ELF ABI}. However, some conventions of the \emph{ELF ABI} are violated for various reasons and an alternate ABI, the \emph{JIT ABI} was defined.
 124
 125 Tables~\ref{s390:tbl:jit:intreg} and~\ref{s390:tbl:jit:fpreg} show the general purpose register and floating point register usage in the \emph{JIT ABI} respectively. The stack frame layout in the \emph{JIT ABI} is shown in table~\ref{s390:tbl:jit:stack}.
 126
 127 \begin{table}[H]
 128         \centering
 129         \begin{tabular}{|c|c|c|}
 130         \hline
 131         Register & Use & CACAO mnemonic \\
 132         \hline
 133         \%r0 & Scratch register & ITMP3, ITMP3\_XPTR \\
 134         \%r1 & Scratch register & ITMP1, ITMP1\_XPC, METHODPTR \\
 135         \%r2 & Parameter passing, return value & RESULT \\
 136         \%r3 & Parameter passing, return value & RESULT2 \\
 137         \%r4 - \%r6 & Parameter passing & - \\
 138         \%r7 - \%r12 & Callee saved & - \\
 139         \%r13 & Procedure vector & PV \\
 140         \%r14 & Return address, scratch register & RA, ITMP2 \\
 141         \%r15 & Stack pointer & SP \\
 142         \hline
 143         \end{tabular}
 144         \caption{General purpose register usage in the \emph{JIT ABI}}
 145         \label{s390:tbl:jit:intreg}
 146 \end{table}
 147
 148 \begin{table}[H]
 149         \centering
 150         \begin{tabular}{|c|c|c|}
 151         \hline
 152         Register & Use & CACAO mnemonic \\
 153         \hline
 154         \%f0 & Parameter passing, return value & FRESULT \\
 155         \%f2 & Parameter passing & - \\
 156         \%f4 & Scratch register & FTMP1 \\
 157         \%f6 & Scratch register & FTMP2 \\
 158         \%f3, \%f5, \%f7 - \%f15 & Caller saved & - \\
 159         \hline
 160         \end{tabular}
 161         \caption{Floating point register usage in the \emph{JIT ABI}}
 162         \label{s390:tbl:jit:fpreg}
 163 \end{table}
 164
 165 \begin{table}[H]
 166         \centering
 167         \begin{tabular}{|c|c|c|}
 168         \hline
 169         Use & Size & Offset \\
 170         \hline
 171         Return address & 1 slot & Stackframesize - 8 \\
 172         Callee saved integer registers & 0 - 5 slots & \\
 173         Callee saved floating point registers & Currently there are none & \\
 174         Temporary slot & 1 slot & \\
 175         \code{monitorenter} argument & 1 slot & \\
 176         Local variables & 0 to n slots & 0 \\
 177         \hline
 178         \end{tabular}
 179         \caption{Stackframe layout in the \emph{JIT ABI}}
 180         \label{s390:tbl:jit:stack}
 181 \end{table}
 182
 183 Register \code{\%r0} is treated by some instructions in a special way. The \emph{branch and save operation} for example takes two registers as operands. The second operand is used as branch address. If however register \code{\%r0} is passed as second operand, the operation is performed without branching. But the most notable special treatment is that \code{\%r0} can't be used as base or index register for storage operands.
 184
 185 To make the \emph{JIT} and \emph{ELF ABI}s compatible, there were two options for the use of the limited \code{\%r0} register: use it as scratch register or as caller saved temporary register. It was first used as caller saved register, what turned out to be a mistake. The final version of the \emph{JIT ABI} uses it as scratch register for the following reasons:
 186
 187 \begin{itemize}
 188         \item For some instructions it is handy to have a consecutive \emph{even-odd pair} of scratch registers.
 189         \item Using \code{\%r0} as general purpose register leads to a lot of corner cases in the code generation algorithm.
 190         \item Among all scratch registers, \code{REG\_ITMP3} is by convention the one used the least likely. So the best allocation for a register with limited usability is \code{REG\_ITMP3}.
 191 \end{itemize}
 192
 193 To illustrate the typical prologue, epilogue and calling conventions in just-in-time code, consider the typical \emph{Hello world} program as shown in figure~\ref{s390:fig:jit:helloj}. The assembly code it is translated into is listed in figure~\ref{s390:fig:jit:hellojdis}.
 194
 195 \begin{figure}
 196 \begin{verbatim}
 197 class Hello {
 198     public static void main(String[] args) {
 199         System.out.println("Hello world");
 200     }
 201 }
 202 \end{verbatim}
 203
 204 \caption{Source code of Java program printing the string ``Hello world''}
 205 \label{s390:fig:jit:helloj}
 206 \end{figure}
 207
 208 \begin{figure}
 209 \begin{verbatim}
 210 ahi    %r13,-4092       ; Offset procedure vector
 211 ahi    %r15,-16         ; Allocate stack frame
 212 st     %r14,12(%r15)    ; Store return address
 213 l      %r1,4052(%r13)   ; Get java.lang.System.out into argument register #1
 214 l      %r2,0(%r1)
 215 l      %r3,4048(%r13)   ; Load "Hello world" from data segment into argument register #2
 216 l      %r12,0(%r2)      ; Load virtual function table pointer of java.lang.System.out
 217 l      %r13,192(%r12)   ; Load method address from virtual function table
 218 basr   %r14,%r13        ; Call method
 219 basr   %r13,%r0         ; Restore procedure vector
 220 ahi    %r13,-4128
 221 l      %r14,12(%r15)    ; Load return address
 222 ahi    %r15,16          ; Remove stack frame
 223 br     %r14             ; Return
 224 bc     0,0              ; NOP
 225 \end{verbatim}
 226 \caption{Disassembly of translated java program printing the string ``Hello world''}
 227 \label{s390:fig:jit:hellojdis}
 228 \end{figure}
 229
 230 \subsection{Register allocation}
 231
 232 For s390 the \emph{simplereg} register allocator was not changed at all and worked out of the box. But as described in section~\ref{s390:sec:long}, there is still potential for optimising long arithmetics by tuning the register allocator for s390.
 233
 234 \subsection{Address constants}
 235
 236 On s390, while registers are 32 bit wide, the width of addresses is 31 bit. If the CPU uses addresses given as 32 bit values, it always ignores the most significant bit of the value. This showed to be a problem in the machine independent part of the code base that uses addresses as lookup keys, like the AVL tree of all methods, because each memory location can be address using two different 32 bit pointers. This code had to be extended to always clear the most significant bit of the passed pointers, if compiled for s390.
 237
 238 A statistical analysis of the static instruction frequency shows that 8\% of all generated intermediate instructions are \code{ACONST}. An \code{ACONST} loads an address literal into the destination operated. As 8\% is fairly a lot, two approaches were considered to implement this instruction:
 239
 240 \begin{itemize}
 241         \item Put the address constant on the data segment and implement \code{ACONST} as a load from the data segment.
 242         \item Load the address using two immediate 16 bit loads and one 16 bit left shift.
 243 \end{itemize}
 244
 245 When running the benchmarks, no considerable difference could be observed. Because of the jitter in the benchmark run times, none of the two approaches took the lead. So we decided to always load address constants from the data segment, with the exception of the \code{null} pointer, that can be satisfied with a single immediate load. The GNU C Compiler loads address constants from the literal pool too.
 246
 247 \subsection{Integer arithmetics}
 248
 249 s390 provides native instructions for 32 bit arithmetics. These include addition, substraction, multiplication, division, reminder, bitwise logical operations, arithmetic and logical shift. Most of the instructions can handle either two register operands, or one register and one memory operand. So all 32 bit arithmetics could be implemented inline.
 250
 251 Java's 8 bit signed integer type \code{byte} and 16 bit unsigned integer type \code{char} are both handled through 32 bit arithmetics. Internally they are always represented as 32 bit values in both storage and registers. The only exception are arrays of those types. Here the array loads must sign extend the loaded values, and the array stores must store only the respective number of bytes into memory.
 252
 253 \subsection{Long arithmetics}
 254 \label{s390:sec:long}
 255
 256 Java's 64 bit integer type \code{long} is stored either as 64 bit value in storage or in two 32 bit registers. Almost all long instructions are inline except \code{LMUL}, \code{LDIV} and \code{LREM} that are delegated to builtins. Addition is performed as one 32 bit addition followed by a second 32 bit addition with carry, substraction works in a similar way. s390 provides 64 bit shift instructions that are used to implement the \code{I2L} intermediate instruction and long shift operations. Those shift instructions always operate on an \emph{even-odd pair of registers}. If not already in such a register pair, the value has to be copied into \code{REG\_ITMP31\_PACKED}. This is actually always the case with the current register allocator, because it allocates the least significant half of the 64 bit value to the lower register of the pair. This could however be simply improved by making the register allocator always prefer an \emph{even-odd pair of registers} for the allocation of 64 bit variables.
 257
 258 \subsection{Floating point arithmetics}
 259
 260 s390 defines 16 64 bit wide floating point registers, that can hold single or double precision IEEE floating point values. Two of those registers can be combined to hold an extended precision IEEE floating value, but this feature is not needed for CACAO. Note that not all 16 registers must be available: registers \code{\%f0}, \code{\%f2}, \code{\%f4}, \code{\%f6} are available on all ESA/390 models. The remaining registers are referred to as additional \emph{floating-point registers} (AFR) and they are present only if the \emph{basic-floating-point-extensions facility} is installed. The current code base blindly assumes that all 16 registers are available. We took care to allocate the AFRs only as temporary registers, thus in future releases we can easily support models without AFRs by simply marking the respective registers as reserved in the register descriptor arrays (\code{nregdescfloat}) and by not storing and restoring them at \emph{ELF ABI} - \emph{CACAO ABI} boundaries.
 261
 262 Java floating point semantics require a IEEE 754 \emph{round to nearest mode}. The rounding mode is controlled via the floating point control register \code{fpc}. On Linux/s390 the \code{fpc} register of newly started processes is always initialised to 0, meaning \emph{round to nearest} \cite{s390:bib:abi}, so no extra effort needs to be made here.
 263
 264 \subsection{Machine code generation}
 265
 266 The usual way to generate machine code in CACAO is to use code generation macros defined in the architecture dependent header file \code{codegen.h}. For example on alpha, to generate machine code equivalent to the following assembly code:
 267
 268 \begin{verbatim}
 269 ild  16(%r9), %r10
 270 \end{verbatim}
 271
 272 one would place the following macro into the code generation algorithm:
 273
 274 \begin{verbatim}
 275 M_ILD(R9, 16, R10)
 276 \end{verbatim}
 277
 278 On s390, the equivalent instruction in assembly language would be:
 279
 280 \begin{verbatim}
 281 l    %r10, 16(%r9)
 282 \end{verbatim}
 283
 284 and would lead to definition of the following macro:
 285
 286 \begin{verbatim}
 287 M_L(R10, 16, 0, R9)
 288 \end{verbatim}
 289
 290 For ease of maintenance there is an effort to keep all code generators as similar as possible. The proffered signature of the code generation macros is the one used in the alpha code generator. To honour this requirement but keep the possibility of using special s390 instructions not available in the alpha instruction set, the code generation macros were defined in two layers:
 291
 292 The instructions prefixed with the prefix \code{N\_} follow exactly to the s390 naming and conventions as defined in~\cite{s390:bib:principles}.
 293
 294 \begin{verbatim}
 295 N_L(dest, disp, index, base)
 296 \end{verbatim}
 297
 298 The widely used code generation macros prefixed with an \code{M\_} are an alpha compatibility layer. They define alpha-like code generation macros in terms of \code{N\_} prefixed code generation macros.
 299
 300 \begin{verbatim}
 301 #define M_ALD(base, disp, dest) N_L(dest, disp, 0, base)
 302 \end{verbatim}
 303
 304 To prevent subtle errors, if CACAO is compiled in debug mode, all code generation macros validate the passed parameters. In release mode, no run-time checks are performed.
 305
 306 As already mentioned in section~\ref{s390:sec:jitabi}, some instructions don't accept register \code{\%r0} as operand, but rather treat it as \emph{optional operand not given}. Again, to prevent subtle errors, if CACAO is compiled in debug mode, the corresponding code generation macros don't accept \code{\%r0} as operand. Instead a special value, \code{RN}, meaning \emph{register none} must be given. In release mode, \code{RN} just expands to \code{\%r0} and no run-time checks are performed.
 307
 308 \subsection{Exception handling}
 309
 310 In contrast to other ports, in the s390 one we try to perform most of exception handling in C code. A port is required to implement an \code{asm\_handle\_exception} function in assembly language that is called only from JIT code with special conventions: register \code{xptr} contains the exception object and register \code{xpc} contains the program counter of the failing instruction, which can be mapped to a java byte code index. All other registers are required to be left unchanged. The s390 version is a thin wrapper around a C function \code{md\_handle\_exception} and does the following:
 311
 312 \begin{itemize}
 313         \item Allocates 3 arrays on the stack: \code{regs[16]}, \code{fregs[16]}, \code{out[3]}.
 314         \item Stores special, temporary and argument integer registers into \code{regs}.
 315         \item Stores temporary and argument floating point registers into \code{fregs}.
 316         \item Calls \code{md\_handle\_exception} with the 3 arrays as arguments.
 317         \item \code{md\_handle\_exception} while handling the exception computes new values for registers and puts them into the respective arrays. This eventually includes a new stack pointer, a new procedure vector, new values for callee saved registers and a new program counter.
 318         \item Restores registers from the respective arrays.
 319         \item Jumps to the address contained in register \code{xpc}.
 320 \end{itemize}
 321
 322 \subsection{Exception raising}
 323
 324 Currently there are two approaches to raise an exception in JIT code:
 325
 326 \begin{itemize}
 327         \item A call to \code{asm\_handle\_exception}.
 328         \item Make the JIT code produce a hardware exception, and call \code{asm\_handle\_exception} from a signal handler.
 329 \end{itemize}
 330
 331 The first approach requires special stub code to be generated for the call while the second requires generation of very little code: a single instruction or even none at all. Handling of hardware exceptions on the other side involves the kernel and a context switch and can therefore be a costy operation. But because of the fundamental property of exceptions being rare, we prefer to reduce the code size at the price of occasional performance penalty.
 332
 333 Consider for example the assembly code corresponding to the load an object's field in figure~\ref{s390:fig:exceptionrise}. If the value in \code{\%r10} points to a valid java object, the field value is loaded. If however the value in \code{\%r10} is a \code{null} pointer, the process receives a \code{SIGSEGV} signal and the operating system passes control to a function \code{md\_handle\_sigsegv}, the registered handler for this signal. This signal handler has access to all registers and the faulting program counter as well. It examines the opcode of the faulting instruction to see whether it corresponds to the use of a \code{null} pointer. If so, an exception object is allocated and put into the \code{xpc} register, \code{xptr} is set to the faulting address and program flow is resumed at \code{asm\_handle\_exception}.
 334
 335 \begin{figure}[H]
 336 \begin{verbatim}
 337 l %r9, 36(%r10)
 338 \end{verbatim}
 339 \caption{Load of an object's field with the potential of rising a \code{NullPointerException}.}
 340 \label{s390:fig:exceptionrise}
 341 \end{figure}
 342
 343 For explicit exception raising, we have defined the \code{ill} pseudo instruction in RR format. Its mnemonic is defined if figure~\ref{s390:fig:ill} and its machine code format in table~\ref{s390:tbl:ill}.
 344
 345 \begin{figure}
 346 \centering
 347 \begin{verbatim}
 348 ill register, exception_number
 349 \end{verbatim}
 350 \caption{Mnemonic of the \code{ill} pseudo instruction}
 351 \label{s390:fig:ill}
 352 \end{figure}
 353
 354 \begin{table}
 355 \centering
 356 \begin{tabular}{|c|c|c|c|}
 357         \hline
 358         Bits & 0 - 7 & 8 - 11 & 12 - 15 \\
 359         \hline
 360         Contents & 0x02 & register & exception number \\
 361         \hline
 362 \end{tabular}
 363 \caption{Machine code for the \code{ill} pseudo instruction}
 364 \label{s390:tbl:ill}
 365 \end{table}
 366
 367 As the opcode \code{0x02} is no valid opcode according to~\cite{s390:bib:principles} , once that instruction is reached, it will lead to an illegal instruction exception. The operating system delivers a \code{SIGILL} signal to the process and control is passed to the registered signal handler. This signal handler then examines the opcode of the instruction at the faulting address. In case it corresponds to the \code{ill} pseudo-instruction, the second register field is interpreted as exception number. The corresponding exception object is then instantiated and control passed to \code{asm\_handle\_exception}.
 368
 369 Signal handling can be used to support corner cases of arithmetics that can't be handled by the architecture natively. Consider for example the special case of the division of \code{0x80000000} (\code{Integer.MIN}) by \code{0xFFFFFFFF} (-1) that raises a fixed-point divide exception which is signalled to the process by a \code{SIGFPE} signal. If the signal handler determines that the faulting operation was an integer division with those special values as operands, the destination operand is set to the expected result (-1) and normal program flow is resumed.
 370
 371 Table~\ref{s390:tbl:signals} contains an overview of signals used in CACAO JIT code.
 372
 373 \begin{table}
 374 \centering
 375 \begin{tabular}{|c|c|c|}
 376 \hline
 377 Signal & Faulting instruction & Thrown runtime exception \\
 378 \hline
 379 \code{SIGSEGV} & \code{L} & \code{EXCEPTION\_HARDWARE\_NULLPOINTER} \\
 380 \code{SIGSEGV} & \code{ST} & \code{EXCEPTION\_HARDWARE\_NULLPOINTER} \\
 381 \code{SIGSEGV} & \code{CL} & \code{EXCEPTION\_HARDWARE\_NULLPOINTER} \\
 382 \code{SIGILL} & \code{ILL} & \code{EXCEPTION\_HARDWARE\_ARITHMETIC} \\
 383 \code{SIGILL} & \code{ILL} & \code{EXCEPTION\_HARDWARE\_ARRAYINDEXOUTOFBOUNDS} \\
 384 \code{SIGILL} & \code{ILL} & \code{EXCEPTION\_HARDWARE\_CLASSCAST} \\
 385 \code{SIGILL} & \code{ILL} & \code{EXCEPTION\_HARDWARE\_PATCHER} \\
 386 \code{SIGFPE} & \code{DR} & \code{EXCEPTION\_HARDWARE\_ARITHMETIC} \\
 387 \code{SIGFPE} & \code{DR} & \code{EXCEPTION\_HARDWARE\_ARITHMETIC} \\
 388 \hline
 389 \end{tabular}
 390 \caption{Overview of signals used by JIT code in Cacao}
 391 \label{s390:tbl:signals}
 392 \end{table}
 393
 394 \subsection{Code patching}
 395
 396 Not all the information needed for program execution is available at compile time. Therefore some parts of the JIT-code need to be \emph{patched} at runtime. A trap instruction is placed at the beginning of the code in question. Further, a \emph{patcher reference} is created. This is a tiny data structure associated with the patched position encapsulating metadata about the information to be patched. The trap instruction used is the \code{ill} pseudo-instruction with a exception number of \code{EXCEPTION\_HADRWARE\_PATCHER}.
 397
 398 Once the patched position is reached and the trap instruction executed the operating system passes control the the \code{md\_signal\_handler\_sigill} function. This function inspects the faulting instruction and then passes control to the machine independent \code{signal\_handle} function, which in turn invokes the machine independent entry into the patching subsystem: \code{patcher\_handler}. Here the faulting program counter is used to look up the associated \emph{patcher reference}. The address of the machine dependent patcher function is extracted from the \emph{patcher reference} and the patcher is invoked with the \emph{patcher reference} as argument. After the patcher function has finished, and the signal handler exits, program flow is resumed at the now reconstructed patched position. The anatomy of a patcher function is the following:
 399
 400 \begin{itemize}
 401         \item Examine the patcher reference to determine what information needs to be retrieved.
 402         \item Retrieve required information.
 403         \item Overwrite the trap instruction at the patched position with the original machine code.
 404         \item Patch required information. This is usually done in one of the following ways:
 405         \begin{itemize}
 406                 \item Modify a value on the data segment. The offset of the value is extracted from the \emph{patcher reference}.
 407                 \item Modify machine code near the patched position.
 408         \end{itemize}
 409 \end{itemize}
 410
 411 Consider the example of calling an object's virtual method in figure~\ref{s390:fig:patch1}. At the time the code for the call is assembled, the called method's class may not have been loaded and thus the offset in the virtual function table may not yet be known. The code generator simply generates the second load with a displacement of 0. It then replaces the machine code at the current location with a trap instruction and creates a \emph{patcher reference} (figure~\ref{s390:fig:patch2}). In the case of a virtual function call it contains an \code{unresolved\_method *}, a data structure encapsulating information needed to resolve the method. Once the trap instruction is reached, the responsible patcher, \code{patcher\_invokevirtual} will be invoked by the patcher subsystem. It resolves the method, then removes the trap instruction and finally, modifies the displacement of the second load, to reflect the offset in the virtual function table (figure~\ref{s390:fig:patch3}).
 412
 413 \begin{figure}
 414 \begin{verbatim}
 415 l mptr, offset_vftbl_ptr(a0) ; Load the object's VFTBL pointer
 416 l pv, offset_method(mptr)    ; Load pointer to method from VFTBL
 417 \end{verbatim}
 418 \caption{Machine code for normal virtual function call}
 419 \label{s390:fig:patch1}
 420 \end{figure}
 421
 422 \begin{figure}
 423 \begin{verbatim}
 424 ; Machine code replaced with trap instruction
 425 ; Patcher reference created for this program counter
 426 ill EXCEPTION_HARDWARE_PATCHER
 427 l pv, 0(mptr)
 428 \end{verbatim}
 429 \caption{Machine code for virtual function call before being patched}
 430 \label{s390:fig:patch2}
 431 \end{figure}
 432
 433 \begin{figure}
 434 \begin{verbatim}
 435 ; Trap instruction removed and original machine code restored
 436 l mptr, offset_vftbl_pointer(a0)
 437 ; Load instruction patched
 438 l pv, 16(mptr)
 439 \end{verbatim}
 440 \caption{Machine code for virtual function call after being patched}
 441 \label{s390:fig:patch3}
 442 \end{figure}
 443
 444 The above example shows, that patchers that modify machine code are tightly coupled with the code used for its generation. Here, given the program counter of the patched position, the patcher is expected to exactly determine the load instruction to patch. On one hand, its easy, because it's simply the next instruction after the trap. On the other hand, care must be taken when modifying the code generation portion to not forget keep the patchers up to date.
 445
 446 \subsection{Data segment access}
 447
 448 One speciality of s390 compared with other architectures is the format of memory addresses in instructions. This can be given as \code{displacement(base)} or \code{displacement(index, base)} where \code{displacement} is an unsigned 12 bit value while \code{base} and \code{index} designate registers. The resulting memory address is calculated as the sum of the displacement and the values in the registers. The conventional way to address a value on the data segment is to use a negative displacement with the procedure vector register. As this obviously can't be done on s390 a workaround was needed. \\
 449
 450 We decided to add a negative constant \code{N\_PV\_OFFSET} to the procedure vector in each java methods prologue. This way, a load of the form as in figure~\ref{s390:fig:dsegload1} turns into a load of the form as in figure~\ref{s390:fig:dsegload2} with a non-negative displacement for the first 4 kB of the data segment. For data segment access beyond that barrier, special code has to be generated. Most of those accesses can be fullfitted by first loading the negative displacement as 16 bit signed immediate value into the destination register and then using that register as an index register for the actual load as in figure~\ref{s390:fig:dsegload3}. Note however, that the code does not work if the destination register is \code{\%r0}, because \code{\%r0} can't be used as index register.
 451
 452 \begin{figure}[H]
 453 \begin{verbatim}
 454     l      reg, negative_displacement(pv)
 455 \end{verbatim}
 456 \caption{Usual way to load a value from the data segment}
 457 \label{s390:fig:dsegload1}
 458 \end{figure}
 459
 460 \begin{figure}[H]
 461 \begin{verbatim}
 462     l      reg, negative_displacement - N_PV_OFFSET(pv)
 463 \end{verbatim}
 464 \caption{Load from data segment with non-negative displacement}
 465 \label{s390:fig:dsegload2}
 466 \end{figure}
 467
 468 \begin{figure}[H]
 469 \begin{verbatim}
 470     lhi reg, negative_displacement
 471     l   reg, 0(reg, pv)
 472 \end{verbatim}
 473 \caption{Load from data segment for large displacements}
 474 \label{s390:fig:dsegload3}
 475 \end{figure}
 476
 477 In very rare cases, where the data can't be addressed using a signed 16 bit offset, we place the negative displacement into the instruction flow and load it in a PC relative fashion, as seen in figure~\ref{s390:fig:dsegload4}.
 478
 479 \begin{figure}[H]
 480 \begin{verbatim}
 481     bras reg, L_bras     ; Branch over constant dseg offset in instruction stream
 482                          ; and store current PC of next instruction in reg
 483     .long negative_displacement
 484 L_bras:
 485     l    reg, 0(reg)     ; Load constant dseg offset into reg
 486     l    reg, 0(reg, pv) ; Address dseg using two registers
 487 \end{verbatim}
 488 \caption{Load from data segment for very large displacements}
 489 \label{s390:fig:dsegload4}
 490 \end{figure}
 491
 492 The fact, that in JIT code the procedure vector is always set off must be kept in mind when passing its value to architecture independent code, for example the patcher subsystem, that expects it to point to the methods entry point. \code{N\_PV\_OFFSET} must first be substracted from the procedure vector value.
 493
 494 \subsection{Recalculation of the procedure vector}
 495
 496 Because the procedure vector is a caller saved register in the \emph{JIT ABI}, but a callee saved register in the \emph{ELF ABI}, the procedure vector has to be recalculated after each Java-to-Java call. This is done by the snippet of assembly code seen in figure~\ref{s390:fig:pvrestore1}
 497
 498 \begin{figure}[H]
 499 \begin{verbatim}
 500     ; get program counter into pv
 501     basr     pv, 0
 502 L_basr:
 503     ; substract offset to start of method
 504     ahi      pv, -(L_basr - L_mcodebase) + N_PV_OFFSET
 505 \end{verbatim}
 506 \caption{Assembly code for the recalculation of the procedure vector}
 507 \label{s390:fig:pvrestore1}
 508 \end{figure}
 509
 510 Although rare, it happens for large methods that the that the immediate operand to \code{ahi} does not fit into the range of valid immediates. Care has to be taken to still generate correct code as seen in the assembly code in figure~\ref{s390:fig:pvrestore2}.
 511
 512 \begin{figure}[H]
 513 \begin{verbatim}
 514     ; Jump over (big) constant in instruction flow and
 515     ; get program counter of next instruction into PV.
 516     bras    pv, L_bras
 517     ; offset to start of method is placed in instruction flow
 518     .long   -(L_bras - L_mcodebase) + N_PV_OFFSET
 519 L_bras:
 520     ; Add offset to program counter retrieved above.
 521     a       pv, 0(pv)
 522 \end{verbatim}
 523 \caption{Assembly code for the recalculation of the procedure vector for large methods}
 524 \label{s390:fig:pvrestore2}
 525 \end{figure}
 526
 527 \subsection{Long branches}
 528
 529 For program counter relative branches, that are used for inter-method branching, the offset is specified as signed 17 bit integer. The size of most methods does not exceed 64kB. But occasionally, there are large methods that require a branch with a offset not fitting into this range. We call this kind of branches \emph{long branches}. If a backward \emph{long branch} is generated, this does not cause further troubles. The problem are forward branches: branches to not yet generated code. In such cases, instead of generating a branch, we reserve exactly the size of a branch instruction. This space will be filled with a (short) branch as soon as the referenced code gets generated. This is obviously a problem if the branch generated later shows to be a \emph{long branch}, because the code generated for a \emph{long branch} does not fit into the reserved space.
 530
 531 This condition is detected in \code{emit\_branch} by checking the passed displacement to fit into the 17 bit signed range. If this condition does not hold and the flag \code{CODEGENDATA\_FLAG\_LONGBRANCHES} is not set, then both flags \code{CODEGENDATA\_FLAG\_ERROR} and \code{CODEGENDATA\_FLAG\_LONGBRANCHES} are set and no branch is generated. After \code{codegen\_emit} returns, \code{codegen\_generate} checks for those two flags. If they are both set, the method is recompiled, with only the \code{CODEGENDATA\_FLAG\_LONGBRANCHES} flags set.
 532
 533 This causes that more space is reserved for forward branches: enough to patch either a (short) branch or a \emph{long branch}.
 534
 535 A long branch is implemented by storing the long displacement in the data segment. The generated code loads the displacement from the data segment, combines it with the procedure vector and then jumps to the resulting address.
 536
 537 \subsection{The development environment}
 538
 539 Most of the development has been done on a Debian GNU/Linux system running on the Hercules emulator. Installing the system on the emulator can be tricky and needs several hours to complete. It is not recommended to setup such a system from scratch, but rather use the disk image floating around at the complang group.
 540
 541 As the emulator showed to be slow for native compiling, a cross compiler was used instead. To eliminate the need of cross compiling all dependencies of CACAO on the development system, the following approach was chosen instead:
 542
 543 \begin{itemize}
 544         \item The root of the target system was mounted via NFS on the development system.
 545         \item The cross compiler was set up to use the mounted NFS root as sysroot.
 546 \end{itemize}
 547
 548 The procedure of building the cross compiler in practice is shown in figure~\ref{s390:fig:crosgcc}.
 549
 550 \begin{figure}
 551 \begin{verbatim}
 552 $ mount s390-system:/ /foo
 553 $ cd binutils-2.17
 554 $ ./configure \
 555     --target=s390-ibm-linux-gnu \
 556     --with-build-sysroot=/foo \
 557     --with-sysroot=/foo
 558 $ make install
 559 $ cd gcc-4.1.1
 560 $ ./configure \
 561     --target=s390-ibm-linux-gnu \
 562     --prefix=/usr/local \
 563     --with-sysroot=/foo \
 564     --enable-languages=c
 565 $ make install
 566 \end{verbatim}
 567 \caption{Building an s390 cross compiler with the sysroot feature}
 568 \label{s390:fig:crosgcc}
 569 \end{figure}
 570
 571 In the final phase of development, we got a free account at the \emph{IBM community development system} for three months. This account provided us with root access to a ESA/390 virtual machine with 256MB of RAM preinstalled with a GNU/Linux system. Development on that system worked perfectly and it had a more reasonable performance than the emulated system.
 572
 573
 574
 575