docs/mini-porting.txt

   1                         Mono JIT porting guide.
   2                 Paolo Molaro (lupus@ximian.com)
   3
   4 * Introduction
   5
   6 This documents describes the process of porting the mono JIT
   7 to a new CPU architecture. The new mono JIT has been designed
   8 to make porting easier though at the same time enable the port
   9 to take full advantage from the new architecture features and
  10 instructions. Knowledge of the mini architecture (described in the
  11 mini-doc.txt file) is a requirement for understanding this guide,
  12 as well as an earlier document about porting the mono interpreter
  13 (available on the web site).
  14
  15 There are six main areas that a port needs to implement to
  16 have a fully-functional JIT for a given architecture:
  17
  18         1) instruction selection
  19         2) native code emission
  20         3) call conventions and register allocation
  21         4) method trampolines
  22         5) exception handling
  23         6) minor helper methods
  24
  25 To take advantage of some not-so-common processor features (for example
  26 conditional execution of instructions as may be found on ARM or ia64), it may
  27 be needed to develop an high-level optimization, but doing so is not a
  28 requirement for getting the JIT to work.
  29
  30 We'll see in more details each of the steps required, note, though,
  31 that a new port may just as well start from a cut&paste of an existing
  32 port to a similar architecture (for example from x86 to amd64, or from
  33 powerpc to sparc).
  34 The architecture specific code is split from the rest of the JIT,
  35 for example the x86 specific code and data is all included in the
  36 following files in the distribution:
  37
  38         mini-x86.h mini-x86.c
  39         inssel-x86.brg
  40         cpu-pentium.md
  41         tramp-x86.c
  42         exceptions-x86.c
  43
  44 I suggest a similar split for other architectures as well.
  45
  46 Note that this document is still incomplete: some sections are only
  47 sketched and some are missing, but the important info to get a port
  48 going is already described.
  49
  50
  51 * Architecture-specific instructions and instruction selection.
  52
  53 The JIT already provides a set of instructions that can be easily
  54 mapped to a great variety of different processor instructions.
  55 Sometimes it may be necessary or advisable to add a new instruction
  56 that represent more closely an instruction in the architecture.
  57 Note that a mini instruction can be used to represent also a short
  58 sequence of CPU low-level instructions, but note that each
  59 instruction represents the minimum amount of code the instruction
  60 scheduler will handle (i.e., the scheduler won't schedule the instructions
  61 that compose the low-level sequence as individual instructions, but just
  62 the whole sequence, as an indivisible block).
  63 New instructions are created by adding a line in the mini-ops.h file,
  64 assigning an opcode and a name. To specify the input and output for
  65 the instruction, there are two different places, depending on the context
  66 in which the instruction gets used.
  67 If the instruction is used in the tree representation, the input and output
  68 types are defined by the BURG rules in the *.brg files (the usual
  69 non-terminals are 'reg' to represent a normal register, 'lreg' to
  70 represent a register or two that hold a 64 bit value, freg for a
  71 floating point register).
  72 If an instruction is used as a low-level CPU instruction, the info
  73 is specified in a machine description file. The description file is
  74 processed by the genmdesc program to provide a data structure that
  75 can be easily used from C code to query the needed info about the
  76 instruction.
  77 As an example, let's consider the add instruction for both x86 and ppc:
  78
  79 x86 version:
  80         add: dest:i src1:i src2:i len:2 clob:1
  81 ppc version:
  82         add: dest:i src1:i src2:i len:4
  83
  84 Note that the instruction takes two input integer registers on both CPU,
  85 but on x86 the first source register is clobbered (clob:1) and the length
  86 in bytes of the instruction differs.
  87 Note that integer adds and floating point adds use different opcodes, unlike
  88 the IL language (64 bit add is done with two instructions on 32 bit architectures,
  89 using a add that sets the carry and an add with carry).
  90 A specific CPU port may assign any meaning to the clob field for an instruction
  91 since the value will be processed in an arch-specific file anyway.
  92 See the top of the existing cpu-pentium.md file for more info on other fields:
  93 the info may or may not be applicable to a different CPU, in this latter case
  94 the info can be ignored.
  95 The code in mini.c together with the BURG rules in inssel.brg, inssel-float.brg
  96 and inssel-long32.brg provides general purpose mappings from the tree representation
  97 to a set of instructions that should be easily implemented in any architecture.
  98 To allow for additional arch-specific functionality, an arch-specific BURG file
  99 can be used: in this file arch-specific instructions can be selected that provide
 100 better performance than the general instructions or that provide functionality
 101 that is needed by the JIT but that cannot be expressed in a general enough way.
 102 As an example, x86 has the special instruction "push" to make it easier to
 103 implement the default call convention (passing arguments on the stack): almost
 104 all the other architectures don't have such an instruction (and don't need it anyway),
 105 so we added a special rule in the inssel-x86.brg file for it.
 106
 107 So, one of the first things needed in a port is to write a cpu-$(arch).md machine
 108 description file and fill it with the needed info. As a start, only a few
 109 instructions can be specified, like the ones required to do simple integer
 110 operations. The default rules of the instruction selector will emit the common
 111 instructions and so we're ready to go for the next step in porting the JIT.
 112
 113
 114 *) Native code emission
 115
 116 Since the first step in porting mono to a new CPU is to port the interpreter,
 117 there should be already a file that allows the emission of binary native code
 118 in a buffer for the architecture. This file should be placed in the
 119         mono/arch/$(arch)/
 120 directory.
 121
 122 The bulk of the code emission happens in the mini-$(arch).c file, in a function
 123 called mono_arch_output_basic_block (). This function takes a basic block, walks the
 124 list of instructions in the block and emits the binary code for each.
 125 Optionally a peephole optimization pass is done on the basic block, but this can be
 126 left for later, when the port actually works.
 127 This function is very simple, there is just a big switch on the instruction opcode
 128 and in the corresponding case the functions or macros to emit the binary native code
 129 are used. Note that in this function the lengths of the instructions are used to
 130 determine if the buffer for the code needs enlarging.
 131
 132 To complete the code emission for a method, a few other functions need
 133 implementing as well:
 134
 135         mono_arch_emit_prolog ()
 136         mono_arch_emit_epilog ()
 137         mono_arch_patch_code ()
 138
 139 mono_arch_emit_prolog () will emit the code to setup the stack frame for a method,
 140 optionally call the callbacks used in profiling and tracing, and move the
 141 arguments to their home location (in a caller-save register if the variable was
 142 allocated to one, or in a stack location if the argument was passed in a volatile
 143 register and wasn't allocated a non-volatile one). caller-save registers used by the
 144 function are saved in the prolog as well.
 145
 146 mono_arch_emit_epilog () will emit the code needed to return from the function,
 147 optionally calling the profiling or tracing callbacks. At this point the basic blocks
 148 or the code that was moved out of the normal flow for the function can be emitted
 149 as well (this is usually done to provide better info for the static branch predictor).
 150 In the epilog, caller-save registers are restored if they were used.
 151 Note that, to help exception handling and stack unwinding, when there is a transition
 152 from managed to unmanaged code, some special processing needs to be done (basically,
 153 saving all the registers and setting up the links in the Last Managed Frame
 154 structure).
 155
 156 When the epilog has been emitted, the upper level code arranges for the buffer of
 157 memory that contains the native code to be copied in an area of executable memory
 158 and at this point, instructions that use relative addressing need to be patched
 159 to have the right offsets: this work is done by mono_arch_patch_code ().
 160
 161
 162 * Call conventions and register allocation
 163
 164 To account for the differences in the call conventions, a few functions need to
 165 be implemented.
 166
 167 mono_arch_allocate_vars () assigns to both arguments and local variables
 168 the offset relative to the frame register where they are stored, dead
 169 variables are simply discarded. The total amount of stack needed is calculated.
 170
 171 mono_arch_call_opcode () is the function that more closely deals with the call
 172 convention on a given system. For each argument to a function call, an instruction
 173 is created that actually puts the argument where needed, be it the stack or a
 174 specific register. This function can also re-arrange th order of evaluation
 175 when multiple arguments are involved if needed (like, on x86 arguments are pushed
 176 on the stack in reverse order). The function needs to carefully take into accounts
 177 platform specific issues, like how structures are returned as well as the
 178 differences in size and/or alignment of managed and corresponding unmanaged
 179 structures.
 180
 181 The other chunk of code that needs to deal with the call convention and other
 182 specifics of a CPU, is the local register allocator, implemented in a function
 183 named mono_arch_local_regalloc (). The local allocator deals with a basic block
 184 at a time and basically just allocates registers for temporary
 185 values during expression evaluation, spilling and unspilling as necessary.
 186 The local allocator needs to take into account clobbering information, both
 187 during simple instructions and during function calls and it needs to deal
 188 with other architecture-specific weirdnesses, like instructions that take
 189 inputs only in specific registers or output only is some.
 190 Some effort will be put later in moving most of the local register allocator to
 191 a common file so that the code can be shared more for similar, risc-like CPUs.
 192 The register allocator does a first pass on the instructions in a block, collecting
 193 liveness information and in a backward pass on the same list performs the
 194 actual register allocation, inserting the instructions needed to spill values,
 195 if necessary.
 196
 197 When this part of code is implemented, some testing can be done with the generated
 198 code for the new architecture. Most helpful is the use of the --regression
 199 command line switch to run the regression tests (basic.cs, for example).
 200 Note that the JIT will try to initialize the runtime, but it may not be able yet to
 201 compile and execute complex code: commenting most of the code in the mini_init()
 202 function in mini.c is needed to let the JIT just compile the regression tests.
 203 Also, using multiple -v switches on the command line makes the JIT dump an
 204 increasing amount of information during compilation.
 205
 206
 207 * Method trampolines
 208
 209 To get better startup performance, the JIT actually compiles a method only when
 210 needed. To achieve this, when a call to a method is compiled, we actually emit a
 211 call to a magic trampoline. The magic trampoline is a function written in assembly
 212 that invokes the compiler to compile the given method and jumps to the newly compiled
 213 code, ensuring the arguments it received are passed correctly to the actual method.
 214 Before jumping to the new code, though, the magic trampoline takes care of patching
 215 the call site so that next time the call will go directly to the method instead of the
 216 trampoline. How does this all work?
 217 mono_arch_create_jit_trampoline () creates a small function that just
 218 preserves the arguments passed to it and adds an additional argument (the method
 219 to compile) before calling the generic trampoline. This small function is called
 220 the specific trampoline, because it is method-specific (the method to compile
 221 is hard-code in the instruction stream).
 222 The generic trampoline saves all the arguments that could get clobbered
 223 and calls a C function that will do two things:
 224
 225 *) actually call the JIT to compile the method
 226 *) identify the calling code so that it can be patched to call directly
 227 the actual method
 228
 229 If the 'this' argument to a method is a boxed valuetype that is passed to
 230 a method that expects just a pointer to the data, an additional unboxing
 231 trampoline will need to be inserted as well.
 232
 233
 234 * Exception handling
 235
 236 Exception handling is likely the most difficult part of the port, as it needs
 237 to deal with unwinding (both managed and unmanaged code) and calling
 238 catch and filter blocks. It also needs to deal with signals, because mono
 239 takes advantage of the MMU in the CPU and of the operation system to
 240 handle dereferences of the NULL pointer. Some of the function needed
 241 to implement the mechanisms are:
 242
 243 mono_arch_get_throw_exception () returns a function that takes an exception object
 244 and invokes an arch-specific function that will enter the exception processing.
 245 To do so, all the relevant registers need to be saved and passed on.
 246
 247 mono_arch_handle_exception () this function takes the exception thrown and
 248 a context that describes the state of the CPU at the time the exception was
 249 thrown. The function needs to implement the exception handling mechanism,
 250 so it makes a search for an handler for the exception and if none is found,
 251 it follows the unhandled exception path (that can print a trace and exit or
 252 just abort the current thread). The difficulty here is to unwind the stack
 253 correctly, by restoring the register state at each call site in the call chain,
 254 calling finally, filters and handler blocks while doing so.
 255
 256 As part of exception handling a couple of internal calls need to be implemented
 257 as well.
 258 ves_icall_get_frame_info () returns info about a specific frame.
 259 mono_jit_walk_stack () walks the stack and calls a callback with info for
 260 each frame found.
 261 ves_icall_get_trace () return an array of StackFrame objects.
 262
 263 ** Code generation for filter/finally handlers
 264
 265 Filter and finally handlers are called from 2 different locations:
 266
 267        1.) from within the method containing the exception clauses
 268        2.) from the stack unwinding code
 269
 270 To make this possible we implement them like subroutines, ending with a
 271 "return" statement. The subroutine does not save the base pointer, because we
 272 need access to the local variables of the enclosing method. Its is possible
 273 that instructions inside those handlers modify the stack pointer, thus we save
 274 the stack pointer at the start of the handler, and restore it at the end. We
 275 have to use a "call" instruction to execute such finally handlers.
 276
 277 The MIR code for filter and finally handlers looks like:
 278
 279     OP_START_HANDLER
 280     ...
 281     OP_END_FINALLY | OP_ENDFILTER(reg)
 282
 283 OP_START_HANDLER: should save the stack pointer somewhere
 284 OP_END_FINALLY: restores the stack pointers and returns.
 285 OP_ENDFILTER (reg): restores the stack pointers and returns the value in "reg".
 286
 287 ** Calling finally/filter handlers
 288
 289 There is a special opcode to call those handler, its called OP_CALL_HANDLER. It
 290 simple emits a call instruction.
 291
 292 Its a bit more complex to call handler from outside (in the stack unwinding
 293 code), because we have to restore the whole context of the method first. After that
 294 we simply emit a call instruction to invoke the handler. Its usually
 295 possible to use the same code to call filter and finally handlers (see
 296 arch_get_call_filter).
 297
 298 ** Calling catch handlers
 299
 300 Catch handlers are always called from the stack unwinding code. Unlike finally clauses
 301 or filters, catch handler never return. Instead we simply restore the whole
 302 context, and restart execution at the catch handler.
 303
 304 ** Passing Exception objects to catch handlers and filters.
 305
 306 We use a local variable to store exception objects. The stack unwinding code
 307 must store the exception object into this variable before calling catch handler
 308 or filter.
 309
 310 * Minor helper methods
 311
 312 A few minor helper methods are referenced from the arch-independent code.
 313 Some of them are:
 314
 315 *) mono_arch_cpu_optimizations ()
 316         This function returns a mask of optimizations that should be enabled for the
 317         current CPU and a mask of optimizations that should be excluded, instead.
 318
 319 *) mono_arch_regname ()
 320         Returns the name for a numeric register.
 321
 322 *) mono_arch_get_allocatable_int_vars ()
 323         Returns a list of variables that can be allocated to the integer registers
 324         in the current architecture.
 325
 326 *) mono_arch_get_global_int_regs ()
 327         Returns a list of caller-save registers that can be used to allocate variables
 328         in the current method.
 329
 330 *) mono_arch_instrument_mem_needs ()
 331 *) mono_arch_instrument_prolog ()
 332 *) mono_arch_instrument_epilog ()
 333         Functions needed to implement the profiling interface.
 334
 335
 336 * Writing regression tests
 337
 338 Regression tests for the JIT should be written for any bug found in the JIT
 339 in one of the *.cs files in the mini directory. Eventually all the operations
 340 of the JIT should be tested (including the ones that get selected only when
 341 some specific optimization is enabled).
 342
 343
 344 * Platform specific optimizations
 345
 346 An example of a platform-specific optimization is the peephole optimization:
 347 we look at a small window of code at a time and we replace one or more
 348 instructions with others that perform better for the given architecture or CPU.
 349