docs/jit-trampolines

   1 Author: Dietmar Maurer (dietmar@ximian.com)
   2 (C) 2001 Ximian, Inc.
   3 (C) 2007 Novell, Inc.
   4
   5 [ 2007 extensions based on posts from Paolo Molaro ]
   6
   7 Howto trigger JIT compilation
   8 =============================
   9
  10 The JIT translates CIL code to native code on a per method basis. For example
  11 if you have this simple program:
  12
  13 public class Test {
  14         public static void Main () {
  15                 System.Console.WriteLine ("Hello");
  16         }
  17 }
  18
  19 the JIT first compiles the Main function. Unfortunately Main() contains another
  20 reference to System.Console.WriteLine(), so the JIT also needs the address for
  21 WriteLine() to generate a call instruction.
  22
  23 The simplest solution would be to JIT compile System.Console.WriteLine()
  24 to generate that address. But that would mean that we JIT compile half of our
  25 class library at once, since WriteLine() uses many other classes and function,
  26 and we have to call the JIT for each of them. Even worse there is the
  27 possibility of cyclic references, and we would end up in an endless loop.
  28
  29 Thus we need some kind of trampoline function for JIT compilation. Such a
  30 trampoline first calls the JIT compiler to create native code, and then jumps
  31 directly into that code. Whenever the JIT needs the address of a function (to
  32 emit a call instruction) it uses the address of those trampoline functions.
  33
  34 One drawback of this approach is that it requires an additional indirection. We
  35 always call the trampoline. Inside the trampoline we need to check if the
  36 method is already compiled or not, and when not compiled we start JIT
  37 compilation. After that we call the code. This process is quite time consuming
  38 and shows very bad performance.
  39
  40 The solution is to add some logic to the trampoline function to detect from
  41 where it is called. It is then possible for the JIT to patch the call
  42 instruction in the caller, so that it directly calls the JIT compiled code
  43 next time.
  44
  45 Implementation Details
  46 ======================
  47
  48 Mono 1.2.6 has quite a few improvements in this area compared to mono
  49 1.2.5 which was released just a few weeks ago. I'll try to detail the
  50 major changes below.
  51
  52 The first change is related to how the memory for the specific
  53 trampolines is allocated: this is executable memory so it is not
  54 allocated with malloc, but with a custom allocator, called Mono Code
  55 Manager. Since the code manager is used primarily for methods, it
  56 allocates chunks of memory that are aligned to multiples of 8 or 16
  57 bytes depending on the architecture: this allows the cpu to fetch the
  58 instructions faster. But the specific trampolines are not performance
  59 critical (we'll spend lots of time JITting the method anyway), so they
  60 can tolerate a smaller alignment. Considering the fact that most
  61 trampolines are allocated one after the other and that in most
  62 architectures they are 10 or 12 bytes, this change alone saved about
  63 25% of the memory used (they used to be aligned up to 16 bytes).
  64
  65 To give a rough idea of how many trampolines are generated I'll give a
  66 few examples:
  67
  68     * MonoDevelop startup creates about 21 thousand trampolines
  69     * IronPython 2.0 running a benchmark creates about 17 thousand trampolines
  70     * an "hello, world" style program about 800
  71
  72 This change in the first case saved more than 80 KB of memory (plus
  73 about the same because reviewing the code allowed me to fix also a
  74 related overallocation issue).
  75
  76 So reducing the size of the trampolines is great, but it's really not
  77 possible to reduce them much further in size, if at all. The next step
  78 is trying just not to create them.
  79
  80 There are two primary ways a trampoline is generated: a direct call to
  81 the method is made or a virtual table slot is filled with a trampoline
  82 for the case when the method is invoked using a virtual call. I'll
  83 note here than in both cases, after compiling the method, the magic
  84 trampoline will do the needed changes so that the trampoline is not
  85 executed again, but execution goes directly to the newly compiled
  86 code. In one case the callsite is changed so that the branch or call
  87 instruction will transfer control to the new address. In the virtual
  88 call case the magic trampoline will change the virtual table slot
  89 directly.
  90
  91 The sequence of instructions used by the JIT to implement a virtual
  92 call are well-known and the magic trampoline (inspecting the registers
  93 and the code sequence) can easily get the virtual table slot that was
  94 used for the invocation. The idea here then is: if we know the virtual
  95 table slot we know also the method that is supposed to be compiled and
  96 executed, since each vtable slot is assigned a unique method by the
  97 class loader. This simple fact allows us to use a completely generic
  98 trampoline in the virtual table slots, avoiding the creation of many
  99 method-specific trampolines.
 100
 101 In the cases above, the number of generated trampolines goes from
 102 21000 to 7700 for MonoDevelop (saving 160 KB of memory), from 17000 to
 103 5400 for the IronPython case and from 800 to 150 for the hello world
 104 case.
 105
 106 Kinds of Trampolines and Thunks
 107 ===============================
 108
 109 This is a list of the trampolines and thunks used in Mono:
 110
 111 - create_fnptr
 112 - load_aot_method
 113 - imt thunk
 114
 115         Interface Method Table, this is used to dispatch calls to
 116         interface methods.
 117
 118 - jump table
 119 - debugger code
 120 - exception call filter
 121 - trampoline (various types)
 122 - throw corlib exception
 123 - restore context
 124 - throw exception by name
 125 - handle stack overflow
 126 - throw exception
 127 - delegate invoke implementation
 128 - cpuid code
 129
 130
 131 Implementation for x86/x86-64
 132 =============================
 133
 134 Usually code looks like this:
 135
 136              mov <some address>, %r11
 137              call *0xfffffffc(%rax)
 138
 139 First, the first call instruction can go directly to the compiled
 140 address or to a trampoline.
 141
 142 If it goes to a trampoline, on amd64 it looks as the one above (on x86
 143 it is different). Currently the trampoline is not modified, but it will
 144 be in the future. On x86 the trampoline looks like:
 145
 146         push constant
 147         jmp generic_trampoline
 148
 149 Note that constant can be a MonoMethod*, but it's not necessarily so
 150 (these are the recent changes: this constant can be -1 or -2, the first
 151 for the case of interface calls, the second for virtual calls).
 152
 153 Other architectures are similar in the semantics, but different in the
 154 details.
 155
 156 The above is what happens for virtual calls.
 157
 158 For interfaces call 3 things can happen:
 159
 160         1) the calls goes directly to the method address.
 161
 162         2) it goes to a trampoline as described above.
 163
 164         3) it goes into an IMT collision resolution stub: this is a
 165            chunk of code that, based on the constant put inside the
 166            imt_register above, will perform a jump to the correct
 167            vtable slot for the interface method.
 168
 169 Note that the vtable slot itself could then contain a trampoline.
 170
 171 Some functions that are used here:
 172
 173 emit-x86.c (arch_create_jit_trampoline): return the JIT trampoline function
 174
 175 emit-x86.c (x86_magic_trampoline): contains the code to detect the caller and
 176 patch the call instruction.
 177
 178 emit-x86.c (arch_compile_method): JIT compile a method
 179
 180 Call Sites
 181 ==========
 182
 183 There are 3 basic different kinds of call sites:
 184
 185         1) normal calls:
 186                 call relative_displacement
 187
 188         2) virtual calls:
 189                 call *positive_offset(%register)
 190
 191         3) interface calls:
 192                 mov constant, %imt_register
 193                 call *negative_offset(%register)
 194
 195 The above is what happens on x86 and amd64, with different values of
 196 %imt_register for each arch (this register is a constant, but it could
 197 change with different mono builds, it should be likely one of
 198 constants the runtime comunicates to the debugger). %register can
 199 change depending on the callsite based on the register allocator
 200 choices.  Note that the constant for the interface calls won't
 201 necessarily be a MonoMethod address: this could change in the future
 202 to a simple number.
 203
 204 In all the 3 cases the JIT trampolines will need to inspect the call
 205 site, but only in the first case the call site will be changed.
 206