docs/jit-thoughts

   1 Just some thoughts for the JITer:
   2
   3 General issues:
   4 ===============
   5
   6 We are designing a JIT compiler, so we have to consider two things:
   7
   8 - the quality of the generated code
   9 - the time needed to generate that code
  10
  11 The current approach is to keep the JITer as simple as possible, and thus as
  12 fast as possible. The generated code quality will suffer from that.
  13
  14 We do not map local variables to registers at the moment, and this makes the
  15 whole JIT much easier, for example we do not need to identify basic block
  16 boundaries or the lifetime of local variables, or select the variables which
  17 are worth to put into a register.
  18
  19 Register allocation is thus done only inside the trees of the forest, and each
  20 tree can use the full set of registers. We simply split a tree if we get out of
  21 registers, for example the following tree:
  22
  23
  24               add(R0)
  25              /   \
  26             /     \
  27            a(R0)  add(R1)
  28                  /   \
  29                 /     \
  30                b(R1)  add(R2)
  31                      /   \
  32                     /     \
  33                    c(R2)   b(R3)
  34
  35 can be transformed to:
  36
  37
  38        stloc(t1)         add(R0)
  39          |              /   \
  40          |             /     \
  41         add(R0)       a(R0)  add(R1)
  42        /   \                /   \
  43       /     \              /     \
  44      c(R0)   b(R1)        b(R1)  t1(R2)
  45
  46
  47 Please notice that the split trees use less registers than the original
  48 tree.
  49
  50 Triggering JIT compilation:
  51 ===========================
  52
  53 The current approach is to call functions indirectly. The address to call is
  54 stored in the MonoMethod structure. For each method we create a trampoline
  55 function. When called, this function does the JIT compilation and replaces the
  56 trampoline with the compiled method address.
  57
  58 Register Allocation:
  59 ====================
  60
  61 With lcc you can assign a fixed register to a tree before register
  62 allocation. For example this is needed by call, which return the value always
  63 in EAX on x86. The current implementation works without such system, due to
  64 special forest generation.
  65
  66
  67 X86 Register Allocation:
  68 ========================
  69
  70 We can use 8bit or 16bit registers on the x86. If we use that feature we have
  71 more registers to allocate, which maybe prevents some register spills. We
  72 currently ignore that ability and always allocate 32 bit registers, because I
  73 think we would gain very little from that optimisation and it would complicate
  74 the code.
  75
  76 Different Register Sets:
  77 ========================
  78
  79 Most processors have more that one register set, at least one for floating
  80 point values, and one for integers. Should we support architectures with more
  81 that two sets? Does someone knows such an architecture?
  82
  83 64bit Integer Values:
  84 =====================
  85
  86 I can imagine two different implementation. On possibility would be to treat
  87 long (64bit) values simply like any other value type. This implies that we
  88 call class methods for ALU operations like add or sub. Sure, this method will
  89 be be a bit inefficient.
  90
  91 The more performant solution is to allocate two 32bit registers for each 64bit
  92 value. We add a new non terminal to the monoburg grammar called long_reg. The
  93 register allocation routines takes care of this non terminal and allocates two
  94 registers for them.
  95
  96
  97 Forest generation:
  98 ==================
  99
 100 It seems that trees generated from the CIL language have some special
 101 properties, i.e. the trees already represents basic blocks, so there can be no
 102 branches to the inside of such a tree. All results of those trees are stored to
 103 memory.
 104
 105 One idea was to drive the code generation directly from the CIL code, without
 106 generating an intermediate forest of trees. I think this is not possible,
 107 because you always have to gather some attributes and attach it to the
 108 instruction (for example the register allocation info). So I thing generating a
 109 tree is the right thing and that also works perfectly with monoburg. IMO we
 110 would not get any benefit from trying to feed monoburg directly with CIL
 111 instructions.
 112
 113 DAG handling:
 114 =============
 115
 116 Monoburg can't handle DAGs, instead we need real trees as input for
 117 the code generator. So we have two problems:
 118
 119 1.) DUP instruction: This one is obvious - we need to store the value
 120 into a temporary variable to solve the problem.
 121
 122 2.) function calls: Chapter 12.8, page 343 of "A retargetable C compiler"
 123 explains that: "because listing a call node will give it a hidden reference
 124 from the code list". I don't understand that (can someone explain that?), but
 125 there is another reason to save return values to temporaries: Consider the
 126 following code:
 127
 128 x = f(y) + g(z); // all functions return integers
 129
 130 We could generate such a tree for this expression: STLOC(ADD(CALL,CALL))
 131
 132 The problem is that both calls returns the value in the same register,
 133 so it is non trivial to generate code for that tree. We must copy one
 134 register into another one, which make register allocation more complex.
 135 The easier solution is store the result of function calls to
 136 temporaries. This leads to the following forest:
 137
 138 STLOC(CALL)
 139 STLOC(CALL)
 140 STLOC(ADD (LDLOC, LDLOC))
 141
 142 This is what lcc is doing, if I understood 12.8, page 342, 343?
 143
 144 Value Types:
 145 ============
 146
 147 The only CLI instructions which can handle value types are loads and stores,
 148 either to local variable, to the stack or to array elements. Value types with a
 149 size smaller than sizeof(int) are handled like any other basic type. For other
 150 value types we load the base address and emit block copies to store them.
 151
 152 Possible Optimisations:
 153 =======================
 154
 155 Miguel said ORP does some optimisation on IL level, for example moving array
 156 bounds checking out of loops:
 157
 158 for (i = 0; i < N; i++) { check_range (a, i); a [i] = X; }
 159
 160 id transformed to:
 161
 162 if (in_range (a, 0, N)) { for (i = 0; i < N; i++) a[i] = X; }
 163 else for (i = 0; i < N; i++) { check_range (a, i); a [i] = X; }
 164
 165 The else is only to keep original semantics (exception handling).
 166