2 A new JIT compiler for the Mono Project
4 Miguel de Icaza (miguel@{ximian.com,gnome.org}),
9 Mini is a new compilation engine for the Mono runtime. The
10 new engine is designed to bring new code generation
11 optimizations, portability and pre-compilation.
13 In this document we describe the design decisions and the
14 architecture of the new compilation engine.
18 Mono is a Open Source implementation of the .NET Framework: it
19 is made up of a runtime engine that implements the ECMA Common
20 Language Infrastructure (CLI), a set of compilers that target
21 the CLI and a large collection of class libraries.
23 This article discusses the new code generation facilities that
24 have been added to the Mono runtime.
26 First we discuss the overall architecture of the Mono runtime,
27 and how code generation fits into it; Then we discuss the
28 development and basic architecture of our first JIT compiler
29 for the ECMA CIL framework. The next section covers the
30 objectives for the work on the new JIT compiler, then we
31 discuss the new features available in the new JIT compiler,
32 and finally a technical description of the new code generation
35 * Architecture of the Mono Runtime
37 The Mono runtime is an implementation of the ECMA Common
38 Language Infrastructure (CLI), whose aim is to be a common
39 platform for executing code in multiple languages.
41 Languages that target the CLI generate images that contain
42 code in high-level intermediate representation called the
43 "Common Intermediate Language". This intermediate language is
44 rich enough to allow for programs and pre-compiled libraries
45 to be reflected. The execution environment allows for an
46 object oriented execution environment with single inheritance
47 and multiple interface implementations.
49 This runtime provides a number of services for programs that
50 are targeted to it: Just-in-Time compilation of CIL code into
51 native code, garbage collection, thread management, I/O
52 routines, single, double and decimal floating point,
53 asynchronous method invocation, application domains, and a
54 framework for building arbitrary RPC systems (remoting) and
55 integration with system libraries through the Platform Invoke
58 The focus of this document is on the services provided by the
59 Mono runtime to transform CIL bytecodes into code that is
60 native to the underlying architecture.
62 The code generation interface is a set of macros that allow a
63 C programmer to generate code on the fly, this is done
64 through a set of macros found in the mono/jit/arch/ directory.
65 These macros are used by the JIT compiler to generate native
68 The platform invocation code is interesting, as it generates
69 CIL code on the fly to marshal parameters, and then this
70 code is in turned processed by the JIT engine.
72 * Previous Experiences
74 Mono has built a JIT engine, which has been used to bootstrap
75 Mono since January, 2002. This JIT engine has reasonable
76 performance, and uses an tree pattern matching instruction
77 selector based on the BURS technology. This JIT compiler was
78 designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
80 The existing JIT compiler has three phases:
82 * Re-creation of the semantic tree from CIL
85 * Instruction selection, with a cost-driven
88 * Code generation and register allocation.
90 It is also hooked into the rest of the runtime to provide
91 services like marshaling, just-in-time compilation and
92 invocation of "internal calls".
94 This engine constructed a collection of trees, which we
95 referred to as the "forest of trees", this forest is created by
96 "hydrating" the CIL instruction stream.
98 The first step was to identify the basic blocks on the method,
99 and computing the control flow graph (cfg) for it. Once this
100 information was computed, a stack analysis on each basic block
101 was performed to create a forest of trees for each one of
104 So for example, the following statement:
110 Which would be represented in CIL as:
117 After the stack analysis would create the following tree:
119 (STIND_I4 ADDR_L[EBX|2] (
120 ADD (LDIND_I4 ADDR_L[ESI|1])
123 This tree contains information from the stack analysis: for
124 instance, notice that the operations explicitly encode the
125 data types they are operating on, there is no longer an
126 ambiguity on the types, because this information has been
129 At this point the JIT would pass the constructed forest of
130 trees to the architecture-dependent JIT compiler.
132 The architecture dependent code then performed register
133 allocation (optionally using linear scan allocation for
134 variables, based on life analysis).
136 Once variables had been assigned, a tree pattern matching with
137 dynamic programming is used (the tree pattern matcher is
138 custom build for each architecture, using a code
139 generator: monoburg). The instruction selector used cost
140 functions to select the best instruction patterns.
142 The instruction selector is able to produce instructions that
143 take advantage of the x86 instruction indexing instructions
146 One problem though is that the code emitter and the register
147 allocator did not have any visibility outside the current
148 tree, which meant that some redundant instructions were
149 generated. A peephole optimizer with this architecture was
150 hard to write, given the tree-based representation that is
153 This JIT was functional, but it did not provide a good
154 architecture to base future optimizations on. Also the
155 line between architecture neutral and architecture
156 specific code and optimizations was hard to draw.
158 The JIT engine supported two code generation modes to support
159 the two optimization modes for applications that host multiple
160 application domains: generate code that will be shared across
161 application domains, or generate code that will not be shared
162 across application domains.
164 * Objectives of the new JIT engine.
166 We wanted to support a number of features that were missing:
168 * Ahead-of-time compilation.
170 The idea is to allow developers to pre-compile their code
171 to native code to reduce startup time, and the working
172 set that is used at runtime in the just-in-time compiler.
174 Although in Mono this has not been a visible problem, we
175 wanted to pro-actively address this problem.
177 When an assembly (a Mono/.NET executable) is installed in
178 the system, it would then be possible to pre-compile the
179 code, and have the JIT compiler tune the generated code
180 to the particular CPU on which the software is
183 This is done in the Microsoft.NET world with a tool
186 * Have a good platform for doing code optimizations.
188 The design called for a good architecture that would
189 enable various levels of optimizations: some
190 optimizations are better performed on high-level
191 intermediate representations, some on medium-level and
192 some at low-level representations.
194 Also it should be possible to conditionally turn these on
195 or off. Some optimizations are too expensive to be used
196 in just-in-time compilation scenarios, but these
197 expensive optimizations can be turned on for
198 ahead-of-time compilations or when using profile-guided
199 optimizations on a subset of the executed methods.
201 * Reduce the effort required to port the Mono code
202 generator to new architectures.
204 For Mono to gain wide adoption in the Unix world, it is
205 necessary that the JIT engine works in most of today's
206 commercial hardware platforms.
208 * Features of the new JIT engine.
210 The new JIT engine was architected by Dietmar Maurer and Paolo
211 Molaro, based on the new objectives.
213 Mono provides a number of services to applications running
214 with the new JIT compiler:
216 * Just-in-Time compilation of CLI code into native code.
218 * Ahead-of-Time compilation of CLI code, to reduce
219 startup time of applications.
221 A number of software development features are also available:
223 * Execution time profiling (--profile)
225 Generates a report of the times consumed by routines,
226 as well as the invocation times, as well as the
229 * Memory usage profiling (--profile)
231 Generates a report of the memory usage by a program
232 that is ran under the Mono JIT.
234 * Code coverage (--coverage)
238 People who are interested in developing and improving the Mini
239 JIT compiler will also find a few useful routines:
243 This is used to time the execution time for the JIT
244 when compiling a routine.
246 * Control Flow Graph and Dominator Tree drawing.
248 These are visual aids for the JIT developer: they
249 render representations of the Control Flow graph, and
250 for the more advanced optimizations, they draw the
251 dominator tree graph.
253 This requires Dot (from the graphwiz package) and Ghostview.
255 * Code generator regression tests.
257 The engine contains support for running regression
258 tests on the virtual machine, which is very helpful to
259 developers interested in improving the engine.
261 * Optimization benchmark framework.
263 The JIT engine will generate graphs that compare
264 various benchmarks embedded in an assembly, and run the
265 various tests with different optimization flags.
267 This requires Perl, GD::Graph.
271 This is probably the most important component of the new code
272 generation engine. The internals are relatively easy to
273 replace and update, even large passes can be replaced and
274 implemented differently.
278 Compiling a method begins with the `mini_method_to_ir' routine
279 that converts the CIL representation into a medium
280 intermediate representation.
282 The mini_method_to_ir routine performs a number of operations:
284 * Flow analysis and control flow graph computation.
286 Unlike the previous version, stack analysis and control
287 flow graphs are computed in a single pass in the
288 mini_method_to_ir function, this is done for performance
289 reasons: although the complexity increases, the benefit
290 for a JIT compiler is that there is more time available
291 for performing other optimizations.
293 * Basic block computation.
295 mini_method_to_ir populates the MonoCompile structure
296 with an array of basic blocks each of which contains
297 forest of trees made up of MonoInst structures.
301 Inlining is no longer restricted to methods containing
302 one single basic block, instead it is possible to inline
303 arbitrary complex methods.
305 The heuristics to choose what to inline are likely going
306 to be tuned in the future.
308 * Method to opcode conversion.
310 Some method call invocations like `call Math.Sin' are
311 transformed into an opcode: this transforms the call
312 into a semantically rich node, which is later inline
313 into an FPU instruction.
315 Various Array methods invocations are turned into
316 opcodes as well (The Get, Set and Address methods)
318 * Tail recursion elimination
322 The MonoInst structure holds the actual decoded instruction,
323 with the semantic information from the stack analysis.
324 MonoInst is interesting because initially it is part of a tree
325 structure, here is a sample of the same tree with the new JIT
328 (stind.i4 regoffset[0xffffffd4(%ebp)]
329 (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
332 This is a medium-level intermediate representation (MIR).
334 Some complex opcodes are decomposed at this stage into a
335 collection of simpler opcodes. Not every complex opcode is
336 decomposed at this stage, as we need to preserve the semantic
337 information during various optimization phases.
339 For example a NEWARR opcode carries the length and the type of
340 the array that could be used later to avoid type checking or
343 There are a number of operations supported on this
346 * Branch optimizations.
350 * Loop optimizations: the dominator trees are
351 computed, loops are detected, and their nesting
354 * Conversion of the method into static single assignment
357 * Dead code elimination.
359 * Constant propagation.
365 Once the above optimizations are optionally performed, a
366 decomposition phase is used to turn some complex opcodes into
367 internal method calls. In the initial version of the JIT
368 engine, various operations on longs are emulated instead of
369 being inlined. Also the newarr invocation is turned into a
372 At this point, after computing variable liveness, it is
373 possible to use the linear scan algorithm for allocating
374 variables to registers. The linear scan pass uses the
375 information that was previously gathered by the loop nesting
376 and loop structure computation to favor variables in inner
379 Stack space is then reserved for the local variables and any
380 temporary variables generated during the various
383 ** Instruction selection
385 At this point, the BURS instruction selector is invoked to
386 transform the tree-based representation into a list of
387 instructions. This is done using a tree pattern matcher that
388 is generated for the architecture using the `monoburg' tool.
390 Monoburg takes as input a file that describes tree patterns,
391 which are matched against the trees that were produced by the
392 engine in the previous stages.
394 The pattern matching might have more than one match for a
395 particular tree. In this case, the match selected is the one
396 whose cost is the smallest. A cost can be attached to each
397 rule, and if no cost is provided, the implicit cost is one.
398 Smaller costs are selected over higher costs.
400 The cost function can be used to select particular blocks of
401 code for a given architecture, or by using a prohibitive high
402 number to avoid having the rule match.
404 The various rules that our JIT engine uses transform a tree of
405 MonoInsts into a list of monoinsts:
407 +-----------------------------------------------------------+
409 | of ===> Instruction selection ===> of |
410 | MonoInst MonoInst. |
411 +-----------------------------------------------------------+
413 During this process various "types" of MonoInst kinds
414 disappear and turned into lower-level representations. The
415 JIT compiler just happens to reuse the same structure (this is
416 done to reduce memory usage and improve memory locality).
418 The instruction selection rules are split in a number of
419 files, each one with a particular purpose:
422 Contains the generic instruction selection
426 Contains x86 specific rules.
429 Contains PowerPC specific rules.
432 burg file for 64bit instructions on 32bit architectures.
435 burg file for 64bit architectures.
438 burg file for floating point instructions
440 For a given build, a set of those files would be included.
441 For example, for the build of Mono on the x86, the following
444 inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
446 ** Native method generation
448 The native method generation has a number of steps:
450 * Architecture specific register allocation.
452 The information about loop nesting that was
453 previously gathered is used here to hint the
456 * Generating the method prolog/epilog.
458 * Optionally generate code to introduce tracing facilities.
460 * Hooking into the debugger.
462 * Performing any pending fixups.
468 The actual code generation is contained in the architecture
469 specific portion of the compiler. The input to the code
470 generator is each one of the basic blocks with its list of
471 instructions that were produced in the instruction selection
474 During the instruction selection phase, virtual registers are
475 assigned. Just before the peephole optimization is performed,
476 physical registers are assigned.
478 A simple peephole and algebraic optimizer is ran at this
481 The peephole optimizer removes some redundant operations at
482 this point. This is possible because the code generation at
483 this point has visibility into the basic block that spans the
486 The algebraic optimizer performs some simple algebraic
487 optimizations that replace expensive operations with cheaper
488 operations if possible.
490 The rest of the code generation is fairly simple: a switch
491 statement is used to generate code for each of the MonoInsts
493 We always try to allocate code in sequence, instead of just using
494 malloc. This way we increase spatial locality which gives a massive
495 speedup on most architectures.
497 *** Ahead of Time compilation
499 Ahead-of-Time compilation is a new feature of our new
500 compilation engine. The compilation engine is shared by the
501 Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
504 The difference is on the set of optimizations that are turned
505 on for each mode: Just-in-Time compilation should be as fast
506 as possible, while Ahead-of-Time compilation can take as long
507 as required, because this is not done at a time critical
510 With AOT compilation, we can afford to turn all of the
511 computationally expensive optimizations on.
513 After the code generation phase is done, the code and any
514 required fixup information is saved into a file that is
515 readable by "as" (the native assembler available on all
516 systems). This assembly file is then passed to the native
517 assembler, which generates a loadable module.
519 At execution time, when an assembly is loaded from the disk,
520 the runtime engine will probe for the existence of a
521 pre-compiled image. If the pre-compiled image exists, then it
522 is loaded, and the method invocations are resolved to the code
523 contained in the loaded module.
525 The code generated under the AOT scenario is slightly
526 different than the JIT scenario. It generates code that is
527 application-domain relative and that can be shared among
530 This is the same code generation that is used when the runtime
531 is instructed to maximize code sharing on a multi-application
534 * SSA-based optimizations
536 SSA form simplifies many optimization because each variable
537 has exactly one definition site. This means that each
538 variable is only initialized once.
540 For example, code like this:
547 Is internally turned into:
554 In the presence of branches, like:
563 The code is turned into:
572 All uses of a variable are "dominated" by its definition
574 This representation is useful as it simplifies the
575 implementation of a number of optimizations like conditional
576 constant propagation, array bounds check removal and dead code
579 * Register allocation.
581 Global register allocation is performed on the medium
582 intermediate representation just before instruction selection
583 is performed on the method. Local register allocation is
584 later performed at the basic-block level on the
586 Global register allocation uses the following input:
588 1) set of register-sized variables that can be allocated to a
589 register (this is an architecture specific setting, for x86
590 these registers are the callee saved register ESI, EDI and
593 2) liveness information for the variables
595 3) (optionally) loop info to favor variables that are used in
598 During instruction selection phase, symbolic registers are
599 assigned to temporary values in expressions.
601 Local register allocation assigns hard registers to the
602 symbolic registers, and it is performed just before the code
603 is actually emitted and is performed at the basic block level.
604 A CPU description file describes the input registers, output
605 registers, fixed registers and clobbered registers by each
608 * BURG Code Generator Generator
610 monoburg was written by Dietmar Maurer. It is based on the
611 papers from Christopher W. Fraser, Robert R. Henry and Todd
612 A. Proebsting: "BURG - Fast Optimal Instruction Selection and
613 Tree Parsing" and "Engineering a Simple, Efficient Code
614 Generator Generator".
616 The original BURG implementation is unable to work on DAGs, instead only
617 trees are allowed. Our monoburg implementations is able to generate tree
618 matcher which works on DAGs, and we use this feature in the new
619 JIT. This simplifies the code because we can directly pass DAGs and
620 don't need to convert them to trees.
624 Profile-based optimization is something that we are very
625 interested in supporting. There are two possible usage
628 * Based on the profile information gathered during
629 the execution of a program, hot methods can be compiled
630 with the highest level of optimizations, while bootstrap
631 code and cold methods can be compiled with the least set
632 of optimizations and placed in a discardable list.
634 * Code reordering: this profile-based optimization would
635 only make sense for pre-compiled code. The profile
636 information is used to re-order the assembly code on disk
637 so that the code is placed on the disk in a way that
640 This is the same principle under which SGI's cord program
643 The nature of the CIL allows the above optimizations to be
644 easy to implement and deploy. Since we live and define our
645 universe for these things, there are no interactions with
646 system tools required, nor upgrades on the underlying
647 infrastructure required.
649 Instruction scheduling is important for certain kinds of
650 processors, and some of the framework exists today in our
651 register allocator and the instruction selector to cope with
652 this, but has not been finished. The instruction selection
653 would happen at the same time as local register allocation.