A new JIT compiler for the Mono Project

	   Miguel de Icaza (miguel@{ximian.com,gnome.org}),
	   Paolo Molaro (lupus@{ximian.com,debian.org})

	This documents overall design of the Mono JIT up to version
	2.0.   After Mono 2.0 the JIT engine was upgraded from
	a tree-based intermediate representation to a linear
	intermediate representation.

	The Linear IL is documented here:

	    http://www.mono-project.com/Linear_IL

* Abstract

	Mini is a new compilation engine for the Mono runtime.  The
	new engine is designed to bring new code generation
	optimizations, portability and pre-compilation. 

	In this document we describe the design decisions and the
	architecture of the new compilation engine. 

* Introduction

	Mono is a Open Source implementation of the .NET Framework: it
	is made up of a runtime engine that implements the ECMA Common
	Language Infrastructure (CLI), a set of compilers that target
	the CLI and a large collection of class libraries.

	This article discusses the new code generation facilities that
	have been added to the Mono runtime.  

	First we discuss the overall architecture of the Mono runtime,
	and how code generation fits into it; Then we discuss the
	development and basic architecture of our first JIT compiler
	for the ECMA CIL framework.  The next section covers the
	objectives for the work on the new JIT compiler, then we
	discuss the new features available in the new JIT compiler,
	and finally a technical description of the new code generation
	engine.

* Architecture of the Mono Runtime

	The Mono runtime is an implementation of the ECMA Common
	Language Infrastructure (CLI), whose aim is to be a common
	platform for executing code in multiple languages.

	Languages that target the CLI generate images that contain
	code in high-level intermediate representation called the
	"Common Intermediate Language".  This intermediate language is
	rich enough to allow for programs and pre-compiled libraries
	to be reflected.  The execution environment allows for an
	object oriented execution environment with single inheritance
	and multiple interface implementations.

	This runtime provides a number of services for programs that
	are targeted to it: Just-in-Time compilation of CIL code into
	native code, garbage collection, thread management, I/O
	routines, single, double and decimal floating point,
	asynchronous method invocation, application domains, and a
	framework for building arbitrary RPC systems (remoting) and
	integration with system libraries through the Platform Invoke
	functionality.

	The focus of this document is on the services provided by the
	Mono runtime to transform CIL bytecodes into code that is
	native to the underlying architecture.

	The code generation interface is a set of macros that allow a
	C programmer to generate code on the fly, this is done
	through a set of macros found in the mono/jit/arch/ directory.
	These macros are used by the JIT compiler to generate native
	code. 

	The platform invocation code is interesting, as it generates
	CIL code on the fly to marshal parameters, and then this
	code is in turned processed by the JIT engine.

	Mono has now gone through three different JIT engines, these
	are:

	* Original JIT engine: 2002, hard to port, hard to
	  implement optimizations.

	* Second JIT engine, used up until Mono 2.0, very
          portable, many new optimizations.

	* Third JIT engine, replaced the code generation layer from
	  being based on a tree representation to be based on a linear
	  representation.

        For more information on the code generation changes, see our
	web site for the details on the Linear IL:

	    http://www.mono-project.com/Linear_IL

* Previous Experiences

	Mono has built a JIT engine, which has been used to bootstrap
	Mono since January, 2002.  This JIT engine has reasonable
	performance, and uses an tree pattern matching instruction
	selector based on the BURS technology.  This JIT compiler was
	designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.

	The existing JIT compiler has three phases:

		* Re-creation of the semantic tree from CIL
		  byte-codes.

		* Instruction selection, with a cost-driven
		  engine. 

		* Code generation and register allocation.

	It is also hooked into the rest of the runtime to provide
	services like marshaling, just-in-time compilation and
	invocation of "internal calls". 

	This engine constructed a collection of trees, which we
	referred to as the "forest of trees", this forest is created by
	"hydrating" the CIL instruction stream.

	The first step was to identify the basic blocks on the method,
	and computing the control flow graph (cfg) for it.  Once this
	information was computed, a stack analysis on each basic block
	was performed to create a forest of trees for each one of
	them. 

	So for example, the following statement:

	       int a, b;
	       ...
	       b = a + 1;

	Which would be represented in CIL as:

			ldloc.0 
			ldc.i4.1 
			add 
			stloc.1 

	After the stack analysis would create the following tree:

               (STIND_I4 ADDR_L[EBX|2] (
			 ADD (LDIND_I4 ADDR_L[ESI|1]) 
			 CONST_I4[1]))

        This tree contains information from the stack analysis: for
        instance, notice that the operations explicitly encode the
        data types they are operating on, there is no longer an
        ambiguity on the types, because this information has been
        inferred. 

	At this point the JIT would pass the constructed forest of
	trees to the architecture-dependent JIT compiler.  

	The architecture dependent code then performed register
	allocation (optionally using linear scan allocation for
	variables, based on life analysis).  

	Once variables had been assigned, a tree pattern matching with
	dynamic programming is used (the tree pattern matcher is
	custom build for each architecture, using a code
	generator: monoburg). The instruction selector used cost
	functions to select the best instruction patterns.  

	The instruction selector is able to produce instructions that
	take advantage of the x86 instruction indexing instructions
	for example. 

	One problem though is that the code emitter and the register
	allocator did not have any visibility outside the current
	tree, which meant that some redundant instructions were
	generated.  A peephole optimizer with this architecture was
	hard to write, given the tree-based representation that is
	used.

	This JIT was functional, but it did not provide a good
	architecture to base future optimizations on.  Also the
	line between architecture neutral and architecture
	specific code and optimizations was hard to draw.

	The JIT engine supported two code generation modes to support
	the two optimization modes for applications that host multiple
	application domains: generate code that will be shared across
	application domains, or generate code that will not be shared
	across application domains.

* Second Generation JIT engine.

	We wanted to support a number of features that were missing:

	   * Ahead-of-time compilation.  

	     The idea is to allow developers to pre-compile their code
	     to native code to reduce startup time, and the working
	     set that is used at runtime in the just-in-time compiler.

	     Although in Mono this has not been a visible problem, we
	     wanted to pro-actively address this problem.

	     When an assembly (a Mono/.NET executable) is installed in
	     the system, it would then be possible to pre-compile the
	     code, and have the JIT compiler tune the generated code
	     to the particular CPU on which the software is
	     installed. 

	     This is done in the Microsoft.NET world with a tool
	     called ngen.exe

	   * Have a good platform for doing code optimizations. 

	     The design called for a good architecture that would
	     enable various levels of optimizations: some
	     optimizations are better performed on high-level
	     intermediate representations, some on medium-level and
	     some at low-level representations.

	     Also it should be possible to conditionally turn these on
	     or off.  Some optimizations are too expensive to be used
	     in just-in-time compilation scenarios, but these
	     expensive optimizations can be turned on for
	     ahead-of-time compilations or when using profile-guided
	     optimizations on a subset of the executed methods.

	   * Reduce the effort required to port the Mono code
             generator to new architectures.

	     For Mono to gain wide adoption in the Unix world, it is
	     necessary that the JIT engine works in most of today's
	     commercial hardware platforms. 

* Features of the Second JIT engine.

	The new JIT engine was architected by Dietmar Maurer and Paolo
	Molaro, based on the new objectives.

	Mono provides a number of services to applications running
	with the new JIT compiler:

	     * Just-in-Time compilation of CLI code into native code.

	     * Ahead-of-Time compilation of CLI code, to reduce
               startup time of applications. 

	A number of software development features are also available:

	     * Execution time profiling (--profile)

	       Generates a report of the times consumed by routines,
	       as well as the invocation times, as well as the
	       callers.

	     * Memory usage profiling (--profile)

	       Generates a report of the memory usage by a program
	       that is ran under the Mono JIT.

	     * Code coverage (--coverage)

	     * Execution tracing.

        People who are interested in developing and improving the Mini
        JIT compiler will also find a few useful routines:

	     * Compilation times

	       This is used to time the execution time for the JIT
	       when compiling a routine. 

	     * Control Flow Graph and Dominator Tree drawing.

	       These are visual aids for the JIT developer: they
	       render representations of the Control Flow graph, and
	       for the more advanced optimizations, they draw the
	       dominator tree graph. 

	       This requires Dot (from the graphwiz package) and Ghostview.

	     * Code generator regression tests.  

	       The engine contains support for running regression
	       tests on the virtual machine, which is very helpful to
	       developers interested in improving the engine.

	     * Optimization benchmark framework.

	       The JIT engine will generate graphs that compare
	       various benchmarks embedded in an assembly, and run the
	       various tests with different optimization flags.  

	       This requires Perl, GD::Graph.

* Flexibility

	This is probably the most important component of the new code
	generation engine.  The internals are relatively easy to
	replace and update, even large passes can be replaced and
	implemented differently.

* New code generator

	Compiling a method begins with the `mini_method_to_ir' routine
	that converts the CIL representation into a medium
	intermediate representation.

	The mini_method_to_ir routine performs a number of operations:

	    * Flow analysis and control flow graph computation.

	      Unlike the previous version, stack analysis and control
	      flow graphs are computed in a single pass in the
	      mini_method_to_ir function, this is done for performance
	      reasons: although the complexity increases, the benefit
	      for a JIT compiler is that there is more time available
	      for performing other optimizations.

	    * Basic block computation.

	      mini_method_to_ir populates the MonoCompile structure
	      with an array of basic blocks each of which contains
	      forest of trees made up of MonoInst structures.

	    * Inlining

	      Inlining is no longer restricted to methods containing
	      one single basic block, instead it is possible to inline
	      arbitrary complex methods.

	      The heuristics to choose what to inline are likely going
	      to be tuned in the future.

	    * Method to opcode conversion.

	      Some method call invocations like `call Math.Sin' are
	      transformed into an opcode: this transforms the call
	      into a semantically rich node, which is later inline
	      into an FPU instruction.

	      Various Array methods invocations are turned into
	      opcodes as well (The Get, Set and Address methods)

	    * Tail recursion elimination

	Basic blocks ****

	The MonoInst structure holds the actual decoded instruction,
	with the semantic information from the stack analysis.
	MonoInst is interesting because initially it is part of a tree
	structure, here is a sample of the same tree with the new JIT
	engine:

		 (stind.i4 regoffset[0xffffffd4(%ebp)] 
			   (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
			         iconst[1]))

	This is a medium-level intermediate representation (MIR). 

	Some complex opcodes are decomposed at this stage into a
	collection of simpler opcodes.  Not every complex opcode is
	decomposed at this stage, as we need to preserve the semantic
	information during various optimization phases.  

	For example a NEWARR opcode carries the length and the type of
	the array that could be used later to avoid type checking or
	array bounds check.

        There are a number of operations supported on this
	representation:

		* Branch optimizations.

		* Variable liveness.

		* Loop optimizations: the dominator trees are
		  computed, loops are detected, and their nesting
		  level computed.

		* Conversion of the method into static single assignment
                  form (SSA form).

	        * Dead code elimination.

		* Constant propagation.

		* Copy propagation.

		* Constant folding.

	Once the above optimizations are optionally performed, a
	decomposition phase is used to turn some complex opcodes into
	internal method calls.  In the initial version of the JIT
	engine, various operations on longs are emulated instead of
	being inlined.  Also the newarr invocation is turned into a
	call to the runtime.

	At this point, after computing variable liveness, it is
	possible to use the linear scan algorithm for allocating
	variables to registers.  The linear scan pass uses the
	information that was previously gathered by the loop nesting
	and loop structure computation to favor variables in inner
	loops.   This process updates the basic block `nesting' field
	which is later used during liveness analysis.

	Stack space is then reserved for the local variables and any
	temporary variables generated during the various
	optimizations.

** Instruction selection: Only used up until Mono 2.0

	At this point, the BURS instruction selector is invoked to
	transform the tree-based representation into a list of
	instructions.  This is done using a tree pattern matcher that
	is generated for the architecture using the `monoburg' tool. 

	Monoburg takes as input a file that describes tree patterns,
	which are matched against the trees that were produced by the
	engine in the previous stages.

	The pattern matching might have more than one match for a
	particular tree.  In this case, the match selected is the one
	whose cost is the smallest.  A cost can be attached to each
	rule, and if no cost is provided, the implicit cost is one.
	Smaller costs are selected over higher costs.

	The cost function can be used to select particular blocks of
	code for a given architecture, or by using a prohibitive high
	number to avoid having the rule match.

	The various rules that our JIT engine uses transform a tree of
	MonoInsts into a list of monoinsts:

	+-----------------------------------------------------------+
	| Tree                                           List       |
	| of           ===> Instruction selection ===>   of         |
	| MonoInst                                       MonoInst.  |
        +-----------------------------------------------------------+

	During this process various "types" of MonoInst kinds 
	disappear and turned into lower-level representations.  The
	JIT compiler just happens to reuse the same structure (this is
	done to reduce memory usage and improve memory locality).

	The instruction selection rules are split in a number of
	files, each one with a particular purpose:

	        inssel.brg
			Contains the generic instruction selection
			patterns.

		inssel-x86.brg
			Contains x86 specific rules.

		inssel-ppc.brg
			Contains PowerPC specific rules.

		inssel-long32.brg
			burg file for 64bit instructions on 32bit architectures.

		inssel-long.brg
			burg file for 64bit architectures.

		inssel-float.brg
			burg file for floating point instructions
		
	For a given build, a set of those files would be included.
	For example, for the build of Mono on the x86, the following
	set is used:

	    inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg

** Native method generation

	The native method generation has a number of steps:

		* Architecture specific register allocation.

		  The information about loop nesting that was
		  previously gathered is used here to hint the
		  register allocator. 

		* Generating the method prolog/epilog.

		* Optionally generate code to introduce tracing facilities.

		* Hooking into the debugger.

		* Performing any pending fixups. 

		* Code generation.

*** Code Generation

	The actual code generation is contained in the architecture
	specific portion of the compiler.  The input to the code
	generator is each one of the basic blocks with its list of
	instructions that were produced in the instruction selection
	phase.

	During the instruction selection phase, virtual registers are
	assigned.  Just before the peephole optimization is performed,
	physical registers are assigned.

	A simple peephole and algebraic optimizer is ran at this
	stage.  

	The peephole optimizer removes some redundant operations at
	this point.  This is possible because the code generation at
	this point has visibility into the basic block that spans the
	original trees.  

	The algebraic optimizer performs some simple algebraic
	optimizations that replace expensive operations with cheaper
	operations if possible.

	The rest of the code generation is fairly simple: a switch
	statement is used to generate code for each of the MonoInsts,
	in the mono/mini/mini-ARCH.c files, the method is called
	"mono_arch_output_basic_block".

	We always try to allocate code in sequence, instead of just using
	malloc. This way we increase spatial locality which gives a massive
	speedup on most architectures.

*** Ahead of Time compilation

	Ahead-of-Time compilation is a new feature of our new
	compilation engine.  The compilation engine is shared by the
	Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
	(AOT).

	The difference is on the set of optimizations that are turned
	on for each mode: Just-in-Time compilation should be as fast
	as possible, while Ahead-of-Time compilation can take as long
	as required, because this is not done at a time critical
	time. 

	With AOT compilation, we can afford to turn all of the
	computationally expensive optimizations on.

	After the code generation phase is done, the code and any
	required fixup information is saved into a file that is
	readable by "as" (the native assembler available on all
	systems). This assembly file is then passed to the native
	assembler, which generates a loadable module.

	At execution time, when an assembly is loaded from the disk,
	the runtime engine will probe for the existence of a
	pre-compiled image.  If the pre-compiled image exists, then it
	is loaded, and the method invocations are resolved to the code
	contained in the loaded module.

	The code generated under the AOT scenario is slightly
	different than the JIT scenario.  It generates code that is
	application-domain relative and that can be shared among
	multiple thread.

	This is the same code generation that is used when the runtime
	is instructed to maximize code sharing on a multi-application
	domain scenario.

* SSA-based optimizations

	SSA form simplifies many optimization because each variable
	has exactly one definition site.  This means that each
	variable is only initialized once.  

	For example, code like this:

	    a = 1
	    ..
	    a = 2
	    call (a)

	Is internally turned into:

	    a1 = 1
	    ..
	    a2 = 2
	    call (a2)

	In the presence of branches, like:

	    if (x)
	         a = 1
	    else
		 a = 2

            call (a)

	The code is turned into:

	    if (x)
	         a1 = 1;
	    else
	         a2 = 2;
	    a3 = phi (a1, a2)
	    call (a3)

	All uses of a variable are "dominated" by its definition

	This representation is useful as it simplifies the
	implementation of a number of optimizations like conditional
	constant propagation, array bounds check removal and dead code
	elimination. 

* Register allocation.

	Global register allocation is performed on the medium
	intermediate representation just before instruction selection
	is performed on the method.  Local register allocation is
	later performed at the basic-block level on the 

	Global register allocation uses the following input:

        1) set of register-sized variables that can be allocated to a
        register (this is an architecture specific setting, for x86
        these registers are the callee saved register ESI, EDI and
        EBX). 

        2) liveness information for the variables

        3) (optionally) loop info to favor variables that are used in
        inner loops.

	During instruction selection phase, symbolic registers are
	assigned to temporary values in expressions.

	Local register allocation assigns hard registers to the
	symbolic registers, and it is performed just before the code
	is actually emitted and is performed at the basic block level.
	A CPU description file describes the input registers, output
	registers, fixed registers and clobbered registers by each
	operation.

* BURG Code Generator Generator: Only used up to Mono 2.0

       monoburg was written by Dietmar Maurer. It is based on the
       papers from Christopher W. Fraser, Robert R. Henry and Todd
       A. Proebsting: "BURG - Fast Optimal Instruction Selection and
       Tree Parsing" and "Engineering a Simple, Efficient Code
       Generator Generator".

       The original BURG implementation is unable to work on DAGs, instead only
       trees are allowed. Our monoburg implementations is able to generate tree
       matcher which works on DAGs, and we use this feature in the new
       JIT. This simplifies the code because we can directly pass DAGs and
       don't need to convert them to trees.

* Adding IL opcodes: an excercise (from a post by Paolo Molaro)

	mini.c is the file that read the IL code stream and decides
	how any single IL instruction is implemented
	(mono_method_to_ir () func), so you always have to add an
	entry to the big switch inside the function: there are plenty
	of examples in that file.

	An IL opcode can be implemented in a number of ways, depending
	on what it does and how it needs to do it.
	
	Some opcodes are implemented using a helper function: one of
	the simpler examples is the CEE_STELEM_REF implementation.

	In this case the opcode implementation is written in a C
	function.  You will need to register the function with the jit
	before you can use it (mono_register_jit_call) and you need to
	emit the call to the helper using the mono_emit_jit_icall()
	function.  

	This is the simpler way to add a new opcode and it doesn't
	require any arch-specific change (though it's limited to what
	you can do in C code and the performance may be limited by the
	function call).
	
	Other opcodes can be implemented with one or more of the already
	implemented low-level instructions. 

	An example is the OP_STRLEN opcode which implements
	String.Length using a simple load from memory.  In this case
	you need to add a rule to the appropriate burg file,
	describing what are the arguments of the opcode and what is,
	if any, it's 'return' value.

	The OP_STRLEN case is:
	
	reg: OP_STRLEN (reg) {  
		MONO_EMIT_LOAD_MEMBASE_OP (s, tree, OP_LOADI4_MEMBASE, state->reg1, 
			state->left->reg1, G_STRUCT_OFFSET (MonoString, length));
	}

	The above means: the OP_STRLEN takes a register as an argument
	and returns its value in a register.  And the implementation
	of this is included in the braces.
	
	The opcode returns a value in an integer register
	(state->reg1) by performing a int32 load of the length field
	of the MonoString represented by the input register
	(state->left->reg1): before the burg rules are applied, the
	internal representation is based on trees, so you get the
	left/right pointers (state->left and state->right
	respectively, the result is stored in state->reg1).

	This instruction implementation doesn't require arch-specific
	changes (it is using the MONO_EMIT_LOAD_MEMBASE_OP which is
	available on all platforms), and usually the produced code is
	fast.
	
	Next we have opcodes that must be implemented with new low-level
	architecture specific instructions (either because of performance
	considerations or because the functionality can't get implemented in
	other ways).  

	You also need a burg rule in this case, too. For example,
	consider the OP_CHECK_THIS opcode (used to raise an exception
	if the this pointer is null). The burg rule simply reads:
	
	stmt: OP_CHECK_THIS (reg) {
		mono_bblock_add_inst (s->cbb, tree);
	}
	
	Note that this opcode does not return a value (hence the
	"stmt") and it takes a register as input.

	mono_bblock_add_inst (s->cbb, tree) just adds the instruction
	(the tree variable) to the current basic block (s->cbb). In
	mini this is the place where the internal representation
	switches from the tree format to the low-level format (the
	list of simple instructions).

	In this case the actual opcode implementation is delegated to
	the arch-specific code.  A low-level opcode needs an entry in
	the machine description (the *.md files in mini/). This entry
	describes what kind of registers are used if any by the
	instruction, as well as other details such as constraints or
	other hints to the low-level engine which are architecture
	specific.  

	cpu-pentium.md, for example has the following entry:
	
	checkthis: src1:b len:3
	
	This means the instruction uses an integer register as a base
	pointer (basically a load or store is done on it) and it takes
	3 bytes of native code to implement it.

	Now you just need to provide the low-level implementation for
	the opcode in one of the mini-$arch.c files, in the
	mono_arch_output_basic_block() function. There is a big switch
	here too. The x86 implementation is:

		case OP_CHECK_THIS:
			/* ensure ins->sreg1 is not NULL */
			x86_alu_membase_imm (code, X86_CMP, ins->sreg1, 0, 0);
			break;
	
	If the $arch-codegen.h header file doesn't have the code to
	emit the low-level native code, you'll need to write that as
	well.  

	Complex opcodes with register constraints may require other
	changes to the local register allocator, but usually they are
	not needed.
		
* Future

        Profile-based optimization is something that we are very
        interested in supporting.  There are two possible usage
        scenarios: 

	   * Based on the profile information gathered during
             the execution of a program, hot methods can be compiled
             with the highest level of optimizations, while bootstrap
             code and cold methods can be compiled with the least set
             of optimizations and placed in a discardable list.

	   * Code reordering: this profile-based optimization would
             only make sense for pre-compiled code.  The profile
             information is used to re-order the assembly code on disk
             so that the code is placed on the disk in a way that
             increments locality.  

	     This is the same principle under which SGI's cord program
	     works.  

	The nature of the CIL allows the above optimizations to be
	easy to implement and deploy.  Since we live and define our
	universe for these things, there are no interactions with
	system tools required, nor upgrades on the underlying
	infrastructure required.

	Instruction scheduling is important for certain kinds of
	processors, and some of the framework exists today in our
	register allocator and the instruction selector to cope with
	this, but has not been finished.  The instruction selection
	would happen at the same time as local register allocation. <