docs/mini-doc.txt

   1
   2                A new JIT compiler for the Mono Project
   3
   4            Miguel de Icaza (miguel@{ximian.com,gnome.org}),
   5            Paolo Molaro (lupus@{ximian.com,debian.org})
   6
   7
   8 * Abstract
   9
  10         Mini is a new compilation engine for the Mono runtime.  The
  11         new engine is designed to bring new code generation
  12         optimizations, portability and pre-compilation.
  13
  14         In this document we describe the design decisions and the
  15         architecture of the new compilation engine.
  16
  17 * Introduction
  18
  19         Mono is a Open Source implementation of the .NET Framework: it
  20         is made up of a runtime engine that implements the ECMA Common
  21         Language Infrastructure (CLI), a set of compilers that target
  22         the CLI and a large collection of class libraries.
  23
  24         This article discusses the new code generation facilities that
  25         have been added to the Mono runtime.
  26
  27         First we discuss the overall architecture of the Mono runtime,
  28         and how code generation fits into it; Then we discuss the
  29         development and basic architecture of our first JIT compiler
  30         for the ECMA CIL framework.  The next section covers the
  31         objectives for the work on the new JIT compiler, then we
  32         discuss the new features available in the new JIT compiler,
  33         and finally a technical description of the new code generation
  34         engine.
  35
  36 * Architecture of the Mono Runtime
  37
  38         The Mono runtime is an implementation of the ECMA Common
  39         Language Infrastructure (CLI), whose aim is to be a common
  40         platform for executing code in multiple languages.
  41
  42         Languages that target the CLI generate images that contain
  43         code in high-level intermediate representation called the
  44         "Common Intermediate Language".  This intermediate language is
  45         rich enough to allow for programs and pre-compiled libraries
  46         to be reflected.  The execution environment allows for an
  47         object oriented execution environment with single inheritance
  48         and multiple interface implementations.
  49
  50         This runtime provides a number of services for programs that
  51         are targeted to it: Just-in-Time compilation of CIL code into
  52         native code, garbage collection, thread management, I/O
  53         routines, single, double and decimal floating point,
  54         asynchronous method invocation, application domains, and a
  55         framework for building arbitrary RPC systems (remoting) and
  56         integration with system libraries through the Platform Invoke
  57         functionality.
  58
  59         The focus of this document is on the services provided by the
  60         Mono runtime to transform CIL bytecodes into code that is
  61         native to the underlying architecture.
  62
  63         The code generation interface is a set of macros that allow a
  64         C programmer to generate code on the fly, this is done
  65         through a set of macros found in the mono/jit/arch/ directory.
  66         These macros are used by the JIT compiler to generate native
  67         code.
  68
  69         The platform invocation code is interesting, as it generates
  70         CIL code on the fly to marshal parameters, and then this
  71         code is in turned processed by the JIT engine.
  72
  73 * Previous Experiences
  74
  75         Mono has built a JIT engine, which has been used to bootstrap
  76         Mono since January, 2002.  This JIT engine has reasonable
  77         performance, and uses an tree pattern matching instruction
  78         selector based on the BURS technology.  This JIT compiler was
  79         designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
  80
  81         The existing JIT compiler has three phases:
  82
  83                 * Re-creation of the semantic tree from CIL
  84                   byte-codes.
  85
  86                 * Instruction selection, with a cost-driven
  87                   engine.
  88
  89                 * Code generation and register allocation.
  90
  91         It is also hooked into the rest of the runtime to provide
  92         services like marshaling, just-in-time compilation and
  93         invocation of "internal calls".
  94
  95         This engine constructed a collection of trees, which we
  96         referred to as the "forest of trees", this forest is created by
  97         "hydrating" the CIL instruction stream.
  98
  99         The first step was to identify the basic blocks on the method,
 100         and computing the control flow graph (cfg) for it.  Once this
 101         information was computed, a stack analysis on each basic block
 102         was performed to create a forest of trees for each one of
 103         them.
 104
 105         So for example, the following statement:
 106
 107                int a, b;
 108                ...
 109                b = a + 1;
 110
 111         Which would be represented in CIL as:
 112
 113                         ldloc.0
 114                         ldc.i4.1
 115                         add
 116                         stloc.1
 117
 118         After the stack analysis would create the following tree:
 119
 120                (STIND_I4 ADDR_L[EBX|2] (
 121                          ADD (LDIND_I4 ADDR_L[ESI|1])
 122                          CONST_I4[1]))
 123
 124         This tree contains information from the stack analysis: for
 125         instance, notice that the operations explicitly encode the
 126         data types they are operating on, there is no longer an
 127         ambiguity on the types, because this information has been
 128         inferred.
 129
 130         At this point the JIT would pass the constructed forest of
 131         trees to the architecture-dependent JIT compiler.
 132
 133         The architecture dependent code then performed register
 134         allocation (optionally using linear scan allocation for
 135         variables, based on life analysis).
 136
 137         Once variables had been assigned, a tree pattern matching with
 138         dynamic programming is used (the tree pattern matcher is
 139         custom build for each architecture, using a code
 140         generator: monoburg). The instruction selector used cost
 141         functions to select the best instruction patterns.
 142
 143         The instruction selector is able to produce instructions that
 144         take advantage of the x86 instruction indexing instructions
 145         for example.
 146
 147         One problem though is that the code emitter and the register
 148         allocator did not have any visibility outside the current
 149         tree, which meant that some redundant instructions were
 150         generated.  A peephole optimizer with this architecture was
 151         hard to write, given the tree-based representation that is
 152         used.
 153
 154         This JIT was functional, but it did not provide a good
 155         architecture to base future optimizations on.  Also the
 156         line between architecture neutral and architecture
 157         specific code and optimizations was hard to draw.
 158
 159         The JIT engine supported two code generation modes to support
 160         the two optimization modes for applications that host multiple
 161         application domains: generate code that will be shared across
 162         application domains, or generate code that will not be shared
 163         across application domains.
 164
 165 * Objectives of the new JIT engine.
 166
 167         We wanted to support a number of features that were missing:
 168
 169            * Ahead-of-time compilation.
 170
 171              The idea is to allow developers to pre-compile their code
 172              to native code to reduce startup time, and the working
 173              set that is used at runtime in the just-in-time compiler.
 174
 175              Although in Mono this has not been a visible problem, we
 176              wanted to pro-actively address this problem.
 177
 178              When an assembly (a Mono/.NET executable) is installed in
 179              the system, it would then be possible to pre-compile the
 180              code, and have the JIT compiler tune the generated code
 181              to the particular CPU on which the software is
 182              installed.
 183
 184              This is done in the Microsoft.NET world with a tool
 185              called ngen.exe
 186
 187            * Have a good platform for doing code optimizations.
 188
 189              The design called for a good architecture that would
 190              enable various levels of optimizations: some
 191              optimizations are better performed on high-level
 192              intermediate representations, some on medium-level and
 193              some at low-level representations.
 194
 195              Also it should be possible to conditionally turn these on
 196              or off.  Some optimizations are too expensive to be used
 197              in just-in-time compilation scenarios, but these
 198              expensive optimizations can be turned on for
 199              ahead-of-time compilations or when using profile-guided
 200              optimizations on a subset of the executed methods.
 201
 202            * Reduce the effort required to port the Mono code
 203              generator to new architectures.
 204
 205              For Mono to gain wide adoption in the Unix world, it is
 206              necessary that the JIT engine works in most of today's
 207              commercial hardware platforms.
 208
 209 * Features of the new JIT engine.
 210
 211         The new JIT engine was architected by Dietmar Maurer and Paolo
 212         Molaro, based on the new objectives.
 213
 214         Mono provides a number of services to applications running
 215         with the new JIT compiler:
 216
 217              * Just-in-Time compilation of CLI code into native code.
 218
 219              * Ahead-of-Time compilation of CLI code, to reduce
 220                startup time of applications.
 221
 222         A number of software development features are also available:
 223
 224              * Execution time profiling (--profile)
 225
 226                Generates a report of the times consumed by routines,
 227                as well as the invocation times, as well as the
 228                callers.
 229
 230              * Memory usage profiling (--profile)
 231
 232                Generates a report of the memory usage by a program
 233                that is ran under the Mono JIT.
 234
 235              * Code coverage (--coverage)
 236
 237              * Execution tracing.
 238
 239         People who are interested in developing and improving the Mini
 240         JIT compiler will also find a few useful routines:
 241
 242              * Compilation times
 243
 244                This is used to time the execution time for the JIT
 245                when compiling a routine.
 246
 247              * Control Flow Graph and Dominator Tree drawing.
 248
 249                These are visual aids for the JIT developer: they
 250                render representations of the Control Flow graph, and
 251                for the more advanced optimizations, they draw the
 252                dominator tree graph.
 253
 254                This requires Dot (from the graphwiz package) and Ghostview.
 255
 256              * Code generator regression tests.
 257
 258                The engine contains support for running regression
 259                tests on the virtual machine, which is very helpful to
 260                developers interested in improving the engine.
 261
 262              * Optimization benchmark framework.
 263
 264                The JIT engine will generate graphs that compare
 265                various benchmarks embedded in an assembly, and run the
 266                various tests with different optimization flags.
 267
 268                This requires Perl, GD::Graph.
 269
 270 * Flexibility
 271
 272         This is probably the most important component of the new code
 273         generation engine.  The internals are relatively easy to
 274         replace and update, even large passes can be replaced and
 275         implemented differently.
 276
 277 * New code generator
 278
 279         Compiling a method begins with the `mini_method_to_ir' routine
 280         that converts the CIL representation into a medium
 281         intermediate representation.
 282
 283         The mini_method_to_ir routine performs a number of operations:
 284
 285             * Flow analysis and control flow graph computation.
 286
 287               Unlike the previous version, stack analysis and control
 288               flow graphs are computed in a single pass in the
 289               mini_method_to_ir function, this is done for performance
 290               reasons: although the complexity increases, the benefit
 291               for a JIT compiler is that there is more time available
 292               for performing other optimizations.
 293
 294             * Basic block computation.
 295
 296               mini_method_to_ir populates the MonoCompile structure
 297               with an array of basic blocks each of which contains
 298               forest of trees made up of MonoInst structures.
 299
 300             * Inlining
 301
 302               Inlining is no longer restricted to methods containing
 303               one single basic block, instead it is possible to inline
 304               arbitrary complex methods.
 305
 306               The heuristics to choose what to inline are likely going
 307               to be tuned in the future.
 308
 309             * Method to opcode conversion.
 310
 311               Some method call invocations like `call Math.Sin' are
 312               transformed into an opcode: this transforms the call
 313               into a semantically rich node, which is later inline
 314               into an FPU instruction.
 315
 316               Various Array methods invocations are turned into
 317               opcodes as well (The Get, Set and Address methods)
 318
 319             * Tail recursion elimination
 320
 321         Basic blocks ****
 322
 323         The MonoInst structure holds the actual decoded instruction,
 324         with the semantic information from the stack analysis.
 325         MonoInst is interesting because initially it is part of a tree
 326         structure, here is a sample of the same tree with the new JIT
 327         engine:
 328
 329                  (stind.i4 regoffset[0xffffffd4(%ebp)]
 330                            (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
 331                                  iconst[1]))
 332
 333         This is a medium-level intermediate representation (MIR).
 334
 335         Some complex opcodes are decomposed at this stage into a
 336         collection of simpler opcodes.  Not every complex opcode is
 337         decomposed at this stage, as we need to preserve the semantic
 338         information during various optimization phases.
 339
 340         For example a NEWARR opcode carries the length and the type of
 341         the array that could be used later to avoid type checking or
 342         array bounds check.
 343
 344         There are a number of operations supported on this
 345         representation:
 346
 347                 * Branch optimizations.
 348
 349                 * Variable liveness.
 350
 351                 * Loop optimizations: the dominator trees are
 352                   computed, loops are detected, and their nesting
 353                   level computed.
 354
 355                 * Conversion of the method into static single assignment
 356                   form (SSA form).
 357
 358                 * Dead code elimination.
 359
 360                 * Constant propagation.
 361
 362                 * Copy propagation.
 363
 364                 * Constant folding.
 365
 366         Once the above optimizations are optionally performed, a
 367         decomposition phase is used to turn some complex opcodes into
 368         internal method calls.  In the initial version of the JIT
 369         engine, various operations on longs are emulated instead of
 370         being inlined.  Also the newarr invocation is turned into a
 371         call to the runtime.
 372
 373         At this point, after computing variable liveness, it is
 374         possible to use the linear scan algorithm for allocating
 375         variables to registers.  The linear scan pass uses the
 376         information that was previously gathered by the loop nesting
 377         and loop structure computation to favor variables in inner
 378         loops.
 379
 380         Stack space is then reserved for the local variables and any
 381         temporary variables generated during the various
 382         optimizations.
 383
 384 ** Instruction selection
 385
 386         At this point, the BURS instruction selector is invoked to
 387         transform the tree-based representation into a list of
 388         instructions.  This is done using a tree pattern matcher that
 389         is generated for the architecture using the `monoburg' tool.
 390
 391         Monoburg takes as input a file that describes tree patterns,
 392         which are matched against the trees that were produced by the
 393         engine in the previous stages.
 394
 395         The pattern matching might have more than one match for a
 396         particular tree.  In this case, the match selected is the one
 397         whose cost is the smallest.  A cost can be attached to each
 398         rule, and if no cost is provided, the implicit cost is one.
 399         Smaller costs are selected over higher costs.
 400
 401         The cost function can be used to select particular blocks of
 402         code for a given architecture, or by using a prohibitive high
 403         number to avoid having the rule match.
 404
 405         The various rules that our JIT engine uses transform a tree of
 406         MonoInsts into a list of monoinsts:
 407
 408         +-----------------------------------------------------------+
 409         | Tree                                           List       |
 410         | of           ===> Instruction selection ===>   of         |
 411         | MonoInst                                       MonoInst.  |
 412         +-----------------------------------------------------------+
 413
 414         During this process various "types" of MonoInst kinds
 415         disappear and turned into lower-level representations.  The
 416         JIT compiler just happens to reuse the same structure (this is
 417         done to reduce memory usage and improve memory locality).
 418
 419         The instruction selection rules are split in a number of
 420         files, each one with a particular purpose:
 421
 422                 inssel.brg
 423                         Contains the generic instruction selection
 424                         patterns.
 425
 426                 inssel-x86.brg
 427                         Contains x86 specific rules.
 428
 429                 inssel-ppc.brg
 430                         Contains PowerPC specific rules.
 431
 432                 inssel-long32.brg
 433                         burg file for 64bit instructions on 32bit architectures.
 434
 435                 inssel-long.brg
 436                         burg file for 64bit architectures.
 437
 438                 inssel-float.brg
 439                         burg file for floating point instructions
 440
 441         For a given build, a set of those files would be included.
 442         For example, for the build of Mono on the x86, the following
 443         set is used:
 444
 445             inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
 446
 447 ** Native method generation
 448
 449         The native method generation has a number of steps:
 450
 451                 * Architecture specific register allocation.
 452
 453                   The information about loop nesting that was
 454                   previously gathered is used here to hint the
 455                   register allocator.
 456
 457                 * Generating the method prolog/epilog.
 458
 459                 * Optionally generate code to introduce tracing facilities.
 460
 461                 * Hooking into the debugger.
 462
 463                 * Performing any pending fixups.
 464
 465                 * Code generation.
 466
 467 *** Code Generation
 468
 469         The actual code generation is contained in the architecture
 470         specific portion of the compiler.  The input to the code
 471         generator is each one of the basic blocks with its list of
 472         instructions that were produced in the instruction selection
 473         phase.
 474
 475         During the instruction selection phase, virtual registers are
 476         assigned.  Just before the peephole optimization is performed,
 477         physical registers are assigned.
 478
 479         A simple peephole and algebraic optimizer is ran at this
 480         stage.
 481
 482         The peephole optimizer removes some redundant operations at
 483         this point.  This is possible because the code generation at
 484         this point has visibility into the basic block that spans the
 485         original trees.
 486
 487         The algebraic optimizer performs some simple algebraic
 488         optimizations that replace expensive operations with cheaper
 489         operations if possible.
 490
 491         The rest of the code generation is fairly simple: a switch
 492         statement is used to generate code for each of the MonoInsts,
 493         in the mono/mini/mini-ARCH.c files, the method is called
 494         "mono_arch_output_basic_block".
 495
 496         We always try to allocate code in sequence, instead of just using
 497         malloc. This way we increase spatial locality which gives a massive
 498         speedup on most architectures.
 499
 500 *** Ahead of Time compilation
 501
 502         Ahead-of-Time compilation is a new feature of our new
 503         compilation engine.  The compilation engine is shared by the
 504         Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
 505         (AOT).
 506
 507         The difference is on the set of optimizations that are turned
 508         on for each mode: Just-in-Time compilation should be as fast
 509         as possible, while Ahead-of-Time compilation can take as long
 510         as required, because this is not done at a time critical
 511         time.
 512
 513         With AOT compilation, we can afford to turn all of the
 514         computationally expensive optimizations on.
 515
 516         After the code generation phase is done, the code and any
 517         required fixup information is saved into a file that is
 518         readable by "as" (the native assembler available on all
 519         systems). This assembly file is then passed to the native
 520         assembler, which generates a loadable module.
 521
 522         At execution time, when an assembly is loaded from the disk,
 523         the runtime engine will probe for the existence of a
 524         pre-compiled image.  If the pre-compiled image exists, then it
 525         is loaded, and the method invocations are resolved to the code
 526         contained in the loaded module.
 527
 528         The code generated under the AOT scenario is slightly
 529         different than the JIT scenario.  It generates code that is
 530         application-domain relative and that can be shared among
 531         multiple thread.
 532
 533         This is the same code generation that is used when the runtime
 534         is instructed to maximize code sharing on a multi-application
 535         domain scenario.
 536
 537 * SSA-based optimizations
 538
 539         SSA form simplifies many optimization because each variable
 540         has exactly one definition site.  This means that each
 541         variable is only initialized once.
 542
 543         For example, code like this:
 544
 545             a = 1
 546             ..
 547             a = 2
 548             call (a)
 549
 550         Is internally turned into:
 551
 552             a1 = 1
 553             ..
 554             a2 = 2
 555             call (a2)
 556
 557         In the presence of branches, like:
 558
 559             if (x)
 560                  a = 1
 561             else
 562                  a = 2
 563
 564             call (a)
 565
 566         The code is turned into:
 567
 568             if (x)
 569                  a1 = 1;
 570             else
 571                  a2 = 2;
 572             a3 = phi (a1, a2)
 573             call (a3)
 574
 575         All uses of a variable are "dominated" by its definition
 576
 577         This representation is useful as it simplifies the
 578         implementation of a number of optimizations like conditional
 579         constant propagation, array bounds check removal and dead code
 580         elimination.
 581
 582 * Register allocation.
 583
 584         Global register allocation is performed on the medium
 585         intermediate representation just before instruction selection
 586         is performed on the method.  Local register allocation is
 587         later performed at the basic-block level on the
 588
 589         Global register allocation uses the following input:
 590
 591         1) set of register-sized variables that can be allocated to a
 592         register (this is an architecture specific setting, for x86
 593         these registers are the callee saved register ESI, EDI and
 594         EBX).
 595
 596         2) liveness information for the variables
 597
 598         3) (optionally) loop info to favor variables that are used in
 599         inner loops.
 600
 601         During instruction selection phase, symbolic registers are
 602         assigned to temporary values in expressions.
 603
 604         Local register allocation assigns hard registers to the
 605         symbolic registers, and it is performed just before the code
 606         is actually emitted and is performed at the basic block level.
 607         A CPU description file describes the input registers, output
 608         registers, fixed registers and clobbered registers by each
 609         operation.
 610
 611 * BURG Code Generator Generator
 612
 613        monoburg was written by Dietmar Maurer. It is based on the
 614        papers from Christopher W. Fraser, Robert R. Henry and Todd
 615        A. Proebsting: "BURG - Fast Optimal Instruction Selection and
 616        Tree Parsing" and "Engineering a Simple, Efficient Code
 617        Generator Generator".
 618
 619        The original BURG implementation is unable to work on DAGs, instead only
 620        trees are allowed. Our monoburg implementations is able to generate tree
 621        matcher which works on DAGs, and we use this feature in the new
 622        JIT. This simplifies the code because we can directly pass DAGs and
 623        don't need to convert them to trees.
 624
 625 * Adding IL opcodes: an excercise (from a post by Paolo Molaro)
 626
 627         mini.c is the file that read the IL code stream and decides
 628         how any single IL instruction is implemented
 629         (mono_method_to_ir () func), so you always have to add an
 630         entry to the big switch inside the function: there are plenty
 631         of examples in that file.
 632
 633         An IL opcode can be implemented in a number of ways, depending
 634         on what it does and how it needs to do it.
 635
 636         Some opcodes are implemented using a helper function: one of
 637         the simpler examples is the CEE_STELEM_REF implementation.
 638
 639         In this case the opcode implementation is written in a C
 640         function.  You will need to register the function with the jit
 641         before you can use it (mono_register_jit_call) and you need to
 642         emit the call to the helper using the mono_emit_jit_icall()
 643         function.
 644
 645         This is the simpler way to add a new opcode and it doesn't
 646         require any arch-specific change (though it's limited to what
 647         you can do in C code and the performance may be limited by the
 648         function call).
 649
 650         Other opcodes can be implemented with one or more of the already
 651         implemented low-level instructions.
 652
 653         An example is the OP_STRLEN opcode which implements
 654         String.Length using a simple load from memory.  In this case
 655         you need to add a rule to the appropriate burg file,
 656         describing what are the arguments of the opcode and what is,
 657         if any, it's 'return' value.
 658
 659         The OP_STRLEN case is:
 660
 661         reg: OP_STRLEN (reg) {
 662                 MONO_EMIT_LOAD_MEMBASE_OP (s, tree, OP_LOADI4_MEMBASE, state->reg1,
 663                         state->left->reg1, G_STRUCT_OFFSET (MonoString, length));
 664         }
 665
 666         The above means: the OP_STRLEN takes a register as an argument
 667         and returns its value in a register.  And the implementation
 668         of this is included in the braces.
 669
 670         The opcode returns a value in an integer register
 671         (state->reg1) by performing a int32 load of the length field
 672         of the MonoString represented by the input register
 673         (state->left->reg1): before the burg rules are applied, the
 674         internal representation is based on trees, so you get the
 675         left/right pointers (state->left and state->right
 676         respectively, the result is stored in state->reg1).
 677
 678         This instruction implementation doesn't require arch-specific
 679         changes (it is using the MONO_EMIT_LOAD_MEMBASE_OP which is
 680         available on all platforms), and usually the produced code is
 681         fast.
 682
 683         Next we have opcodes that must be implemented with new low-level
 684         architecture specific instructions (either because of performance
 685         considerations or because the functionality can't get implemented in
 686         other ways).
 687
 688         You also need a burg rule in this case, too. For example,
 689         consider the OP_CHECK_THIS opcode (used to raise an exception
 690         if the this pointer is null). The burg rule simply reads:
 691
 692         stmt: OP_CHECK_THIS (reg) {
 693                 mono_bblock_add_inst (s->cbb, tree);
 694         }
 695
 696         Note that this opcode does not return a value (hence the
 697         "stmt") and it takes a register as input.
 698
 699         mono_bblock_add_inst (s->cbb, tree) just adds the instruction
 700         (the tree variable) to the current basic block (s->cbb). In
 701         mini this is the place where the internal representation
 702         switches from the tree format to the low-level format (the
 703         list of simple instructions).
 704
 705         In this case the actual opcode implementation is delegated to
 706         the arch-specific code.  A low-level opcode needs an entry in
 707         the machine description (the *.md files in mini/). This entry
 708         describes what kind of registers are used if any by the
 709         instruction, as well as other details such as constraints or
 710         other hints to the low-level engine which are architecture
 711         specific.
 712
 713         cpu-pentium.md, for example has the following entry:
 714
 715         checkthis: src1:b len:3
 716
 717         This means the instruction uses an integer register as a base
 718         pointer (basically a load or store is done on it) and it takes
 719         3 bytes of native code to implement it.
 720
 721         Now you just need to provide the low-level implementation for
 722         the opcode in one of the mini-$arch.c files, in the
 723         mono_arch_output_basic_block() function. There is a big switch
 724         here too. The x86 implementation is:
 725
 726                 case OP_CHECK_THIS:
 727                         /* ensure ins->sreg1 is not NULL */
 728                         x86_alu_membase_imm (code, X86_CMP, ins->sreg1, 0, 0);
 729                         break;
 730
 731         If the $arch-codegen.h header file doesn't have the code to
 732         emit the low-level native code, you'll need to write that as
 733         well.
 734
 735         Complex opcodes with register constraints may require other
 736         changes to the local register allocator, but usually they are
 737         not needed.
 738
 739 * Future
 740
 741         Profile-based optimization is something that we are very
 742         interested in supporting.  There are two possible usage
 743         scenarios:
 744
 745            * Based on the profile information gathered during
 746              the execution of a program, hot methods can be compiled
 747              with the highest level of optimizations, while bootstrap
 748              code and cold methods can be compiled with the least set
 749              of optimizations and placed in a discardable list.
 750
 751            * Code reordering: this profile-based optimization would
 752              only make sense for pre-compiled code.  The profile
 753              information is used to re-order the assembly code on disk
 754              so that the code is placed on the disk in a way that
 755              increments locality.
 756
 757              This is the same principle under which SGI's cord program
 758              works.
 759
 760         The nature of the CIL allows the above optimizations to be
 761         easy to implement and deploy.  Since we live and define our
 762         universe for these things, there are no interactions with
 763         system tools required, nor upgrades on the underlying
 764         infrastructure required.
 765
 766         Instruction scheduling is important for certain kinds of
 767         processors, and some of the framework exists today in our
 768         register allocator and the instruction selector to cope with
 769         this, but has not been finished.  The instruction selection
 770         would happen at the same time as local register allocation. <