mono/mini/mini-doc.txt

   1
   2                A new JIT compiler for the Mono Project
   3
   4            Miguel de Icaza (miguel@{ximian.com,gnome.org}),
   5
   6
   7 * Abstract
   8
   9         Mini is a new compilation engine for the Mono runtime.  The
  10         new engine is designed to bring new code generation
  11         optimizations, portability and pre-compilation.
  12
  13         In this document we describe the design decisions and the
  14         architecture of the new compilation engine.
  15
  16 * Introduction
  17
  18         Mono is a Open Source implementation of the .NET Framework: it
  19         is made up of a runtime engine that implements the ECMA Common
  20         Language Infrastructure (CLI), a set of compilers that target
  21         the CLI and a large collection of class libraries.
  22
  23         This article discusses the new code generation facilities that
  24         have been added to the Mono runtime.
  25
  26         First we discuss the overall architecture of the Mono runtime,
  27         and how code generation fits into it; Then we discuss the
  28         development and basic architecture of our first JIT compiler
  29         for the ECMA CIL framework.  The next section covers the
  30         objectives for the work on the new JIT compiler, then we
  31         discuss the new features available in the new JIT compiler,
  32         and finally a technical description of the new code generation
  33         engine.
  34
  35 * Architecture of the Mono Runtime
  36
  37         The Mono runtime is an implementation of the ECMA Common
  38         Language Infrastructure (CLI), whose aim is to be a common
  39         platform for executing code in multiple languages.
  40
  41         Languages that target the CLI generate images that contain
  42         code in high-level intermediate representation called the
  43         "Common Intermediate Language".  This intermediate language is
  44         rich enough to allow for programs and pre-compiled libraries
  45         to be reflected.  The execution environment allows for an
  46         object oriented execution environment with single inheritance
  47         and multiple interface implementations.
  48
  49         This runtime provides a number of services for programs that
  50         are targeted to it: Just-in-Time compilation of CIL code into
  51         native code, garbage collection, thread management, I/O
  52         routines, single, double and decimal floating point,
  53         asynchronous method invocation, application domains, and a
  54         framework for building arbitrary RPC systems (remoting) and
  55         integration with system libraries through the Platform Invoke
  56         functionality.
  57
  58         The focus of this document is on the services provided by the
  59         Mono runtime to transform CIL bytecodes into code that is
  60         native to the underlying architecture.
  61
  62         The code generation interface is a set of macros that allow a
  63         C programmer to generate code on the fly, this is done
  64         through a set of macros found in the mono/jit/arch/ directory.
  65         These macros are used by the JIT compiler to generate native
  66         code.
  67
  68         The platform invocation code is interesting, as it generates
  69         CIL code on the fly to marshal parameters, and then this
  70         code is in turned processed by the JIT engine.
  71
  72 * Previous Experiences
  73
  74         Mono has built a JIT engine, which has been used to bootstrap
  75         Mono since January, 2002.  This JIT engine has reasonable
  76         performance, and uses an tree pattern matching instruction
  77         selector based on the BURS technology.  This JIT compiler was
  78         designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
  79
  80         The existing JIT compiler has three phases:
  81
  82                 * Re-creation of the semantic tree from CIL
  83                   byte-codes.
  84
  85                 * Instruction selection, with a cost-driven
  86                   engine.
  87
  88                 * Code generation and register allocation.
  89
  90         It is also hooked into the rest of the runtime to provide
  91         services like marshaling, just-in-time compilation and
  92         invocation of "internal calls".
  93
  94         This engine constructed a collection of trees, which we
  95         referred to as the "forest of trees", this forest is created by
  96         "hydrating" the CIL instruction stream.
  97
  98         The first step was to identify the basic blocks on the method,
  99         and computing the control flow graph (cfg) for it.  Once this
 100         information was computed, a stack analysis on each basic block
 101         was performed to create a forest of trees for each one of
 102         them.
 103
 104         So for example, the following statement:
 105
 106                int a, b;
 107                ...
 108                b = a + 1;
 109
 110         Which would be represented in CIL as:
 111
 112                         ldloc.0
 113                         ldc.i4.1
 114                         add
 115                         stloc.1
 116
 117         After the stack analysis would create the following tree:
 118
 119                (STIND_I4 ADDR_L[EBX|2] (
 120                          ADD (LDIND_I4 ADDR_L[ESI|1])
 121                          CONST_I4[1]))
 122
 123         This tree contains information from the stack analysis: for
 124         instance, notice that the operations explicitly encode the
 125         data types they are operating on, there is no longer an
 126         ambiguity on the types, because this information has been
 127         inferred.
 128
 129         At this point the JIT would pass the constructed forest of
 130         trees to the architecture-dependent JIT compiler.
 131
 132         The architecture dependent code then performed register
 133         allocation (optionally using linear scan allocation for
 134         variables, based on life analysis).
 135
 136         Once variables had been assigned, a tree pattern matching with
 137         dynamic programming is used (the tree pattern matcher is
 138         custom build for each architecture, using a code
 139         generator: monoburg). The instruction selector used cost
 140         functions to select the best instruction patterns.
 141
 142         The instruction selector is able to produce instructions that
 143         take advantage of the x86 instruction indexing instructions
 144         for example.
 145
 146         One problem though is that the code emitter and the register
 147         allocator did not have any visibility outside the current
 148         tree, which meant that some redundant instructions were
 149         generated.  A peephole optimizer with this architecture was
 150         hard to write, given the tree-based representation that is
 151         used.
 152
 153         This JIT was functional, but it did not provide a good
 154         architecture to base future optimizations on.  Also the
 155         line between architecture neutral and architecture
 156         specific code and optimizations was hard to draw.
 157
 158         The JIT engine supported two code generation modes to support
 159         the two optimization modes for applications that host multiple
 160         application domains: generate code that will be shared across
 161         application domains, or generate code that will not be shared
 162         across application domains.
 163
 164 * Objectives of the new JIT engine.
 165
 166         We wanted to support a number of features that were missing:
 167
 168            * Ahead-of-time compilation.
 169
 170              The idea is to allow developers to pre-compile their code
 171              to native code to reduce startup time, and the working
 172              set that is used at runtime in the just-in-time compiler.
 173
 174              Although in Mono this has not been a visible problem, we
 175              wanted to pro-actively address this problem.
 176
 177              When an assembly (a Mono/.NET executable) is installed in
 178              the system, it would then be possible to pre-compile the
 179              code, and have the JIT compiler tune the generated code
 180              to the particular CPU on which the software is
 181              installed.
 182
 183              This is done in the Microsoft.NET world with a tool
 184              called ngen.exe
 185
 186            * Have a good platform for doing code optimizations.
 187
 188              The design called for a good architecture that would
 189              enable various levels of optimizations: some
 190              optimizations are better performed on high-level
 191              intermediate representations, some on medium-level and
 192              some at low-level representations.
 193
 194              Also it should be possible to conditionally turn these on
 195              or off.  Some optimizations are too expensive to be used
 196              in just-in-time compilation scenarios, but these
 197              expensive optimizations can be turned on for
 198              ahead-of-time compilations or when using profile-guided
 199              optimizations on a subset of the executed methods.
 200
 201            * Reduce the effort required to port the Mono code
 202              generator to new architectures.
 203
 204              For Mono to gain wide adoption in the Unix world, it is
 205              necessary that the JIT engine works in most of today's
 206              commercial hardware platforms.
 207
 208 * Features of the new JIT engine.
 209
 210         The new JIT engine was architected by Dietmar Maurer and Paolo
 211         Molaro, based on the new objectives.
 212
 213         Mono provides a number of services to applications running
 214         with the new JIT compiler:
 215
 216              * Just-in-Time compilation of CLI code into native code.
 217
 218              * Ahead-of-Time compilation of CLI code, to reduce
 219                startup time of applications.
 220
 221         A number of software development features are also available:
 222
 223              * Execution time profiling (--profile)
 224
 225                Generates a report of the times consumed by routines,
 226                as well as the invocation times, as well as the
 227                callers.
 228
 229              * Memory usage profiling (--profile)
 230
 231                Generates a report of the memory usage by a program
 232                that is ran under the Mono JIT.
 233
 234              * Code coverage (--coverage)
 235
 236              * Execution tracing.
 237
 238         People who are interested in developing and improving the Mini
 239         JIT compiler will also find a few useful routines:
 240
 241              * Compilation times
 242
 243                This is used to time the execution time for the JIT
 244                when compiling a routine.
 245
 246              * Control Flow Graph and Dominator Tree drawing.
 247
 248                These are visual aids for the JIT developer: they
 249                render representations of the Control Flow graph, and
 250                for the more advanced optimizations, they draw the
 251                dominator tree graph.
 252
 253                This requires Dot (from the graphwiz package) and Ghostview.
 254
 255              * Code generator regression tests.
 256
 257                The engine contains support for running regression
 258                tests on the virtual machine, which is very helpful to
 259                developers interested in improving the engine.
 260
 261              * Optimization benchmark framework.
 262
 263                The JIT engine will generate graphs that compare
 264                various benchmarks embedded in an assembly, and run the
 265                various tests with different optimization flags.
 266
 267                This requires Perl, GD::Graph.
 268
 269 * Flexibility
 270
 271         This is probably the most important component of the new code
 272         generation engine.  The internals are relatively easy to
 273         replace and update, even large passes can be replaced and
 274         implemented differently.
 275
 276 * New code generator
 277
 278         Compiling a method begins with the `mini_method_to_ir' routine
 279         that converts the CIL representation into a medium
 280         intermediate representation.
 281
 282         The mini_method_to_ir routine performs a number of operations:
 283
 284             * Flow analysis and control flow graph computation.
 285
 286               Unlike the previous version, stack analysis and control
 287               flow graphs are computed in a single pass in the
 288               mini_method_to_ir function, this is done for performance
 289               reasons: although the complexity increases, the benefit
 290               for a JIT compiler is that there is more time available
 291               for performing other optimizations.
 292
 293             * Basic block computation.
 294
 295               mini_method_to_ir populates the MonoCompile structure
 296               with an array of basic blocks each of which contains
 297               forest of trees made up of MonoInst structures.
 298
 299             * Inlining
 300
 301               Inlining is no longer restricted to methods containing
 302               one single basic block, instead it is possible to inline
 303               arbitrary complex methods.
 304
 305               The heuristics to choose what to inline are likely going
 306               to be tuned in the future.
 307
 308             * Method to opcode conversion.
 309
 310               Some method call invocations like `call Math.Sin' are
 311               transformed into an opcode: this transforms the call
 312               into a semantically rich node, which is later inline
 313               into an FPU instruction.
 314
 315               Various Array methods invocations are turned into
 316               opcodes as well (The Get, Set and Address methods)
 317
 318             * Tail recursion elimination
 319
 320         Basic blocks ****
 321
 322         The MonoInst structure holds the actual decoded instruction,
 323         with the semantic information from the stack analysis.
 324         MonoInst is interesting because initially it is part of a tree
 325         structure, here is a sample of the same tree with the new JIT
 326         engine:
 327
 328                  (stind.i4 regoffset[0xffffffd4(%ebp)]
 329                            (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
 330                                  iconst[1]))
 331
 332         This is a medium-level intermediate representation (MIR).
 333
 334         Some complex opcodes are decomposed at this stage into a
 335         collection of simpler opcodes.  Not every complex opcode is
 336         decomposed at this stage, as we need to preserve the semantic
 337         information during various optimization phases.
 338
 339         For example a NEWARR opcode carries the length and the type of
 340         the array that could be used later to avoid type checking or
 341         array bounds check.
 342
 343         There are a number of operations supported on this
 344         representation:
 345
 346                 * Branch optimizations.
 347
 348                 * Variable liveness.
 349
 350                 * Loop optimizations: the dominator trees are
 351                   computed, loops are detected, and their nesting
 352                   level computed.
 353
 354                 * Conversion of the method into static single assignment
 355                   form (SSA form).
 356
 357                 * Dead code elimination.
 358
 359                 * Constant propagation.
 360
 361                 * Copy propagation.
 362
 363                 * Constant folding.
 364
 365         Once the above optimizations are optionally performed, a
 366         decomposition phase is used to turn some complex opcodes into
 367         internal method calls.  In the initial version of the JIT
 368         engine, various operations on longs are emulated instead of
 369         being inlined.  Also the newarr invocation is turned into a
 370         call to the runtime.
 371
 372         At this point, after computing variable liveness, it is
 373         possible to use the linear scan algorithm for allocating
 374         variables to registers.  The linear scan pass uses the
 375         information that was previously gathered by the loop nesting
 376         and loop structure computation to favor variables in inner
 377         loops.
 378
 379         Stack space is then reserved for the local variables and any
 380         temporary variables generated during the various
 381         optimizations.
 382
 383 ** Instruction selection
 384
 385         At this point, the BURS instruction selector is invoked to
 386         transform the tree-based representation into a list of
 387         instructions.  This is done using a tree pattern matcher that
 388         is generated for the architecture using the `monoburg' tool.
 389
 390         Monoburg takes as input a file that describes tree patterns,
 391         which are matched against the trees that were produced by the
 392         engine in the previous stages.
 393
 394         The pattern matching might have more than one match for a
 395         particular tree.  In this case, the match selected is the one
 396         whose cost is the smallest.  A cost can be attached to each
 397         rule, and if no cost is provided, the implicit cost is one.
 398         Smaller costs are selected over higher costs.
 399
 400         The cost function can be used to select particular blocks of
 401         code for a given architecture, or by using a prohibitive high
 402         number to avoid having the rule match.
 403
 404         The various rules that our JIT engine uses transform a tree of
 405         MonoInsts into a list of monoinsts:
 406
 407         +-----------------------------------------------------------+
 408         | Tree                                           List       |
 409         | of           ===> Instruction selection ===>   of         |
 410         | MonoInst                                       MonoInst.  |
 411         +-----------------------------------------------------------+
 412
 413         During this process various "types" of MonoInst kinds
 414         disappear and turned into lower-level representations.  The
 415         JIT compiler just happens to reuse the same structure (this is
 416         done to reduce memory usage and improve memory locality).
 417
 418         The instruction selection rules are split in a number of
 419         files, each one with a particular purpose:
 420
 421                 inssel.brg
 422                         Contains the generic instruction selection
 423                         patterns.
 424
 425                 inssel-x86.brg
 426                         Contains x86 specific rules.
 427
 428                 inssel-ppc.brg
 429                         Contains PowerPC specific rules.
 430
 431                 inssel-long32.brg
 432                         burg file for 64bit instructions on 32bit architectures.
 433
 434                 inssel-long.brg
 435                         burg file for 64bit architectures.
 436
 437                 inssel-float.brg
 438                         burg file for floating point instructions
 439
 440         For a given build, a set of those files would be included.
 441         For example, for the build of Mono on the x86, the following
 442         set is used:
 443
 444             inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
 445
 446 ** Native method generation
 447
 448         The native method generation has a number of steps:
 449
 450                 * Architecture specific register allocation.
 451
 452                   The information about loop nesting that was
 453                   previously gathered is used here to hint the
 454                   register allocator.
 455
 456                 * Generating the method prolog/epilog.
 457
 458                 * Optionally generate code to introduce tracing facilities.
 459
 460                 * Hooking into the debugger.
 461
 462                 * Performing any pending fixups.
 463
 464                 * Code generation.
 465
 466 *** Code Generation
 467
 468         The actual code generation is contained in the architecture
 469         specific portion of the compiler.  The input to the code
 470         generator is each one of the basic blocks with its list of
 471         instructions that were produced in the instruction selection
 472         phase.
 473
 474         During the instruction selection phase, virtual registers are
 475         assigned.  Just before the peephole optimization is performed,
 476         physical registers are assigned.
 477
 478         A simple peephole and algebraic optimizer is ran at this
 479         stage.
 480
 481         The peephole optimizer removes some redundant operations at
 482         this point.  This is possible because the code generation at
 483         this point has visibility into the basic block that spans the
 484         original trees.
 485
 486         The algebraic optimizer performs some simple algebraic
 487         optimizations that replace expensive operations with cheaper
 488         operations if possible.
 489
 490         The rest of the code generation is fairly simple: a switch
 491         statement is used to generate code for each of the MonoInsts
 492
 493         We always try to allocate code in sequence, instead of just using
 494         malloc. This way we increase spatial locality which gives a massive
 495         speedup on most architectures.
 496
 497 *** Ahead of Time compilation
 498
 499         Ahead-of-Time compilation is a new feature of our new
 500         compilation engine.  The compilation engine is shared by the
 501         Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
 502         (AOT).
 503
 504         The difference is on the set of optimizations that are turned
 505         on for each mode: Just-in-Time compilation should be as fast
 506         as possible, while Ahead-of-Time compilation can take as long
 507         as required, because this is not done at a time critical
 508         time.
 509
 510         With AOT compilation, we can afford to turn all of the
 511         computationally expensive optimizations on.
 512
 513         After the code generation phase is done, the code and any
 514         required fixup information is saved into a file that is
 515         readable by "as" (the native assembler available on all
 516         systems). This assembly file is then passed to the native
 517         assembler, which generates a loadable module.
 518
 519         At execution time, when an assembly is loaded from the disk,
 520         the runtime engine will probe for the existence of a
 521         pre-compiled image.  If the pre-compiled image exists, then it
 522         is loaded, and the method invocations are resolved to the code
 523         contained in the loaded module.
 524
 525         The code generated under the AOT scenario is slightly
 526         different than the JIT scenario.  It generates code that is
 527         application-domain relative and that can be shared among
 528         multiple thread.
 529
 530         This is the same code generation that is used when the runtime
 531         is instructed to maximize code sharing on a multi-application
 532         domain scenario.
 533
 534 * SSA-based optimizations
 535
 536         SSA form simplifies many optimization because each variable
 537         has exactly one definition site.  This means that each
 538         variable is only initialized once.
 539
 540         For example, code like this:
 541
 542             a = 1
 543             ..
 544             a = 2
 545             call (a)
 546
 547         Is internally turned into:
 548
 549             a1 = 1
 550             ..
 551             a2 = 2
 552             call (a2)
 553
 554         In the presence of branches, like:
 555
 556             if (x)
 557                  a = 1
 558             else
 559                  a = 2
 560
 561             call (a)
 562
 563         The code is turned into:
 564
 565             if (x)
 566                  a1 = 1;
 567             else
 568                  a2 = 2;
 569             a3 = phi (a1, a2)
 570             call (a3)
 571
 572         All uses of a variable are "dominated" by its definition
 573
 574         This representation is useful as it simplifies the
 575         implementation of a number of optimizations like conditional
 576         constant propagation, array bounds check removal and dead code
 577         elimination.
 578
 579 * Register allocation.
 580
 581         Global register allocation is performed on the medium
 582         intermediate representation just before instruction selection
 583         is performed on the method.  Local register allocation is
 584         later performed at the basic-block level on the
 585
 586         Global register allocation uses the following input:
 587
 588         1) set of register-sized variables that can be allocated to a
 589         register (this is an architecture specific setting, for x86
 590         these registers are the callee saved register ESI, EDI and
 591         EBX).
 592
 593         2) liveness information for the variables
 594
 595         3) (optionally) loop info to favor variables that are used in
 596         inner loops.
 597
 598         During instruction selection phase, symbolic registers are
 599         assigned to temporary values in expressions.
 600
 601         Local register allocation assigns hard registers to the
 602         symbolic registers, and it is performed just before the code
 603         is actually emitted and is performed at the basic block level.
 604         A CPU description file describes the input registers, output
 605         registers, fixed registers and clobbered registers by each
 606         operation.
 607
 608 * BURG Code Generator Generator
 609
 610        monoburg was written by Dietmar Maurer. It is based on the
 611        papers from Christopher W. Fraser, Robert R. Henry and Todd
 612        A. Proebsting: "BURG - Fast Optimal Instruction Selection and
 613        Tree Parsing" and "Engineering a Simple, Efficient Code
 614        Generator Generator".
 615
 616        The original BURG implementation is unable to work on DAGs, instead only
 617        trees are allowed. Our monoburg implementations is able to generate tree
 618        matcher which works on DAGs, and we use this feature in the new
 619        JIT. This simplifies the code because we can directly pass DAGs and
 620        don't need to convert them to trees.
 621
 622 * Future
 623
 624         Profile-based optimization is something that we are very
 625         interested in supporting.  There are two possible usage
 626         scenarios:
 627
 628            * Based on the profile information gathered during
 629              the execution of a program, hot methods can be compiled
 630              with the highest level of optimizations, while bootstrap
 631              code and cold methods can be compiled with the least set
 632              of optimizations and placed in a discardable list.
 633
 634            * Code reordering: this profile-based optimization would
 635              only make sense for pre-compiled code.  The profile
 636              information is used to re-order the assembly code on disk
 637              so that the code is placed on the disk in a way that
 638              increments locality.
 639
 640              This is the same principle under which SGI's cord program
 641              works.
 642
 643         The nature of the CIL allows the above optimizations to be
 644         easy to implement and deploy.  Since we live and define our
 645         universe for these things, there are no interactions with
 646         system tools required, nor upgrades on the underlying
 647         infrastructure required.
 648
 649         Instruction scheduling is important for certain kinds of
 650         processors, and some of the framework exists today in our
 651         register allocator and the instruction selector to cope with
 652         this, but has not been finished.  The instruction selection
 653         would happen at the same time as local register allocation.