docs/mini-doc.txt

   1
   2                A new JIT compiler for the Mono Project
   3
   4            Miguel de Icaza (miguel@{ximian.com,gnome.org}),
   5
   6
   7 * Abstract
   8
   9         Mini is a new compilation engine for the Mono runtime.  The
  10         new engine is designed to bring new code generation
  11         optimizations, portability and precompilation.
  12
  13         In this document we describe the design decisions and the
  14         architecture of the new compilation engine.
  15
  16 * Introduction
  17
  18         First we discuss the overall architecture of the Mono runtime,
  19         and how code generation fits into it; Then we discuss the
  20         development and basic architecture of our first JIT compiler
  21         for the ECMA CIL framework.  The next section covers the
  22         objectives for the work on the new JIT compiler, then we
  23         discuss the new features available in the new JIT compiler,
  24         and finally a technical description of the new code generation
  25         engine.
  26
  27 * Architecture of the Mono Runtime
  28
  29         The Mono runtime is an implementation of the ECMA Common
  30         Language Infrastructure (CLI), whose aim is to be a common
  31         platform for executing code in multiple languages.
  32
  33         Languages that target the CLI generate images that contain
  34         code in high-level intermediate representation called the
  35         "Common Intermediate Language".  This intermediate language is
  36         rich enough to allow for programs and pre-compiled libraries
  37         to be reflected.  The execution environment allows for an
  38         object oriented execution environment with single inheritance
  39         and multiple interface implementations.
  40
  41         This runtime provides a number of services for programs that
  42         are targeted to it: Just-in-Time compilation of CIL code into
  43         native code, garbage collection, thread management, I/O
  44         routines, single, double and decimal floating point,
  45         asynchronous method invocation, application domains, and a
  46         framework for building arbitrary RPC systems (remoting) and
  47         integration with system libraries through the Platform Invoke
  48         functionality.
  49
  50         The focus of this document is on the services provided by the
  51         Mono runtime to transform CIL bytecodes into code that is
  52         native to the underlying architecture.
  53
  54         The code generation interface is a set of macros that allow a
  55         C programmer to generate code on the fly, this is done
  56         through a set of macros found in the mono/jit/arch/ directory.
  57         These macros are used by the JIT compiler to generate native
  58         code.
  59
  60         The platform invocation code is interesting, as it generates
  61         CIL code on the fly to marshal parameters, and then this
  62         code is in turned processed by the JIT engine.
  63
  64 * Previous Experiences
  65
  66         Mono has built a JIT engine, which has been used to bootstrap
  67         Mono since January, 2002.  This JIT engine has reasonable
  68         performance, and uses an tree pattern matching instruction
  69         selector based on the BURS technology.  This JIT compiler was
  70         designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
  71
  72         The existing JIT compiler has three phases:
  73
  74                 * Re-creation of the semantic tree from CIL
  75                   byte-codes.
  76
  77                 * Instruction selection, with a cost-driven
  78                   engine.
  79
  80                 * Code generation and register allocation.
  81
  82         It is also hooked into the rest of the runtime to provide
  83         services like marshaling, just-in-time compilation and
  84         invocation of "internal calls".
  85
  86         This engine constructed a collection of trees, which we
  87         referred to as the "forest of trees", this forest is created by
  88         "hydrating" the CIL instruction stream.
  89
  90         The first step was to identify the basic blocks on the method,
  91         and computing the control flow graph (cfg) for it.  Once this
  92         information was computed, a stack analysis on each basic block
  93         was performed to create a forest of trees for each one of
  94         them.
  95
  96         So for example, the following statement:
  97
  98                int a, b;
  99                ...
 100                b = a + 1;
 101
 102         Which would be represented in CIL as:
 103
 104                         ldloc.0
 105                         ldc.i4.1
 106                         add
 107                         stloc.1
 108
 109         After the stack analysis would create the following tree:
 110
 111                (STIND_I4 ADDR_L[EBX|2] (
 112                          ADD (LDIND_I4 ADDR_L[ESI|1])
 113                          CONST_I4[1]))
 114
 115         This tree contains information from the stack analysis: for
 116         instance, notice that the operations explicitly encode the
 117         data types they are operating on, there is no longer an
 118         ambiguity on the types, because this information has been
 119         inferred.
 120
 121         At this point the JIT would pass the constructed forest of
 122         trees to the architecture-dependant JIT compiler.
 123
 124         The architecture dependent code then performed register
 125         allocation (optionally using linear scan allocation for
 126         variables, based on life analysis).
 127
 128         Once variables had been assigned, a tree pattern matching with
 129         dynamic programming is used (the tree pattern matcher is
 130         custom build for each architecture, using a code
 131         generator: monoburg). The instruction selector used cost
 132         functions to select the best instruction patterns.
 133
 134         The instruction selector is able to produce instructions that
 135         take advantage of the x86 instruction indexing instructions
 136         for example.
 137
 138         One problem though is that the code emitter and the register
 139         allocator did not have any visibility outside the current
 140         tree, which meant that some redundant instructions were
 141         generated.  A peephole optimizer with this architecture was
 142         hard to write, given the tree-based representation that is
 143         used.
 144
 145         This JIT was functional, but it did not provide a good
 146         architecture to base future optimizations on.  Also the
 147         line between architecture neutral and architecture
 148         specific code and optimizations was hard to draw.
 149
 150         The JIT engine supported two code generation modes to support
 151         the two optimization modes for applications that host multiple
 152         application domains: generate code that will be shared across
 153         application domains, or generate code that will not be shared
 154         across application domains.
 155
 156 * Objectives of the new JIT engine.
 157
 158         We wanted to support a number of features that were missing:
 159
 160            * Ahead-of-time compilation.
 161
 162              The idea is to allow developers to pre-compile their code
 163              to native code to reduce startup time, and the working
 164              set that is used at runtime in the just-in-time compiler.
 165
 166              Although in Mono this has not been a visible problem, we
 167              wanted to pro-actively address this problem.
 168
 169              When an assembly (a Mono/.NET executable) is installed in
 170              the system, it would then be possible to pre-compile the
 171              code, and have the JIT compiler tune the generated code
 172              to the particular CPU on which the software is
 173              installed.
 174
 175              This is done in the Microsoft.NET world with a tool
 176              called ngen.exe
 177
 178            * Have a good platform for doing code optimizations.
 179
 180              The design called for a good architecture that would
 181              enable various levels of optimizations: some
 182              optimizations are better performed on high-level
 183              intermediate representations, some on medium-level and
 184              some at low-level representations.
 185
 186              Also it should be possible to conditionally turn these on
 187              or off.  Some optimizations are too expensive to be used
 188              in just-in-time compilation scenarios, but these
 189              expensive optimizations can be turned on for
 190              ahead-of-time compilations or when using profile-guided
 191              optimizations on a subset of the executed methods.
 192
 193            * Reduce the effort required to port the Mono code
 194              generator to new architectures.
 195
 196              For Mono to gain wide adoption in the Unix world, it is
 197              necessary that the JIT engine works in most of today's
 198              commercial hardware platforms.
 199
 200 * Features of the new JIT engine.
 201
 202         The new JIT engine was architected by Dietmar Maurer and Paolo
 203         Molaro, based on the new objectives.
 204
 205         Mono provides a number of services to applications running
 206         with the new JIT compiler:
 207
 208              * Just-in-Time compilation of CLI code into native code.
 209
 210              * Ahead-of-Time compilation of CLI code, to reduce
 211                startup time of applications.
 212
 213         A number of software development features are also available:
 214
 215              * Execution time profiling (--profile)
 216
 217                Generates a report of the times consumed by routines,
 218                as well as the invocation times, as well as the
 219                callers.
 220
 221              * Memory usage profiling (--profile)
 222
 223                Generates a report of the memory usage by a program
 224                that is ran under the Mono JIT.
 225
 226              * Code coverage (--coverage)
 227
 228              * Execution tracing.
 229
 230         People who are interested in developing and improving the Mini
 231         JIT compiler will also find a few useful routines:
 232
 233              * Compilation times
 234
 235                This is used to time the execution time for the JIT
 236                when compiling a routine.
 237
 238              * Control Flow Graph and Dominator Tree drawing.
 239
 240                These are visual aids for the JIT developer: they
 241                render representations of the Control Flow graph, and
 242                for the more advanced optimizations, they draw the
 243                dominator tree graph.
 244
 245                This requires Dot (from the graphwiz package) and Ghostview.
 246
 247              * Code generator regression tests.
 248
 249                The engine contains support for running regression
 250                tests on the virtual machine, which is very helpful to
 251                developers interested in improving the engine.
 252
 253              * Optimization benchmark framework.
 254
 255                The JIT engine will generate graphs that compare
 256                various benchmarks embedded in an assembly, and run the
 257                various tests with different optimization flags.
 258
 259                This requires Perl, GD::Graph.
 260
 261 * Flexibility
 262
 263         This is probably the most important component of the new code
 264         generation engine.  The internals are relatively easy to
 265         replace and update, even large passes can be replaced and
 266         implemented differently.
 267
 268 * New code generator
 269
 270         Compiling a method begins with the `mini_method_to_ir' routine
 271         that converts the CIL representation into a medium
 272         intermediate representation.
 273
 274         The mini_method_to_ir routine performs a number of operations:
 275
 276             * Flow analysis and control flow graph computation.
 277
 278               Unlike the previous version, stack analysis and control
 279               flow graphs are computed in a single pass in the
 280               mini_method_to_ir function, this is done for performance
 281               reasons: although the complexity increases, the benefit
 282               for a JIT compiler is that there is more time available
 283               for performing other optimizations.
 284
 285             * Basic block computation.
 286
 287               mini_method_to_ir populates the MonoCompile structure
 288               with an array of basic blocks each of which contains
 289               forest of trees made up of MonoInst structures.
 290
 291             * Inlining
 292
 293               Inlining is no longer restricted to methods containing
 294               one single basic block, instead it is possible to inline
 295               arbitrary complex methods.
 296
 297               The heuristics to choose what to inline are likely going
 298               to be tuned in the future.
 299
 300             * Method to opcode conversion.
 301
 302               Some method call invocations like `call Math.Sin' are
 303               transformed into an opcode: this transforms the call
 304               into a semantically rich node, which is later inline
 305               into an FPU instruction.
 306
 307               Various Array methods invocations are turned into
 308               opcodes as well (The Get, Set and Address methods)
 309
 310             * Tail recursion elimination
 311
 312         Basic blocks ****
 313
 314         The MonoInst structure holds the actual decoded instruction,
 315         with the semantic information from the stack analysis.
 316         MonoInst is interesting because initially it is part of a tree
 317         structure, here is a sample of the same tree with the new JIT
 318         engine:
 319
 320                  (stind.i4 regoffset[0xffffffd4(%ebp)]
 321                            (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
 322                                  iconst[1]))
 323
 324         This is a medium-level intermediate representation (MIR).
 325
 326         Some complex opcodes are decomposed at this stage into a
 327         collection of simpler opcodes.  Not every complex opcode is
 328         decomposed at this stage, as we need to preserve the semantic
 329         information during various optimization phases.
 330
 331         For example a NEWARR opcode carries the length and the type of
 332         the array that could be used later to avoid type checking or
 333         array bounds check.
 334
 335         There are a number of operations supported on this
 336         representation:
 337
 338                 * Branch optimizations.
 339
 340                 * Variable liveness.
 341
 342                 * Loop optimizations: the dominator trees are
 343                   computed, loops are detected, and their nesting
 344                   level computed.
 345
 346                 * Conversion of the method into static single assignment
 347                   form (SSA form).
 348
 349                 * Dead code elimination.
 350
 351                 * Constant propagation.
 352
 353                 * Copy propagation.
 354
 355                 * Constant folding.
 356
 357         Once the above optimizations are optionally performed, a
 358         decomposition phase is used to turn some complex opcodes into
 359         internal method calls.  In the initial version of the JIT
 360         engine, various operations on longs are emulated instead of
 361         being inlined.  Also the newarr invocation is turned into a
 362         call to the runtime.
 363
 364         At this point, after computing variable liveness, it is
 365         possible to use the linear scan algorithm for allocating
 366         variables to registers.  The linear scan pass uses the
 367         information that was previously gathered by the loop nesting
 368         and loop structure computation to favor variables in inner
 369         loops.
 370
 371         Stack space is then reserved for the local variables and any
 372         temporary variables generated during the various
 373         optimizations.
 374
 375 ** Instruction selection
 376
 377         At this point, the BURS instruction selector is invoked to
 378         transform the tree-based representation into a list of
 379         instructions.  This is done using a tree pattern matcher that
 380         is generated for the architecture using the `monoburg' tool.
 381
 382         Monoburg takes as input a file that describes tree patterns,
 383         which are matched against the trees that were produced by the
 384         engine in the previous stages.
 385
 386         The pattern matching might have more than one match for a
 387         particular tree.  In this case, the match selected is the one
 388         whose cost is the smallest.  A cost can be attached to each
 389         rule, and if no cost is provided, the implicit cost is one.
 390         Smaller costs are selected over higher costs.
 391
 392         The cost function can be used to select particular blocks of
 393         code for a given architecture, or by using a prohibitive high
 394         number to avoid having the rule match.
 395
 396         The various rules that our JIT engine uses transform a tree of
 397         MonoInsts into a list of monoinsts:
 398
 399         +-----------------------------------------------------------+
 400         | Tree                                           List       |
 401         | of           ===> Instruction selection ===>   of         |
 402         | MonoInst                                       MonoInst.  |
 403         +-----------------------------------------------------------+
 404
 405         During this process various "types" of MonoInst kinds
 406         disappear and turned into lower-level representations.  The
 407         JIT compiler just happens to reuse the same structure (this is
 408         done to reduce memory usage and improve memory locality).
 409
 410         The instruction selection rules are split in a number of
 411         files, each one with a particular purpose:
 412
 413                 inssel.brg
 414                         Contains the generic instruction selection
 415                         patterns.
 416
 417                 inssel-x86.brg
 418                         Contains x86 specific rules.
 419
 420                 inssel-ppc.brg
 421                         Contains PowerPC specific rules.
 422
 423                 inssel-long32.brg
 424                         burg file for 64bit instructions on 32bit architectures.
 425
 426                 inssel-long.brg
 427                         burg file for 64bit architectures.
 428
 429                 inssel-float.brg
 430                         burg file for floating point instructions
 431
 432         For a given build, a set of those files would be included.
 433         For example, for the build of Mono on the x86, the following
 434         set is used:
 435
 436             inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
 437
 438 ** Native method generation
 439
 440         The native method generation has a number of steps:
 441
 442                 * Architecture specific register allocation.
 443
 444                   The information about loop nesting that was
 445                   previously gathered is used here to hint the
 446                   register allocator.
 447
 448                 * Generating the method prolog/epilog.
 449
 450                 * Optionally generate code to introduce tracing facilities.
 451
 452                 * Hooking into the debugger.
 453
 454                 * Performing any pending fixups.
 455
 456                 * Code generation.
 457
 458 *** Code Generation
 459
 460         The actual code generation is contained in the architecture
 461         specific portion of the compiler.  The input to the code
 462         generator is each one of the basic blocks with its list of
 463         instructions that were produced in the instruction selection
 464         phase.
 465
 466         During the instruction selection phase, virtual registers are
 467         assigned.  Just before the peephole optimization is performed,
 468         physical registers are assigned.
 469
 470         A simple peephole and algebraic optimizer is ran at this
 471         stage.
 472
 473         The peephole optimizer removes some redundant operations at
 474         this point.  This is possible because the code generation at
 475         this point has visibility into the basic block that spans the
 476         original trees.
 477
 478         The algebraic optimizer performs some simple algebraic
 479         optimizations that replace expensive operations with cheaper
 480         operations if possible.
 481
 482         The rest of the code generation is fairly simple: a switch
 483         statement is used to generate code for each of the MonoInsts
 484
 485         We always try to allocate code in sequence, instead of just using
 486         malloc. This way we increase spatial locality which gives a massive
 487         speedup on most architectures.
 488
 489 *** Ahead of Time compilation
 490
 491         Ahead-of-Time compilation is a new feature of our new
 492         compilation engine.  The compilation engine is shared by the
 493         Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
 494         (AOT).
 495
 496         The difference is on the set of optimizations that are turned
 497         on for each mode: Just-in-Time compilation should be as fast
 498         as possible, while Ahead-of-Time compilation can take as long
 499         as required, because this is not done at a time criticial
 500         time.
 501
 502         With AOT compilation, we can afford to turn all of the
 503         computationally expensive optimizations on.
 504
 505         After the code generation phase is done, the code and any
 506         required fixup information is saved into a file that is
 507         readable by "as" (the native assembler available on all
 508         systems). This assembly file is then passed to the native
 509         assembler, which generates a loadable module.
 510
 511         At execution time, when an assembly is loaded from the disk,
 512         the runtime engine will probe for the existance of a
 513         pre-compiled image.  If the pre-compiled image exists, then it
 514         is loaded, and the method invocations are resolved to the code
 515         contained in the loaded module.
 516
 517         The code generated under the AOT scenario is slightly
 518         different than the JIT scenario.  It generates code that is
 519         application-domain relative and that can be shared among
 520         multiple thread.
 521
 522         This is the same code generation that is used when the runtime
 523         is instructed to maximize code sharing on a multi-application
 524         domain scenario.
 525
 526 * SSA-based optimizations
 527
 528         SSA form simplifies many optimization because each variable
 529         has exactly one definition site.  This means that each
 530         variable is only initialized once.
 531
 532         For example, code like this:
 533
 534             a = 1
 535             ..
 536             a = 2
 537             call (a)
 538
 539         Is internally turned into:
 540
 541             a1 = 1
 542             ..
 543             a2 = 2
 544             call (a2)
 545
 546         In the presence of branches, like:
 547
 548             if (x)
 549                  a = 1
 550             else
 551                  a = 2
 552
 553             call (a)
 554
 555         The code is turned into:
 556
 557             if (x)
 558                  a1 = 1;
 559             else
 560                  a2 = 2;
 561             a3 = phi (a1, a2)
 562             call (a3)
 563
 564         All uses of a variable are "dominated" by its definition
 565
 566         This representation is useful as it simplifies the
 567         implementation of a number of optimizations like conditional
 568         constant propagation, array bounds check removal and dead code
 569         elimination.
 570
 571 * Register allocation.
 572
 573         Global register allocation is performed on the medium
 574         intermediate representation just before instruction selection
 575         is performed on the method.  Local register allocation is
 576         later performed at the basic-block level on the
 577
 578         Global register allocation uses the following input:
 579
 580         1) set of register-sized variables that can be allocated to a
 581         register (this is an architecture specific setting, for x86
 582         these registers are the callee saved register ESI, EDI and
 583         EBX).
 584
 585         2) liveness information for the variables
 586
 587         3) (optionally) loop info to favour variables that are used in
 588         inner loops.
 589
 590         During instruction selection phase, symbolic registers are
 591         assigned to temporary values in expressions.
 592
 593         Local register allocation assigns hard registers to the
 594         symbolic registers, and it is performed just before the code
 595         is actually emitted and is performed at the basic block level.
 596         A CPU description file describes the input registers, output
 597         registers, fixed registers and clobbered registers by each
 598         operation.
 599
 600
 601 ----------
 602 * Bootstrap
 603
 604         The Mini bootstrap parses the arguments passed on the command
 605         line, and initializes the JIT runtime. Each time the
 606         mini_init() routine is invoked, a new Application Domain will
 607         be returned.
 608
 609 * Signal handlers
 610
 611         mono_runtime_install_handlers
 612
 613 * BURG Code Generator Generator
 614
 615        monoburg was written by Dietmar Maurer. It is based on the
 616        papers from Christopher W. Fraser, Robert R. Henry and Todd
 617        A. Proebsting: "BURG - Fast Optimal Instruction Selection and
 618        Tree Parsing" and "Engineering a Simple, Efficient Code
 619        Generator Generator".
 620
 621        The original BURG implementation is unable to work on DAGs, instead only
 622        trees are allowed. Our monoburg implementations is able to generate tree
 623        matcher which works on DAGs, and we use this feature in the new
 624        JIT. This simplifies the code because we can directly pass DAGs and
 625        don't need to convert them to trees.
 626
 627 * Future
 628
 629         Profile-based optimization is something that we are very
 630         interested in supporting.  There are two possible usage
 631         scenarios:
 632
 633            * Based on the profile information gathered during
 634              the execution of a program, hot methods can be compiled
 635              with the highest level of optimizations, while bootstrap
 636              code and cold methods can be compiled with the least set
 637              of optimizations and placed in a discardable list.
 638
 639            * Code reordering: this profile-based optimization would
 640              only make sense for pre-compiled code.  The profile
 641              information is used to re-order the assembly code on disk
 642              so that the code is placed on the disk in a way that
 643              increments locality.
 644
 645              This is the same principle under which SGI's cord program
 646              works.
 647
 648         The nature of the CIL allows the above optimizations to be
 649         easy to implement and deploy.  Since we live and define our
 650         universe for these things, there are no interactions with
 651         system tools required, nor upgrades on the underlying
 652         infrastructure required.
 653
 654         Instruction scheduling is important for certain kinds of
 655         processors, and some of the framework exists today in our
 656         register allocator and the instruction selector to cope with
 657         this, but has not been finished.  The instruction selection
 658         would happen at the same time as local register allocation.