2008-11-11 Jonathan Pobst <monkey@jpobst.com>

[mono.git] / docs / aot-compiler.txt
diff --git a/docs/aot-compiler.txt b/docs/aot-compiler.txt

index ab1af90d96584d1eafae6b96adb883570a185724..3d77c0a11cadb0f71c4569e7b4159b5935f81c88 100644 (file)
--- a/docs/aot-compiler.txt
+++ b/docs/aot-compiler.txt
@@ -1,44 +1,393 @@
  Mono Ahead Of Time Compiler
  ===========================
  
-The new mono JIT has sophisticated optimization features. It uses SSA and has a
-pluggable architecture for further optimizations. This makes it possible and
-efficient to use the JIT also for AOT compilation.
+       The Ahead of Time compilation feature in Mono allows Mono to
+       precompile assemblies to minimize JIT time, reduce memory
+       usage at runtime and increase the code sharing across multiple
+       running Mono application.
  
+       To precompile an assembly use the following command:
+       
+          mono --aot -O=all assembly.exe
  
-* file format: We use the native object format of the platform. That way it is
-  possible to reuse existing tools like objdump and the dynamic loader. All we
-  need is a working assembler, i.e. we write out a text file which is then
-  passed to gas (the gnu assembler) to generate the object file.
+       The `--aot' flag instructs Mono to ahead-of-time compile your
+       assembly, while the -O=all flag instructs Mono to use all the
+       available optimizations.
  
-* file names: we simply add ".so" to the generated file. For example:
-  basic.exe -> basic.exe.so
-  corlib.dll -> corlib.dll.so
+* Caching metadata
+------------------
  
-* staring the AOT compiler: mini --aot assembly_name
+       Besides code, the AOT file also contains cached metadata information which allows
+       the runtime to avoid certain computations at runtime, like the computation of
+       generic vtables. This reduces both startup time, and memory usage. It is possible
+       to create an AOT image which contains only this cached information and no code by
+       using the 'metadata-only' option during compilation:
  
-The following things are saved in the object file:
+          mono --aot=metadata-only assembly.exe
  
-* version infos: 
+       This works even on platforms where AOT is not normally supported.
  
-* native code: this is labeled with method_XXXXXXXX: where XXXXXXXX is the
-  hexadecimal token number of the method.
+* Position Independent Code
+---------------------------
  
-* additional informations needed by the runtime: For example we need to store
-  the code length and the exception tables. We also need a way to patch
-  constants only available at runtime (for example vtable and class
-  addresses). This is stored i a binary blob labeled method_info_XXXXXXXX:
+       On x86 and x86-64 the code generated by Ahead-of-Time compiled
+       images is position-independent code.  This allows the same
+       precompiled image to be reused across multiple applications
+       without having different copies: this is the same way in which
+       ELF shared libraries work: the code produced can be relocated
+       to any address.
  
-PROBLEMS:
+       The implementation of Position Independent Code had a
+       performance impact on Ahead-of-Time compiled images but
+       compiler bootstraps are still faster than JIT-compiled images,
+       specially with all the new optimizations provided by the Mono
+       engine.
  
-- all precompiled methods must be domain independent, or we add patch infos to
-  patch the target doamin.
+* How to support Position Independent Code in new Mono Ports
+------------------------------------------------------------
+
+       Generated native code needs to reference various runtime
+       structures/functions whose address is only known at run
+       time. JITted code can simple embed the address into the native
+       code, but AOT code needs to do an indirection. This
+       indirection is done through a table called the Global Offset
+       Table (GOT), which is similar to the GOT table in the Elf
+       spec.  When the runtime saves the AOT image, it saves some
+       information for each method describing the GOT table entries
+       used by that method. When loading a method from an AOT image,
+       the runtime will fill out the GOT entries needed by the
+       method.
+
+   * Computing the address of the GOT
+
+        Methods which need to access the GOT first need to compute its
+       address. On the x86 it is done by code like this:
+
+               call <IP + 5>
+               pop ebx
+               add <OFFSET TO GOT>, ebx
+               <save got addr to a register>
+
+       The variable representing the got is stored in
+       cfg->got_var. It is allways allocated to a global register to
+       prevent some problems with branches + basic blocks.
+
+   * Referencing GOT entries
+
+       Any time the native code needs to access some other runtime
+       structure/function (i.e. any time the backend calls
+       mono_add_patch_info ()), the code pointed by the patch needs
+       to load the value from the got. For example, instead of:
+
+       call <ABSOLUTE ADDR>
+       it needs to do:
+       call *<OFFSET>(<GOT REG>)
+
+       Here, the <OFFSET> can be 0, it will be fixed up by the AOT compiler.
+       
+       For more examples on the changes required, see
+       
+       svn diff -r 37739:38213 mini-x86.c 
+
+       * The Program Linkage Table
+
+       As in ELF, calls made from AOT code do not go through the GOT. Instead, a direct call is
+       made to an entry in the Program Linkage Table (PLT). This is based on the fact that on
+       most architectures, call instructions use a displacement instead of an absolute address, so
+       they are already position independent. An PLT entry is usually a jump instruction, which
+       initially points to some trampoline code which transfers control to the AOT loader, which
+       will compile the called method, and patch the PLT entry so that further calls are made
+       directly to the called method.
+       If the called method is in the same assembly, and does not need initialization (i.e. it
+    doesn't have GOT slots etc), then the call is made directly, bypassing the PLT.
+
+* Implementation
+----------------
+
+** The Precompiled File Format
+-----------------------------
+       
+       We use the native object format of the platform. That way it
+       is possible to reuse existing tools like objdump and the
+       dynamic loader. All we need is a working assembler, i.e. we
+       write out a text file which is then passed to gas (the gnu
+       assembler) to generate the object file.
+               
+       The precompiled image is stored in a file next to the original
+       assembly that is precompiled with the native extension for a shared
+       library (on Linux its ".so" to the generated file). 
+
+       For example: basic.exe -> basic.exe.so; corlib.dll -> corlib.dll.so
+
+       To avoid symbol lookup overhead and to save space, some things like the 
+       compiled code of the individual methods are not identified by specific symbols
+    like method_code_1234. Instead, they are stored in one big array and the
+       offsets inside this array are stored in another array, requiring just two
+       symbols. The offsets array is usually named 'FOO_offsets', where FOO is the
+       array the offsets refer to, like 'methods', and 'method_offsets'.
+
+       Generating code using an assembler and linker has some disadvantages:
+       - it requires GNU binutils or an equivalent package to be installed on the
+         machine running the aot compilation.
+       - it is slow.
+
+       There is some support in the aot compiler for directly emitting elf files, but
+       its not complete (yet).
+       
+       The following things are saved in the object file and can be
+       looked up using the equivalent to dlsym:
+       
+               mono_assembly_guid
+       
+                       A copy of the assembly GUID.
+       
+               mono_aot_version
+       
+                       The format of the AOT file format.
+       
+               mono_aot_opt_flags
+       
+                       The optimizations flags used to build this
+                       precompiled image.
+       
+               method_infos
+
+                       Contains additional information needed by the runtime for using the
+                       precompiled method, like the GOT entries it uses.
+
+               method_info_offsets                             
+
+                   Maps method indexes to offsets in the method_infos array.
+                       
+               mono_icall_table
+       
+                       A table that lists all the internal calls
+                       references by the precompiled image.
+       
+               mono_image_table
+       
+                       A list of assemblies referenced by this AOT
+                       module.
+
+               methods
+                       
+                       The precompiled code itself.
+                       
+               method_offsets
+       
+                       Maps method indexes to offsets in the methods array.
+
+               ex_info
+
+                       Contains information about methods which is rarely used during normal execution, 
+                       like exception and debug info.
+
+               ex_info_offsets
+
+                       Maps method indexes to offsets in the ex_info array.
+
+               class_info
+
+                       Contains precomputed metadata used to speed up various runtime functions.
+
+               class_info_offsets
+
+                       Maps class indexes to offsets in the class_info array.
+
+               class_name_table
+
+                       A hash table mapping class names to class indexes. Used to speed up 
+                       mono_class_from_name ().
+
+               plt
+
+                       The Program Linkage Table
+
+               plt_info
+
+                       Contains information needed to find the method belonging to a given PLT entry.
+
+** Source file structure
+-----------------------------
+
+       The AOT infrastructure is split into two files, aot-compiler.c and 
+       aot-runtime.c. aot-compiler.c contains the AOT compiler which is invoked by
+       --aot, while aot-runtime.c contains the runtime support needed for loading
+       code and other things from the aot files.
+
+** Compilation process
+----------------------------
+
+       AOT compilation consists of the following stages:
+       - collecting the methods to be compiled.
+       - compiling them using the JIT.
+       - emitting the JITted code and other information into an assembly file (.s).
+       - assembling the file using the system assembler.
+       - linking the resulting object file into a shared library using the system
+         linker.
+
+** Handling compiled code
+----------------------------
+
+         Each method is identified by a method index. For normal methods, this is
+       equivalent to its index in the METHOD metadata table. For runtime generated
+       methods (wrappers), it is an arbitrary number.
+         Compiled code is created by invoking the JIT, requesting it to created AOT
+       code instead of normal code. This is done by the compile_method () function.
+       The output of the JIT is compiled code and a set of patches (relocations). Each 
+       relocation specifies an offset inside the compiled code, and a runtime object 
+       whose address is accessed at that offset.
+       Patches are described by a MonoJumpInfo structure. From the perspective
+       of the AOT compiler, there are two kinds of patches:
+       - calls, which require an entry in the PLT table.
+       - everything else, which require an entry in the GOT table.
+       How patches is handled is described in the next section.
+         After all the method are compiled, they are emitted into the output file into
+         a byte array called 'methods', The emission
+       is done by the emit_method_code () and emit_and_reloc_code () functions. Each
+       piece of compiled code is identified by the local symbol .Lm_<method index>. 
+       While compiled code is emitted, all the locations which have an associated patch
+       are rewritten using a platform specific process so the final generated code will
+       refer to the plt and got entries belonging to the patches.
+       The compiled code array 
+can be accessed using the 'methods' global symbol. 
+
+** Handling patches
+----------------------------
+
+         Before a piece of AOTed code can be used, the GOT entries used by it must be
+       filled out with the addresses of runtime objects. Those objects are identified
+       by MonoJumpInfo structures. These stuctures are saved in a serialized form in
+       the AOT file, so the AOT loader can deconstruct them. The serialization is done
+       by the encode_patch () function, while the deserialization is done by the
+       decode_patch_info () function.
+       Every method has an associated method info blob inside the 'method_info' byte
+       array in the AOT file. This contains all the information required to load the
+       method at runtime:
+       - the first got entry used by the method.
+       - the number of got entries used by the method.
+       - the serialized patch info for the got entries.
+       Some patches, like vtables, icalls are very common, so instead of emitting their
+       info every time they are used by a method, we emit the info only once into a 
+       byte array named 'got_info', and only emit an index into this array for every
+       access.
+
+** The Procedure Linkage Table (PLT)
+------------------------------------
+
+       Our PLT is similar to the elf PLT, it is used to handle calls between methods.
+       If method A needs to call method B, then an entry is allocated in the PLT for
+       method B, and A calls that entry instead of B directly. This is useful because
+       in some cases the runtime needs to do some processing the first time B is 
+       called.
+       There are two cases:
+       - if B is in another assembly, then it needs to be looked up, then JITted or the
+       corresponding AOT code needs to be found.
+       - if B is in the same assembly, but has got slots, then the got slots need to be
+       initialized.
+       If none of these cases is true, then the PLT is not used, and the call is made
+       directly to the native code of the target method.
+       A PLT entry is usually implemented by a jump though a jump table, where the
+       jump table entries are initially filled up with the address of a trampoline so
+       the runtime can get control, and after the native code of the called method is
+       created/found, the jump table entry is changed to point to the native code. 
+       All PLT entries also embed a integer offset after the jump which indexes into
+       the 'plt_info' table, which stores the information required to find the called
+       method. The PLT is emitted by the emit_plt () function.
+
+** Exception/Debug info
+----------------------------
+
+       Each compiled method has some additional info generated by the JIT, usable 
+       for debugging (IL offset-native offset maps) and exception handling 
+       (saved registers, native offsets of try/catch clauses). Since this info is
+       rarely needed, it is saved into a separate byte array called 'ex_info'.
+
+** Cached metadata
+---------------------------
+
+       When the runtime loads a class, it needs to compute a variety of information
+       which is not readily available in the metadata, like the instance size,
+       vtable, whenever the class has a finalizer/type initializer etc. Computing this
+       information requires a lot of time, causes the loading of lots of metadata,
+       and it usually involves the creation of many runtime data structures 
+       (MonoMethod/MonoMethodSignature etc), which are long living, and usually persist
+       for the lifetime of the app. To avoid this, we compute the required information
+       at aot compilation time, and save it into the aot image, into an array called
+       'class_info'. The runtime can query this information using the 
+       mono_aot_get_cached_class_info () function, and if the information is available,
+       it can avoid computing it.
+
+** Full AOT mode
+-------------------------
+
+       Some platforms like the iphone prohibit JITted code, using technical and/or
+       legal means. This is a significant problem for the mono runtime, since it 
+       generates a lot of code dynamically, using either the JIT or more low-level
+       code generation macros. To solve this, the AOT compiler is able to function in
+       full-aot or aot-only mode, where it generates and saves all the neccesary code
+       in the aot image, so at runtime, no code needs to be generated.
+       There are two kinds of code which needs to be considered:
+       - wrapper methods, that is methods whose IL is generated dynamically by the
+         runtime. They are handled by generating them in the add_wrappers () function,
+         then emitting them the same way as the 'normal' methods. The only problem is
+         that these methods do not have a methoddef token, so we need a separate table
+         in the aot image ('wrapper_info') to find their method index.
+       - trampolines and other small hand generated pieces of code. They are handled
+         in an ad-hoc way in the emit_trampolines () function.
+
+* Performance considerations
+----------------------------
+
+       Using AOT code is a trade-off which might lead to higher or
+       slower performance, depending on a lot of circumstances. Some
+       of these are:
+       
+       - AOT code needs to be loaded from disk before being used, so
+         cold startup of an application using AOT code MIGHT be
+         slower than using JITed code. Warm startup (when the code is
+         already in the machines cache) should be faster.  Also,
+         JITing code takes time, and the JIT compiler also need to
+         load additional metadata for the method from the disk, so
+         startup can be faster even in the cold startup case.
+
+       - AOT code is usually compiled with all optimizations turned
+         on, while JITted code is usually compiled with default
+         optimizations, so the generated code in the AOT case should
+         be faster.
+
+       - JITted code can directly access runtime data structures and
+         helper functions, while AOT code needs to go through an
+         indirection (the GOT) to access them, so it will be slower
+         and somewhat bigger as well.
+
+       - When JITting code, the JIT compiler needs to load a lot of
+         metadata about methods and types into memory.
+
+       - JITted code has better locality, meaning that if A method
+         calls B, then the native code for A and B is usually quite
+         close in memory, leading to better cache behaviour thus
+         improved performance. In contrast, the native code of
+         methods inside the AOT file is in a somewhat random order.
+       
+* Future Work
+-------------
+
+       - Currently, when an AOT module is loaded, all of its
+         dependent assemblies are also loaded eagerly, and these
+         assemblies need to be exactly the same as the ones loaded
+         when the AOT module was created ('hard binding'). Non-hard
+         binding should be allowed.
+
+       - On x86, the generated code uses call 0, pop REG, add
+         GOTOFFSET, REG to materialize the GOT address. Newer
+         versions of gcc use a separate function to do this, maybe we
+         need to do the same.
+
+       - Currently, we get vtable addresses from the GOT. Another
+         solution would be to store the data from the vtables in the
+         .bss section, so accessing them would involve less
+         indirection.
+       
  
-- the main problem is how to patch runtime related addresses, for example:
  
-  - current application domain
-  - string objects loaded with LDSTR
-  - address of MonoClass data
-  - static field offsets 
-  - method addreses
-  - virtual function and interface slots