mcs/docs/compiler

   1                        The Internals of the Mono C# Compiler
   2
   3                                 Miguel de Icaza
   4                               (miguel@ximian.com)
   5                                       2002
   6
   7 * Abstract
   8
   9         The Mono C# compiler is a C# compiler written in C# itself.
  10         Its goals are to provide a free and alternate implementation
  11         of the C# language.  The Mono C# compiler generates ECMA CIL
  12         images through the use of the System.Reflection.Emit API which
  13         enable the compiler to be platform independent.
  14
  15 * Overview: How the compiler fits together
  16
  17         The compilation process is managed by the compiler driver (it
  18         lives in driver.cs).
  19
  20         The compiler reads a set of C# source code files, and parses
  21         them.  Any assemblies or modules that the user might want to
  22         use with his project are loaded after parsing is done.
  23
  24         Once all the files have been parsed, the type hierarchy is
  25         resolved.  First interfaces are resolved, then types and
  26         enumerations.
  27
  28         Once the type hierarchy is resolved, every type is populated:
  29         fields, methods, indexers, properties, events and delegates
  30         are entered into the type system.
  31
  32         At this point the program skeleton has been completed.  The
  33         next process is to actually emit the code for each of the
  34         executable methods.  The compiler drives this from
  35         RootContext.EmitCode.
  36
  37         Each type then has to populate its methods: populating a
  38         method requires creating a structure that is used as the state
  39         of the block being emitted (this is the EmitContext class) and
  40         then generating code for the topmost statement (the Block).
  41
  42         Code generation has two steps: the first step is the semantic
  43         analysis (Resolve method) that resolves any pending tasks, and
  44         guarantees that the code is correct.  The second phase is the
  45         actual code emission.  All errors are flagged during in the
  46         "Resolution" process.
  47
  48         After all code has been emitted, then the compiler closes all
  49         the types (this basically tells the Reflection.Emit library to
  50         finish up the types), resources, and definition of the entry
  51         point are done at this point, and the output is saved to
  52         disk.
  53
  54         The following list will give you an idea of where the
  55         different pieces of the compiler live:
  56
  57         Infrastructure:
  58
  59             driver.cs:
  60                 This drives the compilation process: loading of
  61                 command line options; parsing the inputs files;
  62                 loading the referenced assemblies; resolving the type
  63                 hierarchy and emitting the code.
  64
  65             codegen.cs:
  66
  67                 The state tracking for code generation.
  68
  69             attribute.cs:
  70
  71                 Code to do semantic analysis and emit the attributes
  72                 is here.
  73
  74             rootcontext.cs:
  75
  76                 Keeps track of the types defined in the source code,
  77                 as well as the assemblies loaded.
  78
  79             typemanager.cs:
  80
  81                 This contains the MCS type system.
  82
  83             report.cs:
  84
  85                 Error and warning reporting methods.
  86
  87             support.cs:
  88
  89                 Assorted utility functions used by the compiler.
  90
  91         Parsing
  92
  93             cs-tokenizer.cs:
  94
  95                 The tokenizer for the C# language, it includes also
  96                 the C# pre-processor.
  97
  98             cs-parser.jay, cs-parser.cs:
  99
 100                 The parser is implemented using a C# port of the Yacc
 101                 parser.  The parser lives in the cs-parser.jay file,
 102                 and cs-parser.cs is the generated parser.
 103
 104             location.cs:
 105
 106                 The `location' structure is a compact representation
 107                 of a file, line, column where a token, or a high-level
 108                 construct appears.  This is used to report errors.
 109
 110         Expressions:
 111
 112             ecore.cs
 113
 114                 Basic expression classes, and interfaces most shared
 115                 code and static methods are here.
 116
 117             expression.cs:
 118
 119                 Most of the different kinds of expressions classes
 120                 live in this file.
 121
 122             assign.cs:
 123
 124                 The assignment expression got its own file.
 125
 126             constant.cs:
 127
 128                 The classes that represent the constant expressions.
 129
 130             literal.cs
 131
 132                 Literals are constants that have been entered manually
 133                 in the source code, like `1' or `true'.  The compiler
 134                 needs to tell constants from literals apart during the
 135                 compilation process, as literals sometimes have some
 136                 implicit extra conversions defined for them.
 137
 138             cfold.cs:
 139
 140                 The constant folder for binary expressions.
 141
 142         Statements
 143
 144             statement.cs:
 145
 146                 All of the abstract syntax tree elements for
 147                 statements live in this file.  This also drives the
 148                 semantic analysis process.
 149
 150         Declarations, Classes, Structs, Enumerations
 151
 152             decl.cs
 153
 154                 This contains the base class for Members and
 155                 Declaration Spaces.   A declaration space introduces
 156                 new names in types, so classes, structs, delegates and
 157                 enumerations derive from it.
 158
 159             class.cs:
 160
 161                 Methods for holding and defining class and struct
 162                 information, and every member that can be in these
 163                 (methods, fields, delegates, events, etc).
 164
 165                 The most interesting type here is the `TypeContainer'
 166                 which is a derivative of the `DeclSpace'
 167
 168             delegate.cs:
 169
 170                 Handles delegate definition and use.
 171
 172             enum.cs:
 173
 174                 Handles enumerations.
 175
 176             interface.cs:
 177
 178                 Holds and defines interfaces.  All the code related to
 179                 interface declaration lives here.
 180
 181             parameter.cs:
 182
 183                 During the parsing process, the compiler encapsulates
 184                 parameters in the Parameter and Parameters classes.
 185                 These classes provide definition and resolution tools
 186                 for them.
 187
 188             pending.cs:
 189
 190                 Routines to track pending implementations of abstract
 191                 methods and interfaces.  These are used by the
 192                 TypeContainer-derived classes to track whether every
 193                 method required is implemented.
 194
 195
 196 * The parsing process
 197
 198         All the input files that make up a program need to be read in
 199         advance, because C# allows declarations to happen after an
 200         entity is used, for example, the following is a valid program:
 201
 202         class X : Y {
 203                 static void Main ()
 204                 {
 205                         a = "hello"; b = "world";
 206                 }
 207                 string a;
 208         }
 209
 210         class Y {
 211                 public string b;
 212         }
 213
 214         At the time the assignment expression `a = "hello"' is parsed,
 215         it is not know whether a is a class field from this class, or
 216         its parents, or whether it is a property access or a variable
 217         reference.  The actual meaning of `a' will not be discvored
 218         until the semantic analysis phase.
 219
 220 ** The Tokenizer and the pre-processor
 221
 222         The tokenizer is contained in the file `cs-tokenizer.cs', and
 223         the main entry point is the `token ()' method.  The tokenizer
 224         implements the `yyParser.yyInput' interface, which is what the
 225         Yacc/Jay parser will use when fetching tokens.
 226
 227         Token definitions are generated by jay during the compilation
 228         process, and those can be references from the tokenizer class
 229         with the `Token.' prefix.
 230
 231         Each time a token is returned, the location for the token is
 232         recorded into the `Location' property, that can be accessed by
 233         the parser.  The parser retrieves the Location properties as
 234         it builds its internal representation to allow the semantic
 235         analysis phase to produce error messages that can pin point
 236         the location of the problem.
 237
 238         Some tokens have values associated with it, for example when
 239         the tokenizer encounters a string, it will return a
 240         LITERAL_STRING token, and the actual string parsed will be
 241         available in the `Value' property of the tokenizer.   The same
 242         mechanism is used to return integers and floating point
 243         numbers.
 244
 245         C# has a limited pre-processor that allows conditional
 246         compilation, but it is not as fully featured as the C
 247         pre-processor, and most notably, macros are missing.  This
 248         makes it simple to implement in very few lines and mesh it
 249         with the tokenizer.
 250
 251         The `handle_preprocessing_directive' method in the tokenizer
 252         handles all the pre-processing, and it is invoked when the '#'
 253         symbol is found as the first token in a line.
 254
 255         The state of the pre-processor is contained in a Stack called
 256         `ifstack', this state is used to track the if/elif/else/endif
 257         nesting and the current state.  The state is encoded in the
 258         top of the stack as a number of values `TAKING',
 259         `TAKEN_BEFORE', `ELSE_SEEN', `PARENT_TAKING'.
 260
 261 ** Locations
 262
 263         Locations are encoded as a 32-bit number (the Location
 264         struct) that map each input source line to a linear number.
 265         As new files are parsed, the Location manager is informed of
 266         the new file, to allow it to map back from an int constant to
 267         a file + line number.
 268
 269         The tokenizer also tracks the column number for a token, but
 270         this is currently not being used or encoded.  It could
 271         probably be encoded in the low 9 bits, allowing for columns
 272         from 1 to 512 to be encoded.
 273
 274 * The Parser
 275
 276         The parser is written using Jay, which is a port of Berkeley
 277         Yacc to Java, that I later ported to C#.
 278
 279         Many people ask why the grammar of the parser does not match
 280         exactly the definition in the C# specification.  The reason is
 281         simple: the grammar in the C# specification is designed to be
 282         consumed by humans, and not by a computer program.  Before
 283         you can feed this grammar to a tool, it needs to be simplified
 284         to allow the tool to generate a correct parser for it.
 285
 286         In the Mono C# compiler, we use a class for each of the
 287         statements and expressions in the C# language.  For example,
 288         there is a `While' class for the the `while' statement, a
 289         `Cast' class to represent a cast expression and so on.
 290
 291         There is a Statement class, and an Expression class which are
 292         the base classes for statements and expressions.
 293
 294 ** Namespaces
 295
 296         Using list.
 297
 298 * Internal Representation
 299
 300 ** Expressions
 301
 302         Expressions in the Mono C# compiler are represented by the
 303         `Expression' class.  This is an abstract class that particular
 304         kinds of expressions have to inherit from and override a few
 305         methods.
 306
 307         The base Expression class contains two fields: `eclass' which
 308         represents the "expression classification" (from the C#
 309         specs) and the type of the expression.
 310
 311         Expressions have to be resolved before they are can be used.
 312         The resolution process is implemented by overriding the
 313         `DoResolve' method.  The DoResolve method has to set the
 314         `eclass' field and the `type', perform all error checking and
 315         computations that will be required for code generation at this
 316         stage.
 317
 318         The return value from DoResolve is an expression.  Most of the
 319         time an Expression derived class will return itself (return
 320         this) when it will handle the emission of the code itself, or
 321         it can return a new Expression.
 322
 323         For example, the parser will create an "ElementAccess" class
 324         for:
 325
 326                 a [0] = 1;
 327
 328         During the resolution process, the compiler will know whether
 329         this is an array access, or an indexer access.  And will
 330         return either an ArrayAccess expression or an IndexerAccess
 331         expression from DoResolve.
 332
 333
 334
 335 *** The Expression Class
 336
 337         The utility functions that can be called by all children of
 338         Expression.
 339
 340 ** Constants
 341
 342         Constants in the Mono C# compiler are reprensented by the
 343         abstract class `Constant'.  Constant is in turn derived from
 344         Expression.  The base constructor for `Constant' just sets the
 345         expression class to be an `ExprClass.Value', Constants are
 346         born in a fully resolved state, so the `DoResolve' method
 347         only returns a reference to itself.
 348
 349         Each Constant should implement the `GetValue' method which
 350         returns an object with the actual contents of this constant, a
 351         utility virtual method called `AsString' is used to render a
 352         diagnostic message.  The output of AsString is shown to the
 353         developer when an error or a warning is triggered.
 354
 355         Constant classes also participate in the constant folding
 356         process.  Constant folding is invoked by those expressions
 357         that can be constant folded invoking the functionality
 358         provided by the ConstantFold class (cfold.cs).
 359
 360         Each Constant has to implement a number of methods to convert
 361         itself into a Constant of a different type.  These methods are
 362         called `ConvertToXXXX' and they are invoked by the wrapper
 363         functions `ToXXXX'.  These methods only perform implicit
 364         numeric conversions.  Explicit conversions are handled by the
 365         `Cast' expression class.
 366
 367         The `ToXXXX' methods are the entry point, and provide error
 368         reporting in case a conversion can not be performed.
 369
 370 ** Constant Folding
 371
 372         The C# language requires constant folding to be implemented.
 373         Constant folding is hooked up in the Binary.Resolve method.
 374         If both sides of a binary expression are constants, then the
 375         ConstantFold.BinaryFold routine is invoked.
 376
 377         This routine implements all the binary operator rules, it
 378         is a mirror of the code that generates code for binary
 379         operators, but that has to be evaluated at runtime.
 380
 381         If the constants can be folded, then a new constant expression
 382         is returned, if not, then the null value is returned (for
 383         example, the concatenation of a string constant and a numeric
 384         constant is deferred to the runtime).
 385
 386 ** Side effects
 387
 388         a [i++]++
 389         a [i++] += 5;
 390
 391 ** Statements
 392
 393 * The semantic analysis
 394
 395         Hence, the compiler driver has to parse all the input files.
 396         Once all the input files have been parsed, and an internal
 397         representation of the input program exists, the following
 398         steps are taken:
 399
 400                 * The interface hierarchy is resolved first.
 401                   As the interface hierarchy is constructed,
 402                   TypeBuilder objects are created for each one of
 403                   them.
 404
 405                 * Classes and structure hierarchy is resolved next,
 406                   TypeBuilder objects are created for them.
 407
 408                 * Constants and enumerations are resolved.
 409
 410                 * Method, indexer, properties, delegates and event
 411                   definitions are now entered into the TypeBuilders.
 412
 413                 * Elements that contain code are now invoked to
 414                   perform semantic analysis and code generation.
 415
 416 * Output Generation
 417
 418 ** Code Generation
 419
 420         The EmitContext class is created any time that IL code is to
 421         be generated (methods, properties, indexers and attributes all
 422         create EmitContexts).
 423
 424         The EmitContext keeps track of the current namespace and type
 425         container.  This is used during name resolution.
 426
 427         An EmitContext is used by the underlying code generation
 428         facilities to track the state of code generation:
 429
 430                 * The ILGenerator used to generate code for this
 431                   method.
 432
 433                 * The TypeContainer where the code lives, this is used
 434                   to access the TypeBuilder.
 435
 436                 * The DeclSpace, this is used to resolve names through
 437                   RootContext.LookupType in the various statements and
 438                   expressions.
 439
 440         Code generation state is also tracked here:
 441
 442                 * CheckState:
 443
 444                   This variable tracks the `checked' state of the
 445                   compilation, it controls whether we should generate
 446                   code that does overflow checking, or if we generate
 447                   code that ignores overflows.
 448
 449                   The default setting comes from the command line
 450                   option to generate checked or unchecked code plus
 451                   any source code changes using the checked/unchecked
 452                   statements or expressions.  Contrast this with the
 453                   ConstantCheckState flag.
 454
 455                 * ConstantCheckState
 456
 457                   The constant check state is always set to `true' and
 458                   cant be changed from the command line.  The source
 459                   code can change this setting with the `checked' and
 460                   `unchecked' statements and expressions.
 461
 462                 * IsStatic
 463
 464                   Whether we are emitting code inside a static or
 465                   instance method
 466
 467                 * ReturnType
 468
 469                   The value that is allowed to be returned or NULL if
 470                   there is no return type.
 471
 472
 473                 * ContainerType
 474
 475                   Points to the Type (extracted from the
 476                   TypeContainer) that declares this body of code
 477                   summary>
 478
 479
 480                 * IsConstructor
 481
 482                   Whether this is generating code for a constructor
 483
 484                 * CurrentBlock
 485
 486                   Tracks the current block being generated.
 487
 488                 * ReturnLabel;
 489
 490                   The location where return has to jump to return the
 491                   value
 492
 493         A few variables are used to track the state for checking in
 494         for loops, or in try/catch statements:
 495
 496                 * InFinally
 497
 498                   Whether we are in a Finally block
 499
 500                 * InTry
 501
 502                   Whether we are in a Try block
 503
 504                 * InCatch
 505
 506                   Whether we are in a Catch block
 507
 508                 * InUnsafe
 509                   Whether we are inside an unsafe block
 510
 511 * Miscelaneous
 512
 513 ** Error Processing.
 514
 515         Errors are reported during the various stages of the
 516         compilation process.  The compiler stops its processing if
 517         there are errors between the various phases.  This simplifies
 518         the code, because it is safe to assume always that the data
 519         structures that the compiler is operating on are always
 520         consistent.
 521
 522         The error codes in the Mono C# compiler are the same as those
 523         found in the Microsoft C# compiler, with a few exceptions
 524         (where we report a few more errors, those are documented in
 525         mcs/errors/errors.txt).  The goal is to reduce confussion to
 526         the users, and also to help us track the progress of the
 527         compiler in terms of the errors we report.
 528
 529         The Report class provides error and warning display functions,
 530         and also keeps an error count which is used to stop the
 531         compiler between the phases.
 532
 533         A couple of debugging tools are available here, and are useful
 534         when extending or fixing bugs in the compiler.  If the
 535         `--fatal' flag is passed to the compiler, the Report.Error
 536         routine will throw an exception.  This can be used to pinpoint
 537         the location of the bug and examine the variables around the
 538         error location.
 539
 540         Warnings can be turned into errors by using the `--werror'
 541         flag to the compiler.
 542
 543         The report class also ignores warnings that have been
 544         specified on the command line with the `--nowarn' flag.
 545
 546         Finally, code in the compiler uses the global variable
 547         RootContext.WarningLevel in a few places to decide whether a
 548         warning is worth reporting to the user or not.
 549