* use a pool of MBState structures to speedup monoburg instead of using a
  mempool.
* the decode tables in the burg-generated could use short instead of int
  (this should save about 1 KB)
* track the use of ESP, so that we can avoid the x86_lea in the epilog


Other Ideas:

* the ORP people avoids optimizations inside catch handlers - just to save
  memory (for example allocation of strings - instead they allocate strings when
  the code is executed (like the --shared option)). But there are only a few
  functions using catch handlers, so I consider this a minor issue.

* some performance critical functions should be inlined. These include:
	- mono_mempool_alloc and mono_mempool_alloc0
	- EnterCriticalSection and LeaveCriticalSection
	- TlsSetValue
	- mono_metadata_row_col
	- mono_g_hash_table_lookup
	- mono_domain_get

* if a function which involves locking is called from another function which
  acquires the same lock, it might be useful to create a separate _inner 
  version of the function which does not re-acquire the lock. This is a perf
  win only if the function is called a lot of times, like mono_get_method.

* we can avoid calls to class init trampolines if the are multiple calls to the
  same trampoline in the same basic block. See:

  http://bugzilla.ximian.com/show_bug.cgi?id=51096

Usability
---------

* Remove the various optimization list of flags description, have an 
  extra --help-optimizations flag.

* Remove the various graph options, have a separate --help-graph for 
  that list.