We need to switch to a new register allocator.
The current one is split in a global and a local register allocator.
The global one can assign only callee-saves registers and happens
on the tree-based internal representation: it assigns local variables 
to hardware registers. 
The local one happens on the linear representation on a per basic 
block basis and assigns hard registers to virtual registers (which 
hold temporary values during expression executions) and it deals also 
with the platform-specific issues (fixed registers, call conventions).

Moving to a different register will help solve some of the performance 
issues introduced by the above split, make the register more easily 
portable and solve some of the issues generated by dealing with trees.

The general design ideas are below.

The new allocator should have a global view of all the method, so it can be
able to assign variables also to some of the volatile registers if possible,
even across basic blocks (this would improve performance).

The allocator would be driven by per-arch declarative data, so porting 
should be easier: an architecture needs to specify register classes,
call convention and instructions requirements (similar to the gcc code).

The allocator should operate on the linear representation, this way it's 
easier and faster to track usages more correctly. We need to assign virtual
registers on a per-method basis instead of per basic block. We can assign 
virtual registers to variables, too. Note that since we fix the stack offset
of local vars only after this step (which happens after the burg rules are run),
some of the burg rules that try to optimize the code won't apply anymore:
the peephole code may need to be enhanced to do the optimizations instead.

We need to handle floating point registers in the global allocator, too.

The new allocator also needs to keep track precisely of which registers
contain references or managed pointers to allow us to move to a precise GC.

It may be worth to use a single increasing set of integers for the virtual 
registers, with the class of the register stored separately (unless the 
current local allocator which keeps interger and fp registers separate).

Since this is a large task, we need to do it in steps as much as possible. 
The first is to run the register allocator _after_ the burg rules: this 
requires a rewrite of the liveness code, too, to use linear indexes instead 
of basic-block/tree number combinations. This can be done by:
*) allocating virtual regs to all the locals that can be register allocated
*) running the burg rules (some may require adjustments): the local virtual 
registers are assigned starting from global-virt-regs+1, instead of the current
hardware-regs+1, so we can tell apart global and local virt regs.
*) running the liveness/whatever code is needed to allocate the global registers
*) allocate the rest of the local variables to stack slots
*) continue with the current local allocator

This work could take 2-3 weeks.

The next step is to define the kind of declarative data an architecture needs
and assigning virtual regs to all the registers and making the allocator
assign from the volatile registers, too.
Note that some of the code that is currently emitted in the arch-specific
code, will need to be emitted as instructions that the reg allocator
can inspect: think of a method that returns the first argument which is
received in a register: the current code copies it to either a local slot or
to a global reg in the prolog an copies it back to the return register
int he basic block, but since neither the regallocator nor the peephole code
knows about the prolog code, the first store cannot be optimized away.
The gcc code has some example of how to specify register classes in a 
declarative way.