* Proposal for the local register allocator

	The local register allocator deals with allocating registers
	for temporaries inside a single basic block, while the global 
	register allocator is concerned with method-wide allocation of 
	variables.
	The global register allocator uses callee-saved register for it's 
	purpouse so that there is no need to save and restore these registers
	at call sites.

	There are a number of issues the local allocator needs to deal with:
	*) some instructions expect operands in specific registers (for example
		the shl instruction on x86, or the call instruction with thiscall
		convention, or the equivalent call instructions on other architectures, 
		such as the need to put output registers in %oX on sparc)
	*) some instructions deliver results only in specific registers (for example
		the div instruction on x86, or the call instructionson on almost all
		the architectures).
	*) it needs to know what registers may be clobbered by an instruction
		(such as in a method call)
	*) it should avoid excessive reloads or stores to improve performance
	
	While which specific instructions have limitations is architecture-dependent,
	the problem shold be solved in an arch-independent way to reduce code duplication.
	The register allocator will be 'driven' by the arch-dependent code, but it's 
	implementation should be arch-independent.

	To improve the current local register allocator, we need to
	keep more state in it than the current setup that only keeps busy/free info.

	Possible state information is:

	free: the resgister is free to use and it doesn't contain useful info
	freeable: the register contains data loaded from a local (there is 
		also info about _which_ local it contains) as a result from previous
		instructions (like, there was a store from the register to the local)
	moveable: it contains live data that is needed in a following instruction, but
		the contents may be moved to a different register
	busy: the register contains live data and it is placed there because
		the following instructions need it exactly in that register
	allocated: the register is used by the global allocator

	The local register allocator will have the following interfaces:

	int get_register ();
		Searches for a register in the free state. If it doesn't find it,
		searches for a freeable register. Sets the status to moveable.
		Looking for a 'free' register before a freeable one should allow for
		removing a few redundant loads (though I'm still unsure if such
		things should be delegated entirely to the peephole pass).
	
	int get_register_force (int reg);
		Returns 'reg' if it is free or freeable. If it is moveable, it moves it 
		to another free or freeable register.
		Sets the status of 'reg' to busy.
	
	void set_register_freeable (int reg);
		Sets the status of 'reg' to freeable.
	
	void set_register_free (int reg);
		Sets the status of 'reg' to free.

	void will_clobber (int reg);
		Spills the register to the stack. Sets the status to freeable.
		After the clobbering has occurred, set the status to free.

	void register_unspill (int reg);
		Un-spills register reg and sets the status to moveable.

	FIXME: how is the 'local' information represented? Maybe a MonoInst* pointer.

	Note: the register allocator will insert instructions in the basic block
	during it's operation.

* Examples

	Given the tree (on x86 the right argument to shl needs to be in ecx):

	store (local1, shl (local1, call (some_arg)))

	At the start of the basic block, the registers are set to the free state.
	The sequence of instructions may be:
		instruction		register status -> [%eax %ecx %edx]
		start                                       free free free
		eax = load local1                           mov  free free
		/* call clobbers eax, ecx, edx */
		spill eax                                   free free free
		call                                        mov  free free
		/* now eax contains the right operand of the shl */
		mov %eax -> %ecx                            free busy free
		un-spill                                    mov  busy free
		shl %cl, %eax                               mov  free free
	
	The resulting x86 code is:
		mov $fffc(%ebp), %eax
		mov %eax, $fff0(%ebp)
		push some_arg
		call func
		mov %eax, %ecx
		mov $fff0(%ebp), %eax
		shl %cl, %eax
		
	Note that since shl could operate directly on memory, we could have:
	
		push some_arg
		call func
		mov %eax, %ecx
		shl %cl, $fffc(%ebp)

	The above example with loading the operand in a register is just to complicate
	the example and show that the algorithm should be able to handle it.

	Let's take another example with the this-call call convention (the first argument 
	is passed in %ecx).
	In this case, will_clobber() will be called only on %eax and %edx, while %ecx
	will be allocated with get_register_force ().
	Note: when a register is allocated with get_register_force(), it should be set
	to a different state as soon as possible.

	store (local1, shl (local1, this-call (local1)))

		instruction		register status -> [%eax %ecx %edx]
		start                                       free free free
		eax = load local1                           mov  free free
		/* force load in %ecx */
		ecx = load local1                           mov  busy free
		spill eax                                   free busy free
		call                                        mov  free free
		/* now eax contains the right operand of the shl */
		mov %eax -> %ecx                            free busy free
		un-spill                                    mov  busy free
		shl %cl, %eax                               mov  free free

	What happens when a register that we need to allocate with get_register_force ()
	contains an operand for the next instruction?

		instruction		register status -> [%eax %ecx %edx]
		eax = load local0                           mov  free free
		ecx = load local1                           mov  mov  free
		get_register_force (ecx) here.
		We have two options: 
			mov %ecx, %edx
		or:
			spill %ecx
		The first option is way better (and allows the peephole pass to
		just load the value in %edx directly, instead of loading first to %ecx).
		This doesn't work, though, if the instruction clobbers the %edx register
		(like in a this-call). So, we first need to clobber the registers
		(so the state of %ecx changes to freebale and there is no issue
		with get_register_force ()).
		What if an instruction both clobbers a register and requires it as 
		an operand? Lets' take the x86 idiv instruction as an example: it
		requires the dividend in edx:eax and returns the result in eax,
		with the modulus in edx.
	
	store (local1, div (local1, local2))
		
		instruction		register status -> [%eax %ecx %edx]
		eax = load local0                           mov  free free
		will_clobber eax, edx                       free mov  free
		force mov %ecx, %eax                        busy free free
		set %edx                                    busy free busy
		idiv                                        mov  free free
	
	Note: edx is set to free after idiv, because the modulus is not needed
	(if it was a rem, eax would have been freed).
	If we load the divisor before will_clobber(), we'll have to spill
	eax and reload it later. If we load it just after the idiv, there is no issue.
	In any case, the algorithm should give the correct results and allow the operation.
		
	Working recursively on the isntructions there shouldn't be huge issues
	with this algorithm (though, of course, it's not optimal and it may
	introduce excessive spills or register moves). The advantage over the current
	local reg allocator is that:
	1) the number of spills/moves would be smaller anyway
	2) a separate peephole pass could be able to eliminate reg moves
	3) we'll be able to remove the 'forced' spills we currently do with
		the return value of method calls

* Issues

	How to best integrate such a reg allocator with the burg stuff.

	Think about a call os sparc with two arguments: they got into %o0 and %o1
	and each of them sets the register as busy. But what if the values to put there
	are themselves the result of a call? %o0 is no problem, but for all the 
	next argument n the above algorithm would spill all the 0...n-1 registers...

* Papers

	More complex solutions to the local register allocator problem:
	http://dimacs.rutgers.edu/TechnicalReports/abstracts/1997/97-33.html

	Combining register allocation and instruction scheduling:
	http://citeseer.nj.nec.com/motwani95combining.html

	More on LRA euristics:
	http://citeseer.nj.nec.com/liberatore97hardness.html

	Linear-time optimal code scheduling for delayedload architectures
	http://www.cs.wisc.edu/~fischer/cs701.f01/inst.sched.ps.gz

	Precise Register Allocation for Irregular Architectures
	http://citeseer.nj.nec.com/kong98precise.html

	Allocate registers first to subtrees that need more of them.
	http://www.upb.de/cs/ag-kastens/compii/folien/comment401-409.2.pdf