[JIT/GC] Optimize mono_gc_wbarrier_value_copy_bitmap.
mono_gc_wbarrier_value_copy_bitmap shown up when profiling Roslyn and this PR applies the following set of optimizations to it:
- Drop the bitmap, the extra branch in the loop costs more than the extra cardtable store itself
- Rename it to mono_gc_wbarrier_range_copy to reflect its new meaning
- Remove the JIT wrapper, this function doesn't need it
- Use mono_gc_get_range_copy_func as a way to punch through all layers and get the actual implementation
- Move the implementation to sgen-cardtable.c where everything can be inlined
- Exploit the fact we only need to mark the first card of a given range and thus we can hoist the address calculation outside of the loop
All of this speeds up a microbenchmark by 2x and Roslyn wallclock by around 2%.