My PCM mixer runs fine on its own but slows execution down to a crawl when paired with music and rendering, so I’m looking to optimize it wherever I can, including dropping down and rewriting parts of it in assembly where practical. When examining the output of building with both -Os and -O3 in gccVB 4, though, I noticed the following peculiar pattern when accessing variables kept in WRAM:
movhi hi(_masterMusVolume),r0,r10 ld.b lo(_masterMusVolume)[r10],r14 movhi hi(_noiseVolume),r0,r27 movhi hi(_musDataStart),r0,r10 movhi hi(_freeVSUChannelCur),r0,r25 movhi hi(_noiseVelocity),r0,r26 movhi hi(_noiseLeft),r0,r29 movhi hi(_noiseRight),r0,r31 ld.b lo(_noiseVolume)[r27],r11 ld.w lo(_musDataStart)[r10],r18 movhi hi(_vbTranspose),r0,r10 ld.w lo(_vbTranspose)[r10],r10 ld.b lo(_freeVSUChannelCur)[r25],r17 ld.b lo(_noiseVelocity)[r26],r23 ld.b lo(_noiseLeft)[r29],r22 ld.b lo(_noiseRight)[r31],r12
Since WRAM on the VB is located at 0x05000000 and therefore aligned on a 64KB boundary, wouldn’t it be more economical to, say, movhi hi(_WRAMStart),r0,r10 just once and then ld lo(_variable)[r10] subsequently for each WRAM access? Why doesn’t the code do this or something similar?
I could rewrite this particular routine in assembly (it runs over 8000 times a second via the timer interrupt, so it needs to be as fast as possible) but this kind of code is generated all over the place whenever WRAM is read or written, so that to me just seems like putting a band-aid over a larger problem. Is this a bug in gccVB or is there a way to coax the compiler into generating more efficient code here?
Take a look at this:
http://www.planetvb.com/modules/newbb/viewtopic.php?post_id=17121
Thanks M.K., that should work for WRAM accesses.
Upon closer inspection I see that this pattern is also applied to other areas of memory. I found a simple example using hardware registers:
movhi 0x200, r0, r10 movea 0x20, r10, r10 ld.b [r10], r11 mov 5, r12 andi 0xFF, r11, r11 ori 0x10, r11, r11 st.b r11, [r10] movhi 0x200, r0, r11 movea 0x18, r11, r11 st.b r12, [r11] movhi 0x200, r0, r11 movea 0x1C, r11, r11 st.b r0, [r11]
This is the equivalent assembly when built with -Os to:
HW_REGS[TCR] |= TIMER_20US; HW_REGS[TLR] = 0x05; HW_REGS[THR] = 0x00;
The instruction ‘movhi 0x200, r0, r11’ is executed twice even when nothing is done in between to change the value of r11, making this unnecessary. This is when compiled with -Os for code size. Is this something that can be worked around (without writing it by hand in asm) or a bug in GCC/v810?
Took a look tonight at the gcc 4.4.2 patch that’s floating out there, and I think I might have an idea of what’s causing this: in output_move_single…
return "movhi hi(%1),%.,%0\n\tmovea lo(%1),%0,%0";
That line occurs several times for each time a 32-bit quantity needs to be loaded, and basically encodes those two instructions as a couplet, always. So the compiler doesn’t have a chance to optimize away the extra instruction. Looks either to me like a bug, or it simply doesn’t bother optimizing that case by design. I’m leaning toward the former, as it’s clearly suboptimal code. Anybody with knowledge of GCC have any ideas how to fix it?