Using our friends the bitstring instructions (and some inspiration from DanB’s Hunter code), I reimplemented copymem and setmem to make them faster / more efficient:
// Copy a block of data from one area in memory to another. void copymem (u8* dest, const u8* src, u16 num) { asm(" mov r29,r1 mov %0,r26 mov %1,r27 mov %2,r28 mov %3,r29 mov %4,r30 .hword 0x7C0B mov r1,r29 " : /* output */ : "r" (((u32)dest & 0x3) << 2), "r" (((u32)src & 0x3) << 2), "r" (num << 3), "r" ((u32)dest & ~0x3), "r" ((u32)src & ~0x3) /* input */ : "r1", "r26", "r27", "r28", "r29", "r30" /* trashed */ ); } // Set each byte in a block of data to a given value. void setmem (u8* dest, u8 src, u16 num) { if (num > 0) { /* bitstring copy where dest overlaps src */ /* there needs to be at least a full src word for this to work */ *dest = src; switch (--num & 0x3) { case 3: *++dest = src; case 2: *++dest = src; case 1: *++dest = src; } if (num > 4) { copymem(dest + 1, dest - 3, num - 4); } } }
I haven’t exhaustively tested these, but they work well for me so far. YMMV.
Cool! Did you do any benchmarks of how much faster it is over the “standard” copymem?
I had a quick look at this a while ago and found that copying word aligned data from WRAM to VRM was approximately 2.5 times faster with [font=Courier]movbsu[/font] compared to a word copier (8.2 vs 20.5 cycles/byte). However, the word copier was quicker when run from the instruction cache (7.1 cycles/byte).
[font=Courier]copymem[/font] is a byte copier with a 16-bit loop variable so blitter’s bit string copier must be at least ten times faster.
Have you tested them versus my ASM versions? The C versions are really slow… just rewriting them in ASM (and of course copying words vs. bytes where possible) made them much faster.
IIRC, the bitstring functions aren’t very efficient for just plain copying memory… where it becomes nice is for doing operations (XOR, AND, etc) on strings of bits that aren’t necessarily byte/hword/word aligned. Even those I don’t think are much more efficient than doing them manually in ASM, but much nicer for sure.
DogP
Heh, blitter’s blitters… I love it 😀
But seriously, nice work! I also would like to see some benchmarks comparing this and the C and asm versions under various conditions (with and without cache enabled, wait-state settings, etc.), but whether they’re faster, slower, or the same speed, it’s still great to have a nice, straightforward example that demonstrates the use of movbsu and, by extension, the other bit-string opcodes for those times DogP mentioned when it’s useful to work with large chunks of (especially non-aligned) data.
No I haven’t done any quantifiable benchmarks with these new routines yet– my subjective estimation of “faster” is based entirely on perceived performance of a direct-draw renderer I’m working on after switching from bitmasking/byte copying to using the bitstring instructions. These actually sort of fell out of my work on that– honestly I got tired of looking at the C versions when referencing the gccvb library, they made me cringe (no offense to whoever originally wrote them 😛 )
I did skim over the timing of the bitstring instructions in the NEC Architecture Manual and at first glance it looks like my versions would be faster for large block transfers but not so much for smaller blocks. Again, just going off of a hunch– haven’t yet run any tests to quantify performance. I’d be interested in testing against DogP’s ASM implementations– where could I find these? In either case, agreed that both would perform better than the unoptimized C versions commonly used.
I may write a test app later this week testing under various conditions, such as instruction cache, wait-state, ROM vs. WRAM vs. VRM vs. DRAM (aside: are the VRM and DRAM shared? The official Nintendo VB dev manual details the VRM R/W bandwidth but curiously not the DRAM)
I think I posted them here a long time ago… and they should be in some of the libraries with various demos I’ve posted, but here they are as well. I grabbed these from one of my libraries, though I don’t remember if this was the latest one (I’ve probably got 100 different libraries :P).
I think it works, but IIRC I improved these functions a few years ago… I don’t know if these are them or not. They’re simple enough though… you could probably rewrite them with little effort.
Heh, from a quick glance, a better way than decrementing num to 0 would be to precompute the final pointer and compare current to final… that’d remove a cycle per word 🙂 . If you know the src/dest address ahead of time (basically inlining this code rather than calling the function), you could use the same incrementer to save a cycle as well. And of course if you knew you had a long copy, you could unroll it a little bit and require quad-word alignment or something to save some cycles.
But as usual… there’s usually a “good enough” point 😉 .
void copymem (BYTE* dest, BYTE* src, unsigned int num) { asm(" jr end%= loop%=: ld.b 0[%1],r10 /* load source byte */ st.b r10,0[%0] /* store dest byte */ add 1,%0 /* increment dest pointer */ add 1,%1 /* increment src pointer */ add -1,%2 /* decrement bytes remaining to 0 */ end%=: cmp 0,%2 /* compare num with 0 to tell when done */ bgt loop%= /* loop if bytes remaining is greater than 0 */ " : /* No Output */ : "r" (dest), "r" (src), "r" (num) /* Input */ : "r10" /* regs used */ ); } void copyhword (HWORD* dest, HWORD* src, unsigned int num) { asm(" jr end%= loop%=: ld.h 0[%1],r10 /* load source hword */ st.h r10,0[%0] /* store dest hword */ add 2,%0 /* increment dest pointer */ add 2,%1 /* increment src pointer */ add -1,%2 /* decrement hwords remaining to 0 */ end%=: cmp 0,%2 /* compare num with 0 to tell when done */ bgt loop%= /* loop if hwords remaining is greater than 0 */ " : /* No Output */ : "r" (dest), "r" (src), "r" (num) /* Input */ : "r10" /* regs used */ ); } void copyword (WORD* dest, WORD* src, unsigned int num) { asm(" jr end%= loop%=: ld.w 0[%1],r10 /* load source word */ st.w r10,0[%0] /* store dest word */ add 4,%0 /* increment dest pointer */ add 4,%1 /* increment src pointer */ add -1,%2 /* decrement words remaining to 0 */ end%=: cmp 0,%2 /* compare num with 0 to tell when done */ bgt loop%= /* loop if words remaining is greater than 0 */ " : /* No Output */ : "r" (dest), "r" (src), "r" (num) /* Input */ : "r10" /* regs used */ ); } void setmem (BYTE* dest, BYTE src, unsigned int num) { asm(" jr end%= loop%=: st.b %1,0[%0] /* store dest byte */ add 1,%0 /* increment dest pointer */ add -1,%2 /* decrement bytes remaining to 0 */ end%=: cmp 0,%2 /* compare num with 0 to tell when done */ bgt loop%= /* loop if bytes remaining is greater than 0 */ " : /* No Output */ : "r" (dest), "r" (src), "r" (num) /* Input */ /* no regs used */ ); } void sethword (HWORD* dest, HWORD src, unsigned int num) { asm(" jr end%= loop%=: st.h %1,0[%0] /* store dest byte */ add 2,%0 /* increment dest pointer */ add -1,%2 /* decrement bytes remaining to 0 */ end%=: cmp 0,%2 /* compare num with 0 to tell when done */ bgt loop%= /* loop if bytes remaining is greater than 0 */ " : /* No Output */ : "r" (dest), "r" (src), "r" (num) /* Input */ /* no regs used */ ); } void setword (WORD* dest, WORD src, unsigned int num) { asm(" jr end%= loop%=: st.w %1,0[%0] /* store dest byte */ add 4,%0 /* increment dest pointer */ add -1,%2 /* decrement bytes remaining to 0 */ end%=: cmp 0,%2 /* compare num with 0 to tell when done */ bgt loop%= /* loop if bytes remaining is greater than 0 */ " : /* No Output */ : "r" (dest), "r" (src), "r" (num) /* Input */ /* no regs used */ ); }
DogP
When copying where length is sufficiently large(for cache miss/loading and register spilling reasons), and (src_addr & 0x3) == (dest_addr & 0x3), movbsu should always be slower than a well-optimized loop running from cache due to movbsu having a dummy read slot(IIRC, at least on a stock V810) for every 32-bit quantity transferred.
Code as in: 8 ld.w followed by 8 st.w.
If you want to be (too) clever, you could perform calculations that don’t, or have minimal, memory accesses, and interleave the instructions in-between the store instructions to take advantage of the write buffer and “absorb” some of the wait states(in addition to somewhat avoid use-immediately-after-set penalties in your calculations).