So I’m over here working on my Virtual Fall project, and I came across a bottleneck of sorts. The way I’ve got my level data stored requires several reads from ROM, even though the data is pretty well compressed. What I found was that the more characters I was writing into background segment memory, the slower the program would get. The bad part was that I could only write 256 or so before frames started tearing, which really isn’t a whole lot…
Always one to fear the worst, my first hypothesis was that the VIP transfer rate was what was slowing me down. So I re-wrote my scrolling code using 32-bit writes instead of 16-bits, meaning I was transferring 2 characters per access instead of 1. Strangely, that didn’t help, but the exercise did reveal an interesting side-effect. Outside the bounds of my level, I just had the code render empty characters rather than read them from ROM, and *those* were cruising away full-speed.
When I modified my CHR loader to just output numbers as a function of the input parameters, I found something very relieving: the VIP bus transfers have power to spare. When my CHR indexes were generated by the program, it was REALLY fast loading them into graphics memory. When I was loading them from ROM, that’s where the slow-down occurred.
There’s a few things you can do to accelerate the processing of data when loading from ROM:
* Cache things in RAM when possible
* Configure the wait controller (address 0x02000024) to use 1 wait for ROM instead of the default 2. This is what I wound up doing due to the bulk of data that has to be transferred.
* Read 32 or 16 bits at a time from ROM instead of 8, and use adjacent data intelligently to avoid further accesses.
And the big kahuna:
* Copy a loading routine into RAM and execute it from there. Remember, every program instruction comes out of ROM, so if you can offload that data into faster memory, you’ll get a lot more out of it.
Guy Perfect wrote:
Copy a loading routine into RAM and execute it from there.
What if you use the instruction cache?
You’re absolutely right about that one, HorvatM. I thought about the instruction cache after I opened this thread, but I had to go to work so didn’t get to experiment with it. I’m home now, though, and this gives me an opportunity to add some cache control functions to libvue. (-:
And my results? Holy. Freaking. Crap.
The V810 has 1KB of instruction cache. The instruction cache is designed specifically for the purpose of executing small chunks of code very quickly, such as those used by loading routines. The “big kahuna” I mentioned, where the loading routine is itself loaded into RAM first? Well, the CPU instruction cache is a sort of “bigger kahuna”, because it bypasses the RAM part and stores the loading routine right there on the processor chip.
After enabling the instruction cache around my level-loading code, I was able to achieve some remarkable scrolling speeds, in the neighborhood of 10-12 KB/s transferred from ROM to VIP memory before it started to slow down. This of course is in conjunction with the one-wait wait controller. Leaving the wait controller at its default of two waits, it was expectedly a little less than that, maybe in the 6-8 KB/s range. Don’t take these numbers as cold hard facts, though. We’d need a proper profiling test before drawing any conclusions about transfer limits.
So yeah, HorvatM hit the nail on the head. The instruction cache is, hands down, the best way to accelerate your code when you need to load a lot of stuff from ROM.
Guy Perfect wrote:
The instruction cache is, hands down, the best way to accelerate your code when you need to load a lot of stuff from ROM.
Yes, the cache is definitely a good way to boost your speed (not necessarily just for loading from ROM, but any high performace loop)… though it is only 1K (128 x 8 byte blocks), so you want to enable/disable it wisely. Just enabling it at boot and leaving it on isn’t likely to be very beneficial (cache contents are saved across enable/disable, so just enabling it for loops where performance matters is ideal).
And yes, even though it’s a 32-bit CPU, (I’m almost positive) the CPU is running in 16-bit bus mode to any RAM/ROM/peripherals… so 16-bit transfers are very beneficial over 8-bit transfers… but 32-bit transfers are less beneficial, since it still causes two bus transfers (though it’s handled automatically, so it’s two bus transfers, but only one 32-bit transfer instruction, rather than two 16-bit transfer instructions).
DogP
DogP wrote:
Just enabling [the instruction cache] at boot and leaving it on isn’t likely to be very beneficial (cache contents are saved across enable/disable, so just enabling it for loops where performance matters is ideal).
You can also manually clear the cache. I recommend doing that whenever switching to a different block of code.
*Enable cache
*Loop 1
*Disable cache
(other code)
*Clear cache
*Enable cache
*Loop 2
*Disable cache
DogP wrote:
[…] 32-bit transfers are less beneficial, since it still causes two bus transfers (though it’s handled automatically, so it’s two bus transfers, but only one 32-bit transfer instruction, rather than two 16-bit transfer instructions).
I’m not clear on the details of how it works, frankly. I know the VIP registers *must* be accessed with 16-bit read/write instructions, but VIP memory can be accessed with 8- or 32-bit instructions. Similarly, the VSU and hardware control registers *must* be accessed with 8-bit instructions.
I’m not sure why the distinction matters, but to me it seems like a 32-bit write is semantically different from two 16-bit writes. And even in the event they’re the same, it still saves execution on a per-instruction basis, so I really don’t see any down-sides to using them.
Guy Perfect wrote:
And even in the event they’re the same, it still saves execution on a per-instruction basis, so I really don’t see any down-sides to using them.
Exactly… I’m not saying there’s a down-side (unless it’s more difficult for you to work with because of strange byte alignment or something)… my point was just that a 16-bit transfer will (always) happen in the same amount of time as an 8-bit transfer (double the transfer rate). A 32-bit transfer is still going to require two bus accesses, though only one instruction to do it (so you save a bit, but not double like the 8 to 16). Of course this is all only partially true, since it depends on how the compiler handles it.
There are also other things that can affect speed, but without actually looking at the ASM (or writing the ASM from scratch), it’s gonna be hard to really optimize. Things like branches (3 cycles for branch taken, 1 for not… so you want to program typical case NOT to take a branch). Also for pipelining, you want to avoid hazards… and multiple consecutive loads are good, but multiple consecutive stores are bad. You can actually have your 32-bit load take only 1 cycle, but only if it’s done after an instruction that takes several cycles. But you always save a cycle for consecutive loads. And a 32-bit store only takes one cycle for the first, but then 4 cycles for the following consecutive instructions. But if you’re writing the code in C, you’re kinda at the mercy of the compiler.
IIRC, DanB (maybe?) had tried copying memory with bitstring functions, but I don’t think it was any faster. Obviously there’s no such thing as a free lunch, but there should be less instruction overhead… but IIRC, the instructions take quite a few cycles to execute than standard load/store. You might want to look into what he did though.
So basically… I’d recommend at least writing an inline ASM copy function (IIRC, I posted a general purpose memcpy type one years ago with benchmarks)… but if you really want to get every last bit out of it, you’ll want to optimize that even further for your needs (unrolling the loop, consecutively loading up multiple registers, etc.). Of course the optimization parameters depend on data size, alignment, desired code size, etc.
DogP
DogP wrote:
IIRC, DanB (maybe?) had tried copying memory with bitstring functions, but I don’t think it was any faster.
dasi tried them as well with a memmove() implementation and found them to actually be slower than the optimized LD.W/ST.W loop (presumably with the instruction cache enabled). It’s a novel idea, but the bit string instructions are both hard to use and not efficient, so I don’t see a point in using them.
IIRC(and I might not be :p), there is a small penalty for branching to an address not aligned to a 32-bit boundary.