Hey,
For those of us trying to squeeze out the last bit of CPU power to get a few more FPS, setting the ROM wait state to 1 does help performance a little bit. On my Mario Kart game, setting the wait state from 2 waits to 1 speeds it up by about 1/20th (measured unscientifically, but with consistent results).
Of course YMMV, but it should definitely help where there are lots of ROM accesses (like with many const LUTs, or copying lots of gfx out of ROM). ROM accesses should take 2/3 the amount of time that they do by default, but of course there are other limiting performance factors besides just ROM access. There shouldn’t be any way that this can hurt performance, and it should help avoid stalls in the pipeline. I do have cache enabled in the most processing intensive loop, which limits ROM access as well… so without cache, you may see a larger improvement than I’m seeing.
One note… a single ROM wait state would mean that you must be using 100ns or faster ROMs… I believe the Flashboy ROM is at least that fast, but you should verify before making the assumption.
DogP
That would help me a lot since I don’t preload anything, and read chars and bgmaps on the fly… how do you change the ROM wait state?
jorgeche
It’s address 0x02000024, bit 1… assuming you have it in your libgccvb, it should be HW_REGS[WCR]=1 (the default of 0 is 2 wait states).
Just to be complete, all bits are unused except the lowest 2 bits… the lowest bit is the ROM wait state, the second lowest is the expansion area wait state, so if anyone uses that in the future, you can set that with HW_REGS[WCR]=2 (and then of course be careful to | and &~ to not affect the other bit if you just want to change one or the other).
DogP
Hey man it works great!… I’ve doing a lot of optimizations to my code and algorithms, and although I’m sure all them help, this tip makes a great difference, now my game is running really smooth on hardware.
The main bottleneck was loading graphics on the fly and this makes the process a lot faster.
BTW, it works with FlashBoy.
Thanks for sharing!
jorgeche
Great! I’m glad it helped! And yeah… I talked to Richard today about the Flashboy and he said he used 70ns chips, so 1 wait state should be safe to use w/ the Flashboy. Do you have any guess on how much speed increase you’re seeing? Is it close to 1/20th like I’m seeing, or more, or less? I don’t have many games/apps that aren’t speed throttled, so it’s hard for me to get a good estimate on what can typically be expected. And do you enable cache around your heavy computational loops?
DogP
Yup, I enable cache, but didn’t notice major improves with that at least not visible in the FPS since I didn’t do any proper measurement, but before setting the ROM wait to 1, I saw drops as low as 20 FPS in the most loaded part of my game (between the bridge and the first piranha plant) when the engine was loading that part, now it has never gone down the 35s.
I did profiling on the heavy method which determines which objects must be loaded, and takes care off all the process of creating the game entities and loading their graphics:
Stage_loadObjects(Stage);
Using a clock resolution of 1, I got the following peaks on time duration of such method:
Default ROM wait: higher duration = 35 time units, lowest FPS = 28
ROM wait set to 1: higher duration = 25 time units, lowest FPS = 38
BTW I use cache enabling mainly while writing the param table which is the most time consuming procedure I can think of, I’m not sure about cache working, but I suppose it does not make any difference to active it around a loop which has calls to other function (frame functions for that matter)..
jorgeche
Jmm I’m not sure why, but if I set the clock resolution to 10, I got a more stable frame rate, it never goes over 50 (I’m setting the target fps to 60 thought)… but this way it never goes below 45!… and it runs really smooth.
I’m capping frame rate using this method:
__TARGET_FPS = 60
while(true){
················currentTime = Clock_getTime(_clock);
················if(currentTime – lastTime > 1000 / __TARGET_FPS){
························// save current time
························lastTime = currentTime;
························// process user’s input
························Game_handleInput(this);
························ASSERT(this->stateMachine, Game: no state machine);
························// update the game’s logic
························StateMachine_update(this->stateMachine);
························// simulate collisions
························CollisionManager_update(this->collisionManager);
························// render the stage and its entities
························Stage_render(this->stage);
························// increase the frame rate
························FrameRate_increaseFPS(this->frameRate);
···············}
·······}
jorgeche
Oh… wow, so it really made a big improvement for you (no need to do any real measurements, I was just looking for an estimate).
For cache, it sounds like you’re not using it to it’s full potential… you want to make sure you enable it around things that have a high loop count (same code over and over). Using it around loops that call other functions is fine too (see note below), since the compiler doesn’t know about the cache… it’s just telling the hardware to store any instructions that goes through the CPU in cache in case it comes up again, which will then be pulled from cache (no wait states) vs being pulled from RAM or ROM (which both now have 1 wait state). Note that the VB only has an instruction cache (instructions executed) and not a data cache (data read/written).
I enable cache around my affine loop in Mario Kart, and I saw a huge improvement… but I have a loop that runs through 174 times, recalculating basically the same thing every time, except using the loop count to “tilt” the view.
The VB cache is “direct mapped”, which basically means it acts as a hash. For this reason, you’ll likely get the best performance if all the code is close together (since it’s likely to be sequential), and definitely the best if it’s all within 1KB of memory, since any time bits 9 to 3 in the address are the same, you’ll be hitting the same block in cache… and if it’s not the same actual address as the last time those bits were hit, it’ll need to replace that block with the new data. (ie 0x07000000 collides with 0x05000000, as well as 0x07000400, etc). So, if you have 0x300 bytes of code, plus a 0x100 byte function called from that code… if they’re consecutive in memory, it’ll work the best since they’re unlikely to collide, but if they’re seperated, there’s a good chance that they won’t be aligned to 1K relative to each other, which will cause a cache miss inside the loop. Of course it’s worse if you’ve got more segmentation than that, like a 0x100 byte loop calling 16 0x30 byte functions… there’s definitely a good chance that there’ll be some collisions there.
You also can hurt performance by enabling cache when you shouldn’t be using it, since there is a penalty for cache misses. As you can guess, random jumping around would be a very bad use of the cache. If you have several places that cache would be beneficial, you can also save/restore the cache… although it’s hard to think of a practical reason to do this (since it’ll recreate the cache on it’s own)… maybe restore the cache during a wait to prevent the initial cache misses later? The cache does remain valid while it’s disabled, so it’s important not to keep it enabled longer than it needs to be, or the “good” instructions will get overwritten (and you’ll have cache misses on useless instructions), and the next time in the loop it’ll need to rebuild the entire cache.
I don’t think Reality Boy emulates the cache, but I’ve been thinking about adding it, just so I could profile my use of the cache.
DogP
Thanks for the explanation, I thought that the cache was for data, not instructions.
I’ve been activating it in some heavy functions and I can appreciate some performance gains now. The problem is that I have no way to say for sure where it will work and where not because most of my intensive loops use late binding to call the propper methods based on the object being processed, and since the methods are on different translation units the only way to be sure is to test it on the VB.
jorgeche
I’ve made a few timings and it looks like copying data from ROM to VRM is roughly 20% quicker with waits set to 1.
- This reply was modified 15 years, 2 months ago by dasi.
Yeah, some caches are for data, others are for instructions (they’re typically seperate caches)… the VB only has an instruction cache.
dasi: Copying to VRAM isn’t a very good measure of performance because the VIP has more wait states (IIRC some depend on what it’s doing). I don’t have my documentation with me right now, but I believe it’s variable between 2 and 5 waits. A better measure would be copying to WRAM, which has a fixed wait of 1. A 20% performance boost does sound reasonable though (theoretically, it should be 33% faster).
DogP
Maybe you guys will find this interesting: http://virtual.boy.home.comcast.net/VB/kartFPS.zip .
The number in the bottom right corner displays the number of frames per second. You can enable/disable cache (around the Affine loop), and increase/decrease the ROM wait state in the pause menu. Press start to bring up the pause menu, then:
press on the right d-pad-
up to set 1 wait state
down to set 2 wait state
right to enable cache
left to disable cache
At startup it’s set to the best performance (1 wait state + cache enabled).
You can see at the best performance, it’s does about 218 FPS. It drops to about 203 with 2 wait states + cache, and to about 136 with no cache and 1 wait state. It then drops to about 114 with no cache and 2 wait states.
So, you can see that proper use of cache makes a HUGE difference (~60%, and I don’t know for sure that I’m using it optimally either). 1 wait state does help, but of course it helps more in places where cache isn’t being used since executing from cache (which has no wait states) cuts down on the number of ROM accesses (~7% improvement with cache, ~19% improvement without).
But, I guess the lesson here is that making two simple changes increased the performance of my app by about 91%… so if you’re running out of computing power (or want more room to grow), maybe this will help.
DogP