Hi, I’m new to the site … until yesterday I didn’t even know that anyone had updated the GCC V810 patches passed the original GCC 2.95 patches that were (AFAIK) done by a bunch of Japanese guys in 2000 (for the PC-FX, I believe).
Anyway … I’m trying to “open-up” the PC-FX for development and have done my own update of the old 2.95 patches to binutils 2.23.2 and GCC 4.7.4, in order to get a “modern” C compiler with C99 capability, and with nearly-all of C11.
It occurs to me that you guys over here with a love for the VirtualBoy may be interested in the work that I’ve done, and that you might be a larger group to provide a test-bed, rather than the PC-FX community, where I’m pretty-much the only assembler-capable developer.
I’ve had a quick “chat” with KR155E, and with his help, I’ve found the following threads …
“experimental gcc4 patches”
http://www.planetvb.com/modules/newbb/viewtopic.php?topic_id=3883
“gccVB optimization options and assembly code”
http://www.planetvb.com/modules/newbb/viewtopic.php?topic_id=5055
“Compiling gccvb 4.4.2 under Cygwin”
http://www.planetvb.com/modules/newbb/viewtopic.php?topic_id=5328
From what-I-can-see, I don’t think that my patches are experiencing any of the problems that have been reported in those threads, except for the “movhi optimization” issue … which isn’t really an “issue” as such, it’s because the compiler doesn’t know where the labels are going to resolve to, so it has to generate full 32-bit loads.
That may be something that the linker can resolve with “whole-program-optimization”, but I’ve not been brave-enough to even try to compile the toolchain with that feature enabled.
The new patches are built with mingw64/msys2, and not cygwin, so they’re Windows-native programs.
In trying to clean-up the code so that I could understand (and debug) what was going on, I removed a bunch of pointless options that don’t make any sense to the PC-FX (or VirtualBoy), such as the long-call, long-jump, GHS, and app-regs. Hopefully nobody here cares about those.
I have a version that uses the old GCC 2.95 ABI (with the 16-bytes of stack reserved for r6-r9), and I just completed the transition to the new GCC ABI from 2010 that removes that redundant stack space.
My next task is to change the ABI even more so that I can get useful stack-frames and actually implement a working backtrace function for debugging.
So … I have a couple of technical questions for the assembly-capable developers here.
I’ve not seen the VirtualBoy SDK (and don’t particularly want to wade through it) … but are the V810’s registers R2 and R5 actually used in whatever VirtualBoy libraries you guys use?
Does the VirtualBoy have single-cycle RAM, or does it have wait-states that slow down RAM access?
Are you using any Nintendo binary-only libraries, or can you re-assemble/re-compile whatever libraries/engines that you’re using?
The Good:
The new stack-frame layout is implemented (slightly changed from my original proposal), and R2 is now the permanent-frame-pointer instead of the compiler just using R29 whenever a frame-pointer is needed.
***************************** GCC 1999-ABI V850 STACK FRAME CALLER incoming-arg0 ap-> 16-bytes-reserved CALLEE saved-lp saved-?? fp-> saved-fp local-variables outgoing-arg? outgoing-arg0 sp-> 16-bytes-reserved ***************************** GCC 2016-ABI V810 STACK FRAME CALLER fp-> ap-> incoming-arg0 CALLEE saved-fp saved-lp saved-?? local-variables outgoing-arg? outgoing-arg0 sp-> 4-bytes-reserved *****************************
“-mprolog-function” is working, but I’ve stopped it from being automatically-enabled whenever any optimization is requested.
The new stack frame layout reduces the code-size of the prolog functions so that there’s a good chance that they’ll stay in the V810’s instruction cache more often. Note: the new prolog functions always save the FP and the LP when they’re used.
A stack backtrace is now possible when either “-fno-omit-frame-pointer” or “-mprolog-function” is used.
Any C “leaf” functions (i.e. functions that don’t call other functions) will omit the prolog function if they don’t destroy any callee-saved register, and so small-fast-utility-code will still run as-fast-as-possible.
The NEC-standard register conventions are still the same, except for R2 now being the FP.
Any assembly langauge code that reads arguments off the stack will need to subtract 16 from their offset.
The Bad:
Any C “interrupt-handler” functions are probably broken at the moment, until I get around to fixing them.
Does anyone actually write interrupt-handlers in C???
The compiler generates some pretty slow register-saving code for them, so I sort-of assume that folks just write then in assembly. Am I wrong?
The Future (long term):
I’d like to add a few compiler intrinsics for some of the V810 opcodes, particularly the string opcodes and the in/out opcodes. That would allow the compiler to easily in-line some stuff that people have to drop into assembly to do.
It would also be a thought to contemplate changing the standard register usage so that R26-R29 are not callee-saved registers, and so avoid the compiler from having to save them on the stack whenever someone wants to use a string opcode. But doing so would break all current assembly-language code, and I suspect that people wouldn’t want that. “Yes”, the change in stack-offset in the new ABI also breaks things … but that’s an easy thing to find/fix. Changing ALL the registers would be a much more complicated thing to fix.
Hi Elmer
Welcome, and great work.
Does anyone actually write interrupt-handlers in C???
Yes, using the interrupt_handler function attribute.
The compiler generates some pretty slow register-saving code for them, so I sort-of assume that folks just write then in assembly. Am I wrong?
From what I remember, when a function is declared with the interrupt_handler attribute the compiler generates save_interrupt and restore_interrupt prolog/epilogs, which save/restore only four or five registers. What have you observed?
dasi
dasi wrote:
Welcome, and great work.
Thanks!
From what I remember, when a function is declared with the interrupt_handler attribute the compiler generates save_interrupt and restore_interrupt prolog/epilogs, which save/restore only four or five registers. What have you observed?
Yep, there are those calls, and if your function actually calls anything else that isn’t inlined, then the compiler has to save the LP … and that triggers the generation of calls to save_all_interrupt/restore_all_interrupt which save all the other registers.
Now, please remember that I have no idea about VirtualBoy programming, and that I’m more used to consoles that produce traditional TV-output … but in that world, you’ve got the hblank interrupt … which needs to be blindingly fast, and you’ve got the vblank interrupt … which usually does a lot of stuff and calls a lot of different things.
I can see that having the save_all_interrupt and restore_all_interrupt functions doesn’t really hurt when the compiler needs to use them, because you’re going to take a pretty big hit anyway with all of those registers.
But I don’t really understand the use of the basic save_interrupt and restore_interrupt functions … they just seem to slow the down the interrupt-handling, and don’t save very much code space (just how many different interrupt-handler functions are used in a single program that cause you to be worried about a few bytes???).
Anyway … whatever … I guess that I should fix the compiler’s handling of the prolog/epilog expansion for the interrupt_handler functions.
FYI, I use a custom-modified version of Mednafen for debugging … basically it just uses a larger font in the debugger so that it’s more readable for folks with tired eyes.
It only supports a few platforms (PC Engine, PC-FX, and now VirtualBoy) … but if anyone is interested, I can add a link to it.
Here is the main VirtualBoy debugger screen …
Here is the VirtualBoy memory editor screen …
dasi wrote:
Yes, using the interrupt_handler function attribute.
I fixed the “interrupt_handler” to where it’s working again, although I’m not using the helper-functions anymore, because I really can’t see the point.
I could make the code a tiny bit smarter … but IMHO it’s already a little bit better than GCC’s V850 code, so any further work on it can wait.
************************************ volatile int __attribute__ ((zda)) zda_frame_count = 0; __attribute__ ((interrupt_handler)) void my_irq1 (void) { for (int i = 0; i < 100; i++) zda_frame_count++; } _my_irq1: add -4,sp st.w r1,0[sp] add -8,sp st.w r10,0[sp] movea 100,r0,r10 st.w r11,4[sp] .L7: ld.w zdaoff(_zda_frame_count)[r0],r11 add -1,r10 add 1,r11 st.w r11,zdaoff(_zda_frame_count)[r0] cmp 0,r10 bne .L7 ld.w 0[sp],r10 ld.w 4[sp],r11 add 8,sp ld.w 0[sp],r1 add 4,sp reti ************************************ volatile int sda_frame_count = 0; __attribute__ ((noinline)) void increment_sda_frame_count (void) { sda_frame_count++; } __attribute__ ((interrupt_handler)) void my_irq2 (void) { for (int i = 0; i < 100; i++) increment_sda_frame_count(); } _increment_sda_frame_count: ld.w sdaoff(_sda_frame_count)[gp],r10 add 1,r10 st.w r10,sdaoff(_sda_frame_count)[gp] jmp [r31] _my_irq2: add -4,sp st.w r1,0[sp] mov sp,r1 addi -72,sp,sp st.w r29,-12[r1] st.w fp,-4[r1] movea 100,r0,r29 mov r1,fp st.w r6,-72[r1] st.w r7,-68[r1] st.w r8,-64[r1] st.w r9,-60[r1] st.w r10,-56[r1] st.w r11,-52[r1] st.w r12,-48[r1] st.w r13,-44[r1] st.w r14,-40[r1] st.w r15,-36[r1] st.w r16,-32[r1] st.w r17,-28[r1] st.w r18,-24[r1] st.w r19,-20[r1] st.w r30,-16[r1] st.w lp,-8[r1] .L3: add -1,r29 jal _increment_sda_frame_count cmp 0,r29 bne .L3 ld.w -4[fp],r1 ld.w -72[fp],r6 ld.w -68[fp],r7 ld.w -64[fp],r8 ld.w -60[fp],r9 ld.w -56[fp],r10 ld.w -52[fp],r11 ld.w -48[fp],r12 ld.w -44[fp],r13 ld.w -40[fp],r14 ld.w -36[fp],r15 ld.w -32[fp],r16 ld.w -28[fp],r17 ld.w -24[fp],r18 ld.w -20[fp],r19 ld.w -16[fp],r30 ld.w -12[fp],r29 ld.w -8[fp],lp mov fp,sp mov r1,fp ld.w 0[sp],r1 add 4,sp reti ************************************
I’m going to be perfectly honest: I’ve lost track of the number of GCC versions for VB there are lol. Perhaps we should document them somewhere in a sticky thread?
I can think of 4 offhand:
* 2.9.5 that’s existed for ages
* blitter’s 4.4
* Dasi’s 4.7
* Elmer’s 4.7
IIRC, the startup code is more or less the same between all but the last one (in fact, I believe blitter’s even reuses the 2.9.5 file for this and relocations).
Last year, I started my own port, that didn’t get far b/c of real life. I would be interested in trying an LLVM port tho at some point, even if it has already been done. V810 is one of the only CPUs where I could reasonably succeed in such a port.
cr1901 wrote:
I’m going to be perfectly honest: I’ve lost track of the number of GCC versions for VB there are lol.
From my POV, it’s all about the “dialect” of C that you want to program in.
GCC 2.9.5 is C89/ANSI-C.
GCC 4.7 is C99 with most of C11.
You can also see from the examples of the generated-code that I’ve shown, that GCC 4.7 is a little smarter than GCC 2.95 about moving some loop-invariant calculations outside the loop itself for speed.
I don’t know if the GCC 2.95 version has any problems, but it has been around for a long time, and it’s a “classic” good version of GCC that was used for a lot of game development in the early 2000s (with various patches).
OTOH … It’s getting really, really hard to compile a working GCC 2.95 anymore because modern linux toolchains barf on some of the early-GCC-specific code that’s in there. You pretty much need to find an old GCC 3.x compiler from somewhere.
The GCC 4.4 port seems to have a few problems with it, and so (by his own admission) does Dasi’s GCC 4.7 port.
The GCC 4.4 port is also producing some pretty inefficient code for some reason.
My GCC 4.7 port hasn’t received enough widespread testing yet to see what bugs I’ve introduced … but Alex Marshall’s “liberis” examples all work properly, and another user has ported a simple shoot-em-up to the PC-FX with no apparent problems.
IIRC, the startup code is more or less the same between all but the last one (in fact, I believe blitter’s even reuses the 2.9.5 file for this and relocations).
As I mentioned before, my interest is in the PC-FX, and not the VirtualBoy, and so the linker scripts and the startup code have been tailored to that platform.
It shouldn’t be hard to create a “VirtualBoy” patch that changes them into something that works better for the VB.
BTW … I did add the VirtualBoy’s custom Nintendo instructions to binutils.
If someone is interested in being the “goto guy” for a VirtualBoy version of my patches, then I’d love to hear it.
AFAIK they’re stable and working, and I’m not planning on doing anything more to them for a while because I’ve got other stuff that needs to be done.
- This reply was modified 8 years, 7 months ago by ElmerPCFX.
My GCC 4.7 port hasn’t received enough widespread testing yet to see what bugs I’ve introduced … but Alex Marshall’s “liberis” examples all work properly, and another user has ported a simple shoot-em-up to the PC-FX with no apparent problems.
. . .
If someone is interested in being the “goto guy” for a VirtualBoy version of my patches, then I’d love to hear it.
I’d be happy to help with that and put a build together for testing. There are a few reasonably large Virtual Boy projects around which should give your patches a good workout. 🙂
dasi
- This reply was modified 8 years, 7 months ago by dasi.
ElmerPCFX wrote:
OTOH … It’s getting really, really hard to compile a working GCC 2.95 anymore because modern linux toolchains barf on some of the early-GCC-specific code that’s in there. You pretty much need to find an old GCC 3.x compiler from somewhere.
This is a bit ironic, considering 2.95 can in theory be built with a K&R compiler. I recall there being a “make bootstrap” target in 2.95 that provides an alternate “Stage 1” for compilers that choke on the code?
Also, re: the interrupt handlers, the 4.4 startup code (presumably the 2.95 code as well) provides a few extern vars which map directly to where the CPU jumps on an interrupt. You create your interrupt handler in C, and then convert the handler’s addr to a void pointer and assign them to the desired external vector addresses.
[pedantic]Of course, this can’t be done in ANSI C, but because POSIX compilers require being able to convert function pointers to void, and since [jest]nobody cares about compatibility with VUCC[/jest], no harm done.[/pedantic]
cr1901 wrote:
This is a bit ironic, considering 2.95 can in theory be built with a K&R compiler. I recall there being a “make bootstrap” target in 2.95 that provides an alternate “Stage 1” for compilers that choke on the code?
There may well be some magic incantation and combination of barely-documented “configure” commands and environment variables to make it work … but I could only find recommendations to go back to a previous linux distribution that had a GCC 3 compiler.
I hate the GNU build process!
IIRC, my problems were less to do with the actual source code, and more to do with the build failing on things that weren’t really errors.
At least, until you get to the part of the build that wants to process the GCC documentation … which is just horribly broken and dependent upon very specific versions of various tools.
You create your interrupt handler in C, and then convert the handler’s addr to a void pointer and assign them to the desired external vector addresses.
void pointers??? They’re not declared as function pointers???
IMO the startup code that comes with 2.95 (and thereafter crept in to version 4 and later) already does way too much. All the crt0.s file *must* do, from my own experiments, is initialize the registers, set up the data and bss sections, provide the vector table, and call main(). That’s it. The 2.95 crt0.s does a whole bunch of other possibly unnecessary stuff like clearing VRAM, clearing audio RAM, setting the column table, etc. With *maybe* the exception of the column table, these things should be done as necessary either at the beginning of main() or where appropriate. I don’t use the crt0.s that comes with gccVB but instead provide my own as tailored to the specific project. This includes the interrupt handler as well– in many cases I don’t even use interrupts at all, so I stub out those vectors with reti instructions. I figure since each project needs its own crt0.s anyway because of the ROM info table, I may as well customize it to the project’s needs.
Maybe the ROM info table doesn’t belong in crt0.s. It should be possible to create a separate .s file that when assembled just contains the ROM info table, and then place that in its proper location within the ROM at link time. In any case, I don’t know where this base crt0.s came from, but I think it could stand to be pared down quite a bit.
For example, here is a crt0.s that I use for a small project I wrote to test the timer interrupt. It assumes an interrupt handler exists at $07000000 (in my case, provided in a separate .S file) and enables the instruction cache immediately before jumping to it. However if the timer interrupt is not used, that vector can easily be stubbed out as the others are.
ElmerPCFX wrote:
void pointers??? They’re not declared as function pointers???
They’re u32s :P. I just called them void pointers b/c that’s really what they represent. https://github.com/cr1901/vbdemo/blob/master/src/drivers/timedriv.c#L110-L113
blitter wrote:
IMO the startup code that comes with 2.95 (and thereafter crept in to version 4 and later) already does way too much. All the crt0.s file *must* do, from my own experiments, is initialize the registers, set up the data and bss sections, provide the vector table, and call main(). That’s it. The 2.95 crt0.s does a whole bunch of other possibly unnecessary stuff like clearing VRAM, clearing audio RAM, setting the column table, etc. With *maybe* the exception of the column table, these things should be done as necessary either at the beginning of main() or where appropriate.
…
Maybe the ROM info table doesn’t belong in crt0.s. It should be possible to create a separate .s file that when assembled just contains the ROM info table, and then place that in its proper location within the ROM at link time. In any case, I don’t know where this base crt0.s came from, but I think it could stand to be pared down quite a bit.
Thanks for the example of the cut-down crt0.S, it was very interesting to compare it to the crt.S in VBJaEngine/GCC4.4.
I personally favor doing as-little-as-possible in crt0.S, and leaving non-critical startup functions to calls inside main().
Since you guys need a ROM header, then I really suspect that there should be a seperate “rom.S” project file so that folks can just change the bits that they need.
Having said which, I did notice some things that I found puzzling in the GCC4.4 (and your) crt0.S.
The VirtualBoy docs are pretty insistent that you wait 200us before accessing WRAM, and I don’t see that crt0.s is actually complying with that warning.
I can’t see that the linker script is actually mapping the .sdata/.sbss sections into the VirtualBoy’s WRAM, and I certainly don’t see the GP register being set to a “reasonable” value for using GP-relative addressing.
Unless I’m missing something, that means that all your C variable accesses are going to go through slow 32-bit loads, which seems like a terrible waste when you’ve got the capability for fast-access to 64KB of GP-relative variables.
Perhaps someone modified GCC so that ALL variable-access is GP-relative and I just missed it???
I can’t understand why the .data and .bss segments aren’t 4-byte-aligned so that crt0.S can clear stuff a word at a time rather than doing those slow byte copies.
The interrupt vectors in the GCC4.4 crt0.S seem a little overly-complex … all that loading and indirect jumping could just be replaced with a “jr” directly into a 4-byte executable vector at the end of WRAM, which could then “jr” into your program ROM, or just “reti”.
cr1901 wrote:
They’re u32s :P. I just called them void pointers b/c that’s really what they represent. https://github.com/cr1901/vbdemo/blob/master/src/drivers/timedriv.c#L110-L113
Thanks! I guess that I was expecting executable “jr” vectors since that’s the fastest.
ElmerPCFX wrote:
I personally favor doing as-little-as-possible in crt0.S, and leaving non-critical startup functions to calls inside main().
🙂
The VirtualBoy docs are pretty insistent that you wait 200us before accessing WRAM, and I don’t see that crt0.s is actually complying with that warning.
I’ve tested this particular project on real hardware many times, and never ran into problems. 200us translates to about 4000 clock cycles on the VB, which would pass long before the initialization of data and bss is finished. It is very possible that my luck is due to this project not being terribly dependent on the initial state of RAM, though, so if prematurely accessing WRAM produces garbage, my guess is it simply isn’t affecting my code.
One thing is pretty apparent to me though: it doesn’t seem to have any other effects on the hardware.
I can’t see that the linker script is actually mapping the .sdata/.sbss sections into the VirtualBoy’s WRAM, and I certainly don’t see the GP register being set to a “reasonable” value for using GP-relative addressing.
Unless I’m missing something, that means that all your C variable accesses are going to go through slow 32-bit loads, which seems like a terrible waste when you’ve got the capability for fast-access to 64KB of GP-relative variables.
For the VBJaEngine/GCC4.4 versions, you are correct. However in my version, notice that I set sp and gp to the same value– the top of WRAM. Thanks to WRAM being mirrored every 64KB, gp-relative accesses work just fine. 😉
blitter wrote:
One thing is pretty apparent to me though: it doesn’t seem to have any other effects on the hardware.
It’s purely a restriction on using WRAM within the first 200us.
Page 4-5-1 “Chapter 5 – Cautions when Using work RAM”.
I’ve tested this particular project on real hardware many times, and never ran into problems. 200us translates to about 4000 clock cycles on the VB, which would pass long before the initialization of data and bss is finished. It is very possible that my luck is due to this project not being terribly dependent on the initial state of RAM, though, so if prematurely accessing WRAM produces garbage, my guess is it simply isn’t affecting my code.
The initialization of those data and bss sections in WRAM is precisely what developers are supposed to not do until 200us after power-on.
I just took a look at …
Mario Clash : Delay of 65535*4 cycles, followed by 8 dummy read cycles. Red Alarm : Delay of 65536*4 cycles, followed by 8 dummy read cycles. Vertical Force : Delay of 400*36 cycle divisions, followed by 8 dummy read cycles.
Now, I don’t know how folks are testing their code on a real VirtualBoy, but it could just be that whatever the flash-card is, it has a capacitor on the RESET line to stop the V810 from starting until 200us after the power is applied.
Or maybe everyone is just getting very lucky.
Either way … the current startup code would fail Nintendo’s Lot-Check.
For the VBJaEngine/GCC4.4 versions, you are correct. However in my version, notice that I set sp and gp to the same value– the top of WRAM. Thanks to WRAM being mirrored every 64KB, gp-relative accesses work just fine. 😉
Hahaha … good point, you fooled little-old me! 😉
But seriously … how are you telling the compiler that the variables in the .data section can be accessed GP-relative?
That’s one of the main reasons to put stuff in the .sdata/.sbss sections with the “-msda=??” compiler switch … so that the compiler knows to generate GP-relative code.
Are you only accessing variables from assembly language that way?
Are you using your own linker script?
***************
As a side-note, while taking my quick-look at Mario Clash and the other games, it was interesting to see that Nintendo programmed their game as if the ROM were mapped into the top of the 4GB address range (i.e. the ROM is $F0000000-$FFFFFFF), wheras Hudson and T&ESoft programmed their game as if the ROM were mapped into the bottom of the 4GB address range (i.e. the ROM is $70000000-$7FFFFFF).
There are a couple of potential benefits to programming the game to believe that it’s running at $F0000000-$FFFFFFF, although I can’t see that Nintendo actually took any advantage of it.
ElmerPCFX wrote:
But seriously … how are you telling the compiler that the variables in the .data section can be accessed GP-relative?That’s one of the main reasons to put stuff in the .sdata/.sbss sections with the “-msda=??” compiler switch … so that the compiler knows to generate GP-relative code.
I’m not. 😉 I write everything in assembly. And you’re right– gccVB doesn’t know a thing about SDA variables.
Are you only accessing variables from assembly language that way?
Yep.
Are you using your own linker script?
Yep. I’ve attached it, though I can’t say I’m all too proud of it. I just tweak it here and there.
There are a couple of potential benefits to programming the game to believe that it’s running at $F0000000-$FFFFFFF, although I can’t see that Nintendo actually took any advantage of it.
Do tell! Shifting my ROM up to that area costs me nothing, so if there are extra benefits I’d love to know what they are.
blitter wrote:
Do tell! Shifting my ROM up to that area costs me nothing, so if there are extra benefits I’d love to know what they are.
I can think of three:
1. Your interrupt vectors can consist of a single JR instruction.
2. You can load certain ROM addresses into registers with a single MOVEA instruction. Of course, this requires using a compiler/assembler that does not always mindlessly generate MOVHI/MOVEA pairs. MV810ASM doesn’t. 😉
3. If you arrange your data really carefully, you can load some of it (e.g. lookup tables) with a negative displacement from register 0.
HorvatM wrote:
I can think of three:1. Your interrupt vectors can consist of a single JR instruction.
2. You can load certain ROM addresses into registers with a single MOVEA instruction. Of course, this requires using a compiler/assembler that does not always mindlessly generate MOVHI/MOVEA pairs. MV810ASM doesn’t. 😉
3. If you arrange your data really carefully, you can load some of it (e.g. lookup tables) with a negative displacement from register 0.
Yep, those are the exact-same advantages that came to my mind, too! 😉
And if you want rewritable RAM-vectors for your interrupts, you can have them too just by putting the vectors right at the end of WRAM, and making them into “jr” instructions back into the ROM. There’s just-enough room to do that with the 26-bit offset in the “jr” instruction.
The GCC compiler should generate register-relative offsets if you tell it that your fast-access-data is in the ZDA section with “__attribute__ ((zda))”.
Then you’d just have to fix the linker script to make sure that those sections really are put within the top 32KB of the cartridge.
These aren’t huge performance gains, but every-little-bit helps!