Hi, I’m new to the site … until yesterday I didn’t even know that anyone had updated the GCC V810 patches passed the original GCC 2.95 patches that were (AFAIK) done by a bunch of Japanese guys in 2000 (for the PC-FX, I believe).
Anyway … I’m trying to “open-up” the PC-FX for development and have done my own update of the old 2.95 patches to binutils 2.23.2 and GCC 4.7.4, in order to get a “modern” C compiler with C99 capability, and with nearly-all of C11.
It occurs to me that you guys over here with a love for the VirtualBoy may be interested in the work that I’ve done, and that you might be a larger group to provide a test-bed, rather than the PC-FX community, where I’m pretty-much the only assembler-capable developer.
I’ve had a quick “chat” with KR155E, and with his help, I’ve found the following threads …
“experimental gcc4 patches”
http://www.planetvb.com/modules/newbb/viewtopic.php?topic_id=3883
“gccVB optimization options and assembly code”
http://www.planetvb.com/modules/newbb/viewtopic.php?topic_id=5055
“Compiling gccvb 4.4.2 under Cygwin”
http://www.planetvb.com/modules/newbb/viewtopic.php?topic_id=5328
From what-I-can-see, I don’t think that my patches are experiencing any of the problems that have been reported in those threads, except for the “movhi optimization” issue … which isn’t really an “issue” as such, it’s because the compiler doesn’t know where the labels are going to resolve to, so it has to generate full 32-bit loads.
That may be something that the linker can resolve with “whole-program-optimization”, but I’ve not been brave-enough to even try to compile the toolchain with that feature enabled.
The new patches are built with mingw64/msys2, and not cygwin, so they’re Windows-native programs.
In trying to clean-up the code so that I could understand (and debug) what was going on, I removed a bunch of pointless options that don’t make any sense to the PC-FX (or VirtualBoy), such as the long-call, long-jump, GHS, and app-regs. Hopefully nobody here cares about those.
I have a version that uses the old GCC 2.95 ABI (with the 16-bytes of stack reserved for r6-r9), and I just completed the transition to the new GCC ABI from 2010 that removes that redundant stack space.
My next task is to change the ABI even more so that I can get useful stack-frames and actually implement a working backtrace function for debugging.
So … I have a couple of technical questions for the assembly-capable developers here.
I’ve not seen the VirtualBoy SDK (and don’t particularly want to wade through it) … but are the V810’s registers R2 and R5 actually used in whatever VirtualBoy libraries you guys use?
Does the VirtualBoy have single-cycle RAM, or does it have wait-states that slow down RAM access?
Are you using any Nintendo binary-only libraries, or can you re-assemble/re-compile whatever libraries/engines that you’re using?
ElmerPCFX wrote:
So … I have a couple of technical questions for the assembly-capable developers here.
I’m only barely “assembly-capable” on the v810, but I’ll take a shot at answering these.
I’ve not seen the VirtualBoy SDK (and don’t particularly want to wade through it) … but are the V810’s registers R2 and R5 actually used in whatever VirtualBoy libraries you guys use?
There really is no “the VirtualBoy SDK” due to a lot of fragmenting, but, TMK, most of the existing code out there avoids direct access to registers except in the necessary setting of hardware ports for control of the peripheral hardware. If you want to make use of these registers for a specific purpose (especially if it means improving memory usage of generated code), I’m sure existing projects could be made compatible quite easily.
Does the VirtualBoy have single-cycle RAM, or does it have wait-states that slow down RAM access?
The cartride ROM has either 1 or 2 (the default) wait-states, selectable in software. The RAM used by the video hardware (the “VIP”) has 2-5 waits, depending on what part of the display rendering cycle it’s currently in. All other areas have a fixed wait-state of 1.
Are you using any Nintendo binary-only libraries, or can you re-assemble/re-compile whatever libraries/engines that you’re using?
None of the existing, publicly-available, homebrew VB software uses any Nintendo code, binary or otherwise. I can’t speak for what anyone has on their personal PCs, though.
RunnerPack wrote:
I’m only barely “assembly-capable” on the v810, but I’ll take a shot at answering these.
Thanks!
There really is no “the VirtualBoy SDK” due to a lot of fragmenting, but, TMK, most of the existing code out there avoids direct access to registers except in the necessary setting of hardware ports for control of the peripheral hardware. If you want to make use of these registers for a specific purpose (especially if it means improving memory usage of generated code), I’m sure existing projects could be made compatible quite easily.
Ah … I’m going by the Nintendo Seminar docs, and the PC-FX SDK docs, and the GCC docs … all of which follow NEC’s V810 Architecture Manual, where R2 is reserved as the “Handler Stack Pointer”, and R5 is reserved as the “Text Pointer” (which means the address of the start of the program code).
Now, the PC-FX BIOS and the official SDK libraries (which I’m going to ignore), never actually use either of these registers, they’re just wasted.
Newer versions of GCC (well after 2.95, I think) added an option “-app-regs” that lets the compiler use these 2 registers for the code that it generates.
I’d be quite surprised if anyone here is relying on that option.
I have my own ideas of how I’d like to use those registers.
I’d like to move the Frame Pointer to R2 (right next to the Stack Pointer in R3), and I’d like to use R5 to replace the V850’s EP register … and basically gain another 32KB of fast-access variable space, particularly for use as thread-local variables.
This isn’t going to cause any problems on the PC-FX … but I’m curious if it will cause any problems on the VirtualBoy.
If you’re programming bare-metal with no BIOS or Nintendo libraries … then it shouldn’t really cause you guys any trouble, either.
As for “memory usage” … how “cramped” are you guys? Are you using the “optimize-for-space” option and/or the “prolog-function” option?
The cartridge ROM has either 1 or 2 (the default) wait-states, selectable in software. The RAM used by the video hardware (the “VIP”) has 2-5 waits, depending on what part of the display rendering cycle it’s currently in. All other areas have a fixed wait-state of 1.
OK, thanks! I guess that Nintendo went a little cheap on the memory (again).
The PC-FX runs everything from RAM, so I’m more worried about pipeline-stalls than I am about memory access times.
I guess that you guys have different issues, and that the VirtualBoy’s memory timing dwarfs the occasional modify-then-read pipeline-stall.
That means that I should definitely keep the frame-pointer “optional” rather than “required” (which is a pity, because it’s so darned useful when implemented properly).
None of the existing, publicly-available, homebrew VB software uses any Nintendo code, binary or otherwise. I can’t speak for what anyone has on their personal PCs, though.
Excellent, you’ve got a completely clean-and-legal toolkit, and that means that you’ve got the source-code to make any changes if you use the new 2010 ABI, or whatever I come up with (if it’s an improvement).
Hi ElmerPCFX, welcome to PVB! Glad to meet another fellow programmer with an interest in improving our little homebrew toolchain. 🙂
Please check out my thread here http://www.planetvb.com/modules/newbb/viewtopic.php?topic_id=5252 about GCC 4 and generating PC-relative jumps. I threw together a tool that hacks around the problem by poking at the output ELF file, but if you have a solid grasp of what GCC actually does behind the scenes to exhibit this unwanted behavior, then hopefully you perhaps have a better idea of what a proper solution might be. This bug (among others) is kind of what has put the brakes on my VB development since I’d much rather be fighting my *own* code versus GCC’s code.
I’ve since switched to writing purely in assembly, and FWIW I completely ignore NEC’s register allocations, save for those used by the mul/mulu/bitstring/etc. instructions. The VB itself doesn’t care either. One of my patches to GCC was to rename ‘ep’ so that the assembler would recognize ‘r30’ as a valid alias! 🙂
Hi blitter,
It’s always good to see someone else that’s comfortable in assembly-language.
blitter wrote:
Please check out my thread here http://www.planetvb.com/modules/newbb/viewtopic.php?topic_id=5252 about GCC 4 and generating PC-relative jumps.
I looked at the thread, and it’s pretty obvious that the problem is in the symbol-relocation code in binutils.
A quick comparison of the binutils 2.20.1 patch, my binutils 2.23.2 patch, and the current V850 code shows that there’s a bug in the binutils 2.20.1 patch in the R_V810_26_PCREL relocation.
insn |= (((addend & 0xfffe) << 16) | ((addend & 0x3f0000) >> 16));
should be
insn |= (((addend & 0xfffe) << 16) | ((addend & 0x3ff0000) >> 16));
The patch that you’re using loses the top 4 bits of the 26-bit relative address.
With the bug, the maximum relocation is 0x003fffff … which corresponds nicely to your observed bad-offset of 0x00400000.
I don’t know-for-sure that fixing that will make your problem go away, but I think that it’s pretty likely.
I don’t know how-easy it is for you to recompile binutils and test that … I’ve had a lot of trouble compiling old versions of binutils and GCC with newer versions of the GNU build tools.
I’m using msys2, which keeps very current on all the latest versions of the GNU tools, and I need a bunch of extra patches to compile binutils 2.32.2, and any GCC that’s older than 4.7.
- This reply was modified 8 years, 8 months ago by ElmerPCFX.
blitter: Can I ask what your thinking was behind you 2011-11-23 patch to change the HARD_FRAME_POINTER_REGNUM from 29 to 25?
Do you have an example of whatever the problem was that this was designed to fix?
ElmerPCFX wrote:
I’d like to move the Frame Pointer to R2 (right next to the Stack Pointer in R3), and I’d like to use R5 to replace the V850’s EP register … and basically gain another 32KB of fast-access variable space, particularly for use as thread-local variables.
I don’t see a need for a frame pointer, and neither did NEC apparently.
And you can already access a 64K range with a single register by using negative displacements. Commercial VB games set register 4 to 0x05008000 and use it to access global variables anywhere in the WRAM (which is 64K long).
blitter: Can I ask what your thinking was behind you 2011-11-23 patch to change the HARD_FRAME_POINTER_REGNUM from 29 to 25?
Do you have an example of whatever the problem was that this was designed to fix?
Bitstring instructions, probably.
HorvatM wrote:
I don’t see a need for a frame pointer, and neither did NEC apparently.
NEC didn’t mandate a specific register for the frame-pointer … there’s a huge difference between that and saying that they didn’t see the need for a frame-pointer.
Just because you don’t see the need for frame pointers and backtraces doesn’t change the fact that I do, and so a huge proportion of experienced C/C++ programmers. In-system backtraces are useful for a whole bunch of things.
I don’t know what compiler Nintendo shipped with the VirtualBoy, but it was probably the Green Hills suite.
Which supports frame pointers, as does GCC … and every C compiler that I know of. Sometimes the compiler absolutely needs to use a frame-pointer … which GCC does automatically, even when you use the “omit-frame-pointers” option.
Just because the guys that added V850 support to GCC back in the 1990’s goofed on the stack order of the saved registers and made the frame-pointer unusable for doing a backtrace, doesn’t mean that we need to keep following that mistake in 2016.
And you can already access a 64K range with a single register by using negative displacements. Commercial VB games set register 4 to 0x05008000 and use it to access global variables anywhere in the WRAM (which is 64K long).
I didn’t know that the VirtualBoy only had 64KB RAM, thanks!
So you guys don’t need anything more than the existing SDA segment (gp-register-relative) support, that’s good to know.
But the PC-FX has 2MB RAM, so I could use something a bit more sophisticated.
And you’re ignoring the whole point of a thread-local-variable area … which is another reason to move the TDA segment to R5 on the V810 instead of R30 on the V850.
Bitstring instructions, probably.
I’m sorry, but that’s a completely unhelpful answer.
Sure … he’s trying to move the hard-frame-pointer away from the register that are used by the bitstring instructions.
Why? Are you guys doing bitstring instructions in inline-assembly within the C code? If so … are you telling the compiler what registers you clobber?
Are you doing bitstring instructions from assembly? … If so, it makes little difference whether the compiler puts its frame-pointer in R29 … especially since you’re probably compiling with “omit-frame-pointers” anyway.
Do you realize the effect that moving that definition has on the compiled-code when the compiler does need a frame pointer … especially if you’re using function-prologues?
HorvatM wrote:
Bitstring instructions, probably.
Precisely.
ElmerPCFX wrote:
I’m sorry, but that’s a completely unhelpful answer.Sure … he’s trying to move the hard-frame-pointer away from the register that are used by the bitstring instructions.
Why? Are you guys doing bitstring instructions in inline-assembly within the C code? If so … are you telling the compiler what registers you clobber?
Yes, and yes. It has been quite a while but as I recall either r29 was ignored when I specified it in the clobber list or I got some kind of error.
Are you doing bitstring instructions from assembly? … If so, it makes little difference whether the compiler puts its frame-pointer in R29 … especially since you’re probably compiling with “omit-frame-pointers” anyway.
Anything I’m doing from non-inline assembly the compiler should not touch, period, other than to assemble it. But for what it’s worth I use -fomit-frame-pointers in my Makefiles. Again, it’s been a while so I don’t remember the exact problem moving the frame register solved, but it was definitely related to the bitstring instructions.
Do you realize the effect that moving that definition has on the compiled-code when the compiler does need a frame pointer … especially if you’re using function-prologues?
Frame pointers and backtraces in my experience are pretty useless in VB homebrew since source-level debugging is pretty nonexistent above the assembly code level. Maybe it was possible with Nintendo’s official tools and development hardware, but I for one have never used their tools much less *seen* official VB dev hardware, and I can’t name any forum regulars who have either. Thus, and I’m sorry, but I care little about what happens to the frame pointer. 🙂 As for function prologues, I ran some simple tests before publicizing those patches and didn’t run into problems, but caveat emptor, YMMV, etc.
ElmerPCFX wrote:
I looked at the thread, and it’s pretty obvious that the problem is in the symbol-relocation code in binutils.A quick comparison of the binutils 2.20.1 patch, my binutils 2.23.2 patch, and the current V850 code shows that there’s a bug in the binutils 2.20.1 patch in the R_V810_26_PCREL relocation.
insn |= (((addend & 0xfffe) << 16) | ((addend & 0x3f0000) >> 16));should be
insn |= (((addend & 0xfffe) << 16) | ((addend & 0x3ff0000) >> 16));The patch that you’re using loses the top 4 bits of the 26-bit relative address.
With the bug, the maximum relocation is 0x003fffff … which corresponds nicely to your observed bad-offset of 0x00400000.
I don’t know-for-sure that fixing that will make your problem go away, but I think that it’s pretty likely.
Thank you! I might be able to bring up a gccVB build chain this weekend and test that fix for myself. I agree; it looks likely.
I don’t know how-easy it is for you to recompile binutils and test that … I’ve had a lot of trouble compiling old versions of binutils and GCC with newer versions of the GNU build tools.
I’m using msys2, which keeps very current on all the latest versions of the GNU tools, and I need a bunch of extra patches to compile binutils 2.32.2, and any GCC that’s older than 4.7.
I do all my VB dev in Mac OS X. Specifically, I build the toolchain in 10.6 with an older version of GCC installed via macports. The build products continue to work in the latest version of OS X El Capitan, plus as a bonus I can build PPC versions too.
I also don’t know if I’ve mentioned this anywhere else here, but I do *not* know GCC’s internals. At all. So, my patches are more hacks or bandaids to work around problems I encounter than anything else. I share them just in case they might help other devs, but please don’t accept them as attempts to properly fix any problems (though if I happen to fix anything then AFAIC that’s purely a coincidence. 🙂 )
blitter wrote:
Yes, and yes. It has been quite a while but as I recall either r29 was ignored when I specified it in the clobber list or I got some kind of error.
Thank you, that’s the kind of information that I can use!
So, if I’m understanding you correctly, you are using GCC’s “inline-assembly” to do the string instructions, rather than a separate assembly function. Is that correct?
Anything I’m doing from non-inline assembly the compiler should not touch, period, other than to assemble it. But for what it’s worth I use -fomit-frame-pointers in my Makefiles. Again, it’s been a while so I don’t remember the exact problem moving the frame register solved, but it was definitely related to the bitstring instructions.
Thanks, again. If you’re use “-fomit-frame-pointers” then the compiler should be using R29 as a general-purpose callee-saved register.
If it doesn’t let you “clobber” it in inline assembly, just because it *might* be used as a frame-pointer … then that’s really helpful information.
Frame pointers and backtraces in my experience are pretty useless in VB homebrew since source-level debugging is pretty nonexistent above the assembly code level.
Ah … on the contrary … IMHO that’s exactly when a good backtrace is the most-useful.
If you’ve got a good source-level debugger with full DWARF information about the process, then it doesn’t need a frame-pointer … it already has all the information from the compiler-emitted debugging-info.
A good “backtrace”, complete with actual function names, can be done on the target hardware, without a debugger, if the frame-pointer exists, and if the stack-frame-layout is sensible.
This lets you get the “context” of any error message, and lets you implement sophisticated in-engine memory debugging.
It really helps to have extra RAM available when these things are enabled … which is why Nintendo (and everyone else) shipped their “development-kits” with more RAM than the “retail” kits (up until the last generation, when things got more complex).
You can simulate an environment like this in Mednafen just by modifying the amount of memory that the virtual VirtualBoy sees (it’s a source-level hack to Mednafen).
It’s not useful for “final-testing”, but its a godsend for 90% of development.
I do all my VB dev in Mac OS X. Specifically, I build the toolchain in 10.6 with an older version of GCC installed via macports. The build products continue to work in the latest version of OS X El Capitan, plus as a bonus I can build PPC versions too.
That’s cool to know. I mainly run Windows on my MacPro, but I think that I may still have a 10.6.8 partition somewhere.
I also don’t know if I’ve mentioned this anywhere else here, but I do *not* know GCC’s internals. At all. So, my patches are more hacks or bandaids to work around problems I encounter than anything else. I share them just in case they might help other devs, but please don’t accept them as attempts to properly fix any problems (though if I happen to fix anything then AFAIC that’s purely a coincidence. 🙂 )
No problem … the point is that you’ve tried to improve things, and so did M.K. when the GCC 4.4.2 patches were created. That’s wonderful!
It took me about 6 months of agony to get the GCC 2.9.5 patches updated to GCC 4.7.4, and that included lots of flailing-around inside complex source code that I barely understood … and still mostly-don’t.
Here’s my proposal for a new stack-frame layout, together with the one that everyone is using now, and the “new” GCC ABI from 2010.
Basically … the “old” ABI reserved 16-bytes of stack space for storing the first-4 function arguments just-in-case you call a function with variable-arguments.
In the years since that time, “stdarg.h” has replaced “varargs.h”, and that space is no longer needed.
So the V850 guys got rid of it in 2010.
I’m proposing adding back 4-bytes to use for storing the Frame Pointer, so that backtraces are possible.
Reordering the output of the “saved” registers should also radically reduce the amount of space used by function-prologues … which should help speed them up by keeping them in the instruction cache.
Any comments?
[size=small]
***************************** GCC 1999-ABI V850 STACK FRAME CALLER incoming-arg0 ap-> 16-bytes-reserved CALLEE saved-lp saved-?? fp-> saved-fp local-variables outgoing-arg? outgoing-arg0 sp-> 16-bytes-reserved ***************************** GCC 2010-ABI V850 STACK FRAME CALLER ap-> incoming-arg0 CALLEE saved-lp saved-?? fp-> saved-fp local-variables outgoing-arg? sp-> outgoing-arg0 ***************************** GCC 2016-ABI V810 STACK FRAME CALLER incoming-arg0 ap-> fp-> saved-fp CALLEE saved-lp saved-?? local-variables outgoing-arg? outgoing-arg0 sp-> 4-bytes-reserved *****************************
- This reply was modified 8 years, 8 months ago by ElmerPCFX.
I took a quick look at the libgccvb source code, and was surprised to see so many uses of “u8” and “u16” in the code.
The V810 CPU was designed to handle 32-bit variables … and it doesn’t do any arithmetic operations on 16-bit or 8-bit values.
That means that the compiler needs to do a lot of masking/sign-extending when it’s asked to deal with 16-bit or 8-bit variables, just so that it keeps the results correct within the limits of 16-bit or 8-bit rounding.
You really should be using “int” and “unsigned” as much as possible, and avoid “short” and “char” variables.
I thought that it would be interesting to see how the different GCC compiler versions compile a couple of simple C functions.
In each case, the original libgccvb version is first, and then 1 or 2 versions replacing the “u16” and “u8” variables with “unsigned” instead.
It seems strange to me that GCC 4.4.2 is doing such a relatively-poor job compared to GCC 2.9.5 or GCC 4.7.4, I wonder what went wrong?
All examples are compiled with “-O2 -fomit-frame-pointer”.
[size=small]
**************************************************************************************** **************************************************************************************** void copymem (u8* dest, const u8* src, u16 num) { u16 i; for (i = 0; i < num; i++) { *dest++ = *src++; } } ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** _copymem: andi 65535,r8,r8 _copymem: andi 65535,r8,r8 _copymem: andi 65535,r8,r8 be .L1 mov 0,r10 be .L4 addi -1,r8,r11 cmp r8,r10 mov 0,r10 andi 65535,r11,r11 bnl .L4 .L3: mov r7,r11 add 1,r11 .L6: add 1,r10 add r10,r11 add r6,r11 ld.b 0[r7],r11 ld.b 0[r11],r12 .L3: ld.b 0[r7],r10 andi 65535,r10,r10 mov r6,r11 add 1,r7 add 1,r7 add r10,r11 st.b r10,0[r6] st.b r11,0[r6] add 1,r10 add 1,r6 add 1,r6 st.b r12,0[r11] cmp r11,r6 cmp r8,r10 andi 65535,r10,r11 bne .L3 bl .L6 cmp r11,r8 .L1: jmp [r31] .L4: jmp [r31] bh .L3 .L4: jmp [r31] ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** **************************************************************************************** **************************************************************************************** void copymem2 (u8* dest, const u8* src, unsigned num) { unsigned i; for (i = 0; i < num; i++) { *dest++ = *src++; } } ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** _copymem2:mov r6,r11 _copymem2:mov 0,r11 _copymem2:cmp r0,r8 add r8,r11 cmp r8,r11 be .L10 cmp 0,r8 bnl .L10 mov 0,r10 be .L7 .L12: ld.b 0[r7],r10 .L9: mov r7,r11 .L11: ld.b 0[r7],r10 add 1,r11 add r10,r11 add 1,r7 add 1,r7 ld.b 0[r11],r12 st.b r10,0[r6] st.b r10,0[r6] mov r6,r11 add 1,r6 add 1,r6 add r10,r11 cmp r11,r6 cmp r8,r11 st.b r12,0[r11] bne .L11 bl .L12 add 1,r10 .L7: jmp [r31] .L10: jmp [r31] cmp r10,r8 bh .L9 .L10: jmp [r31] ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** **************************************************************************************** **************************************************************************************** void addmem (u8* dest, const u8* src, u16 num, u8 offset) { u16 i; for (i = 0; i < num; i++) { *dest++ = (*src++ + offset); } } ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** _addmem: andi 65535,r8,r8 _addmem: andi 65535,r8,r8 _addmem: andi 65535,r8,r8 andi 255,r9,r9 mov 0,r11 andi 255,r9,r9 cmp 0,r8 andi 255,r9,r9 cmp r0,r8 be .L13 cmp r8,r11 be .L20 addi -1,r8,r11 bnl .L22 mov 0,r10 andi 65535,r11,r11 .L24: mov r9,r10 .L19: mov r7,r11 add 1,r11 add 1,r11 add r10,r11 add r6,r11 ld.b 0[r7],r12 ld.b 0[r11],r12 .L15: ld.b 0[r7],r10 andi 65535,r11,r11 mov r6,r11 add 1,r7 add r12,r10 add r10,r11 add r9,r10 add 1,r7 add r9,r12 st.b r10,0[r6] st.b r10,0[r6] add 1,r10 add 1,r6 add 1,r6 st.b r12,0[r11] cmp r11,r6 cmp r8,r11 andi 65535,r10,r11 bne .L15 bl .L24 cmp r11,r8 .L13: jmp [r31] .L22: jmp [r31] bh .L19 .L20: jmp [r31] ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** **************************************************************************************** **************************************************************************************** void addmem2 (u8* dest, const u8* src, unsigned num, u8 offset) { unsigned i; for (i = 0; i < num; i++) { *dest++ = (*src++ + offset); } } ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** _addmem2: mov r6,r11 _addmem2: mov 0,r12 _addmem2: andi 255,r9,r9 andi 255,r9,r9 andi 255,r9,r9 cmp r0,r8 add r8,r11 cmp r8,r12 be .L20 cmp 0,r8 bnl .L22 mov 0,r10 be .L18 .L24: mov r9,r10 .L19: mov r7,r11 .L22: ld.b 0[r7],r10 ld.b 0[r7],r11 add r10,r11 add 1,r7 add 1,r12 ld.b 0[r11],r12 add r9,r10 add r11,r10 mov r6,r11 st.b r10,0[r6] add 1,r7 add r10,r11 add 1,r6 st.b r10,0[r6] add r9,r12 cmp r11,r6 add 1,r6 st.b r12,0[r11] bne .L22 cmp r8,r12 add 1,r10 .L18: jmp [r31] bl .L24 cmp r10,r8 .L22: jmp [r31] bh .L19 .L20: jmp [r31] ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** **************************************************************************************** **************************************************************************************** void addmem3 (u8* dest, const u8* src, unsigned num, unsigned offset) { unsigned i; for (i = 0; i < num; i++) { *dest++ = (*src++ + offset); } } ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** _addmem3: cmp 0,r8 _addmem3: mov 0,r12 _addmem3: cmp r0,r8 be .L24 cmp r8,r12 be .L25 andi 255,r9,r9 bnl .L28 andi 255,r9,r9 add r6,r8 .L30: mov r9,r10 mov 0,r10 .L26: ld.b 0[r7],r10 ld.b 0[r7],r11 .L24: mov r7,r11 add 1,r7 add 1,r12 add r10,r11 add r9,r10 add r11,r10 ld.b 0[r11],r12 st.b r10,0[r6] add 1,r7 mov r6,r11 add 1,r6 st.b r10,0[r6] add r10,r11 cmp r8,r6 add 1,r6 add r9,r12 bne .L26 cmp r8,r12 st.b r12,0[r11] .L24: jmp [r31] bl .L30 add 1,r10 .L28: jmp [r31] cmp r10,r8 bh .L24 .L25: jmp [r31] ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** **************************************************************************************** ****************************************************************************************
ElmerPCFX wrote:
I took a quick look at the libgccvb source code, and was surprised to see so many uses of “u8” and “u16” in the code.The V810 CPU was designed to handle 32-bit variables … and it doesn’t do any arithmetic operations on 16-bit or 8-bit values.
That means that the compiler needs to do a lot of masking/sign-extending when it’s asked to deal with 16-bit or 8-bit variables, just so that it keeps the results correct within the limits of 16-bit or 8-bit rounding.
You really should be using “int” and “unsigned” as much as possible, and avoid “short” and “char” variables.
According to David Tucker’s unofficial Virtual Boy specification:
The external data buss [sic] supports both a 32-bit data mode and a 16-bit mode, but the VB only utilizes the 16-bit mode.
Now, I don’t know where he got that info, since I can’t find mention in the official Nintendo docs of the width of the data bus at all, but in the symposium PDFs there is sample code that copies data in memory using short* and char* pointers, so that suggests to me that the VB uses the V810’s 16-bit bus mode. 32-bit pointers are not used in Nintendo’s sample code. So, while arithmetic operations probably should operate on 32-bit values for best performance, is it efficient to load 32-bit values from RAM/ROM on a 16-bit wide data bus (assuming this is how the VB is configured)?
blitter wrote:
So, while arithmetic operations probably should operate on 32-bit values for best performance, is it efficient to load 32-bit values from RAM/ROM on a 16-bit wide data bus (assuming this is how the VB is configured)?
OK, I found a copy of the SDK (which is just the docs) online and confirmed that the VB is using a 16-bit data bus.
Ouch! Nintendo really wanted to make things difficult for their developers, didn’t they?
That has a huge effect on everything … most particularly the importance of running code from the instruction-cache as much as possible.
The compiler doesn’t really seem to understand that “ld.*” is automatically sign-extending a 16-bit/8-bit read from memory.
The compiler doesn’t know that it can use “in.*” on the VirtualBoy to zero-extend reads from memory (that trick won’t work on the PC-FX).
That means that any code that does arithmetic on 16-bit/8-bit values is usually going to generate one or more extra instructions to sign-extend/mask the values when it reads them.
That is 4-bytes of code that are going to take 1 or 2 cycles to execute, and require 2 memory reads, usually from ROM, and potentially with 2 wait-states per read.
That is going to be no-better than the extra 2-cycle memory-read to get the high 16-bits of a 32-bit variable, and quite-possibly worse.
So I think that I’d still recommend that folks stick with 32-bit variables in C as much as possible, but it’s definitely a less clear situation than it is on the PC-FX, and I’d suggest that folks actually look at the assembly code that the compiler generates in order to see what it’s doing.
If you’re programming in assembly, then you can just use ld.h/in.h, and you can write efficient code because you have a better understanding of the CPU architecture and the VirtualBoy than the compiler does.
BTW … the “advice” may change in the future if I can get the compiler to understand that “ld.*” is automatically sign-extending the value, and that it doesn’t need to generate its own code to do it.
But that won’t apply to unsigned variables, which are still going to be masked.
Whatever happens … it still goes to show that the VirtualBoy is another one of the old machines where an assembly-language programmer can generate better code than a compiler.
ElmerPCFX wrote:
The compiler doesn’t really seem to understand that “ld.*” is automatically sign-extending a 16-bit/8-bit read from memory.The compiler doesn’t know that it can use “in.*” on the VirtualBoy to zero-extend reads from memory (that trick won’t work on the PC-FX).
That is a cool trick! I hadn’t thought to investigate the in.* instructions to see what they actually do. I’ll have to use that in my projects now, thanks. 🙂
blitter wrote:
That is a cool trick! I hadn’t thought to investigate the in.* instructions to see what they actually do. I’ll have to use that in my projects now, thanks. 🙂
It’s a nice trick since Nintendo made the I/O address space just by a copy of the normal address space … but note that you don’t save the extra cycle on multiple loads that you do with the “ld” instruction.
Because the V810 sign-extends any constants for math and comparison, I suspect that it’s still probably best to just use signed variables, rather than unsigned variables wherever possible.
I think that I have figured-out how to let GCC know that “ld” instruction sign-extends variables into an int.
Here are a coupe of examples of how it effects the code with newlib’s “strlen” function, and then some variations on it.
The variations show how the generated code changes when things get a little bit more complex when modifying “strlen” to change the comparison so that the compiler can’t just short-cut the check for zero.
The thing to pay particular attention to is the number of instructions in the inner loop.
It shows, again, that if you choose to use C on a processor like the V810, then there are definitely tricks to know that will improve the code-generation.
**************************************************************************************** **************************************************************************************** ORIGINAL FUNCTION FROM NEWLIB 2.2.0 size_t strlen (const char *str) { const char *start = str; while (*str) str++; return str - start; } ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** _strlen: ld.b 0[r6],r10 _strlen: ld.b 0[r6],r10 _strlen: ld.b 0[r6],r10 cmp 0,r10 mov r6,r11 shl 24,r10 be .L42 cmp r0,r10 sar 24,r10 mov r6,r10 be .L46 be .L39 .L41: add 1,r10 .L47: add 1,r6 mov r6,r10 ld.b 0[r10],r11 ld.b 0[r6],r10 .L40: add 1,r10 cmp 0,r11 cmp r0,r10 ld.b 0[r10],r11 bne .L41 bne .L47 shl 24,r11 sub r6,r10 .L46: mov r6,r10 bne .L40 jmp [r31] sub r11,r10 sub r6,r10 .L42: mov 0,r10 jmp [r31] .L39: jmp [r31] jmp [r31] **************************************************************************************** **************************************************************************************** MARK THE END-OF-STRING WITH A NON-ZERO CONSTANT size_t strlen2 (const char *str) { const char *start = str; while (*str != 1) str++; return str - start; } ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** _strlen2: ld.b 0[r6],r10 _strlen2: ld.b 0[r6],r10 _strlen2: ld.b 0[r6],r11 cmp 1,r10 mov r6,r11 shl 24,r11 be .L47 cmp 1,r10 sar 24,r11 mov r6,r10 be .L51 cmp 1,r11 .L46: add 1,r10 .L52: add 1,r6 be .L49 ld.b 0[r10],r11 ld.b 0[r6],r10 mov r6,r10 cmp 1,r11 cmp 1,r10 .L46: add 1,r10 bne .L46 bne .L52 ld.b 0[r10],r11 sub r6,r10 .L51: mov r6,r10 shl 24,r11 jmp [r31] sub r11,r10 sar 24,r11 .L47: mov 0,r10 jmp [r31] cmp 1,r11 jmp [r31] bne .L46 sub r6,r10 jmp [r31] .L49: mov 0,r10 jmp [r31] **************************************************************************************** **************************************************************************************** PASS THE END-OF-STRING MARKER IN AS A "char" PARAMETER int strlen3 (const char *str, char eos) { const char *start = str; while (*str != eos) str++; return str - start; } ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** _strlen3: shl 24,r7 _strlen3: shl 24,r7 _strlen3: ld.b 0[r6],r10 sar 24,r7 sar 24,r7 shl 24,r7 ld.b 0[r6],r10 ld.b 0[r6],r10 mov r7,r12 cmp r7,r10 mov r6,r11 shl 24,r10 be .L52 cmp r7,r10 sar 24,r12 mov r6,r10 be .L56 cmp r7,r10 .L51: add 1,r10 .L57: add 1,r6 be .L56 ld.b 0[r10],r11 ld.b 0[r6],r10 mov r6,r10 cmp r7,r11 cmp r7,r10 .L53: add 1,r10 bne .L51 bne .L57 ld.b 0[r10],r11 sub r6,r10 .L56: mov r6,r10 shl 24,r11 jmp [r31] sub r11,r10 sar 24,r11 .L52: mov 0,r10 jmp [r31] cmp r12,r11 jmp [r31] bne .L53 sub r6,r10 jmp [r31] .L56: mov 0,r10 jmp [r31] **************************************************************************************** **************************************************************************************** PASS THE END-OF-STRING MARKER IN AS AN "int" PARAMETER int strlen4 (const char *str, int eos) { const char *start = str; while (*str != eos) str++; return str - start; } ********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ******** _strlen4: ld.b 0[r6],r10 _strlen4: ld.b 0[r6],r10 _strlen4: ld.b 0[r6],r10 cmp r7,r10 mov r6,r12 shl 24,r10 be .L57 cmp r7,r10 sar 24,r10 mov r6,r10 be .L61 cmp r7,r10 .L56: add 1,r10 .L62: add 1,r6 be .L63 ld.b 0[r10],r11 ld.b 0[r6],r10 mov r6,r10 cmp r7,r11 mov r10,r11 .L60: add 1,r10 bne .L56 cmp r7,r11 ld.b 0[r10],r11 sub r6,r10 bne .L62 shl 24,r11 jmp [r31] .L61: mov r6,r10 sar 24,r11 .L57: mov 0,r10 sub r12,r10 cmp r7,r11 jmp [r31] jmp [r31] bne .L60 sub r6,r10 jmp [r31] .L63: mov 0,r10 jmp [r31] **************************************************************************************** ****************************************************************************************
ElmerPCFX wrote:
blitter wrote:
That is a cool trick! I hadn’t thought to investigate the in.* instructions to see what they actually do. I’ll have to use that in my projects now, thanks. 🙂
It’s a nice trick since Nintendo made the I/O address space just by a copy of the normal address space … but note that you don’t save the extra cycle on multiple loads that you do with the “ld” instruction.
Do you mean grouping “ld” instructions together to speed up the data fetch pipeline? “in” doesn’t follow those rules?
blitter wrote:
Do you mean grouping “ld” instructions together to speed up the data fetch pipeline? “in” doesn’t follow those rules?
Yes.
And no, it doesn’t follow the same rules according to the instruction cycle timings in the V810 Architecture manual.