We're using cookies to ensure you get the best experience on our website. More info
Understood
@elmerpcfxRegistered March 13, 2016Active 2 years, 8 months ago
40 Replies made

blitter wrote:
IMO the startup code that comes with 2.95 (and thereafter crept in to version 4 and later) already does way too much. All the crt0.s file *must* do, from my own experiments, is initialize the registers, set up the data and bss sections, provide the vector table, and call main(). That’s it. The 2.95 crt0.s does a whole bunch of other possibly unnecessary stuff like clearing VRAM, clearing audio RAM, setting the column table, etc. With *maybe* the exception of the column table, these things should be done as necessary either at the beginning of main() or where appropriate.

Maybe the ROM info table doesn’t belong in crt0.s. It should be possible to create a separate .s file that when assembled just contains the ROM info table, and then place that in its proper location within the ROM at link time. In any case, I don’t know where this base crt0.s came from, but I think it could stand to be pared down quite a bit.

Thanks for the example of the cut-down crt0.S, it was very interesting to compare it to the crt.S in VBJaEngine/GCC4.4.

I personally favor doing as-little-as-possible in crt0.S, and leaving non-critical startup functions to calls inside main().

Since you guys need a ROM header, then I really suspect that there should be a seperate “rom.S” project file so that folks can just change the bits that they need.

Having said which, I did notice some things that I found puzzling in the GCC4.4 (and your) crt0.S.

The VirtualBoy docs are pretty insistent that you wait 200us before accessing WRAM, and I don’t see that crt0.s is actually complying with that warning.

I can’t see that the linker script is actually mapping the .sdata/.sbss sections into the VirtualBoy’s WRAM, and I certainly don’t see the GP register being set to a “reasonable” value for using GP-relative addressing.

Unless I’m missing something, that means that all your C variable accesses are going to go through slow 32-bit loads, which seems like a terrible waste when you’ve got the capability for fast-access to 64KB of GP-relative variables.

Perhaps someone modified GCC so that ALL variable-access is GP-relative and I just missed it???

I can’t understand why the .data and .bss segments aren’t 4-byte-aligned so that crt0.S can clear stuff a word at a time rather than doing those slow byte copies.

The interrupt vectors in the GCC4.4 crt0.S seem a little overly-complex … all that loading and indirect jumping could just be replaced with a “jr” directly into a 4-byte executable vector at the end of WRAM, which could then “jr” into your program ROM, or just “reti”.

cr1901 wrote:
They’re u32s :P. I just called them void pointers b/c that’s really what they represent. https://github.com/cr1901/vbdemo/blob/master/src/drivers/timedriv.c#L110-L113

Thanks! I guess that I was expecting executable “jr” vectors since that’s the fastest.

cr1901 wrote:
This is a bit ironic, considering 2.95 can in theory be built with a K&R compiler. I recall there being a “make bootstrap” target in 2.95 that provides an alternate “Stage 1” for compilers that choke on the code?

There may well be some magic incantation and combination of barely-documented “configure” commands and environment variables to make it work … but I could only find recommendations to go back to a previous linux distribution that had a GCC 3 compiler.

I hate the GNU build process!

IIRC, my problems were less to do with the actual source code, and more to do with the build failing on things that weren’t really errors.

At least, until you get to the part of the build that wants to process the GCC documentation … which is just horribly broken and dependent upon very specific versions of various tools.

You create your interrupt handler in C, and then convert the handler’s addr to a void pointer and assign them to the desired external vector addresses.

void pointers??? They’re not declared as function pointers???

I’ll send you a PM.

cr1901 wrote:

I’m going to be perfectly honest: I’ve lost track of the number of GCC versions for VB there are lol.

From my POV, it’s all about the “dialect” of C that you want to program in.

GCC 2.9.5 is C89/ANSI-C.
GCC 4.7 is C99 with most of C11.

You can also see from the examples of the generated-code that I’ve shown, that GCC 4.7 is a little smarter than GCC 2.95 about moving some loop-invariant calculations outside the loop itself for speed.

I don’t know if the GCC 2.95 version has any problems, but it has been around for a long time, and it’s a “classic” good version of GCC that was used for a lot of game development in the early 2000s (with various patches).

OTOH … It’s getting really, really hard to compile a working GCC 2.95 anymore because modern linux toolchains barf on some of the early-GCC-specific code that’s in there. You pretty much need to find an old GCC 3.x compiler from somewhere.

The GCC 4.4 port seems to have a few problems with it, and so (by his own admission) does Dasi’s GCC 4.7 port.

The GCC 4.4 port is also producing some pretty inefficient code for some reason.

My GCC 4.7 port hasn’t received enough widespread testing yet to see what bugs I’ve introduced … but Alex Marshall’s “liberis” examples all work properly, and another user has ported a simple shoot-em-up to the PC-FX with no apparent problems.

IIRC, the startup code is more or less the same between all but the last one (in fact, I believe blitter’s even reuses the 2.9.5 file for this and relocations).

As I mentioned before, my interest is in the PC-FX, and not the VirtualBoy, and so the linker scripts and the startup code have been tailored to that platform.

It shouldn’t be hard to create a “VirtualBoy” patch that changes them into something that works better for the VB.

BTW … I did add the VirtualBoy’s custom Nintendo instructions to binutils.

If someone is interested in being the “goto guy” for a VirtualBoy version of my patches, then I’d love to hear it.

AFAIK they’re stable and working, and I’m not planning on doing anything more to them for a while because I’ve got other stuff that needs to be done.

  • This reply was modified 8 years, 6 months ago by ElmerPCFX.

dasi wrote:

Yes, using the interrupt_handler function attribute.

I fixed the “interrupt_handler” to where it’s working again, although I’m not using the helper-functions anymore, because I really can’t see the point.

I could make the code a tiny bit smarter … but IMHO it’s already a little bit better than GCC’s V850 code, so any further work on it can wait.

************************************

volatile int __attribute__1 zda_frame_count = 0;

__attribute__2 void my_irq1 (void)
{
  for (int i = 0; i < 100; i++)
    zda_frame_count++;
}

_my_irq1: add -4,sp
          st.w r1,0[sp]
          add -8,sp
          st.w r10,0[sp]
          movea 100,r0,r10
          st.w r11,4[sp]
.L7:      ld.w zdaoff(_zda_frame_count)[r0],r11
          add -1,r10
          add 1,r11
          st.w r11,zdaoff(_zda_frame_count)[r0]
          cmp 0,r10
          bne .L7
          ld.w 0[sp],r10
          ld.w 4[sp],r11
          add 8,sp
          ld.w 0[sp],r1
          add 4,sp
          reti

************************************

volatile int sda_frame_count = 0;

__attribute__3 void increment_sda_frame_count (void)
{
  sda_frame_count++;
}

__attribute__2 void my_irq2 (void)
{
  for (int i = 0; i < 100; i++)
    increment_sda_frame_count();
}

_increment_sda_frame_count:
          ld.w sdaoff(_sda_frame_count)[gp],r10
          add 1,r10
          st.w r10,sdaoff(_sda_frame_count)[gp]
          jmp [r31]

_my_irq2: add -4,sp
          st.w r1,0[sp]
          mov sp,r1
          addi -72,sp,sp
          st.w r29,-12[r1]
          st.w fp,-4[r1]
          movea 100,r0,r29
          mov r1,fp
          st.w r6,-72[r1]
          st.w r7,-68[r1]
          st.w r8,-64[r1]
          st.w r9,-60[r1]
          st.w r10,-56[r1]
          st.w r11,-52[r1]
          st.w r12,-48[r1]
          st.w r13,-44[r1]
          st.w r14,-40[r1]
          st.w r15,-36[r1]
          st.w r16,-32[r1]
          st.w r17,-28[r1]
          st.w r18,-24[r1]
          st.w r19,-20[r1]
          st.w r30,-16[r1]
          st.w lp,-8[r1]
.L3:      add -1,r29
          jal _increment_sda_frame_count
          cmp 0,r29
          bne .L3
          ld.w -4[fp],r1
          ld.w -72[fp],r6
          ld.w -68[fp],r7
          ld.w -64[fp],r8
          ld.w -60[fp],r9
          ld.w -56[fp],r10
          ld.w -52[fp],r11
          ld.w -48[fp],r12
          ld.w -44[fp],r13
          ld.w -40[fp],r14
          ld.w -36[fp],r15
          ld.w -32[fp],r16
          ld.w -28[fp],r17
          ld.w -24[fp],r18
          ld.w -20[fp],r19
          ld.w -16[fp],r30
          ld.w -12[fp],r29
          ld.w -8[fp],lp
          mov fp,sp
          mov r1,fp
          ld.w 0[sp],r1
          add 4,sp
          reti

************************************

FYI, I use a custom-modified version of Mednafen for debugging … basically it just uses a larger font in the debugger so that it’s more readable for folks with tired eyes.

It only supports a few platforms (PC Engine, PC-FX, and now VirtualBoy) … but if anyone is interested, I can add a link to it.

Here is the main VirtualBoy debugger screen …

Here is the VirtualBoy memory editor screen …

dasi wrote:

Welcome, and great work.

Thanks!

From what I remember, when a function is declared with the interrupt_handler attribute the compiler generates save_interrupt and restore_interrupt prolog/epilogs, which save/restore only four or five registers. What have you observed?

Yep, there are those calls, and if your function actually calls anything else that isn’t inlined, then the compiler has to save the LP … and that triggers the generation of calls to save_all_interrupt/restore_all_interrupt which save all the other registers.

Now, please remember that I have no idea about VirtualBoy programming, and that I’m more used to consoles that produce traditional TV-output … but in that world, you’ve got the hblank interrupt … which needs to be blindingly fast, and you’ve got the vblank interrupt … which usually does a lot of stuff and calls a lot of different things.

I can see that having the save_all_interrupt and restore_all_interrupt functions doesn’t really hurt when the compiler needs to use them, because you’re going to take a pretty big hit anyway with all of those registers.

But I don’t really understand the use of the basic save_interrupt and restore_interrupt functions … they just seem to slow the down the interrupt-handling, and don’t save very much code space (just how many different interrupt-handler functions are used in a single program that cause you to be worried about a few bytes???).

Anyway … whatever … I guess that I should fix the compiler’s handling of the prolog/epilog expansion for the interrupt_handler functions.

The Good:

The new stack-frame layout is implemented (slightly changed from my original proposal), and R2 is now the permanent-frame-pointer instead of the compiler just using R29 whenever a frame-pointer is needed.

*****************************

GCC 1999-ABI V850 STACK FRAME

CALLER
          incoming-arg0
ap->      16-bytes-reserved

CALLEE
          saved-lp
          saved-??
fp->      saved-fp
          local-variables
          outgoing-arg?
          outgoing-arg0
sp->      16-bytes-reserved

*****************************

GCC 2016-ABI V810 STACK FRAME

CALLER
fp-> ap-> incoming-arg0

CALLEE
          saved-fp
          saved-lp
          saved-??
          local-variables
          outgoing-arg?
          outgoing-arg0
sp->      4-bytes-reserved

*****************************

“-mprolog-function” is working, but I’ve stopped it from being automatically-enabled whenever any optimization is requested.

The new stack frame layout reduces the code-size of the prolog functions so that there’s a good chance that they’ll stay in the V810’s instruction cache more often. Note: the new prolog functions always save the FP and the LP when they’re used.

A stack backtrace is now possible when either “-fno-omit-frame-pointer” or “-mprolog-function” is used.

Any C “leaf” functions (i.e. functions that don’t call other functions) will omit the prolog function if they don’t destroy any callee-saved register, and so small-fast-utility-code will still run as-fast-as-possible.

The NEC-standard register conventions are still the same, except for R2 now being the FP.

Any assembly langauge code that reads arguments off the stack will need to subtract 16 from their offset.

The Bad:

Any C “interrupt-handler” functions are probably broken at the moment, until I get around to fixing them.

Does anyone actually write interrupt-handlers in C???

The compiler generates some pretty slow register-saving code for them, so I sort-of assume that folks just write then in assembly. Am I wrong?

The Future (long term):

I’d like to add a few compiler intrinsics for some of the V810 opcodes, particularly the string opcodes and the in/out opcodes. That would allow the compiler to easily in-line some stuff that people have to drop into assembly to do.

It would also be a thought to contemplate changing the standard register usage so that R26-R29 are not callee-saved registers, and so avoid the compiler from having to save them on the stack whenever someone wants to use a string opcode. But doing so would break all current assembly-language code, and I suspect that people wouldn’t want that. “Yes”, the change in stack-offset in the new ABI also breaks things … but that’s an easy thing to find/fix. Changing ALL the registers would be a much more complicated thing to fix.

You don’t seem to understand just quite how primitive these old hardware systems are … modern systems are approximately 1000 times more powerful.

You’ve chosen to learn to program with the crutch of a modern high-level language doing a lot of work behind-the-scenes, and with somebody else having done 60% of the work for you by having a huge pre-built library (SpriteKit, Unity, etc).

That’s just the reality of modern game development, and I can understand why it would seem to be a good place to start.

But these old machines have absolutely none of this … and the VirtualBoy is never going to run Swift unless you personally decide to try to port the Swift compiler over the V810, at which point you’d find that the Swift compiler’s runtime startup code would take up the complete memory of the VB and leave you with no space left for your actual game.

If you want to write something on the VB, then you’ll have to learn the low-level details of how a computer really works, and to do a good job, you’ll have to be reasonably-comfortable with V810 assembly language (because there’s no source-level debugging).

Part of the pleasure of these old machines is precisely that it’s a challenge to work-around the limited hardware to produce something interesting and fun.

If you really do have such a desire and passion for the VB-specifically that you want to continue, then stop what you’re with Swift and SpriteKit, they’ll teach you nothing that’s relevant to an old machines like this.

Instead just download VBDE and VBJaEngine, and find the Nintendo VB SDK docs online, the V810 docs online, and a copy of Kernigan and Ritchie (https://en.wikipedia.org/wiki/The_C_Programming_Language).

Be warned … it’s a challenging path to follow, but at the end-of-the-day, you’ll learn some basic skills that will translate into your future game development.

If that sounds more “grueling” than “fun”, and you’re already considering the “benefit : reward ratio”, then I really, really really suggest that you take your own advice that “you can create a game that looks similar enough with Sprite Kit or Unity” and just buy an Occulus Rift and make a pseudo-3D game in 4-shades of red.

That’ll avoid a lot of frustration on your part, and will be just as “retro” as most of the current fad of 2D games that pretend to be NES-like.

BTW, most commercial SNES and Genesis games that I know of were done by small teams of 2 or 3 guys, with maybe an extra 1 to add the menus at the end of development, and a musician for a small part of the time.

The credits list always looks longer because everyone-and-the-dog-that-walked-into-the-office-one-day gets listed.

blitter wrote:

Do you mean grouping “ld” instructions together to speed up the data fetch pipeline? “in” doesn’t follow those rules?

Yes.

And no, it doesn’t follow the same rules according to the instruction cycle timings in the V810 Architecture manual.

I think that I have figured-out how to let GCC know that “ld” instruction sign-extends variables into an int.

Here are a coupe of examples of how it effects the code with newlib’s “strlen” function, and then some variations on it.

The variations show how the generated code changes when things get a little bit more complex when modifying “strlen” to change the comparison so that the compiler can’t just short-cut the check for zero.

The thing to pay particular attention to is the number of instructions in the inner loop.

It shows, again, that if you choose to use C on a processor like the V810, then there are definitely tricks to know that will improve the code-generation.

****************************************************************************************
****************************************************************************************

ORIGINAL FUNCTION FROM NEWLIB 2.2.0

size_t strlen (const char *str)
{
  const char *start = str;
  while (*str)
    str++;
  return str - start;
}

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********

_strlen:  ld.b 0[r6],r10      _strlen:  ld.b 0[r6],r10      _strlen:  ld.b 0[r6],r10
          cmp 0,r10                     mov r6,r11                    shl 24,r10
          be .L42                       cmp r0,r10                    sar 24,r10
          mov r6,r10                    be .L46                       be .L39
.L41:     add 1,r10           .L47:     add 1,r6                      mov r6,r10
          ld.b 0[r10],r11               ld.b 0[r6],r10      .L40:     add 1,r10
          cmp 0,r11                     cmp r0,r10                    ld.b 0[r10],r11
          bne .L41                      bne .L47                      shl 24,r11
          sub r6,r10          .L46:     mov r6,r10                    bne .L40
          jmp [r31]                     sub r11,r10                   sub r6,r10
.L42:     mov 0,r10                     jmp [r31]           .L39:     jmp [r31]
          jmp [r31]


****************************************************************************************
****************************************************************************************

MARK THE END-OF-STRING WITH A NON-ZERO CONSTANT

size_t strlen2 (const char *str)
{
  const char *start = str;
  while (*str != 1)
    str++;
  return str - start;
}

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********

_strlen2: ld.b 0[r6],r10      _strlen2: ld.b 0[r6],r10      _strlen2: ld.b 0[r6],r11
          cmp 1,r10                     mov r6,r11                    shl 24,r11
          be .L47                       cmp 1,r10                     sar 24,r11
          mov r6,r10                    be .L51                       cmp 1,r11
.L46:     add 1,r10           .L52:     add 1,r6                      be .L49
          ld.b 0[r10],r11               ld.b 0[r6],r10                mov r6,r10
          cmp 1,r11                     cmp 1,r10           .L46:     add 1,r10
          bne .L46                      bne .L52                      ld.b 0[r10],r11
          sub r6,r10          .L51:     mov r6,r10                    shl 24,r11
          jmp [r31]                     sub r11,r10                   sar 24,r11
.L47:     mov 0,r10                     jmp [r31]                     cmp 1,r11
          jmp [r31]                                                   bne .L46
                                                                      sub r6,r10
                                                                      jmp [r31]
                                                            .L49:     mov 0,r10
                                                                      jmp [r31]


****************************************************************************************
****************************************************************************************

PASS THE END-OF-STRING MARKER IN AS A "char" PARAMETER

int strlen3 (const char *str, char eos)
{
  const char *start = str;
  while (*str != eos)
    str++;
  return str - start;
}

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********

_strlen3: shl 24,r7           _strlen3: shl 24,r7           _strlen3: ld.b 0[r6],r10
          sar 24,r7                     sar 24,r7                     shl 24,r7
          ld.b 0[r6],r10                ld.b 0[r6],r10                mov r7,r12
          cmp r7,r10                    mov r6,r11                    shl 24,r10
          be .L52                       cmp r7,r10                    sar 24,r12
          mov r6,r10                    be .L56                       cmp r7,r10
.L51:     add 1,r10           .L57:     add 1,r6                      be .L56
          ld.b 0[r10],r11               ld.b 0[r6],r10                mov r6,r10
          cmp r7,r11                    cmp r7,r10          .L53:     add 1,r10
          bne .L51                      bne .L57                      ld.b 0[r10],r11
          sub r6,r10          .L56:     mov r6,r10                    shl 24,r11
          jmp [r31]                     sub r11,r10                   sar 24,r11
.L52:     mov 0,r10                     jmp [r31]                     cmp r12,r11
          jmp [r31]                                                   bne .L53
                                                                      sub r6,r10
                                                                      jmp [r31]
                                                            .L56:     mov 0,r10
                                                                      jmp [r31]


****************************************************************************************
****************************************************************************************

PASS THE END-OF-STRING MARKER IN AS AN "int" PARAMETER

int strlen4 (const char *str, int eos)
{
  const char *start = str;
  while (*str != eos)
    str++;
  return str - start;
}

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********

_strlen4: ld.b 0[r6],r10      _strlen4: ld.b 0[r6],r10      _strlen4: ld.b 0[r6],r10
          cmp r7,r10                    mov r6,r12                    shl 24,r10
          be .L57                       cmp r7,r10                    sar 24,r10
          mov r6,r10                    be .L61                       cmp r7,r10
.L56:     add 1,r10           .L62:     add 1,r6                      be .L63
          ld.b 0[r10],r11               ld.b 0[r6],r10                mov r6,r10
          cmp r7,r11                    mov r10,r11         .L60:     add 1,r10
          bne .L56                      cmp r7,r11                    ld.b 0[r10],r11
          sub r6,r10                    bne .L62                      shl 24,r11
          jmp [r31]           .L61:     mov r6,r10                    sar 24,r11
.L57:     mov 0,r10                     sub r12,r10                   cmp r7,r11
          jmp [r31]                     jmp [r31]                     bne .L60
                                                                      sub r6,r10
                                                                      jmp [r31]
                                                            .L63:     mov 0,r10
                                                                      jmp [r31]


****************************************************************************************
****************************************************************************************

blitter wrote:

That is a cool trick! I hadn’t thought to investigate the in.* instructions to see what they actually do. I’ll have to use that in my projects now, thanks. 🙂

It’s a nice trick since Nintendo made the I/O address space just by a copy of the normal address space … but note that you don’t save the extra cycle on multiple loads that you do with the “ld” instruction.

Because the V810 sign-extends any constants for math and comparison, I suspect that it’s still probably best to just use signed variables, rather than unsigned variables wherever possible.

blitter wrote:

So, while arithmetic operations probably should operate on 32-bit values for best performance, is it efficient to load 32-bit values from RAM/ROM on a 16-bit wide data bus (assuming this is how the VB is configured)?

OK, I found a copy of the SDK (which is just the docs) online and confirmed that the VB is using a 16-bit data bus.

Ouch! Nintendo really wanted to make things difficult for their developers, didn’t they?

That has a huge effect on everything … most particularly the importance of running code from the instruction-cache as much as possible.

The compiler doesn’t really seem to understand that “ld.*” is automatically sign-extending a 16-bit/8-bit read from memory.

The compiler doesn’t know that it can use “in.*” on the VirtualBoy to zero-extend reads from memory (that trick won’t work on the PC-FX).

That means that any code that does arithmetic on 16-bit/8-bit values is usually going to generate one or more extra instructions to sign-extend/mask the values when it reads them.

That is 4-bytes of code that are going to take 1 or 2 cycles to execute, and require 2 memory reads, usually from ROM, and potentially with 2 wait-states per read.

That is going to be no-better than the extra 2-cycle memory-read to get the high 16-bits of a 32-bit variable, and quite-possibly worse.

So I think that I’d still recommend that folks stick with 32-bit variables in C as much as possible, but it’s definitely a less clear situation than it is on the PC-FX, and I’d suggest that folks actually look at the assembly code that the compiler generates in order to see what it’s doing.

If you’re programming in assembly, then you can just use ld.h/in.h, and you can write efficient code because you have a better understanding of the CPU architecture and the VirtualBoy than the compiler does.

BTW … the “advice” may change in the future if I can get the compiler to understand that “ld.*” is automatically sign-extending the value, and that it doesn’t need to generate its own code to do it.

But that won’t apply to unsigned variables, which are still going to be masked.

Whatever happens … it still goes to show that the VirtualBoy is another one of the old machines where an assembly-language programmer can generate better code than a compiler.

I took a quick look at the libgccvb source code, and was surprised to see so many uses of “u8” and “u16” in the code.

The V810 CPU was designed to handle 32-bit variables … and it doesn’t do any arithmetic operations on 16-bit or 8-bit values.

That means that the compiler needs to do a lot of masking/sign-extending when it’s asked to deal with 16-bit or 8-bit variables, just so that it keeps the results correct within the limits of 16-bit or 8-bit rounding.

You really should be using “int” and “unsigned” as much as possible, and avoid “short” and “char” variables.

I thought that it would be interesting to see how the different GCC compiler versions compile a couple of simple C functions.

In each case, the original libgccvb version is first, and then 1 or 2 versions replacing the “u16” and “u8” variables with “unsigned” instead.

It seems strange to me that GCC 4.4.2 is doing such a relatively-poor job compared to GCC 2.9.5 or GCC 4.7.4, I wonder what went wrong?

All examples are compiled with “-O2 -fomit-frame-pointer”.

[size=small]

****************************************************************************************
****************************************************************************************

void copymem (u8* dest, const u8* src, u16 num)
{
  u16 i;
  for (i = 0; i < num; i++) {
    *dest++ = *src++;
  }
}

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********

_copymem: andi 65535,r8,r8    _copymem: andi 65535,r8,r8    _copymem: andi 65535,r8,r8
          be .L1                        mov 0,r10                     be .L4
          addi -1,r8,r11                cmp r8,r10                    mov 0,r10
          andi 65535,r11,r11            bnl .L4             .L3:      mov r7,r11
          add 1,r11           .L6:      add 1,r10                     add r10,r11
          add r6,r11                    ld.b 0[r7],r11                ld.b 0[r11],r12
.L3:      ld.b 0[r7],r10                andi 65535,r10,r10            mov r6,r11
          add 1,r7                      add 1,r7                      add r10,r11
          st.b r10,0[r6]                st.b r11,0[r6]                add 1,r10
          add 1,r6                      add 1,r6                      st.b r12,0[r11]
          cmp r11,r6                    cmp r8,r10                    andi 65535,r10,r11
          bne .L3                       bl .L6                        cmp r11,r8
.L1:      jmp [r31]           .L4:      jmp [r31]                     bh .L3
                                                            .L4:      jmp [r31]

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********


****************************************************************************************
****************************************************************************************

void copymem2 (u8* dest, const u8* src, unsigned num)
{
  unsigned i;
  for (i = 0; i < num; i++) {
    *dest++ = *src++;
  }
}

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********

_copymem2:mov r6,r11          _copymem2:mov 0,r11           _copymem2:cmp r0,r8
          add r8,r11                    cmp r8,r11                    be .L10
          cmp 0,r8                      bnl .L10                      mov 0,r10
          be .L7              .L12:     ld.b 0[r7],r10      .L9:      mov r7,r11
.L11:     ld.b 0[r7],r10                add 1,r11                     add r10,r11
          add 1,r7                      add 1,r7                      ld.b 0[r11],r12
          st.b r10,0[r6]                st.b r10,0[r6]                mov r6,r11
          add 1,r6                      add 1,r6                      add r10,r11
          cmp r11,r6                    cmp r8,r11                    st.b r12,0[r11]
          bne .L11                      bl .L12                       add 1,r10
.L7:      jmp [r31]           .L10:     jmp [r31]                     cmp r10,r8
                                                                      bh .L9
                                                            .L10:     jmp [r31]

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********


****************************************************************************************
****************************************************************************************

void addmem (u8* dest, const u8* src, u16 num, u8 offset)
{
  u16 i;
  for (i = 0; i < num; i++) {
    *dest++ = (*src++ + offset);
  }
}

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********

_addmem:  andi 65535,r8,r8    _addmem:  andi 65535,r8,r8    _addmem:  andi 65535,r8,r8
          andi 255,r9,r9                mov 0,r11                     andi 255,r9,r9
          cmp 0,r8                      andi 255,r9,r9                cmp r0,r8
          be .L13                       cmp r8,r11                    be .L20
          addi -1,r8,r11                bnl .L22                      mov 0,r10
          andi 65535,r11,r11  .L24:     mov r9,r10          .L19:     mov r7,r11
          add 1,r11                     add 1,r11                     add r10,r11
          add r6,r11                    ld.b 0[r7],r12                ld.b 0[r11],r12
.L15:     ld.b 0[r7],r10                andi 65535,r11,r11            mov r6,r11
          add 1,r7                      add r12,r10                   add r10,r11
          add r9,r10                    add 1,r7                      add r9,r12
          st.b r10,0[r6]                st.b r10,0[r6]                add 1,r10
          add 1,r6                      add 1,r6                      st.b r12,0[r11]
          cmp r11,r6                    cmp r8,r11                    andi 65535,r10,r11
          bne .L15                      bl .L24                       cmp r11,r8
.L13:     jmp [r31]           .L22:     jmp [r31]                     bh .L19
                                                            .L20:     jmp [r31]

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********


****************************************************************************************
****************************************************************************************

void addmem2 (u8* dest, const u8* src, unsigned num, u8 offset)
{
  unsigned i;
  for (i = 0; i < num; i++) {
    *dest++ = (*src++ + offset);
  }
}

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********

_addmem2: mov r6,r11          _addmem2: mov 0,r12           _addmem2: andi 255,r9,r9
          andi 255,r9,r9                andi 255,r9,r9                cmp r0,r8
          add r8,r11                    cmp r8,r12                    be .L20
          cmp 0,r8                      bnl .L22                      mov 0,r10
          be .L18             .L24:     mov r9,r10          .L19:     mov r7,r11
.L22:     ld.b 0[r7],r10                ld.b 0[r7],r11                add r10,r11
          add 1,r7                      add 1,r12                     ld.b 0[r11],r12
          add r9,r10                    add r11,r10                   mov r6,r11
          st.b r10,0[r6]                add 1,r7                      add r10,r11
          add 1,r6                      st.b r10,0[r6]                add r9,r12
          cmp r11,r6                    add 1,r6                      st.b r12,0[r11]
          bne .L22                      cmp r8,r12                    add 1,r10
.L18:     jmp [r31]                     bl .L24                       cmp r10,r8
                              .L22:     jmp [r31]                     bh .L19
                                                            .L20:     jmp [r31]

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********


****************************************************************************************
****************************************************************************************

void addmem3 (u8* dest, const u8* src, unsigned num, unsigned offset)
{
  unsigned i;
  for (i = 0; i < num; i++) {
    *dest++ = (*src++ + offset);
  }
}

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********

_addmem3: cmp 0,r8            _addmem3: mov 0,r12           _addmem3: cmp r0,r8
          be .L24                       cmp r8,r12                    be .L25
          andi 255,r9,r9                bnl .L28                      andi 255,r9,r9
          add r6,r8           .L30:     mov r9,r10                    mov 0,r10
.L26:     ld.b 0[r7],r10                ld.b 0[r7],r11      .L24:     mov r7,r11
          add 1,r7                      add 1,r12                     add r10,r11
          add r9,r10                    add r11,r10                   ld.b 0[r11],r12
          st.b r10,0[r6]                add 1,r7                      mov r6,r11
          add 1,r6                      st.b r10,0[r6]                add r10,r11
          cmp r8,r6                     add 1,r6                      add r9,r12
          bne .L26                      cmp r8,r12                    st.b r12,0[r11]
.L24:     jmp [r31]                     bl .L30                       add 1,r10
                              .L28:     jmp [r31]                     cmp r10,r8
                                                                      bh .L24
                                                            .L25:     jmp [r31]

********* GCC 4.7.4 ******************* GCC 2.9.5 ******************* GCC 4.4.2 ********


****************************************************************************************
****************************************************************************************

Here’s my proposal for a new stack-frame layout, together with the one that everyone is using now, and the “new” GCC ABI from 2010.

Basically … the “old” ABI reserved 16-bytes of stack space for storing the first-4 function arguments just-in-case you call a function with variable-arguments.

In the years since that time, “stdarg.h” has replaced “varargs.h”, and that space is no longer needed.

So the V850 guys got rid of it in 2010.

I’m proposing adding back 4-bytes to use for storing the Frame Pointer, so that backtraces are possible.

Reordering the output of the “saved” registers should also radically reduce the amount of space used by function-prologues … which should help speed them up by keeping them in the instruction cache.

Any comments?

[size=small]

*****************************

GCC 1999-ABI V850 STACK FRAME

CALLER
          incoming-arg0
ap->      16-bytes-reserved

CALLEE
          saved-lp
          saved-??
fp->      saved-fp
          local-variables
          outgoing-arg?
          outgoing-arg0
sp->      16-bytes-reserved

*****************************

GCC 2010-ABI V850 STACK FRAME

CALLER
ap->      incoming-arg0

CALLEE
          saved-lp
          saved-??
fp->      saved-fp
          local-variables
          outgoing-arg?
sp->      outgoing-arg0

*****************************

GCC 2016-ABI V810 STACK FRAME

CALLER
          incoming-arg0
ap-> fp-> saved-fp

CALLEE
          saved-lp
          saved-??
          local-variables
          outgoing-arg?
          outgoing-arg0
sp->      4-bytes-reserved

*****************************
  • This reply was modified 8 years, 7 months ago by ElmerPCFX.

blitter wrote:
Yes, and yes. It has been quite a while but as I recall either r29 was ignored when I specified it in the clobber list or I got some kind of error.

Thank you, that’s the kind of information that I can use!

So, if I’m understanding you correctly, you are using GCC’s “inline-assembly” to do the string instructions, rather than a separate assembly function. Is that correct?

Anything I’m doing from non-inline assembly the compiler should not touch, period, other than to assemble it. But for what it’s worth I use -fomit-frame-pointers in my Makefiles. Again, it’s been a while so I don’t remember the exact problem moving the frame register solved, but it was definitely related to the bitstring instructions.

Thanks, again. If you’re use “-fomit-frame-pointers” then the compiler should be using R29 as a general-purpose callee-saved register.

If it doesn’t let you “clobber” it in inline assembly, just because it *might* be used as a frame-pointer … then that’s really helpful information.

Frame pointers and backtraces in my experience are pretty useless in VB homebrew since source-level debugging is pretty nonexistent above the assembly code level.

Ah … on the contrary … IMHO that’s exactly when a good backtrace is the most-useful.

If you’ve got a good source-level debugger with full DWARF information about the process, then it doesn’t need a frame-pointer … it already has all the information from the compiler-emitted debugging-info.

A good “backtrace”, complete with actual function names, can be done on the target hardware, without a debugger, if the frame-pointer exists, and if the stack-frame-layout is sensible.

This lets you get the “context” of any error message, and lets you implement sophisticated in-engine memory debugging.

It really helps to have extra RAM available when these things are enabled … which is why Nintendo (and everyone else) shipped their “development-kits” with more RAM than the “retail” kits (up until the last generation, when things got more complex).

You can simulate an environment like this in Mednafen just by modifying the amount of memory that the virtual VirtualBoy sees (it’s a source-level hack to Mednafen).

It’s not useful for “final-testing”, but its a godsend for 90% of development.

I do all my VB dev in Mac OS X. Specifically, I build the toolchain in 10.6 with an older version of GCC installed via macports. The build products continue to work in the latest version of OS X El Capitan, plus as a bonus I can build PPC versions too.

That’s cool to know. I mainly run Windows on my MacPro, but I think that I may still have a 10.6.8 partition somewhere.

I also don’t know if I’ve mentioned this anywhere else here, but I do *not* know GCC’s internals. At all. So, my patches are more hacks or bandaids to work around problems I encounter than anything else. I share them just in case they might help other devs, but please don’t accept them as attempts to properly fix any problems (though if I happen to fix anything then AFAIC that’s purely a coincidence. 🙂 )

No problem … the point is that you’ve tried to improve things, and so did M.K. when the GCC 4.4.2 patches were created. That’s wonderful!

It took me about 6 months of agony to get the GCC 2.9.5 patches updated to GCC 4.7.4, and that included lots of flailing-around inside complex source code that I barely understood … and still mostly-don’t.

HorvatM wrote:
I don’t see a need for a frame pointer, and neither did NEC apparently.

NEC didn’t mandate a specific register for the frame-pointer … there’s a huge difference between that and saying that they didn’t see the need for a frame-pointer.

Just because you don’t see the need for frame pointers and backtraces doesn’t change the fact that I do, and so a huge proportion of experienced C/C++ programmers. In-system backtraces are useful for a whole bunch of things.

I don’t know what compiler Nintendo shipped with the VirtualBoy, but it was probably the Green Hills suite.

Which supports frame pointers, as does GCC … and every C compiler that I know of. Sometimes the compiler absolutely needs to use a frame-pointer … which GCC does automatically, even when you use the “omit-frame-pointers” option.

Just because the guys that added V850 support to GCC back in the 1990’s goofed on the stack order of the saved registers and made the frame-pointer unusable for doing a backtrace, doesn’t mean that we need to keep following that mistake in 2016.

And you can already access a 64K range with a single register by using negative displacements. Commercial VB games set register 4 to 0x05008000 and use it to access global variables anywhere in the WRAM (which is 64K long).

I didn’t know that the VirtualBoy only had 64KB RAM, thanks!

So you guys don’t need anything more than the existing SDA segment (gp-register-relative) support, that’s good to know.

But the PC-FX has 2MB RAM, so I could use something a bit more sophisticated.

And you’re ignoring the whole point of a thread-local-variable area … which is another reason to move the TDA segment to R5 on the V810 instead of R30 on the V850.

Bitstring instructions, probably.

I’m sorry, but that’s a completely unhelpful answer.

Sure … he’s trying to move the hard-frame-pointer away from the register that are used by the bitstring instructions.

Why? Are you guys doing bitstring instructions in inline-assembly within the C code? If so … are you telling the compiler what registers you clobber?

Are you doing bitstring instructions from assembly? … If so, it makes little difference whether the compiler puts its frame-pointer in R29 … especially since you’re probably compiling with “omit-frame-pointers” anyway.

Do you realize the effect that moving that definition has on the compiled-code when the compiler does need a frame pointer … especially if you’re using function-prologues?

blitter: Can I ask what your thinking was behind you 2011-11-23 patch to change the HARD_FRAME_POINTER_REGNUM from 29 to 25?

Do you have an example of whatever the problem was that this was designed to fix?

Hi blitter,

It’s always good to see someone else that’s comfortable in assembly-language.

blitter wrote:
Please check out my thread here http://www.planetvb.com/modules/newbb/viewtopic.php?topic_id=5252 about GCC 4 and generating PC-relative jumps.

I looked at the thread, and it’s pretty obvious that the problem is in the symbol-relocation code in binutils.

A quick comparison of the binutils 2.20.1 patch, my binutils 2.23.2 patch, and the current V850 code shows that there’s a bug in the binutils 2.20.1 patch in the R_V810_26_PCREL relocation.

insn |=4;

should be

insn |=5;

The patch that you’re using loses the top 4 bits of the 26-bit relative address.

With the bug, the maximum relocation is 0x003fffff … which corresponds nicely to your observed bad-offset of 0x00400000.

I don’t know-for-sure that fixing that will make your problem go away, but I think that it’s pretty likely.

I don’t know how-easy it is for you to recompile binutils and test that … I’ve had a lot of trouble compiling old versions of binutils and GCC with newer versions of the GNU build tools.

I’m using msys2, which keeps very current on all the latest versions of the GNU tools, and I need a bunch of extra patches to compile binutils 2.32.2, and any GCC that’s older than 4.7.

  • This reply was modified 8 years, 7 months ago by ElmerPCFX.

RunnerPack wrote:
I’m only barely “assembly-capable” on the v810, but I’ll take a shot at answering these.

Thanks!

There really is no “the VirtualBoy SDK” due to a lot of fragmenting, but, TMK, most of the existing code out there avoids direct access to registers except in the necessary setting of hardware ports for control of the peripheral hardware. If you want to make use of these registers for a specific purpose (especially if it means improving memory usage of generated code), I’m sure existing projects could be made compatible quite easily.

Ah … I’m going by the Nintendo Seminar docs, and the PC-FX SDK docs, and the GCC docs … all of which follow NEC’s V810 Architecture Manual, where R2 is reserved as the “Handler Stack Pointer”, and R5 is reserved as the “Text Pointer” (which means the address of the start of the program code).

Now, the PC-FX BIOS and the official SDK libraries (which I’m going to ignore), never actually use either of these registers, they’re just wasted.

Newer versions of GCC (well after 2.95, I think) added an option “-app-regs” that lets the compiler use these 2 registers for the code that it generates.

I’d be quite surprised if anyone here is relying on that option.

I have my own ideas of how I’d like to use those registers.

I’d like to move the Frame Pointer to R2 (right next to the Stack Pointer in R3), and I’d like to use R5 to replace the V850’s EP register … and basically gain another 32KB of fast-access variable space, particularly for use as thread-local variables.

This isn’t going to cause any problems on the PC-FX … but I’m curious if it will cause any problems on the VirtualBoy.

If you’re programming bare-metal with no BIOS or Nintendo libraries … then it shouldn’t really cause you guys any trouble, either.

As for “memory usage” … how “cramped” are you guys? Are you using the “optimize-for-space” option and/or the “prolog-function” option?

The cartridge ROM has either 1 or 2 (the default) wait-states, selectable in software. The RAM used by the video hardware (the “VIP”) has 2-5 waits, depending on what part of the display rendering cycle it’s currently in. All other areas have a fixed wait-state of 1.

OK, thanks! I guess that Nintendo went a little cheap on the memory (again).

The PC-FX runs everything from RAM, so I’m more worried about pipeline-stalls than I am about memory access times.

I guess that you guys have different issues, and that the VirtualBoy’s memory timing dwarfs the occasional modify-then-read pipeline-stall.

That means that I should definitely keep the frame-pointer “optional” rather than “required” (which is a pity, because it’s so darned useful when implemented properly).

None of the existing, publicly-available, homebrew VB software uses any Nintendo code, binary or otherwise. I can’t speak for what anyone has on their personal PCs, though.

Excellent, you’ve got a completely clean-and-legal toolkit, and that means that you’ve got the source-code to make any changes if you use the new 2010 ABI, or whatever I come up with (if it’s an improvement).

  1. zda []
  2. interrupt_handler [] []
  3. noinline []
  4. (addend & 0xfffe) << 16) | ((addend & 0x3f0000) >> 16 []
  5. (addend & 0xfffe) << 16) | ((addend & 0x3ff0000) >> 16 []