Big thanks to Benjamin Stevens for helping me test this, and dasi for bouncing ideas off of.
Attached to this post is the fifth revision of a bit string instruction testing ROM I put together. It has proven a valuable exercise, as not only are some of the details surrounding the exact operations of bit string instructions difficult to comprehend, but both Reality Boy and Mednafen fail to get all of the details right.
Here’s the output in RealityBoy. The bottom three lines are incorrect:
And here’s the output in Mednafen. The bottom three lines are correct, but the sixth line is incorrect. Also, the brightness is whacked:
Okay, so what is this exactly? It’s various trial cases using the MOVBSU instruction. The first six lines are the instruction itself working on memory, and the bottom three lines are the registers after the sixth bit string was processed.
All displays are the bytes in the order they appear in memory. Each group of 4 bytes is one word, with the low byte on the left.
Line 1 (control test):
Word 0 is initialized to CD AB 00 00
Word 1 is initialized to 00 00 00 00
Word 2 is initialized to 00 00 00 00
Bit string source is bit 0 of word 0
Bit string destination is bit 16 of word 0
Bit string length is 16 bits
Correct output should be: CD AB CD AB, 00 00 00 00, 00 00 00 00
Line 2:
Word 0 is initialized to 78 56 34 12
Word 1 is initialized to 00 00 00 00
Word 2 is initialized to 00 00 00 00
Bit string source is bit 0 of word 0
Bit string destination is bit 16 of word 0
Bit string length is 48 bits
Correct output should be: 78 56 78 56, 34 12 00 00, 00 00 00 00
Line 3:
Word 0 is initialized to 00 00 00 00
Word 1 is initialized to 78 56 34 12
Word 2 is initialized to 00 00 00 00
Bit string source is bit 0 of word 1
Bit string destination is bit 16 of word 0
Bit string length is 32 bits
Correct output should be: 00 00 78 56, 34 12 34 12, 00 00 00 00
Line 4:
Word 0 is initialized to 00 00 12 34
Word 1 is initialized to F0 0D BE EF
Word 2 is initialized to AB CD EF 12
Bit string source is bit 16 of word 0
Bit string destination is bit 16 of word 1
Bit string length is 32 bits
Correct output should be: 00 00 12 34, F0 0D 12 34, F0 0D EF 12
Line 5:
Word 0 is initialized to 00 00 12 34
Word 1 is initialized to 56 78 00 00
Word 2 is initialized to 00 00 00 00
Bit string source is bit 0 of word 1
Bit string destination is bit 24 of word 0
Bit string length is 16 bits
Correct output should be: 00 00 12 56, 78 78 00 00, 00 00 00 00
Line 6:
Word 0 is initialized to 00 00 00 FF
Word 1 is initialized to 00 00 00 00
Word 2 is initialized to 00 00 00 00
Bit string source is bit 24 of word 0
Bit string destination is bit 24 of word 1
Bit string length is 48 bits
Correct output should be: 00 00 00 FF, 00 00 00 FF, 00 00 00 00
What’s going on with line 6, you ask? I’m glad you asked! See, the algorithm for processing bit string instructions is as follows:
* Load the word at the source word address (register 30)
* Load the word at the destination word address (register 29)
* Process 1 bit at a time, storing the bit positions in r27 (source) and r26 (destination), and decreasing the string length register (r28) each time.
* If an interrupt occurs, let it, and don’t update PC. The next time the CPU executes, the registers will already contain the correct values to resume.
* If bit 31 of the source word has been processed, add 4 to r30, set r27 to 0, and load the new source word. This happens BEFORE the next step.
* If bit 31 of the destination word has been processed, output the new value, add 4 to r29, set r26 to 0, and load the new destination word. This happens AFTER the previous step.
* Repeat the steps for as long as the bit string register (r28) is greater than 0.
Those two steps near the end, the ones regarding loading a new source word and storing the new destination word? Mednafen does those in the wrong order. See, during that particular bit string, bit 31 is reached on both the source word and the destination word at the same time. The new source word is word 1, and it was 00 00 00 00 before the old destination word value was written to it. It will be 00 00 00 FF after the destination word is stored. If you store and THEN load, like Mednafen does, then word 2 becomes 00 00 00 FF, and it won’t on a real Virtual Boy.
Okay, so what are those last three lines? Those are the values of r26-r30 after the bit string operation in line 6 completes.
Line 7:
Left: The value of r30 (source word address)
Right: The value of r27 (source bit offset)
Correct output should be: 08 00 00 05, 00 00 00 00
Line 8:
Left: The value of r29 (destination word address)
Right: The value of r26 (destination bit offset)
Correct output should be: 0C 00 00 05, 00 00 00 00
Line 9:
Value: The value of r28 (bit string length)
Correct output should be: 00 00 00 00
Remember, the bit string instructions keep running tabs in the registers in case the system raises an interrupt. Bit string instructions can potentially take a long time, so they’re designed to be aborted and resumed automatically when interrupts are raised. RealityBoy, for whatever bean-brained reason, is not updating these after or during execution of the bit string operation. They’re still set to the initial values.
Attachments:
The plot thickens!
There were some errors in my previous post. Where I mentioned the bit string lengths were 48, I actually meant 40; they processed 5 bytes.
Second, the algorithm for CPU processing of bit string instructions is incorrect. Turns out the CPU is smarter than that.
__________
I’ve prepared revision #7, attached to this post.
RealityBoy, astonishingly, gets it 100% correct:
And this is Mednafen, which gets two of the three columns right (even if the brightness is bonkers):
The column on the right is the initial state of memory prior to any bit string instructions.
For both columns, one bit string starts at bit 24 of word 0, and the other bit string starts at bit 0 of word 2. The length for both is 168 bits, which is 5 whole words, plus one byte.
The left column has the source string begin before the destination string.
The right column has the destination string begin before the source string.
In both cases, on the real Virtual Boy system (as well as in RealityBoy), the source is copied to the destination intact; the source string is never overwritten by the output before it can be processed. In other words, it behaves like the standard C function memmove(), but on individual bits instead of whole bytes.
This tells us something very important: the bit string instructions were absolutely designed to function on overlapping bit strings. By this point, I can only assume that MAGIC happens when an interrupt occurs, and that when the instruction resumes it will finish processing correctly. That is to say, by this point, I believe NEC covered all their bases when designing the V810.
In the mean time, let’s all take a moment to point at Mednafen and laugh.
Attachments:
Wow. I’d only be half able to follow this normally, but I’m in holiday mode and everything is even harder to focus on. But still all very fascinating, the things you’ve been going through here. Great work!
I must issue an apology to Mednafen. With regards to bit strings, even though its results differed from those on the Virtual Boy hardware, it didn’t implement them wrong.
I also withdraw my praise for RealityBoy. With regards to bit strings, even though its results matched those on the Virtual Boy hardware, it didn’t implement them right.
If anyone’s right or wrong, though, it’s me. I was wrong. Let me explain…
In my quest to know everything, I’ve just spent some time conducting extensive testing of the elusive bit string instruction. There’s a pretty solid behavior going on here, and it’s misleading, as this thread can testify.
Here’s the secret:
As it turns out, if the destination bit string begins after the source string, but by less than 64 bits, the result is undefined. As in, while it’s mostly consistent, some bytes will change between executions on the hardware.
And the truth of the overlapping behavior? When the destination string starts 64 or more bits after the source string, there will be feedback where a modified result is read back in as input.
The discrepancy deals with the way the CPU processes bit strings one word at a time rather than one bit at a time. Throw in further complications with the pipeline parallelism and the two-stage store buffer and it’s not surprising that problems arise when accessing data so close to itself. To play it safe, an emulator should–just as Mednafen does–process bit strings one bit at a time, feeding modified output back as input if the destination overlaps the source.
Overlapping bit strings where the destination is located *before* the source doesn’t cause any anomalies on the hardware, even with a distance as small as 1 bit. Bit strings that coincide don’t cause any problems either.
Guy Perfect wrote:
To play it safe, an emulator should–just as Mednafen does–process bit strings one bit at a time, feeding modified output back as input if the destination overlaps the source.
I disagree. The emulator is supposed to be emulating the CPU, not what makes sense. If an emulator differs from hardware, even if it’s because of an “undocumented feature”, then it’s not properly emulating the hardware. If there’s a bug where 1+1=3, you shouldn’t be like “well… they meant that to be 2, so I’ll fix it for them”. Look at the errata sheet for a CPU… there are usually a LOT of things that aren’t quite right, but there’s no going back and changing it… so it becomoes a “feature” of the CPU.
DogP
It’s not an issue of being undocumented, it’s an issue of being undefined. When the destination string starts less than 64 bits after the source string, the output won’t be the same from one execution to the next.
Maybe you could simulate it by tossing a random number generator in there, but no matter how you slice it, you can’t match what the CPU is doing.
Is the output consistent across hardware resets? The V810 doesn’t have a RNG. There has to be *something* deterministically causing that undefined behavior. Not that I’m volunteering to find it… 😉
For the third time, the output isn’t the same between executions, let alone resets or whatever else. I was sitting there pressing the A button and watching the output be different every time.
Sorry, I must have missed the part where you clarified that it wasn’t the same across resets (fresh CPU and hardware state), in addition to being different across subsequent executions in the same program loop.
Guy Perfect wrote:
It’s not an issue of being undocumented, it’s an issue of being undefined. When the destination string starts less than 64 bits after the source string, the output won’t be the same from one execution to the next.Maybe you could simulate it by tossing a random number generator in there, but no matter how you slice it, you can’t match what the CPU is doing.
Sorry… I’ll admit that I didn’t re-read the entire thread… I just saw:
Guy Perfect wrote:
I must issue an apology to Mednafen. With regards to bit strings, even though its results differed from those on the Virtual Boy hardware, it didn’t implement them wrong.I also withdraw my praise for RealityBoy. With regards to bit strings, even though its results matched those on the Virtual Boy hardware, it didn’t implement them right.
which made me assume you were saying Mednafen was different than the hardware, yet somehow correct… while RealityBoy matched the hardware, but was somehow incorrect.
But yes… while you say it’s different between executions, resets, etc… I find it hard to believe that it’s truly “random”, unless it’s causing a timing violation or collision on the bus or something. Other than that, the hardware has to be doing fixed operations… just not necessarily the right thing (possibly dependent on some internal state, or previous register value).
It’s been a long time since I’ve messed with bit string stuff (and I was never really THAT interested in it)… but is there somewhere that says this 64 bit limitation exists, or are you claiming that it’s a hardware bug? From looking at Table 5-13 in document # U10082EJ1V0UM00, it looks like any source and any destination is supposed to be possible.
DogP
I wasn’t able to find anything in any documentation about the destination being 1-63 bits after the source. Alls I can say is that when I put it through a physical test, it breaks, and it doesn’t break in quite the same way twice.