In my quest to know everything, I set my eyes on the Nintendo instructions, and their undocumented cycle counts. We know through the development manual that the MPYHW instruction takes 9 cycles, but what about the others?
What I did was write a routine in assembly that enables the instruction cache and the hardware timer with a 20-microsecond tick frequency, then runs a short loop with three consecutive instances of a given instruction. At the end, I stop the timer and see how many timer ticks it took. I then subtract that number from another run that didn’t have any instructions in the loop.
I used the ADD instruction as a baseline, since it takes 1 cycle. Using MUL and DIV to verify it was working, I got the correct cycle counts:
ADD 1966 1.0 1 cycle MUL 26214 13.33367 13 cycles DIV 75366 38.33469 38 cycles
Logically, I did the same thing on all the Nintendo instructions to determine their cycle counts. The counts may surprise you.
Please: If you’re a coder, run your own tests to get some more sample data on these. I don’t doubt my methods, but I’d like some validation from other sources.
My findings are as follows:
MPYHW 18350 9.33367 9 cycles CLI 23593 12.00051 12 cycles SEI 23593 12.00051 12 cycles XB 12452 6.33367 6 cycles XH 2622 1.33367 1 cycle REV 43909 22.33418 22 cycles
MPYHW takes 9 cycles as expected.
CLI and SEI both take 12 cycles for some unimaginable reason. Curious is the fact that my program didn’t log an extra third of a cycle like it did for all the other instructions I tested. Someone else’s testing would be appreciated to get a clearer view on this count.
XB looks like it takes 6 cycles. Fair enough. But what caught my eye is that XH only takes 1 cycle. I’d have expected them to be pretty close to each other.
The REV instruction expectedly takes a while to complete. In my tests, it clocked in (pun intended) at 22 cycles.
As long as I’m at it…
The development manual indicates that using any register other than r0 for reg1 in the XB and XH instructions may cause problems, but regardless of which registers I specified or the values in those registers, the instructions performed correctly.
MPYHW, on the other hand, is giving me some mysterious results when bits 16-31 don’t sign-extend bit 15 (just like the developer’s manual says). I’m gonna have to put together a test program just for that instruction to figure out exactly what it’s doing.
MPYHW definitely performs a faithful multiplication, but exactly how it behaves when extending off the left side of the register is something I’m gonna have to investigate further in the morning.
In the mean time, have a ROM! All controls are done with the left D-Pad. Up and down change the value of the current digit, and left and right change the digit.
Hmm… some of those results are a bit surprising. It just seems like what’s the point of having custom CPU instructions if they’re not much faster than what you could do with just a few instructions in software (sure… they’re a little bit more convenient, and more compact, but that hardly seems worth a CPU customization).
Guy Perfect wrote:
Curious is the fact that my program didn’t log an extra third of a cycle like it did for all the other instructions I tested. Someone else’s testing would be appreciated to get a clearer view on this count.
Guy Perfect wrote:
ADD 1966 1.0 1 cycle
What about ADD? How about running it on a larger chunk of regular V810 instructions to see if it really is an anomaly, or just that some do and some don’t (you might make a connection between them)?
And just out of curiosity… why 3 instructions in a row? Do you get the same results with just 1? How about 10?
DogP
Regarding CLI and SEI, do we even know how long the stock V810 methods take? Specifically:
; CLI
movea 0xEFFF, $0, $10
stsr $PSW, $11
and $10, $11
ldsr $11, $PSW
; SEI
stsr $PSW, $10
ori 0x1000, $10, $10
ldsr $10, $PSW
I can’t find the duration of LDSR and STSR in the V810 manual.
HorvatM wrote:
Regarding CLI and SEI, do we even know how long the stock V810 methods take?
That’s a good question… I don’t see that listed in the manual. One thing I did notice is that CLI and SEI have the same opcode as EI and DI on the V830 (which is based on the V810 architecture, though not necessarily implemented the same). In the V830 case, they claim to take 4 cycles each. For comparison, LDSR and STSR take 5 cycles on the V830.
DogP
I put together a program to answer this once and for all. It tests all register-based instructions with a simple assembly loop:
# s32 CycleTest(s32 arg1, s32 arg2, s32 num); vueFunction(_CycleTest) # r2 = 0x02000000, base address for hardware control ports # r6 = arg1 # r7 = arg2 # r8 = num, also used as the loop iterator # Configure the hardware timer MOVHI 0x0200, r0, r2 MOV -1, r1 ST.B r1, 0x0018[r2] # Count/reload low = 0xFF ST.B r1, 0x001C[r2] # Count/reload high = 0xFF # Enable and clear the instruction cache MOVEA 0x0803, r0, r1 LDSR r1, CHCW # Enable the timer with 20-microsecond ticks MOVEA 0x0011, r0, r1 ST.B r1, 0x0020[r2] # Execute the instruction 10 times for the given number of iterations .Lcycle_loop: MOV r6, r9 MOV r7, r10 # This comment is located 32 bytes into the function. # When the function is not modified, nothing happens in this loop # The following bytes are meant to be overwritten in RAM BR .Lcycle_end; NOP; NOP; NOP; # Written by 16- and 32-bit instructions BR .Lcycle_end; NOP; NOP; NOP; # Written by 32-bit instructions BR .Lcycle_end # Always present for consistency # End-of-loop code for 32-bit instructions (3 16-bit instructions) .Lcycle_end: ADD -1, r8 BNZ .Lcycle_loop # End-of-loop label # Disable the timer and instruction cache ST.B r0, 0x0020[r2] LDSR r0, CHCW # Retrieve and return the number of timer ticks taken IN.B 0x0018[r2], r6 # Timer count low IN.B 0x001C[r2], r7 # Timer count high SHL 8, r7 # r7 = r7 << 8 | r6; OR r6, r7 MOV -1, r10 # r10 = -1 - r7 & 0xFFFF; SUB r7, r10 ANDI 0xFFFF, r10, r10 JMP [r31] vueEnd(_CycleTest)
This function gets copied into RAM at run-time. Those NOPs are dummy bytes that are replaced with meaningful instructions by the program. The reason there are two sets of NOPs is to accommodate both 16- and 32-bit instructions. The following BR instruction is always present to ensure that the loop takes the same number of cycles always except for the desired instructions.
The C code that drives this looks like this:
// Gets the number of timer ticks for a loop of 4 instances of an instruction s32 GetCount(const INST *inst, s32 num) { s32 len = (SIZE_CYCLETEST + 3) / 4; u32 arg1 = 0, arg2 = 0; s32 x, y, offset = 32; u8 func[len]; u16 bits[2]; // Copy the function into memory memcpy32(func, &CycleTest, len); // If we're not overwriting with an instruction, ignore this all if (inst != NULL) { // Encode the instruction into data bits and get its size len = FORMATS[inst->format](inst, bits); // Copy the instruction into the function buffer 4 times for (x = 0; x < 4; x++) for (y = 0; y < len; y++) { *(u16 *)(&func[offset]) = bits[y]; offset += 2; } // Grab the instruction's pre-defined operands arg1 = inst->val1; arg2 = inst->val2; } // Call the function from the byte buffer return ((s32 (*)(u32, u32, s32)) func)(arg1, arg2, num); }
My main function calls this function 5 times for each instruction (predefined in a const table at the top of the program), and averages the counts. It then subtracts the count from a null call (no instruction overwritten), then divides by the count for ADD, which is known to be 1 cycle.
The output on the hardware looks like this:
ADD (Immediate) 051E = 1 cycle ADD (Register) 051E = 1 cycle ADDF.S 6F5C = 22 cycles ADDI 051F = 1 cycle AND 051E = 1 cycle ANDI 051E = 1 cycle CLI 3D71 = 12 cycles CMP (Immediate) 06ED = 1 cycle CMP (Register) 051E = 1 cycle CMPF.S 228F = 7 cycles CVT.SW 4666 = 14 cycles CVT.WS 27AE = 8 cycles DIV C148 = 38 cycles DIVF.S DFFF = 44 cycles DIVU B70A = 36 cycles LDSR 28F6 = 8 cycles MOV (Immediate) 051F = 1 cycle MOV (Register) 051E = 1 cycle MOVEA 051F = 1 cycle MOVHI 051E = 1 cycle MPYHW 2CCC = 9 cycles MUL 4148 = 13 cycles MULF.S 83D7 = 26 cycles MULU 4147 = 13 cycles NOT 051E = 1 cycle OR 051E = 1 cycle ORI 051F = 1 cycle REV 6F5C = 22 cycles SAR (Immediate) 051E = 1 cycle SAR (Register) 051E = 1 cycle SEI 3D70 = 12 cycles SETF 051E = 1 cycle SHL (Immediate) 051E = 1 cycle SHL (Register) 051E = 1 cycle SHR (Immediate) 051F = 1 cycle SHR (Register) 051E = 1 cycle STSR 28F5 = 8 cycles SUB 051E = 1 cycle SUBF.S 83D7 = 26 cycles TRNC.SW 4147 = 13 cycles XB 1D70 = 6 cycles XH 051F = 1 cycle XOR 051E = 1 cycle XORI 051E = 1 cycle
All instructions with documented cycle counts have the correct count, so that's a relief. The floating-point instructions fall within their given range. The undocumented cycle counts? Well, that's why I made this program.
LDSR and STSR are 8 cycles each. I was expecting 1 cycle. This is how we learn things, though. Suddenly the CLI and SEI instructions being 12 cycles don't sound so bad.
MPYHW is 9 cycles as seen before. Likewise for XB at 6 cycles, XH at 1 cycle and REV at 22 cycles.
A ROM of this program is attached to this post. After the test is finished, up and down on the left D-Pad scroll the list of instructions.
I figured out the operation of MPYHW.
* reg1 is treated as a 17-bit integer, sign-extended to 32 bits in size.
* reg2 is treated as a 32-bit, signed integer.
* Multiplication happens normally, storing the result in reg2.
* r30 is not affected as it is in MUL and MULU.
Algorithm:
// On an unsigned variable reg2 *= (reg1 & 0x0001FFFF) | ((reg1 & 0x00010000) ? 0xFFFE0000 : 0); // On a signed variable reg2 *= reg1 << 15 >> 15;