https://youtu.be/JkOGICAd2EwNotes:
The DSP is being used. I have attached the DSP program.
Performance Hints:
The best path to improved performance on the Saturn is what is frequently called "Data Oriented Design", or "DOD".
In general, your philosophy is to ensure the least amount of data is processed, moved, and accessed.
This is absolutely at odds with the prevalent modern philosophy of programming called "Object Oriented Design", or "OOD".
SGL Anamolies
SGL's documents state that SCU DMA Channel 0 and CPU DMA Channel 0 are "free" in SGL.
However, my observations indicate otherwise.
Let me try and walk you through what was happening.
First, we have a model. It's 625 vertices and 576 polygons.
Let's say we recalculate these polygon's normals and textures every frame before we send that model to be drawn as slPutPolygon.
Normally? This is actually OK. Using fast inverse square root, calculating a normal is a relatively inexpensive task (less than 100 instructions).
If we just wrote that out straight as 100, and did that 576 times, we get 57600 instructions.
Considering an SH2 has 28,000 instructions to offer per millisecond, it only takes about 2ms to perform that task.
But there is a problem with this theory.
The first problem is that we have to read the vertices data from memory. This is not instantaneous.
Another problem is we have to write normals back to memory.
The final problem is the Slave SH2 needs to access this data to draw the polygon, as SGL's default behavior is to use the Slave SH2 for all polygon / matrix processing.
However, by itself, this process does not cause any major performance or synchronization problem.
Remember your program runs on the MSH2. If you have the MSH2 perform this updating normals task before slPutPolygon is reached, it will all be fine.
You might get some small bus contention if the SSH2 currently crunching numbers on some previous model you sent, but that is not a major concern.
But let me try and read into this further.
An important part of SGL is that it is wisely set up to be sending draw commands to VDP1 in a way that wastes the least time possible.
Keep in mind that accessing VDP1's memory will halt VDP1's operation until some cycles after memory access is complete.
Because of this, you should not frequently access VDP1's memory to inform it about what to draw. Rather, SGL prepares all of your draw commands and sends them to VDP1 in one big batch.
SGL does not do this in a linear fashion, however. It is buffered so the SH2s don't waste time waiting for VDP1 to finish drawing so they can send the next frame.
As far as I understand it, your code in immediacy is running two frames behind the frame that is currently being sent to VDP1.
One thing this means is your frame-time is inevitably limited to the transfer time of the frame's data. So there is some time wasted, maybe 4ms?
So the SH2s and VDP1 do not have the full frame-time to render a frame.
This fact alone, you can reasonably ignore except for knowing that you don't actually have 16/33/50/66ms to do anything, always less.
More importantly, the Slave SH2 pretty much _always_ wants access to the memory to transfer the frame to VDP1. Maybe not always, but it is safer to assume so!
So not only does the SSH2 need memory access, it also needs a DMA channel.
Keep in mind that no two processors can access high/low memory at a time. There is a priority order.
The order is (in SGL): SCU > Master > Slave.
So what happens if SGL ABSOLUTELY MUST send the next frame on time (that is its goal: to be fast at 3D), but your code is accessing high memory, and the processor which manages this is the Slave SH2?
... You can call me out on this if you know differently, but my assumption is that the Slave SH2 will switch to using SCU DMA Channel 0 (the normal channel for slDMACopy) to gain priority over MSH2.
This causes a number of problems.
1. The SCU is slower at accessing memory than the SH2s. [Assumption, unverified, but I think it is]
2. The SCU-DSP may only DMA using SCU DMA Channel 0, therein this may directly contend with a DSP program and potentially cause it to malfunction.
3. SBL's file system uses SCU DMA Channel 0 if you set GFS_TMODE_SCU.
4. Imagine any other thing that might be using SCU DMA Channel 0, an assumed "free" channel, and guess what would happen if SGL suddenly wants to use it.
5. SCU can't access low memory. So contention in this range is gauranteed wait cycles. Frankly, that's probably better!
Again, this typically is not a problem, but because SGL is a black box it is unknown when it may enter this condition. It does not fire interrupts when it is or isn't transferring the next frame's data.
How could I come to this conclusion, and where _might_ it be a problem?
In my case, I had two DSP programs and file system transfers active using SCU DMA (to of course leave a DMA channel open on MSH2).
And, lo and behold, if you calculate 576 poly normals and have file system access via SCU (to the SCSP area), every frame both are happening will spike to 50ms.
This is an interesting cascade of contentions.
First, MSH2 and SSH2 want to access memory at the same time.
For SSH2 to gain priority over MSH2, it commands SCU to access the memory.
Then, an unexpected DMA channel is used, which interferes with SH-1, DSP, and MSH2.
In turn, this interferes once again with the SSH2's desired actions.
All piling up to a long delay before the data ends up in VDP1's memory and we move along to the next frame.
Because we can't just send the draw commands as we go.
Another anamoly is that this contention is theoretically worse than simply making MSH2 or SSH2 wait for memory access.
The file system is independent. It's not waiting for the MSH2 to tell it to start or stop transfers on a sub-frame basis, the SH-1 manages that.
Further, the DMA method used is the SCU. The whole CPU Bus has nothing to do with that data after the read commands are sent.
The DSP is also independent and internal to the SCU. If SCU-DMA Channel 0 is used up, the DSP's default behavior is to wait for completion before continuing.
I verified the DSP being uninvolved in this contention cascade by disabling the DSP programs and instead performing the calculations on MSH2. Still happens.
Finally, I got to testing what would happen if I made the normal calculations at sprite draw end via interrupt. Instead of frame-spikes, now EVERY frame was 66ms.
Then, I tested these calculations using slSlaveFunc, which puts your function after all draw commands on the Slave SH2's stack. Now frame-time depended on render load (could be 33ms, could be 50).
In both cases, the frame-times were decoupled from the file system.
The bottom line? Memory access is bad.
If you're XL2, the bottom line is asynchronous file systems are bad

All this complication, but it's that simple.
The solution to this problem is to keep the Master SH2 working more inside of its own cache, rather than stretching computations out so much that they involve more memory access.
Again, computing the normals themselves, no problem for the SH2. But each normal computed is unique data that starts filling cache with junk data that won't be re-used.
Instead of calculating every normal, I solved the problem by seeking through the polygon data first by concantenating the data that changes into a single 32-bit number.
It's then less data to sort through that and find out if that polygon changed. If it DID change, I can look backwards from the direction it moved to find it again.
With that known, only as much as 24 polygons ever need new normals. (1 row in 24x24 = 576).
This is an application-specific solution, but it is an example where computational simplicity was actually very much counter-productive. Instead, the program was made more complex, but faster.
Welcome to Saturn.