Recent Posts

Pages: [1] 2 3 ... 10
1
Project announcement / Re: what p64 does
« Last post by sbf2009 on April 15, 2019, 04:34:25 am »
The lesson I'm getting from this is that if you use SGL (meaning Jo Engine or Z-treme Tools as well,) the DSP will not play nice without a lot of extra considerations.
2
Project announcement / Re: what p64 does
« Last post by ponut64 on April 15, 2019, 04:05:55 am »
https://youtu.be/JkOGICAd2Ew

Notes:
The DSP is being used. I have attached the DSP program.

Performance Hints:
The best path to improved performance on the Saturn is what is frequently called "Data Oriented Design", or "DOD".
In general, your philosophy is to ensure the least amount of data is processed, moved, and accessed.
This is absolutely at odds with the prevalent modern philosophy of programming called "Object Oriented Design", or "OOD".

SGL Anamolies

SGL's documents state that SCU DMA Channel 0 and CPU DMA Channel 0 are "free" in SGL.
However, my observations indicate otherwise.
Let me try and walk you through what was happening.

First, we have a model. It's 625 vertices and 576 polygons.
Let's say we recalculate these polygon's normals and textures every frame before we send that model to be drawn as slPutPolygon.
Normally? This is actually OK. Using fast inverse square root, calculating a normal is a relatively inexpensive task (less than 100 instructions).
If we just wrote that out straight as 100, and did that 576 times, we get 57600 instructions.
Considering an SH2 has 28,000 instructions to offer per millisecond, it only takes about 2ms to perform that task.

But there is a problem with this theory.
The first problem is that we have to read the vertices data from memory. This is not instantaneous.
Another problem is we have to write normals back to memory.
The final problem is the Slave SH2 needs to access this data to draw the polygon, as SGL's default behavior is to use the Slave SH2 for all polygon / matrix processing.

However, by itself, this process does not cause any major performance or synchronization problem.
Remember your program runs on the MSH2. If you have the MSH2 perform this updating normals task before slPutPolygon is reached, it will all be fine.
You might get some small bus contention if the SSH2 currently crunching numbers on some previous model you sent, but that is not a major concern.

But let me try and read into this further.
An important part of SGL is that it is wisely set up to be sending draw commands to VDP1 in a way that wastes the least time possible.
Keep in mind that accessing VDP1's memory will halt VDP1's operation until some cycles after memory access is complete.
Because of this, you should not frequently access VDP1's memory to inform it about what to draw. Rather, SGL prepares all of your draw commands and sends them to VDP1 in one big batch.

SGL does not do this in a linear fashion, however. It is buffered so the SH2s don't waste time waiting for VDP1 to finish drawing so they can send the next frame.
As far as I understand it, your code in immediacy is running two frames behind the frame that is currently being sent to VDP1.
One thing this means is your frame-time is inevitably limited to the transfer time of the frame's data. So there is some time wasted, maybe 4ms?
So the SH2s and VDP1 do not have the full frame-time to render a frame.

This fact alone, you can reasonably ignore except for knowing that you don't actually have 16/33/50/66ms to do anything, always less.
More importantly, the Slave SH2 pretty much _always_ wants access to the memory to transfer the frame to VDP1. Maybe not always, but it is safer to assume so!
So not only does the SSH2 need memory access, it also needs a DMA channel.

Keep in mind that no two processors can access high/low memory at a time. There is a priority order.
The order is (in SGL): SCU > Master > Slave.

So what happens if SGL ABSOLUTELY MUST send the next frame on time (that is its goal: to be fast at 3D), but your code is accessing high memory, and the processor which manages this is the Slave SH2?
... You can call me out on this if you know differently, but my assumption is that the Slave SH2 will switch to using SCU DMA Channel 0 (the normal channel for slDMACopy) to gain priority over MSH2.

This causes a number of problems.
1. The SCU is slower at accessing memory than the SH2s. [Assumption, unverified, but I think it is]
2. The SCU-DSP may only DMA using SCU DMA Channel 0, therein this may directly contend with a DSP program and potentially cause it to malfunction.
3. SBL's file system uses SCU DMA Channel 0 if you set GFS_TMODE_SCU.
4. Imagine any other thing that might be using SCU DMA Channel 0, an assumed "free" channel, and guess what would happen if SGL suddenly wants to use it.
5. SCU can't access low memory. So contention in this range is gauranteed wait cycles. Frankly, that's probably better!

Again, this typically is not a problem, but because SGL is a black box it is unknown when it may enter this condition. It does not fire interrupts when it is or isn't transferring the next frame's data.

How could I come to this conclusion, and where _might_ it be a problem?

In my case, I had two DSP programs and file system transfers active using SCU DMA (to of course leave a DMA channel open on MSH2).
And, lo and behold, if you calculate 576 poly normals and have file system access via SCU (to the SCSP area), every frame both are happening will spike to 50ms.

This is an interesting cascade of contentions.
First, MSH2 and SSH2 want to access memory at the same time.
For SSH2 to gain priority over MSH2, it commands SCU to access the memory.
Then, an unexpected DMA channel is used, which interferes with SH-1, DSP, and MSH2.
In turn, this interferes once again with the SSH2's desired actions.
All piling up to a long delay before the data ends up in VDP1's memory and we move along to the next frame.
Because we can't just send the draw commands as we go.

Another anamoly is that this contention is theoretically worse than simply making MSH2 or SSH2 wait for memory access.
The file system is independent. It's not waiting for the MSH2 to tell it to start or stop transfers on a sub-frame basis, the SH-1 manages that.
Further, the DMA method used is the SCU. The whole CPU Bus has nothing to do with that data after the read commands are sent.
The DSP is also independent and internal to the SCU. If SCU-DMA Channel 0 is used up, the DSP's default behavior is to wait for completion before continuing.
I verified the DSP being uninvolved in this contention cascade by disabling the DSP programs and instead performing the calculations on MSH2. Still happens.

Finally, I got to testing what would happen if I made the normal calculations at sprite draw end via interrupt. Instead of frame-spikes, now EVERY frame was 66ms.
Then, I tested these calculations using slSlaveFunc, which puts your function after all draw commands on the Slave SH2's stack. Now frame-time depended on render load (could be 33ms, could be 50).
In both cases, the frame-times were decoupled from the file system.

The bottom line? Memory access is bad.
If you're XL2, the bottom line is asynchronous file systems are bad :)
All this complication, but it's that simple.

The solution to this problem is to keep the Master SH2 working more inside of its own cache, rather than stretching computations out so much that they involve more memory access.
Again, computing the normals themselves, no problem for the SH2. But each normal computed is unique data that starts filling cache with junk data that won't be re-used.
Instead of calculating every normal, I solved the problem by seeking through the polygon data first by concantenating the data that changes into a single 32-bit number.
It's then less data to sort through that and find out if that polygon changed. If it DID change, I can look backwards from the direction it moved to find it again.
With that known, only as much as 24 polygons ever need new normals. (1 row in 24x24 = 576).

This is an application-specific solution, but it is an example where computational simplicity was actually very much counter-productive. Instead, the program was made more complex, but faster.
Welcome to Saturn.

3
Project announcement / Re: what p64 does
« Last post by ponut64 on April 14, 2019, 05:18:17 am »
4
Project announcement / Re: what p64 does
« Last post by ponut64 on April 06, 2019, 12:20:01 pm »
Hello again,

Here is a DSP sample program for finding the normal of a polygon.

Because assembly is so hard to follow when others write it, I make as many comments as possible. It makes it difficult to crawl over the whole document but easier to follow.

/e: Hm, hesitate. I don't think its working right.
/e2: Fixed errrors.
If you are curious:
1. Line 201 was moving low order bits ("mov all,mc3" when it should have been using the high order bits "mov alh,mc3"
2. The instruction at line 277 was modified from (mvi 16,PL) to (mvi 17,PL)   our shifting output is 1 less than it should be, in comparison to the typical C logic.
3. Line 289 had an extra instruction added after it. This shifts the initial guess back right once. Because the DSP caches instructions, a "loop next instruction" command will execute its designated times to loop while the next instruction was already pre-fetched, so it will also execute, therefore an LPS command will execute the next command 1 more time than indicated in the LOP counter.
5
Project announcement / Re: what p64 does
« Last post by ponut64 on April 04, 2019, 04:24:46 pm »
Hi again,

Here's an SGL-compatible fast inverse square root function.

Code: [Select]
FIXED		fxisqrt(FIXED input){

static FIXED xSR = 0;
static FIXED pushRight = 0;
static FIXED msb = 0;
static FIXED shoffset = 0;
static FIXED yIsqr = 0;

if(input <= 65536){
return 1;
}

xSR = input>>1;
pushRight = input;
msb = 0;
shoffset = 0;
yIsqr = 0;

while(pushRight >= 65536){
pushRight >>=1;
msb++;
}

shoffset = (16 - ((msb)>>1));
yIsqr = 1<<shoffset;

return (slMulFX(yIsqr, (98304 - slMulFX(xSR, slMulFX(yIsqr, yIsqr)))));
}
6
Project announcement / MAZINGER Z Homebrew for SEGA Saturn
« Last post by mindslight on April 02, 2019, 12:58:21 pm »
Hi!

There is a new project in progress: MAZINGER Z for the Sega Saturn made by the Reverant Rolando Fernández Benavidez.

Website: https://vectrex8.wixsite.com/mazinger
7
Project announcement / Re: what p64 does
« Last post by ponut64 on March 30, 2019, 06:53:49 pm »
The DSP as a Logical Processor

I've started experimenting with the DSP, and have found contrary to my preconceptions, it is a logical processor. Now, of course, no one told me it wasn't, I just didn't know it was.
So to familiarize myself with its inner workings, I wrote a bitwise division program. One for signed values, one for unsigned values.

The DSP has no division instruction, so you need to write your own program for it.
8
Project announcement / Re: what p64 does
« Last post by ponut64 on March 17, 2019, 03:57:40 am »
9
Project announcement / Re: Sonic Z-Treme
« Last post by XL2 on March 04, 2019, 03:09:59 am »
 Well, I got really fed up of the Sonic fans nitpicking and complaining nonstop either because it's too much like Sonic X-Treme or not enough (kind of comments I received : "hey moron, Sonic X-Treme used sprites, why are you using an ugly 3d model? Lol").
Or : "why aren't you porting Sonic Mania instead? Do I have to do it myself?".
Its just a really toxic community full of entitled kids, its better that I cancel this project to focus on something I find more interesting and challenging.

Not being stuck with Sonic X-Treme levels and design also allows me to push the hardware much more.
1300 drawn polygons at 20 fps ftw.
And Im still far from done optimizing everything.
10
Project announcement / Re: Sonic Z-Treme
« Last post by 20EnderDude20 on March 04, 2019, 12:39:22 am »
It really is a shame that you cancelled the project.
Pages: [1] 2 3 ... 10
SMF spam blocked by CleanTalk