Jo Engine Forum

Sega Saturn Development => Project announcement => Topic started by: ponut64 on August 20, 2018, 09:22:52 pm

Title: what p64 does
Post by: ponut64 on August 20, 2018, 09:22:52 pm
So I'll start a thread, I guess, instead of spamming the help forum :)

Anyone got clues about delayed audio? Might the B-Bus be saturated, causing a command delay?

If anyone wants to build/look at the code, here it is:

And yes, most of my posts are going to be about problems, as is typical...
Title: Re: what p64 does
Post by: ponut64 on August 21, 2018, 04:23:46 pm
There was never any delay.. it was just the capture device being the garbage it is...  ::)
Title: Re: what p64 does
Post by: ponut64 on September 02, 2018, 06:05:41 am
early physics?
Title: Re: what p64 does
Post by: Cerv3ro on September 06, 2018, 12:01:21 am
Great job, did you know something about developing before?. Do you have an idea of what to create or are you only testing?.
Title: Re: what p64 does
Post by: ponut64 on September 06, 2018, 02:20:16 am
1. I've never developed software before.
2. I've never programmed in the C language before. I did have some training courses on C# and C++ in Microsoft Visual Studio, but this did not output an executable.
3. I've never made development tools before. As I see it this is the true mark of a game development programmer. XL2 and Jo made the development tools.
4. I have had experience working in Unreal Engine, Torque, CryEngine, and TRIBES before as either making simple mods or making maps. Never finished anything but relatively simple maps.
5. My 3D modelling experience comes from making 3D printed objects, such as detailed figurines or mechanical objects. Like, for instance, an entire computer case.
That is a vastly different field because there one is required to make a manifold object. Video games carry no such requirement and often non-manifold objects consume less polygons.

6. As far as what game to make, my philosophy is what I can program is what should define what I can and will make.
Following that, I am going to do testing as far as I can until I feel as if I have programmed enough mechanics to make a satisfactory video game.

7. I work slowly.
Title: Re: what p64 does
Post by: ponut64 on September 07, 2018, 08:43:36 am
In stumbling about with true box-to-box collision detection (rather than only points), I stumbled on an alternate way to test a point against a normal.
It's helpful to share this, because it's basically 16-bit.

Code: [Select]
Sint32	pt_col_plane(Sint16 planept[XYZ], Sint16 ptoffset[XYZ], Sint16 normal[XYZ], Sint16 offset[XYZ])
//Using a NORMAL OF A PLANE which is also a POINT ON THE PLANE and checking IF A POINT IS ON THAT PLANE

//the REAL POSITION of the normal, which is also a POINT ON THE PLANE, needs an actual position. WE FIND IT HERE.
Sint16 realNormal[XYZ] = {normal[X] - offset[X], normal[Y] - offset[Y], normal[Z] - offset[Z]};
Sint16 realpt[XYZ] = {planept[X] + ptoffset[X], planept[Y] + ptoffset[Y], planept[Z] + ptoffset[Z]};

Sint16 pNn[XYZ] = {realNormal[X] - realpt[X], realNormal[Y] - realpt[Y], realNormal[Z] - realpt[Z]};

//The NORMAL of the plane has NO REAL POSITION. it is FROM ORIGIN. We use the normal here.
//If the dot product here is zero, the point lies on the plane.
       //The dot product being negative or positive can be used to determine whether the point has passed the plane.
Sint32 dot;
dot = vectori_dot(pNn, normal);

return dot;

If you're wondering what else I would use, its separating axis theorem, which while simple to explain is much more verbose in code so I'll not go into that much more.
Title: Re: what p64 does
Post by: ponut64 on February 12, 2019, 12:07:32 am
i guess i should update this cuz im not dead

many methods have changed, including moving back to FIXED-point math for precision and speed.
If it looks like I've been re-treading the same math for months, I have.

I also have one pending / one visible contribution here:
He's working on other stuff too.
Title: Re: what p64 does
Post by: Cerv3ro on February 12, 2019, 12:51:30 am
Good to know you are making progress. On this new vid those fixes are noticiable, even when i donĀ“t understand nothing from what is happening (in development terms) you have created a good tech demo with animations and physics. You have my attention dude.
I also have watched those Emeraldnova tutorials....but is too much for me. Hope someone with development skills like you find it useful.
Title: Re: what p64 does
Post by: ponut64 on February 27, 2019, 03:58:55 am

it gets worse
Title: Re: what p64 does
Post by: ponut64 on March 17, 2019, 03:57:40 am
Title: Re: what p64 does
Post by: ponut64 on March 30, 2019, 06:53:49 pm
The DSP as a Logical Processor

I've started experimenting with the DSP, and have found contrary to my preconceptions, it is a logical processor. Now, of course, no one told me it wasn't, I just didn't know it was.
So to familiarize myself with its inner workings, I wrote a bitwise division program. One for signed values, one for unsigned values.

The DSP has no division instruction, so you need to write your own program for it.
Title: Re: what p64 does
Post by: ponut64 on April 04, 2019, 04:24:46 pm
Hi again,

Here's an SGL-compatible fast inverse square root function.

Code: [Select]
FIXED		fxisqrt(FIXED input){

static FIXED xSR = 0;
static FIXED pushRight = 0;
static FIXED msb = 0;
static FIXED shoffset = 0;
static FIXED yIsqr = 0;

if(input <= 65536){
return 1;

xSR = input>>1;
pushRight = input;
msb = 0;
shoffset = 0;
yIsqr = 0;

while(pushRight >= 65536){
pushRight >>=1;

shoffset = (16 - ((msb)>>1));
yIsqr = 1<<shoffset;

return (slMulFX(yIsqr, (98304 - slMulFX(xSR, slMulFX(yIsqr, yIsqr)))));
Title: Re: what p64 does
Post by: ponut64 on April 06, 2019, 12:20:01 pm
Hello again,

Here is a DSP sample program for finding the normal of a polygon.

Because assembly is so hard to follow when others write it, I make as many comments as possible. It makes it difficult to crawl over the whole document but easier to follow.

/e: Hm, hesitate. I don't think its working right.
/e2: Fixed errrors.
If you are curious:
1. Line 201 was moving low order bits ("mov all,mc3" when it should have been using the high order bits "mov alh,mc3"
2. The instruction at line 277 was modified from (mvi 16,PL) to (mvi 17,PL)   our shifting output is 1 less than it should be, in comparison to the typical C logic.
3. Line 289 had an extra instruction added after it. This shifts the initial guess back right once. Because the DSP caches instructions, a "loop next instruction" command will execute its designated times to loop while the next instruction was already pre-fetched, so it will also execute, therefore an LPS command will execute the next command 1 more time than indicated in the LOP counter.
Title: Re: what p64 does
Post by: ponut64 on April 14, 2019, 05:18:17 am
Title: Re: what p64 does
Post by: ponut64 on April 15, 2019, 04:05:55 am

The DSP is being used. I have attached the DSP program.

Performance Hints:
The best path to improved performance on the Saturn is what is frequently called "Data Oriented Design", or "DOD".
In general, your philosophy is to ensure the least amount of data is processed, moved, and accessed.
This is absolutely at odds with the prevalent modern philosophy of programming called "Object Oriented Design", or "OOD".

SGL Anamolies

SGL's documents state that SCU DMA Channel 0 and CPU DMA Channel 0 are "free" in SGL.
However, my observations indicate otherwise.
Let me try and walk you through what was happening.

First, we have a model. It's 625 vertices and 576 polygons.
Let's say we recalculate these polygon's normals and textures every frame before we send that model to be drawn as slPutPolygon.
Normally? This is actually OK. Using fast inverse square root, calculating a normal is a relatively inexpensive task (less than 100 instructions).
If we just wrote that out straight as 100, and did that 576 times, we get 57600 instructions.
Considering an SH2 has 28,000 instructions to offer per millisecond, it only takes about 2ms to perform that task.

But there is a problem with this theory.
The first problem is that we have to read the vertices data from memory. This is not instantaneous.
Another problem is we have to write normals back to memory.
The final problem is the Slave SH2 needs to access this data to draw the polygon, as SGL's default behavior is to use the Slave SH2 for all polygon / matrix processing.

However, by itself, this process does not cause any major performance or synchronization problem.
Remember your program runs on the MSH2. If you have the MSH2 perform this updating normals task before slPutPolygon is reached, it will all be fine.
You might get some small bus contention if the SSH2 currently crunching numbers on some previous model you sent, but that is not a major concern.

But let me try and read into this further.
An important part of SGL is that it is wisely set up to be sending draw commands to VDP1 in a way that wastes the least time possible.
Keep in mind that accessing VDP1's memory will halt VDP1's operation until some cycles after memory access is complete.
Because of this, you should not frequently access VDP1's memory to inform it about what to draw. Rather, SGL prepares all of your draw commands and sends them to VDP1 in one big batch.

SGL does not do this in a linear fashion, however. It is buffered so the SH2s don't waste time waiting for VDP1 to finish drawing so they can send the next frame.
As far as I understand it, your code in immediacy is running two frames behind the frame that is currently being sent to VDP1.
One thing this means is your frame-time is inevitably limited to the transfer time of the frame's data. So there is some time wasted, maybe 4ms?
So the SH2s and VDP1 do not have the full frame-time to render a frame.

This fact alone, you can reasonably ignore except for knowing that you don't actually have 16/33/50/66ms to do anything, always less.
More importantly, the Slave SH2 pretty much _always_ wants access to the memory to transfer the frame to VDP1. Maybe not always, but it is safer to assume so!
So not only does the SSH2 need memory access, it also needs a DMA channel.

Keep in mind that no two processors can access high/low memory at a time. There is a priority order.
The order is (in SGL): SCU > Master > Slave.

So what happens if SGL ABSOLUTELY MUST send the next frame on time (that is its goal: to be fast at 3D), but your code is accessing high memory, and the processor which manages this is the Slave SH2?
... You can call me out on this if you know differently, but my assumption is that the Slave SH2 will switch to using SCU DMA Channel 0 (the normal channel for slDMACopy) to gain priority over MSH2.

This causes a number of problems.
1. The SCU is slower at accessing memory than the SH2s. [Assumption, unverified, but I think it is]
2. The SCU-DSP may only DMA using SCU DMA Channel 0, therein this may directly contend with a DSP program and potentially cause it to malfunction.
3. SBL's file system uses SCU DMA Channel 0 if you set GFS_TMODE_SCU.
4. Imagine any other thing that might be using SCU DMA Channel 0, an assumed "free" channel, and guess what would happen if SGL suddenly wants to use it.
5. SCU can't access low memory. So contention in this range is gauranteed wait cycles. Frankly, that's probably better!

Again, this typically is not a problem, but because SGL is a black box it is unknown when it may enter this condition. It does not fire interrupts when it is or isn't transferring the next frame's data.

How could I come to this conclusion, and where _might_ it be a problem?

In my case, I had two DSP programs and file system transfers active using SCU DMA (to of course leave a DMA channel open on MSH2).
And, lo and behold, if you calculate 576 poly normals and have file system access via SCU (to the SCSP area), every frame both are happening will spike to 50ms.

This is an interesting cascade of contentions.
First, MSH2 and SSH2 want to access memory at the same time.
For SSH2 to gain priority over MSH2, it commands SCU to access the memory.
Then, an unexpected DMA channel is used, which interferes with SH-1, DSP, and MSH2.
In turn, this interferes once again with the SSH2's desired actions.
All piling up to a long delay before the data ends up in VDP1's memory and we move along to the next frame.
Because we can't just send the draw commands as we go.

Another anamoly is that this contention is theoretically worse than simply making MSH2 or SSH2 wait for memory access.
The file system is independent. It's not waiting for the MSH2 to tell it to start or stop transfers on a sub-frame basis, the SH-1 manages that.
Further, the DMA method used is the SCU. The whole CPU Bus has nothing to do with that data after the read commands are sent.
The DSP is also independent and internal to the SCU. If SCU-DMA Channel 0 is used up, the DSP's default behavior is to wait for completion before continuing.
I verified the DSP being uninvolved in this contention cascade by disabling the DSP programs and instead performing the calculations on MSH2. Still happens.

Finally, I got to testing what would happen if I made the normal calculations at sprite draw end via interrupt. Instead of frame-spikes, now EVERY frame was 66ms.
Then, I tested these calculations using slSlaveFunc, which puts your function after all draw commands on the Slave SH2's stack. Now frame-time depended on render load (could be 33ms, could be 50).
In both cases, the frame-times were decoupled from the file system.

The bottom line? Memory access is bad.
If you're XL2, the bottom line is asynchronous file systems are bad :)
All this complication, but it's that simple.

The solution to this problem is to keep the Master SH2 working more inside of its own cache, rather than stretching computations out so much that they involve more memory access.
Again, computing the normals themselves, no problem for the SH2. But each normal computed is unique data that starts filling cache with junk data that won't be re-used.
Instead of calculating every normal, I solved the problem by seeking through the polygon data first by concantenating the data that changes into a single 32-bit number.
It's then less data to sort through that and find out if that polygon changed. If it DID change, I can look backwards from the direction it moved to find it again.
With that known, only as much as 24 polygons ever need new normals. (1 row in 24x24 = 576).

This is an application-specific solution, but it is an example where computational simplicity was actually very much counter-productive. Instead, the program was made more complex, but faster.
Welcome to Saturn.

Title: Re: what p64 does
Post by: Emerald Nova on April 15, 2019, 04:34:25 am
The lesson I'm getting from this is that if you use SGL (meaning Jo Engine or Z-treme Tools as well,) the DSP will not play nice without a lot of extra considerations.
Title: Re: what p64 does
Post by: ponut64 on May 26, 2019, 12:59:39 pm
Title: Re: what p64 does
Post by: ponut64 on November 23, 2019, 02:34:41 pm
ZIP file contains source code for new render path, including animated entities from XL2's binary file converter.
Also compiles in high-res mode.