Finagling that inverse bilinear shader into my own wasn't so hard. For testing I had it tell the gpu to draw two triangles over the entire framebuffer and let it clip fragments. Unfortunately that also means the z buffer becomes useless so I finagled a couple more algorithms to fit a convex shape over the arbitrary quads. Now that I had several ways to draw a quad, I wondered how the performance would be. So I had it draw a million random quads with each method and recorded the time it took in milliseconds 11 times and threw the first one away just in case of shader initialization or whatever.
Performance is incredibly low due to sending quads one at a time to the GPU. Below are ordered from fastest to slowest.
/* DrawQuadQuick Colored quad. Tesselated as a triangle fan around the midpoint of the 4 verticies. That's 4 triangles. Sort of accurate for convex quads.
* 6867.3928
* 6995.4001
* 6851.3919
* 6944.3972
* 6838.3911
* 6815.3898
* 6864.3927
* 6919.3958
* 6901.3948
* 6857.3922
* avg 6885.49384
* qps 145232
* 60fps 2420
* 30fps 4841
*/
/* DrawSpriteQuick Gouraud sprite. 4 tris. Accurate for squares only.
* 8050.4605
* 8045.4601
* 8059.4609
* 7989.457
* 8070.4616
* 8042.46
* 8043.46
* 8056.4608
* 7890.4514
* 8038.4598
* avg 8028.65921
* qps 124553
* 60fps 2075
* 30fps 4151
*/
/* DrawSpriteBilinearToScreen Gouraud sprite using the inverse bilinear shader. Draws 2 tris over the entire screen. Z buffer is useless.
* 9706.5551
* 9698.5547
* 9731.5566
* 9703.555
* 9732.5567
* 9648.5518
* 9703.555
* 9668.553
* 9691.5543
* 9756.5581
* avg 9704.15503
* qps 103048
* 60fps 1717
* 30fps 3434
*/
/* DrawSpriteBilinear Gouraud sprite using inverse bilinear shader with fitted shape. 1 tri for concave, else 2.
* 10868.6216
* 10942.6258
* 10955.6266
* 11025.6306
* 10959.6269
* 11031.631
* 10975.6278
* 11082.6339
* 10812.6185
* 10827.6193
* avg 10948.2262
* qps 91338
* 60fps 1522
* 30fps 3044
*/
/* DrawQuad Colored quad tesselated into horizontal strips. 2 tris per strip. Reasonable accuracy with enough strips.
* 32 strips
* 11216.6415
* 11114.6357
* 10986.6284
* 11144.6374
* 11135.6369
* 11160.6384
* 11101.635
* 11102.635
* 11217.6416
* 11071.6332
* avg 11125.23631
* qps 89885
* 60fps 1498
* 30fps 2996
*/
/* DrawSprite Gouraud sprite tesselated into horizontal strips. 2 tris per strip. Very accurate when strips == texture height. Reasonable accuracy with enough strips.
* 32 strips
* 13811.79
* 14014.8016
* 13926.7966
* 13998.8007
* 14025.8023
* 13840.7916
* 13935.7971
* 13998.8007
* 13912.7958
* 13902.7952
* avg 13936.89716
* qps 71751
* 60fps 1195
* 30fps 2391
*
* 96 strips
* 25656.4675
* 25565.4623
* 25539.4608
* 25433.4547
* 25344.4496
* 25326.4486
* 25285.4462
* 25299.447
* 25298.4469
* 25495.4583
* avg 25424.45419
* qps 39332
* 60fps 655
* 30fps 1311
*/
145 kqps DrawQuadQuick
125 kqps DrawSpriteQuick
103 kqps DrawSpriteBilinearToScreen
091 kqps DrawSpriteBilinear
090 kqps DrawQuad 32 strips
072 kqps DrawSprite 32 strips
039 kqps DrawSprite 96 strips (texture height)
Below from:
https://segaretro.org/Sega_Saturn/Technical_specifications#VDP1Polygon rendering performance: Lighting
800,000 polygons/s: Flat shading, 32-pixel polygons
500,000 polygons/s: Flat shading, 50-pixel polygons
200,000 polygons/s: Gouraud shading, 32-pixel polygons
Texture mapping performance: Lighting
300,000 polygons/s: 32-texel textures
200,000 polygons/s: 70-texel textures
140,000 polygons/s: Gouraud shading, 32-texel textures
This is all on my desktop computer which can do recent games with mediumish settings depending on the game. On more modest machines like my laptop it might be well below saturn-level performance.
Anyways, I think the performance will be more than satisfactory once I figure out how to send quads in bulk to the GPU.
Just for fun, here's a million quads with and without z buffer.

