As far as I gather, the transparent pixels consume cycles because they are present in the instruction for as long as any pixel would be until the next instruction.
The key to the end codes is to truncate the length of the instruction, not so much to reduce the computational intensity of that instruction.
It being based on filling the frame-buffer, it is timed with Vblank, so the computational intensity of what can be done by Vblank is something that you can more or less assume will never change, as it's always the same amount of time. Thus reducing the time spent on regions of the instruction is how you save performance in this case.
[I realize this is a very obvious series of statements, perhaps I shouldn't conjecture about things I don't understand]
Also, that's some good work on the pre-distortion, enderdude.
[If it wasn't clear, you grab the top-right vertice and drag it one box-size up]