Everything at once
This is where the chapter comes together. Linear interpolation, the trick of landing a value some fraction of the way between two endpoints, runs through game code everywhere (SFA's lighting lerps are built on it). Write a pair of them, add the results, and three of the passes you've met all go off at once:
- fp_contract collapses the multiply-then-add in each lerp into a single
fmadds, once the fsubs for the difference is out of the way.
- scheduling drags all four
lfs loads up front and interleaves the two lerps so one's latency hides behind the other's work.
-O4 keeps the result branch-free and packs it into a tight set of registers.
To see the same shape at a different arity, here's a three-component take, mix2, where fp_contract and scheduling do their thing on three lerps:
lfs f5, 0(r3) # p[0]
lfs f2, 0(r4) # q[0]
lfs f4, 4(r3) # p[1] — all six loads hoisted
lfs f0, 4(r4) # q[1]
fsubs f2, f2, f5 # lerp0: q[0]-p[0]
lfs f6, 8(r3) # p[2]
fsubs f0, f0, f4 # lerp1: q[1]-p[1] interleaved
lfs f3, 8(r4) # q[2]
fmadds f2, f1, f2, f5 # lerp0 fused
fsubs f3, f3, f6 # lerp2: q[2]-p[2]
fmadds f0, f1, f0, f4 # lerp1 fused
fmadds f1, f1, f3, f6 # lerp2 fused
fadds f0, f2, f0
fadds f1, f1, f0
blr
See how each lerp's fsubs and fmadds are braided into the next rather than sitting in their own block? That's the scheduler. Compile the same unit with #pragma scheduling off and every lerp would stand alone, finished before the following one starts.
For your blend, tally the lfs instructions in the target asm to learn how many arrays show up and how many elements each one reaches; every fsubs/fmadds pair marks one lerp.
Your task
Write blend(f32 *a, f32 *b, f32 t) to reproduce the assembly above. Write the lerps in the natural form and let the optimizer fuse and interleave.