Capstone: Scheduling Meets fp_contract

Optimization & Scheduling

schedulingfp_contractcapstone

Everything at once

Time to combine the chapter. A linear interpolation — computing a value that sits proportionally between two endpoints — is the beating heart of game code (it's literally what SFA's lighting lerps do). Write two of them and sum the results, and three optimizations fire together:

fp_contract fuses the multiply-then-add in each lerp into one fmadds per lerp (after the fsubs for the difference).
scheduling hoists all four lfs loads to the front and interleaves the two lerps so their latencies overlap.
the whole thing stays branch-free and tightly register-allocated thanks to -O4.

For comparison, here is a three-component version (mix2) to show what fp_contract + scheduling looks like at a different arity:

lfs    f5, 0(r3)       # p[0]
lfs    f2, 0(r4)       # q[0]
lfs    f4, 4(r3)       # p[1]   — all six loads hoisted
lfs    f0, 4(r4)       # q[1]
fsubs  f2, f2, f5      # lerp0: q[0]-p[0]
lfs    f6, 8(r3)       # p[2]
fsubs  f0, f0, f4      # lerp1: q[1]-p[1]   interleaved
lfs    f3, 8(r4)       # q[2]
fmadds f2, f1, f2, f5  # lerp0 fused
fsubs  f3, f3, f6      # lerp2: q[2]-p[2]
fmadds f0, f1, f0, f4  # lerp1 fused
fmadds f1, f1, f3, f6  # lerp2 fused
fadds  f0, f2, f0
fadds  f1, f1, f0
blr

Each lerp's fsubs and fmadds pair is interleaved rather than sequential — that's the scheduler at work. Had this unit used #pragma scheduling off, each lerp would appear as a self-contained block before the next begins.

Count the lfs instructions in the target asm for blend to know how many arrays are involved and how many elements each covers; each fsubs/fmadds pair is one lerp.

Your task

Write blend(f32 *a, f32 *b, f32 t) to reproduce the assembly above. Write the lerps in the natural form and let the optimizer fuse and interleave.

Hints

match blendmwcceppc.exe -O4,p

Loading editor…

Hit “Compile & Check” to diff your code against the target.