Scheduling Floating-Point Work

Optimization & Scheduling

schedulingfloating-pointlatency

FP latencies are long — so the scheduler tries hardest here

Floating-point loads and fmuls have multi-cycle latencies, so the ,p scheduler is most visibly active on FP code. Written in source order you would expect: load pair, multiply, load pair, multiply-add. The scheduler hoists later loads and starts a second product early, weaving the two computations together:

lfs    f1, 4(r3)     # second element of a loaded first
lfs    f0, 4(r4)     # matching element of b
lfs    f2, 0(r3)     # first element of a slotted in behind
fmuls  f0, f1, f0    # one product started early
lfs    f1, 0(r4)     # remaining element of b arrives while fmuls runs
fmadds f1, f2, f1, f0
blr

The loads and the two FP ops are interleaved, not grouped per term. The source was written as two independent products summed together; the scheduler decided when to issue each load. This is the same scheduler from lesson 2, but the payoff is larger because FP stalls are longer.

Note that fmadds comes from fp_contract fusion, not from the scheduler — they're two independent mechanisms both active at -O4,p. The next lesson covers fp_contract in detail; for now just notice it exists so you don't attribute the fused multiply-add to scheduling.

When an FP target's loads look shuffled across the multiplies, suspect the scheduler before you suspect an exotic source expression.

Your task

Write dot2(f32 *a, f32 *b) to reproduce the assembly above. Read the load offsets to determine which elements from each array are paired together, then write the natural C and let the scheduler interleave.

Hints

match dot2mwcceppc.exe -O4,p

Loading editor…

Hit “Compile & Check” to diff your code against the target.