When the scheduler needs you to interleave
The optimization chapter taught you that -O4,p scheduling will happily batch two independent computations — write (a+b)*(c+d) and the four loads issue together. That works when there's one short dependency chain. This lesson is the case where the scheduler alone is not enough, lifted from Star Fox Adventures' voxmapsFn_80010ff4 (the voxel route-picker).
That function compares two neighbouring voxel columns by summing four occupancy bytes each — sumCur over one column, sumNext over the next — and picks the denser direction. The two sums are completely data-independent, so in principle the scheduler can run them in lockstep. The trap: how you spell the accumulation decides the emission order, and emission order seeds the schedule.
Written as two flat expressions —
sumCur = occ[0][0] + occ[0][1] + occ[0][2] + occ[0][3];
sumNext = occ[1][0] + occ[1][1] + occ[1][2] + occ[1][3];
— the front end materializes all of sumCur's chain, then all of sumNext's. The ,p scheduler reorders from that seed and produces a load layout that reads 2(r3), 1(r3), 6(r3), 5(r3), 3(r3) … with a spare third register — and it does not byte-match the target.
The real commit fixed it by weaving the two running sums together, one term at a time. To see the effect, consider a simpler three-element analogue — twoCol_compare, which compares two 3-byte columns data[0] and data[1]:
typedef struct { u8 data[2][3]; } TwoCol;
int twoCol_compare(TwoCol* t) {
int sumA, sumB;
sumA = t->data[0][0];
sumB = t->data[1][0];
sumA += t->data[0][1];
sumB += t->data[1][1];
sumA += t->data[0][2];
sumB += t->data[1][2];
if (sumA >= sumB) { return 0; }
return 1;
}
The two accumulators advance in lockstep and the schedule comes out clean — loads in offset order, the two sum registers built up alternately:
lbz r6, 0(r3) # sumA <- data[0][0]
lbz r0, 1(r3) # (data[0][1])
lbz r7, 3(r3) # sumB <- data[1][0]
lbz r5, 4(r3) # (data[1][1])
add r6, r6, r0 # sumA += data[0][1]
lbz r4, 2(r3)
lbz r0, 5(r3)
add r7, r7, r5 # sumB += data[1][1]
add r6, r6, r4 # sumA += data[0][2]
add r7, r7, r0 # sumB += data[1][2]
xor r0, r7, r6
srawi r3, r0, 1
and r0, r0, r7
subf r0, r0, r3
srwi r3, r0, 31
blr
r6 (sumA) and r7 (sumB) climb in strict alternation, loads come in column-interleaved order. That lockstep is the signature of a hand-interleaved two-accumulator sum. There is no pragma — scheduling is on, as always at -O4,p; the lever is the C shape that feeds it. The real voxmap_chooseDir has four terms per column instead of three — the same technique, one more pair of steps.
Why doesn't the scheduler produce this from the flat form? Its priority order is built on the dependency DAG seeded by emission order, and the flat form emits one whole chain before the other. The scheduler interleaves within the seed it's given; it won't re-found the chains. You hand it the weave by writing the weave.
Your task
Sum the four bytes of scan->occ[0] into sumCur and the four bytes of scan->occ[1] into sumNext, then return 0 if sumCur >= sumNext, else 1. To match the lockstep schedule above you must interleave the two accumulations term by term — not write two flat a+b+c+d sums.