Two Running Sums, Woven by Hand: Steering the ,p Scheduler

Real-World Mastery

schedulinginstruction-orderaccumulation

When the scheduler needs you to interleave

The optimization chapter taught you that -O4,p scheduling will happily batch two independent computations — write (a+b)*(c+d) and the four loads issue together. That works when there's one short dependency chain. This lesson is the case where the scheduler alone is not enough, lifted from Star Fox Adventures' voxmapsFn_80010ff4 (the voxel route-picker).

That function compares two neighbouring voxel columns by summing four occupancy bytes each — sumCur over one column, sumNext over the next — and picks the denser direction. The two sums are completely data-independent, so in principle the scheduler can run them in lockstep. The trap: how you spell the accumulation decides the emission order, and emission order seeds the schedule.

Written as two flat expressions —

sumCur  = occ[0][0] + occ[0][1] + occ[0][2] + occ[0][3];
sumNext = occ[1][0] + occ[1][1] + occ[1][2] + occ[1][3];

— the front end materializes all of sumCur's chain, then all of sumNext's. The ,p scheduler reorders from that seed and produces a load layout that reads 2(r3), 1(r3), 6(r3), 5(r3), 3(r3) … with a spare third register — and it does not byte-match the target.

The real commit fixed it by weaving the two running sums together, one term at a time. To see the effect, consider a simpler three-element analogue — twoCol_compare, which compares two 3-byte columns data[0] and data[1]:

typedef struct { u8 data[2][3]; } TwoCol;

int twoCol_compare(TwoCol* t) {
    int sumA, sumB;
    sumA = t->data[0][0];
    sumB = t->data[1][0];
    sumA += t->data[0][1];
    sumB += t->data[1][1];
    sumA += t->data[0][2];
    sumB += t->data[1][2];
    if (sumA >= sumB) { return 0; }
    return 1;
}

The two accumulators advance in lockstep and the schedule comes out clean — loads in offset order, the two sum registers built up alternately:

lbz   r6, 0(r3)          # sumA <- data[0][0]
lbz   r0, 1(r3)          # (data[0][1])
lbz   r7, 3(r3)          # sumB <- data[1][0]
lbz   r5, 4(r3)          # (data[1][1])
add   r6, r6, r0         # sumA += data[0][1]
lbz   r4, 2(r3)
lbz   r0, 5(r3)
add   r7, r7, r5         # sumB += data[1][1]
add   r6, r6, r4         # sumA += data[0][2]
add   r7, r7, r0         # sumB += data[1][2]
xor   r0, r7, r6
srawi r3, r0, 1
and   r0, r0, r7
subf  r0, r0, r3
srwi  r3, r0, 31
blr

r6 (sumA) and r7 (sumB) climb in strict alternation, loads come in column-interleaved order. That lockstep is the signature of a hand-interleaved two-accumulator sum. There is no pragma — scheduling is on, as always at -O4,p; the lever is the C shape that feeds it. The real voxmap_chooseDir has four terms per column instead of three — the same technique, one more pair of steps.

Why doesn't the scheduler produce this from the flat form? Its priority order is built on the dependency DAG seeded by emission order, and the flat form emits one whole chain before the other. The scheduler interleaves within the seed it's given; it won't re-found the chains. You hand it the weave by writing the weave.

Your task

Sum the four bytes of scan->occ[0] into sumCur and the four bytes of scan->occ[1] into sumNext, then return 0 if sumCur >= sumNext, else 1. To match the lockstep schedule above you must interleave the two accumulations term by term — not write two flat a+b+c+d sums.

Hints

match voxmap_chooseDirmwcceppc.exe -O4,p

Loading editor…

Hit “Compile & Check” to diff your code against the target.