Even after looking a the runtime's source in C2 I'm having trouble coming up with consistent rules. C2 and C3 run similarly but C3 is faster.
Anyways by default the picking system duplicates object lists a lot. Think at the top of every event block, with every iteration of a loop, and probably with some other special events. That is slow with a high amount of instances.
But there are two major optimizations being used:
1. When all the instances are picked then copying just sets a flag so nothing needs to be copied which is faster. That is why having the "for each" as the first condition is faster in 3 and 4.
2. When exporting it keeps track if any of the following sub-events modify the SOL. And if they don't there's no reason to copy the SOL after that point. That's why 2 is fast. So effectively the loop would need to be the last condition in a block, and the sub events would need to do no picking to take advantage of that.
The short version that's easier to remember is you'd want the loop to be the first or last condition in a code block. You can have non-picking events after the loop but it needs to be in a sub-event.
The high cpu usage in 1 is from the worst case with no optimized code paths. And oddly enough number two performs just as bad if the "system compare" was directly below the loop instead of in a sub-event.