Checking for visibility before collision = bad for CPU, don't try to optimize it?

Not favoritedFavorited Favorited 2 favourites
From the Asset Store
Particles support animations, collisions, effects and etc.
  • Thanks to this thread, I will be hunting down those "for each" blocks, seeing if I do indeed have some subevents with a quick "Sprite Inst Var = 5", and change this to "Evaluate Expression", measure, see how much improvement is gained.

    Is evaluate expression any better than compare two values? I never use the valuate, always use the compare.

  • We already know a system compare or evaluate is faster than picking events, but here is a performance test of variations of looping with two picking events. Half the instance's "a" variable is 0 and the other is 1. I also made them all invisible so we are just measuring how long the events take.

    It measures and averages the timings of each event with 10,000 instances. Made in C2 but tested in C3 r449.

    dropbox.com/scl/fi/xd5mz5kofj7c67ftbrhvo/event_timing.capx

    We are mostly measuring the performance of the loop and how many instances are being copied in the SOL. "First" and "Last" are taking advantage of some optimizations whereas middle does not. I tried variations of using sub events but that doesn't appear to be significant.

    The for each appears to roughly do this under the hood:

    save SOL copy
    for each sprite
    -- if(following conditions or sub-events do picking)
    -- -- load SOL copy
    -- select nth sprite
    -- do stuff

    The if may be if the loop is the last condition and sub-events don't do picking but I didn't dig too deep

    The "load SOL copy" is the heavier part.

    With the "first" tests it's not heavy since the copy is just all the instances so construct sets a flag so nothing needs to be copied.

    The "last" tests are able to skip the loading since the SOL isn't being modified past that point.

    And finally, "middle" is the slowest since it has to load a copy of half the instances every iteration.

    I can almost see a way to avoid loading the list of picked instances every iteration, but it's not my code. I'd say it still may be worth filing a report in regards to how slow the middle case is.

  • R0J0hound Great stuff, thank you for sharing!

    Would have been great to see Compare two values as well if you have the chance.

    Sprite.a = 1

    Sprite.b = 0

  • Thank you to everyone who is contributing to the thread… really useful information and quite significant differences in the results.

    Imagine having a 6,000-line project and needing to refactor a bunch of things that could have been done right from the start, just by realigning items.

  • Thank you R0J0, I will study these results and your explanations deeply, it's very useful to myself and clearly many others.

    It's interesting the best performing result is a nested subevent - Whilst I haven't measured it exactly, I seem to have it in mind to "try to avoid nesting subevents where possible" (although I barely follow this since the gains of subevents almost always outweigh trying to lower subevent nests).

    Perhaps there's truth to trying to avoid too much nesting with subevents, but maybe in the case of a loop, less important since it's tackling the specific nest and subsequent nests? I presume Sub-Events function similarly to JSON nesting, where the more you nest, the more CPU usage needed to get the deeper nest (i.e. getting a JSON key at A.B.C.D.E.F is more expensive than getting the key from A.B, even if there's many keys present at A.B and just 2 keys present at A.B.C.D.E.F), plus event blocks have the added complexity with storing SOLs at various levels and reloading SOLs and such.

    ...More things to measure!

    Imagine having a 6,000-line project and needing to refactor a bunch of things that could have been done right from the start, just by realigning items.

    ... heh yeah, imagine! My project's at 16,000 events, no big deal!...

    ...

    ...send help... possibly send help to Je Fawk and a number of others too...

  • What are the chances that the way the system handles these comparisons under the hood, which currently leads to these results, could change—so that another approach becomes more efficient, or that what is efficient now simply turns into the least efficient? What are the patterns? Do they exist?

  • What are the chances that the way the system handles these comparisons under the hood, which currently leads to these results, could change—so that another approach becomes more efficient, or that what is efficient now simply turns into the least efficient? What are the patterns? Do they exist?

    I feel it's less likely to change, unless Scirra find ways that provide exact same results but with boosted performance (but imagine the risk behind that - In a rare case someone uses events that gives a different result, could break many projects).

    Or as I noted somewhere before - Say there's ways to boost these comparisons under the hood, but requires a tiny change to the base of the conditions (e.g. "Sprite - Compare Instance Variable" needs a tiny change to check if a "For Each" is running, or to check if only 1 object is picked, so that it can skip picking checks (assuming this is what is currently happening)), it could bring down performance "generally" now that all "Sprite - Compare Instance Variable" actions are now doing this quick check, which could add up quickly, especially when it's such a common condition used probably 1000s of times in a large project where this "quick check" may not be relevant since those events do need to fire in the current way (rendering that "quick check" useless in most cases, yet still has to do the check and use a tiny bit of CPU).

    With the above scenario, personally, I'd love to see more Actions/Conditons added to cater towards both scenarios. I wouldn't see it as bloat, I'd see it as more pathways to cooperate with how C3's runtime works. Whether it means a new "For Each (No SOL)" (as a random nonsensical example) or various new actions/conditions to be able to keep things readable (it'd look at lot less readable in events with many "evaluate expressions" with no Sprite icons and such), then I'd be all for that.

    Of course, if Scirra discover any performance boosts they could implement, I'd imagine they'd race to do that, especially if it's not a fundamental change, or if the boost is significant enough to risk breaking older projects then they may trial it in a beta.

    But I wouldn't fear much of this information suddenly changing overall. And if it did, it could only be for the better - But I see your point, should you sift through your project (or the particular systems within your project that you wish to get maximum efficiency out of), and start replacing things with "evaluate expression" where possible... or wait it out in case Scirra hatch an idea?...

    I suppose waiting for some insight would be ideal, there's plenty to work on elsewhere in a project! But if desperate for those CPU gains, can implement things in this thread. Even if Scirra made some alternative boost in an unexpected way (e.g. a new conditions/action/event block type), then we would migrate anyway if we seek those gains!

  • Fawk

    Here's an updated test that additionally tests using picking after the loops in the "first" and "middle" cases. Move the mouse up and down to see pics of the events to compare with the readings.

    dropbox.com/scl/fi/0qazj62jfevm0vuk2fvxp/event_timing_2.capx

    Jase00 The nesting of the events doesn't seem to affect the speed significantly.

    So far the fastest way to do things is to have the loop as the last thing after picking stuff, or you can use compares after the loop as long as they are in sub-events.

    KryptoPixel

    To change the internals it would require filing an issue, Ashley investigating it, and him possibly finding a fix to improve the performance.

    The value with this topic either way is seeing some more optimized ways to order things than others. Like from slowest way to fastest way is around 20x faster, which isn't bad.

  • Jase00 The nesting of the events doesn't seem to affect the speed significantly.

    So far the fastest way to do things is to have the loop as the last thing after picking stuff, or you can use compares after the loop as long as they are in sub-events.

    Thank you for measuring. I was so certain I've read in the past about nesting subevents being a crux, but maybe it was a very long time ago... or I imagined it... Cramming in so many tidbits, maybe misremembering a few!

    It's great to know, glad I don't need to investigate and refactor things related to this!

  • I just want to caution everyone about the same old issues when people talk about performance. It's a way more subtle and complex topic than most people really understand.

    Issue 1 is Construct's CPU/GPU measurements are only timer based, and all modern processors have advanced power management. This means it's possible to make a benchmark and slowly increase the amount of work, and then at some point the CPU/GPU measurement suddenly drops a lot. You might think "OMG! Adding more work made it faster, WTF?" but all that happened is you created enough work for the processor to step up from slow, low-power mode to fast, high-power mode. We warn about this:

    CPU measurements can be unreliable, especially when the system is largely idle. Most modern devices deliberately slow down the CPU if not fully loaded in order to save power. This means work takes longer to get done, and these measurements will misleadingly return a higher measurement, since it's based on timing how long the work takes. It will generally only be reliable in the device's maximum performance mode, i.e. under full load.

    But most people still seem to ignore this and design benchmarks that run at less than full load, which means the results might be nonsense. So most of the results in this thread may well be nonsense on that basis alone.

    Issue 2 is that both Construct and JavaScript are so exceptionally well-optimised that, counter-intuitively, it can make changes to the workload seem much worse than they really are. Even Construct's event blocks, with the overhead of the event system, is so fast that one benchmark we did a few years ago showed Construct events being 5x faster than GameMaker Language in VM mode, and still nearly as fast as GML when compiled to C++!

    To illustrate how a faster engine produces counter-intuitive seeming results, consider workload A which takes 1 CPU cycle, and workload B which takes 2 CPU cycles. If you make a benchmark, workload B is half the speed! Some people then start giving advice like "never do B, it is slow". However imagine working in a slower engine where workload A takes 10 CPU cycles, and workload B takes 11 CPU cycles. It's the same difference, but now it's only 10% slower. People might give advice like "both A and B are fine, there's not much difference". But the absolute difference is the same.

    In other words, if an engine is already insanely fast, very small changes to the workload show up as disproportionately large changes in performance. However in most cases it just doesn't matter. In the example I gave earlier, the slowest workload B is still 5x faster than the fastest workload A in the slower engine. Something taking 2 CPU cycles instead of 1 is unlikely to ever affect real-world performance of an actual project, even though you can make a benchmark showing a large percentage difference. Making that benchmark and then saying "never do B" is may well actually be giving bad advice, causing people to do contrived, inconvenient things to their projects that are entirely unnecessary. This is really just another way to re-state that optimisation is usually a waste of time.

    Lastly the event system has been fine-tuned for maximum performance for over 10 years, and it's got a lot of sophisticated optimizations. Whether or not these matter and how they apply depends on what your project does. For example the 'Is overlapping' condition can use the collision cells optimization, which means if you have something like 10,000 sprites spread across a large layout and test overlap with a single sprite, it only checks nearby instances - let's say just 100 instances (which might sound a lot but is just 1% of the total). However if you put a different condition first, it has to disable that optimization. So putting some other condition first that only picks a small number of instances may in fact, perhaps counter-intuitively, be much slower, as that condition will check all 10,000 sprites. It may also have the opposite result, and be faster that way, depending on how much work the condition is versus the very small amount of work of identifying just the nearby instances, as well as how many instances are involved, and where those instances are. So you may in fact be able to measure that collision cells are slower in some contrived benchmark. That does not mean you should change the best advice of "put collision checks first" because usually collision cells is an important and highly effective optimization.

    I'd also note running events once with N instances picked is almost always more efficient than running events N times with one instance picked, because the latter repeats the overhead of the event engine. So for maximum performance, avoid "for each" unless you really need it. Again, a contrived benchmark may be able to measure the opposite. That should not change the general advice.

    So really as usual my performance advice is:

    • Ignore performance results unless you have a real-world performance problem in your project
    • If you have a performance issue, rely on performance measurements in your actual real-world project, and avoid making contrived benchmarks
    • Follow our official performance advice
    • I'd say 95% of the benchmarks users make do not correctly take in to account processor power management, so my general advice would be to ignore user-made benchmarks. If you really want to use benchmarks, only pay attention to properly designed ones that max out the processor, and ignore all others as probably misleading.
    • Remember that even a properly-made benchmark that appears to show a significant result may still be entirely irrelevant to any real-world projects.
  • Ashley thank you for the reply.

    TLDR

    It's acceptable to have different ways of doing 1 thing, and some being worse than others, but a lot of devs were surprised about very basic things, that if done differently result in high CPU gains, for example the Sprite.instancevar vs compare two values, or using an instance var rather than Tags if you have only 1 tag per sprite and not multiple: construct.net/en/forum/construct-3/general-discussion-7/stop-using-tags-tag-187031

    I would like to see more info in the official docs about these quirks. Also there are some useful blog posts from you that are very hard to find, if they are still relevant today.

    Real world projects

    We've been struggling with performance issues for a long time now, started with our multiplayer game Nightwalkers.io which could not have more than ~ 100 zombies on the screen. You could argue the architecture was wrong to begin with, which might be true, but also optimizations were always needed and took a lot of dev time to achieve, which did improve the game until a point.

    We're not making simple games, we're making very complex ones.

    This forum post started from the following issue

    This is CPU intensive

    This is not

    And these are my measurements, both the CPU profiler as well as my laptop's CPU usage. Can't get any more real than this I believe.

    CPU intensive: C3 CPU profile 19.3%, system CPU usage user 14%, Chrome 48.9%

    Not so CPU intensive: C3 CPU profile 5.8%, system CPU usage user 10%, Chrome 44%

    Bear in mind that I was not doing anything on the screen apart from having the panels opened, so most mechanics were on pause.

    Regarding the CPU power

    I don't think it's related to this in this case because people have tested with the same groups active at all times, and it shows differences.

    If the CPU would be throttled because it's not used at max, then all the groups would suffer and show reliable differences.

    BeatsByZann I'm sure he could provide the c3p without any addons if it would be helpful.

    He posted these screenshots:

    Also here alastair did the same: construct.net/en/forum/construct-3/general-discussion-7/stop-using-tags-tag-187031

    Construct 3 is very fast

    Also regarding how fast C3 is, I have to agree with you, this would have to be tested using real life projects because at the end of the day, that's what we make and that's what people play.

    Why make forum posts about small tests that might not appear in real projects?

    It would be ideal to publicly show what's wrong in our project but it's not realistic. So we make small projects in an attempt to replicate an already existing theory. And we're very happy that other devs pitched in on this, it was a very useful community-driven test. As far as I understood, a lot of devs gained some valuable knowledge from these.

  • Well, in the case of "has tags", in one case you just compare a string which is a very simple and quick operation for a CPU. On the other hand "has tags" has to split the given string by spaces to extract individual tags, and then verify that all the provided individual tags are in the set of tags for the given instance. It's probably at least 10x as much work as just comparing a string. It's not that it's slow - JavaScript is still exceptionally fast with such tasks. It's just that you've used a feature which necessarily includes more complex steps. So if you make a benchmark that absolutely hammers that specific feature, you will probably see something like a 10x difference.

    But does it matter? I find it hard to imagine it matters at all in the vast majority of projects. Even 'Has tags' should be so extremely fast that you'd have to do it something like 10000+ times per frame for there to be any significant performance difference. And if your only evidence involves low CPU usage numbers like 5.8% then I'd say you're probably just looking at results in low CPU power mode and your numbers are misleading. Perhaps it's the case that if the CPU was running in full power mode, your numbers would look more like 1% vs. 1.1%, which would be a better reflection of the performance difference.

    This kind of thing comes up so much and seems to mislead people so consistently, even after explaining the pitfalls, that I do wonder if it would be better just to get rid of Construct's timer-based CPU/GPU readings. However they are still useful for general "is it low or high" type readings, for example knowing whether you've maxed it out or not, so I don't think we should get rid of them. It's just something you need to know basically can't be trusted at all. Hence the trap: the CPU/GPU usage numbers look misleading for real-world projects, and a benchmark running flat-out is probably also misleading as it doesn't represent real-world projects. So when do you know if something needs optimizing? The framerate is the ultimate measurement. If your real-world project can't hit the display refresh rate (typically 60 FPS), and changing your events allows it to hit the display refresh rate, then it was significant to performance. Basically in all other cases, there is not really any way to say for sure it actually mattered.

  • Thank you Ashley.

    I spent a lot of time measuring, I kept in mind the very quote in official performance advice: "Measure measure measure"...

    I thought this thread was great. No guessing, folks sharing screenshots and project files. I've remade some tests and observed similar results.

    FWIW even if some things learned here aren't ideal in most cases, like usage of "evaluate expression", it's now another direction to consider if looking to optimise. I knew of things, but didn't expect the results in the specific cases shown.

    It's unfortunate we couldn't learn some confusingly-interesting insights about some mysteries of results; never an obligation, just strong curiosity from us all. I feel the strong pushback against measurements was surprising considering project files were shared for people to check themselves, but presumably you are considering general users that see this topic/the thread seemed too authoritative on how to design events when they could trip up others.

    ... and all modern processors have advanced power management. This means it's possible to make a benchmark and slowly increase the amount of work, and then at some point the CPU/GPU measurement suddenly drops a lot. You might think "OMG! Adding more work made it faster, WTF?" but all that happened is you created enough work for the processor to step up from slow, low-power mode to fast, high-power mode.

    I am not as familiar with CPU power management (is it just the GHz reported by your CPU or is there more to it?), but observed my CPU GHz during tests and see it rise and lower relative to CPU usage of C3, although flickers +-0.2 whether at 10% or 100% CPU with FPS dropping to 10%

    Regardless of, looking at two tests running, the difference in CPU between the two tests maintains a fairly consistent distance, even if inflating tests fairly with a For loop. It feels logical to explore the winning test, even if 5% or 10% difference (ofc depending on the initial test). Turning off VSync and running 1 test at a time would be further confirmation, surly?

    Admittedly I often go with just CPU comparisons that don't max out, it tends to work for me when implemented into real world project. E.G. the "dummy return function" I implemented for a system I've been trying to solve for months, from 10% down to 2% CPU (setting opacity of about 200 objects with a wacky equation pulling from various places related to that specific instance - For Each was my go-to method most of the time).

    It's a presumption but I figured the types of optimisations being looked at in this thread are more for event systems that can "grow" (i.e. wanting more enemies active or something, the more the better). Sure it's working great with 50 or 100 enemies maybe at 30% CPU (knowing you need to keep things lite for other gameplay CPU tasks), but then to find that using a different order of events/different condition could grant an extra 100 enemies or more without affecting anything, is curious to learn about.

    Granted some benchmarks are overkill, and it's easy to be quick to think "HEY this 10000 iteration loop is slow!". But I think there's been some realistic examples in this thread.

    I'd also note running events once with N instances picked is almost always more efficient than running events N times with one instance picked, because the latter repeats the overhead of the event engine.

    Can't deny I'm curious why the "dummy return function" is more optimal than a single event with a "For Each" loop (I posted earlier on this thread), though I'm certain it relates to what you noted here. I presumed calling a return function when many instances of a Sprite are picked, would be calling a function many times, plus running the return function's "pick by UID" and such (altho fast to pick by UID, I thought it's still "more work" dealing with a function and an extra condition). It's not like functions behave like custom actions and pass the SOL, so... No clue!

    Even Construct's event blocks, with the overhead of the event system, is so fast that one benchmark we did a few years ago showed Construct events being 5x faster than GameMaker Language in VM mode, and still nearly as fast as GML when compiled to C++!

    This was a cool benchmark and never left my mind!

    Causing people to do contrived, inconvenient things to their projects that are entirely unnecessary.

    It'd be a shame to not see this type of thread pop up if folks worry about leading people down the wrong path - it's really engaging, but I understand the concern if that's the case. Hopefully if anyone shares something, they write a warning, and I'd like to think newcomers wouldn't be racing to learn about optimisations and ignore the overwhelming confusion of this thread until they've learned more

    This kind of thing comes up so much and seems to mislead people so consistently, even after explaining the pitfalls, that I do wonder if it would be better just to get rid of Construct's timer-based CPU/GPU readings. However they are still useful for general "is it low or high" type readings, for example knowing whether you've maxed it out or not, so I don't think we should get rid of them.

    I know you said you won't remove it, but - I LIVE in that profiler!! It has guided me to mistakes so often. I've optimised 16k event project from 40% CPU down to 17%. It's helped tremendously with measurements. Even if I'm doing it wrong, whatever I measure and feel is optimal, I implement, CPU goes down - This helps reassure me my game will run well on low-end devices, even if a little bit more. (Gotta get the old integrated graphics laptop out - worked moderately well before, must be great now!)

  • Try Construct 3

    Develop games in your browser. Powerful, performant & highly capable.

    Try Now Construct 3 users don't see these ads
  • So I got unhealthily curious about CPU power management.

    [Note: I oversimplify CPU talk here. It's not all about CPU Speeds/GHz, it's also about type of CPU, generation, age, etc.]

    Turns out you can control your processor speed in Windows. Not explaining how. Your own risk if you explore this.

    Doing this stabilised my CPU's clock speed. At worst, only changing +-0.01, but frequently staying completely static.

    This gave far more stable C3 readings, bit jumpy with unlocked framerate but less range with the jumpy numbers.

    This guarantees CPU readings aren't affected by CPU power shenanigans, but is NOT a real world scenario, just an experiment to rule-out CPU power management.

    Maxing out CPU caused unpredictable CPU speed, doing tests that were 10%, 30%, etc, all affected CPU speed. Indeed, CPU is affecting our results.

    The way we measure now, if you let your CPU do it's own thing and fluctuate as workload gets higher, then:

    1. Running one test at a time, your CPU would reach differing speeds during each test. Less accurate results. (I did this method for many months).

    2. Doing tests with "2 Groups enabled and compare CPU Profiler", CPU gets even higher than doing singular tests (The difference in CPU will also be less clear, explained below).

    Notably, the higher your CPU speed, the harder it is to observe a CPU difference between two tests. No use fluffing up both tests with a "For" loop, as CPU will just boost its speed and further decrease the CPU difference. If you had an imaginary 100GHz CPU, none of our measurements would even show a 0.1% difference and we wouldn't know there was a difference that could affect the average player with a 2.5GHz CPU. (Always test on lower-powered devices if you aim to target them).

    Measurements with fixed CPU speed

    I took the same example I posted in this topic: "Dummy Return function" vs "For Each loop", 1000 objects every tick, one test at a time.

    Had to approximate measurements still, since it's never a perfectly-stable number (esp with higher GHz).

    I also tested various clock speeds, in all 3 Framerate modes (VSync, Ticks Only, Full Frames) and noted down the CPU % for VSync, TicksPerMin for Ticks Only, and FPS for Full Frames.

    Why did I do this? I don't know. Maybe someone finds it interesting or can extract some information from it, knowing it's a fixed stable CPU speed on every measurement.

    --- No CPU Restriction ---

    (Idle with preview open using 2% CPU, GHz floats between 4.15GHz to 4.21GHz).

    (Very approximate as CPU speed changes often).

    For Each

    CPU: 13%

    TPM: 1000 [~4.24GHz]

    FPS: 940 [~4.22GHz]

    Return Fn

    CPU: 8%

    TPM: 2250 [~4.24GHz]

    FPS: 1840 [~4.22GHz]

    --- 2.28GHz ---

    For Each

    CPU: 20%

    TPM: 560

    FPS: 545

    Return Fn

    CPU: 12%

    TPM: 1220

    FPS: 1000

    --- 1.48GHz ---

    For Each

    CPU: 30%

    TPM: 360

    FPS: 345

    Return Fn

    CPU: 17%

    TPM: 760

    FPS: 670

    --- 0.98 GHz ---

    For Each

    CPU: 43%

    TPM: 225

    FPS: 218

    Return Fn

    CPU: 25%

    TPM: 475

    FPS: 420

    The kicker? This is all absolutely pointless lol. Regardless of the CPU doing its thing and adapting, or a fixed CPU speed, I still have the same lesson I learned from when I whipped up the original c3p file in 5 minutes.

    With my ignorance these past months, I measured mainly by comparing CPU, running 1 test at a time, re-running a few times to see if any sudden differences occur.

    Any test I have done, has benefited me in my C3 journey:

    - Placing conditions in ideal ways

    - Knowing when/when not to use conditions/actions

    - Knowing the expensive actions

    - Finding that "comparing longer strings can increase CPU insigificantly, but can add up qucikly if reading many long strings in large loop, so opt for a number or shorter string".

    - Learning that getting a value from a deeply-nested JSON gets expensive quickly, not for 1 "Get" but observable in a loop - but keeping nests minimal with hundreds of keys is fine.

    - JSON performs the worst in many read/write/iterate tests when compared with Array/Dictionary/JSON, but is is still powerful and an ideal way to store complex relational data.

    - Understanding "Pick by UID" or "Pick Child/Parent" is the categorical optimal way to pick an object, again, especially if in a long loop - Instance Variables could eat away at CPU... (But now there's exploring "Evaluate expression" which may render Instance Variables more ideal to use in a busy loop

    These all feel accurate, and if inaccurate or misremembering... Can go and measure.

  • The community wants to see more features added, not the removal of features! 🙏

Jump to:
Active Users
There are 1 visitors browsing this topic (1 users and 0 guests)