When something goes wrong in technology, knowing who to blame is a surprisingly difficult problem. Despite having dealt with thousands of bug reports for over 10 years, I'm still regularly humbled by totally unexpected results. These can be things like bugs that look like someone else's fault, but are ours — or reports that look like our fault, but are someone else's — or even on the odd occasion, I'll find out a problem is caused by a freak chain of events that somehow causes a totally unexpected result to something unrelated, like some aspect of text rendering affecting audio playback.
This also involves aardvarks.
This is not unique to us: modern software technology is incredibly complex. For example, take a look at this attempt to answer the question: "What happens when you type google.com into your browser and press enter?" The answer is an epic journey through keyboard hardware and drivers, management by the operating system, interpreting input, several layers of networking including sockets and DNS, security concerns, various protocols, and browser architecture, including the many layers of rendering HTML documents. It doesn't even touch on how Google's servers manage to search billions of documents in a fraction of a second.
These many layers of complexity are typical to most modern software applications. For example rendering a WebGL shader effect in a Construct game involves a deep graphics technology stack, which includes our own effect compositor, but also shader language compilers, managing and converting command buffers, browser security mechanisms, multi-threaded processing, passing data between the application, operating system and graphics driver, decades-old graphics APIs with legacy concerns, scheduling commands to be sent to the hardware, and the design and management of the graphics hardware itself. You might notice that the part we made is actually a pretty thin layer that sits on the top of a deep stack. The icing on the cake, if you will!
Another interesting point to reflect on is just how many different companies are involved in these technology stacks. One thin slice of the technology cake could involve layers of the cake made by a dozen different companies or more. The fact anything works at all comes down to industry agreements, specifications and standards — which can themselves be a source of problems, too.
Observing a problem
Suppose a user reports that Google searches don't work. The problem could be anywhere from the physical keyboard hardware, through keyboard drivers and OS management, to problematic algorithms in Google's data centers, or anywhere in between.
When something goes wrong, it can involve pretty much any layer of these complex technology stacks. Some are more reliable than others. It takes experience to know which — but even then, you can never rule out a rare case of a problem with a supposedly reliable technology — or a bizarre edge-case.
Non-technical users are usually unaware of these many layers. Understandably, the instinct is to blame the thing you can see, which is usually the top layer. For example, a user who cannot reach Google in the Firefox browser might blame Firefox. It could be the problem, but it could be caused by a great many other things, many of which are more likely than it being Firefox's fault. It could even have actually been an aardvark. Experienced engineers have usually come across these bizarre and unexpected "aardvark cases" a few times. (One of my favourites is a bug report that literally said "when I lean on my keyboard, I later cannot open my project". In the end we fixed it. Another is possibly an urban legend but is a good story about a car allergic to vanilla ice cream.)
Suppose someone learns about the way routers manage their TCP/IP stacks, and a few high-profile cases where mistakes caused outages. It might be tempting to then see all failures through this lens. For example if you can't reach Google, you might think "Aha! I bet it's one of those TCP/IP stack management issues!" Experienced engineers know this is unlikely since these parts of the stack are generally extremely reliable — but at the same time it can't be completely ruled out until it's further investigated. After all, it could yet be an aardvark case.
There is research behind this: the Dunning-Kruger effect describes how people with a little ability tend to overestimate their capability. It's a case of "a little knowledge is a dangerous thing". Part of the process of becoming an experienced engineer involves declaring the cause of a problem, investigating for a while, and then finding out you were embarrassingly wrong. I know I've been through that a few times — and still occasionally do! I practically expect it.
It's really tough to try to tell someone that there might be limits to what they know. You can easily imagine them taking it badly. I imagine even just this section of the blog post will be interpreted badly by some. (To be clear: I absolutely do not mean to offend anyone. It's just a tricky subject to discuss.) However I think the research behind the Dunning-Kruger effect is beyond dispute: it must be accepted that there is an effect that works like that.
Of course, despite marketing mainly to non-technical users, we do also end up with some users who are also experienced engineers. There is a good rule of thumb on how you can identify them — more on that later. Additionally experienced engineers also recognise their limitations with regards to domain-specific knowledge. For example I've personally closed around 6000 bug reports directly relating to Construct, and an experienced engineer discussing a Construct issue would be aware of this and take it in to account. Similarly if I report an issue to Google about Chrome, I may have picked up a few bits and pieces of browser knowledge over the years, but I'm well aware they are the ones who know browsers inside out.
The language of experience
As a result of this rite of passage of repeatedly being proven embarrassingly wrong by unexpected twists, experienced engineers are cautious and non-committal until there is a definitive understanding of the issue. Upon receiving a report, good engineers know to avoid immediately pointing the finger at any specific thing, even if it looks obvious, and even if pressured to do so. This is also why most bug report systems require a lot of information to help diagnose the problem, although non-technical users regularly skip over a lot of it, sometimes declaring that it's not relevant. (They are frequently wrong.)
This non-committal approach emerges in the kind of language experienced engineers use. Usually an initial response will include probabilistic language like "it's probably...", "it might be..." or "it could involve..." (normally without the emphasis). This is generally falling back on experience to provide a rule of thumb that helps guide the investigation in approximately the right direction. Failing that, an experienced engineer is unlikely to commit to anything at all, perhaps with comments like "we don't know yet" or even "I have no idea!" This is to avoid pushing the investigation in the wrong direction or proceeding down a path without any evidence to back it up, which tends to become a waste of time. It also helps minimise the embarrassment in the aardvark case — you never claimed it wasn't an aardvark, after all, since you knew from the start there was a chance it could've been, even if the chance was so slim it wasn't worth discussing.
Things we've been blamed for
These subtleties can be lost on users, particularly non-technical ones. When things go wrong, some users may be tempted in to promptly declaring what the problem is. This is understandable: many kinds of problem look like they're obvious, even if they involve mind-bogglingly complex chains of events. However, as mentioned, they usually blame the top layer. If you make consumer software, that's usually your app.
Here's a list of some of the things we've been blamed for when a user finds something goes wrong:
- Graphics driver bugs that cause glitches or poor performance
- Fundamental hardware limits, e.g. GPU memory bandwidth
- Crashes in the browser
- Bugs in third-party libraries we use
- The network performance of third-party services we use
- Misconfigured web servers that have wrong or missing MIME types set
- An ISP's poorly configured network address translation (NAT) resulting in blocked peer-to-peer connections
- Changes in the OS caused by OS updates
- Proprietary media codecs encumbered with fees being unavailable on free platforms
- Problems caused by, or the decisions of, independent third-party developers
- Restrictions browsers or OSs impose to avoid annoying users, or ensure their security
- Features that are delayed or missing because the companies that provide them either respond slowly or specifically don't want to allow it
As I said, this is entirely understandable. It can be very frustrating for us to receive blame for things that are nothing to do with us, but I don't mean to criticise these users. It's natural that if you run in to a problem, you want to have it fixed, and a fair assumption is that the people who make the app you're using are the ones who ought to fix it. We usually do our best to assist in these cases. We're aware of our responsibility to ensure our software interoperates as well as it can with these other parts of the ecosystem. However sometimes the reality is that these are wider issues affecting some other part of the vast and deep technology stacks made by dozens of different companies. We can't take responsibility for it all, even if it does end up affecting our app. For example, we can't fix a bug in nVidia's graphics driver for them, nor is it reasonable to expect us to do so.
This also isn't unique to us: any software on the market will face a similar raft of issues, since all modern software fundamentally depends on existing technology stacks, few of which are perfect. If someone decides to leave over such issues, chances are they'll run in to other similar issues elsewhere — or even the same problem, perhaps manifesting differently.
Sometimes a user will say something like "Construct sucks at making peer-to-peer connections". If your ISP has a highly restrictive network address translation configuration — possibly motivated by IPv4 address space exhaustion on the wider Internet — and you use Construct, and it fails to establish a peer-to-peer connection, then it's an understandable statement: Construct failed to do something. However it is also — unintentionally, due to being unaware of the wider technical issues — unfair to blame Construct. The issue could ultimately derive from a fateful decision long ago to use only a 32-bit address field in IPv4. You could argue we should have chosen a different technology, but that's unlikely to be perfect either, so will probably just end up trading one set of issues for another. For example we could switch to centralised servers — and then users will face hosting fees and scalability issues. Experienced engineers also recognise this point: building technology is about making tradeoffs towards a goal. You can't always have it all.
Blame in the community
The case of a Google search failing is usually straightforward. Google's services are generally so reliable that you know if you can't reach them, you're probably offline (or the problem is otherwise at your end, regardless of whether it involves aardvarks). In a software application though, there are often problems that are far less obvious, and involving far less commonly understood technologies.
Sometimes when things go wrong, people understandably get frustrated about it. However in these cases, asserting the cause of the issue, or demanding an investigation in to a certain area, rarely helps. Until the problem is sufficiently understood, which often is not until the point at which it's fixed, it's difficult to say for sure what is going on. Piling on extra pressure only makes it harder to steer an investigation in a direction that is most likely to result in a solution. If an issue's cause genuinely lies somewhere else in the technology stack as the responsibility of some other company, and there is no feasible workaround, then piling on extra pressure will only force us to say — sometimes even before we know absolutely for sure — that we think there is nothing that can be done. Of course this is hardly satisfying for anyone affected. However still continuing to apply pressure beyond that point is futile and may only cause further aggravation. Our engineers are only human and often have spent hours trying to help. In these cases we will firmly stand our ground. We've been doing this for years. We may not be perfect and we may not always be right, but we do know what we're doing.
This can even result in some totally bonkers situations that are the equivalent of screaming at a Firefox engineer over the fact your Google searches don't work in Firefox, and then it turns out an aardvark chewed through your line. Nobody wants to be that person. It's best for all involved to remain calm and co-operate until the problem is solved. Then you can shake your fist at the aardvark and go "Grrrr!" (Still, I would advocate peaceful measures to avoid it occurring again!) It would also be particularly bonkers to find out it was aardvarks, and then still blame the by now likely totally baffled Firefox engineer, or demand that the Firefox engineer somehow assumes control of all aardvarks globally to prevent them ever affecting the use of Firefox. You'd understand if the Firefox engineer was wondering why they were still trying to help at this point. Unfortunately, on the odd occasion, that's the kind of situation we find ourselves in. And if the Firefox engineer protested that it's unreasonable to expect them to assume control of aardvarks globally, and you argued "don't tell me what's what about aardvarks" on the basis you once read a book about them, I think we could all agree we're firmly in Dunning-Kruger territory — and that mentioning that would be diplomatically challenging, to say the least. The best thing the Firefox engineer can do is just walk away, even with claims behind them that they don't care about whether Firefox works.
Social co-operation on technical issues
A secondary problem occurs when non-technical users see other users asserting the cause of a problem. Especially when multiple people are asserting the same cause, they may side with them and join in calls for action. Unfortunately non-technical users don't have any basis on which to verify the merit of any claims, even if evidence is provided. They may also be unaware of the reasons behind nuanced, non-committal comments by engineers, and interpret this as refusing to co-operate, or shifting blame. Indeed, those asserting a specific cause may appear more confident than an engineer who won't be pinned down. This is where technical issues become socialised, and can mean more pressure on engineers. This is rarely constructive.
In order to avoid this, I think this might be a useful rule of thumb:
- Non-technical users, or inexperienced engineers, sometimes quickly assert the cause of a problem after finding it. For example, "X doesn't work. It must be Y!" They sometimes firmly insist on their interpretation, and demand a specific course of action. "Why won't they fix Y?" They sometimes also perceive one solution as perfect, and do not consider any drawbacks. "Y will be great and fix everything."
- Experienced engineers equivocate. They use non-committal, probabilistic language. For example "X doesn't work. We're not sure why yet", or "In the past this has often been caused by Y". They will suggest a course of action based on experience, yet resist firmly committing to a specific interpretation, keeping an open mind and being prepared to entirely shift the direction of investigation. After all — it could be aardvarks. They are also well aware of the tradeoffs involved, and know that even major changes won't always solve everything. "Y may help with X, but has other downsides for Z that may even outweigh the benefit to X."
If you're not sure how to weigh up technical issues yourself, firstly I'd point out there's no need to join anyone's side at all. However if you do see discussion on technical issues, I think this is a good way to separate those who have dealt with many technical issues in the past, with those who perhaps have not. Staff should work hard to resolve customer's issues, non-technical users should try to avoid wading in to technical discussions, users with some or no technical experience should try to be aware of the limitations of their knowledge (especially when they cannot see the Construct source code), and experienced engineers can be spotted by their abundance of caution. Recognising this should make for better quality technical discussions and a less confrontational approach in the community. And of course all participants should be respectful and co-operative at all times, even if some issues are frustrating to deal with.
Software development is hard, and dealing with issues that come up can be hard too. Modern technology involves an astonishing level of complexity, with layers upon layers of technology maintained by dozens of companies, any of which can go wrong. Few people understand it all. I certainly don't. In some cases, no matter how hard you push, there's just nothing quick or easy that can be done about a problem. It is simply the nature of the beast: you can end up caught between the gears of a forbiddingly complicated machine.
Some of the problems that emerge can look obvious but are far from it. Knowing who to blame amounts to knowing how to fix the problem, which can be incredibly difficult. People who understand this are cautious and equivocal. People who don't understand this can jump to conclusions and end up unnecessarily pressuring engineers, even if they want the software to be as good as possible and have its best interests at heart. I hope this post helps others understand where we are coming from, especially when dealing with possibly frustrating and difficult problems. Just remember — it could be aardvarks.