Service Workers are a pain in the ass

1
Official Construct Team Post
Ashley's avatar
Ashley
  • 5 Apr, 2016
  • 4,069 words
  • ~16-27 mins
  • 12,906 visits
  • 0 favourites

For years now, Construct 2 has used AppCache to allow its HTML5 games to run offline. However recently Chrome and Firefox have been adding support for Service Workers, which allows implementing similar features in a special worker written in Javascript. For a while now browser vendors have been warning that the older AppCache is going to be removed. We've always been keen to stay up-to-date with the latest web technologies - for example we added WebGL and Web Audio support to our engine as early as 2011, long before many of today's popular HTML5 engines/frameworks even existed. So back in October I tried reimplementing offline support with Service Worker. I failed - it seemed the feature was too immature with essential features missing. It was also incredibly complicated, and even with years of experience working at the bleeding edge of web technology, I found it a challenge.

Many of the engineers and evangelists I spoke to at the time pointed me to relatively simple examples which didn't cut it for the needs of our engine, and seemed to take the perspective Service Worker is done and a perfectly sufficient substitute for AppCache. I found this frustrating, and suggested to one of the evangelists (sorry, forgotten who!) that I should take notes next time I try to illustrate how difficult it is, and I remember them being keen on this. So with Firefox appearing to get increasingly close to dropping AppCache, I tried again, and as promised, took loads of notes.

Behold: six months later, I still don't think Service Worker is a viable substitute for AppCache, and it is still a pretty miserable experience to work with. I am worried browser vendors are going to aggressively drop AppCache before Service Worker is ready - and no, it's not ready yet - and we'll end up losing offline support for Construct 2 games for a while until SW catches up.

So this is my case to the browser vendors that it is still too early to drop AppCache.

AppCache works fine

AppCache has a few quirks that can make it a pain to work with too, but it's been working for years for us across all major browsers with no significant problems. Whenever someone brings up offline support with Service Workers, there's usually a link to Application Cache is a Douchebag. Unfortunately I guess "AppCache works OK for us" isn't such a catchy blog title. Somehow this blog post seemed to single-handedly destroy the reputation of AppCache and now everyone seems to regard it as something we can't get rid of fast enough. Service Worker is definitely more powerful and the future of web development, and I'm not trying to troll on it as hard as some have on AppCache, but I guess the lesson is we should be vocal write blog posts to get our opinions out there, so there is not such a monoculture and reliance on single points of view. This is also why I'm going in to so much detail in this blog post - I really want to get my point across.

AppCache had some interesting features that turn out to be hard to replicate in Service Worker:

  • a separate data file with a list of resources (not embedded in a script)
  • implicit caching of the index page (useful to cover both example.com/ and example.com/index.html cases, and as framework developers, we can't assume it will be anything in particular, e.g. could be main.aspx)
  • check for update in the background while loading from cache
  • swap for new cache on next reload
  • atomic swaps, so resources from different versions are never mixed up in the same pageload, e.g. if the background update completes while the cached page is still loading or making requests

The code of my best-effort attempt towards this is on GitHub. Again, I've failed to live up to that list - six months later I still can't get it to work, but I think I'm closer. As promised, I kept a log every step of the way documenting every problem and frustration I had - and I wrote a lot. So here's what the development experience was like. If you haven't realised yet, this is going to get very technical. Also, wall-of-text alert!

Working with Service Worker

So I'm testing on a dev server running on localhost, and testing both Firefox Nightly and Chrome Canary to make sure I've got the latest Service Worker features. For example Firefox only recently announced some important extra features, and I want to be able to use them.

I need to test the fresh-install code a lot, so I open a new Incognito window in Chrome every time I want to start from scratch. This works nicely. I try the same in Firefox using Private Browsing, but it turns out navigator.serviceWorker is undefined in Private Browsing mode! WTF? So I can't use that for testing Firefox. Now every time I test I have to do some combination of Ctrl+Shift+Del and clear everything, or visit about:serviceworkers and manually unregister it. I don't know what really works, this seems flakey and doesn't seem to reliably cause SW reinstallation every time. It's possible it really is working but there's a problem with console logging (sometimes they would suddenly appear when switching tabs or something, indicating a SW installation), but this means working with Firefox is going to be a headache right off the bat.

I wrote some code to try to delete old caches and open a new cache based on the latest version, but Firefox started throwing SecurityError simply accessing caches.keys() or caches.open(). This turned out to be fixed the next day so I guess it was a Nightly quirk, but not before I refactored the code to only do caching stuff after an activate event, set up HTTPS on my local dev server, tried Firefox Stable which simply reported "TypeError: ServiceWorker script at localhost/sw.js for scope localhost encountered an error during installation. <unknown>" and confused me even more, and then gave up and spent the rest of the day only working in Chrome. So yeah, these development browsers can literally work one day but not the next. I guess I opted in to that, but the error messages were confusing and I ended up wasting time.

After setting up HTTPS in a vain attempt to get things working in Firefox, I switched back to Chrome and started getting the error Uncaught (in promise) DOMException: Failed to register a ServiceWorker: An SSL certificate error occurred when fetching the script. undefined:1. I think what is going on here is Chrome normally doesn't trust development SSL certificates, so as a developer you normally just click past the SSL warning, but the SW script request runs in to the same warning and throws an error. So I guess you can't test SW on a local HTTPS server in Chrome? Firefox seemed fine with it. I switched back to HTTP and it started working again.

At one point I accidentally wrote event.waitUntil(...).then(...). (The 'then' should go inside the brackets, not outside.) Firefox reported the error event.waitUntil(...) is undefined which is totally misleading and had me looking up the spec for waitUntil. Chrome had a better error: "cannot read property 'then' of undefined".

I needed to get the main page URL after the SW install so it could be implicitly cached. I could only get this from the first entry in clients.matchAll(). Then I tried switching between the / and /index.html URLs to test main page detection. At this point I ran in to a super nasty Chrome bug: as you navigate between the two URLs, Chrome clears the console log, then after a moment prints the previous page's console messages again. So I kept thinking it was detecting the main page as index.html after navigating to /, and got super confused trying to figure out what the bug was in my code, and doubly confused wondering why a SW would reinstall again on the second pageload and wondering if I really had understood the lifecycle after all. Actually, my code was correct, it was just the old console log coming back. I actually already ran in to this in October and mentioned it in a bug report, but it's such a subtley nasty and misleading bug that it got me all over again. This is where I start to think "this is horrible". That bug has been there for six months now. It became such a problem for me that I started logging a random number in the install log message, so that I could identify if it was the same log coming back (same number) or a genuine reinstall (new number). Yep.

As part of the same bug report, Chrome often double-logs messages to the console. This is also super confusing. For example it will look like two separate requests were made for image.png, but it really only made one and the log message got duplicated. That's exactly what I saw back in October and it's still there. It's not even consistent, and some messages are genuinely repeated because the code is running more than once, so while working you have to mentally identify and remove the spurious log messages. Chrome also sometimes logs an error GET localhost/demos/sw/sw.js net::ERR_FILE_EXISTS, and I have no idea why. By now working in Chrome is as much of a pain as Firefox.

As part of the main page caching, I find that you cannot cache "/" as a URL - it won't match with the main page request when there's nothing after the last slash, e.g. example.com/ as opposed to example.com/index.html. This seems minor, it can be worked around by replacing "/" with the service worker scope.

The update caching needs to be atomic, so there is never (or failing that for as short a time as possible) a partially-filled cache. So, is cache.addAll() atomic? I can't tell - the MDN docs don't directly state it, but it sounds like it is not, since it does say "the browser can resolve the promise as soon as the entry is recorded in the database even if the response body is still streaming in". Besides, you need a cache already created to use it, and as part of making this atomic I want to avoid there being an empty cache that exists while it waits for a bunch of requests to complete. Okay, so I won't rely on cache.addAll(). So now I have to write my own version of that function to wait for the requests to complete, then write them all in one go. For that writing, I can use put() but curiously there is no putAll() method. So the writing stage is definitely not atomic either: if the browser is closed half-way through these put() calls, I guess there will be a partial cache left behind. Ideally there would be a way to say "open a cache and add this content atomically", so either the cache does not exist at all, or it exists with all the content successfully written, and nothing in between. I can't see a way to implement this with the SW APIs. I leave it with the put() calls which reduce but do not eliminate partial-cache time.

Bear in mind I'm considering complicated aspects like the atomicity of cache writes while the console logging is being misleading.

One brighter spot is now there is a request.mode property that tells you if the request is a "navigation", meaning the main page load as opposed to a subresource request. This makes it much easier to identify this case which is when we want to kick off things like a background update check. This however presents a problem of exactly how to kick off the update check in the main page request. We want to respond to the main page request ASAP to ensure best performance, but I vaguely remember something about SW code not being allowed to run outside of event handlers. I try to check this on MDN, but can't find anything about restrictions on when code can run. Besides the only workaround I can think of is to do a little dance where you post a message to one of the clients which then posts back a message, which gives you an event handler you can run more code in. It would be kind of silly if I really had to do that, and I also run in to a curious problem where clientId is null in the main page request, so I can't actually find a client to post this message to anyway. So I decide to ignore it for now and fix it later if it proves to be a problem. I'll just kick off the background check in the main page request, respond immediately to the request and allow the background check to continue outside of the event handler - it seems to work OK in Chrome.

At one point I wanted a cache.name property for diagnostics, but curiously it's not there. I had to omit the name of the cache I was using from some logging.

After an update has downloaded, the cache switchover should be atomic. We don't want the main page to load the first 50 of 100 resources from v1, then a background update finishes and the second 50 load from v2. This basically results in a corrupt state for a non-trivial web app. To solve this, I decide to take an approach where each client will be associated with a specific cache (if any), and never change it for the lifetime of the page. This rules out the client ever fetching mixed version resources: even if an update finishes after 50 of 100 resources are loaded, it will continue loading the second 50 from the same cache. This also rules out using caches.match() since it uses any available caches. It also raises the question of when exactly a cache can be deleted, but let's not worry about that yet. So in the main page request we want to figure out which cache to use and remember it. Now we open a whole can of worms to do with clients.

I decide to try a WeakMap of Client to Cache. This means the Client keys aren't kept in memory if they only exist in the WeakMap, which I thought was a nice way to avoid leaking Clients. Requests give you a clientId so we can get the client from clients.get(clientId) and look in the map. I wonder if the "code isn't allowed to run outside SW" means if I can or can't have a global map in my SW, but decide to also ignore this. Remember how I mentioned before how for some reason clientId is null in the main page request? This throws a spanner in the works: we don't know which client we are yet, so we can't associate a cache with it! Erk. I guess a client doesn't actually exist until the main page request has completed?! This is really a pain. It turns out the first sub-resource request has the correct clientId, so I decide all I can do is postpone the client-cache association until the next request. This immediately causes two more problems: the first few requests all race to do the association since everything is asynchronous, so that has to be mitigated. Secondly, how do we know which cache to load the main page from? We may be offline, and we don't have a client yet. I decide I have to separately look up the newest cache available in both the main page request and the first fetch after that. I think technically this creates another atomicity loophole: it seems possible the main page and first request could choose different caches. I decide this seems unlikely and I don't try to solve it.

Once I set that up, according to my logging, every fetch request is associating a new client with the cache. At first I suspect this is double-logging again, but then by using a normal Map and logging its size, it turns out it really is adding a new Client key for every request. Every client is found by clients.get() with the same clientId. What is going on?? I decide that what must be happening is every call to clients.get() returns a newly-constructed Client object, even when passed the same clientId twice. This causes each Client I get to be a separate instance, so they are not equal and count as separate keys. This is super unhelpful and another nasty gotcha: it makes Client useless as a map key. This aspect of clients.get() does not appear to be in the MDN docs either. I decide to use a normal Map instead of a WeakMap and key off the clientId, which is a string. This solves the problem, but now keys are never cleaned up. Technically it's a memory leak, but I decide it doesn't matter since it will be cleaned up at the end of the browsing session, and solving it would be really hard - there doesn't seem to be a client close event that I could use to remove the key, for example. Finally I have a way to map a Client to a Cache, but I am particularly dubious about how I can't do this association in the main page request.

I'm getting close now. I decide old caches need to be deleted to avoid wasting space, but I'm not sure when to do that: there isn't a client close event, and the old cache needs to be kept around as long as its associated Clients are alive even after an update finishes downloading. I decide to make the code that finds the newest cache in the main page request also delete any older caches than the one it selects. I think this could technically delete a cache that is in use if multiple windows are open, but I decide to pretend that's not a problem. (Everything is hard enough that I guess my thoroughness is starting to slip by here!)

Finally I've gotten far enough to test the upgrade process. Even specifying "reload" cache mode on the update requests, Chrome returns stale requests in my test: after updating to v2, two files return from v2, but still one from v1. It turns out Chrome doesn't support the cache control option yet (crbug.com/453190), so my attempt at using the "reload" cache mode has no effect. This means Chrome is still allowed to return stale responses from the HTTP cache. Why only one file is returned stale is a mystery to me, particularly because my local development server is configured to send Cache-Control: no-cache. The official way to work around this is to add random numbers in the query string. This technically requires parsing query strings in the file list to do it properly. By now I'm pretty exhausted and frustrated; I decide to give up and wait for Chrome to support cache modes.

By now it's the next day, so I decide to try Firefox again. It seems the "SecurityError" problem when accessing caches is fixed now... but now the background update check sporadically fails with "AbortError: The operation was aborted.", or (worse) appears to just silently stop half way through, because no more console logs on its code path end up being logged. Oh... I guess I was right, and you can't run code outside an event handler, and Firefox is stricter about this. So Chrome and Firefox are inconsistent here. But Firefox doesn't appear to tell you that it terminated code mid-execution. It should probably log a warning or something. As it is I'm just looking at a truncated set of console logs and wondering why.

I decide I've had enough. I've got enough code that I'm pretty close, so I've created the currently-broken appcache-sw-polyfill GitHub repository with my code so far, and documented its somewhat serious shortcomings. I might keep tweaking it, but I'm hopeful this has a broad enough appeal that some community cooperation can fix it. I've complained in the past that there is no convincing AppCache polyfill, and it turns out to be a lot of difficult work, so hopefully posting this maybe-90% done version will help that happen.

My verdict is this: six months later, for a second time I have failed to successfully implement AppCache in Service Worker. Along the way, my experience has been that Service Worker is an absolute minefield of nasty gotchas, bugs, missing features, insufficient documentation, thorny problems that I ended up just glossing over, and some things which are just a mystery, that make it overall a nightmare to work with. Perhaps there are better SW experts out there, or other people have different development styles that could work better - that's part of the reason I put the code up on GitHub. But I am personally exasperated at the state of things, and further exasperated by the constant recommendation to hurry up and move from AppCache to SW.

Recommendations to improve Service Worker

Based on my experience, I created a list of changes that I think would have made this a far better development experience:

  • Allow Service Worker in Firefox Private Browsing mode, so the fresh-install case can be tested easily. Why isn't navigator.serviceWorker there?!
  • Allow SW to work in Chrome with untrusted certificate on localhost.
  • Fix Chrome issues 543104 and 453190.
  • Update MDN documentation to clearly state if functions are atomic and specifically when code is allowed to run in a SW.
  • Improve error messages to always correctly identify the problem, as well as logging a warning if a SW was terminated while still executing code. (For Chrome, perhaps a warning if code is executed outside of an event handler, since that works but is non-portable.)
  • If the postMessage dance is really necessary in order to run code outside of an event handler, there really should be a self.run() convenience method to basically do the same from the SW itself.
  • Provide an atomic cache.putAll() or "create cache with contents" method.
  • Provide the correct clientId in a "navigation" fetch event (i.e. for main page)
  • Make clients.get() always return the same Client object for the same id, so they are strictly equal and can be mapped properly
  • Add a client "close" event for a SW to remove any data associated with a client
  • cache.name property for convenience, just returning the name that it was opened with
  • Maybe allow "/" in a cache to match the main page request?

Looking at this, I think it could easily be another six months before all of this can be addressed. Is Service Worker really ready? I don't think it is.

Conclusion

Service Worker is still the future. It's actually got a pretty exciting feature set, allowing general network interception (including more advanced offline cases than AppCache), background processing/sync, notification handling and more. However I hate working with it. The development experience is terrible. It still feels like an alpha-quality feature that needs a lot of work - and sadly, this feeling has not changed much in six months. AppCache had its problems, but is this really easier to work with? As it stands Service Worker is a total minefield and despite trying pretty hard, there is no way I can ship a SW replacement for AppCache for Construct 2 games.

The best-case scenario is we can figure out how to fix all the problems with the "polyfill" (a separate implementation really), and then that can just be a nearly-drop-in replacement for anyone using AppCache, so they won't be subjected to the considerable pain of having to write SW code. I really hope it turns out I'm just too stupid to figure out Service Worker, someone else fixes this, and then we can ship it with Construct 2. Even if that happens though, I am frustrated with the approach browser vendors have taken: AppCache is already on a deprecation path without anyone else having provided a convincing polyfill, which I am inclined to believe is because it's not possible yet! Did nobody else try? I hope browser vendors aren't prepared to remove features without there being any plausible alternative yet.

So, browser vendors: please keep AppCache until the situation improves. Service Workers are a pain in the ass. And if you know what you're doing, please try to fix my broken code!

Subscribe

Get emailed when there are new posts!