About Blog Pricing Sign Up

Bug Report: Broken Code, Passing Tests

Mike November 22, 2021

This month, I chased a bug throughout the entire Homechart codebase that ended up being a very trivial mistake leading to some strange behavior. The bug is noteworthy due to how nightmarish it was: occurring only during browser testing, and only in the CI environment (GitHub Actions).

Architecture Overview

Since this bug touched most of the codebase (at least, while diagnosing it), it’s best to start with the current Homechart architecture. The frontend is web-based, written in TypeScript compiled to JavaScript using the Mithril framework. The backend is REST-based, written in Go, and stores data in PostgreSQL. The frontend uses a pool of Web Workers when performing API GET responses, and caches the results to IndexedDB using a combination of last modified dates and hashes so it can work offline and limit the load of the API server.

The Bug

After adding the API routes and frontend state management for the Bookmarks component, the GitHub Actions job that runs the Puppeteer browser tests started to fail. Specifically, the test that counted the number of PlanTasks displayed in the frontend and confirmed they match the initial amount for new accounts showed 0 instead of 20. To add to the frustration, the test takes about 10 minutes to provision the testing environment and fail.

Not a lot to go off of...

Isolation

Running the same Puppeteer tests locally always passed, making reproduction exceptionally difficult. Starting with isolating the problematic code, I discovered that breaking the new Bookmark API routes and having them return 404s made the tests pass. Extending this further, changing the frontend states to target an invalid URL for the Bookmark API routes also made the tests pass.

The code is broken, but the tests pass

Bizarrely, making any, single, frontend state manager (not just Bookmarks–CalendarEvents, CookRecipes, PlanTasks, etc) fail would cause the tests to pass. The issue was in the frontend, but without a way to reproduce the bug, it was extremely difficult to pinpoint where it was occurring.

Reproduction

After multiple nights of fruitless diagnosis, the key to reproducing the bug eventually showed itself in the number of CPUs the GitHub Actions worker uses: two. As mentioned above, Homechart offloads API read operations to a pool of WebWorkers, and this pool size is based on the number of CPUs available. Since my local environments always had at least four CPUs, I hardcoded the pool count to two. Finally, I could reproduce the problem.

Web worker pool sizing

Resolution

After reproducing the bug, it became easier to trace how the bug was happening. The Puppeteer tests sign up for Homechart, open the menu, sign out, sign back in, and check various pages (including the Tasks page, which ultimately fails). Digging into the API responses, after Puppeteer signs back in, all the API responses were returning 204s because the frontend was sending a hash that matched the latest data on the API.

Turns out, the Web Workers were holding the hashes in a global variable that wasn’t being cleared properly during sign out. The bug manifested due to the number of API requests before sign out being even (since adding the new Bookmark state management) and perfectly distributing themselves across the two Web Workers. This caused the API requests after the next sign in to use the same Web Worker with the cached hash value. After fixing the web worker reset code, everything worked as expected.

Learnings

Nothing is more annoying than a bug you can’t reproduce, and CI job logs (especially Puppeteer) leave a lot to be desired. Homechart is already using Jaeger on the backend for tracing, so I started playing around with using it on the frontend. I believe I can build a a full stack tracing pipeline that I can leverage in the Puppeteer tests to give more visibility when tests fail. Hopefully there will be another blog in the future about this.