[placeholder]

UI Testing at Squarespace: Part I

Intro

There are many different ways to test a web application. At Squarespace, we use unit and integration tests for both frontend and backend code, as well as API tests for backend services. These tests catch a wide variety of bugs, but are unable to directly answer simple questions like “Can I log into my site?” or “Can I complete an order on a Commerce site?” These types of questions are better addressed by end-to-end UI testing.

With continuous deployment enabled for critical backend services and Squarespace’s entire application frontend, it is essential to have meaningful, reliable UI tests. Anything merged to master will ship to production automatically with no human checks, so these tests are the last guarantee that basic features of our product work. To effectively test Squarespace’s complex, interaction-heavy application, we built a uniquely powerful browser testing tool called Charcoal.

Why have UI tests?

In spring of 2014, Squarespace Commerce had a nasty outage. We pushed a set of JavaScript changes for our checkout page to production during normal working hours for the continental US. No issues were reported. At about 3:00AM ET, however, European shoppers couldn’t complete checkout because we neglected to test the shipping section UI with a non-US address entry. Support tickets flooded in, our team lead was woken up, and in his sleep-deprived state he managed to push out a hotfix and go back to sleep.

The next day, we discussed the previous night’s outage and brainstormed how to prevent similar outages moving forward. In 2014, we only had server-side testing, so it was clear that the UI was a major hole in our pre-deployment suite. We decided that any UI testing suite had to meet a few key requirements:

  1. It must cover both JavaScript and CSS regressions. If an element is accidentally set to display:none, that’s just as bad as its renderer crashing.
  2. It must be highly reliable. When tests are flaky, engineers stop paying attention and bugs slip by.
  3. It must run quickly. These tests need to be run before every deployment, and slowing down the deploy pipeline is unacceptable.
  4. It must be easily written by product engineers. Language and tooling shouldn’t be anything exotic.

Charcoal

After researching tools including Karma, Mocha, PhantomJS, and NightwatchJS, we determined that no pre-existing tool met all of our needs, but NightwatchJS came the closest. NightwatchJS is a JavaScript library that provides a fluent API for browser automation via Selenium. Since NightwatchJS is a Selenium-based tool, it’s vulnerable to all of the common pitfalls in Selenium-based tests:

  • Tests easily devolve into highly procedural blobs of assertions. Click x, assert x contains y, click z, etc. This type of code is fragile and doesn’t evolve easily.
  • Tests rely on assertions of DOM content, usually locating an element via a CSS selector then asserting some property of its content. This is inherently fragile because it makes the DOM into an external API. This breaks down very quickly on a rapidly evolving UI.
  • Non-deterministic timing behavior caused by things like async network requests is often papered over with simplistic “wait for 10 seconds, then assert” style patterns. These inevitably fail that one time the request takes a little longer than expected.

Due to all of the above, tests tend to be flaky, giving browser automation a cultural stigma. Developers have bad experiences and tend to be skeptical that UI tests are actually worth writing.

To meet our initial goals and avoid the pitfalls of traditional Selenium testing, we decided that building a more opinionated framework on top of NightwatchJS was the most productive route to take. After a few weeks of prototyping with some promising first results, we named this framework Charcoal. Charcoal aims to solve many of the problems in Selenium testing by adhering to these four principles:

  1. All tests revolve around taking UI screenshots, then comparing those screenshots to a known “stable set” of images recorded against a clean version of the app under test. These screenshot comparisons are the assertion of record for tests.
  2. A test may assert that a DOM element is present before interacting with it, but any assertions on the content or state of the element are forbidden.
  3. Non-deterministic timing events must always be handled with some type of polling watcher. This includes waiting for DOM elements, waiting for an async network request, and waiting for UI animations to complete.
  4. All tests must use the PageObject pattern. Charcoal is designed such that a test script never has direct access to Nightwatch’s main Selenium entrypoint. Test scripts are initialized with an appropriate PageObject implementation which owns this entrypoint.

Some of these principles are enforced with coding standards, while others such as the PageObject mandate are pushed by the framework itself. To make the mandates around screenshots and timing non-determinism easy for developers to implement, Charcoal provides a few advanced utilities on top of Nightwatch. Two of the more interesting ones are Charcoal’s screenshot recorder and its mechanism for handling async network requests.

ScreenshotRecorder

Screenshots are the bread and butter of Charcoal. They allow us to avoid messy DOM assertions, catch CSS-induced bugs, and capture large swaths of UI behavior implicitly. To enable tests to capture high-quality screenshots, Charcoal provides a ScreenshotRecorder object that builds on top of Selenium’s built-in screenshotting capability.

A pre-initialized ScreenshotRecorder object is passed into every PageObject’s constructor. Instance methods in PageObject subclasses record an image with the single line command screenShotRecorder.record(<name-of-screenshot>). To ensure that screenshots are not marred by in-progress UI animations, record() takes a rolling series of screenshots at 100ms intervals, and only records an image to disk once five identical, consecutive screenshots are observed. These “scratch screenshots” are compared to one another using ImageMagick.

To facilitate screenshot comparison with the stable set once the test completes, image files are written in a predetermined directory structure and are given a chronological ordering number, e.g. smoke-tests/commerce-checkout/digital-goods/06-checkout-page.png. Stable set images follow this exact scheme, so comparison can be done in a straightforward, one-to-one manner. To prevent false failures on small pixel differences due to things like a slight change in a drop shadow, the test writer may specify a diff threshold in number of pixels for each named screenshot in a test.

With ScreenshotRecorder’s unique “scratch screenshots” and well-defined diffing behavior, we’ve been able to build a test suite that is consistently centered around screenshot comparison. Recording a reliable screenshot is no more difficult in Charcoal than asserting on an element, so developers have naturally chosen to write tests that revolve around visual comparison.

doAndWaitForRequest(...)

Many UI interactions follow a workflow of “interact with element, make async network request, do something with response”. A typical example of this would be saving a blog post in Squarespace’s /config UI. The user clicks the Save button, an XHR is sent to /api/content/blogs/{blogId}/posts/{postId}, then the post’s editing dialog closes upon observing a 200 response from the async request. The amount of time an XHR takes to complete, and hence the amount of time before the UI responds, is inherently non-deterministic. To handle this non-determinism in an explicit, consistent way, Charcoal provides a doAndWaitForRequest command to do a polling wait on XHRs. Internally, this command is implemented by monkey-patching window.XMLHTTPRequest so that new and in-progress requests can be aggregated and polled on a page-wide level. Within a PageObject, a call looks like this:

this.browser.doAndWaitForRequest({
   action: this.browser.click.bind(this.browser,
     '.sqs-dialog .save')
   optionalRequestUrl: '/api/content/blogs/*',
 );

At runtime, the resulting behavior is as follows:

  1. Click the “Save Post” button located at .sqs-dialog .save
  2. Check if a pending request matching /api/content/blogs/* has been created.
    • If a pending request has been created, poll its state every 200ms. Pause test execution by enqueuing pause() commands in Nightwatch.
    • If a pending request has not been created, do nothing and allow the test to continue executing. The app under test “chose” not to make an async request.
  3. When the request is observed in a complete state, cease enqueuing pause commands and allow test execution to continue.

In practice, doAndWaitForRequest has been a key tool for handling the myriad async requests made by Squarespace’s single-page CMS app. With this command available, we’ve been able to successfully test workflows that would be awkward or impossible to cover without explicitly handling async requests.

Deployment

Charcoal’s original deployment was to a spec’d out, headless Mac Mini. We decided that it was important to run the tests on a fast, dedicated machine in order to avoid extra non-determinism from resource contention. At first, we let the browsers run on the virtual “screen” in OS X’s VNC implementation. We found, however, that this was degrading browser rendering performance because OS X wasn’t using the same graphics path it does with a real display. This slowed the test suite down and occasionally led to seemingly random failures. Adding an HDMI dongle fixed this by “tricking” OS X into thinking a real display was connected, allowing browsers to use hardware-accelerated rendering as they normally do.

This deployment approach worked well for our initial suite of 12 smoke tests, but it was not horizontally scalable, and because it lived on a one-off Mac Mini, was not easy for our SRE team to manage. To address both of these issues, we ended up migrating to a Docker + Linux-based approach that could be easily scaled and managed using our standard infrastructure management tooling. For more details on this, check out Mike Wrighton’s blog post Turbocharging Our UI Tests.

Outcomes

Charcoal became a valuable part of our release process and successfully caught both catastrophic bugs and smaller visual regressions. We were able to test multiple flows through our Commerce checkout page, our login page, and a few different CMS editing flows.

As Charcoal grew more popular, though, a few problems did reveal themselves:

  • When failures did occur, diagnosing them was difficult. Nightwatch’s command log was helpful, but more context was typically needed. Debugging could be frustrating.
  • While tools like doAndWaitForRequest() helped managed non-determinism, seemingly random behaviors and failures did sometimes occur.
  • For UI flows that require fixture state, e.g. a flow expecting a blog page to be present, managing fixture setup across multiple websites became manual and cumbersome.

To address these issues and enable UI testing on an even broader scale, we built a rich set of extensions and tooling around Charcoal, along with a full UI for managing and running tests. This new, Charcoal-based tool suite is called Firepit. Firepit now tests 100s of UI flows at Squarespace and powers the final set of smoke tests in our continuous deployment pipeline. To read more about Firepit, stay tuned for our upcoming post, “UI Testing at Squarespace: Part II”.

If you’re interested in solving problems like this, our Engineering team is hiring!

UI Testing at Squarespace: Part II

APIs First