[placeholder]

UI Testing at Squarespace: Part II

Firepit: Initial Purpose

Over the last four years, Squarespace’s product offerings have expanded significantly. To keep up with Squarespace's growing platform, we needed to scale our functional test coverage.

We created Firepit, a user interface (UI) for configuring and triggering tests. Firepit empowers engineers to easily run custom test suites in Squarespace's deployment environments.

The foundation of our testing philosophy is straightforward: write tests that are simple, readable, and stable.

Automated screenshot testing is notoriously unstable. To mitigate that instability, our tests use carefully placed screenshots to minimize the usage of DOM assertions.

Using lean DOM assertions is essential for keeping our tests simple and scalable, allowing few points of failure. For example, one of our tests covers the signup flow. It starts with a trial signup and ends with a subscription purchase. The test only includes DOM-based assertions for the input fields and buttons that a user would touch during the signup flow. In addition, every DOM assertion depends on the assertions preceding it. If an element that the test expects to see doesn't appear, every subsequent query fails.

Minimal DOM assertions can miss items that are vital to user interactions. For example, our DOM assertions wouldn't notice if a CSS change accidentally shrunk the “Subscribe” button to 2x2 pixels. This is why we use strategically located screenshots throughout the test flow. The screenshots provide valuable health checks on the visual state of the product. One screenshot can eliminate the need for dozens or even hundreds of DOM assertions. The expected "stable set" of screenshots used for comparison are easily configurable within our UI, requiring no test code changes for visual product updates.

Individual tests are designed to be straightforward smoke tests for specific functions of our platform. Firepit tests are not designed to cover:

  • Race conditions
  • Performance regressions
  • Security bugs
  • Data validation
  • Minor product regressions
  • Edge cases
  • Browser-specific bugs

Our testing philosophy also focuses on reducing false failures by using robust retry logic. This builds trust in the validity of a failure alert.

In July 2017, an individual test in our smoke suite had an average per run pass rating of 99.49% (excluding true failures). In a 50-test session, that means the compound probability of a session reporting a false failure would be 23.4%. To reduce this rate of spurious failure, we will run a failed test three additional times. A test result is recognized as a true failure only if the test fails on all four runs. This configuration reflects our focus on detecting severe breaking changes and helps foster trust in the validity of failure alerts.

Operational Impact

Firepit has had a significant operational impact at Squarespace. With automated functional testing, we can implement continuous deployment without the need for manual QA on day-to-day releases.

For a release to successfully build in our staging environment, it must first pass a series of unit, integration, and API tests. Finally, it has to pass a smoke suite of around 50 Firepit tests. If the build fails any of those tests, the release is designated as a failed build within Bamboo and is not promoted to production. In addition to the build failing within Bamboo, a Slackbot notification is sent to our #engineering Slack channel with the failed test titles, along with links to the test run details in Firepit, and alerts for those who committed code to that release.

We execute tests on seven virtual machines (VMs). Each VM runs eight Docker containers. With this setup, we can run 55 tests in parallel with one container for the Firepit UI. Having sufficient hardware to execute smoke suite tests in parallel minimizes build time. Our average smoke suite runtime is about seven minutes.

In July 2017, Firepit ran on 117 releases, averaging about six runs per working day. One smoke suite session is equivalent to about five hours of manual testing. With that in mind, it would take 585 hours of manual testing to allow the same frequency and quality of deploys for the month.

Finally, during the first half of 2017, Firepit caught 53 regressions that our unit, integration, and API tests would have missed. Thirty-two of those regressions were significant enough to warrant severe outages. Our functional test coverage offers continuous integration a high level of quality assurance without the need for manual testing. It also provides a sense of confidence and mental stability for engineers—a reminder that a one-line code change won’t bring down millions of sites.

Firepit is automatically triggered only on deploys to our v6 repository, which houses the main platform functions of Squarespace. As we decompose our largely monolithic service architecture, the frequency of deployments to non-v6 service repositories increases with no automated UI testing coverage. Our testing methodology must adapt to this to mitigate newly exposed vulnerabilities. We’re working to expand our Bamboo integration with Firepit to trigger specific suites of tests that focus on the core functions of a given service update before each release. With this expansion of coverage, we’ll also need smarter allocation of VM resources, with individual workers on our VMs designated to run service deploys in parallel with separate test sessions.

Our testing philosophy and practices will continue to evolve with Squarespace. As our platform grows, we will continue to improve and adapt our testing framework to meet the needs of the company as a whole and maintain the quality our customers expect and deserve.

Want to merge with us? Checkout our job openings here.

The Nuts and Bolts with Robyn Trovati

UI Testing at Squarespace: Part I