[placeholder]

SaaS: Screenshots as a Service

As part of our ongoing efforts to improve user experience on the Squarespace platform, we added a feature to display thumbnail previews of all of your account’s sites in the account picker.

Since we didn’t have any existing functionality around screenshot generation, this was something we had to build from scratch. In this post we’ll talk about the screenshots microservice solution we came up with.

 

Why a Microservice?

The decision to implement this feature as a microservice, rather than an additional component inside the core CMS application, was made early on. As we move toward a more Service Oriented Architecture, new features in the product are often considered as potential candidates for standalone services. With screenshot generation, there were two obvious driving factors:

  1. The ability to scale this part of the system independently was important, as requirements and load were likely to change over time.

  2. This feature was fairly well decoupled from the rest of the system and wouldn’t depend on any internal models.

Design

Selenium

Another decision that was made at the beginning was to use Selenium + Firefox as the tool for capturing the screenshot images. After some prior work in this area, we concluded that the only way to reliably generate accurate screenshot images, as seen in a real browser, was to use a real browser to render them. We also concluded that the Selenium framework was a solid solution for browser automation and could be deployed easily using the docker-selenium environment.

Messaging

We’d so far established that the service would use Selenium to generate screenshot images as instructed by the Squarespace CMS application. The next design decision was about the interface with the rest of the system.

All of the existing services in our ecosystem were RESTful, processing HTTP requests synchronously and with low latency. Processing a screenshot request, however, requires an entire browser session. This means far greater CPU and memory load and higher latency by several orders of magnitude.

With this in mind, the best approach to keep load under control was to allocate a fixed number of processing nodes (Selenium Node instances), and have each node manage its own workload by pulling new requests from a queue.

This would therefore become our first asynchronous, message-based microservice.

Update Logic

Another unanswered question was how we should update the screenshots—what should trigger a request and how frequently it should happen. We wanted screenshots to be as recent as possible, but an update frequency that was too high would have caused the request queue to grow out of control.

The ideal system would process requests in a reasonable amount of time, making heavier use of the queue as a buffer during peak periods.

This ended up being something of a balancing act. We had to choose an update policy that kept screenshots fresh, and provision enough hardware in the service to handle the average rate of requests.

After some analysis of user event data, the policy we decided on was to schedule screenshot updates on demand, as requested by the account-picker front end. This was coupled with some throttling logic to limit the number of updates in a given period of time.

Implementation

The screenshot flow is as follows:

User requests are processed by the Site Server (Squarespace web server) application. If Site Server decides that a particular website needs its screenshot refreshed, it updates the corresponding ScreenshotData document in the MongoDB application database.

Aux server—our system for running batch jobs—periodically fetches all websites marked for an update from the database, and feeds them through some scheduler logic which controls the throttling. This results in screenshot requests going out on the Kafka request queue.

On the other end of the request queue are the service nodes. Each node runs an instance of Selenium Hub, together with multiple instances of Selenium Node (each one runs a single browser session).

Screenshot service architecture

 

In the service nodes, screenshot requests are processed in the following way:

  1. Validate request parameters: website URL, image, and browser dimensions.

  2. Send browser resize, load page, and save screenshot commands to Selenium.

  3. Resize the screenshot image according to the requested image dimensions.

  4. Write the image data to NFS using the storageId of the request.

When screenshot images are requested by the front end, Site Server fetches the ScreenshotData document by websiteId, and uses the storageId to read the file from NFS.

{
  “id”: “string”,
  “storageId”: “string”,
  “websiteId”: “string”,
  “url”: “string”,
  “dimensions”: {
    “viewportWidth”: “integer”,
    “imageDimensions”: {
      “width”: “integer”,
      “height”: “integer”
    }
  }
}

Screenshot request JSON schema

 

Monitoring

Various metrics are recorded and sent to our Graphite cluster for analysis of performance and system health. These include screenshot processing rate, failure/retry rate, as well as system-level metrics to track CPU load, memory, and disk space.

 

Scale

We currently run four screenshot service nodes in our production environment, with each node running six Selenium instances, giving a total of 24 available browser sessions. At peak times, we process around 130 screenshots per minute.

Lessons Learned: Quality over Quantity

The task of automating screenshot capture for web pages is not as simple as it first appears. There are unknown variables such as the content of the page, and the point at which you can be sure that content is fully loaded (this is difficult even with our template-based sites). Add to this the fact that, as we discovered, the Selenium stack is complex and somewhat unpredictable under load, and you have a system that is potentially fragile.

Exercising caution and prioritizing stability over performance was key. For us, this meant allocating sufficient resources so that the browsers weren’t starved of CPU cycles, and allowing pages time to settle before grabbing the screenshot. Careful configuration of Selenium timeouts and retry mechanisms for failed requests were also necessary.

One More Thing

If you’re interested in the world of microservices and would like to join our team, we’re hiring!

Implementing HTTP/2 for Squarespace Websites

Forum Fronting