Under the Hood: Ensuring Site Reliability

Squarespace hosts millions of websites on our cloud-based website-building platform. The reliability of these sites is a top priority for us. We’re consistently implementing the best technologies and safeguards to enable quick load times and prevent outages. The measures we take allow our customers to create Squarespace websites with confidence.

Read on to learn more (and nerd out) about all that goes in to making sure every Squarespace site is reliable and running smoothly.

Powerful data centers

Our platform’s foundation is an enterprise-grade, redundant infrastructure maintained across multiple geographically distributed data centers. Each of these data centers hooks up to the Internet through various service providers. Within each data center, we employ sophisticated designs of networking and server components. The data centers have enough network bandwidth and server capacity to withstand high traffic volumes for any website on our platform. We embrace the concept of infrastructure as code; the provisioning and configuration of all of our data center components are fully automated using Ansible, and we carefully review infrastructure changes before applying them.

Additionally, we have disaster-recovery mechanisms in place that allow us to shift our traffic to different data centers when necessary. A disaster may require us to replicate any data, such as website content and images, in near real-time to our other data centers. We regularly execute the disaster-recovery steps to ensure that they function appropriately, and we continuously iterate and improve on the process to make it as reliable and automated as possible.

Failing over traffic between two data centers for one service

Large-scale software engine

In addition to the data centers, our server engineering teams build and maintain the large-scale software engine that serves millions of users worldwide. Some teams are responsible for scaling our database, caching, search, and queueing layers. Other teams are responsible for breaking up our main application into smaller and more reliable service boundaries by building out a microservices framework.

The server engineering teams also implement optimizations that apply to all Squarespace websites. All JavaScript, CSS, and images are delivered by multiple globally distributed content delivery networks (CDNs) to improve page-load times around the world. Internal caching layers and advanced web technologies, like HTTP/2, speed up page-load times as well.

Squarespace also offers top-of-the-line security to ensure the reliability of our websites. We automatically generate, configure, and renew SSL/TLS certificates for all of the domains on our platform. This keeps all connections to Squarespace websites secure. Squarespace websites are further protected by our custom-built web-application firewall, which blocks abusive traffic in real time.

Web application firewall blocking requests in real time

For larger, more sophisticated attackers, we safeguard websites from distributed denial-of-service (DDoS) attacks using a combination of internal and external tools. Our engineers also proactively address any application security vulnerability, such as cross-site scripting (XSS) attacks and third-party software-library vulnerabilities (for example, Heartbleed and ImageTragick).

Continuous deployment of software

We use Git repositories to host all of our code, and our build system automatically builds all developer branches. Our build system also runs tests on each branch to check for regressions on behavior covered by unit tests. Upon a successful build of a branch, a developer may deploy it to one of the many QA environments if extra testing or verification is required. Automated testing frameworks may also be used in QA environments to guarantee the production-readiness of a branch.

Once ready, a developer may merge his/her branch to the master branch. At this point, our build system will kick off a build of the master branch and deploy it to the staging environment. The deployment system will automatically run smoke- testing suites in staging and, if they're successful, deploy the code to production.

On any given weekday, we deploy our main application between 5 and 12 times. These deployments contain smaller change sets, so it is easier to identify which code changes may have caused an issue if one occurs. With more frequent releases, engineers also experience a tighter feedback loop on the code they push out. Finally, by depending on an automated deployment pipeline, it encourages the development of more robust automated testing and validation frameworks.

The majority of product outages occur almost immediately after new code is released. We balance product stability with rate of change by emphasizing testing. Our Site Reliability Engineering team provides multiple tools to our developers for monitoring purposes. We maintain large Graphite clusters for time-series data, ELK clusters for log aggregation, and Cassandra clusters for Zipkin distributed tracing spans. Every minute, over 14 million metrics are emitted to Graphite and 1 million log lines are ingested into ELK.

Monitoring 95th-percentile latencies using a Grafana dashboard

Monitoring and response

We monitor our data centers and our general system health around the clock. Our internal Graphite, ELK, and Zipkin monitoring tools are complemented by external monitoring systems to check our product feature availabilities, infrastructure components, and network connectivity issues worldwide.

We've standardized on Sensu as our alerting framework that integrates with all of our monitoring systems to notify the responsible team of engineers of any issues that come up. If the issues are potentially customer-impacting, we route alerts through PagerDuty to ensure swift response times and minimize any effects on our customers and their website visitors.

A PagerDuty alert generated by Sensu posted to a Slack channel  

A PagerDuty alert generated by Sensu posted to a Slack channel

In the rare case that an outage occurs, our Site Reliability Engineers are prepared to respond in different time zones across our New York, Portland, and Dublin offices. We equip them with runbooks, automation, and training, all of which reduce our mean time to repair (MTTR). We also post updates to our Status page and Tweet from the @SquarespaceHelp handle. After any outage, the team(s) involved carry out postmortems to find root causes and make necessary corrections. Every Friday, we have a weekly organization-wide postmortem to review the team postmortems and share knowledge across teams.

Operational excellence

Squarespace teams, like any other engineering teams, have to balance time spent developing new software with time spent performing operational work. Our goal is to reduce the operational burden on our teams to allow them to spend more time developing features for our customers. Every month, our Site Reliability Engineering team reports on the operational excellence of our organization. The report includes data for platform uptime, page load latencies, PagerDuty incidents, postmortems, DDoS attacks, time spent on on-call duty, and the frequency and health of builds and deployments, among other metrics.

By reviewing our operational excellence on a monthly basis and making necessary adjustments, we improve the reliability of the product for our customers and empower our engineers to spend more time on software development and less on operations.

Monthly uptime metrics included in the Operational Excellence report

We take pride in our work to ensure that Squarespace is a reliable product for our customers around the world. If you're interested in these types of challenges, our team is hiring!

Understanding Linux Container Scheduling

The Pillars of Squarespace Services