Every day at Squarespace, we work relentlessly to empower millions of customers to share their stories with an elegant, beautiful, and easy-to-create online presence. We have done a pretty good job with this mission so far, and unsurprisingly, our platform, engineer count, and customer base have experienced tremendous growth. Our growth has forced us to think hard about the systems and architecture that we have in place and their sustainability as we continue to grow.
Over the last three years, we have begun to make investments in architectural changes, including the breakup of our monolith into smaller, well-defined service boundaries. In this blog post, I will touch on a challenge that the Core Services Team faced: defining the pillars and core functionality of a service.
An Organizational Play
Ok, first let’s get the obligatory Conway’s law mention out of the way. According to the law, a software system produced by an organization will tend to resemble the communication structures and social boundaries of the organization. I personally am a bit tired of hearing this pitch, but the reasoning is indeed legitimate, and I’ve experienced it firsthand over the past three years. As engineering headcount grows, so too do required coordination, blurriness of ownership, and awkwardness of collaboration timing when pushing major features to production. As a result of these difficulties, we’ve recognized an organizational need to align teams around business areas as opposed to skill sets. So, instead of front-end and back-end teams, we now have account teams, commerce teams, a domains team, etc.
These teams are structurally empowered to move quicker, decide locally, and see their product to production with little need for coordination with the rest of the engineering organization. That has become the perfect recipe for service growth, and we are starting to receive dividends for the investments we made in building out a core service framework three years ago.
Freedom versus Standardization
With this newfound autonomy comes a question that is the real focus of this post: which aspects must be standardized across services? And conversely, what level of technical freedom should individual teams have?
One of the many advantages of the microservice-based architecture is autonomy and the possibility of a polyglot system. Service owners are free to choose from different database systems or programming languages in the implementation of their service. While this advantage might seem utopian, what does it look like in production at scale? When things go wrong that require participation from many different teams, can we still be efficient if we do not have basic standardizations in place and are not speaking the same language? If one team produces a separate set of system metrics than another, are we really in a better position to quickly triage and respond to production events? How do we analyze logs if the commerce team prefers an ELK stack, while our content team prefers Splunk? This unpredictability can be especially harmful given our current support structure which consists of a centralized Site Reliability Engineering (SRE) team. In this environment, it becomes crucial that proper standardization is in place to combat production issues. As service owners move to fully own their own support, we can naturally allow more freedom.
Furthermore, how do we structure our services to allow for tech experimentation and avoid enforcing technology choices that may not make sense for certain teams? Also, how do we limit the amount of exposure a core library has that is dispersed throughout the service ecosystem? We wanted to maintain the city planning analogy eloquently presented by Sam Newman in his book Building Microservices:
So our architects as town planners need to set direction in broad strokes, and only get involved in being highly specific about implementation detail in limited cases. They need to ensure that the system is fit for purpose now, but also a platform for the future. And they need to ensure that it is a system that makes users and developers equally happy . . . . As architects, we need to worry much less about what happens inside the zone than what happens between the zones.1
We wanted service owners to be free to decorate the interiors of their house as they pleased, but at the same time not be put in a position to disrupt the overall service ecosystem at Squarespace. These fundamental questions led us down the road we are now and equipped us to define the core pillars of a good service at Squarespace.
With the decomposition of applications and processes into precise, smaller services comes an explosion of running services. With all these running services, it’s important to have an efficient way to determine where they are running. In its simplest form, service discovery is the process by which services register their own availability. Consumers can then tap into the directory of services to figure out how many services are running and which ones are running where.
Service discovery is also elastic. Registration of running services is dynamic, and if a service is brought down, it should remove its registration from the directory so that consumers know it’s out of commission. Our current service discovery abstractions are built on top of Hashicorp Consul, which abstracts the work needed to discover or register a service away from the service owner.
As more services are added, it becomes essential that we have both high-level and service-level views of our system’s health. Consistency in how service-level metrics are emitted and displayed is equally important. In our service core, we abstract the work required to discover key common metrics to report on, such as discovering user-defined HTTP endpoints and reporting on their latency, error rates, and more. For our service client, we do the same for client-side metrics. And to wrap it all together, our service core ensures that dashboards are automatically created and kept in sync.
Understanding the capabilities of a service and its exported API is crucial for cross-organizational knowledge sharing. Having a consistent place to visit to retrieve detailed information about a service’s endpoints is a luxury that comes for free in our service core. All services use Swagger to detail endpoint information. This has proved to be an effective tool for sharing the purpose of each service and quickly testing endpoints locally.
With these defined and centralized pillars based on the software the Core Services team maintains, we are able to provide a common dashboard that serves as the service home base. The dashboard contains a service inventory, individual service health with relevant graphs, distributed traces, and service dependencies, and more.
Addressing Fan Out
An inevitable added complexity introduced by microservice architecture is “fan out,” where it becomes difficult to predict or control the number of downstream service calls spawned from one user request or schedule-based activity. The following diagram depicts a simple user request in a monolithic architecture versus potential fan out when users’ requests span multiple service boundaries.
The following pillars all work to alleviate, if not eliminate, the issues that surface due to fan out.
Asynchronous Request Execution
Fan out can have a significant impact on overall latency if not addressed. By
default, most common HTTP clients execute requests synchronously. The
implication of this is an elevated response time due to the serial execution of
calls to downstream services. In our above example, serial execution in the
microservice-based system would result in a latency of
A + B + C + D + Z.
Asynchronous request execution can solve the latency issue by allowing for
concurrent execution. Concurrency can be achieved anywhere there is not a direct
data dependency between two calls. Using our example above and assuming there
are no data dependencies other than the lines depicted in the drawing, our
overall latency with concurrent execution would be
max(A, Z) where
A would be
max(B, C, D) plus
A’s internal processing time.
To achieve asynchronous request execution we use Reactive Extensions/RxJava with RxNetty. RxJava also provides powerful mechanisms to combine and manipulate asynchronous computation, helping developers to avoid callback-laden spaghetti code.
Structured / Contextual Logging
As user features and requests are spread across multiple services, it’s important to have a standardized logging mechanism and even more important to track the user request through each sub service. This typically can be done by creating some GUID-based request-context ID that is used as a breadcrumb to search and analyze logs. Once this breadcrumb is passed along service calls, we can use it to diagnose all log entries that participated in one user’s request across multiple services.
All service logging is abstracted at our core, and this consistency allows us to participate seamlessly in a log aggregation and reporting framework.
Another way to take advantage of this request-scoped context ID is distributed tracing. By propagating this ID across the various service calls that participate in serving a request, we can construct a detailed trace showing the time and duration of these events. This insight can be valuable in spotting both inefficiencies (e.g. suboptimal database queries) and operational problems (e.g. network issues). It also sets us up well to perform automated tasks such as automated canary rollbacks on bad production releases.
We use Zipkin to implement the storage back end, querying components, and UI of our tracing framework, while the instrumentation side of things is done in-house.
Circuit breaking is a key concept in microservice-based architectures. With microservices, it is critical that misbehaving or high-latency services do not bring down or affect the larger system. A user request that comprises a large number of service dependencies should not have its experience impacted due to the failure of one service.
The general idea with circuit breakers is that services should fail fast and that latency should not cascade to all the connected dependencies. A service shouldn’t die a slow death, which usually results in wasted resources, backed up requests, and potentially bringing down the entire system. Instead, services should be graceful and resilient through their use of isolation and fallback mechanisms. A slow response is much worse than no response as application threads can block, backup, and exhaust the entire system.
Our common service client relies heavily on Netflix’s wonderful Hystrix project.
Today, three years after our first service was plucked out of our monolith, we continue to make great strides in redefining an architecture that can scale with our growth. Undoubtedly, we will learn about our pillars, add new ones, adjust existing ones, and continue to grow. We also plan to continue developing our core while carefully balancing the forces of high risk and pervasive functionality spreading through our service ecosystem.
The Core Services team is always on the lookout for talent that can help us grow
and improve our service framework. If you’re interested in joining our team,
1. Sam Newman. Building Microservices. O’Reilly, 2015. p. 16. ↑