[placeholder]

Data Traceability and Lineage

In the data engineering team at Squarespace, we are concerned first and foremost with exposing the company's internal data to the rest of the organization. We embrace the challenges related to ingesting, storing, and restructuring data to make it accessible and primed for analysis to help drive the business forward.

One of the primary challenges of this role is answering for the correctness of the data. Data scientists, business operations, finance, and marketing teams are all working off our curated data stores to make critical decisions about the direction of the business, and there will inevitably come a time when somebody looks at a report and asks: "How do we know these numbers are correct?" In that moment it can be difficult to assert with confidence that everything is rock solid. Even if the data is all in the right place, how can we support our case that the results we've produced are consistent with reality?

We've confronted this challenge by introspecting about what tends to improve confidence in our output, and we've found that it comes down to a set of practices that aim to make traceability and data quality assurance key parts of our systems. In this post I'll give an overview of three techniques that we've found especially useful.

Make unwritten rules explicit by turning them into data validation guarantees. Whenever we get a "Funny-Looking Data" bug report, we use it as an opportunity to clarify an explicit expectation of the data set in question. Say a data scientist expects to see at least 50 rows per day in a table but she only sees 6 rows for last week. We would then build a monitor to routinely verify this expectation of the table, for example by scheduling a script to routinely count rows and throw alerts. Often, these assertions can be enforced consistently as database constraints (e.g., non-nullable, foreign key constraint).

These validations effectively create something of a unit test suite for your database, where test successes reinforce confidence and test failures allow us to stay in front of downstream errors. Maybe we even delay the generation of a report if we know the results are going to be off.

Prioritize clarity and traceability in the ingestion and processing architecture. Data warehousing is all about collecting and organizing historical data, but it is a thorny issue when old reporting data is discovered to be incorrect and there isn't any way to trace back how it got that way. Maybe the code which transformed the data has since changed several times, the source data was updated at some point, or the person who worked on the project may be gone.

To combat this we try to limit the statefulness of our processing pipelines. For example, one extreme technique involves reloading all of history frequently, rather than processing incremental batches, to ensure that all current results reflect the latest code and source data. If this is unrealistic then we keep raw source data on-hand and/or retain intermediate stages of processing, and we include pointers in each final reporting row that refer back to the source data element, for example with a foreign key to a source row identifier. (See diagram.) Additionally, we log all transformation script executions with metadata (e.g., start time, execution parameters) and a unique id, and each row loaded into a reporting table contains a pointer to the corresponding script execution log entry.

Diagram of a reporting table entry which includes a reference to source data element.

Diagram of a reporting table entry which includes a reference to source data element.

The trail of breadcrumbs become invaluable when investigating data anomalies or bugs in the ingestion and transformation systems. Lastly, we log every manual query used to backfix data. This ability to easily trace how each data point got from point A to point B is critical for taking some of the mystery over how reports are generated.

Know who is reading from and writing to your data stores. If there are many people with access to the database and hundreds of data transformation scripts in your code base, it is easy to lose track of who is touching a table. Multiple interdependent scripts may be modifying the same table, which makes reasoning about its state difficult. Worse still, an individual user can modify data unannounced.

All of this speaks to the point above about prioritizing clarity and traceability. It is critical to understand the flow of data to find any potential logjams or lynchpins in the system.

We give each program and user that touches the database a distinct username so we can track its queries. Additionally we monitor the logs, and we extended the database to log (user, operation, table, column) tuples to see exactly who performs an INSERT, UPDATE, DELETE, or SELECT on each column. From this output we can generate a network graph of data producers and consumers to see how tables are getting populated and consumed, which helps expose any surprises or complexities.

Enabling our organization to access company data and quickly deliver insights is our mission. As new reporting requirements come in and code gets tweaked, things can spiral quickly into confusion. Data traceability is a key part of our strategy to clear things up, both for ourselves and for our colleagues. We're always looking for new ways to inject traceability and transparency into our data delivery, and we're always looking for more help, so if you are looking for a place to explore these challenges and many more we’re hiring!

The Developer Platform Gets Local

Building an eCommerce Tax Engine