Why You Should Consider Data Contracts for Data Ingestions

signing a contract document

In an ideal world, data engineers are presented with a source of ingesting data (the Extract and Load steps in ELT) from source systems that’s up to the task. Sometimes that’s a Kafka Topic they can subscribe to, others it’s an API built specifically for bulk extraction of data. However, we don’t always live in an ideal world.

When data must be extracted from a system that was not intended for such an operation, it’s best to establish a data contract between the data engineering team building the ingestion and the owner of the system from which data is being extracted.

I write about data contracts in my latest book, Data Pipelines Pocket Reference (O’Reilly, 2021). Here’s an excerpt defining a data contract:

Data Pipelines Pocket Reference cover

A data contract is a written agreement between the owner of a source system and the team ingesting data from that system for use in a data pipeline. The contract should state what data is being extracted, via what method (full, incremental), how often, as well as who (person, team) are the contacts for both the source system and the ingestion.

— Data Pipelines Pocket Reference, by James Densmore (O’Reilly 2021)

An Example – The Order Database

In reality, data is often ingested directly from a source that was not intended for use by the analytics team. Take for example the need to extract data pertaining to orders placed on an e-commerce site. The application developers built the backend of the website and store information about each order placed in a MySQL database. They also created an API for other internal services to talk to the “order” system, but it was not built to extract large amounts of order data in bulk.

The organization needs a dashboard to report on order data (joined up with data from other systems), so it makes sense to ingest the data into a data warehouse and transform it for analysis. With time and developer resources running short, a data engineer finds themself making a non-ideal choice. The API doesn’t make bulk loads easy, and has no way to incrementally extract order data based on a LastUpdated timestamp. It’s built for getting information about individual orders. There’s also the possibility of querying the MySQL application database directly. At least the LastUpdated timestamp is available for incremental loading, but reading directly from an application database doesn’t feel right.

By ingesting from either the API or MySQL database, the data engineer has created a dependency to a source system that the system owner never intended or accounted for in their design. This can create a number of problems down the road. For example:

  • The order system gets an upgrade, and the table structure changes. An ingestion directly from the MySQL database may break due to the schema changes.
  • Because the API is not built for bulk extraction of data, an ingestion from it may be put strain on the source system when it runs. This may cause disruption and even customer facing performance issues.
  • The assumption made by the data engineer that the LastUpdated timestamp is always modified when minor changes to records in one of the MySQL tables may be a poor one. Perhaps it’s not a column that matters all that much to the application, and thus no one thought to ensure its accuracy. However, it sure does matter to the quality of data ingested incrementally!

Establishing a Data Contract

Instead of simply extracting the data from the source system on their own, a data engineer should work with the source system team to establish a data contract. The contract can be implemented in many forms, but I prefer something that is well structured and stored in a system that is both easily discoverable in the organization as well as one that can be integrated into the development process programmatically.

For example, a data contract can be written in JSON form and stored in a GitHub repo. Here’s what one might look like for the ingestion of data from the orders table of the MySQL database discussed earlier.

 ingestion_jobid: "orders_mysql",
 source_host: "ordersdb.internal.company.com",
 source_db: "ordersystem",
 source_table: "orders",
 ingestion_type: "incremental",
 ingestion_frequency_minutes: "120",
 source_owner: "order-team@company.com",
 ingestion_owner: "data-eng@company.com"

The properties of your data contract may vary. For example, instead of an ingestion frequency you may want to store to a cron schedule or other form of schedule notation. The point here is to document the fact that the ingestion exists, what system it impacts, how it impacts that system (frequency, volume, etc.), and clear ownership of both the source system as well as the ingestion job.

The benefit of a data contract is not only transparency but also cross-team collaboration.

As simple as the contract looks, it does take some communication between the data engineer and source system owner to agree upon it. The benefit of a data contract is not only transparency but also cross-team collaboration. Don’t assume the source system owner will agree to your request to pull data in the form or frequency you wish. However, approaching the conversation with a contract to fill out is a great way to keep the conversation focused on the “why” and to collaborate on the best solution. If that solution is a more ideal method of ingesting data (say perhaps a Kafka Topic you can subscribe to), then even better!

Putting the Contract to Work

Once a contract is established there are number of things you can do with it, such as:

  • Build a Git Hook to monitor changes to the source table/API/etc. associated with the integration and alert both the system owner and data engineering team of a potential issue at the point of a new pull request.
  • Create a web app to make contracts searchable and human readable. You can go one step further and add properties to each contract to store more robust documentation of the ingestions.
  • Automatically send reminders to source system and ingestion owners to ensure each ingestion is kept up to date. You can do this on a fixed schedule, or by looking for contracts that haven’t been updated in a while. Just like documentation, don’t assume people will remember to keep it up to date!

Should All Ingestions Have Contracts?

While creating contracts for ingestions that extract directly from a source system not purpose-built for such an operation should be the priority, there’s also value in creating contracts for ALL data ingestions. Taking the earlier example of ingesting order data, perhaps the source system team built an API specifically for use by the data engineering team. The API was built to handle large volumes of data, extracted either in full or incrementally. Because the source system team built it specifically for extractions, they are also aware of what the data engineering team is doing and will thus be more likely to stay in touch as issues arise or changes to the system are planned. In that case, what’s the point of having a data contract?

Consider the following scenarios:

  • Team member turnover on the source system team means that the current team may not remember the specifics of the ingestion you built months or years ago. Even if an ingestion-specific endpoint was built for your data needs, it’s important to ensure it’s kept up to date.
  • Having a full inventory, and documentation, of all ingestions comes in handy. It’s nice to be able to have a single view of all ingestions and their properties in.
  • Consistency! If you choose to implement dependencies checks as part of your development process, you’ll thank yourself for having all ingestions contracted in a machine readable format.

So in short, yes! Though it’s most important to create contracts for ingestions extracting directly from source systems, there is value in creating contracts for ingestions levering bulk extract APIs, Kafka Topics, and other purpose-built sources.