Over the past 9 months, I’ve been working on a book to be published by O’Reilly Media. This past week, Data Pipelines Pocket Reference was officially published in print, e-book, and on O’Reilly.com for subscribers of the O’Reilly platform. It’s exciting, and a little nerve-racking, to get it out there and in the hands of folks learning more about building data pipelines for analytics.
Given “data pipelines” is a pretty broad topic and the O’Reilly Pocket Reference format is somewhat condensed, I wanted to write a short post with a little more context of what’s in the book and who I think might find it valuable.
Before I do, I want to say a special thank you to the entire O’Reilly team for giving me the opportunity to write the book and providing a ton of value along the way. It’s a topic for another post, but the despite the temptation to explore self-publishing, working with an established publisher was without a doubt the right move for me.
What’s in the Book
O’Reilly came to me with a general idea for the book, but it was on me to come up with a specific proposal. I decided to pitch the book as an end-to-end journey through a data pipeline built for analytics. By that I mean the world of data engineering, data warehousing, and analytics engineering. There’s a lot of great content out there on machine learning and data science, but not nearly enough focused on the mechanics of how data teams gather data and turn it into insights.
The book is structured around the ELT pattern. ELT has overtaken ETL as the dominant design of modern data pipelines thanks to cloud data warehouses and a flood of supporting tools for data ingestion, data orchestration, and data modeling.
You can browse the entire table of contents here, but the 10 chapters cover the following major aspects of building data pipelines for analytics from left to right (E->L->T) in the pipeline:
- An overview of the ELT pattern (and EtLT sub-pattern) and why it’s best suited to modern pipeline building.
- Data ingestion (the “EL” in ELT). Lots of code samples here for extracting data from common sources such as Postgres, MySQL, REST APIs, MongoDB, and more, as well as loading data into cloud data warehouses such Snowflake and Redshift.
- Data transformation and modeling (the “T” in ELT). Lots of SQL in this chapter for transforming all the data that’s been ingested into a warehouse into data models designed to answer business questions.
- Data orchestration: DAGs, Apache Airflow, and more! Orchestration ties together all the interconnected processes in a pipeline in a reliable and scalable way.
- Validating data in pipelines. I include a a sample data validation framework (written in Python) along with handful of validation test examples you may want to consider.
- More advanced topics such as maintaining and scaling pipelines as well as monitoring and measuring the performance of pipelines.
Why a “Pocket Reference”?
For those familiar with the O’Reilly “Pocket Reference” series, you might expect more of a reference sheet than a book with narrative, but Data Pipelines Pocket Reference flows more like a traditional O’Reilly book in my opinion (and the opinion of some early readers). The difference is that it’s condensed and covers a lot of ground for a short book. That means that by design there are sections that go deep, but many that stay higher level and aim to give the reader a jumping off point for learning more. In fact, I link out to other books and online resources throughout.
However, sticking to the Pocket Reference roots, I intend for the book to be “reference-able” in the sense that you could read individual chapters or sections to solve a particular problem rather than having to read cover-to-cover. Personally, I like to think people will at least skim the book in its entirety and spend more time on the chapters they need depth from. I guess I’ll find out soon enough!
Who’s it For?
In pitching the book, I had to come up with what I thought the target market would be. I thought a lot about that, and in reading the final version I think I did a reasonable job of catering to that audience.
First, I wanted a book that data engineers working in the data warehousing space could relate to. As noted earlier, there’s no shortage of content on building pipelines and systems for machine learning. I wanted something for data engineers who are building data ingestions and orchestrating pipelines for Snowflake, Redshift, BigQuery, and other cloud data warehouses. Though some of the sections regarding data ingestion might be a bit basic for seasoned data engineers, my hope is that other sections will provide new ideas and insights and even expose data engineers to the work that analytics engineers are doing in data transformation.
Second, I thought there was a need for a book targeted at analytics engineers and data analysts who are looking to get exposed to the end-to-end aspects of data pipelines. For this audience, the data transformation sections might seem basic (though I tried to add some more advanced tidbits!), but learning about data ingestion, orchestration, and validation is an opportunity to expand an analyst’s skill set.
Third, with the number of data engineering roles exploding in recent years I’ve seen many software engineers looking specialize in the space. This book is a great way for someone with an existing software engineering background to learn what this data engineering thing is all about. As someone who started as a software engineer myself, I remember knowing how to code and write some basic SQL, but it taking me years to understand the world of analytics. This book would have been a valuable crash course for me back then, and I hope it is for someone like me breaking in now!
Finally, leaders in the data space who want to get up to speed on what a modern data infrastructure looks like will find value in the book. I know a number of Director and VP level leaders who expressed interest in the book early on. Many are technical but either more generally in software engineering, or a number of years removed from data warehousing best practices.