The Year in Data and Analytics – 2021

Is that time already? Another year gone by? Well, what a year it has been. Every year at this time, I like to look back and reflect on what I’ve learned personally and what’s changed in our industry. I’ll spare you many of my reflections on life in general, but there’s plenty to review in the world of data and analytics.

Trends Leading Into 2021

In my review of 2020, I touched upon a three trends that I thought would carry into 2021. Here’s what I wrote at this time last year:

The global acceptance of ELT over ETL: I’ve written about this over the last few years, but the benefits of working in an ELT paradigm rather than traditional ELT cannot be overstated. MPP, columnar databases like Redshift, Snowflake, and BigQuery made it possible. dbt is a great example of showing the power the separation of Transform in empowering analytics engineers. Data teams that haven’t moved to ELT yet are going to feel the pressure build.

An explosion of “EL” frameworks: Just like dbt has planted a flag on the Transform step, I expect to see the emergence of a new breed of tools that make data ingestion (the Extract and Load steps in ELT) even easier for data engineers. Though commercial products like Fivetran and Stitch have seen success over the last few years, I’ve noticed new activity in the space by players including Airbyte. There’s a number of smaller open source projects popping up as well, so expect one or two to stand out.

Real Competition for Snowflake: You don’t have an IPO like Snowflake did without generating more competition. Snowflake has pretty much stomped on Redshift over the past few years, and BigQuery seems to be stuck in the background at the moment. I expect either Amazon and Google to double down on their respective warehouse technology in 2021 and try to take some ground back. Amazon gave it a shot with Redshift RA3 nodes in 2019, but I wouldn’t count out a more substantial architecture shift in the near future.

How did I do? It’s surely hard to score, but the first two panned out while the 3rd was a mixed bag. I clearly missed a few big developments as well. I touch on those later on in the post.

I know that there’s still some pushback out there to the definition of “ELT” and plenty of people claim it’s just a crafty rebranding of ETL. I couldn’t disagree more based on the real changes I’ve seen in how data organizations deliver and hire, but if the old way works for you then by all means keep it going. Still, the continued success of dbt, and the related rise of the Analytics Engineer has really cemented the separation of “EL” from “T” in both technical design as well as role definition in leading organizations around the world.

Related to ELT taking ground, “EL” frameworks (AKA “data ingestion” tools) made progress in 2021 for sure. Fivetran, the old guard in the space at this point, raised a ton more money and was valued at over $5 billion back in September of this year. Airbyte, who I pointed out in last year’s review, had a breakout year as well raising $26 million back in May. Not to be outdone, in June, Meltano raised a seed round and spun out of GitLab to focus on their open-source framework that’s well aligned with the Singer standard for data ingestion and tight integration with dbt. I’m sure I’ve missed some, and that speaks to the pace of change in the space. Even organizations that continue to build data ingestion in house have benefited from the shift to ELT, and I bet we’ll see more open source frameworks out in the public realm down-the-line as a byproduct.

As for Snowflake getting some more competition, there was less change on that front in 2021 than I expected. The major competitors released some incremental improvements but nothing groundbreaking. ClickHouse raised a massive $250 Million and has started to pick up some big logos. They are one to keep an eye on for sure. So Snowflake kept their momentum, but I don’t think that kind of dominance lasts forever. It’s only a matter of time before competitors pick up noticeable market share, but I’ll stay away from predicting when that will happen!

2021 Breakouts – Data Observability and Data Org Design

In addition to some major trends I was focused on a year ago, there were a number of other developments in the industry that really picked up steam this past year.

“Data observability” is a term you may have heard in the past year or two, but if not this post by Lior Gavish, the co-founder of Monte Carlo, is worth a read. In particular, this definition:

At its simplest, data observability means maintaining a constant pulse of the health of your data systems by monitoring, tracking, and troubleshooting incidents to reduce — and eventually prevent — downtime.

Data observability shares the same core objective as application observability: minimal disruption to service. And both practices involve testing as well as monitoring and alerting when downtime occurs. That said, there are a few key differences between data and application observability.

Lior Gavish

Like the evolution of ETL to ELT, I’ve seen some skepticism that data observability is a buzzword rebrand of things that data teams already do. However, I see it as the maturation of what data teams SHOULD be doing when it comes to monitoring their systems and data. Just like analytics engineers have taken on some of the properties of software engineers, I think it’s long past time that data teams think of their systems as “production” just like application developers do. In some organizations, that’s been the case for a while but I find that to be the exception. Even in organizations where panic sets in when a key dashboard is down, the discipline in data development, testing, and monitoring is no where near what that organization has in place for customer facing services.

The uptick in data observability platforms gives me hope that data teams finally have the breathing room (and funding!) to invest in stability and trust in their systems. In addition to the aforementioned Monte Carlo who raised their Series C in 2021, Great Expectations continues to build a following in the open source arena, and new players like Metaplane are popping up faster than I can keep track. Still, this isn’t just a future trend. 2021 was a year when the term “data observability” became something that didn’t need explication at every turn. Organizations are investing in these tools and practices right now.

Continued Momentum – Data Orchestration and Data Communities

Last year I was pretty excited for Airflow 2.0. Not only did it come out and make a splash, there was continued progress across the industry in improving the way we orchestrate data pipelines. Dagster (also open source) is being adopted by some notable organizations, and recently announced Dagster Cloud, a managed solution that you’ll have to pay for but with that you’ll no longer have to host on your own infrastructure. That gives Dagster some parity with Airflow which can be found fully managed by Astronomer and others.

The challenges in orchestrating data processes, especially in higher scale organizations, continue to haunt data teams so I’m delighted to see tools in this space mature and interconnect with each other (think Airflow or Dagster alongside dbt).

Though not a technical trend, the growth of data communities continued this past year. I wrote about what they are and how they can help us all grow in the industry earlier in the year. I’ve personally learned so much from Slack communities, smaller conferences, and Twitter that it’s hard to give the movement towards communities enough credit. I know it’s in vendors’ best interests to build communities, and “community” is one of the hottest things in marketing for a reason. Still, I find that a lot of the teams building data tools and their associated communities are genuinely open to collaboration and networking beyond commercial interests.

Leading Indicators for 2022

Like always, I cringe at making any bold predictions about the future. Still, in addition to continued momentum in the areas discussed so far there are 3 areas I expect to see a lot of progress in 2022.

Operational Analytics (AKA “Reverse ETL”)

I’ll be honest. I’m not on board with the term “Reverse ETL”, even though I get where it’s coming from. I prefer “operational analytics”, but what ever you want to call it, there is growing momentum to put big kid pants on how data flows OUT of a data warehouse and into operational systems like CRMs, email workflows, and more.

Personally, I think it’s one of those things that’s going to happen in your organization at some point and likely is already. The question is how you want to implement and govern it. Products such as Census, Hightouch, and Grouparoo made themselves well-known in 2021. Given the valuation of Fivetran and growth in the data ingestion tooling space overall, it’s no surprise that there’s value in building a similar platform with connectors going in the other direction. The thing I’m most interested to see, is if vendors doing other things the data space (ingestion, orchestration, data observability, etc.) try to expand into this space or if these end up being truly separate products we’ll add to a stacks.

Data Organizational Design

Less technically challenging, but just as important as building a solid data infrastructure, is building a data organization that works for your business. Data teams come in all shapes and sizes. They also report up to different areas of a business. Perhaps Finance, Engineering, IT, Marketing, or to a dedicated data leader. They can also be centralized in a single org, or decentralized so that each organization has components of, or even full, data teams in them. What I’ve found interesting over the past few years is the move to find a balance with what I think of as “hybrid decentralized data orgs”. In such orgs, core aspects of running the data infrastructure, including data engineering, data governance and observability, and at times some core data parcels in a data warehouse, are all put under a central team. Each department in an organization can then staff up with analytics engineers and data analysts to truly own their data and collaborate on the same infrastructure with other departments who they may share data and knowledge with.

I lead a central data team in an organization (HubSpot) that does something of this nature, but there’s countless variations on the same theme out there. The “Data Mesh” concept is one that I’m still unsure I grasp, but there are aspects of it that fit the same movement of empowering those with domain knowledge of the data to be more self-sufficient. That I can get behind, and I bet I’m not the only one moving down the path of a more decentralized data org in 2022.

A New Generation of BI Platforms

Despite all of the improvements to data warehouses, and other aspects of data infrastructure in the past few years, the layer where all that works is seen by business users has been moving at a slower pace. Products like Tableau, Looker, Power BI, Mode, and others still do the job, but there have been fewer new entrants or big splashes in recent years. I think that’s going to change, both in regards to new entrants as well as improvements to some entrenched players.

In August of 2021, Preset raised a $35.9 million Series B. That one really caught my eye for two reasons. First, it’s coming out of the open source community and founded by Maxime Beauchemin who is also a creator of Airflow. Second, was the announcement that accompanied the fundraise regarding Preset Cloud. If you’ve read this post to this point, you’ve already seen me mention dbt Cloud and Dagster Cloud as intriguing managed (commercial) versions of open source tools. Preset is doing the same, and with an already slick product and well regarded founder.

In addition, we are just plain overdue for some innovation in this space. I expect some of our favorite existing tools to put out some exciting roadmaps in 2022. If not, there’s plenty of energy building to displace them.

A Personal Note

2021 included a first for me with the publication of my first book, Data Pipelines Pocket Reference. I received a lot of support from the data community, and found an even broader community in the process. I learned a lot from writing it and even more so from the feedback I got after the fact. Maybe all authors experience this, but the list of things I now wish I could have included in the book (or would approach a bit differently) is a mile long because of that feedback. A big thank you to those who supported my book and the work of other data authors in 2021!

I wish you all a happy and healthy holiday season, and big things in 2022!


Cover Image Credit, GDJ