If you don’t yet have a data science team, you may have moments of panic feeling that you’re falling behind. It’s true, data science and machine learning aren’t fads. I’ve seen them create real value in organizations. However, I’ve also seen data scientists sit idle or work on “cool” but ultimately low-value projects because the organization wasn’t ready to support them. Why does that happen, and how can you avoid the same mistakes?
Three reasons stand out:
- Lack of a data infrastructure to support data science activities
- No one in leadership with experience leading data science projects
- No clear vision as to what value data science will deliver to the organization
I’ll address the second and third reasons in future posts, but for now I’ll focus on the first.
What is Data Engineering?
Organizations have more data than ever before, not only in volume but also in more places and of more types. For example, most organizations have data that’s stored in SQL databases, log files, 3rd party CRM systems like Salesforce, and so on. Data engineers are tasked with moving all that data around and ensuring it can provide the organization with the necessary value.
You are probably familiar with one key role of a data engineer – building and maintaining a data warehouse. In fact, that’s nothing new and in the past it was done by people with the job title “Business Intelligence Engineer”, “Data Warehouse Engineer” and so on. What’s this new Data Engineer title all about?
As the volume and complexity of data increased, it became more technically challenging to centralize data in a warehouse or data lake. There is also the matter of moving data around for other purposes such as feeding real-time reporting in applications and powering data science and machine learning models. As those needs arose, so did a more technical role in an organization – the data engineer.
Though it’s a broad role, most data engineers spend their days focused solely on these activities and not things like building a web application or even analyzing data. They are the plumbers, road builders, and iron workers of the data world. They ensure that the infrastructure is in place to make use of data and not just collect it.
How do Data Engineers Support Data Scientists?
Not surprisingly, data scientists can’t do much without data. When an organization goes to make its first data science hire, they often fool themselves into believing that because they have lots of data their new data scientist will be off and running. There are two levels of such a mistake.
Level 1 is an easy mistake to make, and thus fairly common. In this level, an organization not only collects lots of data but has built a data warehouse, reports and dashboards. They do not however, have data engineers to build anything further and more complex.
This is a great place to start, but data scientists need something else – granular, and often very raw, data. A data warehouse that’s built to support traditional data analysis contains aggregated (not granular) and processed (cleaned and structured) data. In a matter of day or weeks you’ll be getting questions from the data scientist about very specific data points that you didn’t think to collect, or did collect but haven’t made readily available.
Level 2 is less common but bigger trouble. This is when there is no data warehouse or other analysis specific data infrastructure in place. For example, you might have a production application database or a CRM, but no infrastructure dedicated to data analysis at all.
In either case, your new data scientist will quickly hit a wall. How do they get the raw, granular data they need to experiment and build models? Where will they run and deploy their models? How will the results of their work mesh with your production application?
As you can imagine, that’s where a data engineer comes in. In every successful data science project I’ve been a part of, a strong partnership existed between data engineers and data scientists. Furthermore, it’s easier for the two roles to move projects forward when there is at least the foundations of a data infrastructure in place before a data scientist walks in the door.
Do I Really Need Dedicated Data Engineers?
Sometimes seen as an unnecessary expense, there are organizations that fill the role of a data engineer by borrowing time from existing software engineers. In an early startup, this might be done out of necessity, but in most cases it’s a recipe for failure. Why? In addition to the engineer not necessarily having expertise in data engineering, you’ll often face the choice of the engineer working on a production system (their primary job) or supporting your internal data needs. Given your customers only see what’s already in production (which isn’t anything from your data science team yet), guess which one wins out when it’s crunch time?
Maybe you’ve seen the statistic that 85% of data science projects fail. I personally think that’s a bit overstated, but none the less the failure rate of any experimental activity is going to be higher than conservative organizations are comfortable with. Risk can be managed though, and hiring a data engineer who is dedicated to your data infrastructure is a great start.
Don’t forget to sign up for the Data Liftoff mailing list to get more content and to stay up to date on the latest in data science and data engineering.