Data Engineers are a hot commodity in 2020, but it’s surprising how misunderstood they are. Are they a software engineer with a hyped up job title? Are they a more technical data scientist? Maybe someone who can take requests for custom data pulls and reports?
The truth is, they are none of the above!
I have a hard time nailing down the definition of anything in a single sentence, but I challenged myself to describe a data engineer’s purpose in just that. Here goes!
“Data Engineers build data pipelines to deliver timely and accurate data to data lakes and data warehouses.”
Think about all the data that an organization owns and has access to. Web logs, customer data, sales transactions, and more. Think about where that’s stored – SQL databases, CSV files, document databases, Salesforce, Hubspot, Mailchimp, etc. When it comes time to make sense of what’s going on in your business, you need to bring all of that data together and make it consumable by data analysts, data scientists, business intelligence tools and those glossy dashboards that your CEO likes.
It takes an entire data team to derive value from raw data, but it’s up to the data engineer to get it from each source and deliver it to a destination in a form that’s accessible by the rest of the team. Such a task is called data pipelining, and it’s more complex than the “old days” of simply extracting and loading a few tables from a SQL database.
Organizations are facing data volumes and complexity that seemed impossible just 5-10 years ago. Though the term “big data” doesn’t get tossed around as much anymore, we’re living in world where such volume, variety and velocity of data is the new normal. During that same time period, the data engineer emerged as a defined role on a data team to wrangle all the data in an efficient way.
A data engineer’s purpose isn’t simply to get the data to a lake or warehouse and walk away however. The “timely and accurate data” portion of my definition is key to their success. In a highly functioning data team, the data engineer works closely with data scientists and analysts to understand what will be done with the data. The ultimate purpose drives decisions such as how often the data needs to be delivered, how long it must be retained, how granular it needs to be and more.
Finally, a data engineer takes pride in ensuring the validity of the data they deliver. That means testing, alerting and contingency plans for when something goes wrong. And yes, something will eventually go wrong!
The skills of a data engineer will depend somewhat on the tech stack their organization uses, as well as the goals of the data team. However, there are some common skills I find across all data engineers.
- SQL and Data Warehouses – Not just basic querying skills, but how to write performant SQL and understand the fundamentals of data warehousing. Even if a data team has data warehousing specialists, a data engineer with warehousing fundamentals is a better partner and can fill more complex technical gaps that arise.
- Python and/or Java – The language will depend on the tech stack, but either way a data engineer isn’t going to get the job done with “no code” tools even if they have some good ones in their arsenal.
- Distributed Systems – It’s a broad term, and again the specifics vary by tech stack, but the reality is that with high volumes of data comes the need to work with distributed systems. This is true for data ingestion, storage and processing.
- Basic System Administration – In short, expect a data engineer to be proficient on the Linux command line and be able to do things like understand application logs, schedule cron jobs and troubleshoot firewall and other security settings. If you’re building on AWS, Azure or another cloud provider, they’ll end up being an expert in getting cloud services working together.
- A Business Goal Mentality – A good data engineer isn’t just a code monkey. They may not interface with business users on a regular basis, but the analysts and data scientists on the team certainly will. The data engineer will make better architectural decisions if they’re plugged into the end-goals.
What They Are Not & Why it Matters
There’s a saying that just because you can do something, it doesn’t mean you should. Oh how that’s true for data engineers.
For example, data engineers can write SQL. It’s very easy to fall into the trap of asking a data engineer to help with “just one quick data pull” to answer a question from a stakeholder. The next thing you know, they’re spending time building a pivot table or calculating metrics that end up being shared across the organization. Not only is there risk that the output will not match existing reports, but it’s highly likely that “just one quick data pull” will turn into “hey, remember that data you pulled last month? We need it again, with a few changes”.
Data engineers end up knowing the code base of the company’s production applications just enough to get the data they need, but not enough to jump in and help the production software engineers. Quite often a data scientist or data analyst on the team will want to know something like how the usage of a feature on the website is captured, or how the billing system works. Because they know the data engineer well, that’s the first person they ask. Once again, the data engineer can probably dive in and figure it out, but doing so will take time and focus. In reality, the production software engineers are probably the right source for an answer.
I could go on, but I think you get the point. Data engineers have valuable skills and tend to be a bridge between the data team and the Engineering organization. They have access to a lot of data and code. That’s a blessing and a curse, and there’s a few points of risk in having them try to do it all.
- Burnout – If you slam a data engineer with data pull requests and questions that they don’t feel they have the time or knowledge to answer, they’ll burnout – fast.
- Miscommunication – There’s a reason why data teams have a process for incoming questions like “how much money did the company make last month?”. Believe it or not, even the simplest question is never simple to answer. If you don’t have a process you’ll end up with conflicting answers and lots of stress.
- Unseen Staffing Gaps – Having your data engineers do too much is an easy way to hide gaps on your team that you really should be filling with new hires. If you need more dashboards, hire more data analysts! Sure, there are times when everyone needs to chip in, but don’t make it a regular occurrence.
Be Explicit About Role Definitions
I’m not advocating for working within strict job descriptions, but it’s helpful to ensure you’re explicit about the role of each member of the data team. Data team leaders benefit by communicating the specifics of what a data engineer does internally to the team, externally to relevant business partners, and to the recruiting team. In fact, they might need to remind themselves when things get hectic!
There are reasons why the need for data engineers has come about. By understanding what a data engineer really is, you can recruit, retain and empower the best of them.
Cover Image Courtesy of https://commons.wikimedia.org/wiki/User:Ludwigs2