Don't Underestimate Your Data Engineer

The data scientist may get the plaudits, but they can't do everything alone

16Jan

The flood of data-related roles unleashed in recent years has the potential to cause untold confusion, particularly for senior executives inexperienced in the field. It is, however, necessary that organizations looking to join the upper echelons of data maturity stop trying to hire some 'renaissance data scientist' to collect, clean, and analyze the data all by themselves. It is tremendously rare to find any one person who embodies all the qualities companies are looking for, with demand for such so-called unicorns currently far outstripping supply. Organizations should be looking to employ a full complement of capabilities to manage the entire process, with specialists to collect data and business users empowered to translate it into insights that can improve the bottom line.

One data role growing in importance is the data engineer, which again made LinkedIn's list of most promising jobs in 2018. With a median base salary of $107,000 and a 35% year-on-year increase in the number of job openings to over 1,400, it's not hard to see why it's become such an inviting career path. More than this though, it's also no longer perceived to be a water carrier role, the Watson to the Data Scientist's Sherlock, doing the grunt work while they sweep up the accolades. Data engineers are critical to a successful data journey, and companies are increasingly realizing they need one in place as early as possible.

The definition of what it is exactly a data engineer may vary from company to company, but the core remains essentially the same. They are there to optimize the performance of their organization’s big data ecosystem to put data scientists in the best possible position to do modeling. They prepare an infrastructure capable of ensuring the pipeline is maintained by designing, building, and integrating the tools, infrastructure, frameworks, and services needed to properly collect and ingest batch and stream-oriented data. They clean the raw data, which will invariably contain errors, and check it for unformatted and system-specific codes to make it usable. They may also run ETL (Extract, Transform, and Load) on top of big datasets, although they are not typically expected to know any machine learning or analytics for big data. They are also there to act as advocates for new and better data, and to develop a holistic approach for the rest of the organization that ensures data assets are protected and accessible.

Without the work done by data engineers, simply put, businesses cannot keep up with the influx of data or be sure they are getting reliable results from their analyses. In a blogpost on DataStax last year, Chief Evangelist Patrick McFadin went so far as to argue that demand for data engineers is actually set to outstrip that for data scientists in the future. And while LinkedIn's promising jobs list suggests this is yet to happen (Data Scientists postings were still up 45% to 2100+ last year), his logic is sound. As companies handle and analyze increasingly large amounts of data, structure is becoming a far more pressing issue as the pipeline becomes more liable to falling apart. The nature of discovering insights from the data is also changing. Data scientists are used to analyzing data in a reactive manner, exploring large information sets in the hope of finding a nugget. Today, this is no longer the case. Data analytics no longer requires the same specialist skills. Organizations are instead realizing that they need a data-driven culture in which data is embedded into applications, rather than being pushed into a central data lake before it can be analyzed. As such, it is even more important than ever that data is clean, properly formatted, and easily accessible. Data engineers are far more proactive, using data to improve existing technologies and services. As McFadin notes, 'It's a much more goal-oriented, 'fix the problem with the right tools' approach compared to delving into data to see what is there.'

The data engineer's status is also further enhanced in the age of self-service analytics as a result of the greater emphasis organizations are having to put on compliance, with a raft of incoming regulatory requirements surrounding the collection and use of data, set to inflict huge penalties on those not up to scratch - particularly the incoming GDPR regulations. Data engineers are a vital part of maintaining compliance, curating and preparing trusted data sources so that business users are free to explore data sets without worrying if they are doing legally.

Ultimately, there is now too much potentially useful data out there for us to continue relying on data scientists to do everything. We need business users to conduct more analysis and we need data engineers to ensure they have all the data they need and that it is clean, consistent, and accessible. The data scientist is by no means set to fall by the wayside and should remain relevant for many years to come, but it may be that they can share the workload more in future and focus on what they do best.

Amazonechos

Read next:

How Amazon Alexa Works

i