This article reflects on the remarkable durability of the basic ETL paradigm, while at the same time recognizing some profound changes that must be addressed. These changes are due to new data demands, new classes of users, and new technology opportunities.
Most organizations are realizing that they have dozens, if not hundreds of potential data sources, especially those that are customer facing. The established best practices of manual data conforming are becoming unscalable. A new class of software startups are offering statistical data conforming that can provide usable matching of entities, such as customers, that approach the accuracy of manual data conforming, but at drastically higher speeds. The standard notion of a human dimension manager may give way to a 'robot dimension manager!'
The big data revolution has trumpeted the four V’s: volume, velocity, variety, and value. In my opinion, the most interesting and challenging V is variety. Standard relational databases and standard ETL pipelines are ill-equipped to handle unstructured text, hyper-structured machine data, graph relationships (think Facebook and LinkedIn), or images. As the value of these data types grows, ETL must change with new pipelines and new logic.
The lid has been off of data volumes for some time but it has reached a ridiculous point where even normal Main Street organizations want to access petabytes of data. Even when this data consists of conventional text and numbers (think log data), the data is a whale trapped in a swimming pool. You don’t dare move or copy the data to a new place for processing.
Reports and ad hoc queries remain important but the new phrase is operational analytics, which combine high performance data ingestion, real-time data quality checking and data conforming, and finally sophisticated analytics. All of this requires forgoing conventional batch processing and slow periodic updates of the available data.
Rise of the Analyst and Monetization of Data Insights
Data scientist is the new name for analysts who mine data for insights and propose experiments in marketing strategies (often in real time) that can affect revenue, profitability, and customer satisfaction. These data scientists, frequently working in business departments, are often skilled at communicating directly with senior management, effectively bypassing IT. The challenge is whether IT can become a participant in this process, understanding the ETL pipelines, and creating a stable data infrastructure beyond the prototypes built by the data scientists.
New Analytic Tools
Data scientists and others are employing advanced analytic tools that can take several forms. Some of these tools are statistical algorithms found in advanced packages such as MadLib that can be loaded into some DBMS systems. Others are custom user defined functions (UDFs) typically programmed in C. Finally, others are separate BI tools that consume the data warehouse data.
Columnar Data Stores and In-Memory Databases
The sweet spot for high performance dimensional databases is a combination of columnar data stores and in-memory processing. Columnar data stores are typically excellent at processing many simultaneous joins against a fact table and tolerating very wide dimension tables. As the cost of memory continues to decline, it becomes feasible to stand up terabytes of RAM in a distributed shared-nothing environment. Taking advantage of this physical RAM is still a work in progress, and this balancing is a key aspect of the ETL pipeline architecture. The MapReduce processing framework typically found in Hadoop clusters addresses this problem for certain kinds of 'relation-scan' analyses, but MapReduce relies on a shuffle step that moves the data from node to node to achieve balancing. Expect to see much more progress that will affect ETL in order to take advantage of the huge performance advantages offered by columnar in-store architectures.
Data Virtualization on Steroids
Finally, the venerable approach of defining simple SQL views to present tables in more usable formats, has given way to data virtualization, which now has become a major tool that in some cases replaces conventional ETL. In its basic form, data virtualization replaces physical data transformations with equivalent computations every time the data is accessed. Data virtualization becomes a classic trade-off between query performance and speed of deployment. Data virtualization is great for prototyping and data exploration, and when the prototyping phase is over, data virtualization can be replaced with true conventional ETL where the data is permanently moved and transformed.
If anything, ETL has become a bigger part of the data warehouse mission. We used to say that ETL consumed 50 to 70% of the data warehouse costs, risks, and implementation time. That number may now be more like 70 to 80%, especially considering the challenges described in this article. We can expect many new ETL innovations in the years ahead.