Why Batch Analytics Is Not Going Anywhere

Despite the growth in streaming, don't expect it to disappear any time soon


The idea in business that taking the wrong action is better than no action is something that seems to have been universally accepted, although it is easily countered by the equally true saying ‘fools rush in’. But if you run your life on aphorisms, you’re inevitably going to run into contradictions every now and then.

The idea that speed is the new currency has seen organizations increasingly look at ways they can gain insights from the vast stream of data being collected in real time. Subsequently, they are moving away from the batch processing traditionally favored in big data and towards streaming analytics. But is it always the best option?

In a recent OpsClarity Inc. survey of 4,000 big data professionals, 92% of respondents said they plan to leverage stream-processing applications this year, while 79% will reduce or eliminate investments in batch-only processing. Batch analytics is high latency analytics whereby a large volume of data is processed all at once, yet there is a delay between collection and storage of data sets, the processing for analysis, and reporting. Streaming analytics, on the other hand, is low latency analytics, defined by Forrester as ‘software that can filter, aggregate, enrich, and analyze a high throughput of data from multiple disparate live data sources and in any data format to identify simple and complex patterns to visualize business in real-time, detect urgent situations, and automate immediate actions.’ Speaking to us recently, Slim Baltagi, Director of Big Data Engineering at Capital One, used the comparison that batch analytics was like drinking bottled water while streaming was like drinking straight from the hose.

With batch processing, events are collected together based on the number of records or a certain period of time and stored somewhere to be processed as a finite data set. This can lead to unnecessary latencies between data generation and analysis and action on the data, which means it can lose its value. This could actually result in additional costs and reduce a company’s competitive edge. There is also an implicit presumption in batch analytics that the data is complete, which may not be the case.

In streaming analytics, the system remembers the query and every time the data changes, the answer adapts accordingly. This allows for high volumes of data processing in very little time. Most data is available as a series of events, for example, click streams, mobile apps data, and so forth, that is continuously produced by a variety of applications and systems in the enterprise. Streaming analytics does not rely on typical enterprise sources, instead it uses social media data, sensor data, and the like. By connecting to external data sources, applications can integrate certain data from both internal and external sources into the same central hub.

The benefits of streaming analytics are many. For one, the speed at which data is processed, analyzed, and fed back into local systems greatly accelerates decision-making. The data has the potential to identify costs before they spiral, errors before they swell into a problem, and risks before they become existential threats. With zero data waiting time, the data is also more accurate, as nothing gets lost, overseen or outdated, as the velocity and volume of data is not an issue. Essentially, it offers everything that batch processing does, only a lot faster, so that action can be taken in time to exploit the information - whether this be visibility into what customer behaviors, potential new products, or fraud detection.

There are, however, disadvantages with streaming analytics. Systems based on streaming analytics require more resources at the beginning, although they do become more cost-effective as time goes on and there are many open source stream processing tools available, including Apache Storm, Spark Streaming, and Apache Flink. Since streaming analytics occurs immediately, companies also have only a small window to act on the analytics data before the data loses its value, which is not something all companies are capable of doing. There is also an issue around a lack of experts in streaming analytics, which is still a fairly nascent technology. Forrester analysts Mike Gualtieri and Rowan Curran in a Q3 2014 Forrester report on Big Data and Streaming Analytics noted that, ‘The streaming application programming model is unfamiliar to most application developers.’ The dearth of Data Scientists is a much publicized problem, and since streaming analytics is still a recent technology adoption is slow by most developers due to their lack of expertise.

There are now more mobile devices than people in the US, and as the IoT grows there is going to be a huge growth in the number of sensors. Streaming analytics is only going to become more important for ensuring that this data has value. Organizations will need to adapt their existing data management and analytics processes for the IoT age, but they need to be careful. They need to select the best data processing system for the job at hand rather than going for what’s trendy, as whether batch or stream is required depends on the types and sources of data and processing time needed to get the job done. Not every job requires low latency analytics. They also need the infrastructure in place, or risk causing more harm than good. While streaming analytics is undeniably growing, and with good reason, batch analytics is not going away any time soon.

Big data hype small

Read next:

Is Big Data Still Overhyped?