In a recent article by Ashlee Vance in Businessweek, it was claimed that due to Google moving past Hadoop and offering Cloud Dataflow to it’s customers, including pipelines that incorporate both batch and stream-processing capabilities, it was essentially making Hadoop redundant.
The new Cloud Dataflow software is incorporating the software utilised by Google, arguably the largest data consumer in the world. That this kind of analytics power is going to be offered to their customers is a major breakthrough, but they are certainly not the first company to utilise internal processes and systems that have been initially built on open source platforms.
To get a more in depth knowledge of this I spoke to Anirudh Todi, Software Engineer at Twitter, who was instrumental in the creation of the new and much discussed TSAR system that has been implemented recently at the social network.
Anirudh described to me how this new software works “TSAR is used at Twitter to count billions of events per day and has been built from the ground up, almost entirely on open-source technologies (Storm, Summingbird, Kafka, Aurora, and others).”
The importance of open source data on this project cannot be overstated, as Anirudh points out, the platform was created almost exclusively on it. Having worked on this project with the core based around open source software, I wanted to know what Anirudh’s opinion was on its influence in the rise of Big Data and the popularity of data led company initiatives. .
“In my opinion, Open Source Software has been crucial to the rise of Big Data. In the last two decades and beyond, its adoption has been one of the most significant cultural developments in the software industry, and has shown that individuals, working together over the Internet, can create products that rival and sometimes beat proprietary ones.”
It has clearly been one of the most important aspects in terms of the speed in which Big Data has spread, allowing for not only the breadth of its importance but also the speed in which it has been adopted. Going from almost an unknown to a vital business function within 5 years has been made possible by the collaboration fostered by open source technologies. Anirudh describes the impact that this open source approach has had on the ways that companies have approached these kinds of projects “It has shown how companies can become more innovative, more nimble and more cost-effective by building on the efforts of community work”.
Much of his work on open source platforms has taken place at both Twitter and Facebook, where Anurudh has worked in the past. As these are widely acknowledged as being two of the most data focussed companies I wanted to know what set these two companies apart from others and why they have been so successful in their data projects. “Facebook and Twitter are two companies that are squarely focussed on providing the best user experience that they possibly can. There is an intense focus on collecting as much data as possible and then building an entire ecosystem around it to make it as easy as possible to analyze the petabytes of data and transform it into useful information to provide insights into the data to themselves and to their partners”. It is about more than simply monetising data, but about utilising the data to make themselves better.
Specifically with Twitter, I was curious about how scaling occurred, given the steep upward curve in the amount of data produced and collected by the company. Anirudh said that there are two main aspects to the scalability challenge “Building infrastructure to be able to seamlessly collect and store so much data” and “Building tools to make it easy to analyze this data and produce useful information that can then be used to drive Twitter products“. Anirudh has managed these first through the creation of ‘Manhattan’ a next generation distributed database, and Summingbird, which allows for real time distribution and writing streaming Mapreduce programs.
These programmes have allowed Anirudh and the team at Twitter to effectively manage, analyse and collect data. The systems they have created already, and the future iterations of these represent Twitter’s bright future in data management.
Before we finished the interview I was curious to hear from Anirudh about his thoughts on the future of data as a whole:
“Data in aggregate is growing so fast that everything about how we think about data now is going to change radically in the next ten years. If you're a software engineer or work in technology in any way, this should sound like opportunity. Everything from hardware to networking to database technology to presentation layer is already changing rapidly to allow us more efficient access to data that will let us live and work better”.
“For the last five years or so, there has been an intense focus on Big Data analytics and I only see it increasing in the years to come”.
Anirudh will be speaking at Big Data Innovation, Boston, in September.