Uber has become a household name and few in the developed world don’t know what it does and the speed at which it has grown. Having launched in 2009, the company is already valued at $18 billion, and that number is increasing daily.
It seems like a very simple model; people want to make extra money by driving people around in their free time and people want to be able to get a taxi wherever they are at whatever time. However, because the process is based on an app and locational data, and the company focuses on the speed at which customers are picked up, the amount of data created and required is considerable.
The amount of data within a single transaction is large, but when you consider that they now command a fleet of over 100,000 drivers in around 340 cities across 63 countries, the scale of the data they need to deal with becomes clear.
As a foundation for this they have turned to Apache Spark to allow them to not only process the data quickly, but also so that they can quickly scale their operations. To help with this scaling, and also to deal with the huge amounts of data, they have turned to a Kafka-based system that pushes the data to local data centers, then to a central Hadoop Cluster. This replaces the system of multiple distributed data centers in a relational model. It makes the process considerably simpler and faster.
In fact, Vinoth Chandar, who is in charge of the scaling and creating Uber’s data systems, gave an interview to Datanami and claimed that Spark has been ‘instrumental in where we’ve gotten to’. Having this high speed and quick scaling has meant that Spark feeding into a central Hadoop cluster has been the ideal setup, given the historically strong scalability that this brings.
So what does this all mean for the consumer?
People often miss the importance of a backend data framework because it is not something that is noticed when it is working, but in the case of Uber it is vital to their ongoing success. When there are surge prices for instance, this is set through an analysis of the data rather than arbitrary time slots, making it a more accurate representation of supply and demand.
It also means that reaction to large events is much easier without necessarily affecting the overall service throughout a particular area, for instance if a large music event is happening at one end of a city, the other end will not be devoid of Uber drivers. It also strategically places Uber drivers in real-time, meaning that you can order a car and have it with you as soon as possible.
So despite most Uber users not even knowing what Spark is, the truth is that their enjoyment of the service is very much dependent on it.