Picking Melbourne Cup Winners With Machine Learning

Can you predict a horserace with data?


For a short period in November, as it does every year, the Melbourne Cup horse race captured the attention of millions of people around the world, and Australia stopped almost completely. It is a unique event, seeing the best horses, with the best jockeys, compete on a record length track of 3,200m. This year, Michelle Payne became the first-ever female jockey to win the race, riding Prince of Penzance home to take the $3.6 million prize.

Horse racing is big business for bookmakers, with gamblers queuing up year-round to throw money at them, and the major races can see hundreds of millions laid down. In 2014, Australians collectively wagered around $800 million on the Melbourne Cup — almost $40 per head of population. It is likely that this year the total was even higher. Despite its popularity among gamblers, the many variables involved in a horse race make it extremely unpredictable. In an effort to try and boost their chances of getting one over on the bookies, punters go to extreme lengths. They pour over all the information that’s available, note down everything every ‘expert’ says, employ all manner of systems that they will believe will give them the edge.

This year, PwC’s Insight Analytics team and the PwC Experience Centre attempted to lend a helping hand by pouring over all the data for them, and applying a variety of analytics models to it in an attempt to predict the top 5 finishers.

PwC explained that: ‘Algorithms, including decision trees and their ensemble counter-parts use historical race results to ‘learn’ the complex relationship between a horse’s characteristics (weight, age, trainer etc.) and its most-likely finishing place or probability of winning.’

They used a number of approaches to pick a winner this year, including ‘the tried-and-tested linear regression model, a new machine learning model, and another more unique approach.’

In order to see the full benefits of the modeling, they first gathered all the data possible and had horse racing experts input it in order to focus on what matters when it comes to picking winners. They collected the data using web scraping scripts for over 50 thousand horses and over 1 million race results (a race with 24 horses would have 24 race results). Such a large dataset allowed for thorough testing of the model’s validity before all the variables considered relevant to the race’s result - those that were available, at least - were input. This included ‘race odds, horse speed & form, weight carried, track condition preference of the horse, and barrier position in each race.’

The real question is, how did their models do? Not well, apparently. They predicted, in descending order: Fame Game, Preferment, Trip to Paris, Almoonqith, and Who Shot Thebarman. The actual results were Prince of Penzance, then Max Dynamite, Criterion, Trip to Paris, and Big Orange. Machine learning and predictive analytics may have helped in the development of a cure for cancer, control a driverless car, and translated some of the finest literary works the world has ever known with little to no human assistance, but the Melbourne Cup remains as elusive a mystery as ever.

University lecture small

Read next:

How Are Higher Education Institutions Using Analytics?