Blogs.Read about how we are doing, new products, new product features and much more!
Sportmonks Prediction API
KICKING OFF OUR FOOTBALL PREDICTION API
Woohoo! We’ve just released our brand-new Football Prediction API! We’re so proud of our Dev Team and Data Scientists, who’ve helped us fulfil this dream.
Two years ago, we started researching how to get the most accurate and reliable probability estimates in sports events, based on our own SportMonks data. It resulted in the Prediction API we introduce to you today!
In this two-course blog, we will first give you some prominent features, and then dig a little deeper into the technical aspects of our algorithm.
For starters, let me give you some of our algorithm’s key principles:
- Timely and Substantive:
Every day, the API updates its models with the latest data from our SportMonks Football Database.
- Data Controlled:
It doesn’t need human intervention. It runs on statistical analysis results of the entire historical SportMonks Football Database.
- Precise Probabilities:
The API offers the most precise probabilities possible, by using mathematical Probability Distribution models.
- Predictability Performance:
Our Prediction API’s success rate and quality are monitored, so you can track our predictions’ performance. Because even smart algorithms can fail, it is important to understand what is predictable and what is not.
Understanding the model
Our main model follows Bayesian principles. This means we are using Probability Distribution to describe our model parameters.
First, it is important to ask ourselves what we want to predict. Obviously, football match results rely on the final score. The core task, therefore, is to model the goal distribution of each team. At this point, it is important to make the distinction between goal distribution on the one hand, and the expected goal1 metric (xG) on the other, which gives the probability rate at which a goal attempt will result in a score. Through our Bayesian model,
To learn those distributions, we can use all the data features available to us: events, players, commentaries, statistics… The hard part is to select the features that best describe the teams’ goal distribution.
Once we have learnt the goal distribution, we can use it to predict many matches.
The technical part
In this section, I’ll be giving you more mathematical details. Please be advised that this may get a bit technical. Of course, we are not going to give away all our secrets, but enough to help you understand what it is about.
As said before, we are performing a Bayesian analysis. To extract our main variable, the goals, we need to choose a Probability Distribution. As a starting point, we will be using the Poisson distribution2, which is a positive distribution for counting data. This means that if represents the number of goals scored by a team we assume the following 'prior' (prior probability distribution):
where is the unique distribution parameter. It can be interpreted as the team's expected number of goals. Often it is also represented as the team's strength combining attack and defence effect, but this is not the route WE will be taking. Instead, we will be treating as a random variable and choose a distribution for it. We know it is a positive continuous variable. Therefore a natural prior distribution is the Gamma distribution:
The Gamma distribution has two parameters that will help us cover a large family of shapes. The parameter can be interpreted as the number of *goals* scored by the team over number of matches. Since is now a random variable, it has an expected value that we call .
Now we need to find the parameters of the Gamma distribution. We are interested in . In this case it is the expectation of the Gamma distribution . Please note that this is also our expected *goal* measure. At this point, we have almost everything we need. The last step is to incorporate the set of features that will help us fit the distribution parameter. To do so, we will assume that
In this equation is a vector of feature of interest, while is the Gaussian prior for the parameters.
In the model’s training phase, it determines the posterior parameter distribution, given the data and our prior distribution. The prediction process involves many steps, from data collection to feature engineering through model training.
To help you to understand what it takes, we have summarised our workflow in the following chart.
There are several ways to measure the prediction’s quality. For example, we can use the number of times we have the correct result, which we call the Accuracy model. We could also use the ranked probability score3 or the Brier score4. Instead, we prefer to use an entropy-related measure, the log loss.
For one event, it is represented in the following equation:
is the set of possible outcomes, represents the probability of the outcome and the event label, in which value 1 stands for a success and 0 otherwise. The label can also be interpreted as the a posteriori probability once the event result has been observed. For instance, if the event is 'team A plays team B' and we want to predict the winner. We have and we assume the following probabilities: , and . The following table shows the log loss for the different outcome.
|event||A wins||Draw||B wins|
Finally, a league’s predictability is calculated by looking at the the average log loss across all the events. We have chosen the match winner as the main event to compute the league predictability. There are only three possible outcomes: 1. Home team wins, 2. Draw 3. Away team wins.
The closest the average log loss figure is to zero, the better the predictability.
A purely random model would assign a probability of 33% to each outcome. In this case the random model predictability is .
Remark: Any league with a predictability close to or above 1.0986 should be considered as unpredictable.
An other interesting model focusses on historical probabilities. In average the home team win 45% of the time, and away teams 30%. There is a draw 25% of the time. As a result, the historical model predictability is .
Remark: Any model with a league predictability close to or above 1.0671 should be considered incapable of learning how to beat the historical model.
To make the measure easier to understand, League Predictability Classification is divided into four categories: poor, medium, good, high.
The Prediction Api
What you get
Every day, our models compute predictions for upcoming matches two weeks ahead. The models are updated on the basis of new information coming in every day. The set of actual predictions delivered by our algorithm is described below:
- - Winner: probability of home win, draw and away win.
- - Correct score: non-zero probability of a given score.
- - Over / under: probability of goal score home team, away team and together.
- - Both team to score: probability that both team score.
In addition to these probabilities, we generate the League Predictability through the same model. The League Predictability will reveal the probability set’s quality.
For each league you get:
- 1. The league predictability score given by the average log loss
- 2. The league predictability classification given by one of the four categories: poor, medium, good, high.
Last but not least, the generated probabilities are analyzed together with the odds markets available at our database. This is called, the Value Bet Model.
The upcoming release on 30 July includes the Prediction Model and Value Bet Model. The Player Contribution Model will follow at a later stage. As more data and more data features will enter the database going forward, the model’s learning curve will grow further. The model will also be able to cover extra prediction features like corner probabilities, half time results, or the league final table. In other words, our new API will help you stay on top of your game! Stay tuned!
2. See for instance Dixon M. and Coles S., (1997), Modelling association football scores and inefficiencies in the football betting market, Applied Statistics, 46, pp 265-280. ↩︎
3. Epstein E., (1969), A Scoring System for Probability Forecasts of Ranked Categories, Journal of Applied Meteorology, 8(6), pp. 985-987. ↩︎
4. Brier G., (1950), Verification of Forecasts Expressed in Terms of Probability, Monthly Weather Review, 78(1), pp 1. ↩︎