Kicking off our football Prediction API

Woohoo! We’ve just released our brand-new Football Prediction API! We’re so proud of our Dev Team and Data Scientists, who’ve helped us fulfill this dream.

Two years ago, we started researching how to get the most accurate and reliable probability estimates in sports events, based on our own Sportmonks data. It resulted in the Prediction API we introduce to you today!

In this two-course blog, we will first give you some prominent features and then dig a little deeper into our algorithm’s technical aspects.

For starters, let me give you some of our algorithm’s fundamental principles:

Timely and Substantive:
Every day, the API updates its models with the latest data from our Sportmonks Football Database.

Data Controlled:
It doesn’t need human intervention. It runs on statistical analysis results of the entire historical Sportmonks Football Database.

Precise Probabilities:
The API offers the most precise probabilities possible by using mathematical Probability Distribution models.

Predictability Performance:
We monitor our Prediction API’s success rate and quality so that you can track our predictions’ performance. Because even smart algorithms can fail, it is essential to know what is predictable and what isn’t.

Understanding the model

The model

Our primary model follows Bayesian principles, which means we are using Probability Distribution to describe our model parameters.

First, it is essential to ask ourselves what we want to predict. Football match results rely on the final score.

It is vital to distinguish between goal distribution on the one hand and the expected goal1 metric (xG) on the other, which gives the probability rate at which a goal attempt will result in a score.

The Bayesian model extracts goal distribution from our historical data. It tells us the expected scoring rates of two teams for their next match.

To learn those distributions, we can use all the data features available to us: events, players, commentaries, statistics, and much more. The hard part is to select the features that best describe the teams’ goal distribution.

Once we have learned the goal distribution, we can use it to predict many matches. Predictions are available three weeks forward.
In this section we give more mathematics details to the interested reader. We are not going to give all our secret sauce but enough to make sure you understand what is it about.

As we said we are doing Bayesian analysis. For that we need to choose a probability distribution for our main variable: the goals. As a starting point we use the Poisson distribution[^2]. It is a positive distribution for counting data, goals in our case. This is mean that if $y$ is the number of goals scored by a team we assume the following prior:

$$ y\sim \text{Poisson}(\lambda) $$

where $\lambda$ is the unique distribution parameter. It can be interpreted as the expected number of goal of the team. Often it is also represented as the team strength combining attack and defense effect, but this is not the route we are taking. In place we treat $\lambda$ as a random variable and choose a distribution for it. We know it is a positive continuous variable. Therefore a natural prior distribution is the Gamma distribution:

$$\lambda\sim \text{Gamma}(\alpha,\beta)$$

The Gamma distribution has two parameters that will help to fit a large family of shapes. The parameter $\alpha$ can be interpreted as the number of goal scored by the team over $\beta$ matches. Since $\lambda$ is now a random variable it has an expected value that we call $\mu = \mathbb{E}\left[\lambda\right]$.

Now we need to find the parameters of the Gamma distribution. Actually we are interested by $\mu$. Well in this case it is the expectation of the gamma distribution, $\mathbb{E}\left[\lambda\right] = \frac{\alpha}{\beta}$. Note that this is also the expectation of our expected goal measure. At this point we almost have everything. The last piece is incorporating the set of feature $x$ that will help us to fit the parameter of the distribution. To do so we will assume that:

$$\mu = \theta^\top x$$

where $x$ is a vector of feature of interest and $\theta \sim \mathbb{}N\left( 0,\sigma\right)$ the Gaussian prior for the parameters.

The training phase of the model consists to determine the posterior distribution for the model parameters given the data and our prior distribution. Many steps are involved in the prediction process, from data collection to feature engineering through model training. To help you to understand what it takes we have symbolized our workflow by the following chart.

Quantifying predictability

There are several ways to measure the prediction’s quality. For example, we can use the number of times we have the correct result, which we call the Accuracy model. We could also use the ranked probability score3 or the Brier score4. Instead, we prefer to use an entropy-related measure, the log loss.

For one event it is represented in the following equation:

$$\ell= -\sum_{i\in\Omega} y_{i}\ln p_{i} $$

Ω is the set of possible outcomes, pi​ represents the probability of the outcome i and yi​∈{0,1} the event label, in which value 1 stands for success and 0 otherwise. You can interpret the label as a posteriori probability once the event result has been observed.

For instance, if the event is ‘team A plays team B’ and we want to predict the winner. We have Ω={“A wins”,”Draw”,”B wins”} and we assume the following probabilities: pA​=0.4, pD​=0.1 and pB​=0.5. The following table shows the log loss for the different outcomes.

event                     A wins                  Draw                      B wins
log loss ℓ                 −ln 0.4 ≈ 0.91          −ln 0.1 = 2.3         −ln 0.5 ≈ 0.69

Finally, the model calculates a league’s predictability by looking at the average log loss across all the events. We have chosen the match-winner as the main event to compute the league predictability. There are only three possible outcomes: 1. Home team wins, 2. Draw and 3. Away team wins.

The closest the average log loss figure is to zero, the better the predictability.

A purely random model would assign a probability of 33% to each outcome. In this case, the random model predictability is ℓrand=ln3≈1.0986.

Remark: Any league with predictability close to or above 1.0986 should be considered as unpredictable.

Another exciting model focuses on historical probabilities. On average the home team win 45% of the time, and the away teams 30%. There is a draw 25% of the time. As a result, the historical model predictability is ℓhist≈1.0671.

Remark: Any model with league predictability close to or above 1.0671 should be considered incapable of learning how to beat the historical model.

To make the measure easier to understand, we divided the League Predictability Classification into four categories: poor, medium, good, high.

The Prediction API

What you get

Every day, our models compute predictions for upcoming matches two weeks ahead. We update the models based on new information, which comes in daily. See below for the set of actual predictions delivered by our algorithm:

  • Winner: the probability of home win, draw, and away win.
  • Correct score: the non-zero probability of a given score.
  • Over/under: the probability of goal score home team, away team, and together.
  • Both teams to score: the probability that both teams score.

In addition to these probabilities, we generate the League Predictability through the same model. The League Predictability will reveal the probability set’s quality.

For each league, you get:

  1. The league predictability score indicated by the average log loss
  2. The league predictability classification indicated by one of the four categories: poor, medium, good, high.

Finally, the generated probabilities are analyzed together with the odds markets available in our database, which is called the Value Bet Model.

Coming soon

The upcoming release on 30 July includes the Prediction Model and Value Bet Model. The Player Contribution Model will follow at a later stage. As more data and more data features will enter the database in the future, the model’s learning curve will grow further.

The model will also cover extra prediction features like corner probabilities, half time results, or the league’s final table. In other words, our new API will help you stay on top of your game! Stay tuned!

  1. The Expected Goal metric is a shot based measure. ↩︎
  2. See for instance Dixon M. and Coles S., (1997), Modelling association football scores and inefficiencies in the football betting market, Applied Statistics, 46, pp 265-280. ↩︎
  3. Epstein E., (1969), A Scoring System for Probability Forecasts of Ranked Categories, Journal of Applied Meteorology, 8(6), pp. 985-987. ↩︎
  4. Brier G., (1950), Verification of Forecasts Expressed in Terms of Probability, Monthly Weather Review, 78(1), pp 1. ↩︎