3 min read

Premier League results [football]

We’re going to download English Premier League results from this very useful website: https://www.football-data.co.uk/englandm.php

We’ll be using httr package to download the data and tidyverse collection for all kinds of manipulations:

library(httr)
library(tidyverse)

A few functions to make the code more readable.

Transform the first year of a Premier League season (e.g. 1993) to the format used in football-data.co.uk URLs (“9394”):

int_to_season <- function(x) sprintf(
    "%02d%02d",
    x %% 100,
    x %% 100 + 1
)
int_to_season(1993)
## [1] "9394"

Building on the above function, generate URL for a specific season:

int_to_url <- function(x) x %>% 
    int_to_season() %>% 
    sprintf(
        "https://www.football-data.co.uk/mmz4281/%s/E0.csv",
        .
    )
int_to_url(2020)
## [1] "https://www.football-data.co.uk/mmz4281/2021/E0.csv"

This is how we get the season’s data:

int_to_url(1993) %>% 
    GET() %>% 
    content() %>% 
    head()
## No encoding supplied: defaulting to UTF-8.
## Warning: Missing column names filled in: 'X8' [8], 'X9' [9], 'X10' [10],
## 'X11' [11], 'X12' [12], 'X13' [13], 'X14' [14], 'X15' [15], 'X16' [16],
## 'X17' [17], 'X18' [18], 'X19' [19], 'X20' [20], 'X21' [21], 'X22' [22],
## 'X23' [23], 'X24' [24], 'X25' [25], 'X26' [26], 'X27' [27], 'X28' [28]
## Parsed with column specification:
## cols(
##   .default = col_logical(),
##   Div = col_character(),
##   Date = col_character(),
##   HomeTeam = col_character(),
##   AwayTeam = col_character(),
##   FTHG = col_double(),
##   FTAG = col_double(),
##   FTR = col_character()
## )
## See spec(...) for full column specifications.
## Warning: 169 parsing failures.
## row col   expected    actual         file
## 384  -- 28 columns 7 columns <raw vector>
## 385  -- 28 columns 7 columns <raw vector>
## 386  -- 28 columns 7 columns <raw vector>
## 387  -- 28 columns 7 columns <raw vector>
## 388  -- 28 columns 7 columns <raw vector>
## ... ... .......... ......... ............
## See problems(...) for more details.
## # A tibble: 6 x 28
##   Div   Date  HomeTeam AwayTeam  FTHG  FTAG FTR   X8    X9    X10   X11   X12  
##   <chr> <chr> <chr>    <chr>    <dbl> <dbl> <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 E0    14/0~ Arsenal  Coventry     0     3 A     NA    NA    NA    NA    NA   
## 2 E0    14/0~ Aston V~ QPR          4     1 H     NA    NA    NA    NA    NA   
## 3 E0    14/0~ Chelsea  Blackbu~     1     2 A     NA    NA    NA    NA    NA   
## 4 E0    14/0~ Liverpo~ Sheffie~     2     0 H     NA    NA    NA    NA    NA   
## 5 E0    14/0~ Man City Leeds        1     1 D     NA    NA    NA    NA    NA   
## 6 E0    14/0~ Newcast~ Tottenh~     0     1 A     NA    NA    NA    NA    NA   
## # ... with 16 more variables: X13 <lgl>, X14 <lgl>, X15 <lgl>, X16 <lgl>,
## #   X17 <lgl>, X18 <lgl>, X19 <lgl>, X20 <lgl>, X21 <lgl>, X22 <lgl>,
## #   X23 <lgl>, X24 <lgl>, X25 <lgl>, X26 <lgl>, X27 <lgl>, X28 <lgl>

There are empty columns (and, sometimes, rows) in older seasons’ data, so we better clean it up.

As we’re only interested in the result of each game, these are the columns we need: date, home/away team, full-time home/away goals:

columns <- c(
    "Date",
    "HomeTeam",
    "AwayTeam",
    "FTHG",
    "FTAG"
)

This is a function that takes a year and fetches data for the corresponding season. It also adds Season column to make it easier to use a combined data set for multiple seasons. For extra points, it suppresses warnings and drops empty rows:

get_season_data <- function(x) x %>% 
    int_to_url() %>% 
    GET() %>% 
    {
        suppressWarnings(
            content(
                .,
                col_types = cols(),
                encoding = "UTF-8"
            )
        )
    } %>% 
    mutate(
        Season = x
    ) %>% 
    select(
        Season,
        all_of(
            columns
        )
    ) %>% 
    filter(
        complete.cases(.)
    )

Having done the hard part, getting data for a bunch of seasons is now just a few lines:

df <- 1993:2020 %>% 
    lapply(
        get_season_data
    ) %>% 
    bind_rows() %>% 
    write_csv(
        "EPL.csv"
    )