We’re going to download English Premier League results from this very useful website: https://www.football-data.co.uk/englandm.php
We’ll be using httr package to download the data and tidyverse collection for all kinds of manipulations:
library(httr)
library(tidyverse)
A few functions to make the code more readable.
Transform the first year of a Premier League season (e.g. 1993) to the format used in football-data.co.uk URLs (“9394”):
int_to_season <- function(x) sprintf(
"%02d%02d",
x %% 100,
x %% 100 + 1
)
int_to_season(1993)
## [1] "9394"
Building on the above function, generate URL for a specific season:
int_to_url <- function(x) x %>%
int_to_season() %>%
sprintf(
"https://www.football-data.co.uk/mmz4281/%s/E0.csv",
.
)
int_to_url(2020)
## [1] "https://www.football-data.co.uk/mmz4281/2021/E0.csv"
This is how we get the season’s data:
int_to_url(1993) %>%
GET() %>%
content() %>%
head()
## No encoding supplied: defaulting to UTF-8.
## Warning: Missing column names filled in: 'X8' [8], 'X9' [9], 'X10' [10],
## 'X11' [11], 'X12' [12], 'X13' [13], 'X14' [14], 'X15' [15], 'X16' [16],
## 'X17' [17], 'X18' [18], 'X19' [19], 'X20' [20], 'X21' [21], 'X22' [22],
## 'X23' [23], 'X24' [24], 'X25' [25], 'X26' [26], 'X27' [27], 'X28' [28]
## Parsed with column specification:
## cols(
## .default = col_logical(),
## Div = col_character(),
## Date = col_character(),
## HomeTeam = col_character(),
## AwayTeam = col_character(),
## FTHG = col_double(),
## FTAG = col_double(),
## FTR = col_character()
## )
## See spec(...) for full column specifications.
## Warning: 169 parsing failures.
## row col expected actual file
## 384 -- 28 columns 7 columns <raw vector>
## 385 -- 28 columns 7 columns <raw vector>
## 386 -- 28 columns 7 columns <raw vector>
## 387 -- 28 columns 7 columns <raw vector>
## 388 -- 28 columns 7 columns <raw vector>
## ... ... .......... ......... ............
## See problems(...) for more details.
## # A tibble: 6 x 28
## Div Date HomeTeam AwayTeam FTHG FTAG FTR X8 X9 X10 X11 X12
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 E0 14/0~ Arsenal Coventry 0 3 A NA NA NA NA NA
## 2 E0 14/0~ Aston V~ QPR 4 1 H NA NA NA NA NA
## 3 E0 14/0~ Chelsea Blackbu~ 1 2 A NA NA NA NA NA
## 4 E0 14/0~ Liverpo~ Sheffie~ 2 0 H NA NA NA NA NA
## 5 E0 14/0~ Man City Leeds 1 1 D NA NA NA NA NA
## 6 E0 14/0~ Newcast~ Tottenh~ 0 1 A NA NA NA NA NA
## # ... with 16 more variables: X13 <lgl>, X14 <lgl>, X15 <lgl>, X16 <lgl>,
## # X17 <lgl>, X18 <lgl>, X19 <lgl>, X20 <lgl>, X21 <lgl>, X22 <lgl>,
## # X23 <lgl>, X24 <lgl>, X25 <lgl>, X26 <lgl>, X27 <lgl>, X28 <lgl>
There are empty columns (and, sometimes, rows) in older seasons’ data, so we better clean it up.
As we’re only interested in the result of each game, these are the columns we need: date, home/away team, full-time home/away goals:
columns <- c(
"Date",
"HomeTeam",
"AwayTeam",
"FTHG",
"FTAG"
)
This is a function that takes a year and fetches data for the corresponding season. It also adds Season
column to make it easier to use a combined data set for multiple seasons. For extra points, it suppresses warnings and drops empty rows:
get_season_data <- function(x) x %>%
int_to_url() %>%
GET() %>%
{
suppressWarnings(
content(
.,
col_types = cols(),
encoding = "UTF-8"
)
)
} %>%
mutate(
Season = x
) %>%
select(
Season,
all_of(
columns
)
) %>%
filter(
complete.cases(.)
)
Having done the hard part, getting data for a bunch of seasons is now just a few lines:
df <- 1993:2020 %>%
lapply(
get_season_data
) %>%
bind_rows() %>%
write_csv(
"EPL.csv"
)