library(tidyverse)
To identify the biggest upset (underdog coming out on top), we’ll need two datasets: betting odds and actual results. Both are available as part of my sumo dataset.
Odds
odds <- read_csv(
"odds.csv",
col_types = "cdcdT"
)
I’ve been scraping sumo betting odds since 2017 May tournament:
odds %>%
count(
basho = format(
ts,
"%Y.%m"
)
) %>%
print.data.frame()
## basho n
## 1 2017.05 893
## 2 2017.07 1302
## 3 2017.09 810
## 4 2017.11 1060
## 5 2018.01 1170
## 6 2018.03 1428
## 7 2018.05 767
## 8 2018.09 894
## 9 2018.11 682
## 10 2019.01 1053
## 11 2019.03 1218
## 12 2019.05 730
## 13 2019.07 1099
## 14 2019.09 730
## 15 2019.11 1113
## 16 2020.01 1097
## 17 2020.03 645
## 18 2020.07 687
## 19 2020.08 45
## 20 2020.09 716
## 21 2020.11 913
## 22 2021.01 972
You may notice a few irregularities:
- 2018.07 is missing — I was moving from Sweden to UK and dropped the ball on this one;
- 2020.05 is missing — tournament cancelled because of COVID-19;
- 2020.07 spilled over into 2018.08 — the tournament started later than normal (still COVID-19), we’ll want to fix that.
Dataset with closing odds, with 2020.08 merged into 2020.07:
closing_odds <- odds %>%
mutate(
basho = format(
ts,
"%Y.%m"
)
) %>%
mutate_at(
vars(basho),
recode,
`2020.08` = "2020.07"
) %>%
group_by(
basho,
rikishi1,
rikishi2
) %>%
slice_max(
order_by = ts,
n = 1
) %>%
ungroup()
closing_odds %>%
count(
basho
) %>%
print.data.frame()
## basho n
## 1 2017.05 285
## 2 2017.07 256
## 3 2017.09 252
## 4 2017.11 249
## 5 2018.01 273
## 6 2018.03 267
## 7 2018.05 267
## 8 2018.09 258
## 9 2018.11 265
## 10 2019.01 264
## 11 2019.03 277
## 12 2019.05 270
## 13 2019.07 269
## 14 2019.09 264
## 15 2019.11 246
## 16 2020.01 260
## 17 2020.03 166
## 18 2020.07 245
## 19 2020.09 210
## 20 2020.11 222
## 21 2021.01 208
Make sure match-ups are unique within a tournament:
closing_odds %>%
count(
basho,
rikishi1,
rikishi2
) %>%
count(
n,
name = "nn"
)
## # A tibble: 1 x 2
## n nn
## <int> <int>
## 1 1 5273
Results
All results since 2017 May:
results <- "results.csv" %>%
read_csv(
col_types = "ciiccciciccci"
) %>%
filter(
basho >= "2017.05"
)
A few non-unique match-ups:
results %>%
count(
basho,
rikishi1_id,
rikishi2_id
) %>%
count(
n,
name = "nn"
)
## # A tibble: 2 x 2
## n nn
## <int> <int>
## 1 1 19026
## 2 2 20
This is due to a rare occurrence of two wrestlers, who’ve already faced each other, having to go again in play-offs.
Play-offs are recorded as Day 16 and can easily be filtered out:
results %>%
filter(
day <=15
) %>%
count(
basho,
rikishi1_id,
rikishi2_id
) %>%
count(
n,
name = "nn"
)
## # A tibble: 1 x 2
## n nn
## <int> <int>
## 1 1 19044
Odds + results
All odds can be matched with a result. This also does a sanity check of the win
columns (one must be 0
, the other — 1
):
closing_odds %>%
left_join(
filter(
results,
day <=15
),
by = c(
"basho",
"rikishi1" = "rikishi1_shikona",
"rikishi2" = "rikishi2_shikona"
)
) %>%
count(
rikishi1_win,
rikishi2_win
)
## # A tibble: 2 x 3
## rikishi1_win rikishi2_win n
## <int> <int> <int>
## 1 0 1 2587
## 2 1 0 2686
Finally, let’s find the winner who had longest closing odds:
closing_odds %>%
left_join(
filter(
results,
day <=15
),
by = c(
"basho",
"rikishi1" = "rikishi1_shikona",
"rikishi2" = "rikishi2_shikona"
)
) %>%
arrange(
# odds on the ultimate winner
odds1 * rikishi1_win + odds2 * rikishi2_win
) %>%
tail(1) %>%
# transpose for readability
t()
## [,1]
## rikishi1 "Hakuho"
## odds1 "1.045455"
## rikishi2 "Yoshikaze"
## odds2 "10.5"
## ts "2017-11-22 06:55:03"
## basho "2017.11"
## day "11"
## rikishi1_id "1123"
## rikishi1_rank "Y1w"
## rikishi1_result "10-1 (14-1)"
## rikishi1_win "0"
## kimarite "yorikiri"
## rikishi2_id "5967"
## rikishi2_rank "S1w"
## rikishi2_result "6-5 (6-9)"
## rikishi2_win "1"
I’ve always loved Yoshikaze and am pleased to see him as the most impressive underdog since 2017 May tournament.
Decimal odds of 10.5 is 19/2 or 950 American. Implied probability of about 9%.