4 min read

Biggest upset

library(tidyverse)

To identify the biggest upset (underdog coming out on top), we’ll need two datasets: betting odds and actual results. Both are available as part of my sumo dataset.

Odds

odds <- read_csv(
    "odds.csv",
    col_types = "cdcdT"
)

I’ve been scraping sumo betting odds since 2017 May tournament:

odds %>% 
    count(
        basho = format(
            ts,
            "%Y.%m"
        )
    ) %>% 
    print.data.frame()
##      basho    n
## 1  2017.05  893
## 2  2017.07 1302
## 3  2017.09  810
## 4  2017.11 1060
## 5  2018.01 1170
## 6  2018.03 1428
## 7  2018.05  767
## 8  2018.09  894
## 9  2018.11  682
## 10 2019.01 1053
## 11 2019.03 1218
## 12 2019.05  730
## 13 2019.07 1099
## 14 2019.09  730
## 15 2019.11 1113
## 16 2020.01 1097
## 17 2020.03  645
## 18 2020.07  687
## 19 2020.08   45
## 20 2020.09  716
## 21 2020.11  913
## 22 2021.01  972

You may notice a few irregularities:

  • 2018.07 is missing — I was moving from Sweden to UK and dropped the ball on this one;
  • 2020.05 is missing — tournament cancelled because of COVID-19;
  • 2020.07 spilled over into 2018.08 — the tournament started later than normal (still COVID-19), we’ll want to fix that.

Dataset with closing odds, with 2020.08 merged into 2020.07:

closing_odds <- odds %>% 
    mutate(
        basho = format(
            ts,
            "%Y.%m"
        )
    ) %>% 
    mutate_at(
        vars(basho),
        recode,
        `2020.08` = "2020.07"
    ) %>% 
    group_by(
        basho,
        rikishi1,
        rikishi2
    ) %>% 
    slice_max(
        order_by = ts,
        n = 1
    ) %>% 
    ungroup()
closing_odds %>% 
    count(
        basho
    ) %>% 
    print.data.frame()
##      basho   n
## 1  2017.05 285
## 2  2017.07 256
## 3  2017.09 252
## 4  2017.11 249
## 5  2018.01 273
## 6  2018.03 267
## 7  2018.05 267
## 8  2018.09 258
## 9  2018.11 265
## 10 2019.01 264
## 11 2019.03 277
## 12 2019.05 270
## 13 2019.07 269
## 14 2019.09 264
## 15 2019.11 246
## 16 2020.01 260
## 17 2020.03 166
## 18 2020.07 245
## 19 2020.09 210
## 20 2020.11 222
## 21 2021.01 208

Make sure match-ups are unique within a tournament:

closing_odds %>% 
    count(
        basho,
        rikishi1,
        rikishi2
    ) %>% 
    count(
        n,
        name = "nn"
    )
## # A tibble: 1 x 2
##       n    nn
##   <int> <int>
## 1     1  5273

Results

All results since 2017 May:

results <- "results.csv" %>% 
    read_csv(
        col_types = "ciiccciciccci"
    ) %>% 
    filter(
        basho >= "2017.05"
    )

A few non-unique match-ups:

results %>% 
    count(
        basho,
        rikishi1_id,
        rikishi2_id
    ) %>% 
    count(
        n,
        name = "nn"
    )
## # A tibble: 2 x 2
##       n    nn
##   <int> <int>
## 1     1 19026
## 2     2    20

This is due to a rare occurrence of two wrestlers, who’ve already faced each other, having to go again in play-offs.

Play-offs are recorded as Day 16 and can easily be filtered out:

results %>% 
    filter(
        day <=15
    ) %>% 
    count(
        basho,
        rikishi1_id,
        rikishi2_id
    ) %>% 
    count(
        n,
        name = "nn"
    )
## # A tibble: 1 x 2
##       n    nn
##   <int> <int>
## 1     1 19044

Odds + results

All odds can be matched with a result. This also does a sanity check of the win columns (one must be 0, the other — 1):

closing_odds %>% 
    left_join(
        filter(
            results,
            day <=15
        ),
        by = c(
            "basho",
            "rikishi1" = "rikishi1_shikona",
            "rikishi2" = "rikishi2_shikona"
        )
    ) %>% 
    count(
        rikishi1_win,
        rikishi2_win
    )
## # A tibble: 2 x 3
##   rikishi1_win rikishi2_win     n
##          <int>        <int> <int>
## 1            0            1  2587
## 2            1            0  2686

Finally, let’s find the winner who had longest closing odds:

closing_odds %>% 
    left_join(
        filter(
            results,
            day <=15
        ),
        by = c(
            "basho",
            "rikishi1" = "rikishi1_shikona",
            "rikishi2" = "rikishi2_shikona"
        )
    ) %>% 
    arrange(
        # odds on the ultimate winner
        odds1 * rikishi1_win + odds2 * rikishi2_win
    ) %>% 
    tail(1) %>% 
    # transpose for readability
    t()
##                 [,1]                 
## rikishi1        "Hakuho"             
## odds1           "1.045455"           
## rikishi2        "Yoshikaze"          
## odds2           "10.5"               
## ts              "2017-11-22 06:55:03"
## basho           "2017.11"            
## day             "11"                 
## rikishi1_id     "1123"               
## rikishi1_rank   "Y1w"                
## rikishi1_result "10-1 (14-1)"        
## rikishi1_win    "0"                  
## kimarite        "yorikiri"           
## rikishi2_id     "5967"               
## rikishi2_rank   "S1w"                
## rikishi2_result "6-5 (6-9)"          
## rikishi2_win    "1"

I’ve always loved Yoshikaze and am pleased to see him as the most impressive underdog since 2017 May tournament.

Decimal odds of 10.5 is 19/2 or 950 American. Implied probability of about 9%.