Data Are, NOT Data Is: 2019

Simulation study of 5 card hands

We had a discussion in Mathematical Statistics about simulating the probability of getting a flush in a 5 card hand. There were two different ways we considered doing it - we could create a vector that adds the outcome of a simulated hand (Flush = TRUE or FLUSH = FALSE), then we could take a mean over the entire vector (by taking advantage of the fact that TRUE = 1), or we could do a loop and increment a “counter” variable if Flush = TRUE, and not increment if Flush = FALSE, and take an average of that number. I wanted to know which way was faster and by how much. For the record, we were pretty sure that adding to a vector would be significantly slower than incrementing, but we also know that for loops in R are pretty slow in general compared to other programming languages like Fortran or C. I didn’t investigate C or Fortran because I know nothing about those languages except they’re hard to use. We were also interested in seeing if the relationship is linear as the number of repetitions increased.

flush <- function () {
  suits <- c('hearts', 'diamonds', 'clubs', 'spades') 
  numbers <- 1:13
  deck <- expand.grid(numbers, suits) 
  head(deck);tail(deck)
  deal_id <- sample(1:52, 5, replace = FALSE)
  hand <- deck[deal_id,]
  print(hand)
  pip <- sort(hand$Var1)
  print(pip)
  # Counts the number of the suits
  count_suits <- sort(table(hand$Var2), decreasing = T)
  print(count_suits)
  Straight <- ifelse(any(pip[5] - pip[1] == 4), T, F)
  print(Straight)
  Royal <- all(ifelse(pip == c(1, 10, 11, 12, 13), T, F) == T)
  print(Royal)
  is_success <- count_suits[1] == 5 & Straight == F & Royal == F
  print(is_success)
  return(is_success)
}
flush()

##    Var1     Var2
## 2     2   hearts
## 41    2   spades
## 52   13   spades
## 43    4   spades
## 22    9 diamonds
## [1]  2  2  4  9 13
## 
##   spades   hearts diamonds    clubs 
##        3        1        1        0 
## [1] FALSE
## [1] FALSE
## spades 
##  FALSE

## spades 
##  FALSE

This is pretty straightforward I think. There are 52 cards in a deck, numbered A, 2-10, J, Q, and K. There are 4 suits: Hearts, Diamonds, Spades, and Clubs. I made a 52 x 2 table to make a deck of cards - Var1 is the number on the card (called the pip) and Var2 is the suit of the card. I took a random sample of 5 cards without replacement from the deck to form my hand, then made tables of the pip values and the number of each suit to use to decide if my hand is a flush or not. To decide if my hand is a flush or not, the number of whatever suit has the most cards in it has to have 5 cards. To rule out straight flushes (any 5 connected cards that isn’t Ace high, of the same suit) and the royal flush (A, K, Q, J, 10 of the same suit), I realized that the highest card and the lowest card of the straight would be 5 cards apart. For example, a flush that had 2, 3, 4, 5, 6 would have a high card of 6 and a low card of 2, 6-2=4, which indicates a straight. But a flush that had 2, 3, 4, 5, 10 would not be a straight, and passes the 10-2=8 != 4 test. This leaves out a very important flush though - the straight flush with Ace high, aka the Royal Flush. A, K, Q, J, 10 has a high card of Ace, but Ace is designated as 1 in the deck. Therefore, this is a straight but 13-1=12 != 4 test passes, but we don’t want it to. I couldn’t think of an elegant way to eliminate this case so I basically hard coded it to be a “Royal”, and if it’s a Royal than it can’t be a flush (because it’s a better hand). So we can make one random hand, and it’s pretty straightforward to make many random hands using this method.

flush <- function () {
  suits <- c('hearts', 'diamonds', 'clubs', 'spades') 
  numbers <- 1:13
  deck <- expand.grid(numbers, suits) 
  head(deck);tail(deck)
  deal_id <- sample(1:52, 5, replace = FALSE)
  hand <- deck[deal_id,]
  #print(hand)
  pip <- sort(hand$Var1)
  #print(pip)
  # Counts the number of the suits
  count_suits <- sort(table(hand$Var2), decreasing = T)
  #print(count_suits)
  Straight <- ifelse(any(pip[5] - pip[1] == 4), T, F)
  #print(Straight)
  Royal <- all(ifelse(pip == c(1, 10, 11, 12, 13), T, F) == T)
  #print(Royal)
  is_success <- count_suits[1] == 5 & Straight == F & Royal == F
  #print(is_success)
  return(is_success)
}
many_flushes <- function(k) {
  return((sum(replicate(k, flush())))/k)
}
many_flushes(1000) * 100

## [1] 0.2

Around 0.2% of the hands generated by that sample were flushes. This method can be really slow as the number of replications increases. A better way to do it is with a “counter” variable to speed it up.

efficient_flush <- function(k) {
  suits <- c('hearts', 'diamonds', 'clubs', 'spades') 
  numbers <- 1:13
  deck <- expand.grid(numbers, suits) 
  success <- 0
  for (i in 1:k) {
    deal_id <- sample(1:52, 5, replace = FALSE)
    hand <- deck[deal_id,]
    #print(hand)
    pip <- sort(hand$Var1)
    #print(pip)
    # Counts the number of the suits
    count_suits <- sort(table(hand$Var2), decreasing = T)
    #print(count_suits)
    Straight <- ifelse(any(pip[5] - pip[1] == 4), T, F)
    #print(Straight)
    Royal <- all(ifelse(pip == c(1, 10, 11, 12, 13), T, F) == T)
    if (count_suits[1] == 5 & Straight == F & Royal == F) {
      success = success + 1
    }
  }
  return(success / k)
}
efficient_flush(1000) * 100

## [1] 0.1

This uses mostly the same framework as the other method, but it uses the “success” variable to count the number of successes within the for loop. Instead of adding to the bottom of a vector, it just updates a value instead. The theoretical probability of getting a flush is [(4 choose 1) * (13 choose 5) - ((4 choose 1) * (10 choose 1))]/(52 choose 5) = 0.001965.

# Probability of a flush
(choose(4, 1)*choose(13,5)-choose(4,1)*choose(10,1))/choose(52,5)

## [1] 0.001965402

And the empirical probability I found with the function is:

# Simulate 1000000 hands
efficient_flush(1000000)

## [1] 0.002019

To find out exactly how much more efficient the “efficient” method is, we can use a benchmark to see how long it takes to run each one. Then we can plot the time it takes to see if it has a linear relationship and see how much the slopes differ for each method.

library(rbenchmark)
bench <- benchmark(
  "Adding to vector" = {many_flushes(1000)},
  "Using counter" = {efficient_flush(1000)}, 
  replications = c(1, 10, 20, 50, 100, 150, 200, 300, 400, 500, 750, 1000), 
  columns = c("test", "replications", "elapsed"))

bench2 <- benchmark(
  "Adding to vector" = {many_flushes(1000)},
  "Using counter" = {efficient_flush(1000)}, 
  replications = c(1, 5, 10, 20, 30, 40, 50, 60, 75, 90, 100), 
  columns = c("test", "replications", "elapsed"))

We can plot the amount of time to run each method of simulation to show the trend. The ratio of the two lines at any point shows the relative amount of time it takes to run each method. At large n, the ratio is about 1.9-2.0, so that would indicate that there is around a 90-100% increase in computational time for the slow method compared to the more efficient method.

library(ggplot2)
ggplot(data = bench) + geom_line(aes(x = replications, y = elapsed, col = test)) +
  xlab("Number of replications *10^3") + ylab("Time elapsed (sec)") + 
  ggtitle("Large scale (max n = 1000000) differences in time based on function method")

ggplot(data = bench2) + geom_line(aes(x = replications, y = elapsed, col = test)) +
  xlab("Number of replications *10^3") + ylab("Time elapsed (sec)") + 
  ggtitle("Small scale (max n = 100000) differences in time based on function method")

Data Are, NOT Data Is

Wednesday, September 4, 2019

A interesting side project from a simulation study in R

Simulation study of 5 card hands

Blog Archive