19 Random sampling in R

International Standard On Auditing - 530 defines30 audit sampling as the application of audit procedures to less than 100% of items within a population of audit relevance such that all sampling units have a chance of selection in order to provide the auditor with a reasonable basis on which to draw conclusions about the entire population. Statistical sampling is further defines as an approach to sampling having two characteristics - random selection of samples, and the use of probability theory to evaluate sample results, including measurement of sampling risk.

Appendix 4 of ISA 53 further prescribes different statistical methods of sample selection. We will discuss here each type of sampling methodology used to sample records for audit.

Prerequisites

Load tidyverse

19.1 Simple Random Sampling (With and without replacement)

In this method, records are selected completely at random, by generating random numbers e.g. using random number tables, etc. Refer figure 19.1 for illustration. We can replicate the method of random number generation in R. Even the method of random number generation can be reproducible, by fixing the random number seed. Mainly two functions will be used here sample() and set.seed() already discussed in section 4.9. Since sample() function takes a vector as input and gives vector as output again, we can make use of dplyr::slice_sample() function, discussed in section 4.9, which operates on data frames instead.

Illustration of Simple Random Sampling

Figure 19.1: Illustration of Simple Random Sampling

Let’s see this sampling on iris data. Suppose we have to select a sample of n=12 records, without replacement-

dat <- iris # input data
# set the seed
set.seed(123)
# sample n records
dat %>% 
  slice_sample(n = 12, replace = FALSE)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4.3 3.0 1.1 0.1 setosa
5.0 3.3 1.4 0.2 setosa
7.7 3.8 6.7 2.2 virginica
4.4 3.2 1.3 0.2 setosa
5.9 3.0 5.1 1.8 virginica
6.5 3.0 5.2 2.0 virginica
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
5.8 2.7 5.1 1.9 virginica
6.1 3.0 4.6 1.4 versicolor
6.3 3.4 5.6 2.4 virginica
5.1 2.5 3.0 1.1 versicolor

The syntax is simple. In the first step we have fixed the random number seed for reproducibility. Using slice_sample() we have selected n=12 records without replacement (replace = FALSE).

If sample size is based on some proportion, we have to use prop = .10 (say 10%) instead of n argument. Moreover, if sampling is with replacement, we have to use replace = TRUE.

19.2 Systematic random sampling

ISA 530 defines this sampling approach as ‘Systematic selection, in which the number of sampling units in the population is divided by the sample size to give a sampling interval, for example 50, and having determined a starting point within the first 50, each 50th sampling unit thereafter is selected. Although the starting point may be determined haphazardly, the sample is more likely to be truly random if it is determined by use of a computerized random number generator or random number tables. When using systematic selection, the auditor would need to determine that sampling units within the population are not structured in such a way that the sampling interval corresponds with a particular pattern in the population.’ Refer figure 19.2 for illustration.

Illustration of Systematic Random Sampling

Figure 19.2: Illustration of Systematic Random Sampling

We can replicate this approach again following two steps-

Step-1: Select n as the sample size. Then generate a maximum starting point say s by dividing number of rows in the data by n. Thereafter we have to choose a starting point from 1:s. We can use sample function here. Let’s say this starting number is s1. Then we have to generate an arithmetic sequence, say rand_seq starting from s1 and increasing every s steps thereafter with total n terms.

Step-2: In the next step we will shuffle the data by using slice_sample and select a sample using function filter.

The methodology is replicated as

set.seed(123)
n <- 15 # sample size
s <- floor(nrow(dat)/n)
s1 <- sample(1:s, 1, replace = FALSE)
rand_seq <- seq(s1, by = s, length.out = n)
dat %>% 
  slice_sample(prop = 1) %>% 
  filter(row_number() %in% rand_seq)
  
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
7.7 3.8 6.7 2.2 virginica
6.1 2.8 4.0 1.3 versicolor
7.6 3.0 6.6 2.1 virginica
7.2 3.2 6.0 1.8 virginica
7.2 3.0 5.8 1.6 virginica
6.3 2.7 4.9 1.8 virginica
4.8 3.1 1.6 0.2 setosa
6.4 2.8 5.6 2.1 virginica
5.6 3.0 4.5 1.5 versicolor
4.9 2.5 4.5 1.7 virginica
5.9 3.2 4.8 1.8 versicolor
6.3 3.3 6.0 2.5 virginica
5.4 3.7 1.5 0.2 setosa
6.7 3.1 4.4 1.4 versicolor
5.2 3.5 1.5 0.2 setosa

19.3 Probability Proportionate to size (with or without replacement) a.k.a monetary unit sampling

This sampling approach is defined in ISA-530 as “a type of value-weighted selection in which sample size, selection and evaluation results in a conclusion in monetary amounts.”

Our methodology is not much difference from methodology adopted in section 19.2 except that we will make use of weight_by = argument now.

Let’s use state.x77 data that comes with base R. Since the data is in matrix format, let’s first convert it data frame using as.data.frame() first.

dat <- as.data.frame(state.x77)

Other steps are simple.

set.seed(123)
dat %>% 
  slice_sample(n=12, weight_by = Population)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Pennsylvania 11860 4449 1.0 70.43 6.1 50.2 126 44966
Kentucky 3387 3712 1.6 70.10 10.6 38.5 95 39650
Michigan 9111 4751 0.9 70.63 11.1 52.8 125 56817
Oregon 2284 4660 0.6 72.13 4.2 60.0 44 96184
Utah 1203 4022 0.6 72.90 4.5 67.3 137 82096
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Virginia 4981 4701 1.4 70.08 9.5 47.8 85 39780
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Georgia 4931 4091 2.0 68.54 13.9 40.6 60 58073
Massachusetts 5814 4755 1.1 71.83 3.3 58.5 103 7826
Hawaii 868 4963 1.9 73.60 6.2 61.9 0 6425
New Jersey 7333 5237 1.1 70.93 5.2 52.5 115 7521

19.4 Stratified random sampling

Stratification is defined in ISA-530 as the process of dividing a population into sub-populations, each of which is a group of sampling units which have similar characteristics (often monetary value). Thus, stratified random sampling may imply any of the afore-mentioned sampling techniques applied to individual strata instead of whole population. Refer figure 19.3 for illustration.

Illustration of Stratified Random Sampling

Figure 19.3: Illustration of Stratified Random Sampling

The function dplyr::group_by() will be used here for stratification. Thereafter we can proceed for sampling described as above.

Example Data: - Let’s include region in state.x77 data using dplyr::bind_cols.

dat <- bind_cols(
  as.data.frame(state.x77),
  as.data.frame(state.region)
)

Let’s see first 6 rows of this data

Population Income Illiteracy Life Exp Murder HS Grad Frost Area state.region
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 South
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 West
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 West
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 South
California 21198 5114 1.1 71.71 10.3 62.6 20 156361 West
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766 West

We can check a summary of number of States per region

dat %>% 
  tibble::rownames_to_column('State') %>% # this step will not be 
                                  # used in databases without row names
  group_by(state.region) %>% 
  summarise(states = n())
## # A tibble: 4 × 2
##   state.region  states
##   <fct>          <int>
## 1 Northeast          9
## 2 South             16
## 3 North Central     12
## 4 West              13

Case-1: When the sample size is constant for all strata. Say 2 records per region.

set.seed(123)
n <- 2
dat %>% 
  tibble::rownames_to_column('State') %>% # this step will not be used in databases without row names
  group_by(state.region) %>% 
  slice_sample(n=n) %>% 
  ungroup()
State Population Income Illiteracy Life Exp Murder HS Grad Frost Area state.region
Massachusetts 5814 4755 1.1 71.83 3.3 58.5 103 7826 Northeast
New York 18076 4903 1.4 70.55 10.9 52.7 82 47831 Northeast
Delaware 579 4809 0.9 70.06 6.2 54.6 103 1982 South
North Carolina 5441 3875 1.8 69.21 11.1 38.5 80 48798 South
Indiana 5313 4458 0.7 70.88 7.1 52.9 122 36097 North Central
Minnesota 3921 4675 0.6 72.96 2.3 57.6 160 79289 North Central
Utah 1203 4022 0.6 72.90 4.5 67.3 137 82096 West
Hawaii 868 4963 1.9 73.60 6.2 61.9 0 6425 West

Case-2: When the sample size or proportion is different among strata. This time let us assume that column for stratum is not directly available in the data.
- Say, 20% of States having Population upto 1000; - 30% of States having population greater than 1000 but upto 5000 and finally; - 50% of states having population more than 5000 have to be sampled.

In this scenario, our strategy would be use purrr::map2_dfr() function after splitting the data with group_split() function.

Syntax would be

# define proportions
props <- c(0.2, 0.3, 0.5)

# set seed
set.seed(123)

# take data
dat %>% 
  # reduntant step where data has no column names
  tibble::rownames_to_column('State') %>%
  # create column according to stratums
  mutate(stratum = cut(Population, c(0, 1000, 5000, max(Population)),
                      labels = c("Low", "Mid", "High"))) %>% 
  # split data into groups
  group_split(stratum) %>% 
  # sample in each group
  map2_dfr(props,
           .f = function(d, w) slice_sample(d, prop = w))

We may check the sample selected across each stratum

stratum Total Selected
Low 12 2
Mid 26 7
High 12 6

19.5 Cluster sampling

ISA 530 does not explicitly define cluster sampling. Actually this sampling is sampling of strata and we can apply above mentioned techniques easily to sample clusters. E.g. in the sample data above, we can sample say, 2 clusters (or regions).

Thus, our strategy would be first to sample groups from unique available values and thereafter filter all the records.

# set the seed
set.seed(123)
# sample clusters
clusters <- sample(
  unique(dat$state.region),
  size = 2
)
# filter all records in above clusters
clust_samp <- dat %>% 
  filter(state.region %in% clusters)
# check number of records
clust_samp$state.region %>% table()
## .
##     Northeast         South North Central          West 
##             9             0            12             0