19 Random sampling in R
International Standard On Auditing - 530 defines30 audit sampling as the application of audit procedures to less than 100% of items within a population of audit relevance such that all sampling units have a chance of selection in order to provide the auditor with a reasonable basis on which to draw conclusions about the entire population. Statistical sampling is further defines as an approach to sampling having two characteristics - random selection of samples, and the use of probability theory to evaluate sample results, including measurement of sampling risk.
Appendix 4 of ISA 53 further prescribes different statistical methods of sample selection. We will discuss here each type of sampling methodology used to sample records for audit.
Prerequisites
Load tidyverse
19.1 Simple Random Sampling (With and without replacement)
In this method, records are selected completely at random, by generating random numbers e.g. using random number tables, etc. Refer figure 19.1 for illustration. We can replicate the method of random number generation in R. Even the method of random number generation can be reproducible, by fixing the random number seed. Mainly two functions will be used here sample()
and set.seed()
already discussed in section 4.9. Since sample()
function takes a vector as input and gives vector as output again, we can make use of dplyr::slice_sample()
function, discussed in section 4.9, which operates on data frames instead.

Figure 19.1: Illustration of Simple Random Sampling
Let’s see this sampling on iris
data. Suppose we have to select a sample of n=12
records, without replacement-
dat <- iris # input data
# set the seed
set.seed(123)
# sample n records
dat %>%
slice_sample(n = 12, replace = FALSE)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
4.3 | 3.0 | 1.1 | 0.1 | setosa |
5.0 | 3.3 | 1.4 | 0.2 | setosa |
7.7 | 3.8 | 6.7 | 2.2 | virginica |
4.4 | 3.2 | 1.3 | 0.2 | setosa |
5.9 | 3.0 | 5.1 | 1.8 | virginica |
6.5 | 3.0 | 5.2 | 2.0 | virginica |
5.5 | 2.5 | 4.0 | 1.3 | versicolor |
5.5 | 2.6 | 4.4 | 1.2 | versicolor |
5.8 | 2.7 | 5.1 | 1.9 | virginica |
6.1 | 3.0 | 4.6 | 1.4 | versicolor |
6.3 | 3.4 | 5.6 | 2.4 | virginica |
5.1 | 2.5 | 3.0 | 1.1 | versicolor |
The syntax is simple. In the first step we have fixed the random number seed for reproducibility. Using slice_sample()
we have selected n=12
records without replacement (replace = FALSE
).
If sample size is based on some proportion, we have to use
prop = .10
(say 10%) instead ofn
argument. Moreover, if sampling is with replacement, we have to usereplace = TRUE
.
19.2 Systematic random sampling
ISA 530 defines this sampling approach as ‘Systematic selection, in which the number of sampling units in the population is divided by the sample size to give a sampling interval, for example 50, and having determined a starting point within the first 50, each 50th sampling unit thereafter is selected. Although the starting point may be determined haphazardly, the sample is more likely to be truly random if it is determined by use of a computerized random number generator or random number tables. When using systematic selection, the auditor would need to determine that sampling units within the population are not structured in such a way that the sampling interval corresponds with a particular pattern in the population.’ Refer figure 19.2 for illustration.

Figure 19.2: Illustration of Systematic Random Sampling
We can replicate this approach again following two steps-
Step-1: Select n
as the sample size. Then generate a maximum starting point say s
by dividing number of rows in the data by n
. Thereafter we have to choose a starting point from 1:s
. We can use sample function here. Let’s say this starting number is s1
. Then we have to generate an arithmetic sequence, say rand_seq
starting from s1
and increasing every s
steps thereafter with total n
terms.
Step-2: In the next step we will shuffle the data by using slice_sample
and select a sample using function filter
.
The methodology is replicated as
set.seed(123)
n <- 15 # sample size
s <- floor(nrow(dat)/n)
s1 <- sample(1:s, 1, replace = FALSE)
rand_seq <- seq(s1, by = s, length.out = n)
dat %>%
slice_sample(prop = 1) %>%
filter(row_number() %in% rand_seq)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
7.7 | 3.8 | 6.7 | 2.2 | virginica |
6.1 | 2.8 | 4.0 | 1.3 | versicolor |
7.6 | 3.0 | 6.6 | 2.1 | virginica |
7.2 | 3.2 | 6.0 | 1.8 | virginica |
7.2 | 3.0 | 5.8 | 1.6 | virginica |
6.3 | 2.7 | 4.9 | 1.8 | virginica |
4.8 | 3.1 | 1.6 | 0.2 | setosa |
6.4 | 2.8 | 5.6 | 2.1 | virginica |
5.6 | 3.0 | 4.5 | 1.5 | versicolor |
4.9 | 2.5 | 4.5 | 1.7 | virginica |
5.9 | 3.2 | 4.8 | 1.8 | versicolor |
6.3 | 3.3 | 6.0 | 2.5 | virginica |
5.4 | 3.7 | 1.5 | 0.2 | setosa |
6.7 | 3.1 | 4.4 | 1.4 | versicolor |
5.2 | 3.5 | 1.5 | 0.2 | setosa |
19.3 Probability Proportionate to size (with or without replacement) a.k.a monetary unit sampling
This sampling approach is defined in ISA-530 as “a type of value-weighted selection in which sample size, selection and evaluation results in a conclusion in monetary amounts.”
Our methodology is not much difference from methodology adopted in section 19.2 except that we will make use of weight_by =
argument now.
Let’s use state.x77
data that comes with base R. Since the data is in matrix format, let’s first convert it data frame using as.data.frame()
first.
dat <- as.data.frame(state.x77)
Other steps are simple.
set.seed(123)
dat %>%
slice_sample(n=12, weight_by = Population)
Population | Income | Illiteracy | Life Exp | Murder | HS Grad | Frost | Area | |
---|---|---|---|---|---|---|---|---|
Pennsylvania | 11860 | 4449 | 1.0 | 70.43 | 6.1 | 50.2 | 126 | 44966 |
Kentucky | 3387 | 3712 | 1.6 | 70.10 | 10.6 | 38.5 | 95 | 39650 |
Michigan | 9111 | 4751 | 0.9 | 70.63 | 11.1 | 52.8 | 125 | 56817 |
Oregon | 2284 | 4660 | 0.6 | 72.13 | 4.2 | 60.0 | 44 | 96184 |
Utah | 1203 | 4022 | 0.6 | 72.90 | 4.5 | 67.3 | 137 | 82096 |
California | 21198 | 5114 | 1.1 | 71.71 | 10.3 | 62.6 | 20 | 156361 |
Virginia | 4981 | 4701 | 1.4 | 70.08 | 9.5 | 47.8 | 85 | 39780 |
Arizona | 2212 | 4530 | 1.8 | 70.55 | 7.8 | 58.1 | 15 | 113417 |
Georgia | 4931 | 4091 | 2.0 | 68.54 | 13.9 | 40.6 | 60 | 58073 |
Massachusetts | 5814 | 4755 | 1.1 | 71.83 | 3.3 | 58.5 | 103 | 7826 |
Hawaii | 868 | 4963 | 1.9 | 73.60 | 6.2 | 61.9 | 0 | 6425 |
New Jersey | 7333 | 5237 | 1.1 | 70.93 | 5.2 | 52.5 | 115 | 7521 |
19.4 Stratified random sampling
Stratification is defined in ISA-530 as the process of dividing a population into sub-populations, each of which is a group of sampling units which have similar characteristics (often monetary value). Thus, stratified random sampling may imply any of the afore-mentioned sampling techniques applied to individual strata instead of whole population. Refer figure 19.3 for illustration.

Figure 19.3: Illustration of Stratified Random Sampling
The function dplyr::group_by()
will be used here for stratification. Thereafter we can proceed for sampling described as above.
Example Data: - Let’s include region in state.x77
data using dplyr::bind_cols
.
dat <- bind_cols(
as.data.frame(state.x77),
as.data.frame(state.region)
)
Let’s see first 6 rows of this data
Population | Income | Illiteracy | Life Exp | Murder | HS Grad | Frost | Area | state.region | |
---|---|---|---|---|---|---|---|---|---|
Alabama | 3615 | 3624 | 2.1 | 69.05 | 15.1 | 41.3 | 20 | 50708 | South |
Alaska | 365 | 6315 | 1.5 | 69.31 | 11.3 | 66.7 | 152 | 566432 | West |
Arizona | 2212 | 4530 | 1.8 | 70.55 | 7.8 | 58.1 | 15 | 113417 | West |
Arkansas | 2110 | 3378 | 1.9 | 70.66 | 10.1 | 39.9 | 65 | 51945 | South |
California | 21198 | 5114 | 1.1 | 71.71 | 10.3 | 62.6 | 20 | 156361 | West |
Colorado | 2541 | 4884 | 0.7 | 72.06 | 6.8 | 63.9 | 166 | 103766 | West |
We can check a summary of number of States per region
dat %>%
tibble::rownames_to_column('State') %>% # this step will not be
# used in databases without row names
group_by(state.region) %>%
summarise(states = n())
## # A tibble: 4 × 2
## state.region states
## <fct> <int>
## 1 Northeast 9
## 2 South 16
## 3 North Central 12
## 4 West 13
Case-1: When the sample size is constant for all strata. Say 2
records per region.
set.seed(123)
n <- 2
dat %>%
tibble::rownames_to_column('State') %>% # this step will not be used in databases without row names
group_by(state.region) %>%
slice_sample(n=n) %>%
ungroup()
State | Population | Income | Illiteracy | Life Exp | Murder | HS Grad | Frost | Area | state.region |
---|---|---|---|---|---|---|---|---|---|
Massachusetts | 5814 | 4755 | 1.1 | 71.83 | 3.3 | 58.5 | 103 | 7826 | Northeast |
New York | 18076 | 4903 | 1.4 | 70.55 | 10.9 | 52.7 | 82 | 47831 | Northeast |
Delaware | 579 | 4809 | 0.9 | 70.06 | 6.2 | 54.6 | 103 | 1982 | South |
North Carolina | 5441 | 3875 | 1.8 | 69.21 | 11.1 | 38.5 | 80 | 48798 | South |
Indiana | 5313 | 4458 | 0.7 | 70.88 | 7.1 | 52.9 | 122 | 36097 | North Central |
Minnesota | 3921 | 4675 | 0.6 | 72.96 | 2.3 | 57.6 | 160 | 79289 | North Central |
Utah | 1203 | 4022 | 0.6 | 72.90 | 4.5 | 67.3 | 137 | 82096 | West |
Hawaii | 868 | 4963 | 1.9 | 73.60 | 6.2 | 61.9 | 0 | 6425 | West |
Case-2: When the sample size or proportion is different among strata.
This time let us assume that column for stratum is not directly available in the data.
- Say, 20% of States having Population upto 1000
;
- 30% of States having population greater than 1000
but upto 5000
and finally;
- 50% of states having population more than 5000
have to be sampled.
In this scenario, our strategy would be use purrr::map2_dfr()
function after splitting the data with group_split()
function.
Syntax would be
# define proportions
props <- c(0.2, 0.3, 0.5)
# set seed
set.seed(123)
# take data
dat %>%
# reduntant step where data has no column names
tibble::rownames_to_column('State') %>%
# create column according to stratums
mutate(stratum = cut(Population, c(0, 1000, 5000, max(Population)),
labels = c("Low", "Mid", "High"))) %>%
# split data into groups
group_split(stratum) %>%
# sample in each group
map2_dfr(props,
.f = function(d, w) slice_sample(d, prop = w))
We may check the sample selected across each stratum
stratum | Total | Selected |
---|---|---|
Low | 12 | 2 |
Mid | 26 | 7 |
High | 12 | 6 |
19.5 Cluster sampling
ISA 530 does not explicitly define cluster sampling. Actually this sampling is sampling of strata and we can apply above mentioned techniques easily to sample clusters. E.g. in the sample data above, we can sample say, 2 clusters (or regions).
Thus, our strategy would be first to sample groups from unique available values and thereafter filter all the records.
# set the seed
set.seed(123)
# sample clusters
clusters <- sample(
unique(dat$state.region),
size = 2
)
# filter all records in above clusters
clust_samp <- dat %>%
filter(state.region %in% clusters)
# check number of records
clust_samp$state.region %>% table()
## .
## Northeast South North Central West
## 9 0 12 0