19 Random sampling in R

International Standard On Auditing - 530 defines³⁰ audit sampling as the application of audit procedures to less than 100% of items within a population of audit relevance such that all sampling units have a chance of selection in order to provide the auditor with a reasonable basis on which to draw conclusions about the entire population. Statistical sampling is further defines as an approach to sampling having two characteristics - random selection of samples, and the use of probability theory to evaluate sample results, including measurement of sampling risk.

Appendix 4 of ISA 53 further prescribes different statistical methods of sample selection. We will discuss here each type of sampling methodology used to sample records for audit.

Prerequisites

Load tidyverse

library(tidyverse)

19.1 Simple Random Sampling (With and without replacement)

In this method, records are selected completely at random, by generating random numbers e.g. using random number tables, etc. Refer figure 19.1 for illustration. We can replicate the method of random number generation in R. Even the method of random number generation can be reproducible, by fixing the random number seed. Mainly two functions will be used here sample() and set.seed() already discussed in section 4.9. Since sample() function takes a vector as input and gives vector as output again, we can make use of dplyr::slice_sample() function, discussed in section 4.9, which operates on data frames instead.

Figure 19.1: Illustration of Simple Random Sampling

Let’s see this sampling on iris data. Suppose we have to select a sample of n=12 records, without replacement-

dat <- iris # input data
# set the seed
set.seed(123)
# sample n records
dat %>% 
  slice_sample(n = 12, replace = FALSE)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
4.3	3.0	1.1	0.1	setosa
5.0	3.3	1.4	0.2	setosa
7.7	3.8	6.7	2.2	virginica
4.4	3.2	1.3	0.2	setosa
5.9	3.0	5.1	1.8	virginica
6.5	3.0	5.2	2.0	virginica
5.5	2.5	4.0	1.3	versicolor
5.5	2.6	4.4	1.2	versicolor
5.8	2.7	5.1	1.9	virginica
6.1	3.0	4.6	1.4	versicolor
6.3	3.4	5.6	2.4	virginica
5.1	2.5	3.0	1.1	versicolor

The syntax is simple. In the first step we have fixed the random number seed for reproducibility. Using slice_sample() we have selected n=12 records without replacement (replace = FALSE).

If sample size is based on some proportion, we have to use prop = .10 (say 10%) instead of n argument. Moreover, if sampling is with replacement, we have to use replace = TRUE.

19.2 Systematic random sampling

ISA 530 defines this sampling approach as ‘Systematic selection, in which the number of sampling units in the population is divided by the sample size to give a sampling interval, for example 50, and having determined a starting point within the first 50, each 50th sampling unit thereafter is selected. Although the starting point may be determined haphazardly, the sample is more likely to be truly random if it is determined by use of a computerized random number generator or random number tables. When using systematic selection, the auditor would need to determine that sampling units within the population are not structured in such a way that the sampling interval corresponds with a particular pattern in the population.’ Refer figure 19.2 for illustration.

Figure 19.2: Illustration of Systematic Random Sampling

We can replicate this approach again following two steps-

Step-1: Select n as the sample size. Then generate a maximum starting point say s by dividing number of rows in the data by n. Thereafter we have to choose a starting point from 1:s. We can use sample function here. Let’s say this starting number is s1. Then we have to generate an arithmetic sequence, say rand_seq starting from s1 and increasing every s steps thereafter with total n terms.

Step-2: In the next step we will shuffle the data by using slice_sample and select a sample using function filter.

The methodology is replicated as

set.seed(123)
n <- 15 # sample size
s <- floor(nrow(dat)/n)
s1 <- sample(1:s, 1, replace = FALSE)
rand_seq <- seq(s1, by = s, length.out = n)
dat %>% 
  slice_sample(prop = 1) %>% 
  filter(row_number() %in% rand_seq)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
7.7	3.8	6.7	2.2	virginica
6.1	2.8	4.0	1.3	versicolor
7.6	3.0	6.6	2.1	virginica
7.2	3.2	6.0	1.8	virginica
7.2	3.0	5.8	1.6	virginica
6.3	2.7	4.9	1.8	virginica
4.8	3.1	1.6	0.2	setosa
6.4	2.8	5.6	2.1	virginica
5.6	3.0	4.5	1.5	versicolor
4.9	2.5	4.5	1.7	virginica
5.9	3.2	4.8	1.8	versicolor
6.3	3.3	6.0	2.5	virginica
5.4	3.7	1.5	0.2	setosa
6.7	3.1	4.4	1.4	versicolor
5.2	3.5	1.5	0.2	setosa

19.3 Probability Proportionate to size (with or without replacement) a.k.a monetary unit sampling

This sampling approach is defined in ISA-530 as “a type of value-weighted selection in which sample size, selection and evaluation results in a conclusion in monetary amounts.”

Our methodology is not much difference from methodology adopted in section 19.2 except that we will make use of weight_by = argument now.

Let’s use state.x77 data that comes with base R. Since the data is in matrix format, let’s first convert it data frame using as.data.frame() first.

dat <- as.data.frame(state.x77)

Other steps are simple.

set.seed(123)
dat %>% 
  slice_sample(n=12, weight_by = Population)

	Population	Income	Illiteracy	Life Exp	Murder	HS Grad	Frost	Area
Pennsylvania	11860	4449	1.0	70.43	6.1	50.2	126	44966
Kentucky	3387	3712	1.6	70.10	10.6	38.5	95	39650
Michigan	9111	4751	0.9	70.63	11.1	52.8	125	56817
Oregon	2284	4660	0.6	72.13	4.2	60.0	44	96184
Utah	1203	4022	0.6	72.90	4.5	67.3	137	82096
California	21198	5114	1.1	71.71	10.3	62.6	20	156361
Virginia	4981	4701	1.4	70.08	9.5	47.8	85	39780
Arizona	2212	4530	1.8	70.55	7.8	58.1	15	113417
Georgia	4931	4091	2.0	68.54	13.9	40.6	60	58073
Massachusetts	5814	4755	1.1	71.83	3.3	58.5	103	7826
Hawaii	868	4963	1.9	73.60	6.2	61.9	0	6425
New Jersey	7333	5237	1.1	70.93	5.2	52.5	115	7521

19.4 Stratified random sampling

Stratification is defined in ISA-530 as the process of dividing a population into sub-populations, each of which is a group of sampling units which have similar characteristics (often monetary value). Thus, stratified random sampling may imply any of the afore-mentioned sampling techniques applied to individual strata instead of whole population. Refer figure 19.3 for illustration.

Figure 19.3: Illustration of Stratified Random Sampling

The function dplyr::group_by() will be used here for stratification. Thereafter we can proceed for sampling described as above.

Example Data: - Let’s include region in state.x77 data using dplyr::bind_cols.

dat <- bind_cols(
  as.data.frame(state.x77),
  as.data.frame(state.region)
)

Let’s see first 6 rows of this data

	Population	Income	Illiteracy	Life Exp	Murder	HS Grad	Frost	Area	state.region
Alabama	3615	3624	2.1	69.05	15.1	41.3	20	50708	South
Alaska	365	6315	1.5	69.31	11.3	66.7	152	566432	West
Arizona	2212	4530	1.8	70.55	7.8	58.1	15	113417	West
Arkansas	2110	3378	1.9	70.66	10.1	39.9	65	51945	South
California	21198	5114	1.1	71.71	10.3	62.6	20	156361	West
Colorado	2541	4884	0.7	72.06	6.8	63.9	166	103766	West

We can check a summary of number of States per region

dat %>% 
  tibble::rownames_to_column('State') %>% # this step will not be 
                                  # used in databases without row names
  group_by(state.region) %>% 
  summarise(states = n())

## # A tibble: 4 × 2
##   state.region  states
##   <fct>          <int>
## 1 Northeast          9
## 2 South             16
## 3 North Central     12
## 4 West              13

Case-1: When the sample size is constant for all strata. Say 2 records per region.

set.seed(123)
n <- 2
dat %>% 
  tibble::rownames_to_column('State') %>% # this step will not be used in databases without row names
  group_by(state.region) %>% 
  slice_sample(n=n) %>% 
  ungroup()

State	Population	Income	Illiteracy	Life Exp	Murder	HS Grad	Frost	Area	state.region
Massachusetts	5814	4755	1.1	71.83	3.3	58.5	103	7826	Northeast
New York	18076	4903	1.4	70.55	10.9	52.7	82	47831	Northeast
Delaware	579	4809	0.9	70.06	6.2	54.6	103	1982	South
North Carolina	5441	3875	1.8	69.21	11.1	38.5	80	48798	South
Indiana	5313	4458	0.7	70.88	7.1	52.9	122	36097	North Central
Minnesota	3921	4675	0.6	72.96	2.3	57.6	160	79289	North Central
Utah	1203	4022	0.6	72.90	4.5	67.3	137	82096	West
Hawaii	868	4963	1.9	73.60	6.2	61.9	0	6425	West

Case-2: When the sample size or proportion is different among strata. This time let us assume that column for stratum is not directly available in the data.
- Say, 20% of States having Population upto 1000; - 30% of States having population greater than 1000 but upto 5000 and finally; - 50% of states having population more than 5000 have to be sampled.

In this scenario, our strategy would be use purrr::map2_dfr() function after splitting the data with group_split() function.

Syntax would be

# define proportions
props <- c(0.2, 0.3, 0.5)

# set seed
set.seed(123)

# take data
dat %>% 
  # reduntant step where data has no column names
  tibble::rownames_to_column('State') %>%
  # create column according to stratums
  mutate(stratum = cut(Population, c(0, 1000, 5000, max(Population)),
                      labels = c("Low", "Mid", "High"))) %>% 
  # split data into groups
  group_split(stratum) %>% 
  # sample in each group
  map2_dfr(props,
           .f = function(d, w) slice_sample(d, prop = w))

We may check the sample selected across each stratum

stratum	Total	Selected
Low	12	2
Mid	26	7
High	12	6

19.5 Cluster sampling

ISA 530 does not explicitly define cluster sampling. Actually this sampling is sampling of strata and we can apply above mentioned techniques easily to sample clusters. E.g. in the sample data above, we can sample say, 2 clusters (or regions).

Thus, our strategy would be first to sample groups from unique available values and thereafter filter all the records.

# set the seed
set.seed(123)
# sample clusters
clusters <- sample(
  unique(dat$state.region),
  size = 2
)
# filter all records in above clusters
clust_samp <- dat %>% 
  filter(state.region %in% clusters)
# check number of records
clust_samp$state.region %>% table()

## .
##     Northeast         South North Central          West 
##             9             0            12             0

18 Probability in R

Part-IV: Machine Learning in R