27 Applying network analysis in audit/fraud detection

27.1 Use case-1: Finding network of related entities/persons - Identity Theft

The Following Content is Under Development.

Imagine a scenario when users may have multiple IDs such as mobile numbers, email ids, and say some other ID issued by a Government Department say Income Tax Department (e.g. PAN number in Indian Scenario). Using techniques mentioned in section 42, we may easily find out duplicate users, i.e. duplicates on the basis of one ID. Sometimes need arise where we have to find out network of all the duplicate users where they have changed one or two IDs but retained another. E.g. There may be a social sector scheme where any beneficiary is expected to be registered only once for getting that scheme benefits. Scheme audit(s) may require auditors to check duplicate beneficiaries using multiple IDs.

Understand this with the table 27.1.

Table 27.1: Dummy Data
ID Mobile Email PAN
1 9111111111 PANNO0000A
2 9222222222 PANNO0000A
3 9333333333 PANNO1111B
4 9444444444 PANNO2222C
5 9111111111 PANNO3333D
6 9111111111 PANNO4444E
7 9555555555 PANNO5555F
8 9666666666 PANNO5555F
9 9333333333 PANNO6666G
10 9222222222 PANNO7777H

It may be seen that out of ten persons, two with IDs 6 and 10 respectively share none of IDs out of Email, PAN and Telephone number. But if we see closely, ID-6 shares mobile number with ID-1 who in turn share PAN number with ID-2. ID-2 further shares both Email and Mobile number with ID-6 thus establishing a relation and a network between ID-6 and ID-10. This is clear in figure at 27.1. Note that we are not considering names while finding out duplicates.

knitr::include_graphics("images/canvas.png")
Network diagram of connected entities

Figure 27.1: Network diagram of connected entities

We may find these duplicates using a branch of mathematics called Graph Theory.33 We won’t be discussing any core concepts of graph theory here. There are a few packages to work with graph theory concepts in R, and we will be using igraph34 for our analysis here. Let’s load the library.

dat <- data.frame(
  MainID = 1:9,
  Name = c("A", "B", "C", "B", "E", "A", "F", "G", "H"),
  ID1 = c(11,12,13,13,14,15,16,17,17),
  ID2 = c("1a", "1b","1b", "2a", "2b", "2c", "2c", "2e", "3a"),
  ID3 = c("AB", "AB", "BC", "CD", "EF", "GH", "HI", "HI", "JK")
)
# A preview of our sample data
dat
##   MainID Name ID1 ID2 ID3
## 1      1    A  11  1a  AB
## 2      2    B  12  1b  AB
## 3      3    C  13  1b  BC
## 4      4    B  13  2a  CD
## 5      5    E  14  2b  EF
## 6      6    A  15  2c  GH
## 7      7    F  16  2c  HI
## 8      8    G  17  2e  HI
## 9      9    H  17  3a  JK

Now the complete algorithm is as under-

id_cols <- c("ID1", "ID2", "ID3")
dat %>% 
  mutate(across(.cols = all_of(id_cols), as.character)) %>% 
  pivot_longer(cols = all_of(id_cols), 
               values_drop_na = TRUE) %>% 
  select(MainID, value) %>% 
  graph_from_data_frame() %>%
  components() %>%
  pluck(membership) %>%
  stack() %>%
  set_names(c('UNIQUE_ID', 'MainID')) %>%
  right_join(dat %>% 
               mutate(MainID = as.factor(MainID)), 
             by = c('MainID'))
##   UNIQUE_ID MainID Name ID1 ID2 ID3
## 1         1      1    A  11  1a  AB
## 2         1      2    B  12  1b  AB
## 3         1      3    C  13  1b  BC
## 4         1      4    B  13  2a  CD
## 5         2      5    E  14  2b  EF
## 6         3      6    A  15  2c  GH
## 7         3      7    F  16  2c  HI
## 8         3      8    G  17  2e  HI
## 9         3      9    H  17  3a  JK

We may see that we have got unique ID of users based on all three IDs. Let us understand the algorithm used step by step.

Step-1: First we have to ensure that all the ID columns (Store names of these columns in one vector say id_cols) must be of same type. Since we had a mix of character (Alphanumeric) and numeric IDs, using dplyr::across with dplyr::mutate we can convert all the three ID columns to character type. Readers may refer to section 1.4.1 for type change, and section 14.14 for changing data type of multiple columns simultaneously using dplyr::across.

Thus, first two lines of code above correspond to this step only.

id_cols <- c("ID1", "ID2", "ID3")
dat %>%
  mutate(across(.cols = id_cols, as.character))

Step-2: Pivot all id columns to longer format so that all Ids are linked with one main ID. Now two things should be kept in mind. One that there should be a main_Id column in the data frame. If not create one using dplyr::row_number() before pivoting. Secondly, if there are NAs in any of the IDs these have to be removed while pivoting. Use argument values_drop_na = TRUE inside the tidyr::pivot_longer. Thus, this step will correspond to this line-

pivot_longer(cols = all_of(id_cols), values_drop_na = TRUE)

where - first argument data is invisibly passed through dplyr pipe i.e. %>%. Upto this step, our data frame will look like -

## # A tibble: 27 × 4
##    MainID Name  name  value
##     <int> <chr> <chr> <chr>
##  1      1 A     ID1   11   
##  2      1 A     ID2   1a   
##  3      1 A     ID3   AB   
##  4      2 B     ID1   12   
##  5      2 B     ID2   1b   
##  6      2 B     ID3   AB   
##  7      3 C     ID1   13   
##  8      3 C     ID2   1b   
##  9      3 C     ID3   BC   
## 10      4 B     ID1   13   
## # ℹ 17 more rows

Step-3: Now we need only two columns, one is mainID and another is value which is created by pivoting all ID columns. We will use select(MainID, value) for that.

Step-4: Thereafter we will create a graph object from this data (output after step-3), using igraph package. Interested readers may see how the graph object will look like, using plot() function. The output is shown in figure 27.2. However, this step is entirely optional and it may also be kept in mind that graph output of large data will be highly cluttered and may not be comprehensible at all.

dat %>% 
  mutate(across(.cols = all_of(id_cols), as.character)) %>% 
  pivot_longer(cols = all_of(id_cols), 
               values_drop_na = TRUE) %>% 
  select(MainID, value) %>% 
  graph_from_data_frame() %>%
  plot()
Plot of graph object

Figure 27.2: Plot of graph object

Step-5: This step will be a combination of three lines of codes which will number each ID based on connectivity of all components in the graph objects. Actually components will give us an object where $membership will give us unique_ids for each component in the graph.

## $membership
##  1  2  3  4  5  6  7  8  9 11 1a AB 12 1b 13 BC 2a CD 14 2b EF 15 2c GH 16 HI 
##  1  1  1  1  2  3  3  3  3  1  1  1  1  1  1  1  1  1  2  2  2  3  3  3  3  3 
## 17 2e 3a JK 
##  3  3  3  3 
## 
## $csize
## [1] 13  4 13
## 
## $no
## [1] 3

Next we have to purrr::pluck, $membership only from this object, which will return a named vector.

##  1  2  3  4  5  6  7  8  9 11 1a AB 12 1b 13 BC 2a CD 14 2b EF 15 2c GH 16 HI 
##  1  1  1  1  2  3  3  3  3  1  1  1  1  1  1  1  1  1  2  2  2  3  3  3  3  3 
## 17 2e 3a JK 
##  3  3  3  3

We can then stack this named vector into a data frame using stack and set_names

##    UNIQUE_ID MainID
## 1          1      1
## 2          1      2
## 3          1      3
## 4          1      4
## 5          2      5
## 6          3      6
## 7          3      7
## 8          3      8
## 9          3      9
## 10         1     11
## 11         1     1a
## 12         1     AB
## 13         1     12
## 14         1     1b
## 15         1     13
## 16         1     BC
## 17         1     2a
## 18         1     CD
## 19         2     14
## 20         2     2b
## 21         2     EF
## 22         3     15
## 23         3     2c
## 24         3     GH
## 25         3     16
## 26         3     HI
## 27         3     17
## 28         3     2e
## 29         3     3a
## 30         3     JK

I suggest to purposefully name second column in the output data as MainID so that it can be joined with original data frame in the last step. UNIQUE_ID in this data will give us the new column which will allocate same ID to all possible duplicates in network of three IDs.

Step-6: In the last step we have to join the data frame back to original data frame. Since the type of MainID is now factor type, we can convert type of this column in original data frame before right_join the same. Hence the final step, right_join(dat %>% mutate(MainID = as.factor(MainID)), by = c('MainID')).

27.2 Use case-2: Classification using graph theory

The Content is Under Development.

27.3 Use case-3: Finding circular transactions

The Content is Under Development.