27 Applying network analysis in audit/fraud detection
27.1 Use case-1: Finding network of related entities/persons - Identity Theft
The Following Content is Under Development.
Imagine a scenario when users may have multiple IDs such as mobile numbers, email ids, and say some other ID issued by a Government Department say Income Tax Department (e.g. PAN number in Indian Scenario). Using techniques mentioned in section 42, we may easily find out duplicate users, i.e. duplicates on the basis of one ID. Sometimes need arise where we have to find out network of all the duplicate users where they have changed one or two IDs but retained another. E.g. There may be a social sector scheme where any beneficiary is expected to be registered only once for getting that scheme benefits. Scheme audit(s) may require auditors to check duplicate beneficiaries using multiple IDs.
Understand this with the table 27.1.
ID | Mobile | PAN | |
---|---|---|---|
1 | 9111111111 | aaaa@gmail.com | PANNO0000A |
2 | 9222222222 | bbbb@gmail.com | PANNO0000A |
3 | 9333333333 | cccc@gmail.com | PANNO1111B |
4 | 9444444444 | dddd@gmail.com | PANNO2222C |
5 | 9111111111 | eeee@gmail.com | PANNO3333D |
6 | 9111111111 | ffff@gmail.com | PANNO4444E |
7 | 9555555555 | gggg@gmail.com | PANNO5555F |
8 | 9666666666 | hhhh@gmail.com | PANNO5555F |
9 | 9333333333 | iiii@gmail.com | PANNO6666G |
10 | 9222222222 | bbbb@gmail.com | PANNO7777H |
It may be seen that out of ten persons, two with IDs 6 and 10 respectively share none of IDs out of Email, PAN and Telephone number. But if we see closely, ID-6 shares mobile number with ID-1 who in turn share PAN number with ID-2. ID-2 further shares both Email and Mobile number with ID-6 thus establishing a relation and a network between ID-6 and ID-10. This is clear in figure at 27.1. Note that we are not considering names while finding out duplicates.
knitr::include_graphics("images/canvas.png")

Figure 27.1: Network diagram of connected entities
We may find these duplicates using a branch of mathematics called Graph Theory.33 We won’t be discussing any core concepts of graph theory here. There are a few packages to work with graph theory concepts in R, and we will be using igraph
34 for our analysis here. Let’s load the library.
dat <- data.frame(
MainID = 1:9,
Name = c("A", "B", "C", "B", "E", "A", "F", "G", "H"),
ID1 = c(11,12,13,13,14,15,16,17,17),
ID2 = c("1a", "1b","1b", "2a", "2b", "2c", "2c", "2e", "3a"),
ID3 = c("AB", "AB", "BC", "CD", "EF", "GH", "HI", "HI", "JK")
)
# A preview of our sample data
dat
## MainID Name ID1 ID2 ID3
## 1 1 A 11 1a AB
## 2 2 B 12 1b AB
## 3 3 C 13 1b BC
## 4 4 B 13 2a CD
## 5 5 E 14 2b EF
## 6 6 A 15 2c GH
## 7 7 F 16 2c HI
## 8 8 G 17 2e HI
## 9 9 H 17 3a JK
Now the complete algorithm is as under-
id_cols <- c("ID1", "ID2", "ID3")
dat %>%
mutate(across(.cols = all_of(id_cols), as.character)) %>%
pivot_longer(cols = all_of(id_cols),
values_drop_na = TRUE) %>%
select(MainID, value) %>%
graph_from_data_frame() %>%
components() %>%
pluck(membership) %>%
stack() %>%
set_names(c('UNIQUE_ID', 'MainID')) %>%
right_join(dat %>%
mutate(MainID = as.factor(MainID)),
by = c('MainID'))
## UNIQUE_ID MainID Name ID1 ID2 ID3
## 1 1 1 A 11 1a AB
## 2 1 2 B 12 1b AB
## 3 1 3 C 13 1b BC
## 4 1 4 B 13 2a CD
## 5 2 5 E 14 2b EF
## 6 3 6 A 15 2c GH
## 7 3 7 F 16 2c HI
## 8 3 8 G 17 2e HI
## 9 3 9 H 17 3a JK
We may see that we have got unique ID of users based on all three IDs. Let us understand the algorithm used step by step.
Step-1: First we have to ensure that all the ID columns (Store names of these columns in one vector say id_cols
) must be of same type. Since we had a mix of character (Alphanumeric) and numeric IDs, using dplyr::across
with dplyr::mutate
we can convert all the three ID columns to character type. Readers may refer to section 1.4.1 for type change, and section 14.14 for changing data type of multiple columns simultaneously using dplyr::across
.
Thus, first two lines of code above correspond to this step only.
id_cols <- c("ID1", "ID2", "ID3")
dat %>%
mutate(across(.cols = id_cols, as.character))
Step-2: Pivot all id columns to longer format so that all Ids are linked with one main ID. Now two things should be kept in mind. One that there should be a main_Id column in the data frame. If not create one using dplyr::row_number()
before pivoting. Secondly, if there are NA
s in any of the IDs these have to be removed while pivoting. Use argument values_drop_na = TRUE
inside the tidyr::pivot_longer
. Thus, this step will correspond to this line-
pivot_longer(cols = all_of(id_cols), values_drop_na = TRUE)
where - first argument data is invisibly passed through dplyr pipe i.e. %>%
. Upto this step, our data frame will look like -
## # A tibble: 27 × 4
## MainID Name name value
## <int> <chr> <chr> <chr>
## 1 1 A ID1 11
## 2 1 A ID2 1a
## 3 1 A ID3 AB
## 4 2 B ID1 12
## 5 2 B ID2 1b
## 6 2 B ID3 AB
## 7 3 C ID1 13
## 8 3 C ID2 1b
## 9 3 C ID3 BC
## 10 4 B ID1 13
## # ℹ 17 more rows
Step-3: Now we need only two columns, one is mainID
and another is value
which is created by pivoting all ID columns. We will use select(MainID, value)
for that.
Step-4: Thereafter we will create a graph object from this data (output after step-3), using igraph
package. Interested readers may see how the graph object will look like, using plot()
function. The output is shown in figure 27.2. However, this step is entirely optional and it may also be kept in mind that graph output of large data will be highly cluttered and may not be comprehensible at all.
dat %>%
mutate(across(.cols = all_of(id_cols), as.character)) %>%
pivot_longer(cols = all_of(id_cols),
values_drop_na = TRUE) %>%
select(MainID, value) %>%
graph_from_data_frame() %>%
plot()

Figure 27.2: Plot of graph object
Step-5: This step will be a combination of three lines of codes which will number each ID based on connectivity of all components in the graph objects. Actually components
will give us an object where $membership
will give us unique_ids
for each component in the graph.
## $membership
## 1 2 3 4 5 6 7 8 9 11 1a AB 12 1b 13 BC 2a CD 14 2b EF 15 2c GH 16 HI
## 1 1 1 1 2 3 3 3 3 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 3 3
## 17 2e 3a JK
## 3 3 3 3
##
## $csize
## [1] 13 4 13
##
## $no
## [1] 3
Next we have to purrr::pluck
, $membership
only from this object, which will return a named vector.
## 1 2 3 4 5 6 7 8 9 11 1a AB 12 1b 13 BC 2a CD 14 2b EF 15 2c GH 16 HI
## 1 1 1 1 2 3 3 3 3 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 3 3
## 17 2e 3a JK
## 3 3 3 3
We can then stack
this named vector into a data frame using stack
and set_names
## UNIQUE_ID MainID
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 2 5
## 6 3 6
## 7 3 7
## 8 3 8
## 9 3 9
## 10 1 11
## 11 1 1a
## 12 1 AB
## 13 1 12
## 14 1 1b
## 15 1 13
## 16 1 BC
## 17 1 2a
## 18 1 CD
## 19 2 14
## 20 2 2b
## 21 2 EF
## 22 3 15
## 23 3 2c
## 24 3 GH
## 25 3 16
## 26 3 HI
## 27 3 17
## 28 3 2e
## 29 3 3a
## 30 3 JK
I suggest to purposefully name second column in the output data as MainID
so that it can be joined with original data frame in the last step. UNIQUE_ID
in this data will give us the new column which will allocate same ID to all possible duplicates in network of three IDs.
Step-6: In the last step we have to join the data frame back to original data frame. Since the type of MainID
is now factor type, we can convert type of this column in original data frame before right_join
the same. Hence the final step, right_join(dat %>% mutate(MainID = as.factor(MainID)), by = c('MainID'))
.