39 Benford Tests/Analysis

39.1 Introduction and Historical context

Benford’s Law stands out as an analysis method for both visualizing and evaluating numerical data, especially when focused on detecting fraud. It’s really handy for catching potential trickery, especially in spotting fraud. This rule tells us how often different digits (like 1, 2, 3, etc.) should show up as the first number in lots of real-world data. This law describes the frequency distribution of the first digit, from left hand side, in many real-life data sets which counter-intuitively is not uniform, and is shown in Figure 39.2. Significant differences from the anticipated occurrence rates could signal that the data is questionable and might have been altered. For instance, eligibility for government assistance often hinges on meeting specific criteria, like having an income below a certain level. As a result, data might be manipulated to meet these criteria. This kind of manipulation is precisely what Benford’s Law can detect since fabricated numbers won’t align with the expected frequency pattern outlined by the law.

The Law is named after physicist Frank Benford, who worked on the theory in 1938 and as a result a paper titled The Law of Anomalous Numbers was published.³⁷. However, its discovery is associated more than five decades earlier when astronomer Simon Newcomb observed that initial pages of log tables booklet were more worn out than later pages and published a two page article titled Note on the Frequency of Use of the Different Digits in Natural Numbers in 1881.³⁸

More researchers continued to work on Benford’s law and its extensions. However, it took several decades to find a truly practical application. It was in last decade of twentieth Century, when Dr. Mark J. Nigrini, an accounting Professor, used the law for fraud detection/anaytics and came up with a practical fraud application. He reviewed multiple data sources like sales figures, insurance claim costs, and expense reimbursement claims and did studies on detecting overstatements and understatement of financial figures. His research confirmed the law’s proven usefulness to fraud examiners and auditors in accounting engagements.

His theory is that - If somebody tries to falsify, say, their tax return then invariably they will have to invent some data. When trying to do this, the tendency is for people to use too many numbers starting with digits in the mid range, 5,6,7 and thus not enough numbers starting with 1.

(L to R) Frank Benford, Simon Newcomb, and Mark Nigrini (Source: Wiki)

Figure 39.1: (L to R) Frank Benford, Simon Newcomb, and Mark Nigrini (Source: Wiki)

39.2 Benford’s Law, properties and extensions

39.2.1 Law of first digit

When considering the likelihood of any digit being in the first position (from the left), our initial assumption might be a simple one out of nine scenario, following a uniform distribution. However, this notion was challenged by Canadian-American astronomer Simon Newcomb in 1881, who noticed unusual wear patterns in logarithmic tables. While casually flipping through a logarithmic tables booklet, he discerned a curious pattern— the initial pages exhibited more wear and tear than their later counterparts.

Subsequently, Frank Benford conducted a comprehensive analysis of 20 diverse datasets encompassing river sizes, chemical compound weights, population data, and more. His findings revealed a successive diminishment in probability from digit 1 to 9. In essence, the probability of digit 1 occurring in the initial position is the highest, while that of digit 9 is the lowest.

Mathematically, Benford’s Law or Law of first digits states that the probability of any digit in first place should follow the equation (39.1).

\[\begin{equation} P(d_i) = \log_{10}\left(1 + \frac{1}{d}\right) \tag{39.1} \end{equation}\]

Where \(d_i\) ranges from \(1\) to \(9\).

The probabilities when plotted will generate plot as depicted in Figure 39.2.

Figure 39.2: Diminishing Probabilities of First Digits - Benford Law

To test the proposed law, Benford analysed 20 different data-sets and he observed that nearly all follow the distribution mentioned in equation (39.1).

Let us also try to see whether the law holds by anlaysing six different datasets, which are included in R’s package called benford.analysis. Though we will discuss about the package in detail later in section 39.5. The six datasets are mentioned in Table 39.1.

Table 39.1: Table 39.2: List of six datasets for testing Benford Analysis
Item	Title	Column
census.2000_2010	Population data - US - 2000 and 2010	pop.2000
census.2009	Population data of Towns and Cities of the US - 2009	pop.2009
corporate.payment	Corporate payments of a West Coast utility company - 2010	Amount
lakes.perimeter	Perimeter of lakes arround the world	perimeter.km
sino.forest	Financial Statemens of Sino Forest Corporation’s 2010 Report	value
taxable.incomes.1978	Taxable Income 1978	taxIncomes

The results of Benford’s law of first digit on these six datasets are calculated and have been mentioned in Table 39.3. It can be seen that actual frequencies of first digits, in these six datasets follow Benford’s Law. We can even plot the actual frequencies to inspect results visually. Actual Frequencies in these six datasets are plotted in 39.3 and it may be seen that these follow Benford’s Law largely.

Table 39.3: Table 39.4: Results of First order tests on six datasets
digits	Benford	Census 2000_2010	Census 2009	Corporate Payment	Lakes Perimeter	Sino Forest	Taxable Incomes 1978
1	0.3010300	0.3092126	0.2941207	0.3175548	0.1508888	0.2992228	0.3278721
2	0.1760913	0.1797896	0.1814547	0.1611007	0.0687752	0.1606218	0.2140886
3	0.1249387	0.1271916	0.1200472	0.1101452	0.2170936	0.1256477	0.1235673
4	0.0969100	0.0975454	0.0946743	0.0828655	0.1818372	0.0906736	0.0895397
5	0.0791812	0.0656678	0.0799118	0.1016301	0.1309577	0.0829016	0.0722473
6	0.0669468	0.0656678	0.0702240	0.0602811	0.0930143	0.0699482	0.0521491
7	0.0579919	0.0541919	0.0597673	0.0498209	0.0682885	0.0518135	0.0411117
8	0.0511525	0.0548295	0.0534625	0.0503666	0.0502118	0.0699482	0.0393606
9	0.0457575	0.0459037	0.0463376	0.0662351	0.0389329	0.0492228	0.0400637

Figure 39.3: Distribution of first digit frequencies in six datasets

39.2.2 Scale Invariance

Later in 1961, Roger Pinkham showed that law is invariant to scaling.³⁹ By Scale Invariance, he actually showed that the law is invariant to measurements units. In other words, the law still holds if we convert units from one unit to another. For example, if price or amount figures are measured either in USD or in INR, length is measured either in KMs or Miles, the digit frequencies still follow the Benford’s Law.

Let us check this on one of the six datasets mentioned above, namely census.2009. This data-set contains the figures of population of towns and cities of the United States, as of July of 2009. We can see that first digit frequencies follow Benford’s Law/Pinkham’s Corollary in Figure 39.4. Left plot shows frequencies on original data whereas right plot shows these on randonly scaled data.

Figure 39.4: First Digit Analysis on US Census 2009 data (Left) and Scaled Data (Right)

Figure 39.4 (Left) shows that the law holds for the data. Let us also test the Pinkham’s corollary on the aforesaid data. For this let’s multiply all the figures of population by a random positive number. Through figure 39.4 (Right) it is clear that law still holds after scaling.

39.2.3 First two digits

Nigrini’s contributions gained widespread recognition among scholars and practitioners, highlighting the applicability of Benford’s Law as a valuable forensic accounting and auditing tool across various datasets, particularly in the financial domain. Theodore P. Hill further⁴⁰ extended the scope of the law, demonstrating its validity beyond just the first digit to encompass other digits as well. Hill’s work expanded the utility of Benford’s Law, affirming its effectiveness in detecting irregularities and patterns not only in leading digits but throughout numerical sequences.

The formula for second significant digit can be written down in equation (39.2).

\[\begin{equation} P(d_i) = \sum_{k = 1}^{9}\log_{10}\left(1 + \frac{1}{10k + d_i}\right)\;;\; d = 0,1,..9 \tag{39.2} \end{equation}\]

where \(k\) represents first digit,
\(d_i\) represents second digit.

The probabilities have been calculated, as depicted in Table 39.5. Each cell depicts the probability of occurrence of any two digit, in left side, by first digit in rows and second digit in columns. We may also verify that, row totals thereby depicting probability of occurrence of first digit corresponds Benford’s Law of First Digit. For example, the probability of having first two digits as 10 will be highest at 4.14%.

Table 39.5: Table 39.6: First and Second Digit distributions
	Second Significant Digit
First Digit	0	1	2	3	4	5	6	7	8	9	First Digit Freq
1	4.14%	3.78%	3.48%	3.22%	3.00%	2.80%	2.63%	2.48%	2.35%	2.23%	30.10%
2	2.12%	2.02%	1.93%	1.85%	1.77%	1.70%	1.64%	1.58%	1.52%	1.47%	17.61%
3	1.42%	1.38%	1.34%	1.30%	1.26%	1.22%	1.19%	1.16%	1.13%	1.10%	12.49%
4	1.07%	1.05%	1.02%	1.00%	0.98%	0.95%	0.93%	0.91%	0.90%	0.88%	9.69%
5	0.86%	0.84%	0.83%	0.81%	0.80%	0.78%	0.77%	0.76%	0.74%	0.73%	7.92%
6	0.72%	0.71%	0.69%	0.68%	0.67%	0.66%	0.65%	0.64%	0.63%	0.62%	6.69%
7	0.62%	0.61%	0.60%	0.59%	0.58%	0.58%	0.57%	0.56%	0.55%	0.55%	5.80%
8	0.54%	0.53%	0.53%	0.52%	0.51%	0.51%	0.50%	0.50%	0.49%	0.49%	5.12%
9	0.48%	0.47%	0.47%	0.46%	0.46%	0.45%	0.45%	0.45%	0.44%	0.44%	4.58%
Second Digit Freq	11.97%	11.39%	10.88%	10.43%	10.03%	9.67%	9.34%	9.04%	8.76%	8.50%	100.00%

The law of second digit combined with original Benford’s Law of first digit thus, gives us Law of first two digits. We can verify it in the example on census.2009 data. The resultant plot as depicted in figure 39.5 shows us that the law of first two digits also holds.

Figure 39.5: Law holds for first two digits as well

39.2.4 Second order test

Nigrini and Miller, in 2009,⁴¹ introduced another advanced test based on Benford’s Law. The test states that:

Let \(x_1\), …, \(x_N\) be a data set comprising \(N\) observations, and let \(y_1\), …, \(y_N\) be the observations \(x_i\)’s in ascending order. Then, for many natural data sets, and for large \(N\), the digits of the differences between adjacent observations \(y_{i+1} – y_i\) is close to Benford’s Law. Large deviations from Benford’s Law indicate an anomaly that should be investigated.

So, the steps may be listed as

Sort data from smallest to largest
calculate \(N-1\) differences of \(N\) consecutive observations
Apply Benford’s law on these calculated new data.

Nigrini showed that these digits are expected to closely follow the frequencies of Benford law. Using four different datasets he showed that this test can detect (i) anomalies occurring in data, (ii) whether the data has been rounded and (iii) use of fake data OR ‘statistically generated data’ in place of actual (transactional) data.

39.2.5 Summation Test

The summation test, another second order test, looks for excessively large numbers in a dataset. It identifies numbers that are large compared to the norm for that data. The test was also proposed by Nigrini⁴² and it is based on the fact that the sums of all numbers in a Benford distribution with first-two digits (10, 11, 12, …99) should be the same. Therefore, for each of the 90 first-two digits groups sum proportions should be equal, i.e. 1/90 or 0.011. The spikes, if any indicate that there are some large single numbers or set of numbers.

In the next section, we will see how to implement all these tests through R.

39.2.6 Limitations of Benford Tests

Benford’s Law may not hold in the following circumstances-

When the data-set is comprised of assigned numbers. Like cheque numbers, invoices numbers, telephone numbers, pincodes, etc.
Numbers that may be influenced viz. ATM withdrawals, etc.
Where amounts have either lower bound, or upper bounds or both. E.g. passengers onboard airplane, hourly wage rate, etc.
Count of transactions less than 500.

Before carrying out analyics let us also see the evaluation metrics which will help us to evaluate the goodness of fit of data to Benford’s law. Three statistics are commonly used.

39.3 Goodness of fit metrics

In table 39.3 we saw that digit frequencies largely followed Benford’s Law in six different datasets. However, as to evaluate how close is the actual distribution with theoretical distribution, we need to evaluate the fit on some metrics. Here we will use three different metrics as follows.

39.3.1 Chi-square statistic

In first of these test, we will use Chi Square Statistic. This statistic is used to test the statistical significance to the whole distribution in observed frequency of first digit and first two digits against their expected frequency under Benford’s Law (BL). The Null hypothesis states that digits follow Benford’s Law. Mathematical formula is,

\[\begin{equation} \chi^2 = \sum_{i=1}^{9} \frac{(O_i - E_i)^2}{E_i} \tag{39.3} \end{equation}\]

where -

\(O_i\) is the observed frequency of the i-th digit.
\(E_i\) is the expected frequency of the i-th digit predicted by Benford’s Law.

This calculated chi-square statistic is compared to a critical value. The critical value for Chi-Square Test, comes from a chi-square distribution available easily in any Statistical textbook⁴³. However, for first digit test and first two digits test, the critical values are reproduced in Table 39.7.

Table 39.7: Table 39.8: Critical values for Chi-Square Test
	First Digit Test	Two Digit Test
Degrees of Freedom	8	89
10%	13.362	106.469
5%	15.507	112.022
2.5%	17.535	116.989
1%	20.090	122.942
0.1%	26.125	135.978

To check goodness of fit, we have to compare calculated \(χ^2\) statistic with these critical values. If the observed value is above these critical values we may conclude that our initial hypothesis that data follows BL, should be rejected. Or simply that data does not conforms Benford law/Distribution.

For example, in census.2009 data the chi-square statistic calculates to 17.524 which is less than 2.5% critical value 17.535. Thus, we can say with 5% confidence that census.2009 data follows BL (first digit law).

39.3.2 Z-score

Z-statistic checks whether the individual distribution significantly differs from Benford’s Law distribution. Mathematically, Z-Statistics considers the absolute magnitude of the difference from actual to the expected, size of the data and expected proportion.

\[\begin{equation} Z = \frac{(\lvert p - p_0\rvert) - (\frac{1}{2n})}{\sqrt{\frac{p_0(1-p_0)}{n}}} \tag{39.4} \end{equation} \]

where -

\(p\) is the observed frequency of the leading digits in the dataset.
\(p_0\) is the expected frequency under Benford’s Law.
\(n\) is the number of records

In equation (39.4), the last term in the numerator \(\frac{1}{2N}\) is a continuity correction term and is used only when it is smaller than the first term in the numerator. Mark Nigrini has proposed that if the values of Z-statistic exceed the critical value 1.96, the null hypothesis \(H_{0A}\) is rejected at 5% of significance level. Also note that Null hypothesis is same, which states that digits follow Benford’s Law.

If the significant levels are 1% or 10%, the corresponding critical values are 2.57 and 1.64 respectively.

39.3.3 Mean absolute deviation

Another Statistic, Mean Absolute Deviation also sometimes referred to as M.A.D., measures absolute deviations of observed frequencies from theoritical ones. The mathematical formula is written in equation (39.5).

\[\begin{equation} MAD = \frac{1}{9} \sum_{i=1}^{9} |O_i - E_i| \tag{39.5} \end{equation}\]

As there are no objective critical scores for the absolute deviations, the critical values prescribed by Mark J Nigrini are given in table 39.9 below.

Table 39.9: Critical Scores for MAD test
First Digits		First-Two Digits
0.000 to 0.006	Close conformity	0.000 to 0.012	Close conformity
0.006 to 0.012	Acceptable conformity	0.012 to 0.018	Acceptable conformity
0.012 to 0.015	Marginally acceptable conformity	0.018 to 0.022	Marginally acceptable conformity
above 0.015	Nonconformity	above 0.022	Nonconformity

39.3.4 Other descriptive Statistics

If the data follows Benford’s Law, the numbers should be close to those shown in table 39.10 following, as suggested by Mark Nigrini.

Table 39.10: Ideal Statistics for data that follows Benford’s Law
Statistic	Value
Mean	0.5
Variance	1/12 (0.08333…)
Ex. Kurtosis	-1.2
Skewness	0

39.4 Important

Benford’s Law analysis serves as a powerful tool in uncovering potential irregularities in datasets, but it’s crucial to note that deviations from this statistical phenomenon don’t always signify fraudulent activities. While it highlights notable discrepancies between expected and observed frequencies of digits in naturally occurring datasets, these variations might stem from various legitimate factors such as data entry errors, fluctuations in processes, or different sources of data. Understanding that Benford’s Law offers a signal rather than a definitive confirmation of fraud allows for a more nuanced interpretation, encouraging further investigation to discern the true nature behind these deviations.

Conversely, just because a dataset adheres to Benford’s Law, it doesn’t guarantee the absence of fraud. While conformity to this statistical principle generally suggests consistency within the data, sophisticated fraudsters might deliberately manipulate information to mimic expected distributions, masking their illicit activities. Therefore, while adherence to Benford’s Law might lessen suspicion, it doesn’t serve as an absolute assurance against fraudulent behavior.

Benford’s Law acting as a warning signal indicates potential irregularities in the numbers. It’s vital to dive deeper and investigate why these figures seem odd. Further scrutiny helps differentiate between a minor data hiccup and a potentially significant issue. This additional examination might mean cross-checking other data, validating records, or engaging with those connected to the information. This thorough approach is crucial for unraveling the story behind these uncommon figures.

39.5 Practical approach in R

As already stated we will use package benford.analysis for carrying out analytics on Benford’s Law, in R. Let us load it.

library(benford.analysis)

This package provides tools that make it easier to validate data using Benford’s Law. This package has been developed by Carlos Cinelli. As the package author himself states that the main purpose of the package is to identify suspicious data that need further verification, it should always be kept in mind that these analytics only provide us red-flagged transactions that should be validated further.

Apart from useful functions in the package, this also loads some default datasets specially those which were used by Frank Benford while proposing his law. Let us load the census 2009 data containing the population of towns and cities of the United States, as of July of 2009.

data("census.2009")

Let us view the top 6 rows of the data.

head(census.2009)

##     state             town pop.2009
## 1 Alabama   Abbeville city     2930
## 2 Alabama  Adamsville city     4782
## 3 Alabama     Addison town      709
## 4 Alabama       Akron town      433
## 5 Alabama   Alabaster city    29861
## 6 Alabama Albertville city    20115

In fact, this contains 19509 records.

Problem Statement: Let us test Benford’s law on 2009 population data. Let us see whether the data conforms Benford’s law.

The main function benford() takes a vector of values to be tested as input, and creates an output of special class benford The syntax is

benford(data, number.of.digits=2)

where-

data is numeric vector on which analysis has to be performed.
number.of.digits is number of digits on which analysis has to be performed. Default value is 2.

census_first_digit <- benford(census.2009$pop.2009, number.of.digits = 1)

Above syntax will create census_first_digit object which store various useful information for Benford Analytics. We may view its summary -

summary(census_first_digit)

##                   Length Class      Mode     
## info               4     -none-     list     
## data               4     data.table list     
## s.o.data           2     data.table list     
## bfd               13     data.table list     
## mantissa           2     data.table list     
## MAD                1     -none-     numeric  
## MAD.conformity     1     -none-     character
## distortion.factor  1     -none-     numeric  
## stats              2     -none-     list

Let us also print the object to see what all is stored therein.

print(census_first_digit)

## 
## Benford object:
##  
## Data: census.2009$pop.2009 
## Number of observations used = 19509 
## Number of obs. for second order = 7950 
## First digits analysed = 1
## 
## Mantissa: 
## 
##    Statistic  Value
##         Mean  0.503
##          Var  0.084
##  Ex.Kurtosis -1.207
##     Skewness -0.013
## 
## 
## The 5 largest deviations: 
## 
##   digits absolute.diff
## 1      1        134.79
## 2      2        104.64
## 3      3         95.43
## 4      6         63.94
## 5      8         45.07
## 
## Stats:
## 
##  Pearson's Chi-squared test
## 
## data:  census.2009$pop.2009
## X-squared = 17.524, df = 8, p-value = 0.0251
## 
## 
##  Mantissa Arc Test
## 
## data:  census.2009$pop.2009
## L2 = 4.198e-05, df = 2, p-value = 0.4409
## 
## Mean Absolute Deviation (MAD): 0.003119261
## MAD Conformity - Nigrini (2012): Close conformity
## Distortion Factor: 0.7404623
## 
## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

Results of Chi-Square distribution test, MAD etc. are printed apart from top deviations. The MAD value of 0.003 shows close conformity with Benford’s law. Chi Square statistic at 17.524 is slightly greater than 5% critical value of 15.507. In second example we will see that results of print command on benford object can be further customised, using its other arguments.

Let us also visualise the plots. We will use plot command to generate the plots.

plot(census_first_digit)

Figure 39.6: Benford Analysis Results of Census 2009 Data

We can see that by default five charts are printed.

Digits distribution
Second Order Test digit distribution
Summation test - digit distribution
Chi-Square differences
Summation differences

Similarly, in second example we will see how to customise plot outputs.

We can see that first digits in census 2009 data, follows Benford’s Law closely.

39.5.1 Other Useful functions in package

You may be wondering whether we have to depend upon print function every time to get analytics insights out the object created. In fact there are several other functions in this package which are very useful while carrying out risk analysis through Benford’s Law.

chisq: Gets the Chi-squared test of a Benford object. Takes a benford object as input.
duplicatesTable Shows the duplicates of the data. Similarly, takes a benford object as input.
extract.digits Extracts the leading digits from the data. Takes data as input. This is useful, while carrying out analysis manually.
getBfd Gets the the statistics of the first Digits of a benford object. E.g.

getBfd(census_first_digit)

##    digits  data.dist data.second.order.dist benford.dist
##     <int>      <num>                  <num>        <num>
## 1:      1 0.29412066             0.55811321   0.30103000
## 2:      2 0.18145471             0.15471698   0.17609126
## 3:      3 0.12004716             0.08968553   0.12493874
## 4:      4 0.09467425             0.05761006   0.09691001
## 5:      5 0.07991184             0.04364780   0.07918125
## 6:      6 0.07022400             0.03308176   0.06694679
## 7:      7 0.05976729             0.02553459   0.05799195
## 8:      8 0.05346250             0.01987421   0.05115252
## 9:      9 0.04633759             0.01773585   0.04575749
##    data.second.order.dist.freq data.dist.freq benford.dist.freq
##                          <num>          <num>             <num>
## 1:                        4437           5738         5872.7942
## 2:                        1230           3540         3435.3644
## 3:                         713           2342         2437.4298
## 4:                         458           1847         1890.6174
## 5:                         347           1559         1544.7469
## 6:                         263           1370         1306.0649
## 7:                         203           1166         1131.3649
## 8:                         158           1043          997.9346
## 9:                         141            904          892.6829
##    benford.so.dist.freq data.summation abs.excess.summation difference
##                   <num>          <num>                <num>      <num>
## 1:            2393.1885       51237849             29880783 -134.79419
## 2:            1399.9255       33272136             11915070  104.63563
## 3:             993.2630       22810354              1453288  -95.42981
## 4:             770.4346       15763499              5593567  -43.61744
## 5:             629.4909       15799838              5557228   14.25307
## 6:             532.2270       14527377              6829689   63.93508
## 7:             461.0360       11371006              9986060   34.63511
## 8:             406.6626       18814056              2543010   45.06544
## 9:             363.7720        8617475             12739591   11.31712
##    squared.diff absolute.diff
##           <num>         <num>
## 1:    3.0938378     134.79419
## 2:    3.1870315     104.63563
## 3:    3.7362508      95.42981
## 4:    1.0062752      43.61744
## 5:    0.1315102      14.25307
## 6:    3.1297790      63.93508
## 7:    1.0603039      34.63511
## 8:    2.0350972      45.06544
## 9:    0.1434744      11.31712

getSuspects Gets the ‘suspicious’ observations according to Benford’s Law. Takes both data as well as benford object, as inputs. Example in second case study.
MAD Gets the MAD of a Benford object.
suspectsTable Shows the first digits ordered by the mains discrepancies from Benford’s Law. Notice the difference from getSuspects

39.5.2 Example-2: Corporate payments data

Problem Statement-2: Let us analyse red-flags, on dataset of the 2010’s payments data (189470 records) of a division of a West Coast utility company. This data, corporate.payments is also available with the package. This time we will use first two digits in our analysis.

Step-1: Load the dataset and view its top rows. Let’s also see its summary.

data("corporate.payment")
head(corporate.payment)

##   VendorNum       Date  InvNum Amount
## 1      2001 2010-01-02 0496J10  36.08
## 2      2001 2010-01-02 1726J10  77.80
## 3      2001 2010-01-02 2104J10  34.97
## 4      2001 2010-01-02 2445J10  59.00
## 5      2001 2010-01-02 3281J10  59.56
## 6      2001 2010-01-02 3822J10  50.38

summary(corporate.payment)

##   VendorNum              Date               InvNum              Amount        
##  Length:189470      Min.   :2010-01-02   Length:189470      Min.   :  -71388  
##  Class :character   1st Qu.:2010-02-28   Class :character   1st Qu.:      50  
##  Mode  :character   Median :2010-06-04   Mode  :character   Median :     200  
##                     Mean   :2010-06-16                      Mean   :    2588  
##                     3rd Qu.:2010-09-30                      3rd Qu.:     835  
##                     Max.   :2010-12-31                      Max.   :26763476

We can see it has 189470 records having

+   Vendor Numbers
+   Date of Transaction
+   Invoice Number
+   Amount of invoice/transaction

Step-2: Create benford object

corp_bfd <- benford(corporate.payment$Amount, number.of.digits = 2)

Step-3: Let us first visually inspect the results. This time we will use another argument of plot function in benford.analysis library which is except. Actually this can create seven different plots and by default it creates five plots as stated earlier. Thus, by writing except = "none" we can include all seven plots if we want. Otherwise we will have to mention exclusions from c("digits", "second order", "summation", "mantissa", "chi squared", "abs diff", "ex summation"). There is one more argument namely multiple which is TRUE by default and plots multiple charts in same window.

So let us build (i) Digit distribution and (ii) Second order digit distribution plots.

plot(
  corp_bfd,
  except = c(
    "summation",
    "mantissa",
    "chi squared",
    "abs diff",
    "ex summation",
    "chisq diff",
    "legend"
  ),
  multiple = TRUE
)

Figure 39.7: Benford Analysis results on Corporate payments Data

We can see that largely the data follows Benford’s Law except an abnormal peak at 50.

Step-4: Let us now see what is inside of this object. Function print in benford.analysis package has another argument how.many which simply tells us to print how many of the absolute differences.

print(corp_bfd, how.many = 7)

## 
## Benford object:
##  
## Data: corporate.payment$Amount 
## Number of observations used = 185083 
## Number of obs. for second order = 65504 
## First digits analysed = 2
## 
## Mantissa: 
## 
##    Statistic  Value
##         Mean  0.496
##          Var  0.092
##  Ex.Kurtosis -1.257
##     Skewness -0.002
## 
## 
## The 7 largest deviations: 
## 
##   digits absolute.diff
## 1     50       5938.25
## 2     11       3331.98
## 3     10       2811.92
## 4     14       1043.68
## 5     98        889.95
## 6     90        736.81
## 7     92        709.01
## 
## Stats:
## 
##  Pearson's Chi-squared test
## 
## data:  corporate.payment$Amount
## X-squared = 32094, df = 89, p-value < 2.2e-16
## 
## 
##  Mantissa Arc Test
## 
## data:  corporate.payment$Amount
## L2 = 0.0039958, df = 2, p-value < 2.2e-16
## 
## Mean Absolute Deviation (MAD): 0.002336614
## MAD Conformity - Nigrini (2012): Nonconformity
## Distortion Factor: -1.065467
## 
## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

We can see that digit 50 has indeed the largest abolute difference. One of the reasons for availability of invoices in this digit group may be due to some tax capping or some other reason, which an auditor may need to investigate further.

Using suspectsTable() we can also get similar information.

suspectsTable(corp_bfd) |> 
  head(7)

##    digits absolute.diff
##     <int>         <num>
## 1:     50     5938.2544
## 2:     11     3331.9798
## 3:     10     2811.9177
## 4:     14     1043.6833
## 5:     98      889.9470
## 6:     90      736.8084
## 7:     92      709.0129

Step-5: Let us also get the Chi Square and other metrics

chisq(corp_bfd)

## 
##  Pearson's Chi-squared test
## 
## data:  corporate.payment$Amount
## X-squared = 32094, df = 89, p-value < 2.2e-16

Going strictly by numbers and p-value, which we should not depend upon in Benford Analytics, we see that Null hypothesis (Ref: section 39.3.1) has been rejected. In other words, chi-square statistic tells us that data does not follow Benford Law.

To get Mean Absolute Deviation

MAD(corp_bfd)

## [1] 0.002336614

Whether the value conforms to values suggested by Mark Nigrini, we can do

corp_bfd$MAD.conformity

## [1] "Nonconformity"

Step-6: Let us generate duplicate values avilable if any, in the data. For sake of brevity here, we will print top-5 results.

duplicatesTable(corp_bfd) |> 
  head(5)

##     number duplicates
##      <num>      <int>
## 1:   50.00       6022
## 2: 1153.35       2264
## 3: 1083.45       1185
## 4:  150.00       1056
## 5:  988.35       1018

Examining output above, we can see that there are 6022 invoices having all amount of USD50 each. Probably this could be the reason for failing of null hypothesis in the data.

Step-7: We can extract all distribution data using getBFD function.

getBfd(corp_bfd) |> 
  head(10)

##     digits  data.dist data.second.order.dist benford.dist
##      <int>      <num>                  <num>        <num>
##  1:     10 0.05658542            0.374786273   0.04139269
##  2:     11 0.05579119            0.015922692   0.03778856
##  3:     12 0.03236926            0.014609795   0.03476211
##  4:     13 0.03116440            0.013266365   0.03218468
##  5:     14 0.02432422            0.011113825   0.02996322
##  6:     15 0.03038637            0.011510747   0.02802872
##  7:     16 0.02385416            0.010365779   0.02632894
##  8:     17 0.02179563            0.009129213   0.02482358
##  9:     18 0.02085011            0.009358207   0.02348110
## 10:     19 0.02043408            0.008106375   0.02227639
##     data.second.order.dist.freq data.dist.freq benford.dist.freq
##                           <num>          <num>             <num>
##  1:                       24550          10473          7661.082
##  2:                        1043          10326          6994.020
##  3:                         957           5991          6433.875
##  4:                         869           5768          5956.838
##  5:                         728           4502          5545.683
##  6:                         754           5624          5187.640
##  7:                         679           4415          4873.039
##  8:                         598           4034          4594.423
##  9:                         613           3859          4345.952
## 10:                         531           3782          4122.982
##     benford.so.dist.freq data.summation abs.excess.summation difference
##                    <num>          <num>                <num>      <num>
##  1:             2711.386       28701407             23224143  2811.9177
##  2:             2475.302       22324748             16847484  3331.9798
##  3:             2277.057       16258127             10780863  -442.8749
##  4:             2108.225       15520165             10042901  -188.8378
##  5:             1962.711       27393259             21915996 -1043.6833
##  6:             1835.994       49191988             43714724   436.3597
##  7:             1724.651       12523174              7045911  -458.0390
##  8:             1626.044       11994778              6517515  -560.4233
##  9:             1538.106        7545939              2068675  -486.9517
## 10:             1459.193        6987397              1510133  -340.9820
##     squared.diff absolute.diff
##            <num>         <num>
##  1:  1032.084049     2811.9177
##  2:  1587.368773     3331.9798
##  3:    30.485235      442.8749
##  4:     5.986347      188.8378
##  5:   196.418497     1043.6833
##  6:    36.704517      436.3597
##  7:    43.053153      458.0390
##  8:    68.359901      560.4233
##  9:    54.561565      486.9517
## 10:    28.200147      340.9820

Step-8: To get suspected/high risk records, we may make use of getSuspects function. As already stated it requires both benford object and data as inputs.

# We are printing 10 records only
getSuspects(corp_bfd, corporate.payment) |> 
  head(10)

##     VendorNum       Date      InvNum  Amount
##        <char>     <Date>      <char>   <num>
##  1:      2001 2010-01-02     3822J10   50.38
##  2:      2001 2010-01-07    100107-2 1166.29
##  3:      2001 2010-01-08 11210084007 1171.45
##  4:      2001 2010-01-08     1585J10   50.42
##  5:      2001 2010-01-08     4733J10  113.34
##  6:      2001 2010-01-08     6263J10  117.22
##  7:      2001 2010-01-08     6673J10   50.80
##  8:      2001 2010-01-08     9181J10  114.78
##  9:      2001 2010-01-09     1510J10   50.49
## 10:      2001 2010-01-09     1532J10   50.45

Moreover, by using slice_max function from dplyr we can also get n high-valued ‘suspects’.

getSuspects(corp_bfd, corporate.payment) |>
  slice_max(order_by = Amount, n = 10, with_ties = FALSE)

##     VendorNum       Date                InvNum    Amount
##        <char>     <Date>                <char>     <num>
##  1:      2817 2010-10-27                10-10A 1156428.2
##  2:     17141 2010-04-05                040510 1135003.6
##  3:      2817 2010-11-30            1033500002 1112304.3
##  4:     16721 2010-09-16 SEE ATTACHED BALSHEET 1100000.0
##  5:      6118 2010-12-17             103511001  509093.7
##  6:      2817 2010-05-28                 40821  506971.5
##  7:     17284 2010-03-24                032400  504580.6
##  8:      6118 2010-08-26             102381001  504334.6
##  9:     17284 2010-03-10                 31000  502132.2
## 10:      2088 2010-03-24            1008300003  500000.0

Conclusion

Though by statistics (goodness of fit metrics), the data did not conform to BL, yet we observed that there were abnormally high records starting with digits 50. The reasons can be further investigated. By charts we also observed that, otherwise the data conform to BL. We also extracted suspected records for further investigation on other parameters/tests/verification. To sum up, we can say that, Benford Analysis can be a good starting point for fraud/forensic analytics while auditing. Before closing, let us also delve in one other example.

39.5.3 Example-3: Lakes Perimeter

Let us apply this on lakes.perimeter⁴⁴ data which is available with the package.

# load sample data
data(lakes.perimeter) 
# Number of rows
nrow(lakes.perimeter)

## [1] 248607

# View top rows
head(lakes.perimeter)

##   perimeter.km
## 1          1.0
## 2          1.0
## 3          1.1
## 4          1.1
## 5          1.1
## 6          1.1

# Generate Benford Object
lake_ben <- benford(lakes.perimeter$perimeter.km, number.of.digits = 2)

Let us see the plots, metrics and top outliers

plot(lake_ben)

Figure 39.8: Benford Analysis - lake Perimeter Data

# Chisq test
chisq(lake_ben)

## 
##  Pearson's Chi-squared test
## 
## data:  lakes.perimeter$perimeter.km
## X-squared = 88111, df = 89, p-value < 2.2e-16

# MAD
MAD(lake_ben)

## [1] 0.006012766

# Whether it conforms?
lake_ben$MAD.conformity

## [1] "Nonconformity"

# Get top-10 suspects
getSuspects(lake_ben, lakes.perimeter) |>
  head(10)

##     perimeter.km
##            <num>
##  1:          1.5
##  2:          1.5
##  3:          1.5
##  4:          1.5
##  5:          1.5
##  6:          1.5
##  7:          1.5
##  8:          1.5
##  9:          1.5
## 10:          1.5

# Get top-10 suspects on Squared Differences
getSuspects(lake_ben, lakes.perimeter, 
            by = "squared.diff") |>
  head(10)

##     perimeter.km
##            <num>
##  1:          3.6
##  2:          3.6
##  3:          3.6
##  4:          3.6
##  5:          3.6
##  6:          3.6
##  7:          3.6
##  8:          3.6
##  9:          3.6
## 10:          3.6

# Get top-10 suspects on Absolute Excess Summation
getSuspects(lake_ben, lakes.perimeter, 
            by = "abs.excess.summation") |>
  head(10)

##     perimeter.km
##            <num>
##  1:          1.0
##  2:          1.0
##  3:          1.3
##  4:          1.3
##  5:          1.3
##  6:          1.3
##  7:          1.3
##  8:          1.3
##  9:          1.3
## 10:          1.3

Conclusion

We observed that data does not conform Benford’s law which is evident from plot as well as MAD value. Chi-Squared Value of 88111 also exceeds critical value very significantly. Nigrini and Miller gave some plausible explanations in their Research paper⁴⁵ for this non-conformity. One of the possible reasons, they propose, was that perimeter is not a correct measurement for the size of a lake.

39.6 Conclusion

As we conclude this chapter on Benford Analytics, it’s clear that this statistical phenomenon holds remarkable potential across diverse fields. The inherent simplicity of Benford’s Law belies its complexity and applicability. Its ability to unveil anomalies, authenticate data integrity, and aid in forensic investigations underscores its significance in modern data analysis. As we delve deeper into its intricacies and practical applications, we unravel a tool that not only scrutinizes numbers but also illuminates new avenues for precision, authenticity, and trust in our data-driven world.