39 Benford Tests/Analysis
39.1 Introduction and Historical context
Benford’s Law stands out as an analysis method for both visualizing and evaluating numerical data, especially when focused on detecting fraud. It’s really handy for catching potential trickery, especially in spotting fraud. This rule tells us how often different digits (like 1, 2, 3, etc.) should show up as the first number in lots of real-world data. This law describes the frequency distribution of the first digit, from left hand side, in many real-life data sets which counter-intuitively is not uniform, and is shown in Figure 39.2. Significant differences from the anticipated occurrence rates could signal that the data is questionable and might have been altered. For instance, eligibility for government assistance often hinges on meeting specific criteria, like having an income below a certain level. As a result, data might be manipulated to meet these criteria. This kind of manipulation is precisely what Benford’s Law can detect since fabricated numbers won’t align with the expected frequency pattern outlined by the law.
The Law is named after physicist Frank Benford, who worked on the theory in 1938 and as a result a paper titled The Law of Anomalous Numbers was published.37. However, its discovery is associated more than five decades earlier when astronomer Simon Newcomb observed that initial pages of log tables booklet were more worn out than later pages and published a two page article titled Note on the Frequency of Use of the Different Digits in Natural Numbers in 1881.38
More researchers continued to work on Benford’s law and its extensions. However, it took several decades to find a truly practical application. It was in last decade of twentieth Century, when Dr. Mark J. Nigrini, an accounting Professor, used the law for fraud detection/anaytics and came up with a practical fraud application. He reviewed multiple data sources like sales figures, insurance claim costs, and expense reimbursement claims and did studies on detecting overstatements and understatement of financial figures. His research confirmed the law’s proven usefulness to fraud examiners and auditors in accounting engagements.
His theory is that - If somebody tries to falsify, say, their tax return then invariably they will have to invent some data. When trying to do this, the tendency is for people to use too many numbers starting with digits in the mid range, 5,6,7 and thus not enough numbers starting with 1.



Figure 39.1: (L to R) Frank Benford, Simon Newcomb, and Mark Nigrini (Source: Wiki)
39.2 Benford’s Law, properties and extensions
39.2.1 Law of first digit
When considering the likelihood of any digit being in the first position (from the left), our initial assumption might be a simple one out of nine scenario, following a uniform distribution. However, this notion was challenged by Canadian-American astronomer Simon Newcomb in 1881, who noticed unusual wear patterns in logarithmic tables. While casually flipping through a logarithmic tables booklet, he discerned a curious pattern— the initial pages exhibited more wear and tear than their later counterparts.
Subsequently, Frank Benford conducted a comprehensive analysis of 20 diverse datasets encompassing river sizes, chemical compound weights, population data, and more. His findings revealed a successive diminishment in probability from digit 1 to 9. In essence, the probability of digit 1 occurring in the initial position is the highest, while that of digit 9 is the lowest.
Mathematically, Benford’s Law or Law of first digits states that the probability of any digit in first place should follow the equation (39.1).
\[\begin{equation} P(d_i) = \log_{10}\left(1 + \frac{1}{d}\right) \tag{39.1} \end{equation}\]- Where \(d_i\) ranges from \(1\) to \(9\).
The probabilities when plotted will generate plot as depicted in Figure 39.2.

Figure 39.2: Diminishing Probabilities of First Digits - Benford Law
To test the proposed law, Benford analysed 20 different data-sets and he observed that nearly all follow the distribution mentioned in equation (39.1).
Let us also try to see whether the law holds by anlaysing six different datasets, which are included in R’s package called benford.analysis
. Though we will discuss about the package in detail later in section 39.5. The six datasets are mentioned in Table 39.1.
Item | Title | Column |
---|---|---|
census.2000_2010 | Population data - US - 2000 and 2010 | pop.2000 |
census.2009 | Population data of Towns and Cities of the US - 2009 | pop.2009 |
corporate.payment | Corporate payments of a West Coast utility company - 2010 | Amount |
lakes.perimeter | Perimeter of lakes arround the world | perimeter.km |
sino.forest | Financial Statemens of Sino Forest Corporation’s 2010 Report | value |
taxable.incomes.1978 | Taxable Income 1978 | taxIncomes |
The results of Benford’s law of first digit on these six datasets are calculated and have been mentioned in Table 39.3. It can be seen that actual frequencies of first digits, in these six datasets follow Benford’s Law. We can even plot the actual frequencies to inspect results visually. Actual Frequencies in these six datasets are plotted in 39.3 and it may be seen that these follow Benford’s Law largely.
digits | Benford | Census 2000_2010 | Census 2009 | Corporate Payment | Lakes Perimeter | Sino Forest | Taxable Incomes 1978 |
---|---|---|---|---|---|---|---|
1 | 0.3010300 | 0.3092126 | 0.2941207 | 0.3175548 | 0.1508888 | 0.2992228 | 0.3278721 |
2 | 0.1760913 | 0.1797896 | 0.1814547 | 0.1611007 | 0.0687752 | 0.1606218 | 0.2140886 |
3 | 0.1249387 | 0.1271916 | 0.1200472 | 0.1101452 | 0.2170936 | 0.1256477 | 0.1235673 |
4 | 0.0969100 | 0.0975454 | 0.0946743 | 0.0828655 | 0.1818372 | 0.0906736 | 0.0895397 |
5 | 0.0791812 | 0.0656678 | 0.0799118 | 0.1016301 | 0.1309577 | 0.0829016 | 0.0722473 |
6 | 0.0669468 | 0.0656678 | 0.0702240 | 0.0602811 | 0.0930143 | 0.0699482 | 0.0521491 |
7 | 0.0579919 | 0.0541919 | 0.0597673 | 0.0498209 | 0.0682885 | 0.0518135 | 0.0411117 |
8 | 0.0511525 | 0.0548295 | 0.0534625 | 0.0503666 | 0.0502118 | 0.0699482 | 0.0393606 |
9 | 0.0457575 | 0.0459037 | 0.0463376 | 0.0662351 | 0.0389329 | 0.0492228 | 0.0400637 |

Figure 39.3: Distribution of first digit frequencies in six datasets
39.2.2 Scale Invariance
Later in 1961, Roger Pinkham showed that law is invariant to scaling.39 By Scale Invariance, he actually showed that the law is invariant to measurements units. In other words, the law still holds if we convert units from one unit to another. For example, if price or amount figures are measured either in USD or in INR, length is measured either in KMs or Miles, the digit frequencies still follow the Benford’s Law.
Let us check this on one of the six datasets mentioned above, namely census.2009
. This data-set contains the figures of population of towns and cities of the United States, as of July of 2009. We can see that first digit frequencies follow Benford’s Law/Pinkham’s Corollary in Figure 39.4. Left plot shows frequencies on original data whereas right plot shows these on randonly scaled data.


Figure 39.4: First Digit Analysis on US Census 2009 data (Left) and Scaled Data (Right)
Figure 39.4 (Left) shows that the law holds for the data. Let us also test the Pinkham’s corollary on the aforesaid data. For this let’s multiply all the figures of population by a random positive number. Through figure 39.4 (Right) it is clear that law still holds after scaling.
39.2.3 First two digits
Nigrini’s contributions gained widespread recognition among scholars and practitioners, highlighting the applicability of Benford’s Law as a valuable forensic accounting and auditing tool across various datasets, particularly in the financial domain. Theodore P. Hill further40 extended the scope of the law, demonstrating its validity beyond just the first digit to encompass other digits as well. Hill’s work expanded the utility of Benford’s Law, affirming its effectiveness in detecting irregularities and patterns not only in leading digits but throughout numerical sequences.
The formula for second significant digit can be written down in equation (39.2).
\[\begin{equation} P(d_i) = \sum_{k = 1}^{9}\log_{10}\left(1 + \frac{1}{10k + d_i}\right)\;;\; d = 0,1,..9 \tag{39.2} \end{equation}\]- where \(k\) represents first digit,
- \(d_i\) represents second digit.
The probabilities have been calculated, as depicted in Table 39.5. Each cell depicts the probability of occurrence of any two digit, in left side, by first digit in rows and second digit in columns. We may also verify that, row totals thereby depicting probability of occurrence of first digit corresponds Benford’s Law of First Digit. For example, the probability of having first two digits as 10 will be highest at 4.14%.
First Digit | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | First Digit Freq |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 4.14% | 3.78% | 3.48% | 3.22% | 3.00% | 2.80% | 2.63% | 2.48% | 2.35% | 2.23% | 30.10% |
2 | 2.12% | 2.02% | 1.93% | 1.85% | 1.77% | 1.70% | 1.64% | 1.58% | 1.52% | 1.47% | 17.61% |
3 | 1.42% | 1.38% | 1.34% | 1.30% | 1.26% | 1.22% | 1.19% | 1.16% | 1.13% | 1.10% | 12.49% |
4 | 1.07% | 1.05% | 1.02% | 1.00% | 0.98% | 0.95% | 0.93% | 0.91% | 0.90% | 0.88% | 9.69% |
5 | 0.86% | 0.84% | 0.83% | 0.81% | 0.80% | 0.78% | 0.77% | 0.76% | 0.74% | 0.73% | 7.92% |
6 | 0.72% | 0.71% | 0.69% | 0.68% | 0.67% | 0.66% | 0.65% | 0.64% | 0.63% | 0.62% | 6.69% |
7 | 0.62% | 0.61% | 0.60% | 0.59% | 0.58% | 0.58% | 0.57% | 0.56% | 0.55% | 0.55% | 5.80% |
8 | 0.54% | 0.53% | 0.53% | 0.52% | 0.51% | 0.51% | 0.50% | 0.50% | 0.49% | 0.49% | 5.12% |
9 | 0.48% | 0.47% | 0.47% | 0.46% | 0.46% | 0.45% | 0.45% | 0.45% | 0.44% | 0.44% | 4.58% |
Second Digit Freq | 11.97% | 11.39% | 10.88% | 10.43% | 10.03% | 9.67% | 9.34% | 9.04% | 8.76% | 8.50% | 100.00% |
The law of second digit combined with original Benford’s Law of first digit thus, gives us Law of first two digits. We can verify it in the example on census.2009
data. The resultant plot as depicted in figure 39.5 shows us that the law of first two digits also holds.

Figure 39.5: Law holds for first two digits as well
39.2.4 Second order test
Nigrini and Miller, in 2009,41 introduced another advanced test based on Benford’s Law. The test states that:
Let \(x_1\), …, \(x_N\) be a data set comprising \(N\) observations, and let \(y_1\), …, \(y_N\) be the observations \(x_i\)’s in ascending order. Then, for many natural data sets, and for large \(N\), the digits of the differences between adjacent observations \(y_{i+1} – y_i\) is close to Benford’s Law. Large deviations from Benford’s Law indicate an anomaly that should be investigated.
So, the steps may be listed as
- Sort data from smallest to largest
- calculate \(N-1\) differences of \(N\) consecutive observations
- Apply Benford’s law on these calculated new data.
Nigrini showed that these digits are expected to closely follow the frequencies of Benford law. Using four different datasets he showed that this test can detect (i) anomalies occurring in data, (ii) whether the data has been rounded and (iii) use of fake data OR ‘statistically generated data’ in place of actual (transactional) data.
39.2.5 Summation Test
The summation test, another second order test, looks for excessively large numbers in a dataset. It identifies numbers that are large compared to the norm for that data. The test was also proposed by Nigrini42 and it is based on the fact that the sums of all numbers in a Benford distribution with first-two digits (10, 11, 12, …99) should be the same. Therefore, for each of the 90 first-two digits groups sum proportions should be equal, i.e. 1/90 or 0.011. The spikes, if any indicate that there are some large single numbers or set of numbers.
In the next section, we will see how to implement all these tests through R.
39.2.6 Limitations of Benford Tests
Benford’s Law may not hold in the following circumstances-
- When the data-set is comprised of assigned numbers. Like cheque numbers, invoices numbers, telephone numbers, pincodes, etc.
- Numbers that may be influenced viz. ATM withdrawals, etc.
- Where amounts have either lower bound, or upper bounds or both. E.g. passengers onboard airplane, hourly wage rate, etc.
- Count of transactions less than 500.
Before carrying out analyics let us also see the evaluation metrics which will help us to evaluate the goodness of fit of data to Benford’s law. Three statistics are commonly used.
39.3 Goodness of fit metrics
In table 39.3 we saw that digit frequencies largely followed Benford’s Law in six different datasets. However, as to evaluate how close is the actual distribution with theoretical distribution, we need to evaluate the fit on some metrics. Here we will use three different metrics as follows.
39.3.1 Chi-square statistic
In first of these test, we will use Chi Square Statistic. This statistic is used to test the statistical significance to the whole distribution in observed frequency of first digit and first two digits against their expected frequency under Benford’s Law (BL). The Null hypothesis states that digits follow Benford’s Law. Mathematical formula is,
\[\begin{equation} \chi^2 = \sum_{i=1}^{9} \frac{(O_i - E_i)^2}{E_i} \tag{39.3} \end{equation}\]where -
- \(O_i\) is the observed frequency of the i-th digit.
- \(E_i\) is the expected frequency of the i-th digit predicted by Benford’s Law.
This calculated chi-square statistic is compared to a critical value. The critical value for Chi-Square Test, comes from a chi-square distribution available easily in any Statistical textbook43. However, for first digit test and first two digits test, the critical values are reproduced in Table 39.7.
Degrees of Freedom | 8 | 89 |
---|---|---|
10% | 13.362 | 106.469 |
5% | 15.507 | 112.022 |
2.5% | 17.535 | 116.989 |
1% | 20.090 | 122.942 |
0.1% | 26.125 | 135.978 |
To check goodness of fit, we have to compare calculated \(χ^2\) statistic with these critical values. If the observed value is above these critical values we may conclude that our initial hypothesis that data follows BL, should be rejected. Or simply that data does not conforms Benford law/Distribution.
For example, in census.2009
data the chi-square statistic calculates to 17.524
which is less than 2.5% critical value 17.535. Thus, we can say with 5% confidence that census.2009
data follows BL (first digit law).
39.3.2 Z-score
Z-statistic checks whether the individual distribution significantly differs from Benford’s Law distribution. Mathematically, Z-Statistics considers the absolute magnitude of the difference from actual to the expected, size of the data and expected proportion.
\[\begin{equation} Z = \frac{(\lvert p - p_0\rvert) - (\frac{1}{2n})}{\sqrt{\frac{p_0(1-p_0)}{n}}} \tag{39.4} \end{equation} \]where -
- \(p\) is the observed frequency of the leading digits in the dataset.
- \(p_0\) is the expected frequency under Benford’s Law.
- \(n\) is the number of records
In equation (39.4), the last term in the numerator \(\frac{1}{2N}\) is a continuity correction term and is used only when it is smaller than the first term in the numerator. Mark Nigrini has proposed that if the values of Z-statistic exceed the critical value 1.96, the null hypothesis \(H_{0A}\) is rejected at 5% of significance level. Also note that Null hypothesis is same, which states that digits follow Benford’s Law.
If the significant levels are 1% or 10%, the corresponding critical values are 2.57 and 1.64 respectively.
39.3.3 Mean absolute deviation
Another Statistic, Mean Absolute Deviation also sometimes referred to as M.A.D., measures absolute deviations of observed frequencies from theoritical ones. The mathematical formula is written in equation (39.5).
\[\begin{equation} MAD = \frac{1}{9} \sum_{i=1}^{9} |O_i - E_i| \tag{39.5} \end{equation}\]
As there are no objective critical scores for the absolute deviations, the critical values prescribed by Mark J Nigrini are given in table 39.9 below.
First Digits | First-Two Digits | ||
---|---|---|---|
0.000 to 0.006 | Close conformity | 0.000 to 0.012 | Close conformity |
0.006 to 0.012 | Acceptable conformity | 0.012 to 0.018 | Acceptable conformity |
0.012 to 0.015 | Marginally acceptable conformity | 0.018 to 0.022 | Marginally acceptable conformity |
above 0.015 | Nonconformity | above 0.022 | Nonconformity |
39.3.4 Other descriptive Statistics
If the data follows Benford’s Law, the numbers should be close to those shown in table 39.10 following, as suggested by Mark Nigrini.
Statistic | Value |
---|---|
Mean | 0.5 |
Variance | 1/12 (0.08333…) |
Ex. Kurtosis | -1.2 |
Skewness | 0 |
39.4 Important
Benford’s Law analysis serves as a powerful tool in uncovering potential irregularities in datasets, but it’s crucial to note that deviations from this statistical phenomenon don’t always signify fraudulent activities. While it highlights notable discrepancies between expected and observed frequencies of digits in naturally occurring datasets, these variations might stem from various legitimate factors such as data entry errors, fluctuations in processes, or different sources of data. Understanding that Benford’s Law offers a signal rather than a definitive confirmation of fraud allows for a more nuanced interpretation, encouraging further investigation to discern the true nature behind these deviations.
Conversely, just because a dataset adheres to Benford’s Law, it doesn’t guarantee the absence of fraud. While conformity to this statistical principle generally suggests consistency within the data, sophisticated fraudsters might deliberately manipulate information to mimic expected distributions, masking their illicit activities. Therefore, while adherence to Benford’s Law might lessen suspicion, it doesn’t serve as an absolute assurance against fraudulent behavior.
Benford’s Law acting as a warning signal indicates potential irregularities in the numbers. It’s vital to dive deeper and investigate why these figures seem odd. Further scrutiny helps differentiate between a minor data hiccup and a potentially significant issue. This additional examination might mean cross-checking other data, validating records, or engaging with those connected to the information. This thorough approach is crucial for unraveling the story behind these uncommon figures.
39.5 Practical approach in R
As already stated we will use package benford.analysis
for carrying out analytics on Benford’s Law, in R. Let us load it.
This package provides tools that make it easier to validate data using Benford’s Law. This package has been developed by Carlos Cinelli. As the package author himself states that the main purpose of the package is to identify suspicious data that need further verification, it should always be kept in mind that these analytics only provide us red-flagged transactions that should be validated further.
Apart from useful functions in the package, this also loads some default datasets specially those which were used by Frank Benford while proposing his law. Let us load the census 2009 data containing the population of towns and cities of the United States, as of July of 2009.
data("census.2009")
Let us view the top 6 rows of the data.
head(census.2009)
## state town pop.2009
## 1 Alabama Abbeville city 2930
## 2 Alabama Adamsville city 4782
## 3 Alabama Addison town 709
## 4 Alabama Akron town 433
## 5 Alabama Alabaster city 29861
## 6 Alabama Albertville city 20115
In fact, this contains 19509 records.
Problem Statement: Let us test Benford’s law on 2009 population data. Let us see whether the data conforms Benford’s law.
The main function benford()
takes a vector of values to be tested as input, and creates an output of special class benford
The syntax is
benford(data, number.of.digits=2)
where-
-
data
is numeric vector on which analysis has to be performed. -
number.of.digits
is number of digits on which analysis has to be performed. Default value is2
.
census_first_digit <- benford(census.2009$pop.2009, number.of.digits = 1)
Above syntax will create census_first_digit
object which store various useful information for Benford Analytics. We may view its summary -
summary(census_first_digit)
## Length Class Mode
## info 4 -none- list
## data 4 data.table list
## s.o.data 2 data.table list
## bfd 13 data.table list
## mantissa 2 data.table list
## MAD 1 -none- numeric
## MAD.conformity 1 -none- character
## distortion.factor 1 -none- numeric
## stats 2 -none- list
Let us also print the object to see what all is stored therein.
print(census_first_digit)
##
## Benford object:
##
## Data: census.2009$pop.2009
## Number of observations used = 19509
## Number of obs. for second order = 7950
## First digits analysed = 1
##
## Mantissa:
##
## Statistic Value
## Mean 0.503
## Var 0.084
## Ex.Kurtosis -1.207
## Skewness -0.013
##
##
## The 5 largest deviations:
##
## digits absolute.diff
## 1 1 134.79
## 2 2 104.64
## 3 3 95.43
## 4 6 63.94
## 5 8 45.07
##
## Stats:
##
## Pearson's Chi-squared test
##
## data: census.2009$pop.2009
## X-squared = 17.524, df = 8, p-value = 0.0251
##
##
## Mantissa Arc Test
##
## data: census.2009$pop.2009
## L2 = 4.198e-05, df = 2, p-value = 0.4409
##
## Mean Absolute Deviation (MAD): 0.003119261
## MAD Conformity - Nigrini (2012): Close conformity
## Distortion Factor: 0.7404623
##
## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!
Results of Chi-Square distribution test, MAD etc. are printed apart from top deviations. The MAD value of 0.003
shows close conformity
with Benford’s law. Chi Square statistic at 17.524 is slightly greater than 5% critical value of 15.507. In second example we will see that results of print
command on benford object can be further customised, using its other arguments.
Let us also visualise the plots. We will use plot
command to generate the plots.
plot(census_first_digit)

Figure 39.6: Benford Analysis Results of Census 2009 Data
We can see that by default five charts are printed.
- Digits distribution
- Second Order Test digit distribution
- Summation test - digit distribution
- Chi-Square differences
- Summation differences
Similarly, in second example we will see how to customise plot outputs.
We can see that first digits in census 2009 data, follows Benford’s Law closely.
39.5.1 Other Useful functions in package
You may be wondering whether we have to depend upon print function every time to get analytics insights out the object created. In fact there are several other functions in this package which are very useful while carrying out risk analysis through Benford’s Law.
-
chisq
: Gets the Chi-squared test of a Benford object. Takes a benford object as input. -
duplicatesTable
Shows the duplicates of the data. Similarly, takes a benford object as input. -
extract.digits
Extracts the leading digits from the data. Takes data as input. This is useful, while carrying out analysis manually. -
getBfd
Gets the the statistics of the first Digits of a benford object. E.g.
getBfd(census_first_digit)
## digits data.dist data.second.order.dist benford.dist
## <int> <num> <num> <num>
## 1: 1 0.29412066 0.55811321 0.30103000
## 2: 2 0.18145471 0.15471698 0.17609126
## 3: 3 0.12004716 0.08968553 0.12493874
## 4: 4 0.09467425 0.05761006 0.09691001
## 5: 5 0.07991184 0.04364780 0.07918125
## 6: 6 0.07022400 0.03308176 0.06694679
## 7: 7 0.05976729 0.02553459 0.05799195
## 8: 8 0.05346250 0.01987421 0.05115252
## 9: 9 0.04633759 0.01773585 0.04575749
## data.second.order.dist.freq data.dist.freq benford.dist.freq
## <num> <num> <num>
## 1: 4437 5738 5872.7942
## 2: 1230 3540 3435.3644
## 3: 713 2342 2437.4298
## 4: 458 1847 1890.6174
## 5: 347 1559 1544.7469
## 6: 263 1370 1306.0649
## 7: 203 1166 1131.3649
## 8: 158 1043 997.9346
## 9: 141 904 892.6829
## benford.so.dist.freq data.summation abs.excess.summation difference
## <num> <num> <num> <num>
## 1: 2393.1885 51237849 29880783 -134.79419
## 2: 1399.9255 33272136 11915070 104.63563
## 3: 993.2630 22810354 1453288 -95.42981
## 4: 770.4346 15763499 5593567 -43.61744
## 5: 629.4909 15799838 5557228 14.25307
## 6: 532.2270 14527377 6829689 63.93508
## 7: 461.0360 11371006 9986060 34.63511
## 8: 406.6626 18814056 2543010 45.06544
## 9: 363.7720 8617475 12739591 11.31712
## squared.diff absolute.diff
## <num> <num>
## 1: 3.0938378 134.79419
## 2: 3.1870315 104.63563
## 3: 3.7362508 95.42981
## 4: 1.0062752 43.61744
## 5: 0.1315102 14.25307
## 6: 3.1297790 63.93508
## 7: 1.0603039 34.63511
## 8: 2.0350972 45.06544
## 9: 0.1434744 11.31712
-
getSuspects
Gets the ‘suspicious’ observations according to Benford’s Law. Takes both data as well as benford object, as inputs. Example in second case study. -
MAD
Gets the MAD of a Benford object. -
suspectsTable
Shows the first digits ordered by the mains discrepancies from Benford’s Law. Notice the difference fromgetSuspects
39.5.2 Example-2: Corporate payments data
Problem Statement-2: Let us analyse red-flags, on dataset of the 2010’s payments data (189470 records) of a division of a West Coast utility company. This data, corporate.payments
is also available with the package. This time we will use first two digits in our analysis.
Step-1: Load the dataset and view its top rows. Let’s also see its summary.
## VendorNum Date InvNum Amount
## 1 2001 2010-01-02 0496J10 36.08
## 2 2001 2010-01-02 1726J10 77.80
## 3 2001 2010-01-02 2104J10 34.97
## 4 2001 2010-01-02 2445J10 59.00
## 5 2001 2010-01-02 3281J10 59.56
## 6 2001 2010-01-02 3822J10 50.38
summary(corporate.payment)
## VendorNum Date InvNum Amount
## Length:189470 Min. :2010-01-02 Length:189470 Min. : -71388
## Class :character 1st Qu.:2010-02-28 Class :character 1st Qu.: 50
## Mode :character Median :2010-06-04 Mode :character Median : 200
## Mean :2010-06-16 Mean : 2588
## 3rd Qu.:2010-09-30 3rd Qu.: 835
## Max. :2010-12-31 Max. :26763476
We can see it has 189470 records having
+ Vendor Numbers
+ Date of Transaction
+ Invoice Number
+ Amount of invoice/transaction
Step-2: Create benford object
corp_bfd <- benford(corporate.payment$Amount, number.of.digits = 2)
Step-3: Let us first visually inspect the results. This time we will use another argument of plot
function in benford.analysis
library which is except
. Actually this can create seven different plots and by default it creates five plots as stated earlier. Thus, by writing except = "none"
we can include all seven plots if we want. Otherwise we will have to mention exclusions from c("digits", "second order", "summation", "mantissa", "chi squared", "abs diff", "ex summation")
. There is one more argument namely multiple
which is TRUE by default and plots multiple charts in same window.
So let us build (i) Digit distribution and (ii) Second order digit distribution plots.
plot(
corp_bfd,
except = c(
"summation",
"mantissa",
"chi squared",
"abs diff",
"ex summation",
"chisq diff",
"legend"
),
multiple = TRUE
)


Figure 39.7: Benford Analysis results on Corporate payments Data
We can see that largely the data follows Benford’s Law except an abnormal peak at 50.
Step-4: Let us now see what is inside of this object. Function print
in benford.analysis
package has another argument how.many
which simply tells us to print how many of the absolute differences.
print(corp_bfd, how.many = 7)
##
## Benford object:
##
## Data: corporate.payment$Amount
## Number of observations used = 185083
## Number of obs. for second order = 65504
## First digits analysed = 2
##
## Mantissa:
##
## Statistic Value
## Mean 0.496
## Var 0.092
## Ex.Kurtosis -1.257
## Skewness -0.002
##
##
## The 7 largest deviations:
##
## digits absolute.diff
## 1 50 5938.25
## 2 11 3331.98
## 3 10 2811.92
## 4 14 1043.68
## 5 98 889.95
## 6 90 736.81
## 7 92 709.01
##
## Stats:
##
## Pearson's Chi-squared test
##
## data: corporate.payment$Amount
## X-squared = 32094, df = 89, p-value < 2.2e-16
##
##
## Mantissa Arc Test
##
## data: corporate.payment$Amount
## L2 = 0.0039958, df = 2, p-value < 2.2e-16
##
## Mean Absolute Deviation (MAD): 0.002336614
## MAD Conformity - Nigrini (2012): Nonconformity
## Distortion Factor: -1.065467
##
## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!
We can see that digit 50 has indeed the largest abolute difference. One of the reasons for availability of invoices in this digit group may be due to some tax capping or some other reason, which an auditor may need to investigate further.
Using suspectsTable()
we can also get similar information.
suspectsTable(corp_bfd) |>
head(7)
## digits absolute.diff
## <int> <num>
## 1: 50 5938.2544
## 2: 11 3331.9798
## 3: 10 2811.9177
## 4: 14 1043.6833
## 5: 98 889.9470
## 6: 90 736.8084
## 7: 92 709.0129
Step-5: Let us also get the Chi Square and other metrics
chisq(corp_bfd)
##
## Pearson's Chi-squared test
##
## data: corporate.payment$Amount
## X-squared = 32094, df = 89, p-value < 2.2e-16
Going strictly by numbers and p-value, which we should not depend upon in Benford Analytics, we see that Null hypothesis (Ref: section 39.3.1) has been rejected. In other words, chi-square statistic tells us that data does not follow Benford Law.
To get Mean Absolute Deviation
MAD(corp_bfd)
## [1] 0.002336614
Whether the value conforms to values suggested by Mark Nigrini, we can do
corp_bfd$MAD.conformity
## [1] "Nonconformity"
Step-6: Let us generate duplicate values avilable if any, in the data. For sake of brevity here, we will print top-5 results.
duplicatesTable(corp_bfd) |>
head(5)
## number duplicates
## <num> <int>
## 1: 50.00 6022
## 2: 1153.35 2264
## 3: 1083.45 1185
## 4: 150.00 1056
## 5: 988.35 1018
Examining output above, we can see that there are 6022 invoices having all amount of USD50 each. Probably this could be the reason for failing of null hypothesis in the data.
Step-7: We can extract all distribution data using getBFD
function.
## digits data.dist data.second.order.dist benford.dist
## <int> <num> <num> <num>
## 1: 10 0.05658542 0.374786273 0.04139269
## 2: 11 0.05579119 0.015922692 0.03778856
## 3: 12 0.03236926 0.014609795 0.03476211
## 4: 13 0.03116440 0.013266365 0.03218468
## 5: 14 0.02432422 0.011113825 0.02996322
## 6: 15 0.03038637 0.011510747 0.02802872
## 7: 16 0.02385416 0.010365779 0.02632894
## 8: 17 0.02179563 0.009129213 0.02482358
## 9: 18 0.02085011 0.009358207 0.02348110
## 10: 19 0.02043408 0.008106375 0.02227639
## data.second.order.dist.freq data.dist.freq benford.dist.freq
## <num> <num> <num>
## 1: 24550 10473 7661.082
## 2: 1043 10326 6994.020
## 3: 957 5991 6433.875
## 4: 869 5768 5956.838
## 5: 728 4502 5545.683
## 6: 754 5624 5187.640
## 7: 679 4415 4873.039
## 8: 598 4034 4594.423
## 9: 613 3859 4345.952
## 10: 531 3782 4122.982
## benford.so.dist.freq data.summation abs.excess.summation difference
## <num> <num> <num> <num>
## 1: 2711.386 28701407 23224143 2811.9177
## 2: 2475.302 22324748 16847484 3331.9798
## 3: 2277.057 16258127 10780863 -442.8749
## 4: 2108.225 15520165 10042901 -188.8378
## 5: 1962.711 27393259 21915996 -1043.6833
## 6: 1835.994 49191988 43714724 436.3597
## 7: 1724.651 12523174 7045911 -458.0390
## 8: 1626.044 11994778 6517515 -560.4233
## 9: 1538.106 7545939 2068675 -486.9517
## 10: 1459.193 6987397 1510133 -340.9820
## squared.diff absolute.diff
## <num> <num>
## 1: 1032.084049 2811.9177
## 2: 1587.368773 3331.9798
## 3: 30.485235 442.8749
## 4: 5.986347 188.8378
## 5: 196.418497 1043.6833
## 6: 36.704517 436.3597
## 7: 43.053153 458.0390
## 8: 68.359901 560.4233
## 9: 54.561565 486.9517
## 10: 28.200147 340.9820
Step-8: To get suspected/high risk records, we may make use of getSuspects
function. As already stated it requires both benford object and data as inputs.
# We are printing 10 records only
getSuspects(corp_bfd, corporate.payment) |>
head(10)
## VendorNum Date InvNum Amount
## <char> <Date> <char> <num>
## 1: 2001 2010-01-02 3822J10 50.38
## 2: 2001 2010-01-07 100107-2 1166.29
## 3: 2001 2010-01-08 11210084007 1171.45
## 4: 2001 2010-01-08 1585J10 50.42
## 5: 2001 2010-01-08 4733J10 113.34
## 6: 2001 2010-01-08 6263J10 117.22
## 7: 2001 2010-01-08 6673J10 50.80
## 8: 2001 2010-01-08 9181J10 114.78
## 9: 2001 2010-01-09 1510J10 50.49
## 10: 2001 2010-01-09 1532J10 50.45
Moreover, by using slice_max
function from dplyr
we can also get n
high-valued ‘suspects’.
getSuspects(corp_bfd, corporate.payment) |>
slice_max(order_by = Amount, n = 10, with_ties = FALSE)
## VendorNum Date InvNum Amount
## <char> <Date> <char> <num>
## 1: 2817 2010-10-27 10-10A 1156428.2
## 2: 17141 2010-04-05 040510 1135003.6
## 3: 2817 2010-11-30 1033500002 1112304.3
## 4: 16721 2010-09-16 SEE ATTACHED BALSHEET 1100000.0
## 5: 6118 2010-12-17 103511001 509093.7
## 6: 2817 2010-05-28 40821 506971.5
## 7: 17284 2010-03-24 032400 504580.6
## 8: 6118 2010-08-26 102381001 504334.6
## 9: 17284 2010-03-10 31000 502132.2
## 10: 2088 2010-03-24 1008300003 500000.0
Conclusion
Though by statistics (goodness of fit metrics), the data did not conform to BL, yet we observed that there were abnormally high records starting with digits 50
. The reasons can be further investigated. By charts we also observed that, otherwise the data conform to BL. We also extracted suspected records for further investigation on other parameters/tests/verification. To sum up, we can say that, Benford Analysis can be a good starting point for fraud/forensic analytics while auditing. Before closing, let us also delve in one other example.
39.5.3 Example-3: Lakes Perimeter
Let us apply this on lakes.perimeter
44 data which is available with the package.
## [1] 248607
# View top rows
head(lakes.perimeter)
## perimeter.km
## 1 1.0
## 2 1.0
## 3 1.1
## 4 1.1
## 5 1.1
## 6 1.1
# Generate Benford Object
lake_ben <- benford(lakes.perimeter$perimeter.km, number.of.digits = 2)
Let us see the plots, metrics and top outliers
plot(lake_ben)

Figure 39.8: Benford Analysis - lake Perimeter Data
# Chisq test
chisq(lake_ben)
##
## Pearson's Chi-squared test
##
## data: lakes.perimeter$perimeter.km
## X-squared = 88111, df = 89, p-value < 2.2e-16
# MAD
MAD(lake_ben)
## [1] 0.006012766
# Whether it conforms?
lake_ben$MAD.conformity
## [1] "Nonconformity"
# Get top-10 suspects
getSuspects(lake_ben, lakes.perimeter) |>
head(10)
## perimeter.km
## <num>
## 1: 1.5
## 2: 1.5
## 3: 1.5
## 4: 1.5
## 5: 1.5
## 6: 1.5
## 7: 1.5
## 8: 1.5
## 9: 1.5
## 10: 1.5
# Get top-10 suspects on Squared Differences
getSuspects(lake_ben, lakes.perimeter,
by = "squared.diff") |>
head(10)
## perimeter.km
## <num>
## 1: 3.6
## 2: 3.6
## 3: 3.6
## 4: 3.6
## 5: 3.6
## 6: 3.6
## 7: 3.6
## 8: 3.6
## 9: 3.6
## 10: 3.6
# Get top-10 suspects on Absolute Excess Summation
getSuspects(lake_ben, lakes.perimeter,
by = "abs.excess.summation") |>
head(10)
## perimeter.km
## <num>
## 1: 1.0
## 2: 1.0
## 3: 1.3
## 4: 1.3
## 5: 1.3
## 6: 1.3
## 7: 1.3
## 8: 1.3
## 9: 1.3
## 10: 1.3
Conclusion
We observed that data does not conform Benford’s law which is evident from plot as well as MAD value. Chi-Squared Value of 88111
also exceeds critical value very significantly. Nigrini and Miller gave some plausible explanations in their Research paper45 for this non-conformity. One of the possible reasons, they propose, was that perimeter is not a correct measurement for the size of a lake.
39.6 Conclusion
As we conclude this chapter on Benford Analytics, it’s clear that this statistical phenomenon holds remarkable potential across diverse fields. The inherent simplicity of Benford’s Law belies its complexity and applicability. Its ability to unveil anomalies, authenticate data integrity, and aid in forensic investigations underscores its significance in modern data analysis. As we delve deeper into its intricacies and practical applications, we unravel a tool that not only scrutinizes numbers but also illuminates new avenues for precision, authenticity, and trust in our data-driven world.
Further Reading-
ISACA JOURNAL ARCHIVES - Understanding and Applying Benford’s Law - 1 May 2011
Newcomb, Simon. “Note on the Frequency of Use of the Different Digits in Natural Numbers.” American Journal of Mathematics, vol. 4, no. 1, 1881, pp. 39–40. JSTOR, https://doi.org/10.2307/2369148. Accessed 15 Jun. 2022.
Durtschi, Cindy & Hillison, William & Pacini, Carl. (2004). The Effective Use of Benford’s Law to Assist in Detecting Fraud in Accounting Data. J. Forensic Account.