What if you are told that you could catch manipulations in your dataset just by observing the occurrences of numbers 1 to 9. Sounds intriguing, doesn’t it? Read on to find out how.
Benford law states that the numbers occurring in any dataset follow a distribution where number “1” has a frequency to occur the maximum number of times, number “2” the second-highest frequency, and so on. This law also holds true for the prediction of 2 and 3-digit numbers in a dataset. A general distribution of numbers occurring in a dataset is shown below:
It is believed that a large fluctuation of your dataset from these calculated values points towards data manipulation and other ill-treatment to your dataset. It has been seen that Benford law conclusions apply to a wide variety of datasets. Datasets of Population numbers, death rate, stock prices, electricity and water bills, length of rivers, length of the tallest building, etc generally follow this law.
A lot of times Benford law applies to data where a simple explanation can’t justify its applicability but the conclusion and applicability of Benford Law hold true in a fair dataset.
The law is named after Frank Benford. He stated the law in 1938 in a paper titled “The Law of Anomalous Numbers” However, its discovery goes back to 1881, when an American astronomer Simon Newcomb found that in logarithm tables the initial pages (that started with 1) were much more used and worn than the other pages. he then published the results. These are the first known instance of this observation and includes a distribution on the second digit, as well.
Again in 1938, physicist Frank Benford noticed it and tested it on data from 20 different domains and hence was credited for it. His initial data set included the surface areas of 335 rivers, the sizes of 3259 US populations, 104 physical constants, 1800 molecular weights, 5000 entries from a mathematical handbook, 308 numbers contained in an issue of Reader’s Digest, the street addresses of the first 342 persons listed in American Men of Science and 418 death rates. The total number of observations used in the paper was 20,229. This discovery was later named after Benford.
Benford law is widely used in fraud and error detection. As a large set of numbers follows the law, so accountants, auditors, economists, and tax professionals have a benchmark of what the normal levels of any particular number in a set are.
• Accounting fraud detection- The law can be used to detect possible fraud in lists of socio-economic data submitted in support of public planning decisions. When we simply compare the first-digit frequency distribution from the data with the expected distribution according to Benford’s law any major fluctuations can be a cause of concern and hence needs to be taken into account. Ponzi schemes can be detected using the law. Unrealistic returns fall far from the expected Benford probability distribution.
• Election data - Another major use of Benford law is to detect fraud in elections it was used as evidence of fraud in the 2009 Iranian elections. An analysis found that the second digits in vote counts for the winner of the election tended to differ significantly from the expectations of Benford’s law upon further inspection a widespread ballot stuffing was found. Similar deviations found in the 2009 Iranian presidential election helped predict the fraud.
Other applications include forensic auditing and fraud detection. On data from the 2003 California gubernatorial election, the 2000 and 2004 United States presidential elections, and the 2009 German federal election; the Benford’s Law test was found to be worth taking seriously as a statistical test for fraud.
• Macroeconomic data - The macroeconomic data the Greek government reported to the European Union before entering the eurozone was shown to be probably fraudulent using Benford’s law, albeit years after the country joined. The results were indeed manipulated hence, proving the reliability of the law
• Genome data-The law is also widely used in genome data as well. The number of open reading frames and their relationship to genome size differs between eukaryotes and prokaryotes. Benford’s law has been used to test this observation with an excellent fit to the data in both cases.
Detection Of Fraud
The following chart shows a series randomly generated in excel using a random function. We can see that this chart is nothing even close to a Benford curve, and this straight-line result tends to repeat even when the random numbers are recalculated multiple times. Hence upon seeing such plots we can conclude that the data was artificially produced, which they were using a standard computerized random number generator program.
Similarly, if a person uses a computer’s numeric keypad to create random numbers then also the results will never be close to Benford’s results. Even if a person fabricates numbers mentally (using his or her brain rather than a computer), there is little reason to believe such a mental exercise would produce results that adhere closely to Benford’s curve. It is more likely that the person producing numbers mentally would tend to repeat certain patterns, and charting the frequency of the resulting leading digits might reveal those patterns.
Here is a histogram of the areas of 196 countries (The units are \km^2)
Here is a table with percentages. The “BL prediction” column is the percentage that Benford’s law predicts for each digit.
Here is a histogram of the population of each of the 3,142 counties or county equivalents in the United States.
Here is a table with percentages.
Benford’s law is a very accurate and reliable measure to check for data manipulation. Till now a lot of frauds and other malpractices have been spotted using the law. Some of the popular ones are the 2009 Iranian elections, the 2000 and 2004 United States presidential elections, the 2009 German federal election, the Greek government’s manipulation to enter the European Union, and a lot more. However, one of the major limitations is that’s the law can be neither blindly applied to all datasets nor can the result be trusted blindly. Therefore, one should be cautious before applying as well as interpreting the results of the law. The same becomes even more important when we interpret the second and third digits as well of a dataset.
Recommended » Ethics in Data Science