'Results management' — detection and diagnosis using Benford's Law

Can the fabrication of research results be prevented?
Can the peer review process be augmented with
automated checking?
These questions become more important
with automated submission of data to archives.
The potential usefulness of automated methods of detecting at least some forms of
either intentional or unintentional ‘result management’ is clear.

Benford’s Law is a postulated relationship on the frequency
of digits (Benford 1938). It states that the distribution of the combination of digits
in a set of random data drawn from a set of random distributions
follows the log relationship (Hill 1998). Benford’s Law,
actually more of a conjecture, suggests the probability of
occurrence of a sequence of digits d is given by the equation:

Prob(d) = log10(1+1/d)

For example, the probability of the sequence of digits 1,2,3 is
given by log10(1+1/123).
Below is the distribution predicted by Benford’s Law
for the first four digits.

Fig 1. Expected distributions of the first four digits according to Benford’s Law.

The frequency of digits can deviate from the law for a range of reasons,
mostly to do with constraints on possible values. Deviations due to human fabrication or alteration of data
have been shown to be useful for detecting fraud
in financial data (Nigrini 2000).
Although Benford’s law has been shown to hold
on the first digits of some scientific data sets, particularly those
covering large orders of magnitude, is clearly not valid for data such as simple
time series where the variance is small relative to the mean.
As a simple example, the data with
a mean of 5 and standard deviation of 1 would tend to have leading
digits around 4, 5 and 6, rather than one.

Despite this, it is possible
that subsequent digits may conform better.
A recent experimental study suggested that the second digit was a much more
reliable indicator of fabricated experimental data (Diekmann 2004).
Such a relationship would be very useful on time series data
as generated by geophysical phenomena.

This post reports the results of some tests digit frequency
as a practical methodology for detecting ‘result management’
in geophysical series data. The code is written in R and available as an R package Audit 0.1.

Simulated Results

Random numbers

As a first example I fabricated 157
data to resemble random numbers.
I also generated a set of 200 random numbers with a normal distribution.
This simulated series is used to determine
if fabricated data can be detected.

Below are the results for the commands executed in R
below. The first and second digit
distribution with fits is shown on a log-log plot.

dr< -rnorm(200)
df<-c(23,34,51,...
benford(dr,plot=TRUE)
benford(df,plot=TRUE)

Fig 2. Distribution of 1st and 2nd digits, random data.

Fig 3. Distribution of 1st and 2nd digits, fabricated data.

The table below quantifies the distributions on the plots.
In the case of random data, the the first digit
deviates significantly (p=10-58), but the second does not (p=0.51).

In the case of the fabricated data, the p for both the first digit
and the second digit is zero (p=0) indicating the Chi2 test
on the second digit detects deviance. Another way of
quantifying deviation is to sum the norm of
the difference between the expected and observed
frequencies for each digit. The value for D2 on the fabricated data (D2=0.54) is much higher than the value for random data (D2=0.17).
These results give one confidence that the second digit
and simple statistical test can detect fabricated data sets.

Table 2. Digit frequency and Benford’s Law
for randomly generated and fabricated data sets.

Set p1 p2 D1 D2
Random 2×10-58 0.51 0.81 0.17
Fabricated 0.00 0.00 0.37 0.54

For time series testing, I inserted the 157 fabricated numbers
between the sets of random numbers forming a single
series, and calculated the Chi2 statistic on the second digit
on a moving window. Eg:

benford(c(dr,df,dr),plot=TRUE,n=50)

The red line is the probability of Benford’s Law distribution in the 2nd digit,
calculated on the moving window of size 50. The green line
is a benchmark level of probability below which indicates significant deviation from Benford’s Law distribution. The figure below the first one is the same analysis applied to the first
difference of the initial series. The blue lines deliniate the fabricated series (although it is shifted a bit in the second).

Fig 4. Deviation in the second digit from Benford’s Law (red line) in
a moving window of size 50.

The simulated data consistes of a random sequence on both sides, and fabricated data in the center.
In the figure above, the fabricated region is clearly detected in both the
original series and the differenced series when the
red line falls below the green line.

Tidal height data

The Port Arthur tidal height sets were obtained from John Hunter (Hunter et. al. 2003)
The first file called present_dat_gmt is the full (1184-day) data-set from
the Port Arthur tide gauge from 24/6/1999 to 20/9/2002 collected
with an Aquatrak 4100 with Vitel WLS2 logger.

Fig 5. The distribution of both the first and second digits in the instrumental Port Arthur data.

The figure above shows the distribution of both the first and second digits has a smooth profile. This is to be expected
of a large set of instrumentally collected data. While the
distribution is similar to Benford’s there does appear to be
a systematic difference.

The manually recorded data set porta40.txt
is a digitised version of sea level data hand
collected in imperial units (feet an dinches) by Thomas
Lempriere at Port Arthur, Tasmania, Australia. They cover the period
1840 to 1842, but are incomplete for 1840. The data were
converted into decimal feet prior to analysis.

Fig 6. The distribution of the first two digits in the hand-collected Port Arthur data.

The figure above shows the distribution of the second digit in the hand collected data deviates significantly from Benford’s Law predictions due to elevated frequencies of 0’s and 5’s.

This may have been due to human ’rounding error’
biasing data towards either whole or half feet. The proof of
some form of results management in the hand coded data set is shown in the table below. The Chi2 test on
the second digit shows significant deviance from the
expected Benford’s distribution in both data sets. However
the distance measure appears to discriminate between
the two sets, with the instrumental set giving a D2
value of 0.03 and the hand-coded set a value of 0.33.

Table 1. Digit frequency of Port Arthur tidal data sets for instrument and hand recorded data sets.

Set Chi2-1st Chi2-2nd D1 D2
Instrument 0.00 3×10-44 0.23 0.03
Hand 0.00 2×10-40 0.76 0.33

The figure shows the Chi2 value
dropping below the green line in a number of regions throughout
the series. John Hunter communicated that he believed different people were responsible for collection of data throughout the period, which may explain the transitions in the graph.

Fig 7. The deviation of the 2nd digit for the hand collected Port Arthur series.

Conclusions

The program detected the difference in digit distribution between
randomly generated and fabricated data. A moving window analysis correctly identified the location of the fabricated data in a time series.
The digit distribution of the tidal gauge data set collected by instruments
showed deviation from Benford’s Law, but this was not apparent from examination of the distributions. The tidal gauge data set collected by hand
showed evidence of rounding of results to whole or half feet.
The instrumental data set is very large (n=28,179) and this could
account for the sensitivty of the Chi2 to deviation. Another
explanation for the deviation may be the
conjecture that Benford’s Law is not exactly accurate in this case.

These results show that the distributions of the 2nd digit can be used
to detect and diagnose ‘results management’ on geophysical time series data. These methods could have wide applicability for
assisting in quality control of a wide range of data sets, particularly
in conjunction with data archival processes.

An initial version of an R package for doing these analysis called Audit 0.1 is available. The package is covered by GPL and assistance will be needed for further development of the package.

References

1. Frank Benford: The law of anomalous numbers, Proceedings of the American
Philosophical Society, Vol. 78, No. 4, (March 1938), pp. 551–572.

2. Nigrini, M. Digital Analysis Using Benford’s Law: Tests Statistics for Auditors.
Vancouver, Canada: Global Audit Publications, 2000.

3. Andreas Diekmann, Not the First Digit! Using Benfordâ€™s Law to Detect
Fraudulent Scientific Data
Swiss Federal
Institute of Technology Zurich October 2004.

4. Hunter, J., Coleman, R. and Pugh, D., 2003. The sea level at Port Arthur,
Tasmania, from 1841 to the present
, Geophysical Research Letters,
Vol. 30, No. 7, 54-1 to 54-4, doi:10.1029/2002GL016813.
5. Hill T.P.,
The First Digit Phenomenon.
1998.

0 thoughts on “'Results management' — detection and diagnosis using Benford's Law”

1. Congratulations David

I suggest that you put some “classical” datasets of your field of study in the package, for example, that .rwl datasets… Maybe some links to them …

Cheers.

2. Congratulations David

I suggest that you put some “classical” datasets of your field of study in the package, for example, that .rwl datasets… Maybe some links to them …

Cheers.

3. Thanks Marcos for pointing out about the rwl files. Its not in the write up above but the package can read them now. It would be good to have an accounting data set too. I wonder if it would be possible to get historic fradulent data sets?

I need to test it on a lot more data sets. Massaging the data into a good form for analysis while keeping the patterns is an issue. It would be great for it to be virtually automatic. This is what the option “auto” is intended for. Much to be done though.

4. Thanks Marcos for pointing out about the rwl files. Its not in the write up above but the package can read them now. It would be good to have an accounting data set too. I wonder if it would be possible to get historic fradulent data sets?

I need to test it on a lot more data sets. Massaging the data into a good form for analysis while keeping the patterns is an issue. It would be great for it to be virtually automatic. This is what the option “auto” is intended for. Much to be done though.

5. Steve McIntyre says:

David, you may notice a familiar name in Hans’ note. I was just getting started at looking at proxies and this was one of the first puzzling patterns that I noticed. Hans solved the problem.

It was amazing that the actual measurement had been superceded by the transformed data and, as I recall, Hans was never able to get Thompson to provide the original data.

6. Steve McIntyre says:

David, you may notice a familiar name in Hans’ note. I was just getting started at looking at proxies and this was one of the first puzzling patterns that I noticed. Hans solved the problem.

It was amazing that the actual measurement had been superceded by the transformed data and, as I recall, Hans was never able to get Thompson to provide the original data.

7. I think there is a lot of results management out there. I wonder what the digit frequency of the quelquaya data would look like? I don’t intend the analysis to be restricted to digit frequency. Other tests could be included.

One of the main reasons I got into this was to look at the WCDP archive. Imagine something that could churn through all the paleo data for suspect data and diagnose its origin.

8. I think there is a lot of results management out there. I wonder what the digit frequency of the quelquaya data would look like? I don’t intend the analysis to be restricted to digit frequency. Other tests could be included.

One of the main reasons I got into this was to look at the WCDP archive. Imagine something that could churn through all the paleo data for suspect data and diagnose its origin.

9. Larry Huldén says:

This checking of digit distribution is very interesting. Availability of the original measurements is really important.
I think that there is an increasing need for statistical checking of very big data sets in different ways. The problems may arise in systematic “errors” or varying rounding effects. I am not thinking of typing errors, interpreting errors of hand written texts or corresponding errors which is a problem of its own.
I have an interesting data set of death causes of nearly two million people from 1750 to 1850 in Finland. The death causes have been typed in by hundreds (or thousands) of amateurs interested in tracing there relatives. All the data is available on internet. When checking for the yearly age distribution of different death causes I observed that there was a systematically higher representation of the age classes 20, 30, 40, 50, 60 and so on. From that we can expect that in some cases the exact age of the people was not known, so the priest wrote a “proxy” for the age. I don’t know if the rounding is biased upwards or downwards. This causes some problems in detailed analysis of rare death causes. I can only statistically adjust for the age classes.
We used the raw data unchanged in the study of malaria in Finland when we compared malaria with some other different diseases in the study. The effect of the deviating age classes are visible in the graphs available at http://www.malariajournal.com/content/4/1/19
The results for malaria are not expected to have been affected by this problem because we think that the bias has the same distribution on each disease.

This checking of digit distribution is very interesting. Availability of the original measurements is really important.
I think that there is an increasing need for statistical checking of very big data sets in different ways. The problems may arise in systematic “errors” or varying rounding effects. I am not thinking of typing errors, interpreting errors of hand written texts or corresponding errors which is a problem of its own.
I have an interesting data set of death causes of nearly two million people from 1750 to 1850 in Finland. The death causes have been typed in by hundreds (or thousands) of amateurs interested in tracing there relatives. All the data is available on internet. When checking for the yearly age distribution of different death causes I observed that there was a systematically higher representation of the age classes 20, 30, 40, 50, 60 and so on. From that we can expect that in some cases the exact age of the people was not known, so the priest wrote a “proxy” for the age. I don’t know if the rounding is biased upwards or downwards. This causes some problems in detailed analysis of rare death causes. I can only statistically adjust for the age classes.
We used the raw data unchanged in the study of malaria in Finland when we compared malaria with some other different diseases in the study. The effect of the deviating age classes are visible in the graphs available at http://www.malariajournal.com/content/4/1/19
The results for malaria are not expected to have been affected by this problem because we think that the bias has the same distribution on each disease.

11. Hi Larry. Nigrini developed an index for financial figures that is supposed to diagnose rounding, either up or down. I have implemented it in the package already, so it may be possible to answer the question about rounding of age in deaths using that.

12. Hi Larry. Nigrini developed an index for financial figures that is supposed to diagnose rounding, either up or down. I have implemented it in the package already, so it may be possible to answer the question about rounding of age in deaths using that.

13. Thanks for this information.

14. Pingback: wypozyczalnia lawet

15. Pingback: zobacz tutaj

16. Pingback: kliknij tutaj