Can the fabrication of research results be prevented?
Can the peer review process be augmented with
These questions become more important
with automated submission of data to archives.
The potential usefulness of automated methods of detecting at least some forms of
either intentional or unintentional ‘result management’ is clear.
Benford’s Law is a postulated relationship on the frequency
of digits (Benford 1938). It states that the distribution of the combination of digits
in a set of random data drawn from a set of random distributions
follows the log relationship (Hill 1998). Benford’s Law,
actually more of a conjecture, suggests the probability of
occurrence of a sequence of digits d is given by the equation:
Prob(d) = log10(1+1/d)
For example, the probability of the sequence of digits 1,2,3 is
given by log10(1+1/123).
Below is the distribution predicted by Benford’s Law
for the first four digits.
The frequency of digits can deviate from the law for a range of reasons,
mostly to do with constraints on possible values. Deviations due to human fabrication or alteration of data
have been shown to be useful for detecting fraud
in financial data (Nigrini 2000).
Although Benford’s law has been shown to hold
on the first digits of some scientific data sets, particularly those
covering large orders of magnitude, is clearly not valid for data such as simple
time series where the variance is small relative to the mean.
As a simple example, the data with
a mean of 5 and standard deviation of 1 would tend to have leading
digits around 4, 5 and 6, rather than one.
Despite this, it is possible
that subsequent digits may conform better.
A recent experimental study suggested that the second digit was a much more
reliable indicator of fabricated experimental data (Diekmann 2004).
Such a relationship would be very useful on time series data
as generated by geophysical phenomena.
This post reports the results of some tests digit frequency
as a practical methodology for detecting ‘result management’
in geophysical series data. The code is written in R and available as an R package Audit 0.1.
As a first example I fabricated 157
data to resemble random numbers.
I also generated a set of 200 random numbers with a normal distribution.
This simulated series is used to determine
if fabricated data can be detected.
Below are the results for the commands executed in R
below. The first and second digit
distribution with fits is shown on a log-log plot.
dr< -rnorm(200) df<-c(23,34,51,... benford(dr,plot=TRUE) benford(df,plot=TRUE)
The table below quantifies the distributions on the plots.
In the case of random data, the the first digit
deviates significantly (p=10-58), but the second does not (p=0.51).
In the case of the fabricated data, the p for both the first digit
and the second digit is zero (p=0) indicating the Chi2 test
on the second digit detects deviance. Another way of
quantifying deviation is to sum the norm of
the difference between the expected and observed
frequencies for each digit. The value for D2 on the fabricated data (D2=0.54) is much higher than the value for random data (D2=0.17).
These results give one confidence that the second digit
and simple statistical test can detect fabricated data sets.
For time series testing, I inserted the 157 fabricated numbers
between the sets of random numbers forming a single
series, and calculated the Chi2 statistic on the second digit
on a moving window. Eg:
The red line is the probability of Benford’s Law distribution in the 2nd digit,
calculated on the moving window of size 50. The green line
is a benchmark level of probability below which indicates significant deviation from Benford’s Law distribution. The figure below the first one is the same analysis applied to the first
difference of the initial series. The blue lines deliniate the fabricated series (although it is shifted a bit in the second).
The simulated data consistes of a random sequence on both sides, and fabricated data in the center.
In the figure above, the fabricated region is clearly detected in both the
original series and the differenced series when the
red line falls below the green line.
Tidal height data
The Port Arthur tidal height sets were obtained from John Hunter (Hunter et. al. 2003)
The first file called present_dat_gmt is the full (1184-day) data-set from
the Port Arthur tide gauge from 24/6/1999 to 20/9/2002 collected
with an Aquatrak 4100 with Vitel WLS2 logger.
The figure above shows the distribution of both the first and second digits has a smooth profile. This is to be expected
of a large set of instrumentally collected data. While the
distribution is similar to Benford’s there does appear to be
a systematic difference.
The manually recorded data set porta40.txt
is a digitised version of sea level data hand
collected in imperial units (feet an dinches) by Thomas
Lempriere at Port Arthur, Tasmania, Australia. They cover the period
1840 to 1842, but are incomplete for 1840. The data were
converted into decimal feet prior to analysis.
The figure above shows the distribution of the second digit in the hand collected data deviates significantly from Benford’s Law predictions due to elevated frequencies of 0’s and 5’s.
This may have been due to human ’rounding error’
biasing data towards either whole or half feet. The proof of
some form of results management in the hand coded data set is shown in the table below. The Chi2 test on
the second digit shows significant deviance from the
expected Benford’s distribution in both data sets. However
the distance measure appears to discriminate between
the two sets, with the instrumental set giving a D2
value of 0.03 and the hand-coded set a value of 0.33.
The figure shows the Chi2 value
dropping below the green line in a number of regions throughout
the series. John Hunter communicated that he believed different people were responsible for collection of data throughout the period, which may explain the transitions in the graph.
The program detected the difference in digit distribution between
randomly generated and fabricated data. A moving window analysis correctly identified the location of the fabricated data in a time series.
The digit distribution of the tidal gauge data set collected by instruments
showed deviation from Benford’s Law, but this was not apparent from examination of the distributions. The tidal gauge data set collected by hand
showed evidence of rounding of results to whole or half feet.
The instrumental data set is very large (n=28,179) and this could
account for the sensitivty of the Chi2 to deviation. Another
explanation for the deviation may be the
conjecture that Benford’s Law is not exactly accurate in this case.
These results show that the distributions of the 2nd digit can be used
to detect and diagnose ‘results management’ on geophysical time series data. These methods could have wide applicability for
assisting in quality control of a wide range of data sets, particularly
in conjunction with data archival processes.
An initial version of an R package for doing these analysis called Audit 0.1 is available. The package is covered by GPL and assistance will be needed for further development of the package.
- Frank Benford: The law of anomalous numbers, Proceedings of the American
Philosophical Society, Vol. 78, No. 4, (March 1938), pp. 551–572.
- Nigrini, M. Digital Analysis Using Benford’s Law: Tests Statistics for Auditors.
Vancouver, Canada: Global Audit Publications, 2000.
- Andreas Diekmann, Not the First Digit! Using Benfordâ€™s Law to Detect
Fraudulent Scientific Data Swiss Federal
Institute of Technology Zurich October 2004.
- Hunter, J., Coleman, R. and Pugh, D., 2003. The sea level at Port Arthur,
Tasmania, from 1841 to the present, Geophysical Research Letters,
Vol. 30, No. 7, 54-1 to 54-4, doi:10.1029/2002GL016813.
- Hill T.P.,
The First Digit Phenomenon. 1998.