Can the fabrication of research results be prevented?

Can the peer review process be augmented with

automated checking?

These questions become more important

with automated submission of data to archives.

The potential usefulness of automated methods of detecting at least some forms of

either intentional or unintentional ‘result management’ is clear.

Benford’s Law is a postulated relationship on the frequency

of digits (Benford 1938). It states that the distribution of the combination of digits

in a set of random data drawn from a set of random distributions

follows the log relationship (Hill 1998). Benford’s Law,

actually more of a conjecture, suggests the probability of

occurrence of a sequence of digits *d* is given by the equation:

*Prob(d) = log _{10}(1+1/d)*

For example, the probability of the sequence of digits 1,2,3 is

given by *log _{10}(1+1/123)*.

Below is the distribution predicted by Benford’s Law

for the first four digits.

The frequency of digits can deviate from the law for a range of reasons,

mostly to do with constraints on possible values. Deviations due to human fabrication or alteration of data

have been shown to be useful for detecting fraud

in financial data (Nigrini 2000).

Although Benford’s law has been shown to hold

on the first digits of some scientific data sets, particularly those

covering large orders of magnitude, is clearly not valid for data such as simple

time series where the variance is small relative to the mean.

As a simple example, the data with

a mean of 5 and standard deviation of 1 would tend to have leading

digits around 4, 5 and 6, rather than one.

Despite this, it is possible

that subsequent digits may conform better.

A recent experimental study suggested that the second digit was a much more

reliable indicator of fabricated experimental data (Diekmann 2004).

Such a relationship would be very useful on time series data

as generated by geophysical phenomena.

This post reports the results of some tests digit frequency

as a practical methodology for detecting ‘result management’

in geophysical series data. The code is written in R and available as an R package Audit 0.1.

## Simulated Results

### Random numbers

As a first example I fabricated 157

data to resemble random numbers.

I also generated a set of 200 random numbers with a normal distribution.

This simulated series is used to determine

if fabricated data can be detected.

Below are the results for the commands executed in R

below. The first and second digit

distribution with fits is shown on a log-log plot.

dr< -rnorm(200) df<-c(23,34,51,... benford(dr,plot=TRUE) benford(df,plot=TRUE)

The table below quantifies the distributions on the plots.

In the case of random data, the the first digit

deviates significantly (p=10^{-58}), but the second does not (p=0.51).

In the case of the fabricated data, the p for both the first digit

and the second digit is zero (p=0) indicating the Chi2 test

on the second digit detects deviance. Another way of

quantifying deviation is to sum the norm of

the difference between the expected and observed

frequencies for each digit. The value for D2 on the fabricated data (D2=0.54) is much higher than the value for random data (D2=0.17).

These results give one confidence that the second digit

and simple statistical test can detect fabricated data sets.

Set | p1 | p2 | D1 | D2 |
---|---|---|---|---|

Random | 2×10^{-58} |
0.51 | 0.81 | 0.17 |

Fabricated | 0.00 | 0.00 | 0.37 | 0.54 |

For time series testing, I inserted the 157 fabricated numbers

between the sets of random numbers forming a single

series, and calculated the Chi2 statistic on the second digit

on a moving window. Eg:

benford(c(dr,df,dr),plot=TRUE,n=50)

The red line is the probability of Benford’s Law distribution in the 2nd digit,

calculated on the moving window of size 50. The green line

is a benchmark level of probability below which indicates significant deviation from Benford’s Law distribution. The figure below the first one is the same analysis applied to the first

difference of the initial series. The blue lines deliniate the fabricated series (although it is shifted a bit in the second).

The simulated data consistes of a random sequence on both sides, and fabricated data in the center.

In the figure above, the fabricated region is clearly detected in both the

original series and the differenced series when the

red line falls below the green line.

### Tidal height data

The Port Arthur tidal height sets were obtained from John Hunter (Hunter et. al. 2003)

The first file called present_dat_gmt is the full (1184-day) data-set from

the Port Arthur tide gauge from 24/6/1999 to 20/9/2002 collected

with an Aquatrak 4100 with Vitel WLS2 logger.

The figure above shows the distribution of both the first and second digits has a smooth profile. This is to be expected

of a large set of instrumentally collected data. While the

distribution is similar to Benford’s there does appear to be

a systematic difference.

The manually recorded data set porta40.txt

is a digitised version of sea level data hand

collected in imperial units (feet an dinches) by Thomas

Lempriere at Port Arthur, Tasmania, Australia. They cover the period

1840 to 1842, but are incomplete for 1840. The data were

converted into decimal feet prior to analysis.

The figure above shows the distribution of the second digit in the hand collected data deviates significantly from Benford’s Law predictions due to elevated frequencies of 0’s and 5’s.

This may have been due to human ’rounding error’

biasing data towards either whole or half feet. The proof of

some form of results management in the hand coded data set is shown in the table below. The Chi2 test on

the second digit shows significant deviance from the

expected Benford’s distribution in both data sets. However

the distance measure appears to discriminate between

the two sets, with the instrumental set giving a D2

value of 0.03 and the hand-coded set a value of 0.33.

Set | Chi2-1st | Chi2-2nd | D1 | D2 |
---|---|---|---|---|

Instrument | 0.00 | 3×10^{-44} |
0.23 | 0.03 |

Hand | 0.00 | 2×10^{-40} |
0.76 | 0.33 |

The figure shows the Chi2 value

dropping below the green line in a number of regions throughout

the series. John Hunter communicated that he believed different people were responsible for collection of data throughout the period, which may explain the transitions in the graph.

## Conclusions

The program detected the difference in digit distribution between

randomly generated and fabricated data. A moving window analysis correctly identified the location of the fabricated data in a time series.

The digit distribution of the tidal gauge data set collected by instruments

showed deviation from Benford’s Law, but this was not apparent from examination of the distributions. The tidal gauge data set collected by hand

showed evidence of rounding of results to whole or half feet.

The instrumental data set is very large (n=28,179) and this could

account for the sensitivty of the Chi2 to deviation. Another

explanation for the deviation may be the

conjecture that Benford’s Law is not exactly accurate in this case.

These results show that the distributions of the 2nd digit can be used

to detect and diagnose ‘results management’ on geophysical time series data. These methods could have wide applicability for

assisting in quality control of a wide range of data sets, particularly

in conjunction with data archival processes.

An initial version of an R package for doing these analysis called Audit 0.1 is available. The package is covered by GPL and assistance will be needed for further development of the package.

## References

- Frank Benford: The law of anomalous numbers, Proceedings of the American

Philosophical Society, Vol. 78, No. 4, (March 1938), pp. 551–572. - Nigrini, M. Digital Analysis Using Benford’s Law: Tests Statistics for Auditors.

Vancouver, Canada: Global Audit Publications, 2000. - Andreas Diekmann, Not the First Digit! Using Benfordâ€™s Law to Detect

Fraudulent Scientific Data Swiss Federal

Institute of Technology Zurich October 2004. - Hunter, J., Coleman, R. and Pugh, D., 2003. The sea level at Port Arthur,

Tasmania, from 1841 to the present, Geophysical Research Letters,

Vol. 30, No. 7, 54-1 to 54-4, doi:10.1029/2002GL016813. - Hill T.P.,

The First Digit Phenomenon. 1998.

Congratulations David

This page is great.

I suggest that you put some “classical” datasets of your field of study in the package, for example, that .rwl datasets… Maybe some links to them …

Cheers.

Congratulations David

This page is great.

I suggest that you put some “classical” datasets of your field of study in the package, for example, that .rwl datasets… Maybe some links to them …

Cheers.

Thanks Marcos for pointing out about the rwl files. Its not in the write up above but the package can read them now. It would be good to have an accounting data set too. I wonder if it would be possible to get historic fradulent data sets?

I need to test it on a lot more data sets. Massaging the data into a good form for analysis while keeping the patterns is an issue. It would be great for it to be virtually automatic. This is what the option “auto” is intended for. Much to be done though.

Thanks Marcos for pointing out about the rwl files. Its not in the write up above but the package can read them now. It would be good to have an accounting data set too. I wonder if it would be possible to get historic fradulent data sets?

I need to test it on a lot more data sets. Massaging the data into a good form for analysis while keeping the patterns is an issue. It would be great for it to be virtually automatic. This is what the option “auto” is intended for. Much to be done though.

May I point to measurement rounding artifacts in the quelquaya ice core?

http://home.casema.nl/errenwijlens/co2/quelccaya.htm

May I point to measurement rounding artifacts in the quelquaya ice core?

http://home.casema.nl/errenwijlens/co2/quelccaya.htm

Hans, that is a very elegant study. Thanks.

Hans, that is a very elegant study. Thanks.

David, you may notice a familiar name in Hans’ note. I was just getting started at looking at proxies and this was one of the first puzzling patterns that I noticed. Hans solved the problem.

It was amazing that the actual measurement had been superceded by the transformed data and, as I recall, Hans was never able to get Thompson to provide the original data.

David, you may notice a familiar name in Hans’ note. I was just getting started at looking at proxies and this was one of the first puzzling patterns that I noticed. Hans solved the problem.

It was amazing that the actual measurement had been superceded by the transformed data and, as I recall, Hans was never able to get Thompson to provide the original data.

I think there is a lot of results management out there. I wonder what the digit frequency of the quelquaya data would look like? I don’t intend the analysis to be restricted to digit frequency. Other tests could be included.

One of the main reasons I got into this was to look at the WCDP archive. Imagine something that could churn through all the paleo data for suspect data and diagnose its origin.

I think there is a lot of results management out there. I wonder what the digit frequency of the quelquaya data would look like? I don’t intend the analysis to be restricted to digit frequency. Other tests could be included.

One of the main reasons I got into this was to look at the WCDP archive. Imagine something that could churn through all the paleo data for suspect data and diagnose its origin.

This checking of digit distribution is very interesting. Availability of the original measurements is really important.

I think that there is an increasing need for statistical checking of very big data sets in different ways. The problems may arise in systematic “errors” or varying rounding effects. I am not thinking of typing errors, interpreting errors of hand written texts or corresponding errors which is a problem of its own.

I have an interesting data set of death causes of nearly two million people from 1750 to 1850 in Finland. The death causes have been typed in by hundreds (or thousands) of amateurs interested in tracing there relatives. All the data is available on internet. When checking for the yearly age distribution of different death causes I observed that there was a systematically higher representation of the age classes 20, 30, 40, 50, 60 and so on. From that we can expect that in some cases the exact age of the people was not known, so the priest wrote a “proxy” for the age. I don’t know if the rounding is biased upwards or downwards. This causes some problems in detailed analysis of rare death causes. I can only statistically adjust for the age classes.

We used the raw data unchanged in the study of malaria in Finland when we compared malaria with some other different diseases in the study. The effect of the deviating age classes are visible in the graphs available at http://www.malariajournal.com/content/4/1/19

The results for malaria are not expected to have been affected by this problem because we think that the bias has the same distribution on each disease.

This checking of digit distribution is very interesting. Availability of the original measurements is really important.

I think that there is an increasing need for statistical checking of very big data sets in different ways. The problems may arise in systematic “errors” or varying rounding effects. I am not thinking of typing errors, interpreting errors of hand written texts or corresponding errors which is a problem of its own.

I have an interesting data set of death causes of nearly two million people from 1750 to 1850 in Finland. The death causes have been typed in by hundreds (or thousands) of amateurs interested in tracing there relatives. All the data is available on internet. When checking for the yearly age distribution of different death causes I observed that there was a systematically higher representation of the age classes 20, 30, 40, 50, 60 and so on. From that we can expect that in some cases the exact age of the people was not known, so the priest wrote a “proxy” for the age. I don’t know if the rounding is biased upwards or downwards. This causes some problems in detailed analysis of rare death causes. I can only statistically adjust for the age classes.

We used the raw data unchanged in the study of malaria in Finland when we compared malaria with some other different diseases in the study. The effect of the deviating age classes are visible in the graphs available at http://www.malariajournal.com/content/4/1/19

The results for malaria are not expected to have been affected by this problem because we think that the bias has the same distribution on each disease.

Hi Larry. Nigrini developed an index for financial figures that is supposed to diagnose rounding, either up or down. I have implemented it in the package already, so it may be possible to answer the question about rounding of age in deaths using that.

Hi Larry. Nigrini developed an index for financial figures that is supposed to diagnose rounding, either up or down. I have implemented it in the package already, so it may be possible to answer the question about rounding of age in deaths using that.

sZ3wUs wnisfybgmzgc, [url=http://zbciavgiggoy.com/]zbciavgiggoy[/url], [link=http://hhcbcbabgkns.com/]hhcbcbabgkns[/link], http://orhdoyhapcdk.com/

Thanks for this information.

Pingback: wypozyczalnia lawet

Pingback: zobacz tutaj

Pingback: kliknij tutaj

Pingback: link

Pingback: polecam

Pingback: link do strony

Pingback: polskie papu

Pingback: cennalokata.blogspot.com

Pingback: zobacz tutaj

Pingback: sufity armstrong