'Results management' — detection and diagnosis using Benford's Law

Can the fabrication of research results be prevented?
Can the peer review process be augmented with
automated checking?
These questions become more important
with automated submission of data to archives.
The potential usefulness of automated methods of detecting at least some forms of
either intentional or unintentional ‘result management’ is clear.

Benford’s Law is a postulated relationship on the frequency
of digits (Benford 1938). It states that the distribution of the combination of digits
in a set of random data drawn from a set of random distributions
follows the log relationship (Hill 1998). Benford’s Law,
actually more of a conjecture, suggests the probability of
occurrence of a sequence of digits d is given by the equation:

Prob(d) = log10(1+1/d)

For example, the probability of the sequence of digits 1,2,3 is
given by log10(1+1/123).
Below is the distribution predicted by Benford’s Law
for the first four digits.


Fig 1. Expected distributions of the first four digits according to Benford’s Law.

Continue reading

Niche modeling — what is it?

There are a number of ways to answer this question.
There are a rich diversity of methods to predict
species’ distribution and they could be listed and described.
Alternatively, the biological relationships between
species and the environment could be emphasized, and approaches from
population dynamics used as a starting point.

A more general approach to niche modeling can be based the
statistical idea of the probability distribution.

Definition: A niche model is a probability distribution defined on environmental

Definition: A probability distribution f(E) is an assignment of a probability
to every interval on a set of environmental variables E.

This definition of the niche as a probability distribution
over sets of environmental variables allows for
developing niche models in new ways over
new entities.

Continue reading

Geographic models with R and netpbm

Geographic information is a major component of niche modeling in any
spatial science such as ecology.
Geographic Information Systems (GIS) are the tool of choice when
the main purpose is managing geographic information.

As in the previous chapter when R was used as a relational database,
R can be used to perform simple spatial tasks.
This both avoids the need for a separate GIS system when not necessary,
and helps to build knowledge of advanced use of the R language.

R is not very efficient for some of these operations as data
must be manipulated in a form suitable for mathematical operations,
and this limits the size of the data that can be handled.
Another more efficient way to perform basic ENM functions
on large sets of data is to use image processing.
For this, a good image processing package is called netpbm
and examples of the use of image utilities to perform fundamental
analytical operations for modeling are given.

Continue reading

Database system for niche modeling using R

In this section we show how to use R as a stand-alone database for niche modeling.
Even though R is a vector programming language, R has powerful
operations that replicate relational database operations including select and join.
Using R in this way avoids the need
for setting up and for analysis and interacting with an additional piece of software.
In addition,
describing basic database operations in R helps to build knowledge of the
R’s powerful indexing operations.

Loading and saving a database

One of the main languages used in databases is the SQL or structured query language.
While not going into the syntax of this language, we will use it to compare with
similar operations written in R.

Continue reading

Steps to improving numeracy in elementary school children

There is a problem in the education of children, and its nothing to do with the “No Child Left Behind” policy or funding levels. It is with the way elementary science is taught. Based on web searches of curricula around the country, science topics and examples seem to have a high profile in classroom activities. For example, weather, transportation, plants and so on are used in mini-projects to motivate journaling, thinking, communication and enriching the student’s knowledge literacy of the world.

This emphasis on literacy is also the problem. While science topics feed developing literacy skills, there appears to be no attempt to integrate them with numeracy skills. Numeracy skills, like basic arithmetic, need to be evaluated in standardized tests but are poorly motivated in most classrooms.

Here are some examples of how science topics can be used to motivate numeracy skills, here are things I am doing with first graders.

  1. The balance beam. Using a domino as a fulcrum, a 12in ruler and Lego blocks, put a cross on graph paper where the numbers of blocks needed on each side balance. The number of blocks on the right hand side goes on the X-axis and the number of blocks on the left hand side on the Y-axis. Draw a straight line through the crosses.
  2. Balance beam doubling. Perform the same experiment, only placing the blocks on one side at half the distance from the fulcrum (i.e. at 9in instead of 12in). Talk about doubling.
  3. The weighing scale. Give each child a small spring scale to record the weights of various items in the classroom.
  4. Then make your own spring scale by attaching a ‘slinky’ to the ceiling resting against wall. Students then calibrate the scale, by marking the length extension from a known weight on the graph paper, and drawing a line from the origin through the calibration point. Weight other items and validate accuracy using the spring scales.
  5. Electricity. Give each student a battery, a piece of wire and a light bulb and ask them to make the bulb light. After they have worked that out, give them another battery — the brightness doubles. Talk about doubling, circuits.
  6. The triangle inequality. Get them to draw a triangle with sides of length 3in, 4in and 5in. Make up maths facts with these numbers and point out that the length of one side is always shorter that the sum of the lengths of the other two sides — the triangle inequality.

Continue reading

Niche model basics in the R language

Successful modeling relies heavily on a few
basic concepts in mathematics and statistics. This post summarizes
the major areas you need to know for ecological niche modeling,
illustrated with examples in the vector language R.


We assume that readers have a basic knowledge of mathematics.
For people not familiar with the R language, it is helpful to have a summary of the
major types and operations comparison.

R is a very powerful vector language
that supports the basic data types: integer,
numeric, logical, character/string, as well as more advanced
types factor, complex, and raw, and complex containers such as lists, vectors and matrices.
Some types not supported in most languages are as follows:

Continue reading

Google Earth Gets Avian Influenza Data

How many years have I been waiting for a fast, powerful, networked geospatial client with an open data format? Maybe this time. Google Earth could wipe other suitors off the proverbial map.

Nature magazine features an article on a dynamic map of Avian Influenza developed by Decalan Butler whose blog is here.

Here is a snapshot of the Google Earth Avian Influenza map showing the surge in cases in Europe and the Middle East this year.


Continue reading

For Science's Gatekeepers, a Credibility Gap

Readers of this blog were alerted early to the gathering storm with the post “Peer-censorship and scientific fraud.” Now the influential New York Times has a Health editorial on the topic entitled For Science’s Gatekeepers, a Credibility Gap.

Virtually every major scientific and medical journal has been humbled recently by publishing findings that are later discredited. The flurry of episodes has led many people to ask why authors, editors and independent expert reviewers all failed to detect the problems before publication.

The article is strong on problems by weak on solutions, e.g.:

Any influential system that profits from taxpayer-financed research should be held publicly accountable for how the revenues are spent. Journals generally decline to disclose such data.

I will add some links to commentary below as they become available.

Random numbers predict future temperatures

Previously “A New Temperature Reconstruction” used random data with long term persistence (LTP) to illustrate the circular reasoning behind the ‘hockey stick’ reconstruction of past temperatures. This one shows the potential for false positives due to the statistics used in the ‘hockey stick’. The dynamic simulation below shows future temperatures predicted using a random fractional differencing algorithm that generates realistic LTP behavior. Future temperatures and validation statistics are calculated each time the page is reloaded. One unusual statistic used in MBH98 suggests the future can be predicted using random numbers.

Note: This is a first version of the application and may contain errors and be improved considerably. The code is freely available under the GPL to order to promote open science. See The Reference Frame for more information.

Reload page for new prediction. Measured and predicted future temperatures, with years on the x axis, and temperature anomalies on the y axis. The measured temperatures are in blue and the simulated temperatures are in red. Black points are measured temperatures for years in the validation period.

Continue reading