Niche model basics in the R language

Successful modeling relies heavily on a few
basic concepts in mathematics and statistics. This post summarizes
the major areas you need to know for ecological niche modeling,
illustrated with examples in the vector language R.

Elements

We assume that readers have a basic knowledge of mathematics.
For people not familiar with the R language, it is helpful to have a summary of the
major types and operations comparison.

R is a very powerful vector language
that supports the basic data types: integer,
numeric, logical, character/string, as well as more advanced
types factor, complex, and raw, and complex containers such as lists, vectors and matrices.
Some types not supported in most languages are as follows:

Factor is used to express categories, or enumerated types and consist of a finite set of
named levels. They can also be ordered. It is important to know because R imports data tables as factors by default
and this can get confusing when you want them to be numbers. The example shows factors
of population density of a species.

> factor(c("1", "2", "3", "4"), ordered = TRUE)

[1] 1 2 3 4
Levels: 1 < 2 < 3 < 4

Complex numbers are of the form x+yi where x and y
are real numbers and i is equal to the square root of -1. The two numbers are called
the real and imaginary part. These are useful abstract quantities as meaningful
solutions often result from basic numeric operations on them.
For example, the two parts can represent the coordinates of a point in a plane.

> j  j^2

[1] 23249.52-6872.86i

Type Raw holds bytes and is useful for handling binary data in a compact form.
The only operations on raw are bitwise operations, AND, OR and NOT.
The byte value is displayed in hex notation, where the basic digits
range from 0 to 15 are represented by letters 0 to f.
Raw values are most frequently used in images where
the numbers represent intensity, e.g. 255 is white and 0 is black.
However, in using images in a numeric context, the raw values can also represent
categories such as vegetation types in a vegetation map, or normalized
values such as average temperature or rainfall. Bitwise operation can then be
useful to for operations such as masking out areas.

> as.raw(255)

[1] ff

> as.raw(15) | as.raw(255)

[1] ff

Vectors are an ordered set of items of identical type and are one
of the most versatile features of R. Vectors can be created in a number of
ways. Below are some of the most common:

> x  x

[1] 2 1 4 6 5

> y  y

 [1] 3 3 3 3 3 3 3 3 3 3

> z  z

 [1] 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990

Lists contain an unordered set of named items of different type.
These are a general purpose type for holding all kinds of data.
Below is an example of a list for holding a vector of locations of a species
and the species name.

> list(coords = c(123.12 - (0+45i), 122 - (0+41i), 130 - (0+40i)),
+     species = "Puma concolor")

coords
[1] 123.12-45i 122.00-41i 130.00-40i

species [1] "Puma concolor"

Data frames are an extremely useful construct for organizing
data in R, very similar to tables or spreadsheets. A data frame is essentially
a list of vectors of equal length. That is, each column in a table can be a
different type, but they must all have the same number of items. Data frames
are very commonly used in reading in data for analysis via the read.table command.

Time series are another useful construct for handling series where
the elements have an inherent and differing frequencies. The ts command creates a time series
with information on the start, end, frequency and actual data.

> ts(1:10)

Time Series:
Start = 1
End = 10
Frequency = 1
 [1]  1  2  3  4  5  6  7  8  9 10

A matrix is a two dimensional vector with a defined number of rows and columns.
The number of columns and rows is set or returned by the dim command.

> m  dim(m)

[1] 20 20

Below is a table of basic the types, and examples follow.

                                              Examples
Integer                                              7
Numeric                                            5.6
Logical                                    TRUE, FALSE
Character                                         here
Factor                                               1
Complex                                           0+0i
Raw                                                 ff
Constants                 pi, NULL, Inf, -Inf, nan, NA
Vectors                   1:10, rep(1,10), seq(0,10,1)
matrices               matrix(0,3,3), array(0, c(3,3))
lists                                   list(x=1, y=1)
data.frames data.frame(x=numeric(10), y=character(10))

Operations

R being an interpreted language has the flexibility that
types are usually cast into the correct form, i.e. integer + float = float.
Each of these types use the usual operators available in most computer languges:

                                                  Operators
Numeric                              x+y, x-y, x*y, x/y, x^
Logical      !x, x&y, x&&y, x|y, x||y, xor(x, y), isTRUE(x)
Bitwise                                             !, |, &
Relational                 xy, x=y, x==y, x!=y
Assignment         x< -value, x<x, value->>x
Accessors                  x.y, x[y], pkg::name, pkg:::name
Constructors                              x:y, x=y, y model

But unlike most
languages, being a vector language overloads these operators even further to
operate on sets of numbers.
For example, lets use
the vectors we constructed using ‘c’, ‘seq’ and ‘rep’ and perform
some operations on them.
In a vector language more complex structures such as vectors and lists can be treated
as basic types because many of the basic operators apply to them.
Basic arithmetric operations like addition are elementwise on vectors.
When vectors are of unequal length they wrap around.
Logical operations on numeric vectors produce logical vectors, very important
for building complex expressions.

> x + y

 [1] 5 4 7 9 8 5 4 7 9 8

> !y

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> x > y

 [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

Functions

There are many ways to introduce functions. In most language they
are given preferred treatment, but in R they could in many ways be also be regarded
as another simple type because just like the other simple types,
they can be passed as arguments to other functions.

In mathematical terms they are described
as a mapping from one domain to another, such as numbers to numbers.
The identity function that returns whatever is input
would be defined in the following way:

> f  f("this")

[1] "this"

Another approach to functional relationship is to define it on the Cartesian
product of two sets X with Y . The f is a function provided there is at most one
element y of Y that is related to x via f .
When y is uniquely determined by the value of x , x acts like an index.
Using this definition, where indexing of vectors can be regarded as a basic function,
where x is the position of the element.

> f  f(3, y) == y[3]

[1] TRUE

The examples so-far have returned single values, but typical R functions return
much more complex return values: vectors, lists and data frames.
Many functions will operate on whole vectors, it so can be viewed as parallel functions.
The first example below simulates annual temperatures. The second example is a very important
application of indexing a vector with a vector. The third shows the difference between
the simple max operation and the parallel pmax operation.

> daylight  daylight(1:12)

 [1] -8.660254e-01 -5.000000e-01 -6.123234e-17  5.000000e-01  8.660254e-01
 [6]  1.000000e+00  8.660254e-01  5.000000e-01  1.836970e-16 -5.000000e-01
[11] -8.660254e-01 -1.000000e+00

> z[x]

[1] 1981 1980 1983 1985 1984

> max(x, 4)

[1] 6

> pmax(x, 4)

[1] 4 4 4 6 5

R also contains a rich set of built in functions that make many programming
tasks simple and efficient. Here is a list of some of the ones used throughout this book.

                                                                 Description
aggregate Splits the data into subsets, computes summary statistics for each
cumsum               Returns a vector whose elements are the cumulative sums
filter                 Applies linear filtering to a univariate time series
hist                 Computes and plots a histogram of the given data values
spectrum                     Estimates the spectral density of a time series
acf             Computes and plots estimates of the autocorrelation function

Basic models used in niche modeling

A basic understanding of various kinds of functions is necessary for ecological niche modelling.
Two main types of function are those describing the response of a species to
an environmental driver, such as nutrients, rainfall temperature, or many others.
The second major type are those describing the biological response over
time or space, and are modeled by functions that incorporate a large
degree of random error, called stochastic functions. These functions
can also be used in higher dimensions.

Response functions

The first shows basic relationships used for describing populations: linear, exponential and power.
Linear relations are the basic idealized way of describing functional response of organisms to
anything from nutrient to predator populations. Typically responses are non-linear, and exponential
or power relationships are often seen and used.

> y  plot(i, y, type = "l")
> lines(i, exp(i), lty = 2)
> lines(i, i^2, lty = 4)

While useful, these cannot capture the basic concept of a niche model.
A niche is a concept that at minimum is the tendency of a species
to prefer a particular set or range of values. To express this with
a function requires a ‘hump’ or ‘inverted U’ shape centered on values optimal
to the species growth and reproduction. Below are three ways to do this is with
functions: step function, a truncated quadratic, and exponential.

> i  y  plot(i, y, type = "l")
> sf  lines(i, sf(i))
> lines(i, exp(-(i^2)), lty = 2)

Another frequently encountered relationship is the periodic useful
to describe everything from daily, annual, and multiyear cycles.
Relationships between periodic series to be aware of are period
doubling and additive cycles.

> y1  y2  plot(i, y1 + y2, type = "l")
> lines(i, y1, lty = 2)
> lines(i, y2, lty = 4)

Stochastic functions or series

The next important functions to know about are stochastic functions.
These are used to describe various forms of randomness in models.
While on the one hand a stochastic series can be regarded as a time series,
it can also be regarded as a function, with the input as the index according
to the equivalence above. The primary example is the rnorm function
that returns a vector of random numbers between zero and one.

IID refers to the statistics of independent and identically distributed random numbers where
every y value is a simple random number with no reference to any other
number.

> par(mfrow = c(2, 1))
> y  plot(y)
> acf(y)

The second graph is an autocorrelation function a useful tool for
discriminating different types of random variables. Correlation
is a quantification of the degree to which two variables are linearly
related to each other. Autocorrelation then is the degree to which
a variable is related to itself. The autocorrelation function (ACF) is
the correlation at each distance between points, called lags.

The IID series has high autocorrelation at zero lag, as every number
correlates with itself. Correlation at all other lags are below the level of
significance, indicated by the dashed line.

Moving Average is the application of a moving window
over the points in the series. As can be seen in the ACF plot, the averaging process
produces some autocorrelation between neighboring points with low lag. Note that we must
remove some values at each ends of the averaged series as the filter operation
consumes end points. The moving average is often called a low pass filter,
as it removes high frequency variations and leave low frequency ones.

> par(mfrow = c(2, 1))
> y  plot(y)
> acf(y)

A random walk is a series where each value
is dependent on the previous, plus some noise.
It is simply created by taking the cumulative sum of a random
IID variable. This is easily generated in
from the cumulative sum of IID random numbers using the cumsum function in R.
The ACF plot shows autocorrelations in a random walk are extremely persistent,
with significant correlation between lags of up to 100.

As an IID process has the properties that the average of
the random numbers is finite (zero) the series
tends to stay roughly level with the starting point no matter how many numbers are generated.
A random walk however, has the property that its average value is infinity.
Over time notice the random walk appears to ‘trend’. The
series can diverge arbitrarily far from the starting point.

> par(mfrow = c(2, 1))
> y  plot(y, type = "l")
> acf(y, lag.max = 100)

The Markov process is a series where each value is
partially dependent on the previous value, but no other previous
values. That is, there is no additional knowledge to be gained
about the future from the past values as it is all captured in the present value.
This variable describes such things
as water levels where the daily evaporation is a random variable
and allows an intermediate level of autocorrelation.

> par(mfrow = c(2, 1))
> y  plot(y)
> acf(y)

Two dimensional models

All of the forms of functions can also apply to two dimensions.
A biological response function might represent the response
of a species to two environmental factors, such temperature
and rainfall. The most common use of 2D functions is
a distribution over part of the earth’s surface, i.e. a map.

In R we can generate a two dimensional matrix of random numbers
and display it with the image command. We can also examine
the correlation structure treating the 2D matrix as a 1D vector.

We can generate different types of autocorrelation structure
with the random fields library. Here we can see the
autocorrelation between neighboring points. The ACF also
shows a second peak of correlation at a lag of 100, corresponding to
the dimensions of the matrix, when each point becomes correlated
with itself. Called features are called artifacts and are something
to watch out for in any form of analysis.

The third graph is another useful diagnostic for autocorrelated series is the
spectral frequency, showing the intensity of variation at particular frequencies (or lags).
IID series have a flat spectral frequency. However, autocorrelated series have
greater variation in the longer frequencies (or lags) as shown in the figure.
The artifacts are more clearly revealed as regular ‘spikes’ in intensity in this plot.
Normally the periodogram should be smooth.

There are many more forms of mathematical and statistical relationships that could be described
and used in models. In this chapter are the most basic and frequently used.
Some are introduced in subsequent chapters, but these will be used
frequently. It would be worth while to install R and experiment with them yourself and
become familiar with the various options and their effects. Playing with them
in this way will improve your intuition for models, and most importantly prepare you
for recognizing the possible things that can go wrong.


davids@David-Stockwells-Computer.local

Advertisements

0 thoughts on “Niche model basics in the R language

  1. Pingback: bateria do laptopa acer

  2. Pingback: wypozyczalnia samochodów Gliwice

  3. Pingback: organizacja eventów kulinarnych

  4. Pingback: link do strony

  5. Pingback: strona www

  6. Pingback: polecam

  7. Pingback: link do strony

  8. Pingback: kliknij

  9. Pingback: kliknij tutaj

  10. Pingback: zakupzlota.wordpress.com

  11. Pingback: program pity 2015 - pity2015program.pl

  12. Pingback: zobacz tutaj

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s