Scale Invariance of the Aggregate

For computational, statistical or display reasons, daily data are often aggregated to a coarser time scale. This is done by splitting the sequence into subsets along a coarser index grid, and calculating a summary statistics such as the mean value at each segment.

Missing values cause problems when calculating the mean. In the R default, the presence of a single NA returns an NA for most arithmetic operations. There is an option to calculate the mean after omitting the NAs. In the first case, the calculated means are valid but data is lost when converted to NAs. In the second, no data is lost but the means deviate wildly when the data come from strongly cyclical series such as temperature.

figure2.1

Figure 1 shows the reduction in monthly aggregate data when na.rm=T.

figure2.2

Figure 2 shows the reduction when na.rm=T, with almost total loss on annual aggregation. While data is not lost with the option na.rm=F, the outliers at the start and end of the Rutherglen minimum data series illustrates its unexpected biasing effect.

The figures illustrate that a heterogeneous sequence is not ‘invariant’ with respect to aggregation using a mean. The only way to ensure invariance, which confers a degree of reliability under aggregation, is if the missing data are randomly distributed within each section of the course index.

Most studies define rules about the number of allowable missing values, but either these are not clearly stated, or use rules that o not guarantee invariance, such as a set number of missing values (eg. CAWCR).

Because of the invariance of heterogeneous data under aggregation, it is best to analyze data at their original resolution.

Heterogeneous Weather Sequences

The recorded sequences of temperature and rainfall from weather stations is often strikingly heterogeneous, with many different formats, protocols, and disruptions in the records. The Australian temperature record we see is the product of smoothing algorithms used to produce graphic displays that hide this structure. The problem of parameter estimation must be approached using methods that approximate the ideal homogeneous case.

While a standard formats for water data exists (WDF) there do not appear to be standards for temperature data. The function sequences reads and detects two types of data file downloaded the Australian Bureau of Meteorology: the Climate Data Online (CDO) and the ACORN-SAT reference data set. You can use a wild card to identify the ones you want or a specific one.

CDO=sequences("../inst/extdata",stations="082039",maxmin=11,type="CDO",na.rm=T)
ACORN=sequences("../inst/extdata",stations="082039",maxmin="min",type="ACORN",na.rm=T)

The sequences function returns a ‘zoo’ series which is a particularly powerful time series structure in R. The zoo series can be combined on the union or intersection of their dates with the ‘merge’ command.

Zoo series can represent time of day as well with the ?YYYY-mm-dd hh:mm:ss? format, allowing separate maximum and minimum temperature series to be effectively combined to achieve a single daily temperature series.

figure1.1

Figure 1 is the plot of the Rutherglen minimum daily temperature in from the raw CDO and homogenized ACORN network. The difference between the raw CDO data and the ACORN series is in blue. These are the adjustments to the Rutherglen minimum series.

All of the sequences that match criteria can also be loaded with a command such as the following.

comp=sequences("../inst/extdata",maxmin=11,type="CDO")
compNotNAs=summary.sequences(comp)

The summary function returns descriptive statistics about the heterogeneity of single or multiple series such as the date of the first and last value, the number and proportion of NAs.

figure1.2

Figure 2 shows the number of active stations (non-NAs) in 176 stations in south-eastern Australian. Note the extremely uneven collection of weather data over time. Such extreme heterogeneity can easily bias analysis that is sensitive to the number of missing values over time.

For example, calculation of a mean temperature sequence could only done reliably if the missing values were distributed uniformly. If the weather stations that recorded during the latter 20th century tended to be situated inland where the weather is hotter, this would tend to bias a simple average warm over that period. If the sequences were standardized on a common time period such as the 60’s, there would be a bias between periods before and after the standard period.

Clearly, taking the mean, averaging, or any similar operation is fraught with danger with extremely heterogeneous sequences such as these.

Anomalies, Breakouts and Homogenization

Anomalies are the secrets hidden in the output of temperature recorders, price of trades, or server loads. Anomalies represent trading opportunities, calls for resource reallocation or in the case of weather stations, need for instrument re-calibration or replacement. Detecting them has been of interest to twitter. Correcting them, known as homogenization, greatly criticized. What are the limits to to reliable detection of anomalies in real world data?

A sequence is a function f: N \rightarrow R where N is the natural numbers, usually a discrete time step and R is a real number. A series is the progressive sum of the terms in a sequence S_i = \sum y_i where i=1 ... n. A typical example of a series is the ‘random walk’ produced by the cumulative sum of random values. One occasionally sees more general series to model a given sequence, such the progression of higher order terms in a polynomial regression model y_t = a_1 t + a_2 t^2 ... a_n t^n and the periodic terms of a Fourier series. A sequence space is a closed set of sequences; subtracting or adding sequences gives another valid sequence.

Real world sequences contain missing values or ‘gaps’, and truncated start and end dates. Such data series are called ‘heterogeneous’. Missing values are represented by the the ‘na’ from the R language, by augmenting the range of the sequence to \{N,na\}. One of the main questions I want to address is how the analysis can be done reliably in the presence of arbitrary missing values.

The series of posts is organized as follows. Section 2 describes operations on sequences and the improvement of detection limits with series instead of sequences. Section 3 introduces anomaly tests on spaces of sequences; the comparison of the target with regional neighbours adopted widely in climatology. Section 4 uses the package to try to find anomolies in the Rutherglen daily minimum temperature data set.

Rutherglen is a small town in north-eastern Victoria, Australia, known for wine production. The ACORN-SAT station number 082039 is based on the still-open Rutherglen Research station 082039 (Lat: 36.10S Lon: 146.51E Elevation: 175m) and has a virtually unbroken record of daily readings from the 7th of November 1912, apart for missing days in the earlier part of the record, and a gap with no records between 1960 and 1965.

This R package is motivated by exploration of the Rutherglens extreme trend divergence (image above from KensKingdom) between the raw Climate Data Online (CDO) version and the Australian temperature reference network (ACORN-SAT), a ‘homogenized’ reference network published by the Australian Bureau of Meteorology (BoM) in 2012. In the process I hope we develop a useful R package for a whole range of applications.