Business decision-making using information such as customer profiles to predict returns on investment is coming to be know as the Predictive Enterprise (PE). While business analytics — or ‘quants’ — are not new, engineering a greater a degree of sophistication and integration of predictive modeling into business processes and structures is showing high returns for many industries.
A range of software tools for the predictive enterprise are beginning to be available. Well known vendors like SPSS appear very serious about enabling the PE with off-the-shelf business solutions for education, financial, marketing, insurance and telecommunications industries. A new book “The Power to Predict” tells how stepping beyond the real-time to anticipate trends is changing the way web stats are analysed, software is deployed, and information systems are designed.
Successful predictive analytics relies heavily on a few basic concepts in mathematics and statistics. Of-the-shelf statistical packages hide the complexity and can be useful if they are tailored exacly to your application. But in many cases, each situation is a little different. Also, understanding of the basics and the pitfalls is necessary for communicating the results with confidence. Niche modeling, as developed in ecology, has developed sophisticated tools
and treatments directly applicable to predictive analytics for business.
This post describes how predictive analytics can be expressed in terms of niche modeling.
The analysis is done via the R language, a powerful, reliable and free statistical program in the manner of the S statistical language.
Using R for predictive analytics is a low-cost and flexible solution, but does require a basic knowledge of statistics and mathematics. R is a very powerful language for a number of reasons. However, the main feature is vector processing — the ability to perform operations on entire arrays of numbers without explicitly writing iteration loops. This allows code to be shortened considerably, loops implemented efficiently, and encourages a parsimonious style of programming around larger data structures that is easier to maintain.
For people not familiar with the R language, a summary of the major types and operations comparison here provides a rapid tutorial into using R language for niche modeling.
In the basic formalism for niche modelling, the main entity being modelled is the species, S. In a general sense, a species could be either a biological species, such as a mountain lion Puma concolor, or a product such as a model of digital camera, say the Nikon D50.
The niche of the species is a description of the environment for that species in terms of environment variables E. For example, environment variables for a biological species might be temperature and rainfall. Environment variables for a potential customer of the D50 might be annual income and years of photographic experience.
The environmental variables are defined on a space X which could be zero dimensional, such as survey sites or individual people, one dimensional such as sales over time, or two dimensional such as a sales in a spatial area. Finally the niche of the species is defined as a probability distribution over both the environment and the space, eg
Prob(S) = f1(E) = f2(X)
Most of applications of niche modelling can be described using these simple elements.
Basic models used in niche modeling
There is a fundamental difference between niche models and various kinds of well known statistical models. Using the usual regression models, product sales would described in terms of some
a linear regression of a combination of customer characteristics. The figure below shows some basic relationships for describing physical relationships including linear, exponential or power relationships.
> i y plot(i, y, type = "l") > lines(i, exp(i), lty = 2) > lines(i, i^2, lty = 4)
One of the problems with the standard linear regression models is that in
most cases, sales do not keep on increasing indefinitely.
In most cases, a relationship like sales shows an optimal
range, or sweet spot in customer descriptions such as annual income. To express this with
a function requires a ‘hump’ or ‘inverted U’ shape centered on optimal values. A niche, at a minimum, is the tendency of a species
to prefer a particular set or range of values. Because business analytics of customers
are typically modeling relationships of preference, linear regression functions
will be inaccurate or even completely misleading when used for predictive analytics.
The figure below shows three ways to do this is with
functions: step function, a truncated quadratic, and exponential.
> i y plot(i, y, type = "l") > sf lines(i, sf(i)) > lines(i, exp(-(i^2)), lty = 2)
Niche modelling can be applied equally well to predictive analytics of product sales as to organisms in the environment. Niche modeling maintains that the usual forms of simple linear regression model will not be suitable for representing the non-linear relationships in predictive analytics, which would be better represented by non-linear inverted ‘U’ shaped niche models. Other related articles on this blog explain more about practical niche modelling for successful predictive analytics.