Predictive models for business analytics often involve
very complex data-mining and other statistical techniques.
Here is a simple, efficient way of predicting using images that reduces the
prediction process to its bare essentials.
All models are essentially generalizations —
simplifications into patterns that enable extrapolation into the unknown.
As such, one of the simplest forms of generalization
is categorization, where a large number of dissimilar items are sorted into a smaller number of bins, based on their similarity. Once a set of bins or categories is established, and there is a basis for deciding into which bin new items should go, new items can be categorized. In this way, a categorization, or clustering can serve as a predictive model. And, as categorization is
a basic operation producing an color palette for an image,
images can be used to develop models, and palette swaps used for prediction.
To see how clustering works, consider a basic clustering algorithm
available in R and other statistical languages
called kmeans. In kmeans, the data to be clustered
is partitioned into k groups such that the sum
of squares from points to the assigned cluster centers is minimized.
At the minimum, all cluster centers are at the mean of
the set of data points with the same category.
The similar operation in image processing is called color quantization
or color reduction. Reducing the number of colors is very useful,
and it will compress the size of an image albeit at the expense
of the quality of the image. But if the right colors are chosen
for the bins, the eye barely notices the difference.
Most image processing utilities will do this. For
example utility in netpbm for this purpose is called ppmquant: e.g.
ppmquant number_of_colors pnmfile
After producing a reduced set of colors or bins from the image,
palette swapping can provide the prediction. Palette swapping
replaces the set of colors in the image with a new set of colors.
The utility in the netpbm package is pamlookup, invoked with an image as
a lookup table for mapping the old colors to the new: i.e.
pamlookup -lookupfile=lookupfile -missingcolor=color [-fit] indexfile
Note that this is a very efficient operation as the data in the image
does not change, only the small set of values in the palette of the image.
For example, say we have an image that represents a pattern of
environmental values. For concreteness, the donut in Figure 3A could be the
vicinity of a ring-road, the edges of an urban area, or any other feature.
Say we predict that certain values are of interest, perhaps as
potential for future crimes. The frequency of those crimes is
a niche model, as shown by the peaked distribution in
probability over the environmental values
in Figure 2.
Swapping the colors in the original image with the new colors given
by the function in Figure 2 (i.e. mapping the values on the x axis to the
values on the y axis) changes Figure 3A into Figure 3B — essentially
producing a prediction of the probability of crime in the region. This illustrates predictions from a model using only palette swapping
on images. Because images can be stored and manipulated very
efficiently by most computers, predictive algorithms such as WhyWhere
using this approach can handle very large datasets (stored as images) very