The conventional approach to information systems is ‘three-tiered’: consisting of
a database, and analysis system and a presentation system. For example,
a low-cost system for predictive business analytics might use a MySQL database,
the statistics languge R for analytics, and a browser for presentation of results.
While there are advantages in
partitioning clearly delineating levels, inefficiencies and errors introduced when transferring
data between levels, particularly in the reading and writing of data to databases.
A novel approach producing very efficient systems for financal analysis is to
integrate database and analysis functions into the one, memory resident application.
Using this approach the vector-based K language has produced
the fastest industry trading applications in existence — over 100 times
faster than equivalent three-tiered applications.
In this section we show how to use R as a database by replicating
relational database operations including select and join.
As well as simplicity and efficiency for smaller systems,
the simple customer example helps to build knowledge of the
R’s powerful indexing operations.
First we create a simple customer database consisting of three tables: customer, product and order.
>customerproductorder< -data.frame(order_no=c(12,13,14,15,16,17), customer_id=c(1,1,2,2,2,3), date=c(20060711,20050823,20060101,20060112,20060202,20060401))
Adding a new order for a d40 camera is a simple operation using rbind.
Adding a new column to
a table similar and much simpler than modifying a table in a database, using the operation cbind.
Now lets use a basic database operation, SELECT. To select all orders for the d50 cameras type the following. The $ is used to specify a particular column in the order table. The comma is needed
because the table is two dimensional. It says to entire columns with a particular row value.
>order[order$product_id=="d40",] order_no customer_id product_id date 1 12 1 d40 20060711 3 14 2 d40 20060101 6 17 3 d40 20060401 11 18 4 d40 20060606
In order to count the number of sales of d40 cameras, the rows containing that product are slected
and the dimension of the result is used to return the number of rows.
Now, say we wanted to determine the total sales of each product line.
In a conventional database we would first join the two tables product and orders
with an SQL operation something like:
SELECT * FROM product, order WHERE product.product_id = order=order.product_id.
In R the equivalent command would use the merge command (and save as a temporary variable m):
>m< -merge(product,order,by.product="product_id",by.order="product_id") product_id description price order_no customer_id date 1 d40 camera 599.00 12 1 20060711 2 d40 camera 599.00 14 2 20060101 3 d40 camera 599.00 17 3 20060401 4 d40 camera 599.00 18 4 20060606 5 d41 camera 233.33 16 2 20060202 6 mx90 printer 123.00 13 1 20050823 7 mx90 printer 123.00 15 2 20060112
In R we would then use the function aggregate to sum the prices for each product.
> aggregate(m$price,by=list(m$product_id),sum) Group.1 x 1 d40 2396.00 2 d41 233.33 3 mx90 246.00
Finally to save the database, including all the tables and temporary variables,
execute save.image(file=”filename”). The data is recovered with
load(“filename”) or R can be started in that directory.
These examples show that all of the major functions of a database that might be needed
for accessing and preparing data for predictive analytics can be performed with
simple statements in the R language. For more examples of using R as a database
with examples from niche modeling see the post here.