Pentaho/R Integration Plugin


http://dekarlab.de/wp/?p=5



Pentaho R Integration Plugin is used to execute R code direct from Pentaho Data Integration (PDI).
R Script Plugin can be downloaded here: Downloads

Installation
R script plugin can be installed by performing the following steps:
1) Copy the plugin folder (RScriptPlugin) into the folder of your PDI installation:
data-integration\plugins\steps
2) Install R project from the site http://www.r-project.org/
3) Install rJava package in R by executing: install.packages(“rJava”)
4) Copy JRI library rJava/jri/jri.dll (windows) or rJava/jri/libjri.so (linux) to: data-integration/libswt/{your operation system}
5) Specify location of R using R_HOME environment variable. (like: C:\R\R-3.0.2).
6) For windows, it is also needed to put in PATH variable the path to R.dll, for example: C:\R\R-3.0.2\bin\{your operation system}
7) The plugin can be found under Sripting Folder in Design Tab:

Note: to check the installation press Help button in R plugin in Spoon.

Usage
We demonstrate the usage of “Execute R Script” plugin by solving the following task. In CSV file we have the values for two variables a and b. We should calculate the sum of these variables for every row in CSV file.
For this we create transformation, which contains three steps:
  • CSV Step to load input values
  • Execute R Script Step
  • Select Values Step
The transformation is demonstrated on the following picture:

CSV file has the following values:

R script contains two sections. The first one is used to define mapping between input fields in PDI (Pentaho Data Integration) and variables in R script. The second one is used to define mapping between output variables from R script and output fields in PDI.

We have two fields in PDI with names a and b. In R file we have two variables with names a and b. In R script we define the third variable c as a sum of a and b.
In R script there is one service variable OUTPUT of type list. Using this variable we can define output values. In our case the output is only variable c. For this variable we can define a new field in PDI. In our case this is c_res field. In Select Values step we show the values for three fields a, b, and c_res.

The result is demonstrated on the picture below.

For more complicated tasks it is recommended to define functions and use direct database connection in R script. For example, we use Run.R in PDI and call function MainScript, which is defined in MainScript.R.
/code/r/Run.R

library(RMySQL)
library(R.utils)

sourceDirectory(“/code/r/func/”);
ctx.date_from=’01.01.2000′;#this value can be transferred from PDI
ctx.date_to=’01.12.2010′;#this value can be transferred from PDI
OUTPUT<-list(“err_train”=0, “err_gen”=0);
OUTPUT<-MainScript(ctx);
/code/r/func/MainScript.R
MainScript<-function(ctx) {
con <-dbConnect(MySQL(), user=”root”, password=””, dbname=”lab”, host=”localhost”);
on.exit(dbDisconnect(con));
printf(“\nTime Period: %s-%s”, ctx.date_from, ctx.date_to)
flush.console()
#Select data from database
query = paste(“”,”select * from table where column between “);
query = paste(query, “str_to_date(‘”, ctx.date_from, “‘, ‘%d.%m.%Y’) and str_to_date(‘”, ctx.date_to, “‘, ‘%d.%m.%Y’)”, sep=””);
rs <- dbSendQuery(con, query);
result <- fetch(rs, n = -1)
printf(“\nSize: %g”, nrow(result))
dbDisconnect(con)
#Here the model can be evaluated and results can be saved in variables err_train and err_gen.
#results to PDI
OUTPUT<-list(“err_train”=err_train, “err_gen”=err_gen);
return(OUTPUT)
}

Comments

Popular posts from this blog

Changing the Java VM Memory Limits in Pentaho BI

CAS with Pentaho Community Edition

utf8_general_ci VS utf8_unicode_ci what should we use?