15 Random forest

15.1 Practical experiments

15.1.1 Random forest for prediction of iris

The caret package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:

  • data splitting
  • pre-processing
  • feature selection
  • model tuning using resampling
  • variable importance estimation

15.1.1.1 Required R packages

Required packages:

caret   
AppliedPredictiveModeling 
ellipse 

Attribute Information:

  1. sepal length in cm

  2. sepal width in cm

  3. petal length in cm

  4. petal width in cm

  5. class:

– Iris Setosa – Iris Versicolour – Iris Virginica

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
dim(iris)
## [1] 150   5

15.1.1.2 Visulization

15.1.1.2.1 Scatterplot Matrix
#install.packages("AppliedPredictiveModeling")
library(AppliedPredictiveModeling)
transparentTheme(trans = .4)
#install.packages("caret")
library(caret)
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
featurePlot(x = iris[, 1:4], 
            y = iris$Species, 
            plot = "pairs",
            ## Add a key at the top
            auto.key = list(columns = 3))

15.1.1.2.2 Scatterplot Matrix with Ellipses
#install.packages("ellipse")
library(ellipse)
## 
## Attaching package: 'ellipse'
## The following object is masked from 'package:graphics':
## 
##     pairs
featurePlot(x = iris[, 1:4], 
            y = iris$Species, 
            plot = "ellipse",
            ## Add a key at the top
            auto.key = list(columns = 3))

15.1.1.2.3 Overlayed Density Plots
featurePlot(x = iris[, 1:4], 
            y = iris$Species,
            plot = "density", 
            ## Pass in options to xyplot() to 
            ## make it prettier
            scales = list(x = list(relation="free"), 
                          y = list(relation="free")), 
            adjust = 1.5, 
            pch = "|", 
            layout = c(4, 1), 
            auto.key = list(columns = 3))

15.1.1.2.4 Box Plots
featurePlot(x = iris[, 1:4], 
            y = iris$Species, 
            plot = "box", 
            ## Pass in options to bwplot() 
            scales = list(y = list(relation="free"),
                          x = list(rot = 90)),  
            layout = c(4,1 ), 
            auto.key = list(columns = 2))

15.1.1.2.5 Machine leraning - Random forest
15.1.1.2.6 Model Train
set.seed(186)
# Data Splitting
train_index <- createDataPartition(iris$Species, p = 0.75, , times=1, list = FALSE)
train_set = iris[train_index, ]
test_set  = iris[-train_index, ]
fit_rf_cv <- train(Species ~ ., data=train_set, method='rf', metric = "Accuracy",
                   trControl=trainControl(method="cv",number=5)) 
fit_rf_cv
## Random Forest 
## 
## 114 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 91, 91, 93, 91, 90 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9398551  0.9100274
##   3     0.9398551  0.9100274
##   4     0.9398551  0.9100274
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Variance importance
rfVarImpcv = varImp(fit_rf_cv)
rfVarImpcv
## rf variable importance
## 
##              Overall
## Petal.Width   100.00
## Petal.Length   95.11
## Sepal.Length   16.93
## Sepal.Width     0.00
15.1.1.2.7 Testing test data
## predict
test_set$predict_rf <- predict(fit_rf_cv, test_set, "raw")
confusionMatrix(test_set$predict_rf, test_set$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         11         1
##   virginica       0          1        11
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9444          
##                  95% CI : (0.8134, 0.9932)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.728e-14       
##                                           
##                   Kappa : 0.9167          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9167           0.9167
## Specificity                 1.0000            0.9583           0.9583
## Pos Pred Value              1.0000            0.9167           0.9167
## Neg Pred Value              1.0000            0.9583           0.9583
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3056           0.3056
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.9375           0.9375