Synopsis

In this report we aim to predict the manner applied in weight lifting exercises(WLE). The devices such as Jawbone Up, Nike FuelBand, and Fitbit are part of the quantified self movement tested on a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patters in their behavior, or because they are tech geeks. The measurements are taken from the accelerometers on the belt, forearm, arm, and dumbell of the 6 pariticipants performing the barbell lifts correctly and incrrectly in 5 different ways.

Data Processing

From the Human Activity Recognition Research, it has described the WLE dataset which can be downloaded as below:

a. [Training data](https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv)
b. [Test data](https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv)

We read in the WLE dataset from the raw text file stored in csv format where fields are delimited with the comma. By examining the dataset, there are some missing values appeared as empty, NA as well as “#DIV/0!”.

wleTrainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wleTrainFile<- "pml-training.csv"
download.file(wleTrainUrl,destfile=wleTrainFile,method="curl")
wleTrainRaw<-read.csv(wleTrainFile, sep = ",", na.strings = c("","NA","#DIV/0!"))

wleTestUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
wleTestFile<- "pml-testing.csv"
download.file(wleTestUrl,destfile=wleTestFile,method="curl")
wleTestRaw<-read.csv(wleTestFile, sep = ",", na.strings = c("","NA","#DIV/0!"))

Each dataset contains 160 columns and 19622 rows.

dim(wleTrainRaw)
## [1] 19622   160
dim(wleTestRaw)
## [1]  20 160

Data Cleaning

Remove the X, user_name, raw_timestamp_part_1, raw_time_part_2, cvtd_timestamp, new_window and num_window columns.

wleTrainRaw<-wleTrainRaw[,-c(1:7)]

After removing the columns which contains more than 70% of missing values, the testing set reduced to 53 columns.

wleTrainRaw<-wleTrainRaw[,colSums(is.na(wleTrainRaw))<.7*nrow(wleTrainRaw)]
dim(wleTrainRaw)
## [1] 19622    53

Select the same column names for training set following the testing set These testing set will be used for cross validation.

wleValidate<-wleTestRaw[,c(colnames(wleTrainRaw)[1:52],'problem_id')]
dim(wleValidate)
## [1] 20 53

Data Slicing

Since the training set is large, it will be divided as 75% observations for training subset and the remaining as test subset. For reproducibility purposes, the seed is setto 1221.

library(caret)
set.seed(1221)
wleTrainDP<-createDataPartition(y=wleTrainRaw$classe, p=0.75, list=FALSE)
wleTrainSS<-wleTrainRaw[wleTrainDP,]
wleTestSS<-wleTrainRaw[-wleTrainDP,]
rbind("Training subset"=dim(wleTrainSS), "Test subset"=dim(wleTestSS))
##                  [,1] [,2]
## Training subset 14718   53
## Test subset      4904   53

Checking Zero Covariates

There are no zero variance predictors or near-zero variance predictors exist, thus no predictors will be removed in the construction of the prediction model.

nzv<-nearZeroVar(wleTrainSS, saveMetrics=TRUE)
str(nzv, vec.len=2)
## 'data.frame':    53 obs. of  4 variables:
##  $ freqRatio    : num  1.16 1.01 ...
##  $ percentUnique: num  7.8 11.6 ...
##  $ zeroVar      : logi  FALSE FALSE FALSE ...
##  $ nzv          : logi  FALSE FALSE FALSE ...
nzv[nzv[,"zeroVar"]+nzv[,"nzv"]>0,]
## [1] freqRatio     percentUnique zeroVar       nzv          
## <0 rows> (or 0-length row.names)

Building Model With Classification Tree

The result showed that the 4 preditors of this model are roll_belt, pitch_forearm, magnet_dumbbell_y and roll_forearm variables. It has taken approximately 17 seconds to build this model.

time1<-proc.time()
ctreeFit<-train(classe~.,method="rpart",data=wleTrainSS)
## Loading required package: rpart
time2<-proc.time()
ctreeTime<-time2-time1
#the duration used for building the model
ctreeTime
##    user  system elapsed 
##  15.757   0.918  16.860
print(ctreeFit$finalModel)
## n= 14718 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 14718 10533 A (0.28 0.19 0.17 0.16 0.18)  
##    2) roll_belt< 130.5 13468  9291 A (0.31 0.21 0.19 0.18 0.11)  
##      4) pitch_forearm< -34.35 1178     4 A (1 0.0034 0 0 0) *
##      5) pitch_forearm>=-34.35 12290  9287 A (0.24 0.23 0.21 0.2 0.12)  
##       10) magnet_dumbbell_y< 439.5 10408  7466 A (0.28 0.18 0.24 0.19 0.11)  
##         20) roll_forearm< 123.5 6470  3831 A (0.41 0.18 0.18 0.17 0.06) *
##         21) roll_forearm>=123.5 3938  2625 C (0.077 0.18 0.33 0.23 0.18) *
##       11) magnet_dumbbell_y>=439.5 1882   926 B (0.032 0.51 0.038 0.23 0.19) *
##    3) roll_belt>=130.5 1250     8 E (0.0064 0 0 0 0.99) *
rattle::fancyRpartPlot(ctreeFit$finalModel)
## Warning: Failed to load RGtk2 dynamic library, attempting to install it.
## Please install GTK+ from http://r.research.att.com/libs/GTK_2.24.17-X11.pkg
## If the package still does not load, please ensure that GTK+ is installed and that it is on your PATH environment variable
## IN ANY CASE, RESTART R BEFORE TRYING TO LOAD THE PACKAGE AGAIN

##Cross-Validation With Classification Tree Model This classification tree model has the poor accuracy(0.4488).

confusionMatrix(wleTestSS$classe, predict(ctreeFit, wleTestSS))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1264   20  105    0    6
##          B  385  330  234    0    0
##          C  405   36  414    0    0
##          D  374  137  293    0    0
##          E  138  124  250    0  389
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4888          
##                  95% CI : (0.4747, 0.5029)
##     No Information Rate : 0.5232          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3315          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4926  0.51005  0.31944       NA  0.98481
## Specificity            0.9440  0.85459  0.87777   0.8361  0.88645
## Pos Pred Value         0.9061  0.34773  0.48421       NA  0.43174
## Neg Pred Value         0.6290  0.91985  0.78217       NA  0.99850
## Prevalence             0.5232  0.13193  0.26427   0.0000  0.08055
## Detection Rate         0.2577  0.06729  0.08442   0.0000  0.07932
## Detection Prevalence   0.2845  0.19352  0.17435   0.1639  0.18373
## Balanced Accuracy      0.7183  0.68232  0.59861       NA  0.93563

Building Model With Random Forest

The result showed that it has taken approximately 111 seconds to build this model.

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
time1<-proc.time()
rforestFit<-randomForest(classe~.,data=wleTrainSS, importance=TRUE)
time2<-proc.time()
rforestTime<-time2-time1
#the duration used for building the model
rforestTime
##    user  system elapsed 
## 109.111   0.789 110.621
# return the first 6 rows of the second tree
head(getTree(rforestFit,k=2))
##   left daughter right daughter split var split point status prediction
## 1             2              3        34    35.50000      1          0
## 2             4              5         1   129.50000      1          0
## 3             6              7        27    41.27783      1          0
## 4             8              9        38   426.50000      1          0
## 5            10             11        35    61.50000      1          0
## 6            12             13         4     6.00000      1          0
# list out the variable importance
varImpPlot(rforestFit)

##Cross-Validation With Random Forest Model This random forest model has the high accuracy(0.9947).

result=predict(rforestFit, wleTestSS)
cm<-confusionMatrix(wleTestSS$classe, result)
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B    3  946    0    0    0
##          C    0    6  848    1    0
##          D    0    0   11  793    0
##          E    0    0    0    5  896
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9947          
##                  95% CI : (0.9922, 0.9965)
##     No Information Rate : 0.2851          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9933          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9979   0.9937   0.9872   0.9925   1.0000
## Specificity            1.0000   0.9992   0.9983   0.9973   0.9988
## Pos Pred Value         1.0000   0.9968   0.9918   0.9863   0.9945
## Neg Pred Value         0.9991   0.9985   0.9973   0.9985   1.0000
## Prevalence             0.2851   0.1941   0.1752   0.1629   0.1827
## Detection Rate         0.2845   0.1929   0.1729   0.1617   0.1827
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9989   0.9965   0.9927   0.9949   0.9994

Expected Out-Of-Sample Error and Estimated Sample Error

The expected out-of-sample error is 0.005301794, as well as the estimated sample error is 0.005301794.

#Expected Out-Of-Sample Error
expOutOfSampleError<-1-cm$overall['Accuracy'] 
names(expOutOfSampleError)<-"Expected Out-Of-Sample Error"
expOutOfSampleError
## Expected Out-Of-Sample Error 
##                  0.005301794
#Estimated Sample Error
estSampleError<-1-(sum(result==wleTestSS$classe)/length(result))
estSampleError
## [1] 0.005301794

Choosing the Final Model for the Prediction

Considering the high accuracy and the small sample error, I have decided to use the Random Forest Model to perform the prediction on the testing set.

testingPred<-predict(rforestFit, wleValidate)
testingPred
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Generate Prediction Output

## Seed is set to 1221 to produce this results.
answer<-as.character(testingPred)
pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
pml_write_files(answer)

Conclusion

I have been convinced that using this Random Forest Model with the high accuracy and small error rate can be used to predict the manner executed in weight lifting exercises.