In this report we aim to predict the manner applied in weight lifting exercises(WLE). The devices such as Jawbone Up, Nike FuelBand, and Fitbit are part of the quantified self movement tested on a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patters in their behavior, or because they are tech geeks. The measurements are taken from the accelerometers on the belt, forearm, arm, and dumbell of the 6 pariticipants performing the barbell lifts correctly and incrrectly in 5 different ways.
From the Human Activity Recognition Research, it has described the WLE dataset which can be downloaded as below:
a. [Training data](https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv)
b. [Test data](https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv)
We read in the WLE dataset from the raw text file stored in csv format where fields are delimited with the comma. By examining the dataset, there are some missing values appeared as empty, NA as well as “#DIV/0!”.
wleTrainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wleTrainFile<- "pml-training.csv"
download.file(wleTrainUrl,destfile=wleTrainFile,method="curl")
wleTrainRaw<-read.csv(wleTrainFile, sep = ",", na.strings = c("","NA","#DIV/0!"))
wleTestUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
wleTestFile<- "pml-testing.csv"
download.file(wleTestUrl,destfile=wleTestFile,method="curl")
wleTestRaw<-read.csv(wleTestFile, sep = ",", na.strings = c("","NA","#DIV/0!"))
Each dataset contains 160 columns and 19622 rows.
dim(wleTrainRaw)
## [1] 19622 160
dim(wleTestRaw)
## [1] 20 160
Remove the X, user_name, raw_timestamp_part_1, raw_time_part_2, cvtd_timestamp, new_window and num_window columns.
wleTrainRaw<-wleTrainRaw[,-c(1:7)]
After removing the columns which contains more than 70% of missing values, the testing set reduced to 53 columns.
wleTrainRaw<-wleTrainRaw[,colSums(is.na(wleTrainRaw))<.7*nrow(wleTrainRaw)]
dim(wleTrainRaw)
## [1] 19622 53
Select the same column names for training set following the testing set These testing set will be used for cross validation.
wleValidate<-wleTestRaw[,c(colnames(wleTrainRaw)[1:52],'problem_id')]
dim(wleValidate)
## [1] 20 53
Since the training set is large, it will be divided as 75% observations for training subset and the remaining as test subset. For reproducibility purposes, the seed is setto 1221.
library(caret)
set.seed(1221)
wleTrainDP<-createDataPartition(y=wleTrainRaw$classe, p=0.75, list=FALSE)
wleTrainSS<-wleTrainRaw[wleTrainDP,]
wleTestSS<-wleTrainRaw[-wleTrainDP,]
rbind("Training subset"=dim(wleTrainSS), "Test subset"=dim(wleTestSS))
## [,1] [,2]
## Training subset 14718 53
## Test subset 4904 53
There are no zero variance predictors or near-zero variance predictors exist, thus no predictors will be removed in the construction of the prediction model.
nzv<-nearZeroVar(wleTrainSS, saveMetrics=TRUE)
str(nzv, vec.len=2)
## 'data.frame': 53 obs. of 4 variables:
## $ freqRatio : num 1.16 1.01 ...
## $ percentUnique: num 7.8 11.6 ...
## $ zeroVar : logi FALSE FALSE FALSE ...
## $ nzv : logi FALSE FALSE FALSE ...
nzv[nzv[,"zeroVar"]+nzv[,"nzv"]>0,]
## [1] freqRatio percentUnique zeroVar nzv
## <0 rows> (or 0-length row.names)
The result showed that the 4 preditors of this model are roll_belt, pitch_forearm, magnet_dumbbell_y and roll_forearm variables. It has taken approximately 17 seconds to build this model.
time1<-proc.time()
ctreeFit<-train(classe~.,method="rpart",data=wleTrainSS)
## Loading required package: rpart
time2<-proc.time()
ctreeTime<-time2-time1
#the duration used for building the model
ctreeTime
## user system elapsed
## 15.757 0.918 16.860
print(ctreeFit$finalModel)
## n= 14718
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 14718 10533 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130.5 13468 9291 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -34.35 1178 4 A (1 0.0034 0 0 0) *
## 5) pitch_forearm>=-34.35 12290 9287 A (0.24 0.23 0.21 0.2 0.12)
## 10) magnet_dumbbell_y< 439.5 10408 7466 A (0.28 0.18 0.24 0.19 0.11)
## 20) roll_forearm< 123.5 6470 3831 A (0.41 0.18 0.18 0.17 0.06) *
## 21) roll_forearm>=123.5 3938 2625 C (0.077 0.18 0.33 0.23 0.18) *
## 11) magnet_dumbbell_y>=439.5 1882 926 B (0.032 0.51 0.038 0.23 0.19) *
## 3) roll_belt>=130.5 1250 8 E (0.0064 0 0 0 0.99) *
rattle::fancyRpartPlot(ctreeFit$finalModel)
## Warning: Failed to load RGtk2 dynamic library, attempting to install it.
## Please install GTK+ from http://r.research.att.com/libs/GTK_2.24.17-X11.pkg
## If the package still does not load, please ensure that GTK+ is installed and that it is on your PATH environment variable
## IN ANY CASE, RESTART R BEFORE TRYING TO LOAD THE PACKAGE AGAIN
##Cross-Validation With Classification Tree Model This classification tree model has the poor accuracy(0.4488).
confusionMatrix(wleTestSS$classe, predict(ctreeFit, wleTestSS))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1264 20 105 0 6
## B 385 330 234 0 0
## C 405 36 414 0 0
## D 374 137 293 0 0
## E 138 124 250 0 389
##
## Overall Statistics
##
## Accuracy : 0.4888
## 95% CI : (0.4747, 0.5029)
## No Information Rate : 0.5232
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3315
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4926 0.51005 0.31944 NA 0.98481
## Specificity 0.9440 0.85459 0.87777 0.8361 0.88645
## Pos Pred Value 0.9061 0.34773 0.48421 NA 0.43174
## Neg Pred Value 0.6290 0.91985 0.78217 NA 0.99850
## Prevalence 0.5232 0.13193 0.26427 0.0000 0.08055
## Detection Rate 0.2577 0.06729 0.08442 0.0000 0.07932
## Detection Prevalence 0.2845 0.19352 0.17435 0.1639 0.18373
## Balanced Accuracy 0.7183 0.68232 0.59861 NA 0.93563
The result showed that it has taken approximately 111 seconds to build this model.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:ggplot2':
##
## margin
time1<-proc.time()
rforestFit<-randomForest(classe~.,data=wleTrainSS, importance=TRUE)
time2<-proc.time()
rforestTime<-time2-time1
#the duration used for building the model
rforestTime
## user system elapsed
## 109.111 0.789 110.621
# return the first 6 rows of the second tree
head(getTree(rforestFit,k=2))
## left daughter right daughter split var split point status prediction
## 1 2 3 34 35.50000 1 0
## 2 4 5 1 129.50000 1 0
## 3 6 7 27 41.27783 1 0
## 4 8 9 38 426.50000 1 0
## 5 10 11 35 61.50000 1 0
## 6 12 13 4 6.00000 1 0
# list out the variable importance
varImpPlot(rforestFit)
##Cross-Validation With Random Forest Model This random forest model has the high accuracy(0.9947).
result=predict(rforestFit, wleTestSS)
cm<-confusionMatrix(wleTestSS$classe, result)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 0 0 0 0
## B 3 946 0 0 0
## C 0 6 848 1 0
## D 0 0 11 793 0
## E 0 0 0 5 896
##
## Overall Statistics
##
## Accuracy : 0.9947
## 95% CI : (0.9922, 0.9965)
## No Information Rate : 0.2851
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9933
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9979 0.9937 0.9872 0.9925 1.0000
## Specificity 1.0000 0.9992 0.9983 0.9973 0.9988
## Pos Pred Value 1.0000 0.9968 0.9918 0.9863 0.9945
## Neg Pred Value 0.9991 0.9985 0.9973 0.9985 1.0000
## Prevalence 0.2851 0.1941 0.1752 0.1629 0.1827
## Detection Rate 0.2845 0.1929 0.1729 0.1617 0.1827
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9989 0.9965 0.9927 0.9949 0.9994
The expected out-of-sample error is 0.005301794, as well as the estimated sample error is 0.005301794.
#Expected Out-Of-Sample Error
expOutOfSampleError<-1-cm$overall['Accuracy']
names(expOutOfSampleError)<-"Expected Out-Of-Sample Error"
expOutOfSampleError
## Expected Out-Of-Sample Error
## 0.005301794
#Estimated Sample Error
estSampleError<-1-(sum(result==wleTestSS$classe)/length(result))
estSampleError
## [1] 0.005301794
Considering the high accuracy and the small sample error, I have decided to use the Random Forest Model to perform the prediction on the testing set.
testingPred<-predict(rforestFit, wleValidate)
testingPred
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
## Seed is set to 1221 to produce this results.
answer<-as.character(testingPred)
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(answer)
I have been convinced that using this Random Forest Model with the high accuracy and small error rate can be used to predict the manner executed in weight lifting exercises.