Let's Wine About It

Introduction

 Anthony Bourdain once said, “Meals make the society, hold the fabric together in lots of ways that were charming and interesting and intoxicating to me” Hoffower, Gal. (2018). Along with meals served in many different regions of the world, wine accompanies and compliments the dishes that are served. People of different cultures come together to celebrate birthdays, holidays, religious beliefs, promotions, and sometime sorrow, but one thing is for certain that wine is more than likely always present. Cultures have produced wine since the biblical times and still remains popular today. Wine has the ability to bring people together. Whether friend or foe, share a meal and a “quality” bottle of wine.

 

Choosing a bottle of wine can be based on many factors including expert’s quality rating, aromas, price, region, body, and chemical attributes. For the everyday wine connoisseur, “quality” is very important. 

 

Data Acquisition

 After researching and viewing many different data sets and the possibilities of different data mining techniques that could be used, the Wine Quality data set was chosen. This set was retrieved from the UCI repository archives at University of California Irvine. This data set was also chosen because it was one of the most complete sets that was explored and required very little transformation and cleaning.

 

The wine data was created by using data from the production of wine in the region of Vinho Verde. Vinho Verde is in the northwestern region of Portugal, and the wine is only made from grapes that are indigenous to the region. The Vinho Verde region, “is one of the largest and oldest wine regions in the world. It is home to thousands of producers, generates a wealth of economic activity and jobs and strongly contributes to the development of Minho and the country” Vinho Verde History (2018).

 

This wine region produces not only white and red wines, but also sparkling wines, Brandy Wine, and red or rose vinegar. The wine quality data set contains chemical attributes that are important in determining the quality of wine scientifically. The set also contains a sensory attribute that was determined by wine experts with a minimum of three evaluations were made. The “quality” attribute is a scale of 0 to 10, zero meaning very bad and 10 being outstanding.

 

Data Mining Opportunity and Experiment Design

Wine quality is highly considered when consumers are purchasing a bottle of wine as well as when the wine is being produced. Using the wine quality data set that was created originally by Vinho Verde, different models and algorithms will be applied, to output predictions of “quality” based on the chemical attributes.

 

The expert “quality” ratings currently in the wine data set are sensory driven. Different data mining techniques will be used to determine possible association rules to determine if the chemical attributes can accurately predict the expert’s “quality” rating. Classification to Clusters technique can also be applied to help determine different drivers of the “quality” ratings. Naïve Bayes, Decision Tress, and Random Forests can also be applied to best predict what chemical attributes contribute to the expert’s quality scores.

 

Data Exploration

 When examining the chemical attributes of the wine data set, there are attributes that deal with the sugar content and acidity of the wines, as well as the sulfite levels and pH. It is possible as well that all of these can indicate the level of alcohol and quality of wines. All of the attributes are numbers.

 

Intuitively it is known that many people do not like wines with high tannins (acidity). Wines that are high in tannins is what can give people headaches, but other people seek out wines with high tannins due to taste and quality.

 

The distribution of the data and the variables is very important to see what the data we are dealing with looks like. From the histograms created we are able to tell that some of the variables are skewed, and do not represent a normal distribution. The only variable with a normal distribution is the “quality” attribute. This could indicate that there are outliers within the data set and the range of the variables are also very large.

 

Frequency Distribution of Attributes:

 

 

 

 

 

 

 

 

Density Distribution of Attributes:

 

 

 

Calling the summary function on the entire data set will allow the range in variables to be seen.

 

Summary of wine quality data set:

 

 

To better understand the ranges of variables, the variables need to be researched to determine the industry standard of each as well if one variable impacts another variable.

 

The acidity of wine impacts the taste of wine making it either sour or tart. When a wine lacks acidity is can be considered to be flat in taste. The acids also play a major role in impacting the pH level of a wine as well as the color, and lifespan of a wine. Acidity in wine is divided into two categories, fixed acidity and volatile acidity. The citric acid variable is a derivative of the fixed acidity variable. If a wine is produced with “warm climate grapes, [the wine] can be low in acid, more or less depending on the variety” Nierman (2004). Wines that have a pH level as high as 4.0 makes the wine’s taste much softer and more popular with consumers. Volatile acidity levels should be low and barely detectable.

 

Also, wines with more sugar content will result in a higher alcohol percentage. The sugars transform into higher alcohols during the fermentation process. Higher alcohol percentages can impact the aromatic effects of wine in either a positive or negative way, but higher alcohol percentages  normally do not have an impact on a quality rating, such as the one in the wine data set. Sulfur dioxide refers to the fruit preservatives that are used during the fermentation process. The yeast that is also used during this process also created sulfites (sulfur dioxide). It is very rare to find a wine that is free from sulfur dioxides.

 

Correlation of Attributes:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The correlation of attributes shows that alcohol and density are negatively correlated, but the percentage of alcohol in the wine is positively correlated with quality. This insight is different than what normally occurs. Normally the percent of alcohol would not be correlated with a sensory attribute such as “quality.” The fact that we have a positive correlation between alcohol and quality could indicate that the percent of alcohol positively impacts the aromatic effects of the wine and could possibly give a good quality score by experts. More research would need to be done to determine the causation between the two attributes.  

 

Data Preparation

 The wine quality data set contains 4,898 white wine varieties from the Vinho Verde region of Portugal with 12 attributes about the chemical consistency of each white wine. There was also another red wine data set that contained 1,599 variables and 12 attributes. Both data sets were combined to create one data set. This resulted in a data set containing 6,497 observations and 13 attributes. An attribute was added to be able to distinguish between red and white wine if needed. The data was checked for complete cases and NAs and all observations and variables were complete and no missing values present.

 

Discretization took place on the quality attribute, which was our target variable. This variable ranged from 3 to 9 and needed to be converted from an integer to a factor to classify it as either “good” or “bad” quality wine.

Data slicing was a step that was completed to split data into training and testing sets. The training data set was used specifically for our model building. It was ensured that the test data set was not manipulated and kept in its natural form. The slicing was done with the 2/3 rule, where 2/3 of the data is used for training and the remaining 1/3 for testing. The data was sampled to create and index so that not all one color was used for the sample and the other for the test.

kNN Model

The caret package provides a train() method for training the data using various algorithms. Different parameter values for different algorithms need to be passed through the train() function. However, before the train() method is called, trainControl() needs to be used first to control for computational nuances and allows for training the models.

Three parameters are set within the trainControl() method. The “method” parameter holds the details about resampling. The “method” can be set with various values like “boot”, “boot632”, “cv”, “repeatedcv”, “LOOCV”, “LGOCV” etc. For this analysis, repeatedcv was used which was repeated cross-validation.

The “number” parameter holds the number of resampling iterations, 10-fold cross validation specification. The “repeats” parameter contains the complete sets of folds to compute for the repeated cross-validation. This means the 10-fold validation 3 times, so it’s super extra groovy cross validation.

Before training the knn classifier, set.seed() is used to allow for replication of the results.

 

In training the knn classifier, the train() method is passed with “method” as “knn”. Since the model is trying to predict wine quality, that parameter is passed first. The quality~. denotes our formula for using all attributes in the classifier and quality, again, is on the left as the target variable. The “trControl” parameter is passed with results from the trainControl() method. The “preProcess” parameter is for preprocessing the training data, which is especially helpful because some of the data points such as sulfur dioxide and sulphates have outliers that need to be controled for.

As discussed earlier data preprocessing is a mandatory task. Thus pass 2 values into the “preProcess” parameter “center” & “scale”. These parameters center and scale the data, because the data has a wide variety of ranges, and values. To explain further, after preProcessing is passed, the training data variables are converted to have a mean value of about “0” and standard deviation of about “1”. The data was normalized during this process.

The “tuneLength” parameter held an integer value and will be used for tuning our algorithm!

 

Trained kNN Model Result

The kNN model shows Accuracy and Kappa metrics for different k-values, with the objective of optimizing the model by selecting the best k-value based on Accuracy. The kNN model selected k = 7 using Accuracy to pick the optimal model (e.g. largest value of accuracy used, or most accurate k value).

There is variation in Accuracy versus the various K values by plotting these in a graph:

 

 

Test Set Prediction using kNN Model

Now that the kNN model is trained, and using k = 7, the model is ready for the prediction of wine quality using the test data set. Prediction is performed using the predict() method from the caret package.

As mentioned previously, the caret package provides the predict() method for predicting results of the previously trained model. Two arguments are passed into the method; The first parameter is the trained model (knn_fit) and the second parameter “newdata” holds the testing data frame (wine.test), which was created earlier in the Data Slicing process. The predict() method returns a list of predicted values, which are saved as knn.pred to reference later.

Using caret’s confusionMatrix() method, will retune the statistics of the model’s results and evaluate performance accuracy of how the model classified the variables.

Confusion Matrix and Statistics

           Reference
 Prediction  bad   good
       bad      1589    237
       good     143    197
                                          
 Accuracy : 0.8246         

 95% CI : (0.8079, 0.8404)
 No Information Rate : 0.7996         

P-Value [Acc > NIR] : 0.001788       
                                       
Kappa : 0.4042         
Mcnemar's Test P-Value : 1.835e-06      
                                  

Sensitivity : 0.9174         

Specificity : 0.4539         
Pos Pred Value : 0.8702         
Neg Pred Value : 0.5794         
Prevalence : 0.7996         
Detection Rate : 0.7336         
Detection Prevalence : 0.8430         

Balanced Accuracy : 0.6857                                                   
'Positive' Class : bad        

    

The KNN model thus reports, with a kvalue=7, accuracy of 82.46%. This is not a bad accuracy but modifying the variables ran against the model could potentially result in a higher accuracy. If the color variable from the attributes in the model will this improve the accuracy of the kNN model?

Again, the model shows Accuracy and Kappa metrics for different k-values, this time using 11 predictors and not using color as one of those 11 predictors, with the objective of optimizing the model by selecting the best k-value based on Accuracy. The model selected k = 17 using Accuracy to pick the optimal model (e.g. largest value of accuracy used, or most accurate k value).

 

Confusion Matrix and Statistics

                          Reference

 Prediction    bad    good
          bad     1626    269
        good      106     165

                                          
Accuracy : 0.8269         
95% CI : (0.8103, 0.8426)

No Information Rate : 0.7996          
P-Value [Acc > NIR] : 0.0007083      
                                       
Kappa : 0.3712         

Mcnemar's Test P-Value : < 2.2e-16      
                                        
Sensitivity : 0.9388         

Specificity : 0.3802         
Pos Pred Value : 0.8580         
Neg Pred Value : 0.6089         

Prevalence : 0.7996         
Detection Rate : 0.7507          
Detection Prevalence : 0.8749         
Balanced Accuracy : 0.6595         
                                        

'Positive' Class : bad            
 

The model improved slightly, and the accuracy reported 82.69%, versus the previous 82.46%. It’s a small change, though, and the number of nearest neighbors greatly increased. The color attribute would be an easy way for the model to group neighbors, and since that attribute was removed, it must work a little harder, to get better accuracy.

The model was ran one more time removing some variables that were correlated. For instance, fixed acidity is an overall measure of things like citric acid and volatile acid combined, and sulfites are derivatives of our sulfur dioxide variables. Given the correlation that exists here, as we can see strongly from our correlation matrix (plot_correlation(wine, type = ‘continuous’,‘Review.Date’)), Also, the color attribute was added back into the model for this run.

The model used 8 predictors and settled on a k-value of 17. The model ran a bit faster, which means computing time was saved. If there is an increase in model performance it can be leveraged that time savings is a factor of which model to use.

 

Confusion Matrix and Statistics

                       Reference
 Prediction    bad     good
            bad   1623     273
          good    109      161
                                         
Accuracy : 0.8236         
95% CI : (0.8069, 0.8395)
No Information Rate : 0.7996         
P-Value [Acc > NIR] : 0.002536       
                                        
Kappa : 0.3588         
Mcnemar's Test P-Value : < 2.2e-16      
                                           
Sensitivity : 0.9371         
Specificity : 0.3710         
Pos Pred Value : 0.8560         

Neg Pred Value : 0.5963         
Prevalence : 0.7996         
Detection Rate : 0.7493         
Detection Prevalence : 0.8753         
Balanced Accuracy : 0.6540         
                                          

'Positive' Class : bad            

 

Unfortunately, the accuracy decreased to 82.36%. Thus, our fit2 KNN model, the one without color, which gave the highest prediction accuracy.

Creating SVM and Random Forests models, also using the caret package, could potentially improve our prediction accuracy.

SVM Classifier using caret in R

The training and testing data was examined using SVM and the R caret package. The principle behind SVM (support vector machines) is to build a hyperplane that separates data for different classes. The procedure of building this hyperplane is the main task of a SVM classifier, and can vary. The most important thing the model considers as it builds the hyperplane is maximizing the distance from the hyperplane to the nearest point of either class. Each of those points are individually known as a Support Vector - hence, “support vector machine”.

 

Training the SVM Model

A svmctrl variable was created by utilizing the trainControl() method. In this method three parameters are set. The first specifies the controlling the training of the model using repeated cross validation, the number specifies that it is 10 fold (k-fold), and the repeats are set to 3.

The new SVM classification model is a linear model, and it is created by setting the svmLinear variable using the train() method from the caret package.It is a linear Support Vector Machine model, as the method is specified to be “svmLinear”. The label variable is the variable that the model is attempting to predict. With the model controlled by our svmctrl variable as described above, and the data standardized using preProcess to center and scale (normalize) the variables that all have differing ranges to a mean of zero and a standard deviation of 1. Finally, the tuneLength parameter allows the ability to finely tune our algorithm.

The model shows that it maintained a tuning parameter of “C” constant at 1, which is reasonable given that this is a linear model.

Predictions using SVM

The SVM trained model, using C=1 (tuning parameter for a linear model), it is ready to predict the testing data labels. This will be done in a similar fashion to how kNN was implemented, using caret’s predict() method.

 

The confusionMatrix is ran to determine our test accuracy:

 

Confusion Matrix and Statistics

                       Reference
 Prediction     bad    good
            bad   1732     434
          good       0         0
                                          
 Accuracy : 0.7996         
 95% CI : (0.7821, 0.8163)
No Information Rate : 0.7996         
P-Value [Acc > NIR] : 0.5128         
                                          
Kappa : 0              
Mcnemar's Test P-Value : <2e-16          
                                        
Sensitivity : 1.0000         
Specificity : 0.0000         
Pos Pred Value : 0.7996         
Neg Pred Value :    NaN         
Prevalence : 0.7996         
Detection Rate : 0.7996         
Detection Prevalence : 1.0000         
Balanced Accuracy : 0.5000         
                                      
'Positive' Class : bad      

      

The matrix reveals that this model underperformed our KNN model, from an accuracy perspective, with 79.96% of quality predicted correctly.

Now that the linear SVM model was built, it can take a customized C value using grid searching. To do this values of C (using expand.grid()) into a “grid” dataframe, and the dataframe is used to test our classifier at specific varying C values. Caret’s tuneGrid parameter can be used for this tuning parameter.

 

The summary and plot displayed above show that the classifier gives the highest accuracy at C = 0.01. Actually, it shows that the accuracy is constant no matter which option is chosen, meaning that cost level has not impact on the model. The model is ready to make some predictions on our test set.

Confusion Matrix and Statistics

                      Reference
Prediction     bad      good
           bad   1732       434
         good       0           0
                                       
Accuracy : 0.7996         
95% CI : (0.7821, 0.8163)
No Information Rate : 0.7996         
P-Value [Acc > NIR] : 0.5128         
                                        
Kappa : 0              
Mcnemar's Test P-Value : <2e-16         
                                       
Sensitivity : 1.0000         
Specificity : 0.0000         
Pos Pred Value : 0.7996         
Neg Pred Value :    NaN         
Prevalence : 0.7996         
Detection Rate : 0.7996         
Detection Prevalence : 1.0000         
Balanced Accuracy : 0.5000         
                                        
'Positive' Class : bad            

 

By adjusting the C parameter to C = 0.01, there are not any prediction accuracy increases. The linear model still predicts with 79.96% accuracy, compared with our best KNN at 82.69%.

Non-Linear SVM

Another model is created to predict the quality using non-linear kernel, such as radial basis function. To use the RBF kernel, the train() method’s “method” parameter needed to change to “svmRadial” from the previous svmLinear. The Radial model will select the approriate cost (C) value and sigma (essentially standard deviation) for the model. Note that when a larger sigma is present, the model decision tends to be flexible and smooth, and thought it tends to make wrong classifications while more often, it avoids the hazard of overfitting.

 

As seen above, the model is trained on a constant sigma of 0.08496634. Also, the smaller sigma gives more accurate predictions - but also increases the risk of over-fitting. The C value selected by the model for optimality was C = 16.

The model is ready to use its accuracy on the test set.

Confusion Matrix and Statistics

                      Reference
Prediction    bad      good
           bad   1634       246
         good     98         188
                                         
Accuracy : 0.8412         
95% CI : (0.8251, 0.8563)
No Information Rate : 0.7996         
P-Value [Acc > NIR] : 3.983e-07      
                                        
Kappa : 0.4318         
Mcnemar's Test P-Value : 2.268e-15      
                                     
Sensitivity : 0.9434         
Specificity : 0.4332         
Pos Pred Value : 0.8691         
Neg Pred Value : 0.6573          
Prevalence : 0.7996         
Detection Rate : 0.7544         
Detection Prevalence : 0.8680         
Balanced Accuracy : 0.6883         
                                   
'Positive' Class : bad            

 

84.12% accuracy, which is the best model returned so far, versus both the linear SVM models and the KNN.

 

Further tuning the model with different values of C and sigma will hopefully increase the accuracy.  Again, the grid search approach was used to implement this.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

After a lengthy run of the large expanded grid radial SVM model, it returned the optimal sigma and C values of 0.75 and 2, respectively. Using this model, it can be tested against the testing set for accuracy.

Confusion Matrix and Statistics

                    Reference
Prediction    bad      good
           bad   1644      210
         good     88        224
                                        
Accuracy : 0.8624         
95% CI : (0.8472, 0.8767)
No Information Rate : 0.7996         
P-Value [Acc > NIR] : 1.449e-14      
                                        
Kappa : 0.5201         
Mcnemar's Test P-Value : 2.394e-12      
                                       
Sensitivity : 0.9492         
Specificity : 0.5161         
Pos Pred Value : 0.8867         
Neg Pred Value : 0.7179         
Prevalence : 0.7996         
Detection Rate : 0.7590         
Detection Prevalence : 0.8560         
Balanced Accuracy : 0.7327         
                                         
'Positive' Class : bad            

 

Accuracy was gained relative to the base radial model from above, as this model’s accuracy is 86.24% compared to the previous 84.12%. This gain in accuracy also comes with a decreased risk of overfitting, particularly as mentioned above w.r.t. sigma being so small. As sigma has been increased from the initial radial model, the overall risk of overfitting has gone down while model performance has improved.

So far on the svm.pred.radial.grid model, gives the most accurate prediction results. However, Random Forest models are another method of prediction that tend to have highly accurate classification results, and as such will be manipulated next.

Random Forest

Caret was used to train a model using the Random Forest method.

To use the Random Forest model, the train() method’s “method” parameter must be set to “rf” from the previous svmRadial. The RF model will then select the appropriate mtry value based on accuracy to reveal the diagnostic ability of the model; that is, does it diagnose (predict) accurately. Following an automatic optimization of the model is a manual tuning leveraging the tuneGrid approach to try and improve model performance.

In running this model, note that the same repeated cross validation controls as all of thje previous models has been used, with the addition of classProbs. classProbs is a logical control that allows indication of whether class probabilities should be computed during classification along with predicted values in each resample run. Since the RF method is dependent upon various probabilities of multiple resamples, the classProbs() method must be set to TRUE, where its default was FALSE for the previous models.

The model results are as follows:

 

Random Forest

4331 samples
12 predictor
2 classes: 'bad', 'good'


Pre-processing: centered (12), scaled (12)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 3898, 3898, 3898, 3898, 3898, 3897, ...
Resampling results across tuning parameters:

   mtry  Accuracy   Kappa   
    2    0.8805537  0.5612422

    3    0.8803237  0.5660164

    4    0.8791695  0.5650532
    5    0.8790928  0.5667277
    6    0.8784763  0.5660125
    7    0.8783204  0.5653053
    8    0.8765506  0.5592438
    9    0.8762434  0.5587958
   10    0.8758574  0.5589288
   12    0.8746272  0.5551558

 

Here is shown that accuracy was used to select the optimal model, as noted above, and that the model settled on a mtry value of 2. Using this model to predict agains the entire test data set, and record the accuracy, it is shown as follows:

 

Confusion Matrix and Statistics

                     Reference
Prediction    bad     good
           bad   1680     209
         good     52       225
                                         

Accuracy : 0.8795        
95% CI : (0.865, 0.8929)
No Information Rate : 0.7996         
P-Value [Acc > NIR] : < 2.2e-16     
                                        
Kappa : 0.565         
Mcnemar's Test P-Value : < 2.2e-16     
                                       
Sensitivity : 0.9700        
Specificity : 0.5184        
Pos Pred Value : 0.8894        
Neg Pred Value : 0.8123        
Prevalence : 0.7996        
Detection Rate : 0.7756        
Detection Prevalence : 0.8721        
Balanced Accuracy : 0.7442        
                                       
'Positive' Class : bad           

 

Model accuracy is reported at 87.95%, which is easily the best performing model so far. Now the model can be tuned using the caret package to assess for futher improved performance.

Tuning will commence on one parameter: mtry. mtry, along with ntree, are widely recognized as the two factors that will have the biggest impact on a RF model.

Direct from the help page for the randomForest() function in R, mtry is the “Number of variables randomly sampled as candidates at each split”. Ntree is the “Number of trees to grow”.

To tune, a new model is created using the same control and tuneLength as in previous models, but this time mtry=floor(sqrt(ncol(wine.train))) or mtry=3 – set manually.

The model runs with mtry constat at 3, and accuracy output reports as follows:

Confusion Matrix and Statistics

                      Reference
Prediction    bad     good
           bad   1669     204
         good     63       230
                                          
Accuracy : 0.8767         
95% CI : (0.8621, 0.8903)
No Information Rate : 0.7996         
P-Value [Acc > NIR] : < 2.2e-16      
                                      
Kappa : 0.562          
Mcnemar's Test P-Value : < 2.2e-16      
                                         
Sensitivity : 0.9636         
Specificity : 0.5300         
Pos Pred Value : 0.8911         
Neg Pred Value : 0.7850         
Prevalence : 0.7996         
Detection Rate : 0.7705         
Detection Prevalence : 0.8647         
Balanced Accuracy : 0.7468         
                                         
'Positive' Class : bad            

 

87.9% accuracy is similarly obtained.

Decision Tree

For training a Decision Tree classifier using caret, the train() method should be passed with “method” parameter set to “rpart”. There is another package “rpart” that is specifically available for decision tree implementation. Accordingly, caret is a bit simpler because it links its train function with others for a more “plug and play” type approach.

For the decision tree model, the target variable quality is passed, as in previous models. The “quality~.” once again denotes the formula for using all attributes in the classifier as predictors. The “trControl” parameter will be passed with results from our trianControl() method similar to previous models.

In terms of tuning our model, research shows we can use different criteria while splitting our nodes of the tree as a way to tune our model.

To select the specific strategy, the parameter “parms” is passed into the train() method. It should contain a list of parameters for the rpart method. For splitting criteria, a “split” parameter with values either “information” for information gain or “gini” for gini index must also be added. Both information gain and gini index to find the most accurate model are analyzed herein.

Below is shown the result of the initially trained decision tree model by printing out the dtree.fit variable. It shows the accuracy metrics for different values of the cp; the complexity parameter of the tree.

CART

4331 samples
12 predictor
2 classes: 'bad', 'good'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 3899, 3898, 3898, 3897, 3898, 3899, ...
Resampling results across tuning parameters:

 

         cp              Accuracy        Kappa   
0.004744958   0.8302984    0.3557012
0.005931198   0.8295302    0.3348221
0.006326611   0.8297606    0.3355760
0.006820878   0.8293753    0.3335772
0.007117438   0.8293753    0.3335772
0.008303677   0.8291413    0.3366335
0.013048636   0.8193654    0.2653197
0.016607355   0.8158271    0.2422359
0.020166074   0.8145943    0.2433389
0.025504152   0.8098205    0.1679176

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.004744958.

 

Prediction of Decision Tree on Test data

 

Confusion Matrix and Statistics

                        Reference
Prediction     bad    good
           bad    1611    264
         good     121     170
                                         
Accuracy : 0.8223         
95% CI : (0.8055, 0.8381)
No Information Rate : 0.7996         
P-Value [Acc > NIR] : 0.00419        
                                      
Kappa : 0.3672         
Mcnemar's Test P-Value : 4.588e-13      

 

Sensitivity : 0.9301         
Specificity : 0.3917         
Pos Pred Value : 0.8592         
Neg Pred Value : 0.5842         
Prevalence : 0.7996         
Detection Rate : 0.7438         
Detection Prevalence : 0.8657         
Balanced Accuracy : 0.6609         
                                        
'Positive' Class : bad            

The above results show that the classifier with the criterion as information gain is giving 82.23% of accuracy for the test set.

 

Training the Decision Tree Classifier with criterion as gini index

Next is programmed a decision tree classifier using the splitting criterion of gini index (versus the previous information gain). This model’s results show accuracy metrics for different values of the cp; the complexity parameter of the tree.

 

CART
 

4331 samples
12 predictor
2 classes: 'bad', 'good'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 3899, 3898, 3898, 3897, 3898, 3899, ...
Resampling results across tuning parameters:

           cp            Accuracy       Kappa   
0.004744958   0.8289826    0.3616752   

0.005931198   0.8302182    0.3503554
0.006326611   0.8306792    0.3507520
0.006820878   0.8302187    0.3485428
0.007117438   0.8300648    0.3453694

0.008303677   0.8272936    0.3293491
0.013048636   0.8174423    0.2785669
0.016607355   0.8137475    0.2542771
0.020166074   0.8118971    0.2297410
0.025504152   0.8076646    0.1323638

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.006326611.

 

Prediction using Gini Index Decision Tree

Confusion Matrix and Statistics

                        Reference
Prediction     bad     good
           bad    1649      286
         good       83       148
                                      
Accuracy : 0.8296         
95% CI : (0.8131, 0.8452)
No Information Rate : 0.7996         
P-Value [Acc > NIR] : 0.000211       
                                       
Kappa : 0.3554         
Mcnemar's Test P-Value : < 2.2e-16      
                                      
Sensitivity : 0.9521         
Specificity : 0.3410         
Pos Pred Value : 0.8522         
Neg Pred Value : 0.6407         
Prevalence : 0.7996         
Detection Rate : 0.7613         
Detection Prevalence : 0.8934          
Balanced Accuracy : 0.6465         
                                        
'Positive' Class : bad      

      

The model performance achieved 82.96% accuracy, which is an improvement over the information gain model, but does not outperform the original Random Forest model run previously.

Conclusion

In summation, various models were run to predict wine quality using 13 predominantly chemical and quantitative features. Models run include KNN (k-nearest neighbor), Support Vector Machine (both linear and radial), Random Forest, and Decision Tree (both using information gain criterion and gini index criterion).

Though none of the models were able to produce more than 87% accuracy in predicting the test set results, the random forest model was able to be tuned using the number of samples taken at each node split (mtry, discussed previously) to achieve the highest prediction accuracy of 87.95%.

Accordingly, the model rf.fit should be used for predicting wine quality.

Abstract

 

Attribute Descriptions:

 

fixed acidity: most acids involved with wine or fixed or nonvolatile

volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste, should be very low or not able to be detected.

citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

chlorides: the amount of salt in the wine

free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

density: the density of water is close to that of water depending on the percent alcohol and sugar content

pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

sulfites: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

alcohol: the percent alcohol content of the wine

quality (score between 0 and 10 given by wine experts) – Output variable

Sources:

 

C. (n.d.). © 2018. Retrieved June 10, 2018, from http://www.vinhoverde.pt/en/homepage

     Vinho Verde wine region

C. (n.d.). © 2018. Retrieved June 10, 2018, from http://www.vinhoverde.pt/en/history-of-vinho-verde Vinho Verde history

 

Hoffower, H., & Gal, S. (2018, June 8). 15 memorable Anthony Bourdain quotes that show why the celebrity chef and author was so beloved. Retrieved June 18, 2018, from http://www.businessinsider.com/anthony-bourdain-best-quotes-food-travel-life-2018-6

 

Nierman, Doug. (2004) Fixed Acidity. Retrieved June 10, 2018 from http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity

 

Wine Quality Data Set http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

Follow me

© 2019 Kirby Hood

 

  • LinkedIn Social Icon
  • Facebook Social Icon