Final Laboratory Assignment:
Using K-Nearest Neighbors, Apriori, and Decision Tree j48
The dataset I’ve decided to explore is analyzing red wine dataset to determine the quality. There is a total of 1599 instances and 12 elements that determine the quality of a wine. There are no missing values, and all attributes are numeric. I have also cleaned up the dataset as there were some un-readable characters in the data, I removed both semi-colons and double quotes. The wine quality analysis should give an overall understanding on how the different attributes of wine produce a certain quality and type of wine.
Understanding Wine Attributes
I thought this would be a good section to explain some of the attributes that make up our dataset. The understanding of these attributes can provide additional insight.
The attributes of this dataset are:
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- free sulfur dioxide
- total sulfur dioxide
- Fixed acidity is a part of a group of acidic elements commonly found in grapes that give wine its bitter or flat taste. Fixed acidity can be combined with volatile acidity as they are from the same family.
- Citric acid is a type of fixed acid, which is found in grapes and can affect the color and taste of wine.
- Residual Sugar balances out the acidic nature of some wines. It is as the name says, left over sugar from the fermentation process from grapes.
- Chlorides provide saltiness of the wine taste.
- Sulfur Dioxide is mainly used as a preservative when aging wine. Sometimes used more in white wine than red.
- Density is thickness or compactness of an object in this case the wine.
- Ph level determines the readiness of the wine with red wines falling somewhere around 3.3 to 3.6 levels.
- Sulphates similar to dioxides above and acts as a preservative.
- Alcohol is a state produced in beverages that go through a fermentation process.
- Quality is our class variable to determine if we could predict the quality of any wine, particularly red wine in this dataset.
In looking at the charts for this data it shows that ph and density have somewhat of a normal distribution, which makes sense as most wines in particular categories(red, white etc) would have around the same ph levels. The other attributes such as residual sugar have outliers, which could have an effect on our data when modeling.
The next steps I have taken were to normalize the data and discretize the class variable into 3 nominal values, which represent low quality, medium, and high quality. Discretizing the other attributes outside of the class showed an imbalance on some of the instances, allocating unbalanced weight. You can kind of see this imbalance via the chart below, in how the data is skewed.
Below I will discuss each algorithm used and the results. For the following models I used:
Normalization on all attributes except the class (quality)
Discretization into 3 bins for the class (quality) - low, med, high, binning any higher decreased accuracy.
Resampling on all attribute/instances to make up for the unbalanced nature with a seed of 20.
Cross-Validation - 10 Fold
For Apriori I had to turn the rest of the attributes to nominal, I also re-ran the other algorithms afterwards, with no marked improvement.
Accuracy, Performance and Discussion:
The accuracy for this algorithm was 92.8% after resampling, before I could only get to about 83% on accuracy. The algorithm incorrectly classified 115 instances out of 1484. The F-measure was about 0.92 using a default of 1 K-NN. The confusion matrix shows how each separated bin classified each instance:
The most mis-classified was in the “lower” quality wines. We can also see this in the F-Measure for this bin which is 0.65.
Accuracy, Performance and Discussion:
For Apriori, I had to bin 20 for the attributes other than the class(quality), I started off trying higher numbers, but there were too many unused bins in proportion to the rest of the data. The association created 4 item-sets with the following rules:
- citric acid='(0.2-0.25]' 185 ==> quality='(0.333333-0.666667]' 181 lift:(1.18) lev:(0.02)  conv:(6.22)
- chlorides='(0.099833-0.14975]' alcohol='(0.129231-0.172308]' 203 ==> quality='(0.333333-0.666667]' 198 lift:(1.17) lev:(0.02)  conv:(5.69)
- residual sugar='(0.049658-0.099315]' alcohol='(0.129231-0.172308]' 187 ==> quality='(0.333333-0.666667]' 182 lift:(1.17) lev:(0.02)  conv:(5.24)
- alcohol='(0.129231-0.172308]' 301 ==> quality='(0.333333-0.666667]' 292 lift:(1.17) lev:(0.03)  conv:(5.06)
- residual sugar='(0.049658-0.099315]' chlorides='(0.099833-0.14975]' sulphates='(0.1-0.15]' 189 ==> quality='(0.333333-0.666667]' 179 lift:(1.14) lev:(0.01)  conv:(2.89)
- chlorides='(0.099833-0.14975]' sulphates='(0.1-0.15]' 328 ==> quality='(0.333333-0.666667]' 309 lift:(1.13) lev:(0.02)  conv:(2.76)
- residual sugar='(0.049658-0.099315]' density='(0.45-0.5]' 194 ==> quality='(0.333333-0.666667]' 180 lift:(1.12) lev:(0.01)  conv:(2.18)
- fixed acidity='(0.243363-0.292035]' 222 ==> quality='(0.333333-0.666667]' 205 lift:(1.11) lev:(0.01)  conv:(2.07)
- residual sugar='(0.049658-0.099315]' sulphates='(0.1-0.15]' 284 ==> quality='(0.333333-0.666667]' 261 lift:(1.1) lev:(0.02)  conv:(1.99)
- chlorides='(0.099833-0.14975]' pH='(0.4-0.45]' 196 ==> quality='(0.333333-0.666667]' 179 lift:(1.1) lev:(0.01)  conv:(1.83)
The most interesting rule/s here is the citric acid rule by itself relating directly to quality. Most of the rules have a directly correlation to quality, but on the low-end of the spectrum. For this model, I mostly used to see connections between the attributes in their role in determining a quality red wine. From this result, I would look closer at fixed acidity, residual sugar/alcohol, and citric acid.
Decision Tree -J48
Accuracy, Performance and Discussion:
The accuracy for the J48 decision tree was about a 90.37% with mis-classifying 9.63% of the instances. The F1-Score(F-Measure) was lower here than the K-NN with a 0.9, which is still pretty good in being close to 1. The tree was produced with a 20 seed, and had about 57 leaves, and a size of 113. Alcohol being chosen at the top, and splitting on fixed acidity and sulphates. Splitting on fixed-acidity is important, as we look back at our definitions of the data attributes. Fixed-acidity is what gives wine its flat or bitter taste, which makes it to be an important attribute in how wine is perceived. It is interesting in this dataset as the decision tree seems to think so as well.
The confusion matrix for J48 also mis-classified lower quality instances as well as the K-NN, but this time, it mis-classified more than it correctly classified.
Overall, I would chose the K-NN algorithm for a model to analyze this dataset. I think the other models were great in comparing the performance, and actually seeing how the data is separated based on parameters. At first, I was concerned with the imbalanced instances/classes in the dataset, but it showed another side to interpreting the data. For instance, this particular batch could have an overwhelming amount of “medium” quality wines in its shipment, it doesn’t necessarily mean it is a bad thing, it could just mean that some adjustments could be made to particular attributes. Possibly, making changes to certain or monitoring processes when nearing the fermentation stage, or extracting afterwards to produce higher quality. I think some improvement of the data cleaning could possibly help the accuracy improve, possibly binning the instances themselves with the right tweaks (for J48). Finally, I am pretty happy with choosing the K-NN to determine quality of red wine.