Predicting wine quality using chemical properties
-| Data Science | Complex Systems | Scientific Research |
As a food and drink enthusiast I've always wondered if we can use statistics to predict flavor profile or quality/taste in food. I know there is a whole research field when it comes to food science that explores these kind of questions, but I wanted to see if I could try it on my own.
Particularly, when I talk to people about drinks is usually about beer or whisky, but when the conversation goes toward wine I often feel lost, I don't think I have the sophisticated palate that other people have. So when I realized that there is a data set of wine quality, first published on a paper that includes different chemical properties and a quality score provided by wine experts. I decided to check it out and to see if I could use statistics (machine learning) to explore and try to answer questions I've wondered for a while, like: Can we predict the quality/taste of a wine knowing its chemical properties?
This entry has the following sections:
Loading imports and data
First, we load some libraries we are going to use for this mini-project:
#importing libraries
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
import torch.nn.functional as F
import sklearn.metrics as skm
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython import display
display.set_matplotlib_formats('svg')
#getting the dataset - there are 2 sets: red wines and white wines
url_white = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
url_red = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
rwine_data = pd.read_csv(url_red, sep=';')
wwine_data = pd.read_csv(url_white, sep=';')
We assign labels for type of wine on each data frame and then concatenate them to have all the information in one single data frame:
#assigning labels
rwine_data['wine type'] = 1
wwine_data['wine type'] = 0
#concatenating data frames
allwine_data = pd.concat([rwine_data, wwine_data], axis=0, ignore_index=True)
it is fairly common that data sets contain unbalanced and not normalized data so it won't hurt if we look at its distribution:
#plot data
fig, ax = plt.subplots(1, figsize=(17,4))
ax = sns.boxplot(data=allwine_data)
ax.set_xticklabels(allwine_data.columns,rotation=45)
# we drop the columns quality and wine type since those don't need normalization.
normed_alldata = allwine_data.drop(['quality', 'wine type'], axis=1)
for col in normed_alldata.columns:
#getting mean and standard deviation...
col_mean = np.mean(normed_alldata[col])
col_std = np.std(normed_alldata[col], ddof=1)
#normalizing data
normed_alldata[col] = (normed_alldata[col] - col_mean) / col_std
now we check that the data is normalized by plotting the new distributions
#plot the normalized data
fig, ax = plt.subplots(1, figsize=(17,4))
ax = sns.boxplot(data=normed_alldata)
ax.set_xticklabels(normed_alldata.columns,rotation=45)
Predicting wine type
Before trying to predict wine quality, a good exercise would be to see how accurately we can predict other properties with the data we already have. The type of wine could be the simplest one so we will start with that one.
Let's see how many points we have of each type of wine:
#plotting the counts for each type of wine
plt.hist(allwine_data['wine type'].values)
plt.xticks([0,1], labels=['White','Red'])
To balance the dataset we need to create more data points or reduce their number, for simplicity we are going to perform the second option. We need to randomly select a subset of points from the white wine group of data points
# making a *randomized* selection of white wine data
white_ix = np.where(allwine_data['wine type']==0)[0]
red_ix = np.where(allwine_data['wine type']==1)[0]
#this are the indexes we are going to choose for our dataset
ix_selec = np.concatenate([np.random.choice(white_ix, len(red_ix)), red_ix])
now we can define the variables for our machine learning model
#defining the data and target
normed_allselect = normed_alldata.iloc[ix_selec]
wine_target = allwine_data['wine type'].values[ix_selec]
in Pytorch an easy way to pass the data to the model is through loaders, so we can define a function to build the loaders and make it easier if we decide to make more experiments with different datasets
# function that will help us to build the data loaders for PyTorch
def get_loaders(data, target, test_size, batch_size):
#first converting to torch format
Tdata = torch.tensor(data.values).float()
Ttarget = torch.tensor(target).float()[:, None]
#split the data
train_data, test_data, train_labels, test_labels = train_test_split(Tdata, Ttarget, test_size=test_size)
#creating datasets for the loaders
train_data = TensorDataset(train_data, train_labels)
test_data = TensorDataset(test_data, test_labels)
#creating loaders, dropping last to have equal size loaders.
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_data, batch_size=test_data.tensors[0].shape[0])
return train_loader, test_loader
and we use the function we just defined to build the loaders with our data, using 80% of the data for training and the rest for testing with batches of size batch_size = 64
train_loader, test_loader = get_loaders(
normed_allselect, wine_target, test_size=0.2, batch_size=64)
Now we can think about our Artificial Neural Network (ANN) model, I decided to select a model flexible enough to be used for our different experiments. The model is fully connected (FCNN) which means that all nodes from neighbor layers are connected to each other with the exception of self-loops and nodes from the same layer. It has an input layer, four hidden layers and an output layer, and we can visualize the architecture in the next figure
to build this model we can use the Torch base NN class:
#defining our class for this ANN model fully connected with batch normalization
class ANN_wine(nn.Module):
def __init__(self, n_input): # initiating class, setting number of input nodes as variable
super().__init__()
#intput for the n_input features to 16 nodes
self.input = nn.Linear(n_input, 16)
#First layer
self.fc1 = nn.Linear(16, 32)
self.bnorm1 = nn.BatchNorm1d(16)
#second layer
self.fc2 = nn.Linear(32, 20)
self.bnorm2 = nn.BatchNorm1d(32)
#third layer
self.fc3 = nn.Linear(20, 16)
self.bnorm3 = nn.BatchNorm1d(20)
#the output
self.output = nn.Linear(16, 1)
#defining forward pass with ReLu activation functions
def forward(self, x):
#input forward
x = F.relu(self.input(x))
#forward of layer 1 with batch normalization
x = self.bnorm1(x)
x = F.relu(self.fc1(x))
#forward of layer 2 with batch normalization
x = self.bnorm2(x)
x = F.relu(self.fc2(x))
#forward of layer 3 with batch normalization
x = self.bnorm3(x)
x = F.relu(self.fc3(x))
return self.output(x)
before training our model we need to make sure that the data loaders and the model have the same format, to check this we just need to pass one batch from the loader through the model
#Loading a batch from the training dataset
X, y = next(iter(train_loader))
print(X.shape)
print(y.shape)
#we initialize our model with 11 input features
ANN_winetype = ANN_wine(11)
#pass the model
y_hat = ANN_winetype(X)
print(y_hat.shape)
torch.Size([64, 11])
torch.Size([64, 1])
torch.Size([64, 1])
it seems that everything is in order. Now we need to train the model, we define a function to train the model that uses Binary Cross-entropy as loss function and the Adam optimizer:
def train_model(model,train_loader, test_loader, n_epochs, learn_rate):
#loss function: binary cross entropy.
lossfun = nn.BCEWithLogitsLoss()
#defining the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learn_rate)
#now we initialize the losses and accuracy
losses = torch.zeros(n_epochs)
train_acc = []
test_acc = []
for epoch_i in range(n_epochs):
#using the model in training mode
model.train()
batch_acc = []
batch_loss = []
#Now we loop over the batches
for X, y in train_loader:
#forward
y_hat = model(X)
loss = lossfun(y_hat, y)
#back propagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
#storing batch loss and accuracy
batch_loss.append(loss.item())
batch_acc.append( 100*torch.mean( ((y_hat>0) == y).float()).item())
#saving epoch's loss and accuracy
train_acc.append(np.mean(batch_acc))
losses[epoch_i] = np.mean(batch_loss)
#now we evaluate the model with the test data
#switch to evaluation mode
model.eval()
X, y = next(iter(test_loader))
#forward
with torch.no_grad():
y_hat = model(X)
#accuracy for test data
test_acc.append( 100*torch.mean(((y_hat>0) == y).float()).item())
return train_acc, test_acc, losses
now we can train our model and plot the loss and accuracy over epochs
train_acc, test_acc, losses = train_model(ANN_winetype,
train_loader,
test_loader,
n_epochs=200,
learn_rate=0.001
)
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
ax[0].plot(losses, 'k-')
ax[0].set_xlabel('Epoch')
ax[0].set_ylabel('Loss')
ax[0].set_title('Loss per Epoch')
ax[0].grid()
ax[1].plot(train_acc)
ax[1].plot(test_acc)
ax[1].set_xlabel('Epoch')
ax[1].set_ylabel('Accuracy (%)')
ax[1].set_title(f'Final accuracy: {test_acc[-1]:.2f}%')
ax[1].grid()
sklearn
like the confusion matrix and a classification report that includes values for: precision, recall and f1-score. from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
X, y = next(iter(test_loader))
y_hat = ANN_winetype(X)
pred_wine = (y_hat > 0).detach().numpy()
CM_winetype = confusion_matrix(y, pred_wine)
plotting the confusion matrix:
plt.rcParams.update({'font.size': 15})
plt.figure(figsize=(8, 8))
plt.imshow(CM_winetype, 'Oranges')
plt.xticks([0, 1], ['White', 'Red'])
plt.yticks([0, 1], ['White', 'Red'])
plt.xlabel('Predicted type')
plt.ylabel('True type')
plt.text(0, 0, f'True negatives:\n{CM_winetype[0,0]}', ha='center', va='center')
plt.text(
0, 1, f'False negatives:\n{CM_winetype[1,0]}', ha='center', va='center')
plt.text(1, 1, f'True positives:\n{CM_winetype[1,1]}', ha='center', va='center')
plt.text(
1, 0, f'False positives:\n{CM_winetype[0,1]}', ha='center', va='center')
plt.show()
cr = classification_report(y, pred_wine)
print(cr)
--------- precision recall f1-score support
0.0 1.00 1.00 1.00 326
1.0 1.00 1.00 1.00 314
accuracy 1.00 640
macro avg 1.00 1.00 1.00 640
weighted avg 1.00 1.00 1.00 640
But, why is that this model performs so well, my first guess is that chemical properties between white and red wines are very different. Instead of comparing all the properties in our data, I decided to do a small search on Google and chose the chemical properties that contribute the most to flavor profile and found here that the main component is sugar (residual sugar), followed by acid and tannins.
So we will going to drop some of the properties from the full dataset
#selecting chemical properties that contribute to flavor
chem_prop = normed_allselect.columns.drop(
['free sulfur dioxide', 'total sulfur dioxide', 'sulphates', 'density', 'chlorides']).values
chem_prop
and then we can see how the remaining distributions differ with the type of wine
plt.rcParams.update({'font.size': 10})
fig, ax = plt.subplots(2, 3, figsize=(14, 9))
prop_c = 0 # counter
for i in range(2):
for j in range(3):
sns.boxplot(data=allwine_data, x='wine type', y = chem_prop[prop_c], ax=ax[i,j])
ax[i,j].set_xticklabels(['white','red'])
prop_c += 1
plt.show()
Can we predict the value of a chemical property?
Now let's see if we can predict chemical properties related to flavor profile, since both type of wine have different chemical properties we are going to use only the data of one type of wine. We will work with the white wine dataset since it is the one with the largest amount of data points.
We normalize the data and construct loaders
#we are going to normalize wine properties except for the quality (or target)
normed_wdata = wwine_data.drop(['quality', 'wine type'], axis=1)
for col in normed_wdata.columns:
#getting mean and standard deviation...
col_mean = np.mean(normed_wdata[col])
col_std = np.std(normed_wdata[col], ddof=1)
#normalizing data
normed_wdata[col] = (normed_wdata[col] - col_mean) / col_std
we do the first property in the list: Fixed Acidity
#the first chemical property in the list
new_data = normed_wdata.drop(chem_prop[0], axis=1)
new_target = wwine_data[chem_prop[0]].values
#build loaders
train_loader, test_loader = get_loaders(new_data, new_target, test_size=0.2, batch_size=64)
now we need to define another training function, it should be different since we will be performing a regression for these values (real numbers) and we need a different loss function like Mean Square Error (MSE). And as a quantity to test accuracy we are going to use Pearson Correlation that indicates linear statistical dependencies between two variables.
def train_model_2(model, train_loader, test_loader, n_epochs, learn_rate):
#loss function: Mean Square Error - for regression.
lossfun = nn.MSELoss()
#defining the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learn_rate)
#now we initialize the losses and accuracy
losses = torch.zeros(n_epochs)
train_acc = []
test_acc = []
for epoch_i in range(n_epochs):
#using the model in training mode
model.train()
batch_acc = []
batch_loss = []
#Now we loop over the batches
for X, y in train_loader:
#forward
y_hat = model(X)
loss = lossfun(y_hat, y)
#back propagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
#storing batch loss and accuracy
batch_loss.append(loss.item())
#computing pearson correlation for "accuracy"
batch_cor = np.corrcoef(np.concatenate(np.array(y_hat.detach())),np.concatenate(np.array(y)))[0,1]
batch_acc.append(batch_cor)
#saving epoch's loss and accuracy
train_acc.append(np.mean(batch_acc))
losses[epoch_i] = np.mean(batch_loss)
#now we evaluate the model with the test data
#switch to evaluation mode
model.eval()
X, y = next(iter(test_loader))
#forward
with torch.no_grad():
y_hat = model(X)
#accuracy for test data
test_cor = np.corrcoef(np.concatenate(
np.array(y_hat.detach())), np.concatenate(np.array(y)))[0, 1]
test_acc.append(test_cor)
return train_acc, test_acc, losses
we initialize our model and make a test by passing one batch, we are using the same function (ANN architecture) but with different number of inputs this time (10 instead of 11):
#defining a new model and passing data through it
chemprop_model = ANN_wine(10)
X, y = next(iter(train_loader))
#evaluating
y_hat = chemprop_model(X)
print(X.shape)
print(y.shape)
print(y_hat.shape)
torch.Size([64, 10])
torch.Size([64, 1])
torch.Size([64, 1])
m2_trainacc, m2_testacc, m2_losses = train_model_2(chemprop_model,
train_loader,
test_loader,
n_epochs=200,
learn_rate=0.001)
fig, ax = plt.subplots(1, 2, figsize=(14, 6))
ax[0].plot(m2_losses, 'k-')
ax[0].set_xlabel('Epoch')
ax[0].set_ylabel('Loss')
ax[0].set_title('Loss per Epoch')
ax[0].grid()
ax[1].plot(m2_trainacc)
ax[1].plot(m2_testacc)
ax[1].set_xlabel('Epoch')
ax[1].set_ylabel(r'Pearson Correlation Coefficient ($\rho$)')
ax[1].set_title('Final ' r'$\rho$' f': {m2_testacc[-1]:.2f}')
ax[1].grid()
#loading all train and test data
train_data = train_loader.dataset.tensors
test_data = test_loader.dataset.tensors
#passing the data through the model to predict Fixed Acidity values
train_pred = chemprop_model(train_data[0])
test_pred = chemprop_model(test_data[0])
#computing pearson correlation coefficient of target and predicted values for train and test data
pcor_train = np.corrcoef(np.concatenate(np.array(train_pred.detach().numpy())),np.concatenate(train_data[1].detach().numpy()))[0,1]
pcor_test = np.corrcoef(np.concatenate(np.array(test_pred.detach().numpy())),np.concatenate(test_data[1].detach().numpy()))[0,1]
plotting the estimations
plt.figure(figsize=(7,7))
plt.plot(train_pred.detach(), train_data[1].detach(), 'o')
plt.plot(test_pred.detach(), test_data[1].detach(), 'o')
plt.grid()
plt.legend(('Train ' r'$\rho$' f': {pcor_train:.2f}', 'Test ' r'$\rho$' f': {pcor_test:.2f}'))
plt.title('Fixed Acidity')
#plt.xlim([0,60])
plt.xlabel('Predicted Value')
plt.ylabel('Real Value')
plt.axline((0, 0), slope=1, linestyle='--', color='k', label=None)
plt.show()
We do the same analysis for each chemical property:
#variables to store data
pcor_trains = np.zeros(6)
pcor_tests = np.zeros(6)
pred_trains = []
pred_tests = []
data_trains = []
data_tests = []
#looping over the chemical properties
for (nc,chemprop) in enumerate(chem_prop):
#building data
new_data = normed_data.drop(chemprop, axis=1)
new_target = wwine_data[chemprop].values
#loaders
train_loader, test_loader = get_loaders(
new_data, new_target, test_size=0.2, batch_size=64)
#initializing model
chemprop_model = ANN_wine(10)
#training...
_ = train_model_2(
chemprop_model, train_loader, test_loader, n_epochs=200, learn_rate=0.001)
#computing final accuracy
train_data = train_loader.dataset.tensors
test_data = test_loader.dataset.tensors
train_pred = chemprop_model(train_data[0])
test_pred = chemprop_model(test_data[0])
pcor_train = np.corrcoef(np.concatenate(np.array(train_pred.detach().numpy())),np.concatenate(train_data[1].detach().numpy()))[0,1]
pcor_test = np.corrcoef(np.concatenate(np.array(test_pred.detach().numpy())),np.concatenate(test_data[1].detach().numpy()))[0,1]
pcor_trains[nc] = pcor_train
pcor_tests[nc] = pcor_test
pred_trains.append(train_pred)
pred_tests.append(test_pred)
data_trains.append(train_data[1])
data_tests.append(test_data[1])
and plot the results
fig, ax = plt.subplots(2, 3, figsize=(18, 12))
prop_c = 0 #counter
for i in range(2):
for j in range(3):
xl = [np.min(pred_trains[prop_c].detach().numpy()), np.max(pred_trains[prop_c].detach().numpy())]
yl = [np.min(pred_trains[prop_c].detach().numpy()), np.max(pred_trains[prop_c].detach().numpy())]
ax[i,j].plot(pred_trains[prop_c].detach(), data_trains[prop_c].detach(), 'o')
ax[i,j].plot(pred_tests[prop_c].detach(), data_tests[prop_c].detach(), 'o')
ax[i,j].grid()
ax[i,j].set_xlim(xl)
ax[i,j].set_ylim(yl)
ax[i,j].legend(('Train ' r'$\rho$' f': {pcor_trains[prop_c]:.2f}',
'Test ' r'$\rho$' f': {pcor_tests[prop_c]:.2f}'))
ax[i,j].set_title(chem_prop[prop_c])
#ax[i,j].xlim([0, 60])
ax[i,j].set_xlabel('Predicted Value')
ax[i,j].set_ylabel('Real Value')
ax[i,j].axline((0, 0), slope=1, linestyle='--', color='k', label=None)
prop_c += 1
almost all chemical properties can be predicted with a decent amount of precision, however if we check the pearson correlation values for the test data there are a couple of poor performances: volatile acidity and citric acid.
Predicting wine quality
Finally, we will use the same model to predict wine quality, first let's check how the quality scores are distributed
target = wwine_data['quality'].values
plt.hist(target)
plt.xlabel('Quality Score')
plt.ylabel('Count')
plt.show()
there are too many categories and the data is not distributed uniformly, to simplify this we will make a binary quality score (good, not good) and then try to balance the data so we don't bias our training process.
First we binarize the data:
#we set 0 = not good, 1 = good
boolean_target = np.array([0 if target[i] < 6 else 1 for i in range(len(target))] )
#plotting new distribution
plt.hist(boolean_target)
plt.xlabel('Quality Score')
plt.ylabel('Count')
plt.xticks([0,1], labels=['Not good', 'Good'])
plt.show()
#locating both categories
good_ix = np.where(boolean_target==1)[0]
bad_ix = np.where(boolean_target==0)[0]
#number of data points to be sampled
n_samps = len(bad_ix)
#selecting at random
# we choose at random n_samps for the 1 'quality'
ix_selec = np.concatenate([np.random.choice(good_ix, n_samps), bad_ix])
# we use the selection for the data
btarget_select = boolean_target[ix_selec]
normed_select = normed_wdata.iloc[ix_selec]
Now we can finally try to predict the quality of the wine, we build the loaders, initialize the model and pass data through it
#building loaders
train_loader, test_loader = get_loaders(normed_select,
btarget_select,
test_size=0.2,
batch_size=64)
#initializing model
ANN_model = ANN_wine(11)
#passing data through the model
X,y = next(iter(train_loader))
print(X.shape)
print(y.shape)
y_hat = ANN_model(X)
print(y_hat.shape)
torch.Size([64, 11])
torch.Size([64, 1])
torch.Size([64, 1])
train_acc, test_acc, losses = train_model(ANN_model,
train_loader,
test_loader,
n_epochs=1000,
learn_rate=0.001
)
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
ax[0].plot(losses, 'k-')
ax[0].set_xlabel('Epoch')
ax[0].set_ylabel('Loss')
ax[0].set_title('Loss per Epoch')
ax[0].grid()
ax[1].plot(train_acc)
ax[1].plot(test_acc)
ax[1].set_xlabel('Epoch')
ax[1].set_ylabel('Accuracy (%)')
ax[1].set_title(f'Final accuracy: {test_acc[-1]:.2f}%')
ax[1].grid()
In this case, the accuracy is not as high as with the chemical properties or the type of wine, and it seems that we could still get lower loss and better training accuracy if we train for more epochs, however the lack of improvement on test accuracy tells us that it might not be necessary to train for longer.
Now let's se how good our ANN model is to predict the wine quality given by experts
X, y = next(iter(test_loader))
y_hat = ANN_model(X)
pred_wwine = (y_hat > 0).detach().numpy()
#confusion matrix
c_matrix = confusion_matrix(y, pred_wwine)
plt.rcParams.update({'font.size': 15})
plt.figure(figsize=(8, 8))
plt.imshow(c_matrix, 'Blues')
plt.xticks([0,1], ['Bad','Good'])
plt.yticks([0,1],['Bad','Good'])
plt.xlabel('Predicted quality')
plt.ylabel('True quality')
plt.text(0, 0, f'True negatives:\n{c_matrix[0,0]}', ha='center', va='center')
plt.text(
0, 1, f'False negatives:\n{c_matrix[1,0]}', ha='center', va='center')
plt.text(1, 1, f'True positives:\n{c_matrix[1,1]}', ha='center', va='center')
plt.text(
1, 0, f'False positives:\n{c_matrix[0,1]}', ha='center', va='center')
plt.show()
cr_wwine = classification_report(y, pred_wwine)
print(cr_wwine)
--- precision recall f1-score support
0.0 0.83 0.79 0.81 327
1.0 0.80 0.84 0.82 329
accuracy 0.82 656
macro avg 0.82 0.82 0.82 656
weighted avg 0.82 0.82 0.82 656
These results are decent enough to think our model does capture some hidden rules in the dependency of the chemical properties with the quality of the wine, but they fall short if we compare them with previous results.
But, what about red wine? We do the same analysis for red wine
#Normalizing data
normed_rdata = rwine_data.drop(['quality', 'wine type'], axis=1)
for col in normed_rdata.columns:
#getting mean and standard deviation...
col_mean = np.mean(normed_rdata[col])
col_std = np.std(normed_rdata[col], ddof=1)
#normalizing data
normed_rdata[col] = (normed_rdata[col] - col_mean) / col_std
#binarizing target
target = rwine_data['quality'].values
boolean_target = np.array(
[0 if target[i] < 6 else 1 for i in range(len(target))])
#plotting target distribution
plt.hist(boolean_target)
plt.xlabel('Quality Score')
plt.ylabel('Count')
plt.xticks([0, 1], labels=['Not good', 'Good'])
plt.show()
#building loaders
train_loader, test_loader = get_loaders(
normed_rdata, boolean_target, test_size=0.2, batch_size=64)
#checking that the loaders are correct
X, y = next(iter(train_loader))
print(X.shape)
print(y.shape)
#init model
ANN_model = ANN_wine(11)
#passing data
y_hat = ANN_model(X)
print(y_hat.shape)
torch.Size([64, 11])
torch.Size([64, 1])
torch.Size([64, 1])
train_acc, test_acc, losses = train_model(ANN_model,
train_loader,
test_loader,
n_epochs=1000,
learn_rate=0.001
)
X, y = next(iter(test_loader))
y_hat = ANN_model(X)
pred_rwine = (y_hat > 0).detach().numpy()
c_matrix = confusion_matrix(y, pred_rwine)
plt.rcParams.update({'font.size': 15})
plt.figure(figsize=(8, 8))
plt.imshow(c_matrix, 'Blues')
plt.xticks([0, 1], ['Bad', 'Good'])
plt.yticks([0, 1], ['Bad', 'Good'])
plt.xlabel('Predicted quality')
plt.ylabel('True quality')
plt.text(0, 0, f'True negatives:\n{c_matrix[0,0]}', ha='center', va='center')
plt.text(
0, 1, f'False negatives:\n{c_matrix[1,0]}', ha='center', va='center')
plt.text(1, 1, f'True positives:\n{c_matrix[1,1]}', ha='center', va='center')
plt.text(
1, 0, f'False positives:\n{c_matrix[0,1]}', ha='center', va='center')
plt.show()
cr_rwine = classification_report(y, pred_rwine)
print(cr_rwine)
---- precision recall f1-score support
0.0 0.67 0.75 0.71 130
1.0 0.82 0.75 0.78 190
accuracy 0.75 320
macro avg 0.74 0.75 0.75 320
weighted avg 0.76 0.75 0.75 320
our model performs slightly worse in predicting quality score for red wine, it could be related to the size of the dataset. But why is that we can have higher accuracy in predicting chemical properties relevant to flavor profile like the amount of residual sugar?
this discrepancy could have different explanations:
Our model (approach) is not good enough
Wine quality could be related to the acidity values we couldn't predict accurately
Wine quality scores have a subjective component from the experts
it is not simple to know what exactly might be the issue, but we can ignore the first one since we are using the same model for each prediction, of course we could have a better model but that is not the point of this experiment. In the original paper of this dataset the authors report that a modified Support Vector Machine (SVM) model performs better than a deep neural network or decision tree models.
The last couple of explanations are more plausible, but difficult to prove. For the first one we would need to do more statistical tests to evaluate the contribution of those variables to the prediction. However, it is not hard to think that there is a subjective component (expert preferences, physiological differences) when it comes to evaluating wine quality. Predicting human behavior is a complicated task and I wouldn't be surprised that even with a perfect model we would still not be able to get great results because we are lacking information about individual preferences from the experts.
Don't forget to take a look at the notebook for this post here!