Binary Classification on Tabular Data - Predicting Abnormal ECG Scans¶
Introduction¶
In this notebook, you will train an autoencoder to detect anomalies on the ECG5000 dataset. This dataset contains 5,000 Electrocardiograms, each with 140 data points. You will use a simplified version of the dataset, where each example has been labeled either 0 (corresponding to an abnormal rhythm), or 1 (corresponding to a normal rhythm). You are interested in identifying the abnormal rhythms.
Technical preliminaries¶
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# initialize the seeds of different random number generators so that the
# results will be the same every time the notebook is run
tf.random.set_seed(42)
pd.options.mode.chained_assignment = None
Read in the data¶
Conveniently, the dataset in CSV form has been made available online and we can load it into a Pandas dataframe with the very useful pd.read_csv
command.
# Because each column of data represents a datapoint we will name the columns by the sequence of datapoints
# (1,2,3...140)
names = []
for i in range(140):
names.append(i)
# The last column will be the target or dependent variable
names.append("Target")
Read in the data from http://storage.googleapis.com/download.tensorflow.org/data/ecg.csv and set the column names from the list created in the box above
df = pd.read_csv(
"http://storage.googleapis.com/download.tensorflow.org/data/ecg.csv", header=None
)
df.columns = names
df.shape
df.head()
Preprocessing¶
This dataset only has numeric variables. For consistency sake, we will assign the column names to variable numerics.
numerics = names
# Remove the dependent variable
numerics.remove("Target")
# Set the output to "target_metrics"
target_metrics = df.Target.value_counts(normalize=True)
print(target_metrics)
Extract the dependent variable
# set the dependent variables to 'y'
y = df.pop("Target")
Before we normalize the numerics, let's split the data into an 80% training set and 20% test set (why should we split before normalization?).
from sklearn.model_selection import train_test_split
# split into train and test sets with the following naming conventions:
# X_train, X_test, y_train and y_test
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, stratify=y)
OK, let's calculate the mean and standard deviation of every numeric variable in the training set.
# Assign the means to "means" and standard deviation to "sd"
means = X_train[numerics].mean()
sd = X_train[numerics].std()
print(means)
Let's normalize the train and test dataframes with these means and standard deviations.
# Normalize X_train
X_train[numerics] = (X_train[numerics] - means) / sd
# Normalize X_test
X_test[numerics] = (X_test[numerics] - means) / sd
X_train.head()
The easiest way to feed data to Keras/Tensorflow is as Numpy arrays so we convert our two dataframes to Numpy arrays.
# Convert X_train and X_test to Numpy arrays
X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
X_train.shape, y_train.shape
X_test.shape, y_test.shape
Build a model¶
Define model in Keras¶
Creating an NN is usually just a few lines of Keras code.
- We will start with a single hidden layer.
- Since this is a binary classification problem, we will use a sigmoid activation in the output layer.
# get the number of columns and assign it to "num_columns"
num_columns = X_train.shape[1]
# Define the input layer. assign it to "input"
input = keras.Input(shape=(num_columns,), dtype="float32")
# Feed the input vector to the hidden layer. Call it "h"
h = keras.layers.Dense(16, activation="relu", name="Hidden")(input)
# Feed the output of the hidden layer to the output layer. Call it "output"
output = keras.layers.Dense(1, activation="sigmoid", name="Output")(h)
# tell Keras that this (input,output) pair is your model. Call it "model"
model = keras.Model(input, output)
model.summary()
keras.utils.plot_model(model, show_shapes=True)
Set optimization parameters¶
Now that the model is defined, we need to tell Keras three things:
- What loss function to use - Since our output variable is binary, we will select the
binary_crossentropy
loss function. - Which optimizer to use - we will use a 'flavor' of SGD called
adam
which is an excellent default choice - What metrics you want Keras to report out - in classification problems like this one,
accuracy
is commonly used.
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
Train the model¶
To kickoff training, we have to decide on three things:
- The batch size - 32 is a good default
- The number of epochs (i.e., how many passes through the training data). Start by setting this to 100, but you can experiment with different values.
- Whether we want to use a validation set. This will be useful for overfitting detection and regularization via early stopping so we will ask Keras to automatically use 20% of the data points as a validation set
# Fit your model and assign the output to "history"
history = model.fit(
X_train, y_train, epochs=100, batch_size=32, validation_split=0.2, verbose=2
)
history_dict = history.history
history_dict.keys()
loss_values = history_dict["loss"]
val_loss_values = history_dict["val_loss"]
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, "bo", label="Training loss")
plt.plot(epochs, val_loss_values, "b", label="Validation loss")
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()
plt.clf()
acc = history_dict["accuracy"]
val_acc = history_dict["val_accuracy"]
plt.plot(epochs, acc, "bo", label="Training acc")
plt.plot(epochs, val_acc, "b", label="Validation acc")
plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()
Evaluate the model¶
Let's see how well the model does on the test set.
model.evaluate
is a very handy function to calculate the performance of your model on any dataset.
# Getting the results of your model for grading
score, acc = model.evaluate(X_test, y_test)
y.value_counts(normalize=True)
# Selecting a specific row (e.g., row index 300)
row_index = 300
y_values = X_train[row_index, :]
x_values = range(X_train.shape[1]) # X-axis: 0 to 139
# Plotting
plt.figure(figsize=(10, 5))
plt.plot(x_values, y_values, marker="o", linestyle="-")
plt.xlabel("X-Axis (Index)")
plt.ylabel("Y-Axis (Values)")
plt.title(f"Plot of Row {row_index}")
plt.grid(True)
plt.show()
print(y_train[row_index]) # Result is abnormal scan for row_index=300