The provided data gives several physical metrics regarding breast cancer tumors with the goal of predicting malignancy. The model made is simple, requiring only 3 predictors: radius mean, concavity worst, and symmetry worst. Based on our test set, we expect our model to predict malignancy with a 94% accuracy in the real world. The model also tells us that there is a positive relationship between concavity worst and symmetry worst with malignancy.
The provided data was scaled using min max scaler, and fed into a logistic model. The logistic model used a l1 penalty with a penalty coefficient of 0.01 (C = 100).
The predictors used for the model were radius mean, concavity worst, and symmetry worst. Radius mean and concavity worst are correlated, so an interaction term was created. Radius mean and the interaction term were determined not statistically significant. However getting rid of them slightly affects the accuracy, and the radius mean is on the cusp of being significant, so they were kept within the model.
The data was split into a 70% train set and 30% test set, and below are the performances:
Train Performance
Test Peformance
There is some dropoff of sensitivity between the train and test sets. If this is an issue when the model is deployed, it may be good to consider adding a slight penalty to False Negatives.
A random forest was also created and performed at the same level as the logistic regression. However, I find the logistic model preferable due to its interpretability.
import pandas as pd
import requests
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc
from sklearn.ensemble import RandomForestClassifier
import statsmodels.api as sm
%matplotlib inline
plt.style.use('ggplot')
Loading in Data with Header
columns = requests.get("https://gist.githubusercontent.com/jeff-boykin/b5c536467c30d66ab97cd1f5c9a3497d/raw/5233c792af49c9b78f20c35d5cd729e1307a7df7/field_names.txt").text.split("\n")
cancer_df = pd.read_csv("https://gist.githubusercontent.com/jeff-boykin/b5c536467c30d66ab97cd1f5c9a3497d/raw/5233c792af49c9b78f20c35d5cd729e1307a7df7/breast-cancer.csv", names = columns)
Performing exploratory data analysis on the data
def eda(dataframe): #Performs exploratory data analysis on the dataframe
print "columns \n", dataframe.columns
print "head \n", dataframe.head()
print "tail \n", dataframe.tail()
print "missing values \n", dataframe.isnull().sum()
print "dataframe types \n", dataframe.dtypes
print "dataframe shape \n", dataframe.shape
print "dataframe describe \n", dataframe.describe() #summary statistics
for item in dataframe:
print item
print dataframe[item].nunique()
print "%s duplicates out of %s records" % (len(dataframe) - len(dataframe.drop_duplicates()), len(dataframe))
eda(cancer_df)
Classes are slightly unbalanced, and we have a limited number of records.
cancer_df.diagnosis.value_counts().plot(kind = "bar", title = "Cancer Diagnoses")
The mean for both smoothness and compactness is greater than the median. This is due to the outliers that are greater than both the medians within the distributions. In the distributions below the median line is blue and the mean line is green.
print "Smoothness"
print "Mean:", cancer_df.smoothness_mean.mean()
print "Median:", cancer_df.smoothness_mean.median()
print "\nCompactness"
print "Mean:", cancer_df.compactness_mean.mean()
print "Median:", cancer_df.compactness_mean.median()
ax = cancer_df.smoothness_mean.plot(kind = "hist", title = "Smoothness Distribution")
ax.axvline(cancer_df.smoothness_mean.mean(), color = "#4cbb17")
ax.axvline(cancer_df.smoothness_mean.median(), color = "#3B5998")
ax = cancer_df.compactness_mean.plot(kind = "hist", title = "Compactness Distribution")
ax.axvline(cancer_df.compactness_mean.mean(), color = "#4cbb17")
ax.axvline(cancer_df.compactness_mean.median(), color = "#3B5998")
Below I used the function from one of the labs to create a bootstrap sample. The function takes a dataframe, converts it into a list of dictionaries, takes a random sample, and then converts it back to a dataframe.
def bootstrap(dataframe, iters=1000):
list_of_dicts = dataframe.to_dict(orient = "records")
random_sample = list(np.random.choice(list_of_dicts, replace=True, size = iters))
bootstrap_frame = pd.DataFrame(random_sample)
return bootstrap_frame
bootstrap_sample = bootstrap(cancer_df)
bootstrap_sample.ID.value_counts().head() #Shows duplicates within sample with replacement
Using a minmax scaler to help scale the data. This is to help prevent any one variable from influencing the model due to its scale.
scale_columns = [u'radius_mean', u'radius_sd_error',
u'radius_worst', u'texture_mean', u'texture_sd_error', u'texture_worst',
u'perimeter_mean', u'perimeter_sd_error', u'perimeter_worst',
u'area_mean', u'area_sd_error', u'area_worst', u'smoothness_mean',
u'smoothness_sd_error', u'smoothness_worst', u'compactness_mean',
u'compactness_sd_error', u'compactness_worst', u'concavity_mean',
u'concavity_sd_error', u'concavity_worst', u'concave_points_mean',
u'concave_points_sd_error', u'concave_points_worst', u'symmetry_mean',
u'symmetry_sd_error', u'symmetry_worst', u'fractal_dimension_mean',
u'fractal_dimension_sd_error', u'fractal_dimension_worst']
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(cancer_df[scale_columns]), columns = scale_columns)
Making the diagnosis column into a binary
cancer_df["target"] = cancer_df.diagnosis.map({"M":1,"B":0})
The variables radius_mean, perimeter_mean, concavity_worst, symmetry_worst, and fractal_dimension_mean all look like they have a positive correlation with malignancy. Below you can see the violin plot on the right is higher in each of the graphs.
There were other correlated variables, but I decided only to focus on either the mean or the worst of any set (i.e. if radius mean and radius std error correlated with malignancy, I would just plot radius mean).
One issue with the data is that most of the variables are highly correlated with each other, causing issues for any models that assume independence.
for i, column in enumerate(["radius_mean", "perimeter_mean", "concavity_worst", "symmetry_worst","fractal_dimension_mean"]):
plt.figure(i)
sns.violinplot(x = cancer_df.target, y = X_scaled[column])
plt.title(column)
sns.heatmap(X_scaled[["radius_mean", "perimeter_mean", "concavity_worst", "symmetry_worst","fractal_dimension_mean"]].corr(), vmax=.8, square=True)
One of the models I went with is a logistic regression, since it is a relatively familiar model and the coefficients can tell us how the features affect the probability of malignancy.
One of the drawbacks is that I have to be selective with my features, because if they are highly correlated it will introduce multicollinearity, which will affect my results. Also the data is scaled, so I will not be able to produce useful odds ratios from the coefficients.
For this model I will select radius mean, concavity worst, and symmetry worst. Radius mean and concavity worst are highly correlated, so I will use an interaction term as well.
To prevent overfitting I will let grid search optimize the regularization coefficient and whether I should use l1 or l2 regularization.
X_scaled["radius_mean_concavity_worst_interaction"] = X_scaled["radius_mean"] * X_scaled["concavity_worst"]
X_train, X_test, y_train, y_test = train_test_split(X_scaled, cancer_df.target, stratify = cancer_df.target, test_size=0.3, random_state = 997407)
logit_columns = ["radius_mean", "concavity_worst", "symmetry_worst", "radius_mean_concavity_worst_interaction"]
X_train_logit = X_train[logit_columns]
X_test_logit = X_test[logit_columns]
parameters = {"C": [0.0001, 0.001, 0.01, 0.1, .15, .25, .275, .33, 0.5, .66, 0.75, 1.0, 2.5, 5.0, 10.0, 100.0, 1000.0], "penalty" : ["l1", "l2"], "random_state" : [326228]}
grid_logit = GridSearchCV(LogisticRegression(), parameters, cv = 10, verbose = True, n_jobs = -1)
grid_logit.fit(X_train_logit, y_train)
Here is the best model chosen by grid search. The regularization level C is 100, and the regularization penalty is l1.
grid_logit.best_estimator_
The intercept and coefficients for the best model are below.
print "Intercept", grid_logit.best_estimator_.intercept_[0]
for x, y in zip(list(X_train_logit.columns), grid_logit.best_estimator_.coef_[0]):
print x, y
I did the same fit using stats models to check the confidence intervals of each predictor. Some of the parameters within stats models are different, and for that reason the coefficients do not exactly equal the output form scikit-learn.
Notice that radius mean and the radius mean concavity worst interaction term are not statistically significant. However, when I remove them from the model, I do lose a couple of points in accuracy, and for that reason I left them in the model.
Radius mean is on the cusp of being significant, and if I were to decrease my confidence to 90% it probably would be significant.
X_train_sm = sm.add_constant(X_train_logit)
sm_model = sm.Logit(y_train, X_train_sm).fit_regularized(method = "l1", alpha = 0.01, max_iter = 100)
sm_model.summary()
def evaluate_model(y_true, y_predicted):
a_score = accuracy_score(y_true, y_predicted)
conmat = np.array(confusion_matrix(y_true, y_predicted, labels=[1,0]))
confusion = pd.DataFrame(conmat, index=['y_true', 'y_false'], columns=['predicted_true','predicted_false'])
c_report = classification_report(y_true, y_predicted)
return a_score, confusion, c_report
def plot_roc(model, features, target, title):
y_pred_rf = model.predict_proba(features)[:, 1] #http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html
FPR, TPR, _ = roc_curve(target, y_pred_rf)
plt.figure(figsize=[11,9])
plt.plot(FPR, TPR, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title(title, fontsize=18)
plt.legend(loc="lower right")
plt.show()
Below is the evaluation on the training set. There is a 96% accuracy, a 94% sensitivity, and a 97% specificity.
The ROC Curve is far from the dashed line, which is exactly what we want.
print "Train Evaluation"
a_score, confusion, c_report = evaluate_model(y_train, grid_logit.predict(X_train_logit))
print a_score
print confusion
print c_report
plot_roc(grid_logit, X_train_logit, y_train, "Train ROC")
Below is the evaluation on the testing set. There is a 94% accuracy, a 89% sensitivity, and a 97% specificity.
We also get a nice ROC curve here.
a_score, confusion, c_report = evaluate_model(y_test, grid_logit.predict(X_test_logit))
print a_score
print confusion
print c_report
plot_roc(grid_logit, X_test_logit, y_test, "Test ROC")
This model performs very nicely. It is simple because it only requires three variables (plus an interaction term), and it performs consistently between the train and test sets. We can also say that increases in concavity worst and symmetry worst lead to an increased probability of malignancy (Although, I'm reluctant to say either or for radius mean, because of conflicting coefficient signs between it and the interaction term, and it not being statistically significant).
The only drawback is that we lose some sensitivity with the test set. If we were to deploy this model we may consider adding a slight penalty to False Negatives.
The other model I use is a random forest. I do not have to worry about independence of predictors, so I'm going to throw in all the variables that seem to correlate with malignancy.
To help control for overfitting, I will have grid search choose a max depth between 1 and 5 therefore preventing the trees from growing too large.
forest_columns = ["radius_mean", "perimeter_mean", "concavity_worst", "symmetry_worst","fractal_dimension_mean"]
X_train_forest = X_train[forest_columns]
X_test_forest = X_test[forest_columns]
parameters = {"max_depth": [1,2,3,4,5], "random_state": [767102], "n_estimators": [10, 50]}
grid_forest = GridSearchCV(RandomForestClassifier(), parameters, cv = 10, verbose = True, n_jobs = -1)
grid_forest.fit(X_train_forest, y_train)
Here our forest makes at least 4 levels of splits, and there are 50 trees aggregated within the forest.
grid_forest.best_estimator_
Below you can see feature importances, where concavity worst is the most important feature, and symmetry worst is the least important.
pd.Series(grid_forest.best_estimator_.feature_importances_, index = X_train_forest.columns).sort_values(ascending = False).plot(kind = "bar", title = "Feature Importances")
In the train set we have a 98% accuracy rate, a 95% sensitivity rate, and a 99% specificity rate.
The ROC curve here shows a nice separation of classes as well.
print "Train Evaluation"
a_score, confusion, c_report = evaluate_model(y_train, grid_forest.predict(X_train_forest))
print a_score
print confusion
print c_report
plot_roc(grid_forest, X_train_forest, y_train, "Train ROC")
In our test set we get a 94% accuracy rate, with a 89% sensitivity, and a 96% specificity.
print "Test Evaluation"
a_score, confusion, c_report = evaluate_model(y_test, grid_forest.predict(X_test_forest))
print a_score
print confusion
print c_report
plot_roc(grid_forest, X_test_forest, y_test, "Test ROC")
The random forest also performed very nicely. We did have some dropoff in accuracy from 98% in the train set to 94% in the test set. The test set shows that it may perform at the same level as the logistic regression.
However, you cannot make the same inferences with the random forest as you can with the logistic regression. You can tell what features are important, but you are unable to tell if they positively or negatively affect the probability of malignancy.
For this reason and its simplicity, I would go with the logistic model.