General Assembly Breast Cancer Project

By Brendan Bailey

Executive Summary

The provided data gives several physical metrics regarding breast cancer tumors with the goal of predicting malignancy. The model made is simple, requiring only 3 predictors: radius mean, concavity worst, and symmetry worst. Based on our test set, we expect our model to predict malignancy with a 94% accuracy in the real world. The model also tells us that there is a positive relationship between concavity worst and symmetry worst with malignancy.

Technical Summary

The provided data was scaled using min max scaler, and fed into a logistic model. The logistic model used a l1 penalty with a penalty coefficient of 0.01 (C = 100).

The predictors used for the model were radius mean, concavity worst, and symmetry worst. Radius mean and concavity worst are correlated, so an interaction term was created. Radius mean and the interaction term were determined not statistically significant. However getting rid of them slightly affects the accuracy, and the radius mean is on the cusp of being significant, so they were kept within the model.

The data was split into a 70% train set and 30% test set, and below are the performances:

Train Performance

  • Accuracy: 96%
  • Sensitivity: 94%
  • Specificity: 97%

Test Peformance

  • Accuracy: 94%
  • Sensitivity: 89%
  • Specificity: 97%

There is some dropoff of sensitivity between the train and test sets. If this is an issue when the model is deployed, it may be good to consider adding a slight penalty to False Negatives.

A random forest was also created and performed at the same level as the logistic regression. However, I find the logistic model preferable due to its interpretability.

Python Coding and Data Set

In [17]:
import pandas as pd
import requests
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc
from sklearn.ensemble import RandomForestClassifier
import statsmodels.api as sm
%matplotlib inline

plt.style.use('ggplot')

Loading in Data with Header

In [18]:
columns = requests.get("https://gist.githubusercontent.com/jeff-boykin/b5c536467c30d66ab97cd1f5c9a3497d/raw/5233c792af49c9b78f20c35d5cd729e1307a7df7/field_names.txt").text.split("\n")
cancer_df = pd.read_csv("https://gist.githubusercontent.com/jeff-boykin/b5c536467c30d66ab97cd1f5c9a3497d/raw/5233c792af49c9b78f20c35d5cd729e1307a7df7/breast-cancer.csv", names = columns)

Performing exploratory data analysis on the data

In [19]:
def eda(dataframe): #Performs exploratory data analysis on the dataframe
    print "columns \n", dataframe.columns
    print "head \n", dataframe.head()
    print "tail \n", dataframe.tail()
    print "missing values \n", dataframe.isnull().sum()
    print "dataframe types \n", dataframe.dtypes
    print "dataframe shape \n", dataframe.shape
    print "dataframe describe \n", dataframe.describe() #summary statistics
    for item in dataframe:
        print item
        print dataframe[item].nunique()
    print "%s duplicates out of %s records" % (len(dataframe) - len(dataframe.drop_duplicates()), len(dataframe))
In [20]:
eda(cancer_df)
columns 
Index([u'ID', u'diagnosis', u'radius_mean', u'radius_sd_error',
       u'radius_worst', u'texture_mean', u'texture_sd_error', u'texture_worst',
       u'perimeter_mean', u'perimeter_sd_error', u'perimeter_worst',
       u'area_mean', u'area_sd_error', u'area_worst', u'smoothness_mean',
       u'smoothness_sd_error', u'smoothness_worst', u'compactness_mean',
       u'compactness_sd_error', u'compactness_worst', u'concavity_mean',
       u'concavity_sd_error', u'concavity_worst', u'concave_points_mean',
       u'concave_points_sd_error', u'concave_points_worst', u'symmetry_mean',
       u'symmetry_sd_error', u'symmetry_worst', u'fractal_dimension_mean',
       u'fractal_dimension_sd_error', u'fractal_dimension_worst'],
      dtype='object')
head 
         ID diagnosis  radius_mean  radius_sd_error  radius_worst  \
0    842302         M        17.99            10.38        122.80   
1    842517         M        20.57            17.77        132.90   
2  84300903         M        19.69            21.25        130.00   
3  84348301         M        11.42            20.38         77.58   
4  84358402         M        20.29            14.34        135.10   

   texture_mean  texture_sd_error  texture_worst  perimeter_mean  \
0        1001.0           0.11840        0.27760          0.3001   
1        1326.0           0.08474        0.07864          0.0869   
2        1203.0           0.10960        0.15990          0.1974   
3         386.1           0.14250        0.28390          0.2414   
4        1297.0           0.10030        0.13280          0.1980   

   perimeter_sd_error           ...             concavity_worst  \
0             0.14710           ...                       25.38   
1             0.07017           ...                       24.99   
2             0.12790           ...                       23.57   
3             0.10520           ...                       14.91   
4             0.10430           ...                       22.54   

   concave_points_mean  concave_points_sd_error  concave_points_worst  \
0                17.33                   184.60                2019.0   
1                23.41                   158.80                1956.0   
2                25.53                   152.50                1709.0   
3                26.50                    98.87                 567.7   
4                16.67                   152.20                1575.0   

   symmetry_mean  symmetry_sd_error  symmetry_worst  fractal_dimension_mean  \
0         0.1622             0.6656          0.7119                  0.2654   
1         0.1238             0.1866          0.2416                  0.1860   
2         0.1444             0.4245          0.4504                  0.2430   
3         0.2098             0.8663          0.6869                  0.2575   
4         0.1374             0.2050          0.4000                  0.1625   

   fractal_dimension_sd_error  fractal_dimension_worst  
0                      0.4601                  0.11890  
1                      0.2750                  0.08902  
2                      0.3613                  0.08758  
3                      0.6638                  0.17300  
4                      0.2364                  0.07678  

[5 rows x 32 columns]
tail 
         ID diagnosis  radius_mean  radius_sd_error  radius_worst  \
564  926424         M        21.56            22.39        142.00   
565  926682         M        20.13            28.25        131.20   
566  926954         M        16.60            28.08        108.30   
567  927241         M        20.60            29.33        140.10   
568   92751         B         7.76            24.54         47.92   

     texture_mean  texture_sd_error  texture_worst  perimeter_mean  \
564        1479.0           0.11100        0.11590         0.24390   
565        1261.0           0.09780        0.10340         0.14400   
566         858.1           0.08455        0.10230         0.09251   
567        1265.0           0.11780        0.27700         0.35140   
568         181.0           0.05263        0.04362         0.00000   

     perimeter_sd_error           ...             concavity_worst  \
564             0.13890           ...                      25.450   
565             0.09791           ...                      23.690   
566             0.05302           ...                      18.980   
567             0.15200           ...                      25.740   
568             0.00000           ...                       9.456   

     concave_points_mean  concave_points_sd_error  concave_points_worst  \
564                26.40                   166.10                2027.0   
565                38.25                   155.00                1731.0   
566                34.12                   126.70                1124.0   
567                39.42                   184.60                1821.0   
568                30.37                    59.16                 268.6   

     symmetry_mean  symmetry_sd_error  symmetry_worst  fractal_dimension_mean  \
564        0.14100            0.21130          0.4107                  0.2216   
565        0.11660            0.19220          0.3215                  0.1628   
566        0.11390            0.30940          0.3403                  0.1418   
567        0.16500            0.86810          0.9387                  0.2650   
568        0.08996            0.06444          0.0000                  0.0000   

     fractal_dimension_sd_error  fractal_dimension_worst  
564                      0.2060                  0.07115  
565                      0.2572                  0.06637  
566                      0.2218                  0.07820  
567                      0.4087                  0.12400  
568                      0.2871                  0.07039  

[5 rows x 32 columns]
missing values 
ID                            0
diagnosis                     0
radius_mean                   0
radius_sd_error               0
radius_worst                  0
texture_mean                  0
texture_sd_error              0
texture_worst                 0
perimeter_mean                0
perimeter_sd_error            0
perimeter_worst               0
area_mean                     0
area_sd_error                 0
area_worst                    0
smoothness_mean               0
smoothness_sd_error           0
smoothness_worst              0
compactness_mean              0
compactness_sd_error          0
compactness_worst             0
concavity_mean                0
concavity_sd_error            0
concavity_worst               0
concave_points_mean           0
concave_points_sd_error       0
concave_points_worst          0
symmetry_mean                 0
symmetry_sd_error             0
symmetry_worst                0
fractal_dimension_mean        0
fractal_dimension_sd_error    0
fractal_dimension_worst       0
dtype: int64
dataframe types 
ID                              int64
diagnosis                      object
radius_mean                   float64
radius_sd_error               float64
radius_worst                  float64
texture_mean                  float64
texture_sd_error              float64
texture_worst                 float64
perimeter_mean                float64
perimeter_sd_error            float64
perimeter_worst               float64
area_mean                     float64
area_sd_error                 float64
area_worst                    float64
smoothness_mean               float64
smoothness_sd_error           float64
smoothness_worst              float64
compactness_mean              float64
compactness_sd_error          float64
compactness_worst             float64
concavity_mean                float64
concavity_sd_error            float64
concavity_worst               float64
concave_points_mean           float64
concave_points_sd_error       float64
concave_points_worst          float64
symmetry_mean                 float64
symmetry_sd_error             float64
symmetry_worst                float64
fractal_dimension_mean        float64
fractal_dimension_sd_error    float64
fractal_dimension_worst       float64
dtype: object
dataframe shape 
(569, 32)
dataframe describe 
                 ID  radius_mean  radius_sd_error  radius_worst  texture_mean  \
count  5.690000e+02   569.000000       569.000000    569.000000    569.000000   
mean   3.037183e+07    14.127292        19.289649     91.969033    654.889104   
std    1.250206e+08     3.524049         4.301036     24.298981    351.914129   
min    8.670000e+03     6.981000         9.710000     43.790000    143.500000   
25%    8.692180e+05    11.700000        16.170000     75.170000    420.300000   
50%    9.060240e+05    13.370000        18.840000     86.240000    551.100000   
75%    8.813129e+06    15.780000        21.800000    104.100000    782.700000   
max    9.113205e+08    28.110000        39.280000    188.500000   2501.000000   

       texture_sd_error  texture_worst  perimeter_mean  perimeter_sd_error  \
count        569.000000     569.000000      569.000000          569.000000   
mean           0.096360       0.104341        0.088799            0.048919   
std            0.014064       0.052813        0.079720            0.038803   
min            0.052630       0.019380        0.000000            0.000000   
25%            0.086370       0.064920        0.029560            0.020310   
50%            0.095870       0.092630        0.061540            0.033500   
75%            0.105300       0.130400        0.130700            0.074000   
max            0.163400       0.345400        0.426800            0.201200   

       perimeter_worst           ...             concavity_worst  \
count       569.000000           ...                  569.000000   
mean          0.181162           ...                   16.269190   
std           0.027414           ...                    4.833242   
min           0.106000           ...                    7.930000   
25%           0.161900           ...                   13.010000   
50%           0.179200           ...                   14.970000   
75%           0.195700           ...                   18.790000   
max           0.304000           ...                   36.040000   

       concave_points_mean  concave_points_sd_error  concave_points_worst  \
count           569.000000               569.000000            569.000000   
mean             25.677223               107.261213            880.583128   
std               6.146258                33.602542            569.356993   
min              12.020000                50.410000            185.200000   
25%              21.080000                84.110000            515.300000   
50%              25.410000                97.660000            686.500000   
75%              29.720000               125.400000           1084.000000   
max              49.540000               251.200000           4254.000000   

       symmetry_mean  symmetry_sd_error  symmetry_worst  \
count     569.000000         569.000000      569.000000   
mean        0.132369           0.254265        0.272188   
std         0.022832           0.157336        0.208624   
min         0.071170           0.027290        0.000000   
25%         0.116600           0.147200        0.114500   
50%         0.131300           0.211900        0.226700   
75%         0.146000           0.339100        0.382900   
max         0.222600           1.058000        1.252000   

       fractal_dimension_mean  fractal_dimension_sd_error  \
count              569.000000                  569.000000   
mean                 0.114606                    0.290076   
std                  0.065732                    0.061867   
min                  0.000000                    0.156500   
25%                  0.064930                    0.250400   
50%                  0.099930                    0.282200   
75%                  0.161400                    0.317900   
max                  0.291000                    0.663800   

       fractal_dimension_worst  
count               569.000000  
mean                  0.083946  
std                   0.018061  
min                   0.055040  
25%                   0.071460  
50%                   0.080040  
75%                   0.092080  
max                   0.207500  

[8 rows x 31 columns]
ID
569
diagnosis
2
radius_mean
456
radius_sd_error
479
radius_worst
522
texture_mean
539
texture_sd_error
474
texture_worst
537
perimeter_mean
537
perimeter_sd_error
542
perimeter_worst
432
area_mean
499
area_sd_error
540
area_worst
519
smoothness_mean
533
smoothness_sd_error
528
smoothness_worst
547
compactness_mean
541
compactness_sd_error
533
compactness_worst
507
concavity_mean
498
concavity_sd_error
545
concavity_worst
457
concave_points_mean
511
concave_points_sd_error
514
concave_points_worst
544
symmetry_mean
411
symmetry_sd_error
529
symmetry_worst
539
fractal_dimension_mean
492
fractal_dimension_sd_error
500
fractal_dimension_worst
535
0 duplicates out of 569 records

Classes are slightly unbalanced, and we have a limited number of records.

In [21]:
cancer_df.diagnosis.value_counts().plot(kind = "bar", title = "Cancer Diagnoses")
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x11e5cb7d0>

The mean for both smoothness and compactness is greater than the median. This is due to the outliers that are greater than both the medians within the distributions. In the distributions below the median line is blue and the mean line is green.

In [22]:
print "Smoothness"
print "Mean:", cancer_df.smoothness_mean.mean()
print "Median:", cancer_df.smoothness_mean.median()

print "\nCompactness"
print "Mean:", cancer_df.compactness_mean.mean()
print "Median:", cancer_df.compactness_mean.median()
Smoothness
Mean: 2.86605922671
Median: 2.287

Compactness
Mean: 0.0254781388401
Median: 0.02045
In [23]:
ax = cancer_df.smoothness_mean.plot(kind = "hist", title = "Smoothness Distribution")
ax.axvline(cancer_df.smoothness_mean.mean(), color = "#4cbb17")
ax.axvline(cancer_df.smoothness_mean.median(), color = "#3B5998")
Out[23]:
<matplotlib.lines.Line2D at 0x11defab50>
In [24]:
ax = cancer_df.compactness_mean.plot(kind = "hist", title = "Compactness Distribution")
ax.axvline(cancer_df.compactness_mean.mean(), color = "#4cbb17")
ax.axvline(cancer_df.compactness_mean.median(), color = "#3B5998")
Out[24]:
<matplotlib.lines.Line2D at 0x11e1c8d90>

Below I used the function from one of the labs to create a bootstrap sample. The function takes a dataframe, converts it into a list of dictionaries, takes a random sample, and then converts it back to a dataframe.

In [25]:
def bootstrap(dataframe, iters=1000):
    list_of_dicts = dataframe.to_dict(orient = "records")
    random_sample = list(np.random.choice(list_of_dicts, replace=True, size = iters))
    bootstrap_frame = pd.DataFrame(random_sample)
    return bootstrap_frame
In [26]:
bootstrap_sample = bootstrap(cancer_df)
In [27]:
bootstrap_sample.ID.value_counts().head() #Shows duplicates within sample with replacement
Out[27]:
863031      7
869931      6
91805       6
89263202    6
893548      6
Name: ID, dtype: int64

Using a minmax scaler to help scale the data. This is to help prevent any one variable from influencing the model due to its scale.

In [28]:
scale_columns = [u'radius_mean', u'radius_sd_error',
       u'radius_worst', u'texture_mean', u'texture_sd_error', u'texture_worst',
       u'perimeter_mean', u'perimeter_sd_error', u'perimeter_worst',
       u'area_mean', u'area_sd_error', u'area_worst', u'smoothness_mean',
       u'smoothness_sd_error', u'smoothness_worst', u'compactness_mean',
       u'compactness_sd_error', u'compactness_worst', u'concavity_mean',
       u'concavity_sd_error', u'concavity_worst', u'concave_points_mean',
       u'concave_points_sd_error', u'concave_points_worst', u'symmetry_mean',
       u'symmetry_sd_error', u'symmetry_worst', u'fractal_dimension_mean',
       u'fractal_dimension_sd_error', u'fractal_dimension_worst']
X_scaled = pd.DataFrame(MinMaxScaler().fit_transform(cancer_df[scale_columns]), columns = scale_columns)

Making the diagnosis column into a binary

In [29]:
cancer_df["target"] = cancer_df.diagnosis.map({"M":1,"B":0})

Exploratory Analysis

The variables radius_mean, perimeter_mean, concavity_worst, symmetry_worst, and fractal_dimension_mean all look like they have a positive correlation with malignancy. Below you can see the violin plot on the right is higher in each of the graphs.

There were other correlated variables, but I decided only to focus on either the mean or the worst of any set (i.e. if radius mean and radius std error correlated with malignancy, I would just plot radius mean).

One issue with the data is that most of the variables are highly correlated with each other, causing issues for any models that assume independence.

In [14]:
for i, column in enumerate(["radius_mean", "perimeter_mean", "concavity_worst", "symmetry_worst","fractal_dimension_mean"]):
    plt.figure(i)
    sns.violinplot(x = cancer_df.target, y = X_scaled[column])
    plt.title(column)
In [15]:
sns.heatmap(X_scaled[["radius_mean", "perimeter_mean", "concavity_worst", "symmetry_worst","fractal_dimension_mean"]].corr(), vmax=.8, square=True)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x11e348e10>

Modeling

Logistic Regression

One of the models I went with is a logistic regression, since it is a relatively familiar model and the coefficients can tell us how the features affect the probability of malignancy.

One of the drawbacks is that I have to be selective with my features, because if they are highly correlated it will introduce multicollinearity, which will affect my results. Also the data is scaled, so I will not be able to produce useful odds ratios from the coefficients.

For this model I will select radius mean, concavity worst, and symmetry worst. Radius mean and concavity worst are highly correlated, so I will use an interaction term as well.

To prevent overfitting I will let grid search optimize the regularization coefficient and whether I should use l1 or l2 regularization.

In [38]:
X_scaled["radius_mean_concavity_worst_interaction"] = X_scaled["radius_mean"] * X_scaled["concavity_worst"]
X_train, X_test, y_train, y_test = train_test_split(X_scaled, cancer_df.target, stratify = cancer_df.target, test_size=0.3, random_state = 997407)
In [39]:
logit_columns = ["radius_mean", "concavity_worst", "symmetry_worst", "radius_mean_concavity_worst_interaction"]
X_train_logit = X_train[logit_columns]
X_test_logit = X_test[logit_columns]
In [40]:
parameters = {"C": [0.0001, 0.001, 0.01, 0.1, .15, .25, .275, .33, 0.5, .66, 0.75, 1.0, 2.5, 5.0, 10.0, 100.0, 1000.0], "penalty" : ["l1", "l2"], "random_state" : [326228]}
grid_logit = GridSearchCV(LogisticRegression(), parameters, cv = 10, verbose = True, n_jobs = -1) 
In [41]:
grid_logit.fit(X_train_logit, y_train)
Fitting 10 folds for each of 34 candidates, totalling 340 fits
[Parallel(n_jobs=-1)]: Done 340 out of 340 | elapsed:    1.3s finished
Out[41]:
GridSearchCV(cv=10, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.0001, 0.001, 0.01, 0.1, 0.15, 0.25, 0.275, 0.33, 0.5, 0.66, 0.75, 1.0, 2.5, 5.0, 10.0, 100.0, 1000.0], 'random_state': [326228]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=True)

Here is the best model chosen by grid search. The regularization level C is 100, and the regularization penalty is l1.

In [42]:
grid_logit.best_estimator_
Out[42]:
LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=326228, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

The intercept and coefficients for the best model are below.

In [44]:
print "Intercept", grid_logit.best_estimator_.intercept_[0]
for x, y in zip(list(X_train_logit.columns), grid_logit.best_estimator_.coef_[0]):
    print x, y
Intercept -10.7096125705
radius_mean -28.0270346832
concavity_worst 54.7938986327
symmetry_worst 10.1837906608
radius_mean_concavity_worst_interaction 10.0014513013

I did the same fit using stats models to check the confidence intervals of each predictor. Some of the parameters within stats models are different, and for that reason the coefficients do not exactly equal the output form scikit-learn.

Notice that radius mean and the radius mean concavity worst interaction term are not statistically significant. However, when I remove them from the model, I do lose a couple of points in accuracy, and for that reason I left them in the model.

Radius mean is on the cusp of being significant, and if I were to decrease my confidence to 90% it probably would be significant.

In [78]:
X_train_sm = sm.add_constant(X_train_logit)
sm_model = sm.Logit(y_train, X_train_sm).fit_regularized(method = "l1", alpha = 0.01, max_iter = 100)
sm_model.summary()
Optimization terminated successfully.    (Exit mode 0)
            Current function value: 0.118633849847
            Iterations: 129
            Function evaluations: 129
            Gradient evaluations: 129
QC check did not pass for 2 out of 5 parameters
Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers
Could not trim params automatically due to failed QC check.  Trimming using trim_mode == 'size' will still work.
Out[78]:
Logit Regression Results
Dep. Variable: target No. Observations: 398
Model: Logit Df Residuals: 393
Method: MLE Df Model: 4
Date: Thu, 11 May 2017 Pseudo R-squ.: 0.8246
Time: 19:35:20 Log-Likelihood: -46.058
converged: True LL-Null: -262.66
LLR p-value: 1.866e-92
coef std err z P>|z| [0.025 0.975]
const -10.4823 4.056 -2.584 0.010 -18.433 -2.532
radius_mean -28.8058 15.542 -1.853 0.064 -59.268 1.656
concavity_worst 54.1693 15.356 3.528 0.000 24.072 84.266
symmetry_worst 10.1725 2.230 4.561 0.000 5.801 14.544
radius_mean_concavity_worst_interaction 12.1618 34.730 0.350 0.726 -55.908 80.231
In [63]:
def evaluate_model(y_true, y_predicted):
    a_score = accuracy_score(y_true, y_predicted)
    conmat = np.array(confusion_matrix(y_true, y_predicted, labels=[1,0]))
    confusion = pd.DataFrame(conmat, index=['y_true', 'y_false'], columns=['predicted_true','predicted_false'])
    c_report = classification_report(y_true, y_predicted)
    return a_score, confusion, c_report
In [64]:
def plot_roc(model, features, target, title):
    y_pred_rf = model.predict_proba(features)[:, 1] #http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html
    FPR, TPR, _ = roc_curve(target, y_pred_rf)
    plt.figure(figsize=[11,9])
    plt.plot(FPR, TPR, linewidth=4)
    plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate', fontsize=18)
    plt.ylabel('True Positive Rate', fontsize=18)
    plt.title(title, fontsize=18)
    plt.legend(loc="lower right")
    plt.show()

Below is the evaluation on the training set. There is a 96% accuracy, a 94% sensitivity, and a 97% specificity.

The ROC Curve is far from the dashed line, which is exactly what we want.

In [106]:
print "Train Evaluation"
a_score, confusion, c_report = evaluate_model(y_train, grid_logit.predict(X_train_logit))
print a_score
print confusion
print c_report
Train Evaluation
0.959798994975
         predicted_true  predicted_false
y_true              139                9
y_false               7              243
             precision    recall  f1-score   support

          0       0.96      0.97      0.97       250
          1       0.95      0.94      0.95       148

avg / total       0.96      0.96      0.96       398

In [107]:
plot_roc(grid_logit, X_train_logit, y_train, "Train ROC")

Below is the evaluation on the testing set. There is a 94% accuracy, a 89% sensitivity, and a 97% specificity.

We also get a nice ROC curve here.

In [112]:
a_score, confusion, c_report = evaluate_model(y_test, grid_logit.predict(X_test_logit))
print a_score
print confusion
print c_report
0.941520467836
         predicted_true  predicted_false
y_true               57                7
y_false               3              104
             precision    recall  f1-score   support

          0       0.94      0.97      0.95       107
          1       0.95      0.89      0.92        64

avg / total       0.94      0.94      0.94       171

In [113]:
plot_roc(grid_logit, X_test_logit, y_test, "Test ROC")

This model performs very nicely. It is simple because it only requires three variables (plus an interaction term), and it performs consistently between the train and test sets. We can also say that increases in concavity worst and symmetry worst lead to an increased probability of malignancy (Although, I'm reluctant to say either or for radius mean, because of conflicting coefficient signs between it and the interaction term, and it not being statistically significant).

The only drawback is that we lose some sensitivity with the test set. If we were to deploy this model we may consider adding a slight penalty to False Negatives.

Random Forest

The other model I use is a random forest. I do not have to worry about independence of predictors, so I'm going to throw in all the variables that seem to correlate with malignancy.

To help control for overfitting, I will have grid search choose a max depth between 1 and 5 therefore preventing the trees from growing too large.

In [157]:
forest_columns = ["radius_mean", "perimeter_mean", "concavity_worst", "symmetry_worst","fractal_dimension_mean"]
X_train_forest = X_train[forest_columns]
X_test_forest = X_test[forest_columns]
In [158]:
parameters = {"max_depth": [1,2,3,4,5], "random_state": [767102], "n_estimators": [10, 50]}
grid_forest = GridSearchCV(RandomForestClassifier(), parameters, cv = 10, verbose = True, n_jobs = -1) 
In [159]:
grid_forest.fit(X_train_forest, y_train)
Fitting 10 folds for each of 10 candidates, totalling 100 fits
[Parallel(n_jobs=-1)]: Done  76 tasks      | elapsed:    6.6s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    8.8s finished
Out[159]:
GridSearchCV(cv=10, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [10, 50], 'random_state': [767102], 'max_depth': [1, 2, 3, 4, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=True)

Here our forest makes at least 4 levels of splits, and there are 50 trees aggregated within the forest.

In [160]:
grid_forest.best_estimator_
Out[160]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=4, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=50, n_jobs=1, oob_score=False,
            random_state=767102, verbose=0, warm_start=False)

Below you can see feature importances, where concavity worst is the most important feature, and symmetry worst is the least important.

In [161]:
pd.Series(grid_forest.best_estimator_.feature_importances_, index = X_train_forest.columns).sort_values(ascending = False).plot(kind = "bar", title = "Feature Importances")
Out[161]:
<matplotlib.axes._subplots.AxesSubplot at 0x121ca9d10>

In the train set we have a 98% accuracy rate, a 95% sensitivity rate, and a 99% specificity rate.

The ROC curve here shows a nice separation of classes as well.

In [162]:
print "Train Evaluation"
a_score, confusion, c_report = evaluate_model(y_train, grid_forest.predict(X_train_forest))
print a_score
print confusion
print c_report
Train Evaluation
0.979899497487
         predicted_true  predicted_false
y_true              141                7
y_false               1              249
             precision    recall  f1-score   support

          0       0.97      1.00      0.98       250
          1       0.99      0.95      0.97       148

avg / total       0.98      0.98      0.98       398

In [163]:
plot_roc(grid_forest, X_train_forest, y_train, "Train ROC")

In our test set we get a 94% accuracy rate, with a 89% sensitivity, and a 96% specificity.

In [168]:
print "Test Evaluation"
a_score, confusion, c_report = evaluate_model(y_test, grid_forest.predict(X_test_forest))
print a_score
print confusion
print c_report
Test Evaluation
0.93567251462
         predicted_true  predicted_false
y_true               57                7
y_false               4              103
             precision    recall  f1-score   support

          0       0.94      0.96      0.95       107
          1       0.93      0.89      0.91        64

avg / total       0.94      0.94      0.94       171

In [167]:
plot_roc(grid_forest, X_test_forest, y_test, "Test ROC")

The random forest also performed very nicely. We did have some dropoff in accuracy from 98% in the train set to 94% in the test set. The test set shows that it may perform at the same level as the logistic regression.

However, you cannot make the same inferences with the random forest as you can with the logistic regression. You can tell what features are important, but you are unable to tell if they positively or negatively affect the probability of malignancy.

For this reason and its simplicity, I would go with the logistic model.