Machine learning model ZOO

Nowadays everyone talks about 'artificial intelligence', in every research paper one can read 'deep learning'. In reality many of the work is done by traditional machine learning models. These models live with us for decades now and do a descent job with their limitations.

Usually traditional machine learning models are easier to interpret and understand their learnt underlying decision process than neural networks. The rule of thumb is that when it comes to images, audio files or text, modern neural networks are superior than traditional machine learning methods, on the other hand when we work with traditional tabular data it often goes the opposite way. It worth to study the winning solutions on Kaggle to see the usual winning models for different datatypes.

In [1]:
import numpy as np # np array & math
import pandas as pd # to handle data table
import seaborn as sns # high-level plotting package, built on matplotlib
import matplotlib.pyplot as plt # lower-level plotting package

from collections import Counter
# super useful function to count objects in lists

%matplotlib inline
# to have the plots displayed within the notebook


from sklearn import datasets, cluster
from sklearn import neighbors, ensemble, tree, linear_model
from sklearn import model_selection, metrics
# sklearn in the most popular machine learning library in python
In [2]:
data = datasets.load_breast_cancer()

Breast Cancer Wisconsin (Diagnostic) Data Set

Fine needle aspirate (FNA) of a breast mass. The dataset contains extracted information of cell nuclei from digitized images.

A few extracted information:

  • radius
  • texture
  • perimeter
  • area
  • smoothness
  • compactness

    The target variable is binary, the diagnosis (M malignant - 0, B benign - 1).

Image: By Ed Uthman from Houston, TX, USA - Pancreas FNA; adenocarcinoma vs. normal ductal epithelium (400x)Uploaded by CFCF, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=30103637

Get familiar with the data

  • missing values
  • range of features
  • strange behaviours, outliers
In [3]:
X = pd.DataFrame(data['data'])
X.columns = data['feature_names']
X.head()
Out[3]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns

In [4]:
Counter(data['target'])
Out[4]:
Counter({0: 212, 1: 357})
In [5]:
data['target_names']
Out[5]:
array(['malignant', 'benign'], dtype='<U9')
In [6]:
X.describe()
Out[6]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 30 columns

In [7]:
plt.imshow(X.T.isna())
print(f'# NAs: {X.isna().sum().sum()}')
# NAs: 0
In [8]:
f'{1+1}foo{"bar".replace("a", " ")}'

# cool conversion of python functions to string
# above Python 3.6
Out[8]:
'2foob r'
In [9]:
X['target'] = data.target
In [10]:
plt.hist(X[X['target'] == 0]['mean radius'], alpha = 0.7, label='malignant', bins=20)
plt.hist(X[X['target'] == 1]['mean radius'], alpha = 0.7, label='benign', bins=20)
plt.xlabel('mean radius', fontsize=15)
plt.legend(fontsize=15)
Out[10]:
<matplotlib.legend.Legend at 0x7f77f1b63be0>
In [11]:
plt.figure(figsize=(20, 7))
data2 = pd.concat([X.target, ((X-X.mean())/X.std()).drop('target', 1)], axis=1)
data2 = pd.melt(data2,id_vars="target",
                    var_name="features",
                    value_name='value')
sns.violinplot(x="features", y="value", hue="target", data=data2, split=True, inner="quart")
plt.xticks(rotation=90)
plt.show()
In [12]:
plt.figure(figsize=(14, 14))
sns.heatmap(X.corr(), annot=True, linewidths=.5, fmt= '.1f')
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f77f33ca640>
In [13]:
plt.figure(figsize=(14, 14))
sns.clustermap(X.T)
sns.clustermap(((X-X.mean())/X.std()).T)
/home/patbaa/anaconda3/lib/python3.8/site-packages/seaborn/matrix.py:649: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
Out[13]:
<seaborn.matrix.ClusterGrid at 0x7f77ed3be6d0>
<Figure size 1008x1008 with 0 Axes>

Impute missing data when needed always with care

  • is the missingness random?
  • is the missingness portion significant?
  • remove feature / samples
  • fill with mean / median
  • new feature: is_missing ?

Modeling

  • KMeans clustering - 2 clusters
  • Decision tree classifier
  • Random forest with feature importance
  • Logistic regression
In [14]:
kmeans = cluster.KMeans(n_clusters=2, random_state=42)
rf     = ensemble.RandomForestClassifier(random_state=42)
dt     = tree.DecisionTreeClassifier()
lr     = linear_model.LogisticRegression()
knn    = neighbors.KNeighborsClassifier(5) 

# random states are important for reproducibility

KMeans is unsupervised, the rest is supervised

In [15]:
kmeans_clusters = kmeans.fit_predict(X.drop('target', 1))
(kmeans_clusters == X['target']).mean()
Out[15]:
0.14586994727592267

KMeans does not know which cluster is which, it is unsupervised!

In [16]:
kmeans_clusters = np.array([1 if i == 0 else 0 for i in kmeans_clusters])
(kmeans_clusters == X['target']).mean()
Out[16]:
0.8541300527240774
In [17]:
cm = metrics.confusion_matrix(y_true=X['target'],y_pred=kmeans_clusters)
sns.heatmap(cm,annot=True,fmt="d")
plt.xlabel('prediction', fontsize=15)
plt.ylabel('label', fontsize=15)
Out[17]:
Text(33.0, 0.5, 'label')

Supervised models. Everything is with default values!

  • K-fold cross-validation, with K=5
  • Leave-one-out is when K=len(X)
In [18]:
rf_preds   = model_selection.cross_val_predict(rf, X.drop('target', 1), 
                               X['target'], method='predict_proba', cv=5)
tree_preds = model_selection.cross_val_predict(dt, X.drop('target', 1), 
                               X['target'], method='predict_proba', cv=5)
lr_preds   = model_selection.cross_val_predict(lr, X.drop('target', 1), 
                               X['target'], method='predict_proba', cv=5)
knn_preds  = model_selection.cross_val_predict(knn, X.drop('target', 1), 
                               X['target'], method='predict_proba', cv=5)
/home/patbaa/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/patbaa/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/patbaa/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/patbaa/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/patbaa/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

ConvergenceWarning...

  • increase iteration limit
  • scale data
  • learning rate
In [19]:
plt.figure(figsize=(8, 8))
for idx, preds in enumerate([rf_preds, tree_preds, lr_preds, knn_preds]):
    fpr, tpr, _ = metrics.roc_curve(y_score=preds[:,1], y_true=data['target'])
    auc = np.round(metrics.roc_auc_score(y_score=preds[:,1], y_true=data['target']), 3)
    plt.plot(fpr, tpr, label=['random forest', 'decision tree', 
                             'logistic regression', 'knn'][idx] + f': {auc}')
plt.legend(fontsize=15)
plt.plot([0, 1], [0, 1], '--', c='k')
plt.xlabel('False Positive Rate', fontsize=15)
plt.ylabel('True Positive Rate', fontsize=15)
Out[19]:
Text(0, 0.5, 'True Positive Rate')

Why do we have just...

  • one break point for decision tree
  • a few break points for KNN?

Remember: ROC curve is generated with sweeping the probability threshold! Can these models provide real, continuous probabilities?

In [20]:
X = X.set_index('target')
X = (X - X.mean())/X.std()
X = X.reset_index()
rf_preds2   = model_selection.cross_val_predict(rf, X.drop('target', 1), 
                               X['target'], method='predict_proba', cv=5)
tree_preds2 = model_selection.cross_val_predict(dt, X.drop('target', 1), 
                               X['target'], method='predict_proba', cv=5)
lr_preds2   = model_selection.cross_val_predict(lr, X.drop('target', 1), 
                               X['target'], method='predict_proba', cv=5)
knn_preds2  = model_selection.cross_val_predict(knn, X.drop('target', 1), 
                               X['target'], method='predict_proba', cv=5)
In [21]:
plt.figure(figsize=(16, 8))
plt.subplot(121)
for idx, preds in enumerate([rf_preds, tree_preds, lr_preds, knn_preds]):
    fpr, tpr, _ = metrics.roc_curve(y_score=preds[:,1], y_true=data['target'])
    auc = np.round(metrics.roc_auc_score(y_score=preds[:,1], y_true=data['target']), 3)
    plt.plot(fpr, tpr, label=['random forest', 'decision tree', 
                             'logistic regression', 'knn'][idx] + f': {auc}')
plt.legend(fontsize=15)
plt.plot([0, 1], [0, 1], '--', c='k')
plt.xlabel('False Positive Rate', fontsize=15)
plt.ylabel('True Positive Rate', fontsize=15)
plt.title('Without normalization', fontsize=20)

plt.subplot(122)
for idx, preds in enumerate([rf_preds2, tree_preds2, lr_preds2, knn_preds2]):
    fpr, tpr, _ = metrics.roc_curve(y_score=preds[:,1], y_true=data['target'])
    auc = np.round(metrics.roc_auc_score(y_score=preds[:,1], y_true=data['target']), 3)
    plt.plot(fpr, tpr, label=['random forest', 'decision tree', 
                             'logistic regression', 'knn'][idx] + f': {auc}')
plt.legend(fontsize=15)
plt.plot([0, 1], [0, 1], '--', c='k')
plt.xlabel('False Positive Rate', fontsize=15)
plt.ylabel('True Positive Rate', fontsize=15)
plt.title('With normalization', fontsize=20)
plt.show()

Summary:

  • some models are sensitive for input data scale
  • use sklearn, one of the best package for ML purposes in Python
    • clean API
    • standard
    • community
  • there is no overall best model
    • Kaggle competitions are often won by
      • tree based models (with tabular data). Random forest like or gradient boosting (XGboost)
      • convolutional neural networks (images, sound, text)
    • but the best model is data dependent
  • also model interpretability can be important

Also there are models outside of sklearn. Such as