15
Feb 2017
Analysis of the Adult data set from UCI Machine Learning Repository
Tags |
On Computer Technology
This is an analysis of the Adult data set in the UCI Machine Learning Repository.
This data set is meant for binary class classification - to predict whether the income of a person exceeds 50K per year based on some census data.
Our objective here is to familiarize ourselves with scikit-learn by training a Logistic Regression classifier on the data set by using grid search, cross validation and pipeline, and see how well we can do.
At the end of it all, I decided to do a writeup using a Jupyter notebook (which I have pretty much no experience with), so here we are. For those of you reading the HTML version of this notebook, you might have to scroll quite a bit because I have no idea how to toggle the scrolling for the HTML export.
Gist with ipynb file and data set: https://gist.github.com/yanhan/355fb068eb5089b4de78b8de326e6358
First, we import the necessary libraries:
from IPython.display import display
from numpy.random import RandomState
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import accuracy_score, confusion_matrix, make_scorer, precision_recall_fscore_support, roc_auc_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn_pandas import DataFrameMapper
import numpy as np
import pandas as pd
We create a numpy.random.RandomState so that we can reproduce the same results each time we run this notebook.
rs = RandomState(130917)
This dataset is small and consists of 48842 rows with 14 columns (not counting the column giving the response variable). Hence we can load it entirely into memory. The fields of this data set are delimited by spaces; we can make use of pandas read_csv
function to load it into memory as a dataframe.
df = pd.read_csv("Dataset.data", header=None, delimiter=r"\s+",)
Let's take a look at the first few rows of the dataframe.
print(df.head())
Seems like everything is as according to the specifications. For ease of human consumption, we assign column names to the dataframe based on the specs.
df.columns = [
"Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
"MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
"CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]
See if there are any NaNs in the dataframe:
df.isnull().values.any()
Let's show the first few rows of the dataframe with the column names:
print(df.head())
Let's take a look at the values of the Income
column:
df.Income.unique()
All good. Let's convert the <=50K
s into -1 and the >50K
into +1
df["Income"] = df["Income"].map({ "<=50K": -1, ">50K": 1 })
Let's extract the response variable into a numpy array and drop it from the dataframe:
y_all = df["Income"].values
df.drop("Income", axis=1, inplace=True,)
Let's look at the data again:
print(df.head())
Now, the Age
, fnlwgt
, EducationNum
, CapitalGain
, CapitalLoss
and HoursPerWeek
are clearly numerical. Let's get some summary statistics on these numerical columns:
df.describe()
Age
, fnlwgt
, EducationNum
and HoursPerWeek
look pretty ok. But CapitalGain
and CapitalLoss
both have an IQR of 0. Could they be power law distributions? Let's find out.
df.CapitalGain.value_counts()
Alright, a whopping 91.7% of the CapitalGain
consists of 0. Now onto the CapitalLoss
column:
df.CapitalLoss.value_counts()
This is even worse. 95.3% of the CapitalLoss
column consists of 0.
Frankly, I do not know how to handle columns distributed according to power law, so I choose to drop them:
df.drop("CapitalGain", axis=1, inplace=True,)
df.drop("CapitalLoss", axis=1, inplace=True,)
Let's convert the Age
, fnlwgt
, EducationNum
and HoursPerWeek
to floating point so scikit-learn / numpy won't complain about scaling later.
df.Age = df.Age.astype(float)
df.fnlwgt = df.fnlwgt.astype(float)
df.EducationNum = df.EducationNum.astype(float)
df.HoursPerWeek = df.HoursPerWeek.astype(float)
Time for the categorical variables. First up WorkClass
:
df.WorkClass.unique()
Next up Education
:
df.Education.unique()
Next up MaritalStatus
:
df.MaritalStatus.unique()
Next up Occupation
:
df.Occupation.unique()
Next up Relationship
:
df.Relationship.unique()
Next up Race
:
df.Race.unique()
Next up Gender
:
df.Gender.unique()
Next up NativeCountry
:
df.NativeCountry.unique()
Looks like there's more unique values for NativeCountry
compared to the other categorical variables. Exactly how many unique values are there?
len(df.NativeCountry.unique())
It is not entirely clear what Relationship
means in this data set, since there is already a MaritalStatus
column. While it makes sense for someone whose MaritalStatus
column has the value Married-civ-spouse
or Married-AF-spouse
(whatever those 2 values mean) to have Relationship
column with value Husband
or Wife
, can someone's MaritalStatus
column have the value Married-civ-spouse
but whose Relationship
column have value Own-child
(whatever that means)?
Common sense tells us that WorkClass
, Education
, Occupation
are features with good predictive power with regards to one's income. Similarly for Race
and Gender
but to a slightly lesser extent. But how about MaritalStatus
and the not so clear Relationship
column?
What about NativeCountry
? A possible argument for this column is that there may be a higher concentration of people from certain countries in certain occupations. And we know that occupations certainly influence income. But doesn't that make NativeCountry
redundant? Not entirely. If it is possible for there to be a non-trivial difference between the average income of men and women (with all else equal), then it is possible for there to be a difference between the average income of people from the US and people not from the US who are in the same occupation (again all else being equal). Hence this feature could have predictive power as well.
Long story short, we choose to retain all these columns and use one-hot encoding to transform them into numerical features that the Logistic Regression model we're using later can consume.
df = pd.get_dummies(df, columns=[
"WorkClass", "Education", "MaritalStatus", "Occupation", "Relationship",
"Race", "Gender", "NativeCountry",
])
What are the dimensions of our dataframe?
df.shape
Is our data imbalanced towards one class?
pd.value_counts(pd.Series(y_all))
Turns out, yes. Only 23.9% of the data belongs to the positive class. This is something we'll have to keep in mind when training the classifier later.
We'll do a stratified split of the data set into a training set and a test set, with 25% of the data going into the test set. A stratified split will ensure that the percentage of every class in the training set and test set is very similar to that of the entire data set.
Recall the RandomState
variable in rs
that we created right at the start of the file? We pass that in to the random_state
argument of the train_test_split
function:
X_train, X_test, y_train, y_test = train_test_split(
df, y_all, test_size=0.25, stratify=y_all, random_state=rs,
)
Recall that we have 4 numerical columns: Age
, fnlwgt
, EducationNum
and HoursPerWeek
. We will have to scale the data. For simplicity, we choose to use StandardScaler
. No scaling / transformation will be done for the rest of the columns. We use the DataFrameMapper from the sklearn-pandas package to accomplish this.
standard_scaler_cols = ["Age", "fnlwgt", "EducationNum", "HoursPerWeek",]
other_cols = list(set(df.columns) - set(standard_scaler_cols))
mapper = DataFrameMapper(
[([col,], StandardScaler(),) for col in standard_scaler_cols] +
[(col, None,) for col in other_cols]
)
Now we create our pipeline, which will consist of 2 steps:
The RandomState
variable rs
is passed into the Logistic Regression classifier.
clf = LogisticRegression(random_state=rs,)
pipeline = Pipeline([
("scale", mapper,),
("logit", clf,)
])
We are going to perform a grid search with cross validation over 2 "hyperparameters":
C
argument to LogisticRegression
indicating the inverse of the regularization strength. We're going to vary this from 1e-04 to 1e4 with a 10x increase from one value to the next. This is a legit hyperparameter.class_weight
argument to LogisticRegression
. Seems like setting this to balanced
could help out in our case where the classes are imbalanced. Not sure whether this is a real hyperparameter to Logistic Regression which is why I put the hyperparameters in quotes.For scoring, we will make use of roc_auc_score
instead of the default accuracy, because of the imbalance in classes. ROC AUC score also has an intuitive appeal in that it represents the probability of the classifier ranking a randomly selected positive instance over a randomly selected negative instance
To make the results repeatable, we create a StratifiedKFold
object and pass it the RandomState
variable rs
. Then we pass the StratifiedKFold
object as the cross validation generator via the cv
argument of the GridSearchCV
object.
strat_kfold = StratifiedKFold(10, random_state=rs,)
estimator = GridSearchCV(
pipeline,
param_grid={
"logit__C": np.power(10, np.arange(-4.0, 5.0)),
"logit__class_weight": ["balanced", None,],
},
scoring=make_scorer(roc_auc_score),
cv=strat_kfold,
)
Now we perform the grid search with cross validation and print out the results.
I chose to disable this warning because there's so many of them:
DeprecationWarning: Estimator DataFrameMapper modifies parameters in __init__. This behavior is deprecated as of 0.18 and support for this behavior will be removed in 0.20.
It should have been resolved in one of the issues for the sklearn-pandas
package but for some reason it has resurfaced.
Credits to this Stack Overflow answer
import warnings
warnings.filterwarnings("ignore")
estimator.fit(X_train, y_train)
Let's look at the results in a pandas DataFrame:
cv_results_df = pd.DataFrame(estimator.cv_results_)
cv_results_df.sort_values(by="rank_test_score").head(5)
Best parameters for Logistic Regression:
print(estimator.best_params_)
Ok, so setting class_weight
to balanced
does help for Logistic Regression. The best regularization C
is 0.01
.
Now we write a function to build a pandas DataFrame from a confusion matrix:
def _build_df_from_confusion_matrix(confusion_matrix, as_fractions=False):
if as_fractions:
x = np.array(confusion_matrix)
x = np.apply_along_axis(
lambda row: [
row[0] / (row[0] + row[1]),
row[1] / (row[0] + row[1])
],
1,
x
)
else:
x = confusion_matrix
df = pd.DataFrame(
x,
index=["<= 50K", "> 50K"],
columns=["<= 50K", "> 50K"]
)
df.index.names = ["Actual"]
df.columns.names = ["Predicted"]
return df
y_train_predicted = estimator.predict(X_train)
print("Training set accuracy score: {}".format(
accuracy_score(y_train, y_train_predicted)
))
print("Training set AUROC score: {}".format(estimator.score(X_train, y_train)))
print("\nConfusion matrix for training set:")
training_confusion_matrix = confusion_matrix(
y_train, y_train_predicted
)
display(_build_df_from_confusion_matrix(training_confusion_matrix))
print("Same as above but in fractions:")
display(_build_df_from_confusion_matrix(training_confusion_matrix, as_fractions=True))
print("Precision, recall, f-score:")
precision_recall_fscore_support(y_train, y_train_predicted)
While the AUROC score of 0.8071775040759195 seems decent, the precision, recall and fscore tells a different story.
We see that the precision for the negative class is awesome (94.135%) but precision for the positive class sucks (53.286%). Even though 76.6% of the negative class instances were classified correctly, the sheer number of instances belonging to the negative class means that overall we get a very high false positive rate of 46.7%.
That said, the recall for the positive class is 84.826%. So this classifier captures a good part of the instances that belong to the positive class.
I have no idea how to interpret the f-score and I have no time to learn how to do it so I will skip that.
But this is for the training set. How about the test set?
y_test_predicted = estimator.predict(X_test)
print("Test set accuracy score: {}".format(
accuracy_score(y_test, y_test_predicted)
))
print("Test set AUROC score: {}".format(estimator.score(X_test, y_test)))
print("\nConfusion matrix for test set:")
test_confusion_matrix = confusion_matrix(y_test, y_test_predicted)
display(_build_df_from_confusion_matrix(test_confusion_matrix))
print("Same as above but in fractions:")
display(_build_df_from_confusion_matrix(test_confusion_matrix, as_fractions=True))
print("Precision, recall, f-score:")
precision_recall_fscore_support(y_test, y_test_predicted)
We get very similar numbers for the test set. Seems like the classifier performed slightly better for the test set than the training set. Which is kind of strange. But I guess this means that the Logistic Regression algorithm has learnt something from the data.
So we have a classifier with good recall but bad precision. This is great if we want to identify as many people with income exceeding 50K per annum as possible, but it is not so great when we realize that almost half the people identified as income exceeding 50K per annum actually earn less than 50K per annum.
This not so great performance could be mainly attributed to us dropping the CapitalGain
and CapitalLoss
columns. I'm not entirely sure how to deal with those but an idea is to take logarithm of those columns - we can handle zeros by adding 1 to the entire column.
Another idea is to add polynomial features. This can be done without too much effort in scikit-learn. The major downside is that we already had 106 features after dropping the CapitalGain
and CapitalLoss
columns. With those 2 columsn we will have 108 features. With polynomial features, this number can easily triple. Grid search will take a lot more time to complete.
We learnt how to:
DataFrameMapper
in the sklearn-pandas
packagesklearn.metrics.roc_auc_score
as the evaluation criteria during GridSearchCVA pretty good deal I'd say. The next time, I will perform some analysis / learning on a 'more real' dataset.
CapitalGain
and CapitalLoss
features¶One fine day I recalled seeing that there are no negative values in the CapitalGain
and CapitalLoss
columns - the minimum value is zero. Perhaps we can do take logarithms and see how the resulting values are distributed.
# Same code from above to re create the dataframe
df = pd.read_csv("Dataset.data", header=None, delimiter=r"\s+",)
df.columns = [
"Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
"MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
"CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]
%matplotlib inline
import matplotlib.pyplot as plt
log_capital_gain = np.log(df.CapitalGain + 1)
print("Min: {}, Max: {}".format(
np.min(log_capital_gain), np.max(log_capital_gain)
))
hist, bins = np.histogram(log_capital_gain, bins=50)
plt.bar((bins[:-1] + bins[1:]) / 2, hist, align="center",)
plt.xlabel("log(CapitalGain)")
plt.ylabel("Counts")
log_capital_loss = np.log(df.CapitalLoss + 1)
print("Min: {}, Max: {}".format(
np.min(log_capital_loss), np.max(log_capital_loss)
))
hist, bins = np.histogram(log_capital_loss, bins=50)
plt.bar((bins[:-1] + bins[1:]) / 2, hist, align="center",)
plt.xlabel("log(CapitalLoss)")
plt.ylabel("Counts")
Hmm so I remembered the wrong stuff. After scaling, the values in these 2 columns will still be distributed very similarly as before the scaling is done. A log-log plot of both the counts and the values should show approximately a linear trend. Let's do that.
len_of_range = np.max(df.CapitalLoss) + 1
m = np.hstack(
(
np.array(range(len_of_range)).reshape((len_of_range, 1,)),
np.bincount(df.CapitalLoss).reshape((len_of_range, 1,)),
)
)
m = m[m[:, 1] > 0]
m[:, 0] += 1
m = np.log(m)
plt.plot(m[:, 0], m[:, 1])
plt.xlabel("log(CapitalLoss + 1)")
plt.ylabel("log(Counts)")
Oops, wrong type of plot. Let's try fitting a line and plot the points in a scatter plot.
coeffs = np.polyfit(m[:, 0], m[:, 1], 1)
plt.scatter(m[:, 0], m[:, 1], marker="o")
x = np.arange(0, 10)
plt.plot(x, coeffs[0] * x + coeffs[1], color="green")
plt.xlabel("log(CapitalLoss + 1)")
plt.ylabel("log(Counts)")
Ok, this is pretty much the same too and totally not what we envisioned things to be. Ack.
I recalled seeing at a talk someone plotting a distribution of a single variable grouped by classes, with each curve for a class colored differently. A little googling yields these results: http://stackoverflow.com/a/31258153/732396 http://stackoverflow.com/a/32748510/732396
which we make use of in the plots below:
df.groupby("Income").CapitalLoss.hist(normed=1, alpha=0.6,)
df.groupby("Income").CapitalGain.hist(normed=1, alpha=0.6)
I also recall hearing that if there is a marked difference between the distribution of the points between the two classes, then that variable should be a good feature. In this case, what we see from these plots is not exactly great, because for both classes, the majority of the values are concentrated near 0.
Just realized the income isn't at a log scale. Let's take logs.
df2 = df.copy()
df2.CapitalLoss = df2.CapitalLoss + 1
df2.CapitalLoss = df2.CapitalLoss.apply(lambda r: np.log(r + 1))
df2.groupby("Income").CapitalLoss.hist(normed=1, alpha=0.6,)
df2.CapitalGain = df2.CapitalGain + 1
df2.CapitalGain = df2.CapitalGain.apply(lambda r: np.log(r + 1))
df2.groupby("Income").CapitalGain.hist(normed=1, alpha=0.6,)
The situation is similar. Never mind.
Onto the machine learning. For the CapitalLoss
and CapitalGain
columns, we will add 1 to all the values and then take logs. This will not leak data from the test set into the training set. Ultimately, we will be using StandardScaler
to transform those 2 columns before passing the data to Logistic Regression.
df.CapitalGain = df.CapitalGain + 1
df.CapitalGain = df.CapitalGain.apply(lambda r: np.log(r + 1))
df.CapitalLoss = df.CapitalLoss + 1
df.CapitalLoss = df.CapitalLoss.apply(lambda r: np.log(r + 1))
# Apply what's necessary amongst the stuff we did earlier
df["Income"] = df["Income"].map({ "<=50K": -1, ">50K": 1 })
y_all = df.Income.values
df = pd.get_dummies(df, columns=[
"WorkClass", "Education", "MaritalStatus", "Occupation", "Relationship",
"Race", "Gender", "NativeCountry",
])
# We have to redo the train-test split because of the new columns
df.drop("Income", axis=1, inplace=True,)
X_train, X_test, y_train, y_test = train_test_split(
df, y_all, test_size=0.25, stratify=y_all, random_state=rs,
)
# Note the newly added `CapitalGain` and `CapitalLoss` columns
standard_scaler_cols = [
"Age", "fnlwgt", "EducationNum", "CapitalGain", "CapitalLoss", "HoursPerWeek",
]
other_cols = list(set(df.columns) - set(standard_scaler_cols))
# We need to create a new `DataFrameMapper` and `Pipeline` object to
# incorporate those 2 columns
mapper = DataFrameMapper(
[([col,], StandardScaler(),) for col in standard_scaler_cols] +
[(col, None,) for col in other_cols]
)
clf = LogisticRegression(random_state=rs,)
pipeline = Pipeline([
("scale", mapper,),
("logit", clf,)
])
strat_kfold = StratifiedKFold(10, random_state=rs,)
estimator = GridSearchCV(
pipeline,
param_grid={
"logit__C": np.power(10, np.arange(-4.0, 5.0)),
"logit__class_weight": ["balanced", None,],
},
scoring=make_scorer(roc_auc_score),
cv=strat_kfold,
)
estimator.fit(X_train, y_train)
cv_results_df = pd.DataFrame(estimator.cv_results_)
cv_results_df.sort_values(by="rank_test_score").head(5)
print(estimator.best_params_)
As opposed to the case without the CapitalGain
and CapitalLoss
columns, the regularization parameter is 1.0 here, as opposed to 0.01 .
y_train_predicted = estimator.predict(X_train)
print("Training set accuracy score: {}".format(
accuracy_score(y_train, y_train_predicted)
))
print("Training set AUROC score: {}".format(estimator.score(X_train, y_train)))
print("\nConfusion matrix for training set:")
training_confusion_matrix = confusion_matrix(
y_train, y_train_predicted
)
display(_build_df_from_confusion_matrix(training_confusion_matrix))
print("Same as above but in fractions:")
display(_build_df_from_confusion_matrix(training_confusion_matrix, as_fractions=True))
print("Precision, recall, f-score:")
precision_recall_fscore_support(y_train, y_train_predicted)
Results from without the CapitalGain
and CapitalLoss
columns:
Training set accuracy score: 0.785755234637329
Training set AUROC score: 0.8071775040759195
Precision, recall, f-score:
(array([ 0.94135285, 0.53286032]),
array([ 0.76609488, 0.84826013]),
array([ 0.84472934, 0.65454706]),
array([27866, 8765]))
There is an improvement in the precision for the positive class and recall for the positive class, but a drop in recall for the negative class.
y_test_predicted = estimator.predict(X_test)
print("Test set accuracy score: {}".format(
accuracy_score(y_test, y_test_predicted)
))
print("Test set AUROC score: {}".format(estimator.score(X_test, y_test)))
print("\nConfusion matrix for test set:")
test_confusion_matrix = confusion_matrix(y_test, y_test_predicted)
display(_build_df_from_confusion_matrix(test_confusion_matrix))
print("Same as above but in fractions:")
display(_build_df_from_confusion_matrix(test_confusion_matrix, as_fractions=True))
print("Precision, recall, f-score:")
precision_recall_fscore_support(y_test, y_test_predicted)
Results from without the CapitalGain
and CapitalLoss
columns:
Test set accuracy score: 0.7903529604455
Test set AUROC score: 0.8116512329133934
Precision, recall, f-score:
(array([ 0.94322224, 0.53917749]),
array([ 0.77080418, 0.85249829]),
array([ 0.84834123, 0.66056749]),
array([9289, 2922]))
Incorporating the CapitalGain
and CapitalLoss
columns have improved the accuracy and AUROC scores slightly. Similarly for precision for the positive class. Recall for the positive class dropped slightly but recall for the negative class increased.
Comparing the confusion matrices, we see that by incorporating these 2 columns, there is a decrease in the number of false positives but an increase in the number of false negatives.
Is this a better classifier? Arguably.
I definitely need more experience playing around with data sets and machine learning algorithms. Since I already have some stuff here, why not go further? What I want to do:
NOTE: The stuff in this section can take really long to run on a normal PC and can take up to 16GB of RAM on same occasions. Eventually I decided to spin up a c4.8xlarge EC2 instance on AWS to run this. It costs over $2 USD an hour just to run the instance and the cross validation took over an hour.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False,)
df_cheat = poly.fit_transform(df)
print(df_cheat.shape)
I wanted to do degree 3 polynomial features but that the memory require exceeded that on my system because dense matrices are required. Using degree 2 polynomial feature transformation, we got a total of 108 (original features) + 108 (original features to 2nd power) + (108 choose 2 = 5778 interaction features) = 5994 features. Wow.\
Let's restore the column names. Code modified from http://stackoverflow.com/a/36735970
df_cheat = pd.DataFrame(data=df_cheat)
df_cheat.columns = [
" * ".join([
("{}^{}".format(col, power) if power > 1 else col)
for col, power in cols_zipped_with_powers if power > 0
])
for cols_zipped_with_powers in [zip(df.columns, p) for p in poly.powers_]
]
df_cheat.head()
Looks correct to me.
We'll do the train-test split before dimensionality reduction using PCA. We'll also do scaling for the columns containing learned features so PCA is not affected by the range of values for the polynomial features.
X_train, X_test, y_train, y_test = train_test_split(
df_cheat, y_all, test_size=0.25, stratify=y_all, random_state=rs,
)
standard_scaler_cols = [
"Age", "fnlwgt", "EducationNum", "CapitalGain", "CapitalLoss", "HoursPerWeek",
] + list(df_cheat.columns[108:])
other_cols = list(set(df_cheat.columns) - set(standard_scaler_cols))
mapper = DataFrameMapper(
[([col,], StandardScaler(),) for col in standard_scaler_cols] +
[(col, None,) for col in other_cols]
)
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(mapper.fit_transform(X_train))
np.cumsum(pca.explained_variance_ratio_)
Let's find the number of components such that the explained variance exceeds 0.9
np.argmax(np.cumsum(pca.explained_variance_ratio_) > 0.9)
So the first 1466 components explain 90% of the variance. Let's keep that many components.
Because PCA takes such a long time to run and we have to pick the number of components and there is quite a bit of effort involved in writing a custom estimator to do that, I do not want to include PCA in a pipeline for the cross validation step. So we'll just run it. The drawback is that during cross validation, there is some leakage from the training set to the test set. But we'll live with that.
pca2 = PCA(n_components=1466)
X_train_pca = pca2.fit_transform(X_train)
clf = LogisticRegression(random_state=rs,)
strat_kfold = StratifiedKFold(10, random_state=rs,)
estimator = GridSearchCV(
clf,
param_grid={
"C": np.power(10, np.arange(-4.0, 5.0)),
"class_weight": ["balanced", None,],
},
scoring=make_scorer(roc_auc_score),
cv=strat_kfold,
)
estimator.fit(X_train_pca, y_train)
estimator.best_params_
y_train_predicted = estimator.predict(X_train_pca)
print("Training set accuracy score: {}".format(
accuracy_score(y_train, y_train_predicted)
))
print("Training set AUROC score: {}".format(estimator.score(X_train_pca, y_train)))
print("\nConfusion matrix for training set:")
train_confusion_matrix = confusion_matrix(y_train, y_train_predicted)
display(_build_df_from_confusion_matrix(train_confusion_matrix))
print("Same as above but in fractions:")
display(_build_df_from_confusion_matrix(train_confusion_matrix, as_fractions=True))
print("Precision, recall, f-score:")
precision_recall_fscore_support(y_train, y_train_predicted)
Numbers for training set without polynomial features and PCA:
Training set accuracy score: 0.8053561191340668
Training set AUROC score: 0.8188093217197174
(array([ 0.94194373, 0.56206818]),
array([ 0.7930094 , 0.84460924]),
array([ 0.86108405, 0.67496353]),
array([27866, 8765]))
Accuracy dropped by 0.1 and AUROC score dropped slightly by about 0.03 . For the negative class, precision increased but recall and f-score dropped. For the positive class, there is an increase in recall but decrease in precision and f-score.
Now for the test set.
X_test_pca = pca2.transform(X_test)
y_test_predicted = estimator.predict(X_test_pca)
print("Test set accuracy score: {}".format(
accuracy_score(y_test, y_test_predicted)
))
print("Test set AUROC score: {}".format(estimator.score(X_test_pca, y_test)))
print("\nConfusion matrix for test set:")
test_confusion_matrix = confusion_matrix(y_test, y_test_predicted)
display(_build_df_from_confusion_matrix(test_confusion_matrix))
print("Same as above but in fractions:")
display(_build_df_from_confusion_matrix(test_confusion_matrix, as_fractions=True))
print("Precision, recall, f-score:")
precision_recall_fscore_support(y_test, y_test_predicted)
Numbers from the test set without polynomial features + PCA:
Test set accuracy score: 0.8048480877896979
Test set AUROC score: 0.818598282440006
(array([ 0.94201229, 0.56126392]),
array([ 0.79222737, 0.8449692 ]),
array([ 0.86065142, 0.67449802]),
array([9289, 2922]))
This is the same situation as for the training set. So overall, polynomial features followed by PCA actually did worse in terms of precision and f-score for the positive class, but performed better in terms of recall.
Disclaimer: Opinions expressed on this blog are solely my own and do not express the views or opinions of my employer(s), past or present.