Appendix - Predicting Churn

Prepare

The data was collected by Waze.
The data is considered 1st party and is considered trustworthy.
The data is stored in csv file, approximately 1,5 MB large.
The dataset contains data from the last month, unspecified by purpose.
The table consist of 13 columns: ID, label, sessions, drives, total_sessions, n_days_after_onboarding, total_navigations_fav1, total_navigations_fav2, driven_km_drives, duration_minutes_drives, activity_days, driving_days, device
There are 14999 rows of data
700 rows are missing a label
Comparing summary statistics of the 700 rows that are missing labels with summary statistics of the rows that are not missing any values
Calculated how many iPhone users had null values and how many Android users had null values, and the percentage. Compared the data with the device ratio in the full dataset.
Concluded that there are no apparent differences between the entries that have a label and those who are missing a label. There is nothing to suggest a non-random cause of the missing data.
The maximum value in the driven_km_drives column is 21,183 km which is more than half the circumference of the earth. Might suggest faulty values.
The model estimating total_sessions might not be accurate since it is not much larger then sessions which are the number of sessions for the last month.

Data Dictionary

Jupyter Notebook

# Importing packages for data manipulation

import pandas as pd

import numpy as np

# Loading dataset into dataframe

df = pd.read_csv('waze_dataset.csv')

#Viewing the data

df.head(10)

# Checking how many rows and columns they are, if there are missing values and what datatypes are used in the dataset

df.info()

# Isolating rows with null values

df_null = df[df['label'].isnull()]

# Displaying summary stats of rows with null values

df_null.describe()

# Isolating rows without null values

df_not_null = df[~df['label'].isnull()]

# Displaying summary stats of rows without null values

df_not_null.describe()

# Counting null values by device

df_null.value_counts('device')

# Calculating % of iPhone nulls and Android nulls

df_null.value_counts('device', normalize=True)

# Calculating % of iPhone users and Android users in full dataset

df.value_counts('device', normalize=True)

Process

Calculated the number of churned vs retained users.
Calculated median values of all columns for churned and retained users.
Calculated median kilometers per drive in the last month, median kilometers per driving day and median number of drives per driving day for both retained and churned users
Calculated the number of and percentage of Android users and iPhone users for churned and retained users.

Jupyter Notebook

# Calculating counts of churned vs. retained

print(df['label'].value_counts())

# Calculating percentages of churned vs. retained

print(df['label'].value_counts(normalize=True))

# Calculating median values of all columns for churned and retained users

df.groupby('label').median(numeric_only=True)

# Grouping data by `label` and calculating medians

medians_by_label = df.groupby('label').median(numeric_only=True)

# Calculating median kilometers per drive

medians_by_label['driven_km_drives'] / medians_by_label['drives']

# Calculating median kilometers per driving day

medians_by_label['driven_km_drives'] / medians_by_label['driving_days']

# Calculating median drives per driving day

medians_by_label['drives'] / medians_by_label['driving_days']

# For each label, calculating the number of Android users and iPhone users

df.groupby(['label', 'device']).size()

# For each label, calculating the percentage of Android users and iPhone users

df.groupby('label')['device'].value_counts(normalize=True)

Analyze

EDA - Exploratory Data Analysis

Plotted boxplots to determine outliers and where the bulk of the data points reside for variables: sessions, drives, total_sessions, n_days_after_onboarding, driven_km_drives, duration_minutes_drives, activity_days, driving_days
Plotted histograms to understand the distribution of variables for: sessions, drives, total_sessions, n_days_after_onboarding, driven_km_drives, duration_minutes_drives, activity_days, driving_days
Plotted pie chars for device and label
Plotted a histogram comparing driving_days and activity_days, and a scatter plot to visualize the realiations between these two variables.
Plotted retantion by device
Calculated and plotted retention by kilometers driven per driving day
Plotted a histogram visualizing churn rate per number of driving days
Calculated and plotted the proportion of sessions that occurred in the last month
Calculated the 95th percentile for certain columns and replaced outliers with the value at the 95th percentile

Jupyter Notebook

#Box plot 'sessions'

plt.figure(figsize=(5,2))

sns.boxplot(x=df['sessions'], fliersize=2)

plt.title('sessions box plot');

#Histogram 'sessions'

sns.histplot(x=df['sessions'])

median = df['sessions'].median()

plt.axvline(median, color='red', linestyle='--')

plt.text(75,1200, 'median=56.0', color='red')

plt.title('sessions box plot');

# Box plot 'drives'

plt.figure(figsize=(5,2))

sns.boxplot(x=df['drives'], fliersize=2)

plt.title('drives box plot');

# Histogram 'drives'

sns.histplot(x=df['drives'])

median = df['drives'].median()

plt.axvline(median, color='red', linestyle='--')

plt.text(75,1000, 'median=48.0', color='red')

plt.title('drives histogram');

# Box plot total_sessions

plt.figure(figsize=(5,2))

sns.boxplot(x=df['total_sessions'], fliersize=2)

plt.title('total_sessions box plot');

# Histogram 'total_sessions'

sns.histplot(x=df['total_sessions'])

median = df['total_sessions'].median()

plt.axvline(median, color='red', linestyle='--')

plt.text(170,700, 'median=159.6', color='red')

plt.title('total_sessions histogram');

# Box plot 'n_days_after_onboarding'

plt.figure(figsize=(5,2))

sns.boxplot(x=df['n_days_after_onboarding'], fliersize=2)

plt.title('n_days_after_onboarding box plot')

# Histogram 'n_days_after_onboarding'

sns.histplot(x=df['n_days_after_onboarding'])

median = df['n_days_after_onboarding'].median()

plt.axvline(median, color='red', linestyle='--')

plt.title('n_days_after_onboarding histogram');

# Box plot 'driven_km_drives'

plt.figure(figsize=(5,2))

sns.boxplot(x=df['driven_km_drives'], fliersize=2)

plt.title('driven_km_drives box plot');

# Histogram 'driven_km_drives'

sns.histplot(x=df['driven_km_drives'])

median = df['driven_km_drives'].median()

plt.axvline(median, color='red', linestyle='--')

plt.text(4000,700, 'median=3493.9', color='red')

plt.title('driven_km_drives histogram');

# Box plot 'duration_minutes_drives'

plt.figure(figsize=(5,2))

sns.boxplot(x=df['duration_minutes_drives'], fliersize=2)

plt.title('duration_minutes_drives box plot');

# Histogram 'duration_minutes_drives'

sns.histplot(x=df['duration_minutes_drives'])

median = df['duration_minutes_drives'].median()

plt.axvline(median, color='red', linestyle='--')

plt.text(2000,700, 'median=1478.2', color='red')

plt.title('driven km drives box plot');

# Box plot 'activity_days'

plt.figure(figsize=(5,2))

sns.boxplot(x=df['activity_days'], fliersize=2)

plt.title('activity_days box plot');

# Histogram 'activity_days'

sns.histplot(x=df['activity_days'], discrete=True)

median = df['activity_days'].median()

plt.axvline(median, color='red', linestyle='--')

plt.title('activity_days histogram');

# Box plot 'driving_days'

plt.figure(figsize=(5,2))

sns.boxplot(x=df['driving_days'], fliersize=2)

plt.title('driving_days box plot');

# Histogram 'driving_days'

sns.histplot(x=df['driving_days'], discrete=True)

median = df['driving_days'].median()

plt.axvline(median, color='red', linestyle='--')

plt.title('driving_days histogram');

# Pie chart 'device'

fig = plt.figure(figsize=(3,3))

data=df['device'].value_counts()

plt.pie(data,

labels=[f'{data.index[0]}: {data.values[0]}',

f'{data.index[1]}: {data.values[1]}'],

autopct='%1.1f%%'

)

plt.title('Users by device');

# Pie chart 'label'

fig = plt.figure(figsize=(3,3))

data=df['label'].value_counts()

plt.pie(data,

labels=[f'{data.index[0]}: {data.values[0]}',

f'{data.index[1]}: {data.values[1]}'],

autopct='%1.1f%%'

)

plt.title('Count of retained vs. churned');

# Histogram 'driving_days' vs 'activity_days'

plt.figure(figsize=(12,4))

label=['driving days', 'activity days']

plt.hist([df['driving_days'], df['activity_days']],

bins=range(0,33),

label=label)

plt.xlabel('days')

plt.ylabel('count')

plt.legend()

plt.title('driving_days vs. activity_days');

#Calculating maximum number of 'driving_days' and 'activity_days'

print(df['driving_days'].max())

print(df['activity_days'].max())

# Scatter plot 'driving_days' vs 'activity_days'

sns.scatterplot(data=df, x='driving_days', y='activity_days')

plt.title('driving_days vs. activity_days')

plt.plot([0,31], [0,31], color='red', linestyle='--');

# Histogram retention by 'device'

plt.figure(figsize=(5,4))

sns.histplot(data=df,

x='device',

hue='label',

multiple='dodge',

shrink=0.9

)

plt.title('Retention by device histogram');

#Retention by kilometers driven per driving day

# 1. Create `km_per_driving_day` column

df['km_per_driving_day'] = df['driven_km_drives'] / df['driving_days']

# 2. Call `describe()` on the new column

df['km_per_driving_day'].describe()

# 1. Convert infinite values to zero

df.loc[df['km_per_driving_day']==np.inf, 'km_per_driving_day'] = 0

# 2. Confirm that it worked

df['km_per_driving_day'].describe()

# Histogram etention by km_per_driving_day

plt.figure(figsize=(12,5))

sns.histplot(data=df,

x='km_per_driving_day',

bins=range(0,1201,20),

hue='label',

multiple='fill')

plt.ylabel('%', rotation=0)

plt.title('Churn rate by mean km per driving day');

# Histogram churn rate per number of driving_days

plt.figure(figsize=(12,5))

sns.histplot(data=df,

x='driving_days',

bins=range(1,32),

hue='label',

multiple='fill',

discrete=True)

plt.ylabel('%', rotation=0)

plt.title('Churn rate per driving day');

#Proportion of sessions that occurred in the last month

df['percent_sessions_in_last_month'] = df['sessions'] / df['total_sessions']

#Median

df['percent_sessions_in_last_month'].median()

# Histogram 'percent_sessions_in_last_month'

sns.histplot(x=df['percent_sessions_in_last_month'], hue=df['label'],)

median = df['percent_sessions_in_last_month'].median()

plt.axvline(median, color='red', linestyle='--')

plt.title('percent_sessions_in_last_month');

#Median 'n_days_after_onboarding'

df['n_days_after_onboarding'].median()

# Histogram 'n_days_after_onboarding'

data = df.loc[df['percent_sessions_in_last_month']>=0.4]

plt.figure(figsize=(5,3))

sns.histplot(x=data['n_days_after_onboarding'])

plt.title('Num. days after onboarding for users with >=40% sessions in last month');

#Handling outliers

#Calculate the 95th percentile of a given column, then replaces values > the 95th percentile with the value at the 95th percentile

def outlier_imputer(column_name, percentile):

# Calculate threshold

threshold = df[column_name].quantile(percentile)

# Impute threshold for values > than threshold

df.loc[df[column_name] > threshold, column_name] = threshold

print('{:>25} | percentile: {} | threshold: {}'.format(column_name, percentile, threshold))

# Calculate 95th percentile for the following columns

for column in ['sessions', 'drives', 'total_sessions',

'driven_km_drives', 'duration_minutes_drives']:

outlier_imputer(column, 0.95)

Hyphitesis Test

Conducted a hypothesis test, a two-sample t-test, to find out if drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices.
𝐻0: There is no difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.
𝐻𝐴: There is a difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

Jupyter Notebook

# Import relevant packages or libraries

import pandas as pd

import numpy as np

from scipy import stats

# Load dataset into dataframe

df = pd.read_csv('waze_dataset.csv')

# Create new dictionary `map_dictionary`

map_dictionary = {'Android':2,'iPhone':1}

# Create new `device_type` column

df['device_type'] = df['device']

# Map the new column to the dictionary

df['device_type'] = df['device_type'].map(map_dictionary)

# Calculate mean for device types

df.groupby('device_type')['drives'].mean()

# Isolate the `drives` column for iPhone users.

drives_iphone = df[df['device_type'] == 1]['drives']

# Isolate the `drives` column for Android users.

drives_android = df[df['device_type'] == 2]['drives']

# Perform the t-test

stats.ttest_ind(a=drives_iphone, b=drives_android, equal_var=False)

Regression Model

Identified that the following variables have outliers: sessions, drives, total_sessions, total_navigations_fav1, total_navigations_fav2, driven_km_drives and duration_minutes_drives. Calculate the 95th percentile of each column and changed the outliers to the 95th percentile value.
Created a new variable km_per_driving_day with driven_km_drives devided by driving_days' Converted infinity values to zero.
Created a new binary variable professional_driver where it is set to 1 if the driver has driven 60 drives or more in 15 days or less, otherwise 0.
Created a new binary variable label2 where the value is set to 1 if the in label value is 'churned', and if 'retained' it is set to 0.
Checked the correlation among predictor variables and created a correlation heatmap.
Created a new binary variable device2 where the value is set to 1 if the in device value is 'iPhone', and if 'Android' it is set to 0.
Dropped the following columns: label, label2, device, sessions and driving_days
Split the date into training and test data.
Applied the user data to build and a binomial logistic regression model to predict user churn.
Checked model assumptions and interpreting model results.
Evaluated the model using a confusion matrix.
Checked the importance of the model's features by generating a bar graph of the model's coefficients.

Jupyter Notebook

# Import relevant packages or libraries

# Packages for numerics + dataframes

import numpy as np

import pandas as pd

# Packages for visualization

import matplotlib.pyplot as plt

import seaborn as sns

# Packages for Logistic Regression & Confusion Matrix

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, accuracy_score, precision_score, \

recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

from sklearn.linear_model import LogisticRegression

# Load the dataset by running this cell

df = pd.read_csv('waze_dataset.csv')

# Create `km_per_driving_day` column

df['km_per_driving_day'] = df['driven_km_drives']/df['driving_days']

# Convert infinite values to zero

df.loc[df['km_per_driving_day']==np.inf, 'km_per_driving_day'] = 0

# Create `professional_driver` column

df['professional_driver'] = np.where((df['drives'] >= 60) & (df['driving_days'] >= 15), 1, 0)

# Check count of professionals and non-professionals

df['professional_driver'].value_counts(normalize=True)

# Check in-class churn rate

df.groupby(['professional_driver'])['label'].value_counts(normalize=True)

# Drop rows with missing data in `label` column

df = df.dropna(subset=['label'])

# Calculate the 95th percentile specified columns, and change outliers in the columns to this value

for column in ['sessions', 'drives', 'total_sessions', 'total_navigations_fav1',

'total_navigations_fav2', 'driven_km_drives', 'duration_minutes_drives']:

threshold = df[column].quantile(0.95)

df.loc[df[column] > threshold, column] = threshold

# Encode categorical variable 'label'. Create binary `label2` column

df['label2'] = np.where((df['label'] == 'churned'), 1, 0)

df[['label', 'label2']].tail()

# Generate a correlation matrix

df.corr(method='pearson')

# Plot correlation heatmap

plt.figure(figsize=(15,10))

sns.heatmap(df.corr(method='pearson'), vmin=-1, vmax=1, annot=True, cmap='coolwarm')

plt.title('Correlation heatmap indicates many low correlated variables',

fontsize=18)

plt.show();

# Encode categorical variable 'device'. Create new `device2` variable

df['device2'] = np.where((df['device'] == 'iPhone'), 1, 0)

df[['device', 'device2']].tail()

# Isolate predictor variables

X = df.drop(columns = ['label', 'label2', 'device', 'sessions', 'driving_days'])

# Isolate target variable

y = df['label2']

# Perform the train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Use .head()

X_train.head()

# Instantiate a logistic regression model. Add the argument `penalty = None`, since the predictors are unscaled.

model = LogisticRegression(penalty='none', max_iter=400)

# Fit the model on `X_train` and `y_train`.

model.fit(X_train, y_train)

# Create a series whose index is the column names and whose values are the coefficients in model.coef_

pd.Series(model.coef_[0], index=X.columns)

#Get the model intercept

model.intercept_

# Get the predicted probabilities of the training data

training_probabilities = model.predict_proba(X_train)

training_probabilities

# Copy the `X_train` dataframe and assign to `logit_data`

logit_data = X_train.copy()

# Create a new `logit` column in the `logit_data` df

logit_data['logit'] = [np.log(prob[1] / prob[0]) for prob in training_probabilities]

# Plot regplot of `activity_days` log-odds

sns.regplot(x='activity_days', y='logit', data=logit_data, scatter_kws={'s': 2, 'alpha': 0.5})

plt.title('Log-odds: activity_days');

# Generate predictions on X_test

y_preds = model.predict(X_test)

# Score the model (accuracy) on the test data

model.score(X_test, y_test)

# Display a confusion matrix

cm = confusion_matrix(y_test, y_preds)

disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=['retained', 'churned'],)

disp.plot();

# Calculate precision manually

precision = cm[1,1] / (cm[0, 1] + cm[1, 1])

# Calculate recall manually

recall = cm[1,1] / (cm[1, 0] + cm[1, 1])

# Create a classification report

target_labels = ['retained', 'churned']

print(classification_report(y_test, y_preds, target_names=target_labels))

#Generate a bar graph of the model's coefficients for a visual representation of the importance of the model's features.

# Create a list of (column_name, coefficient) tuples

feature_importance = list(zip(X_train.columns, model.coef_[0]))

# Sort the list by coefficient value

feature_importance = sorted(feature_importance, key=lambda x: x[1], reverse=True)

# Plot the feature importances

import seaborn as sns

sns.barplot(x=[x[1] for x in feature_importance], y=[x[0] for x in feature_importance],orient='h')

plt.title('Feature importance');

Building and Testing Random Forest and XGBoost Model

Created a feature representing the mean number of kilometers driven on each driving day in the last month for each user, 'km_per_driving_day' and converted infinite values to 0.
Created a new column 'percent_sessions_in_last_month' that represents the percentage of each user's total sessions that were logged in their last month of use.
Created a new, binary feature called 'professional_driver' that is a 1 for users who had 60 or more drives and drove on 15+ days in the last month.
Created a new feature called 'total_sessions_per_day', which represents the mean number of sessions per day since onboarding.
Created a new feature representing the mean kilometers per hour driven in the last month, 'km_per_hour'.
Created a new feature representing the mean number of kilometers per drive made in the last month for each user, 'km_per_drive', and converted infinite values to 0.
Created a new feature that represents the percentage of total sessions that were used to navigate to one of the users' favorite places, 'percent_of_drives_to_favorite'.
Dropped rows with missing values, the 700 missing values in the label column.
Encode categorical variables 'device' and 'label'.
Drop 'ID' column
Splited the data with a ratio of 60/20/20 for training/validation/test sets. Set the random state to 42 and the stratify to the proportions in y (retained 82.2645 %, churned 17.7355%)
Created a random forest model using hyperparameters. Used recall as the evaluation metric. Examined the best score.
Created an XGBoost model using a dictionary of hyperparameters. Used recall as the evaluation metric. Examined the best score and best parameters.
Used the best random forest model and the best XGBoost model to predict on the validation data.
Used the champion model, the XGBoost model, to predict on the test dataset.
Plotted a confusion matrix of the champion model's predictions on the test data
Inspected the most important features of the final model.

Jupyter Notebook

# Import packages for data manipulation

import numpy as np

import pandas as pd

# Import packages for data visualization

import matplotlib.pyplot as plt

# This lets us see all of the columns, preventing Juptyer from redacting them.

pd.set_option('display.max_columns', None)

# Import packages for data modeling

from sklearn.model_selection import GridSearchCV, train_test_split

from sklearn.metrics import roc_auc_score, roc_curve, auc

from sklearn.metrics import accuracy_score, precision_score, recall_score,\

f1_score, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay

from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier

# This is the function that helps plot feature importance

from xgboost import plot_importance

# This module lets us save our models once we fit them.

import pickle

# Import dataset

df = pd.read_csv('waze_dataset.csv')

# Create `km_per_driving_day` feature

df['km_per_driving_day'] = df['driven_km_drives'] / df['driving_days']

# Convert infinite values to zero

df.loc[df['km_per_driving_day']==np.inf, 'km_per_driving_day'] = 0

# Create `percent_sessions_in_last_month` feature

df['percent_sessions_in_last_month'] = df['sessions']/df['total_sessions']

# Create `professional_driver` feature

df['professional_driver'] = np.where((df['drives'] >= 60) & (df['driving_days'] >= 15), 1, 0)

# Create `total_sessions_per_day` feature

df['total_sessions_per_day'] = df['total_sessions'] / df['n_days_after_onboarding']

# Create `km_per_hour` feature

df['km_per_hour'] = df['driven_km_drives'] / (df['duration_minutes_drives'] / 60)

# Create `km_per_drive` feature

df['km_per_drive'] = df['driven_km_drives'] / df['drives']

# Convert infinite values to zero

df.loc[df['km_per_drive']==np.inf, 'km_per_drive'] = 0

# Create `percent_of_sessions_to_favorite` feature

df['percent_of_drives_to_favorite'] = (df['total_navigations_fav1'] + df['total_navigations_fav2']) / df['total_sessions']

# Drop rows with missing values

df.dropna(subset = ['label'], axis = 0)

# Drop rows with missing values

df = df.dropna(subset=['label'])

# Encode 'device' column. Create new `device2` variable

df['device2'] = np.where(df['device']=='Android', 0, 1)

# Encode 'label' column. Create binary `label2` column

df['label2'] = np.where(df['label']=='churned', 1, 0)

# Drop `ID` column

df = df.drop(['ID'], axis=1)

# Get class balance of 'label' col

df['label'].value_counts(normalize=True)

# Isolate X variables

X = df.drop(columns=['label', 'label2', 'device'])

# Isolate y variable

y = df['label2']

# Split into train and test sets

X_tr, X_test, y_tr, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Split into train and validate sets

X_train, X_val, y_train, y_val = train_test_split(X_tr, y_tr, stratify=y_tr, test_size=0.25, random_state=42)

# Instantiate the random forest classifier

rf = RandomForestClassifier(random_state=42)

# Create a dictionary of hyperparameters to tune

cv_params = {'max_depth': [None],

'max_features': [1.0],

'max_samples': [1.0],

'min_samples_leaf': [2],

'min_samples_split': [2],

'n_estimators': [300],

}

# Define a dictionary of scoring metrics to capture

scoring = {'accuracy', 'precision', 'recall', 'f1'}

# Instantiate the GridSearchCV object

rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='recall')

#Fit the model to the training data

%%time

rf_cv.fit(X_train, y_train)

# Examine best score

rf_cv.best_score_

# Examine best hyperparameter combo (only one param used for each hyperparameter, not needed)

rf_cv.best_params_

# Instantiate the XGBoost classifier

xgb = XGBClassifier(objective='binary:logistic', random_state=42)

# Create a dictionary of hyperparameters to tune

cv_params = {'max_depth': [6, 12],

'min_child_weight': [3, 5],

'learning_rate': [0.01, 0.1],

'n_estimators': [300]

}

# Define a dictionary of scoring metrics to capture

scoring = {'accuracy', 'precision', 'recall', 'f1'}

# Instantiate the GridSearchCV object

xgb_cv = GridSearchCV(xgb, cv_params, scoring=scoring, cv=4, refit='recall')

# Fit the model to the training data

%%time

xgb_cv.fit(X_train, y_train)

# Examine best score

xgb_cv.best_score_

# Examine best parameters

xgb_cv.best_params_

# Use random forest model to predict on validation data

rf_val_preds = rf_cv.best_estimator_.predict(X_val)

# Use XGBoost model to predict on validation data

xgb_val_preds = xgb_cv.best_estimator_.predict(X_val)

# Use XGBoost model to predict on test data

xgb_test_preds = xgb_cv.best_estimator_.predict(X_test)

# Generate array of values for confusion matrix

cm = confusion_matrix(y_test, xgb_test_preds, labels=xgb_cv.classes_)

# Plot confusion matrix

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['retained', 'churned'])

disp.plot();

# Plottet the most important features of the final model

plot_importance(xgb_cv.best_estimator_);

Page updated

Google Sites

Report abuse