Predicting Churn
Background
Waze has developed a free real-time navigation app makes it easier for drivers around the world to get to where they want to go. Waze’s community of map editors, beta testers, translators, partners, and users helps make each drive better and safer. Waze partners with cities, transportation authorities, broadcasters, businesses, and first responders to help as many people as possible travel more efficiently and safely.
Waze has launched a project with the goal to help prevent user churn, and it is part of a larger effort at Waze to increase growth. Churn quantifies the number of users who have uninstalled the Waze app or stopped using the app. This project focuses on monthly user churn. Typically, high retention rates indicate satisfied users who repeatedly use the Waze app over time. If Waze can identify a segment of users who are at high risk of churning, Waze can proactively engage these users with special offers to try and retain them. Otherwise, Waze may simply lose these users without knowing why.
By analyzing and interpreting data, and generate valuable insights will help Waze leadership optimize the company’s retention strategy, enhance user experience, and make informed, data-driven business decisions about product development.
Problem to Solve
Waze needs help analyzing user data and developing a machine-learning model that will predict user churn, improve user retention, and grow Waze’s business. An accurate model can also help identify specific factors that contribute to churn and answer questions such as:
- Who are the users that are most likely to churn?
- Why do users churn?
- When do users churn?
Preliminary data summary
Initial key insights
The dataset contains 82% retained users and 18% churned users.
Churned users averaged ~3 more drives in the last month than retained users, 50 compared with 47 drives.
Retained users used the app on over twice as many days as churned users in the last month, 17 compared with 8.
The median churned user drove ~200 more kilometers and 2.5 more hours during the last month than the median retained user.
Churned users had more drives in fewer days, and their trips were farther and longer in duration.
The median user from both groups drove ~73 km/drive.
Median drives per driving day for churned users is ~8.3 and for retained users ~3.4.
The median user who churned drove 608 kilometers each day they drove last month, which is almost 250% the per-drive-day distance of retained users. Retained users drove 247 kilometer per day.
Regardless of user churn, the users represented in this data drive a lot. This suggests that this data does not represent typical drivers at large.
Android users comprised approximately 36% of the sample, while iPhone users made up about 64%
The churn rate for both iPhone and Android users was within one percentage point of each other. There is nothing suggesting of churn being correlated with device.
Exploratory Data Analysis
Key Insights
The more times users used the app, the less likely they were to churn. While 40% of the users who didn't use the app at all last month churned, nobody who used the app 30 days churned.
Distance driven per driving day had a positive correlation with user churn. The farther a user drove on each driving day, the more likely they were to churn.
Number of driving days had a negative correlation with churn. Users who drove more days of the last month were less likely to churn.
Users of all tenures from brand new to ~10 years were relatively evenly represented in the data.
Nearly all the variables were either very right-skewed or uniformly distributed.
For the right-skewed distributions, this means that most users had values in the lower end of the range for that variable.
For the uniform distributions, this means that users were generally equally likely to have values anywhere within the range for that variable.
Several variables had highly improbable or perhaps even impossible outlying values, such as: driven_km_drives, activity_days and driving_days.
Overall churn rate is ~17%, and that this rate is consistent between iPhone users and Android users when not accounting for the outliers.
Sessions last month
The sessions variable is a right-skewed distribution with half of the observations having 56 or fewer sessions. However, as indicated by the boxplot, some users have more than 700.
Drives last month
The drives information follows a distribution similar to the sessions variable. It is right-skewed, approximately log-normal, with a median of 48. However, some drivers had over 400 drives in the last month.
Total sessions
The total_sessions is a right-skewed distribution. The median total number of sessions is 159.6.
Compared with the median number of sessions in the last month, 48, it indicates that a large proportion of a user's total drives might have taken place in the last month.
Number of days since onboarding
The total user number of days since onboarding is a uniform distribution with values ranging from near-zero to ~3,500 (~9.5 years).
Total kilometers driven last month
The number of drives driven in the last month per user is a right-skewed distribution with half the users driving under 3,495 kilometers.
Duration drive last month
The duration_minutes_drives variable has a heavily skewed right tail. Half of the users drove less than ~1,478 minutes (~25 hours), but some users clocked over 250 hours over the month.
Activity in days last month
Users opened the app a median of 16 days. The box plot reveals a centered distribution. The histogram shows a nearly uniform distribution of ~500 people opening the app on each count of days. However, there are ~250 people who didn't open the app at all and ~250 people who opened the app every day of the month.
This distribution is noteworthy because it does not mirror the sessions distribution, which you might think would be closely correlated with activity_days.
Driving in days last month
The number of days users drove each month is almost uniform.
However, there were almost twice as many users (~1,000 vs. ~550) who did not drive at all during the month. This might seem counterintuitive when considered together with the information from activity_days. Needs further investigation.
Type of Device
There are nearly twice as many iPhone users as Android users represented in this data.
Churned vs Retained
Less than 18% of the users churned.
Driving days vs Activity days
Data suggest that users open the app more than they use the app to drive. Perhaps to check drive times or route information, to update settings, or even just by mistake.
Retention by device
The proportion of churned users to retained users is consistent between device types.
Retention by kilometers driven per driving day
The churn rate tends to increase as the mean daily distance driven increases, indicating that long-distance users discontinue using the app.
Retention by number of driving days
The churn rate is highest for people who didn't use Waze much during the last month. The more times they used the app, the less likely they were to churn. While 40% of the users who didn't use the app at all last month churned, nobody who used the app 30 days churned.
Retention by number of days since onboarding
The churn rate is highest in the first year, approximately 25% churn, and then it slightly decreases. After 4 years less than 20% churn and it keeps decreasing to approximately 15% after 6 years.
Percentage of sessions that occured in the last month
Half of the people in the dataset had 40% or more of their sessions in just the last month, yet the overall median time since onboarding is almost five years (1741 days).
Number of days since onboarding for users with 40% or more sessions in last month
The number of days since onboarding for users with 40% or more of their total sessions occurring in just the last month is a uniform distribution.
Might be faulty data. Why have so many long-time users suddenly used the app so much in the last month.
Hypothesis Testing
Conducted a hypothesis test, a two-sample t-test, to answer the following question:
"Do drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices?"
The mean number of drives for Android users is ≈ 66.23 and for iPhone users ≈ 67.86
𝐻0: There is no difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.
𝐻𝐴: There is a difference in the average number of drives between drivers who use iPhone devices and drivers who use Androids.
Significance level set to 5%
P-value is 14.3
Since the p-value is larger than the chosen significance level (5%), I fail to reject the null hypothesis.
There is no statistically significant difference in the mean amount of rides between iPhone users and Android users.
Drivers who use iPhone devices on average have a similar number of drives as those who use Androids.
Regression Model
Applied user data to build and analyze a binomial logistic regression model to predict user churn. The efficacy of a binomial logistic regression model is determined by accuracy, precision, and recall scores. In particular, recall is essential to this model as it shows the number of churned users.
The model has mediocre precision, 53% of its positive predictions are correct.
The model has very low recall, with only 9% of churned users identified. This means the model makes a lot of false negative predictions and fails to capture users who will churn.
Activity_days was by far the most important feature in the model. It had a negative correlation with user churn.
In the model, km_per_driving_day was the second-least-important variable. It had a positive correlation with user churn.
The model is not a strong enough predictor, as made clear by its poor recall score. However, if the model is only being used to guide further exploratory efforts, then it can have value.
Correlation heatmap indicates many low correlated variables
Pearson correlation coefficient value set to 0.7, variables with values greater than the absolute value of 0.7 are strongly multicollinear. Therefore, only one of these variables should be used in your model.
sessions and drives are multicollinear with each other, they have value 1.0.
driving_days and activity_days are multicollinear with each other, they have value 0.95.
Dropped sessions and driving_days, rather than drives and activity_days. The reason for this is that the features that were kept for modeling had slightly stronger correlations with the target variable than the features that were dropped.
Logistic assumptions are met, the model results can be appropriately interpreted
Log-odds (logit) of the dependent variable with respect to the predictor variable should be linear to meet the logistic assumptions.
Log-odds (logit) of the dependent variable with respect to the predictor variables activity_days and km_per_driven_day is linear
Very low recall and mediocre precision
Confusion matrix shows a relative high value of false negatives, 578. False negatives indicate that the model labels churned users as retained.
The model has very low recall, with only 9% of churned users identified. Model makes a lot of false negative predictions and fails to capture users who will churn.
Calculations based on confusion model also show that the model has mediocre precision, 53% of its positive predictions are correct.
Accuracy is 82%, the proportion of data points that are correctly classified.
Activity_days most important feature in the model
Bar graph of the model's coefficients shows that activity_days is the most important variable for the model.
Machine Learning Model
Developed two different models, random forest and XGBoost, to cross-compare results and obtain a model with the highest predictive power. Recall was used as the primary evaluation metric, but accuracy, precision, and f1 were also used to determine the efficacy of the model. The data was split into training, validation, and test sets. Splitting the data three ways means that there is less data available to train the model, however, this gives us a better estimate of future performance than splitting the data two ways.
The XGBoost model model had the highest predictive power.
The XGBoost model fit the data better than the random forest model, the XGBoost model's recall score was almost 50% better than the random forest model's recall score while maintaining a similar accuracy and precision score.
The recall score for the XGBoost model was 17%, nearly double compared with the logistic regression model which was 9%.
The XGBoost model only identified 16.6% of the users who churned.
The model is not a strong enough predictor, as made clear by its poor recall score. The model can be used to guide further exploratory efforts.
XGBoost was the the model with the highest predictive power
This XGBoost model fit the data better than the random forest model. The recall score was almost 50% better than the random forest model's recall score while maintaining a similar accuracy and precision score.
Test scores decreased from the training scores across all metrics, but only by very little. This means that the model did not overfit the training data.
The XGBoost model's validation scores were lower, but only very slightly.
XGBoost has low recall score
The model predicted three times as many false negatives (423) as it did false positives (132). This means that the model labeled 423 churned users as retained.
It correctly identified only 16.6% of the users who actually churned, 84 churned users.
Most important feature is 'km_per_hour' in XGBoost model
The XGBoost model made more use of more of the features than the logistic regression model.
Engineered features accounted for six of the top 10 features (and three of the top five): km_per_hour, percent_sessions_in_last_month, total_sessions_per_day, percent_of_drives_to_favorite, km_per_drive, km_per_driving_day.
Feature engineering is often one of the best and easiest ways to boost model performance.
Conclusion
1
In the last month, the median churn users drove 200 kilometers longer and 2.5 hours more than the median retained users. Churned users more drives in fewer days, and their trips were farther and longer in duration.
2
The more times users used the app, the less likely they were to churn. While 40% of the users who didn't use the app at all last month churned, nobody who used the app 30 days churned.
3
The churn rate is highest in the first year of using the app, approximatly 25% churn, and then it slightly decreases. After 4 years less than 20% churn and it keeps decreasing to approximately 15% after 6 years.
4
The XGBoost model had the highest predictive power with a recall score 50% better than the random forest model's recall score. Still, it only correctly identified 16.6% of the users who actually churned. Current data is insufficient to consistently predict churn.
Recommended Next Steps
Current data is insufficient to consistently predict churn. It partly answers the question of "Who are the users that are most likely to churn?" based on app usage from the last month, but it fails to answer the question "Why do users churn?". It would be helpful to have drive-level information for each user (such as drive times, geographic locations, satisfaction, etc.). It would probably also be helpful to have more granular data to know how users interact with the app. For example, how often they report or confirm road hazard alerts, give feedback on the proposed route, etc. Also, an exit survey with satisfaction questions and why a user chooses to stop using the app could give useful insights.
Regarding the question "When do users churn?", data indicates that users are more likely to churn during the first few years of using the app. It would be helpful to investigate this further by collecting user interaction data, perhaps some features are missing or features are not functioning as intended. By collecting feedback during the first year of usage some of these questions could be answered.
Once more data is collected the recommendation is to have a second iteration of data analysis and modeling.
Further analysis
Further analysis is necessary to investigate the following areas:
Collect geo data, for example, the monthly count of unique starting and ending locations each driver inputs.
Investigate if data for typical drivers at large is missing since the users represented in this data drive a lot.
Investigate outliers for the following variables: driven_km_drives, activity_days, and driving_days. These variables had highly improbable or perhaps even impossible outliers.
There were ~250 users who were not active during and ~1,000 users who did not drive at all during the last month. This needs further investigation. Why are the users opening the app if they don't drive?
Assess the demographic and geographic factors that may contribute to churn. This could allow Waze to understand if there are specific geographical areas where the app is not functioning according to expectations. Or if the app appeals more to people of a certain age and, if improvements can be made to make it more appealing to other age groups.
Consider implementing customer feedback mechanisms, such as feedback after every drive, post-churn surveys, or customer satisfaction ratings. This would help Waze gather insights directly from customers regarding their experience with the app and identify opportunities for improvement.
Appendix
The data cleaning and analysis process and all the Python code is available in the appendix for those interested in viewing the details.