Predicting Churn

Background

Waze has developed a free real-time navigation app makes it easier for drivers around the world to get to where they want to go. Waze’s community of map editors, beta testers, translators, partners, and users helps make each drive better and safer. Waze partners with cities, transportation authorities, broadcasters, businesses, and first responders to help as many people as possible travel more efficiently and safely. 

Waze has launched a project with the goal to help prevent user churn, and it is part of a larger effort at Waze to increase growth. Churn quantifies the number of users who have uninstalled the Waze app or stopped using the app. This project focuses on monthly user churn. Typically, high retention rates indicate satisfied users who repeatedly use the Waze app over time. If Waze can identify a segment of users who are at high risk of churning, Waze can proactively engage these users with special offers to try and retain them. Otherwise, Waze may simply lose these users without knowing why. 

By analyzing and interpreting data, and generate valuable insights will help Waze leadership optimize the company’s retention strategy, enhance user experience, and make informed, data-driven business decisions about product development.  

Problem to Solve

Waze needs help analyzing user data and developing a machine-learning model that will predict user churn, improve user retention, and grow Waze’s business. An accurate model can also help identify specific factors that contribute to churn and answer questions such as: 


Preliminary data summary 

Initial key insights

Exploratory Data Analysis 

Key Insights

Sessions last month


Drives last month


Total sessions


Number of days since onboarding


Total kilometers driven last month 

Duration drive last month 

Activity in days last month 

Driving in days last month

Type of Device

Churned vs Retained

Driving days vs Activity days

Retention by device

Retention by kilometers driven per driving day

Retention by number of driving days

Retention by number of days since onboarding

Percentage of sessions that occured in the last month

Number of days since onboarding for users with  40% or more sessions in last month

Hypothesis Testing 

Conducted a hypothesis test, a two-sample t-test, to answer the following question:


"Do drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices?"  


Regression Model 

Applied user data to build and analyze a binomial logistic regression model to predict user churn. The efficacy of a binomial logistic regression model is determined by accuracy, precision, and recall scores. In particular, recall is essential to this model as it shows the number of churned users. 


Correlation heatmap indicates many low correlated variables

Logistic assumptions are met, the model results can be appropriately interpreted

Very low recall and mediocre precision

Activity_days most important feature in the model

Machine Learning Model

Developed two different models, random forest and XGBoost, to cross-compare results and obtain a model with the highest predictive power. Recall was used as the primary evaluation metric, but  accuracy, precision, and f1 were also used to determine the efficacy of the model.  The data was split into training, validation, and test sets. Splitting the data three ways means that there is less data available to train the model, however, this gives us a better estimate of future performance than splitting the data two ways.


XGBoost was the the model with the highest predictive power

XGBoost has low recall score

Most important feature is 'km_per_hour' in XGBoost model

 Conclusion

1

In the last month, the median churn users drove 200 kilometers longer and 2.5 hours more than the median retained users. Churned users more drives in fewer days, and their trips were farther and longer in duration. 

2

The more times users used the app, the less likely they were to churn. While 40% of the users who didn't use the app at all last month churned, nobody who used the app 30 days churned.

3

The churn rate is highest in the first year of using the app, approximatly 25% churn, and then it slightly decreases. After 4 years less than 20% churn and it keeps decreasing to approximately 15% after 6 years. 


4

The XGBoost model had the highest predictive power with a recall score 50% better than the random forest model's recall score. Still, it only correctly identified 16.6% of the users who actually churned. Current data is insufficient to consistently predict churn.

Recommended Next Steps

Current data is insufficient to consistently predict churn. It partly answers the question of "Who are the users that are most likely to churn?" based on app usage from the last month, but it fails to answer the question "Why do users churn?". It would be helpful to have drive-level information for each user (such as drive times, geographic locations, satisfaction, etc.). It would probably also be helpful to have more granular data to know how users interact with the app. For example, how often they report or confirm road hazard alerts, give feedback on the proposed route, etc. Also, an exit survey with satisfaction questions and why a user chooses to stop using the app could give useful insights.

Regarding the question "When do users churn?",  data indicates that users are more likely to churn during the first few years of using the app. It would be helpful to investigate this further by collecting user interaction data, perhaps some features are missing or features are not functioning as intended. By collecting feedback during the first year of usage some of these questions could be answered. 

Once more data is collected the recommendation is to have a second iteration of data analysis and modeling.

Further analysis

Further analysis is necessary to investigate the following areas:

Appendix

The data cleaning and analysis process and all the Python code is available in the appendix for those interested in viewing the details.