Another quick check-in during my data science bootcamp journey…

LeighannaJoHooper
5 min readJul 25, 2020

…And things I learned during my third project.

I just finished my third project in my online immersive data science bootcamp. This particular project was on supervised machine learning. We were given four datasets to chose from with their own unique problems to ‘solve.’

The first was a multi-class classification dataset that contained Chicago car crash data with the goal of building a classifier that could predict patterns in the contributory causes of car accidents using data about the people in the car, about the car, road conditions and the information provided in the police reports. (Learn more about this dataset, here.)

One consisted of data from ‘Terry Stops’ based off the Terry v. Ohio, Supreme Court case that dealt with search and seizure violations during police stops of civilians and the Fourth Amendment. (Click on the link for more information regarding this case.) Terry Stop is another name for ‘stop and frisk.’ The goal here was to predict whether an arrest was actually made after one of these stops, using binary classification.

The third given was the Tanzanian Water Well Data (click for more information.) The goal here was using, ternary classification, to attempt to predict water well conditions using various water-point data provided.

The fourth and last dataset, and the one I ended up choosing, involved customer churn. This was a binary classification problem using data from SyriaTel (click for access to the dataset.)

The churn rate, also known as the rate of attrition or customer churn, is the rate at which customers stop doing business with an entity. It is most commonly expressed as the percentage of service subscribers who discontinue their subscriptions within a given time period.

This definition was provided by the following site: https://www.investopedia.com/terms/c/churnrate.asp.

The ability to predict that a particular customer is at a high risk of churning, while there is still time to do something about it, represents a huge additional potential revenue source for every business, especially online businesses, like Syria Telecom.

Using CRISP-DM methodology, I tackled the dataset. Starting with data exploration and understanding the business, I made some first glance insights.

  • Does location of the customer have an impact?
  • What is the account length of contracts that are more prone to churn?
  • Does whether the customer’s account having an international plan affect customer churn rate?
  • Does the amount of customer service calls made by the customer affect churn?
  • Would adding a voicemail plan to the customer’s account potentially affect churn rate?

I ended up choosing a KNearestNeighbors model using precision and recall metrics for my predictor.

From my findings, I made the assumption that Syria Telecom is an internet based company and that their main customers are office based.

I concluded that customers that ended their contracts used on average about 28 more minutes a day than customers that remained loyal.

Customers that made 3 customer service calls were more likely to end their contract than others.

Customers that had an international plan were less likely to end their contract.

The coolest things that I found and implemented during this project were two plots, both from plotly.

One is the Parallel Categories Diagram.

‘Each variable in the data set is represented by a column of rectangles, where each rectangle corresponds to a discrete value taken on by that variable. The relative heights of the rectangles reflect the relative frequency of occurrence of the corresponding value. Combinations of category rectangles across dimensions are connected by ribbons, where the height of the ribbon corresponds to the relative frequency of occurrence of the combination of categories in the data set.’ — This explanation provided by the plotly.com.

I used this diagram to show the link between a customer having an international plan and/or a voice mail plan and whether they are more likely to churn or not.

Another very cool plotly.com graph that I used was the Choropleth Map (click for full details.)

The code I used is above, produced the following map.

But the truly fun part is that when you hover over each state, there is an interactive feature that you can use to provide extra information.

Pretty amazing, right!?

So, long story short, I had a lot of fun with this project.

There is still a lot that I need to learn, and could use some practice on. But, all in all, I feel like this has been a great experience and am yards farther along than I was when I wrote my first blog.

Thanks for following along on my journey!

--

--