Data Mining and Data Analytics for Analysing Customer Churn Rate

One of the most critical factors in customer relationship management that directly impacts a company’s long-term profitability is customer attrition. When a company can better predict if a customer is likely to cut ties, it can take a more targeted approach to mitigate customer turnover. Telecommunications Company is concerned about the number of customers leaving their landline business for cable competitors. The company needs to know which customers are leaving and attempt to mitigate continued customer loss. This paper describes the methods to analyze customer data to identify why customers are leaving and potential indicators to explain why those customers are leaving so the company can make an informed plan to mitigate further loss.


I. INTRODUCTION
One of the most critical factors in customer relationship management that directly impacts a company's long-term profitability is customer attrition. When a company can better predict if a customer is likely to cut ties, it can take a more targeted approach to mitigate customer turnover.
Telecommunications Company is concerned about the number of customers leaving their landline business for cable competitors. The company needs to know which customers are leaving and attempt to mitigate continued customer loss. This paper describes the methods to analyze customer data to identify why customers are leaving and potential indicators to explain why those customers are leaving so the company can make an informed plan to mitigate further loss.

A. Tool Selection
Firstly, we need to choose the best appropriate tools to perform the analysis on the data set. For this paper we are choose R programming to analyze the churn rate of the customer due to the fact that R provides various inbuilt packages and also features making statistical analysis of large data sets simple. R has the integrated development environment available in R studio and is accessible from number of scripting language and hence makes it easier to view and analyze the code. With R almost every tool a data scientist might need to manipulate and evaluate structured data is included. What isn't in the base package has often been built and shared by other programmers and is freely available to download. Hence, based on the advantages of R programming it's been chosen for the analysis of the customer data.

B. Descriptive Method
Multiple correspondence analysis will be used to analyze the variables which contribute most to the customer churn rate. To check if there is any collinearity and eliminate such Published on July 30, 2020. Sharan Kumar Paratala Rajagopal, Capgemini America Inc., USA.
(corresponding e-mail: prsharankumar@gmail.com) data from analysis. This data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. Binning to create categorical variables if there any continuous.

C. Prediction Method
Logistic regression will be used as the target and dependent variables are categorical values and logistic regression holds good for the analysis.

III. DATA EXPLORATION AND PREPARATION
Before beginning the analysis, the target variable has to be determined in the data and indicate if the specific type of data the target variable is using. The target variable in this research is the "Churn" variable. The data which is used by the target variable will be either "Yes" or "No" and it is categorical.
" Fig.1" shows the rest of the variables used in the analysis process.

A. Data preparation
• Data preparation is very important step in any data analysis. As whole data set might not contribute entirely due to various factors such as outliers etc.
• In this analysis the data preparation is done by reducing the number of independent variables which are continuous and might not contribute much.
• Reduce number of categorical variable values. Make them into factors which will help in analysis.
• MonthlyCharges and TotalCharges are of similar values and both of them might not be required and one of the column can be dropped.

•
Reduce the binary value to "No" instead of "No Internet Service" wherever possible.
• Bin the "Tenure" column for the analysis instead of continuous qualitative values.

B. Statistical Identity
" Fig.3" provides the details of the variables used pre and post cleaning of data preparation and divided into dependent and independent variables.

C. Raw data
The Raw data has been cleaned and the same is mentioned in " Fig.4". Various possible values for each variable identified is described to give better understanding of the data we are looking at.  " Fig.6" shows after removing the missing data. The raw data set chosen has 21 columns and 7032 rows of data. Other cleanup activities performed on this data set includes • "No Internet service" is converted to "No". • "No Phone Service" is converted to "No" • Group the "Tenure" into bins based on the months. Minimum value is 1 and maximum value is 72. So, group into 5 tenure groups.
• Removed the column "CustomerID", "Gender" and "TotalCharges" as we don't need for analysis. These columns don't predict the churn rate and it's of least importance in analysis. The churn rate is not dependent on the gender to predict if the customer is staying or leaving the telecom company.
• " Fig.7" shows that "Monthly Charges" and "Total Charges" are correlated. So, one of them will be removed from the model. "Total Charges" will be removed. " Fig.8" shows bar plots for categorical variables. " Fig.10" provides the information based on MCA analysis, the customer with month-month contract is near to the target variable "Churn" and it shows that customers more likely to churn when they are on month-to-month contract. Overall customer turn rate is 26.6% Fig. 10. Customer churn rate analysis

D. Logistic Regression analysis
Logistic regression is used for binary classification. Since, "Churn" variable is categorical and contains "Yes" or "No" values. Logistic regression will be appropriate.
" Fig.11 provides the logistic regression model values.      Logistic regression works better when there is single decision is to be made. Decision tree can be used when there is more than one decision to be made. Decision tree can be scaled up to more complex and more liable to over fit.
Logistic regression is useful when there is categorical variable to analyze. Logistic regression is simple and has low variance and less prone to over fitting the values.
Based on the " Fig.16", "Contract" is the most important variable to predict customer churn or not. Customer contract even if its one year or two year contract and even without paperless billing, customer is less likely to churn. If the customer is on month-to-month contract and tenure group of 0-12 months and paperless billing is more likely to churn.