Welcome back to another Pandas Series with us. Today we will discuss the customer churn dataset and learn ways to analyze and visualize it.

Before we begin with this module, let's first understand what churn analysis means.

  • Churn Analysis : Companies use churn analytics to measure the rate at which customers quit the product, site, or service. It answers the questions “Are we losing customers?” and “If so, how?” to allow teams to take action. Lower churn rates lead to happier customers, larger margins, and higher profits. To prevent churn, teams must first measure it with analytics.

Now that we have understood what churn analysis is, we will learn various techniques to analyze any given churn dataset. For this purpose, we are using the Telecommunication Customer Churn dataset. Our main focus for this module will be solely on analyzing all the relevant customer data.

We want to do our analysis of customers who left within the last month and their relation with:-

  • Senior citizens
  • Total tenure period
  • Monthly charge and total charge
  • Different contracts
  • Different services

The link to access the dataset: https://www.kaggle.com/blastchar/telco-customer-churn

Learning Objectives

After this module, we should be able to do the following:

  1. Use pandas library to analyze different features of the dataset, which includes
    • Read the dataset
    • Extract relevant features for our analysis
    • Use functions like df.query
  2. Use plotly library to visualize the given results.
    • Plot graphs like bar graphs, scatter plots, pie charts, and violin plots.

Understanding the dataset

Each row in the dataset represents a customer, whereas, each column contains different attributes.

The data set includes information about:

  • Customers who left within the last month – This is the column under the name Churn.
  • Services each customer signed up for – Columns under the name phone service, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies.
  • Customer account information – How long they have been a customer, contract, payment method, paperless billing, monthly charges, and total charges.
  • Demographic information about customers – Gender, Senior citizen or not, and if they have partners and dependents.

However, for the purpose of learning, we will focus on a few of these columns only.

Pre-processing the data

To get started, we need to import some useful libraries that will help us import the dataset into our python environment, manipulate and analyze the same and later help us to visualize it.

#Importing libraries
import pandas as pd
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
print("The modules are imported")

#Importing dataset
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

Using df.shape function, we see that our dataset has 7043 rows and 21 columns. We have already discussed what these columns mean.  Also, we observe that there are no null values present in our dataset using the code df.isnull().sum().

Let's first understand our target variable Churn by observing the proportion of customers with churn status Yes and No. We can use the code df['Churn'].value_counts() to see the number of values in each category, we get that customers with churn status No are 5174, and that of Yes are 1869.

Since our goal for this module is not just to see the numbers but also to visualize the data, we will plot our results using the code below.

import plotly.graph_objects as go

churn_bar1 = go.Figure()
churn_bar1 .add_trace(go.Bar(
    name='Churn Yes',
    y=df['Churn'].value_counts(),x=df['Churn'].unique(),marker_color='paleturquoise'))

churn_bar1.update_layout(barmode='group',title='Churn status')
churn_bar1 .show()

Making use of the Graph Objects function of plotly, we have written a code for a simple vertical bar graph. The graph depicts the frequency count for the two categories (Yes, No). The x-axis has the categories – Yes and No, whereas, the y-axis has the counts for each.

This is how the graph looks like.

We see that the proportion for the Yes category is very less as compared to No.

Now let's jump straight to our analysis and see how each attribute is related to our target variable Churn.

Senior Citizen

We are starting with the simplest column, SeniorCitizen. There are only two distinct values in this column, i.e., Yes and No, corresponding to whether the customer who left the company was a Senior Citizen or now.

For this bit, we can separate our dataset into two different categories: Churn Status YES and Churn Status NO using the code df.query('Churn == "Yes"') and df.query('Churn == "No"').

If you have been following our pandas series, you know what the function df.query() does. If you are new here, let's quickly discuss how useful this function is.

  • query() - This is a filtering function that enables to select and filter the columns of a dataFrame with a boolean expression.

We are using the above method for this feature and making two separate dataframes namely, churn_yes and churn_no to analyze the proportion of senior citizens.

We will use the following lines of code:

#Churn yes dataset
churn_yes=pd.DataFrame(df.query('Churn == "Yes"'))

#Churn no dataset
churn_no=pd.DataFrame(df.query('Churn == "No"'))

We have successfully separated the two churn categories. Now, let's visualize the proportion of senior citizens in each of these dataframes. We will use the value_counts() function of pandas to analyze the values and plot them using a pie chart.

# To see the values in each category
churn_yes['SeniorCitizen'].value_counts() #print this result separately

#Plotting the values
colors = ['cadetblue','powderblue']

#Using the values for yes and no that we obtained from the above code
churn_pie1 = go.Figure(data=[go.Pie(labels = ['Yes','No'],
                             values =['476','1393'],pull=[0.3, 0, 0, 0])])
churn_pie1.update_traces(hoverinfo='label+percent', textinfo='value+percent', textfont_size=20, marker=dict(colors=colors))
churn_pie1.show()

The above gives the proportion of senior citizens for the customers that left the company. We see that only 25.5% (476) customers were senior citizens and the rest 74.5% (1393) were not.

We can use the same code to analyze the number of senior citizens for customers that did not leave the company. The following pie chart is how the result should look like.

The senior citizens are only 12.9% (666).

We see that even though the percentage of senior citizens for churn_no is only 12.9%, they are still more than that of churn_yes (666 > 476). This is only because of the fact that the total number of customers for churn_no is much higher than that of churn_yes. The percentage of proportions can hence be  very misleading and still does not make it easier for us to compare the two categories.

To solve this problem, we will plot a group bar plot and compare the results.

import plotly.graph_objects as go

churn_bar2 = go.Figure()
churn_bar2.add_trace(go.Bar(
    name='Senior Citizen - Yes',
    x=['Churn_Yes','Churn_No'], y=[476,1393],marker_color='turquoise'))
    
churn_bar2.add_trace(go.Bar(
    name='Senior Citizen - No',
    x=['Churn_Yes','Churn_No'], y=[666,4508],marker_color='gold'))

churn_bar2.update_layout(barmode='group',title='Senior Citizens corresponding to the churn status')
churn_bar2.update_xaxes(title='Churn status')
churn_bar2.update_yaxes(title='Counts')
churn_bar2.show()

The above plot has made it very easy to compare the senior citizens for the two churn categories.

Let's move on to the next attribute of our dataset.

Total Tenure Period

We use the column tenure for our analysis. We will only see the tenure period for customers who left the company, i.e., churn_yes.

#Dataframe
tenure_df=pd.DataFrame(churn_yes['tenure'].value_counts().reset_index())
tenure_df.head()  #print

#Rename columns
tenure_df.rename(columns = {'index':'count'}, inplace = True)
tenure_df.columns  #print

#Plotting
data = tenure_df #dataset
churn_scatter1 = px.scatter(data, x="tenure",y='count')

churn_scatter1.update_traces(mode='markers', marker_line_width=2, marker_size=7,opacity=0.7)
churn_scatter1.update_layout(title='Tenure period')
churn_scatter1.update_xaxes(title='Days')
churn_scatter1.update_yaxes(title='Count')

churn_scatter1.show()

We observe that most of my customers are between 0-100 days (we have assumed days to be the unit for the tenure column). This implies that most of the churn customers leave within the first 3 months. The highest tenure period is for 6 days with a count of 72. We also have an outlier on the 380th day.

We can also visualize the distribution of tenure for churn_yes and churn_no to observe the average tenure period of customers that did not leave the company.

For this purpose, we plot distribution plots.

import plotly.figure_factory as ff #importing a new function from plotly
x1=df[df["Churn"]=='No']["tenure"]
x2=df[df["Churn"]=='Yes']["tenure"]
hist_data = [x1,x2]

group_labels = ['Churn no','Churn yes']
colors = ['#DC3912', '#FFA15A']

# Create distplot with curve_type set to 'normal'
distplot1 = ff.create_distplot(hist_data, group_labels, show_hist=False, colors=colors)

# Add title
distplot1.update_layout(title_text='Tenure')
distplot1.update_xaxes(title_text='Days')
distplot1.update_yaxes(title_text='Density')
distplot1.show()

The above distribution plot helps us compare different distributions. The distribution for churn_yes is right-skewed, whereas, the distribution for churn_no is mostly flat.

Let's analyze the results for the next attributes.

Monthly Charge and Total Charge

We will analyze the monthly charge and total charge for customers who left the company. It's important to try out different visualization techniques for different variables, therefore, this time we are trying our hand on violin plot. Let's first understand what a violin plot is.

  • Violin plots - Violin plots are a method of plotting numeric data and can be considered a combination of the box plot with a kernel density plot. In the violin plot, we can find the same information as in the box plots, i.e., mediun, interquartile range and the lower/upper adjacent values (min, max).

First, we will extract the values in the columns monthly charge and total_charge.

#dataframe for monthly charge
month_yes=pd.DataFrame(churn_yes['MonthlyCharges'].value_counts().reset_index())
month_yes.head() #print

#renaming columns
month_yes.rename(columns ={'index':'monthlycharge','MonthlyCharges':'month_count'}, inplace = True)

#dataframe for total charge
total_yes=pd.DataFrame(churn_yes['TotalCharges'].value_counts().reset_index())
total_yes.head() #print

#rename columns
total_yes.rename(columns = {'index':'total_charge','TotalCharges':'total_count'}, inplace = True)

#concatenate
charge_yes=pd.concat([month_yes,total_yes],axis=1)

Plotting a violin plot.

data = charge_yes
violin1 = px.violin(data, y=charge_yes['monthlycharge'], box=True, # draw box plot inside the violin
                points='all', # can be 'outliers', or False)
violin1.show()

Most of the data is clustered from 80 to 100, which implies that the monthly charge for most of the customers that left was between 80 to 100 units.

Following the same procedure for total charge and visualizing the result.

The data points are scattered uniformly throughout the plot. We can observe the median, interquartile range, maximum and minimum values as well.

Try plotting a scatter plot for these columns and see if we can get better results to interpret.

Moving on to the next attribute.

Different Contracts

We will analyze the different contracts the customers were enrolled in for churn_yes and churn_no.

The procedure for this will remain the same, as we discussed for tenure and seniorcitizens. So let's get started.

#dataframe for churn yes
contract=pd.DataFrame(churn_yes['Contract'].value_counts().reset_index())

#rename columns
contract.rename(columns={'index':'contracts_churn_yes','Contract':'counts_yes'},inplace=True)

#df for churn no
contract_no=pd.DataFrame(churn_no['Contract'].value_counts().reset_index())

#rename columns
contract_no.rename(columns={'index':'contracts_churn_no','Contract':'counts_no'},inplace=True)

#concatenate
contract_status=pd.concat([contract,contract_no],axis=1)
contract_status

The resultant dataframe will show us the number of customers enrolled in each contract. We will get the following table as an output:

contracts_churn_yes counts_yes contracts_churn_no counts_no
0 Month-to-month 1655 Month-to-month 2220
1 One year 166 Two year 1647
2 Two year 48 One year 1307

We see that for churn_yes 1655 customers were enrolled in the 'Month-to-month' contract. We can also see the total counts for different contracts in the above table. Let's plot a horizontal bar graph and visualize our results for the same.

import plotly.graph_objects as go

top_labels = ['Month-to-month','One year','Two year']

colors = ['rgba(38, 24, 74, 0.8)', 'rgba(71, 58, 131, 0.8)',
          'rgba(122, 120, 168, 0.8)']

x_data = [[1655,166,48],
          [2220,1647,1307]]

y_data = ['For churn no','For churn yes']

bar_h = go.Figure()

for i in range(0, len(x_data[0])):
    for xd, yd in zip(x_data, y_data):
        bar_h.add_trace(go.Bar(
            x=[xd[i]], y=[yd],
            orientation='h',
            marker=dict(
                color=colors[i],
                line=dict(color='rgb(248, 248, 249)', width=1)
            )
        ))

bar_h.update_layout(
    xaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=False,
        zeroline=False,
        domain=[0.15, 1]
    ),
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=False,
        zeroline=False,
    ),
    barmode='stack',
    paper_bgcolor='rgb(248, 248, 255)',
    plot_bgcolor='rgb(248, 248, 255)',
    margin=dict(l=120, r=10, t=140, b=80),
    showlegend=False,
)

annotations = []

for yd, xd in zip(y_data, x_data):
    # labeling the y-axis
    annotations.append(dict(xref='paper', yref='y',
                            x=0.14, y=yd,
                            xanchor='right',
                            text=str(yd),
                            font=dict(family='Arial', size=14,
                                      color='rgb(67, 67, 67)'),
                            showarrow=False, align='right'))
    # labeling the first percentage of each bar (x_axis)
    annotations.append(dict(xref='x', yref='y',
                            x=xd[0] / 2, y=yd,
                            text=str(xd[0]),
                            font=dict(family='Arial', size=14,
                                      color='rgb(248, 248, 255)'),
                            showarrow=False))
    # labeling the top
    if yd == y_data[-1]:
        annotations.append(dict(xref='x', yref='paper',
                                x=xd[0] / 2, y=1.1,
                                text=top_labels[0],
                                font=dict(family='Arial', size=14,
                                          color='rgb(67, 67, 67)'),
                                showarrow=False))
    space = xd[0]
    for i in range(1, len(xd)):
            # labeling the rest of percentages for each bar (x_axis)
            annotations.append(dict(xref='x', yref='y',
                                    x=space + (xd[i]/2), y=yd,
                                    text=str(xd[i]),
                                    font=dict(family='Arial', size=14,
                                              color='rgb(248, 248, 255)'),
                                    showarrow=False))
            # labeling the Likert scale in the bar graph
            if yd == y_data[-1]:
                annotations.append(dict(xref='x', yref='paper',
                                        x=space + (xd[i]/2), y=1.1,
                                        text=top_labels[i],
                                        font=dict(family='Arial', size=14,
                                                  color='rgb(67, 67, 67)'),
                                        showarrow=False))
            space += xd[i]

bar_h.update_layout(annotations=annotations,title='Contract Status')
bar_h.show()

The above bar graph shows the same result. 'Month-to-month' contract has the highest count and 'Two-year' contract with the least count. This implies that customers that were enrolled in a contract with a longer time period left the company less in the past one month.

Note: The same procedure can be used to plot Likert Scale data as well.

Different Services

In this section, we will focus on only one service and learn ways to handle similar columns. We will analyze the InternetService column in this section.

We have 3 unique values in this column, namely, 'Fibre Optic', 'DSL', and 'No'.

Let's see how to extract the data from this column for customers that left the company and the ones that stayed ( churn_yes , churn_no ) and compare the results.

#for churn yes
churn_yes['InternetService'].value_counts()  #print

#for churn no
churn_no['InternetService'].value_counts()   #print

#Plot
import plotly.graph_objects as go

bar3= go.Figure()
bar3.add_trace(go.Bar(
    name='Churn Yes',
    x=['Fiber optic','DSL','No'], y=[1297,459,113],marker_color='crimson'))
bar3.add_trace(go.Bar(
    name='Churn No',
    x=['Fiber optic','DSL','No'], y=[1799,1962,1413],marker_color='lightpink'))
bar3.update_layout(barmode='group',title='Internet Service')
bar3.update_xaxes(title='Different Internet Services')
bar3.update_yaxes(title='Counts')
bar3.show()

We observe that customers with internet service 'Fibre Optics' (1297) has the highest count for leaving the company. Whereas, customers with internet service 'DSL' has the highest count for staying (1962) and comparatively lesser count for leaving (459). This gives us an intuition that 'DSL' is performing better than 'Fibre Optics', therefore, we can think about switching more customers to 'DSL'.

Go through the lines of code above and analyze the results for other services that the customers are taking.

With this, we come to the end of our analysis.

Conclusions

We gained some interesting insights into the dataset. They are as follows:-

  1. 25.5% (476) customers were senior citizens that left the company in the last month.
  2. Churn customers leave within the first 3 months. The highest tenure period is for 6 days with a count of 72.
  3. The monthly charge for most of the customers that left was between 80 to 100 units. While the total charge was scattered uniformly for all values.
  4. 'Month-to-month' contract has the highest count (1655) for  customers that left.
  5. Internet service 'Fibre Optics' (1297) has the highest count for leaving the company, we observe that 'DSL' performs better than 'Fibre Optics'.

We have found the answers to all the questions, and we can now share our report with the team, together with the processed data for further analysis.

Hope that now you feel more confident to handle similar datasets.

Waiting for you in our next pandas series.
See you then!