The virtual telephony service CallMeMaybe (call center) is developing a new function that will give supervisors information on the least effective operators. An operator is considered ineffective if they have a large number of missed incoming calls (internal and external) and a long waiting time for incoming calls. Moreover, if an operator is supposed to make outgoing calls, a small number of them is also a sign of ineffectiveness.
The datasets contain data on the use of the virtual telephony service CallMeMaybe. Its clients are organizations that need to distribute large numbers of incoming calls among various operators or make outgoing calls through their operators. Operators can also make internal calls to communicate with one another. These calls go through CallMeMaybe's network.
Link to presentation: https://drive.google.com/file/d/1EXTnZYiKVr9UPSqS_IO8U_-A0boTII16/view?usp=sharing
Link to dashboard: https://public.tableau.com/profile/alina7324#!/vizhome/CallMeMaybe_16106288661390/CallMeMaybedashboard
Task: Determine the thresholds for effective operator performance according to KPIs
Recommendations:
Ineffective operators:
Check missed calls (abandonment rate) on the company level and not on operator level.
import pandas as pd
import numpy as np
from numpy import median
from scipy import stats as st
import math as mth
import datetime as dt
from datetime import datetime
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.gridspec as gs
import seaborn as sns
import plotly
import plotly.express as px
from plotly import graph_objects as go
import plotly.io as pio
pio.templates.default = "seaborn"
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import *
import sys
import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
#import the files
location = '/some/local/path'
try:
calls = pd.read_csv(location + 'telecom_dataset_us.csv')
clients = pd.read_csv(location + 'telecom_clients_us.csv')
except:
calls = pd.read_csv('/datasets/telecom_dataset_us.csv')
clients = pd.read_csv('/datasets/telecom_clients_us.csv')
def data_study(df):
"""This function runs methods to study the data and prints out the results.
The input argument is a dataset.
"""
print(df.info(memory_usage='deep'))
print("")
print("Head:")
display(df.head())
print("")
print("Tail:")
display(df.tail())
print("")
print("Check for Nulls:")
display(df.isnull().sum())
print("")
print("Check for 0:")
for c in df.columns:
print(c , len(df[df[c] == 0]))
print("")
print('Number of duplicated rows:', df.duplicated().sum())
data_study(calls)
Missing values:
There are 8,172 missing operators. This is 15% of the data. I can't just remove this much data, so I will investigate these rows to look for the best solution.
#subset of rows with missing operators
missing_op = calls[calls['operator_id'].isnull()]
missing_op.sample(10)
#looking for commonalities of missing values
print(missing_op.user_id.nunique())
print(" ")
print(missing_op.date.nunique())
print(" ")
print(missing_op.direction.value_counts())
print(" ")
print(missing_op.internal.value_counts())
print(" ")
print(missing_op.is_missed_call.value_counts())
The missing values are distributed among 305 users, 119 dates, and all types of calls.
What is very clear is that almost all the calls here are incoming external unanswered calls. These calls might have happened outside the working hours or were abandoned by the caller before they were assigned to an operator.
Since we are working here to identify operator metrics, these calls are irrelevant to our task and should be dropped from the data. However, it is essential to point out that this reflects negatively on the call center as a whole.
I kept the original dataset as is to demonstrate the difference in the EDA section.
#drop rows with missing values ()
data = calls.dropna(subset=['operator_id']).reset_index(drop=True)
data.info()
Now let's check the missing values in internal columnn:
print(data['internal'].isnull().sum())
print(" ")
data[data['internal'].isnull()].head(10)
data.internal.value_counts()
There are 60 missing values left in internal column (0.1%). This column is not significant for our analysis. For this reason, I decided to leave these values as they were. I will convert the column type to bool in the next step, which will automatically convert the NaN to False.
Converting data types:
#convert date from object to dt and eliminate the hour (we can see it's insignificant here)
data['date'] = pd.to_datetime(data['date']).dt.date #this gets rid of the hour but returns a str
data['date'] = pd.to_datetime(data['date']) #this converts again to dt
#convert operator_id from float to int
data['operator_id'] = data['operator_id'].astype('int')
#convert internal from object to bool
data['internal'] = data['internal'].astype('bool')
#convert direction from object to category
data['direction'] = data['direction'].astype('category')
#test
print(data.info(memory_usage='deep'))
print("")
data.head()
Conversion of data types reduced memory usage from 11.1MB to 2.1MB
Duplicates
Before dropping rows, there were 4,900 duplicated rows in the data. First, let's see if that changed:
data.duplicated().sum()
data[data.duplicated()].sample(20)
There is no apparent connection between the duplicates.
As each row is a record of an operator's work, the rows are clearly caused by a bug in the system and should be dropped.
#dropping duplicates
data = data.drop_duplicates()
calls = calls.drop_duplicates()
#test
data.info()
Check for errors in the data:
#check if there are any errors in is_missed_call
data.query('is_missed_call == True & call_duration > 0').count()
data.query('is_missed_call == True & call_duration > 0').sample(5)
There are 296 rows where call duration is larger than 0 but indicated as a missed call. This is an error in is_missed_call column. To fix the error, I'll change the is_missed_call value to False for these rows.
data['is_missed_call'][(data.is_missed_call == True) & (data.call_duration > 0)] = False
#test
data.query('is_missed_call == True & call_duration > 0').count()
#Check values in numeric columns
data.describe().T
data['date'].describe()
data.describe(include=['category','bool']).T
Everything else seems to be ok.
data_study(clients)
No missing or duplicates here.
Date column should be converted to dt:
#convert date to dt type
clients['date_start'] = pd.to_datetime(clients['date_start'])
#rename the column
clients = clients.rename(columns={'date_start':'join_date'})
clients['join_date'].describe()
clients.describe(include='object').T
I want to make sure that there are no logged calls for a user before the join date. For this purpose, and for further analysis that involves the clients' dataset, I will create a merged dataset that contains all the data.
#merge the two datasets
full_data = data.merge(clients, how='inner', on='user_id')
full_data.head()
#check for date anomalies
full_data.query('date < join_date').count()
#reorder columns
full_data = full_data.reindex(columns=['user_id','join_date','tariff_plan','date','direction','internal','operator_id',
'is_missed_call','calls_count','call_duration','total_call_duration'])
full_data.head()
print('The log features {} users and {} operators.'.format(full_data.user_id.nunique(),full_data.operator_id.nunique()))
We have significantlly more operators than users.
grouped_operator = full_data.groupby('operator_id').agg({'user_id':'nunique'}).sort_values(by='user_id', ascending=False).reset_index()
grouped_operator.head()
Each operator is assigned to one user.
Now let's look from a user perspective:
grouped_user = full_data.groupby('user_id').agg({'operator_id':'nunique'}).sort_values(by='operator_id', ascending=False).reset_index()
grouped_user.head(20)
fig = px.histogram(grouped_user, x='operator_id', nbins=100)
fig.update_layout(title_text='Distribution of operators assigned to a user',
xaxis_title_text='Number of operators',
yaxis_title_text='Number of Users')
fig.show()
We can see that most users are assigned to just one operator, but we have some big clients that need more than one operator.
Let's determine what makes a client big:
def whiskers(df, column):
"""This function calculates the outlier thresholds.
The input argument is a dataset and column name.
"""
stat = df[column].describe()
iqr = stat[6] - stat[4]
left_whisker = round(stat[4] - 1.5 * iqr, 2)
right_whisker = round(stat[6] + 1.5 * iqr, 2)
if left_whisker < stat[3]:
left_whisker = stat[3]
if right_whisker > stat[7]:
right_whisker = stat[7]
return [left_whisker, right_whisker]
whiskers(grouped_user,'operator_id')
We can now say that users that were assigned to more than 9 operators are outliers in the data (basically, the top 19 users that appeared in print out above). In this case, we are checking the quality of each operator's work, so the identity of the user is irrelevant. We can keep the outliers and move on.
#group calls by direction
direction = full_data.groupby('direction').agg({'calls_count':'sum'}).reset_index()
direction
fig = go.Figure(data=[go.Pie(labels=direction['direction'], values=direction['calls_count'], hole=.2, pull=[0.2, 0])])
fig.update_layout(title_text='Proportion of call direction after eliminating missing operators')
fig.show()
86.6% of calls in the log are outgoing calls. Only a small share of the service done for the clients is answering incoming calls.
It is important to remember that we dropped a large number of incoming calls when handling missing values. Just to get a clear picture of the workload, let's compare to the original data:
direction_all = calls.groupby('direction').agg({'calls_count':'sum'}).reset_index()
fig = go.Figure(data=[go.Pie(labels=direction_all['direction'], values=direction_all['calls_count'], hole=.2, pull=[0.2, 0])])
fig.update_layout(title_text='Proportion of call direction of the entire data (including missing operators)')
fig.show()
We can see that the actual work load is about 75%-25%.
in_ex = full_data.groupby('internal').agg({'calls_count':'sum'}).reset_index()
in_ex
fig = go.Figure(data=[go.Pie(labels=in_ex['internal'], values=in_ex['calls_count'], hole=.2, pull=[0.2, 0])])
fig.update_layout(title_text='Distribution of internal and external calls')
fig.show()
Only a small portion of calls (2%) are internal calls (between operators).
full_data.date.describe()
fig = px.histogram(full_data, x='date', nbins=100)
fig.update_layout(title_text='Distribution of entries by date',
xaxis_title_text='Dates',
yaxis_title_text='Number of accurances')
fig.show()
full_data.join_date.describe()
We can see our data covers 4 months (Aug-Nov) and detect each week.
There is a gradual rise in work during these months. As we can see, the users join between 01/08-31/10/2019 (parallelly to the collection of the data). This could explain the rise in workload.
From now on, in the analysis, I will analyze the daily average for each operator. This will allow me to use all the data in the record.
def find_outliers(df, column):
print('Mean {} is: {}'.format(column, df[column].mean()))
print('Median {} is: {}'.format(column, df[column].median()))
print('{} whiskers:{}'.format(column, whiskers(df,column)))
print("")
fig = px.box(df, y=column)
fig.update_layout(title_text='Box plot for {}'.format(column))
fig.show()
find_outliers(full_data, 'total_call_duration')
At a first look, we can see that the data is very skewed. There is a big difference between the mean and the median.
There is a long tail, and the left whisker of 2659.5sec (44min) seems like a low bar to set for this metric (a typical workday can be between 8-12 hours). In fact, this will probably eliminate all the effective operators from the data.
The max value in the data is 166,155sec (46 hours). I want to take a closer look at those values.
I think the bar should be set at 12 working hours per day (43,200sec). First, I want to check working hours per operator per day:
#group the total_call_duration by date and by operator
work_day=full_data.groupby(['date','operator_id']).agg({'total_call_duration':'sum'}).sort_values(by='total_call_duration', ascending=False).reset_index()
work_day.head()
work_day[work_day['total_call_duration'] > 43200]['operator_id'].nunique()
So, 7 operators logged crazy hours. Maybe a few operators used the same login to the system. These operator_ids should be dropped from the data.
drop_op = work_day[work_day['total_call_duration'] > 43200]['operator_id'].unique().tolist()
drop_op
indexNames = full_data[full_data['operator_id'].isin(drop_op)].index
full_data = full_data.drop(indexNames).reset_index()
full_data.info()
find_outliers(full_data, 'total_call_duration')
The data still has a long tail, but now the values make sense. This tail is precisely what we need to analyze.
full_data['waiting_duration'] = full_data['total_call_duration'] - full_data['call_duration']
full_data['avg_wait_time'] = full_data['waiting_duration'] / full_data['calls_count']
full_data.head()
In this section, I will divide the data into incoming and outgoing calls, choose only the relevant columns for performance assessment, and run the KMeans clustering algorithm to determine effectiveness.
#subset of incoming calls
incoming = full_data.query('direction == "in"')
incoming.head()
There are two parameters to measure the ineffectiveness of an operator for incoming calls:
For clustering, I'll calculate the average waiting time and average missed calls for each operator. The clustering algorithm will determine the threshold for effectiveness.
#converting True-False values in is_missed_call to 0 if calls were answered and to the call count if they were missed.
incoming['is_missed_call'] = incoming['is_missed_call'].replace([False,True],[0,incoming['calls_count']])
incoming.head()
incoming_for_cluster = incoming.groupby(['operator_id']).agg({'is_missed_call':'mean','avg_wait_time':'mean'}).rename(columns={'is_missed_call':'avg_miss_call', 'avg_wait_time':'wait_time'})
incoming_for_cluster.head()
linked = linkage(incoming_for_cluster, method = 'ward')
plt.figure(figsize=(15, 10))
dendrogram(linked, orientation='top',no_labels=True)
plt.title('Hierarchical clustering for incoming calls data')
plt.show()
The dendrogram shows that indeed there are 2 clusters in this data.
km = KMeans(n_clusters = 2,random_state=True)
labels = km.fit_predict(incoming_for_cluster)
incoming_for_cluster['labels'] = labels
display(incoming_for_cluster.groupby('labels').mean())
print("")
display(incoming_for_cluster.groupby('labels').min())
print("")
display(incoming_for_cluster.groupby('labels').max())
Several things to notice:
#subset of outgoing calls
outgoing = full_data.query('direction == "out"')
outgoing.head()
There was only one parameter set to measure the ineffectiveness of an operator for outgoing calls:
After reading about call center KPIs, I think that an additional parameter should be answer success rate (ASR).
For clustering, I'll calculate the average number of calls and average answered calls for each operator. The clustering algorithm will determine the threshold for effectiveness.
#converting True-False values in is_missed_call to 0 if calls were missed and to the call count if they were answered.
outgoing['is_missed_call'] = outgoing['is_missed_call'].replace([False,True],[outgoing['calls_count'],0])
outgoing.head()
outgoing_for_cluster = outgoing.groupby(['operator_id']).agg({'calls_count':'mean','is_missed_call':'mean'}).rename(columns={'calls_count':'avg_call_count','is_missed_call':'avg_answer_call'})
outgoing_for_cluster.head()
linked = linkage(outgoing_for_cluster, method = 'ward')
plt.figure(figsize=(15, 10))
dendrogram(linked, orientation='top', no_labels=True)
plt.title('Hierarchical clustering for outgoing calls data')
plt.show()
We can see two clusters here.
km = KMeans(n_clusters = 2, random_state=True)
labels = km.fit_predict(outgoing_for_cluster)
outgoing_for_cluster['labels'] = labels
display(outgoing_for_cluster.groupby('labels').mean())
print("")
display(outgoing_for_cluster.groupby('labels').min())
print("")
display(outgoing_for_cluster.groupby('labels').max())
outgoing_for_cluster.groupby('labels').median()
Several things to notice:
effective = incoming_for_cluster.query('labels == 0')
ineffective = incoming_for_cluster.query('labels == 1')
To decide which test I should use to test the hypothesis and how to formulate a null hypothesis, I first need to know whether the data is normally distributed. To do that, I will use the Shapiro test of normality.
Null hypothesis H₀- The data is normally distributed.
Alternative hypothesis H₁- The data is not distributed normally.
The significance level is set to 5%, as normally used in the business industry.
def shapiro(df, column, alpha=0.05):
"""This function checks normal distribution using the Shapiro test.
The input arguments are a dataset and a column name. Alpha value is an optional argumant.
"""
stat, p = st.shapiro(df[column])
print('Statistics=%.3f, p=%.3f' % (stat, p))
#interpret the test:
if p > alpha:
print('Sample is normally distributed (fail to reject H0)')
else:
print('Sample is not normally distributed (reject H0)')
shapiro(effective, 'avg_miss_call')
effective['avg_miss_call'].hist();
shapiro(ineffective, 'avg_miss_call')
ineffective['avg_miss_call'].hist();
Since my samples are not normally distributed but the distribution has similar shapes, I will use the Wilcoxon-Mann-Whitney test.
This test requires equal sample sizes:
print("Effective length is: ", len(effective['avg_miss_call']))
print("Ineffective length is: ", len(ineffective['avg_miss_call']))
eff_sample = effective['avg_miss_call'].sample(len(ineffective['avg_miss_call']))
Null hypothesis H₀- There is no significant difference between the means of the two clusters.
Alternative hypothesis H₁- There is a significant difference between the means of the two clusters.
The significance level is set to 5%, as normally used in the business industry.
alpha = .05 # critical statistical significance level
results = st.wilcoxon(
eff_sample,
ineffective['avg_miss_call'])
print('p-value: ', results.pvalue)
if (results.pvalue < alpha):
print("We reject the null hypothesis")
else:
print("We can't reject the null hypothesis")
There is no significant difference between the performance of the two groups in this parameter.
For this section, I will use a z-test to test the proportions between the tariff plans.
First, let's create a subset that includes both tariff plans and clusters.
temp = outgoing_for_cluster.reset_index()
temp.head()
outgoing = outgoing.merge(temp,how='inner',on='operator_id')
outgoing.head()
grouped_by_tariff = outgoing.groupby('tariff_plan').agg({'labels':'sum','operator_id':'count'}).rename(columns={'labels':'labels_sum', 'operator_id':'count'}).reset_index()
grouped_by_tariff
In the outgoing clustering, 0 is the ineffective cluster, and 1 is the effective cluster, so the higher the labels_sum, the better service the plan gets.
#defining a function that runs the z-test
def check_hypothesis(tariff1, tariff2, alpha=0.05):
"""This function checks whether there is a significant difference between the proportions of two populations
using the z-test.
The input arguments two populations. Alpha value is an optional argumant.
"""
successes1=grouped_by_tariff[grouped_by_tariff.tariff_plan==tariff1]['labels_sum'].iloc[0]
successes2=grouped_by_tariff[grouped_by_tariff.tariff_plan==tariff2]['labels_sum'].iloc[0]
trials1=grouped_by_tariff[grouped_by_tariff.tariff_plan==tariff1]['count'].iloc[0]
trials2=grouped_by_tariff[grouped_by_tariff.tariff_plan==tariff2]['count'].iloc[0]
p1 = successes1/trials1
p2 = successes2/trials2
p_combined = (successes1 + successes2) / (trials1 + trials2)
difference = p1 - p2
z_value = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/trials1 + 1/trials2))
distr = st.norm(0, 1)
p_value = (1 - distr.cdf(abs(z_value))) * 2
print('p-value: ', p_value)
if (p_value < alpha):
print("Reject H0 for tariff plans {} and {}".format(tariff1,tariff2))
else:
print("Fail to Reject H0 tariff plans {} and {}".format(tariff1,tariff2))
Null hypothesis H₀- There is no significant difference between the proportions of the two groups.
Alternative hypothesis H₁- There is a significant difference between the proportions of the two groups.
The significance level is set to 5% as normally used in the business industry.
check_hypothesis('A', 'B', alpha=0.05)
check_hypothesis('B', 'C', alpha=0.05)
check_hypothesis('A', 'C', alpha=0.05)
Tariffs B and C get similarly effective service while tariff A get better quality of service for outgoing calls.
Note:I am aware of the multiple comparisons problem.
When we apply the Bonferroni or any other correction, we increase the risk of committing a type 2 error. As I'm running only 3 tests, an adjustment of the alpha value is not necessary.
To study about the work of call centers and relevant KPIs I used the following resources:
It is hard for me to point out exactly what I got from each link. All together, these gave me a good understanding of the business and the analisys required here.
These links helped me decide what statistical test to use and learn more about it: