A/B Testing for online store¶

Task:¶

You've received an analytical task from an international online store. Your predecessor failed to complete it: they launched an A/B test and then quit (to start a watermelon farm in Brazil). They left only the technical specifications and the test results.

Technical description¶

Test name: recommender_system_test
Groups: А (control), B (new payment funnel)
Launch date: 2020-12-07
The date when they stopped taking up new users: 2020-12-21
End date: 2021-01-01
Audience: 15% of the new users from the EU region
Purpose of the test: testing changes related to the introduction of an improved recommendation system
Expected result: within 14 days of signing up, users will show better conversion into product page views (the product_page event), product cart views (product_cart) and purchases (purchase). At each of the stage of the funnel product_page → product_cart → purchase, there will be at least a 10% increase.
Expected number of test participants: 6000

This is an A/B test, the purpose of which is to test changes related to the introduction of an improved recommendation system.

Opening data ¶

import pandas as pd 
import numpy as np
from numpy import median
from scipy import stats as st
import math as mth

import datetime as dt
from datetime import datetime, timedelta

import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns

import plotly
import plotly.express as px
from plotly import graph_objects as go

import sys
import warnings
if not sys.warnoptions:    
       warnings.simplefilter("ignore")

#import the files

location = '/some/local/path'

try: 
    calendar = pd.read_csv(location + 'ab_project_marketing_events_us.csv')
    events = pd.read_csv(location + 'final_ab_events_us.csv')
    users = pd.read_csv(location + 'final_ab_new_users_us.csv') 
    participants = pd.read_csv(location + 'final_ab_participants_us.csv') 
    
except: 
    calendar = pd.read_csv('/datasets/ab_project_marketing_events_us.csv')
    events = pd.read_csv('/datasets/final_ab_events_us.csv')
    users = pd.read_csv('/datasets/final_ab_new_users_us.csv') 
    participants = pd.read_csv('/datasets/final_ab_participants_us.csv')

Preprosessing ¶

Go back to the Table of Contents

def data_study(data): 
    """This function runs methods to study the data and prints out the results.

    The input argument is a dataset.  
    """
    print(data.info(memory_usage='deep'))
    print("")
    print("Head:")
    display(data.head())
    print("")
    print("Tail:")
    display(data.tail())
    print("")
    print("Check for Nulls:")
    display(data.isnull().sum()) 
    print("")
    print("Check for 0:")
    for c in data.columns: 
        print(c , len(data[data[c] == 0]))
    print("")
    print('Number of duplicated rows:', data.duplicated().sum())

Marketing events ¶

Go back to the Table of Contents

data_study(calendar)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   name       14 non-null     object
 1   regions    14 non-null     object
 2   start_dt   14 non-null     object
 3   finish_dt  14 non-null     object
dtypes: object(4)
memory usage: 4.0 KB
None

Head:

Tail:

Check for Nulls:

name         0
regions      0
start_dt     0
finish_dt    0
dtype: int64

Check for 0:
name 0
regions 0
start_dt 0
finish_dt 0

Number of duplicated rows: 0

# as calender has only 14 entries we can look at the entire dataset
calendar

We can see there are no missing values or duplicated rows in this dataset.
There are 2 columns that need to be converted to dt.

calendar['start_dt'] = pd.to_datetime(calendar['start_dt'])
calendar['finish_dt'] = pd.to_datetime(calendar['finish_dt'])
calendar.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   name       14 non-null     object        
 1   regions    14 non-null     object        
 2   start_dt   14 non-null     datetime64[ns]
 3   finish_dt  14 non-null     datetime64[ns]
dtypes: datetime64[ns](2), object(2)
memory usage: 2.4 KB

We can see that converting lowered memory usage from 4.0KB to 2.4KB.

As our AB test deals with EU users, we can create a subset of EU promotions.

calendar_eu = calendar[calendar['regions'].str.contains('EU', regex=False) == True]
calendar_eu

Users ¶

Go back to the Table of Contents

data_study(users)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61733 entries, 0 to 61732
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     61733 non-null  object
 1   first_date  61733 non-null  object
 2   region      61733 non-null  object
 3   device      61733 non-null  object
dtypes: object(4)
memory usage: 15.4 MB
None

Head:

Tail:

Check for Nulls:

user_id       0
first_date    0
region        0
device        0
dtype: int64

Check for 0:
user_id 0
first_date 0
region 0
device 0

Number of duplicated rows: 0

No missimg values or duplicated rows here.

#convert first_date to dt type
users['first_date'] = pd.to_datetime(users['first_date'])

#change date column name to make it easier to distinguish from other dates
users = users.rename(columns={'first_date':'join_date'})

Next I will check the values in region and device columns. Both seem like columns that might contain duplicates and can be eventually converted to category type to optimize the data.

#check values in region column
users.region.value_counts()

EU           46270
N.America     9155
CIS           3155
APAC          3153
Name: region, dtype: int64

#check values in device column
users.device.value_counts()

Android    27520
PC         15599
iPhone     12530
Mac         6084
Name: device, dtype: int64

#convert region and device to category type
users['region'] = users['region'].astype('category') 
users['device'] = users['device'].astype('category')

users.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61733 entries, 0 to 61732
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   user_id    61733 non-null  object        
 1   join_date  61733 non-null  datetime64[ns]
 2   region     61733 non-null  category      
 3   device     61733 non-null  category      
dtypes: category(2), datetime64[ns](1), object(1)
memory usage: 4.9 MB

After data type conversion the memory usage dropped from 15.4MB to 4.9MB

#check there are no repeat user_ids
users.user_id.nunique()

61733

The number of unique user ids is similar to the number of entries, which means there are no repeat ids.

#check the dates for anomalies
users.join_date.describe()

count                   61733
unique                     17
top       2020-12-21 00:00:00
freq                     6290
first     2020-12-07 00:00:00
last      2020-12-23 00:00:00
Name: join_date, dtype: object

len(users[users['join_date'] > '2020-12-21'])

5263

In this dataset we have users from all regions (we only need EU) and 8.5% of the data are users that joined after 21/12/20.
At the moment I will leave those users since they will be dropped in following steps.
In addition, this indicates that the test dicription written by the previous analyst is not accurate.

Events ¶

Go back to the Table of Contents

data_study(events)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440317 entries, 0 to 440316
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   user_id     440317 non-null  object 
 1   event_dt    440317 non-null  object 
 2   event_name  440317 non-null  object 
 3   details     62740 non-null   float64
dtypes: float64(1), object(3)
memory usage: 93.4 MB
None

Head:

Tail:

Check for Nulls:

user_id            0
event_dt           0
event_name         0
details       377577
dtype: int64

Check for 0:
user_id 0
event_dt 0
event_name 0
details 0

Number of duplicated rows: 0

events.event_name.value_counts()

login           189552
product_page    125563
purchase         62740
product_cart     62462
Name: event_name, dtype: int64

The missing values in details column are to be expected since there are values only for purchase events.
Adding the count of purchase events (62,740) to the count of NaN values (377,577) gives use total of entries in the dataset.
This data is unnecessary for the analysis so the missing values can be left untouched.

# convert event_dt to dt type
events['event_dt'] = pd.to_datetime(events['event_dt'])

events.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440317 entries, 0 to 440316
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   user_id     440317 non-null  object        
 1   event_dt    440317 non-null  datetime64[ns]
 2   event_name  440317 non-null  object        
 3   details     62740 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 64.8 MB

Conversion to datetime type reduced memory usage from 93.4MB to 64.8MB.
I decided not to convert event_name to category type, has it makes it harder to plot graph.

events.event_dt.describe()

count                  440317
unique                 267268
top       2020-12-23 02:37:24
freq                       10
first     2020-12-07 00:00:33
last      2020-12-30 23:36:33
Name: event_dt, dtype: object

events.user_id.nunique()

58703

As with the users dataset, we have more data than necessary for the test. The unnecessary data will be dropped in future steps.

Participants ¶

Go back to the Table of Contents

data_study(participants)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18268 entries, 0 to 18267
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  18268 non-null  object
 1   group    18268 non-null  object
 2   ab_test  18268 non-null  object
dtypes: object(3)
memory usage: 3.7 MB
None

Head:

Tail:

Check for Nulls:

user_id    0
group      0
ab_test    0
dtype: int64

Check for 0:
user_id 0
group 0
ab_test 0

Number of duplicated rows: 0

participants.ab_test.value_counts()

interface_eu_test          11567
recommender_system_test     6701
Name: ab_test, dtype: int64

From the description of the test I can understand that there was a mistake in the test name. The correct test is 'interface_eu_test' and not 'recommender_system_test' as stated in the description.

Crerating a dataset for testing ¶

Go back to the Table of Contents

#making a list of all test participants
test_participants = participants[participants['ab_test'] == 'interface_eu_test']['user_id']

#checking if the participants from EU
users[users.user_id.isin(list(test_participants))]['region'].unique()

[EU]
Categories (1, object): [EU]

#subset of test events
test_events = events[events.user_id.isin(list(test_participants))].reset_index(drop=True)
test_events.head()

#now I can add the group column and drop unnecessary columns
test_events = test_events.merge(participants, how='inner', on='user_id').drop(columns=['details','ab_test'])
test_events.head()

test_events.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 87821 entries, 0 to 87820
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   user_id     87821 non-null  object        
 1   event_dt    87821 non-null  datetime64[ns]
 2   event_name  87821 non-null  object        
 3   group       87821 non-null  object        
dtypes: datetime64[ns](1), object(3)
memory usage: 18.1 MB

Stage summary: ¶

Go back to the Table of Contents

Our data for the test consists of 4 datasets.
In each dataset I converted the date columns to datetime type.
Comparing the acctual data with the test description shows that the description is not accurate and shouls be double checked.
The datasets contain more data than necessary for the test. I created a new subset that contains only the data needed going forward.

Exploratory Data Analysis ¶

Are there users who are present in both samples? ¶

Go back to the Table of Contents

groups = test_events.groupby('group').agg({'user_id':'nunique'})
groups

users_by_group = test_events.groupby('user_id').agg({'group':'nunique'}).reset_index()
users_by_group.head()

users_by_group.query('group > 1').count()

user_id    441
group      441
dtype: int64

There are 441 users that were recruited into both groups. These users should be dropped.

drop_list = users_by_group.query('group > 1')['user_id'].tolist()

test_events = test_events.drop(test_events[test_events.user_id.isin(drop_list)].index).reset_index(drop=True)
test_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82259 entries, 0 to 82258
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   user_id     82259 non-null  object        
 1   event_dt    82259 non-null  datetime64[ns]
 2   event_name  82259 non-null  object        
 3   group       82259 non-null  object        
dtypes: datetime64[ns](1), object(3)
memory usage: 2.5+ MB

This eliminated 5,562 events from the log. Normally, 6.33% would be too much data to discard, but for the purpose of A/B testing we have no other choice. We can't have participants in the test that got both options of the system.
The fact that we got so many users in both groups might indicate that there was a bug in the recruitment process.

group_clean = test_events.groupby('group').agg({'user_id':'nunique'})
group_clean

After removing those users we are left with less participants than description requirs (6,000).
Since we know the description is inaccurate, there is no reason not to examine the results anyway.

Study conversion at different stages of the funnel ¶

Go back to the Table of Contents

We'll start with a simple funnel devided by test groups:

Group A:

groupA = test_events.query('group == "A"')
groupB = test_events.query('group == "B"')

groupA_funnel = groupA.groupby('event_name').agg({'user_id':'nunique'}).sort_values(by='user_id', ascending=False).reset_index()
groupA_funnel

groupA_funnel = groupA_funnel.reindex([0, 1, 3, 2])
groupA_funnel

fig = go.Figure(go.Funnel(
    y = ['Login','Product page','Product cart','Purchase'],
    x = groupA_funnel['user_id'],
    textinfo='label + text + percent initial + percent previous', hoverinfo='skip'
    ))
fig.update_layout(title="Simple event funnel for group A")
fig.show()

Group B:

groupB_funnel = groupB.groupby('event_name').agg({'user_id':'nunique'}).sort_values(by='user_id', ascending=False).reset_index()
groupB_funnel

fig = go.Figure(go.Funnel(
    y = ['Login','Product page','Product cart','Purchase'],
    x = groupB_funnel['user_id'],
    textinfo='label + text + percent initial + percent previous', hoverinfo='skip'
    ))
fig.update_layout(title="Simple event funnel for group B")
fig.show()

These funnels are very interesting, as they show a difference in funnel stages between the two test groups.
While group B has the expected funnel of: login --> product_page --> product_cart --> purchase, for group A the last two steps are reversed. Group A users first make a purchase and then visit the cart. The site might have a "buy now" button that allows users to skip the cart.

To look in further, I will plot a sequential funnel for each group separately.

Group A

usersA = groupA.pivot_table(
    index='user_id', 
    columns='event_name', 
    values='event_dt',
    aggfunc='min') 
usersA.head(20)

step_1_A = ~usersA['login'].isna()
step_2_A = step_1_A & (usersA['product_page'] >= usersA['login'])
step_3_A = step_2_A & (usersA['product_cart'] >= usersA['product_page'])
step_4_A = step_3_A & (usersA['purchase'] >= usersA['product_cart'])

n_logins_A = usersA[step_1_A].shape[0]
n_product_A = usersA[step_2_A].shape[0]
n_checkout_A = usersA[step_3_A].shape[0]
n_payment_A = usersA[step_4_A].shape[0]

print('Logins:', n_logins_A)
print('Product page:', n_product_A)
print('Cart:', n_checkout_A)
print('Paid:', n_payment_A)

Logins: 5350
Product page: 3577
Cart: 820
Paid: 113

fig = go.Figure(go.Funnel(
    y = ['Login','Product Page','Cart','Payment'],
    x = [n_logins_A,n_product_A,n_checkout_A,n_payment_A],
    textinfo='label + text + percent initial + percent previous', hoverinfo='skip'
    ))
fig.update_layout(title="Sequential funnel for test group A")
fig.show()

Group B

usersB = groupB.pivot_table(
    index='user_id', 
    columns='event_name', 
    values='event_dt',
    aggfunc='min') 
usersB.head(20)

step_1_B = ~usersB['login'].isna()
step_2_B = step_1_B & (usersB['product_page'] >= usersB['login'])
step_3_B = step_2_B & (usersB['product_cart'] >= usersB['product_page'])
step_4_B = step_3_B & (usersB['purchase'] >= usersB['product_cart'])

n_logins_B = usersB[step_1_B].shape[0]
n_product_B = usersB[step_2_B].shape[0]
n_checkout_B = usersB[step_3_B].shape[0]
n_payment_B = usersB[step_4_B].shape[0]

print('Logins:', n_logins_B)
print('Product page:', n_product_B)
print('Cart:', n_checkout_B)
print('Paid:', n_payment_B)

Logins: 5058
Product page: 3318
Cart: 805
Paid: 98

fig = go.Figure(go.Funnel(
    y = ['Login','Product Page','Cart','Payment'],
    x = [n_logins_B,n_product_B,n_checkout_B,n_payment_B],
    textinfo='label + text + percent initial + percent previous', hoverinfo='skip'
    ))
fig.update_layout(title="Sequential funnel for test group B")
fig.show()

It is also very interesting to see that the sequential funnel charts for the two groups are almost the same (there are very minor differences).
We can see that the the sequential funnel for the 2 groups is the same while the simple funnel shows a difference in the sequence.
Only 2% of users complete a purchase going through all 4 stages of the sequence in this particulare order, while 32-32% of the test users complete a purchase on the site.
This teaches us that the app has shortcuts that allow users to skip certine events. For example, skipping the cart (checkout) and going straight to payment.

Note: While looking at the pivot tables for the sequential funnel I noticed that there are some irregularities in the event timestamps. Events that should happen in a sequence have the same timestamp. I tried to handle this by using >= when calculating the funnel values. Nevertheless, those irregularities shold be kept in mind when looking at the data.

Average number of events per user ¶

Go back to the Table of Contents

Is the number of events per user distributed equally among the samples?

events_per_userA = groupA.groupby('user_id').agg({'event_name':'count'}).reset_index()
print('Average number of events per user in group A:', round(events_per_userA['event_name'].mean(), 2))

Average number of events per user in group A: 8.1

events_per_userB = groupB.groupby('user_id').agg({'event_name':'count'}).reset_index()
print('Average number of events per user in group B:', round(events_per_userB['event_name'].mean(), 2))

Average number of events per user in group B: 7.69

There is a small difference between the samples. It seems that users in the control group visit the site slightly more.

What period of time does the data cover? ¶

Go back to the Table of Contents

#histogram by date and time
plt.figure(figsize=(10,5))
test_events['event_dt'].hist(bins=150, xrot=60, color='royalblue');
plt.title('Distribution of events');
plt.xlabel('Dates');
plt.ylabel('Number of events');

There is a gradual rise in event amout until Dec 13th. This makes sense- this is the recruitment period.
After that, the events reach a peak at around Dec 21.
According to the EU events calendar, the site launched the Christmas&New Year Promo on Dec 25th. We should expect a rise in events after that date. Nevertheless, the amount of events drops drastically (back to recruitment period numbers).
This can be explained by holiday season- people don't actually shop on the holiday itself. The promo should have started before Christmas and not on the day of.

calendar_eu

Comments on the data ¶

Go back to the Table of Contents

Are there any peculiarities in the data that you have to take into account before starting the A/B test?

There are several things to notice:

There seems to be a problem at the recruitment stage of the test, which resulted in 441 participants (6.33% of events) landing in both test groups. We had to remove these users. As a result, we are running the test with less participants than expected.
There are some irragularities with the timestamps. For example, some users preformed all events at the exact same moment. I tried to solve this issue in the analysis but I can't be sure the results correctly reflect user behavior.

Experiment results ¶

Go back to the Table of Contents

#creating a pivot table by groups and events
pivot = test_events.pivot_table(index='event_name', values='user_id', columns='group', aggfunc=lambda x: x.nunique()).reset_index()
pivot

I will define a function that runs the z-test and can be used to test all events.

#defining a function that runs the z-test
def check_hypothesis(event, alpha=0.05):
    """This function runs a z-test to check for statistical difference between the proportions.

    The input argumentsare event name to test and alpha level.
    """
    
    successes1 = pivot[pivot.event_name == event]['A'].iloc[0]
    successes2 = pivot[pivot.event_name == event]['B'].iloc[0]
    
    trials1 = test_events[test_events.group == 'A']['user_id'].nunique()
    trials2 = test_events[test_events.group == 'B']['user_id'].nunique()
    
    p1 = successes1/trials1

    p2 = successes2/trials2

    p_combined = (successes1 + successes2) / (trials1 + trials2)

    difference = p1 - p2
        
    z_value = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/trials1 + 1/trials2))
  
    distr = st.norm(0, 1) 

    p_value = (1 - distr.cdf(abs(z_value))) * 2

    print('p-value: ', p_value)

    if (p_value < alpha):
        print("Reject H0 for", event)
    else:
        print("Fail to Reject H0 for", event)
    
    print('')

Now we can run the z-test on each event:

H0 - There is a no significant difference between the proportions of the control group and the test group for checked event.

H1 - There is a significant difference between the proportions of the control group and the test group for checked event.

The significance level is set to 5% as normally used in the business industry.

for i in pivot.event_name.unique():
    check_hypothesis(i, alpha=0.05)

p-value:  0.33090965942701267
Fail to Reject H0 for login

p-value:  0.07014790784441405
Fail to Reject H0 for product_cart

p-value:  0.17829662456570405
Fail to Reject H0 for product_page

p-value:  0.014023560498416465
Reject H0 for purchase

There is no statistically significant difference between the two groups for login, product page visits and product cart visits.
There is, however, a difference in purchase. Group A participants made more purchases.

Conclusion ¶

Go back to the Table of Contents

The test was not successful for several reasons:

Holiday season is not a good time to run an A/B test. Shopping habits change over the month of December (Rise before the holidays and drop during the holidays, a pattern we can clearly see in the data).
In addition, it is unwise to run a test while having a promotion (unless the promotion is what's being tested), since we can't distinguish between behavior changes that are related to the feature and changes related to the promotion. In this particular case, the promotion started too late so it might have little affect on the users (they weren't visiting the site anyway because of the holidays).
There were less participants in the test than expected.
The timestamps are inconsistent.

Despite the above, I notice to following:

There was a difference in the funnel between the two groups. We know group B visited the cart more than group A, but can't pin point exactly at what stage of the funnel this happens.
Group A participants made more purchases.

To conclude: The test should be run again in January (after the holiday season ends) and after technical issues are fixed.

	name	regions	start_dt	finish_dt
0	Christmas&New Year Promo	EU, N.America	2020-12-25	2021-01-03
1	St. Valentine's Day Giveaway	EU, CIS, APAC, N.America	2020-02-14	2020-02-16
2	St. Patric's Day Promo	EU, N.America	2020-03-17	2020-03-19
3	Easter Promo	EU, CIS, APAC, N.America	2020-04-12	2020-04-19
4	4th of July Promo	N.America	2020-07-04	2020-07-11

	name	regions	start_dt	finish_dt
0	Christmas&New Year Promo	EU, N.America	2020-12-25	2021-01-03
1	St. Valentine's Day Giveaway	EU, CIS, APAC, N.America	2020-02-14	2020-02-16
2	St. Patric's Day Promo	EU, N.America	2020-03-17	2020-03-19
3	Easter Promo	EU, CIS, APAC, N.America	2020-04-12	2020-04-19
4	4th of July Promo	N.America	2020-07-04	2020-07-11
5	Black Friday Ads Campaign	EU, CIS, APAC, N.America	2020-11-26	2020-12-01
6	Chinese New Year Promo	APAC	2020-01-25	2020-02-07
7	Labor day (May 1st) Ads Campaign	EU, CIS, APAC	2020-05-01	2020-05-03
8	International Women's Day Promo	EU, CIS, APAC	2020-03-08	2020-03-10
9	Victory Day CIS (May 9th) Event	CIS	2020-05-09	2020-05-11
10	CIS New Year Gift Lottery	CIS	2020-12-30	2021-01-07
11	Dragon Boat Festival Giveaway	APAC	2020-06-25	2020-07-01
12	Single's Day Gift Promo	APAC	2020-11-11	2020-11-12
13	Chinese Moon Festival	APAC	2020-10-01	2020-10-07

	name	regions	start_dt	finish_dt
0	Christmas&New Year Promo	EU, N.America	2020-12-25	2021-01-03
1	St. Valentine's Day Giveaway	EU, CIS, APAC, N.America	2020-02-14	2020-02-16
2	St. Patric's Day Promo	EU, N.America	2020-03-17	2020-03-19
3	Easter Promo	EU, CIS, APAC, N.America	2020-04-12	2020-04-19
5	Black Friday Ads Campaign	EU, CIS, APAC, N.America	2020-11-26	2020-12-01
7	Labor day (May 1st) Ads Campaign	EU, CIS, APAC	2020-05-01	2020-05-03
8	International Women's Day Promo	EU, CIS, APAC	2020-03-08	2020-03-10

	user_id	event_dt	event_name	details
0	E1BDDCE0DAFA2679	2020-12-07 20:22:03	purchase	99.99
1	7B6452F081F49504	2020-12-07 09:22:53	purchase	9.99
2	9CD9F34546DF254C	2020-12-07 12:59:29	purchase	4.99
3	96F27A054B191457	2020-12-07 04:02:40	purchase	4.99
4	1FD7660FDF94CA1F	2020-12-07 10:15:09	purchase	4.99

	user_id	event_dt	event_name	details
0	96F27A054B191457	2020-12-07 04:02:40	purchase	4.99
1	A92195E3CFB83DBD	2020-12-07 00:32:07	purchase	4.99
2	354D653172FF2A2D	2020-12-07 15:45:11	purchase	4.99
3	7FCD34F47C13A9AC	2020-12-07 22:06:13	purchase	9.99
4	0313C457F07C339E	2020-12-07 13:10:48	purchase	9.99

	user_id	first_date	region	device
0	D72A72121175D8BE	2020-12-07	EU	PC
1	F1C668619DFE6E65	2020-12-07	N.America	Android
2	2E1BF1D4C37EA01F	2020-12-07	EU	PC
3	50734A22C0C63768	2020-12-07	EU	iPhone
4	E1BDDCE0DAFA2679	2020-12-07	N.America	iPhone

	user_id	first_date	region	device
61728	1DB53B933257165D	2020-12-20	EU	Android
61729	538643EB4527ED03	2020-12-20	EU	Mac
61730	7ADEE837D5D8CBBD	2020-12-20	EU	PC
61731	1C7D23927835213F	2020-12-20	EU	iPhone
61732	8F04273BB2860229	2020-12-20	EU	Android

	user_id	event_dt	event_name	details
440312	245E85F65C358E08	2020-12-30 19:35:55	login	NaN
440313	9385A108F5A0A7A7	2020-12-30 10:54:15	login	NaN
440314	DB650B7559AC6EAC	2020-12-30 10:59:09	login	NaN
440315	F80C9BDDEA02E53C	2020-12-30 09:53:39	login	NaN
440316	7AEC61159B672CC5	2020-12-30 11:36:13	login	NaN

	user_id	group	ab_test
0	D1ABA3E2887B6A73	A	recommender_system_test
1	A7A3664BD6242119	A	recommender_system_test
2	DABC14FDDFADD29E	A	recommender_system_test
3	04988C5DF189632E	A	recommender_system_test
4	482F14783456D21B	B	recommender_system_test

	user_id	group	ab_test
18263	1D302F8688B91781	B	interface_eu_test
18264	3DE51B726983B657	A	interface_eu_test
18265	F501F79D332BE86C	A	interface_eu_test
18266	63FBE257B05F2245	A	interface_eu_test
18267	79F9ABFB029CF724	B	interface_eu_test

	user_id	group
0	0002CE61FF2C4011	1
1	001064FEAAB631A1	1
2	001E72F50D1C48FA	1
3	002412F1EB3F6E38	1
4	002540BE89C930FB	1

event_name	login	product_cart	product_page	purchase
user_id
0002CE61FF2C4011	2020-12-07 04:37:36	2020-12-07 04:37:37	2020-12-07 04:37:42	NaT
002412F1EB3F6E38	2020-12-09 09:35:20	NaT	2020-12-09 09:35:20	NaT
002540BE89C930FB	2020-12-08 18:05:24	2020-12-08 18:05:24	2020-12-08 18:05:24	NaT
003346BB64227D0C	2020-12-17 05:24:01	NaT	NaT	NaT
00341D8401F0F665	2020-12-21 11:14:50	NaT	NaT	NaT
004C58ADE7CA8C4A	2020-12-18 15:13:33	NaT	2020-12-18 15:13:35	2020-12-18 15:13:33
0053DD654C9513D6	2020-12-12 04:35:37	NaT	2020-12-12 04:35:38	2020-12-12 04:35:37
005881C49580D705	2020-12-23 13:54:49	2020-12-23 13:54:49	NaT	NaT
00807760AE4F0504	2020-12-21 09:24:41	NaT	2020-12-21 09:24:41	NaT
0093130DFBF1DC6A	2020-12-20 01:07:27	2020-12-20 01:07:27	2020-12-20 01:07:27	2020-12-20 01:07:27
00A29C54C162D2CF	2020-12-20 22:31:17	NaT	NaT	NaT
00B45E6226F03553	2020-12-14 06:00:44	NaT	NaT	NaT
00C2BA76985336C1	2020-12-12 15:56:42	NaT	NaT	NaT
00C3BF17FD8CBFC9	2020-12-22 15:56:59	NaT	2020-12-22 15:57:02	2020-12-22 15:56:59
00D627B7002B66CB	2020-12-17 07:03:57	NaT	2020-12-17 07:03:59	NaT
00D937C10E86138E	2020-12-15 04:14:42	NaT	NaT	NaT
00DCC58F2FAD6F66	2020-12-18 10:10:40	NaT	NaT	NaT
00E8D06BF3E5A23B	2020-12-14 21:59:06	NaT	2020-12-14 21:59:08	2020-12-14 21:59:06
00EECC8699408DD1	2020-12-09 08:24:37	2020-12-09 08:24:37	2020-12-09 08:24:37	2020-12-09 08:24:37
00EFA157F7B6E1C4	2020-12-09 14:12:14	2020-12-09 14:12:14	2020-12-09 14:12:14	NaT

event_name	login	product_cart	product_page	purchase
user_id
001064FEAAB631A1	2020-12-20 14:43:27	NaT	2020-12-20 14:43:28	NaT
001E72F50D1C48FA	2020-12-17 15:44:05	NaT	2020-12-17 15:44:06	NaT
0031F1B5E9FBF708	2020-12-14 00:47:10	NaT	NaT	2020-12-14 00:47:10
0036BE15EE4D319D	2020-12-21 01:56:05	NaT	NaT	NaT
003F86A34B575D27	2020-12-22 16:14:09	2020-12-22 16:14:09	NaT	NaT
0050F43F34C955F4	2020-12-13 19:41:56	2020-12-13 19:41:59	2020-12-13 19:41:57	2020-12-13 19:41:56
0069105CBE9CA8DC	2020-12-07 17:50:08	NaT	2020-12-07 17:50:09	NaT
00722905AE9E2441	2020-12-13 08:39:06	NaT	2020-12-13 08:39:07	NaT
0074BEB9856FBC87	2020-12-13 13:02:23	2020-12-13 13:02:23	2020-12-13 13:02:23	NaT
0082A350AF0647FA	2020-12-22 21:31:54	NaT	2020-12-22 21:31:56	2020-12-22 21:31:54
009F1C78D679CDEC	2020-12-11 14:20:56	2020-12-11 14:20:56	2020-12-11 14:20:57	2020-12-11 14:20:56
00C8947A03ECB797	2020-12-21 17:02:08	NaT	2020-12-21 17:02:10	NaT
00C8FDF85B779D88	2020-12-13 03:03:02	NaT	2020-12-13 03:03:02	NaT
0107365328F9C3DE	2020-12-14 02:25:31	NaT	2020-12-14 02:25:31	NaT
01260F7CCAF2357F	2020-12-20 09:11:58	2020-12-20 09:12:00	2020-12-20 09:12:00	NaT
01292EDFD9CCF3FC	2020-12-21 15:09:37	NaT	2020-12-21 15:09:39	NaT
012E1FBA4CBBDD8C	2020-12-22 07:22:22	2020-12-22 07:22:22	2020-12-22 07:22:23	NaT
012EE9FDC4BCCC06	2020-12-07 22:08:03	NaT	2020-12-07 22:08:04	NaT
01503216FE61EE4D	2020-12-14 04:37:57	NaT	NaT	NaT
0156BAE5D3A05F3B	2020-12-10 21:00:46	2020-12-10 21:00:49	NaT	NaT

group	event_name	A	B
0	login	5350	5058
1	product_cart	1726	1716
2	product_page	3577	3318
3	purchase	1903	1683

	user_id
group
A	5792
B	5499

A/B Testing for online store¶

Task:¶

Technical description¶

Tables Of Contents ¶

Opening data ¶

Preprosessing ¶

Marketing events ¶

Users ¶

Events ¶

Participants ¶

Crerating a dataset for testing ¶

Stage summary: ¶

Exploratory Data Analysis ¶

Are there users who are present in both samples? ¶

Study conversion at different stages of the funnel ¶

Average number of events per user ¶

What period of time does the data cover? ¶

Comments on the data ¶

Experiment results ¶

Conclusion ¶

Thank you for your attention!¶