One-Way ANOVA with Python
Welcome to our hands-on tutorial on performing one-way ANOVA (Analysis of Variance) with Python. In this guide, we'll walk through a practical example analyzing how customer engagement varies across three marketing channels: TV, Social Media, and Print advertising. New to ANOVA? Start with our comprehensive ANOVA guide for a solid theoretical foundation before diving into the implementation.
The analysis will be shown using two approaches:
- Using statistical libraries (scipy, statsmodels)
- Manual calculations with detailed explanations
Sample Data and Assumptions
Note on Assumptions:
For this tutorial, we assume the following ANOVA assumptions are satisfied:
- The observations are independent
- The data within each group is normally distributed
- The groups have homogeneous variances
Dataset:
Channel, Customer Engagement
TV, 85
TV, 90
TV, 88
TV, 84
TV, 87
Social Media, 88
Social Media, 92
Social Media, 89
Social Media, 85
Social Media, 91
Print, 78
Print, 80
Print, 82
Print, 79
Print, 81
Try it yourself:
Want to analyze this data without coding? Copy the data above and paste it into our One-Way ANOVA Calculator to get instant results.
Approach 1: Using Statistical Libraries
Python Implementation:
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import matplotlib.pyplot as plt
import seaborn as sns
# Step 1: Create the dataset
data = {
'Channel': ['TV', 'TV', 'TV', 'TV', 'TV',
'Social Media', 'Social Media', 'Social Media', 'Social Media', 'Social Media',
'Print', 'Print', 'Print', 'Print', 'Print'],
'Customer_Engagement': [85, 90, 88, 84, 87,
88, 92, 89, 85, 91,
78, 80, 82, 79, 81]
}
# Step 2: Create a DataFrame
df = pd.DataFrame(data)
# Step 3: Calculate descriptive statistics
descriptive_stats = df.groupby('Channel')['Customer_Engagement'].agg([
'count', 'mean', 'std', 'min', 'max'
]).round(2)
print("Descriptive Statistics:")
print(descriptive_stats)
# Step 4: Perform one-way ANOVA
# Get the engagement scores for each channel
tv_scores = df[df['Channel'] == 'TV']['Customer_Engagement']
social_scores = df[df['Channel'] == 'Social Media']['Customer_Engagement']
print_scores = df[df['Channel'] == 'Print']['Customer_Engagement']
# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(tv_scores, social_scores, print_scores)
print("One-way ANOVA Results:")
print(f"F-statistic: {f_statistic:.4f}")
print(f"p-value: {p_value:.4f}")
# Step 5: Calculate Effect Size (Eta-squared)
def calculate_eta_squared(df, dv, between):
"""Calculate eta-squared effect size"""
groups = df[between].unique()
grand_mean = df[dv].mean()
# Calculate SSt (Total Sum of Squares)
ss_total = np.sum((df[dv] - grand_mean) ** 2)
# Calculate SSb (Between Sum of Squares)
ss_between = np.sum([
len(df[df[between] == group]) *
(df[df[between] == group][dv].mean() - grand_mean) ** 2
for group in groups
])
return ss_between / ss_total
eta_squared = calculate_eta_squared(df, 'Customer_Engagement', 'Channel')
print(f"Effect Size (η²): {eta_squared:.4f}")
Results:
Descriptive Statistics:
Channel | Mean | Std | N |
---|---|---|---|
80.0 | 1.58 | 5 | |
Social Media | 89.0 | 2.74 | 5 |
TV | 86.8 | 2.39 | 5 |
F-statistic: F = 21.0318
p-value: p = 0.0001
Effect Size: η² = 0.7780
Approach 2: Manual Calculations
Python Implementation and Results:
import pandas as pd
import numpy as np
# Step 1: Create the dataset
print("Step 1: Create the dataset")
data = {
'Channel': ['TV', 'TV', 'TV', 'TV', 'TV',
'Social Media', 'Social Media', 'Social Media', 'Social Media', 'Social Media',
'Print', 'Print', 'Print', 'Print', 'Print'],
'Engagement': [85, 90, 88, 84, 87,
88, 92, 89, 85, 91,
78, 80, 82, 79, 81]
}
df = pd.DataFrame(data)
# Step 2: Calculate Grand Mean
grand_mean = df['Engagement'].mean()
print(f"Step 2: Grand Mean = {grand_mean:.2f}")
# Step 3: Calculate Group Means
group_means = df.groupby('Channel')['Engagement'].mean()
print("Step 3: Group Means:")
print(group_means)
Step 2: Grand Mean = 85.27
Step 3: Group Means:
Channel
Print 80.0
Social Media 89.0
TV 86.8
Name: Engagement, dtype: float64
# Steps 4-6: Calculate Sum of Squares
n_groups = len(df['Channel'].unique())
n_total = len(df)
n_per_group = df.groupby('Channel').size()
ssb = sum(n_per_group * (group_means - grand_mean)**2)
print(f"Step 4: Sum of Squares Between Groups (SSB) = {ssb:.2f}")
ssw = 0
for channel in df['Channel'].unique():
group_data = df[df['Channel'] == channel]['Engagement']
group_mean = group_means[channel]
ssw += sum((group_data - group_mean)**2)
print(f"Step 5: Sum of Squares Within Groups (SSW) = {ssw:.2f}")
sst = sum((df['Engagement'] - grand_mean)**2)
print(f"Step 6: Total Sum of Squares (SST) = {sst:.2f}")
print(f"Verification: SST ({sst:.2f}) ≈ SSB ({ssb:.2f}) + SSW ({ssw:.2f})")
Step 4: Sum of Squares Between Groups (SSB) = 220.13
Step 5: Sum of Squares Within Groups (SSW) = 62.80
Step 6: Total Sum of Squares (SST) = 282.93
Verification: SST (282.93) ≈ SSB (220.13) + SSW (62.80)
# Steps 7-9: Calculate df, MS, and F-statistic
df_between = n_groups - 1
df_within = n_total - n_groups
df_total = n_total - 1
print(f"Step 7: Degrees of Freedom:")
print(f"Between Groups (df_b) = {df_between}")
print(f"Within Groups (df_w) = {df_within}")
print(f"Total (df_t) = {df_total}")
ms_between = ssb / df_between
ms_within = ssw / df_within
print(f"Step 8: Mean Squares:")
print(f"Mean Square Between (MSB) = {ms_between:.2f}")
print(f"Mean Square Within (MSW) = {ms_within:.2f}")
f_statistic = ms_between / ms_within
print(f"Step 9: F-statistic = {f_statistic:.4f}")
Step 7: Degrees of Freedom:
Between Groups (df_b) = 2
Within Groups (df_w) = 12
Total (df_t) = 14
Step 8: Mean Squares:
Mean Square Between (MSB) = 110.07
Mean Square Within (MSW) = 5.23
Step 9: F-statistic = 21.0318
ANOVA Summary Table:
Source | SS | df | MS | F |
---|---|---|---|---|
Between | 220.13 | 2 | 110.07 | 21.0318 |
Within | 62.80 | 12 | 5.23 | - |
Total | 282.93 | 14 | - | - |
Effect Size: η² = 0.7780
Interpreting the Results:
The one-way ANOVA results show a significant difference in customer engagement across the three marketing channels (TV, Social Media, Print). The F-statistic of 21.0318 and p-value of 0.0001 indicate that the mean customer engagement levels are not equal across the groups. The effect size (η²) of 0.778 suggests that 77.8% of the variance in customer engagement can be explained by the marketing channel.
Post-hoc Analysis:
Conducting post-hoc tests such as Tukey HSD, Bonferroni, Scheffe's test, or Fisher's LSD can help identify which specific groups differ significantly from each other. This is important when the ANOVA results are significant. You can use our Tukey HSD calculator to perform this analysis.
Help us improve
Found an error or have a suggestion? Let us know!