Two-Sample Z-Test
Calculator
Learn More
Two-Sample Z-Test
Definition
Two-Sample Z-Test is a statistical test used to determine whether the means of two populations are significantly different from each other when both population standard deviations are known. It's particularly useful for large samples and when working with known population parameters.
Formula
Test Statistic:
Where:
- = sample means
- = population means
- = known population standard deviations
- = sample sizes
Confidence Interval for Mean Difference:
Key Assumptions
Practical Example
Comparing the efficiency of two production lines with known process variations:
Step 1: State the Data
- Line 1: = 50, = 95.2 units/hour, = 4.0
- Line 2: = 45, = 93.8 units/hour, = 3.8
Step 2: State Hypotheses
- (no difference)
- (there is a difference)
Step 3: Calculate Test Statistic
Z-statistic:
Step 4: Calculate P-value
For two-tailed test:
Step 5: Calculate Confidence Interval
Step 6: Draw Conclusion
Critical value at 5% significance level:
Since and , we fail to reject . There is no significant difference between the two production lines.
Effect Size
Cohen's d for two-sample z-test:
Interpretation guidelines:
- Small effect:
- Medium effect:
- Large effect:
Power Analysis
Required sample size per group for equal sample sizes:
Where:
- = significance level
- = probability of Type II error
- = minimum detectable difference
Decision Rules
Reject if:
- Two-sided test:
- Left-tailed test:
- Right-tailed test:
- Or if
Reporting Results
Standard format:
Code Examples
1library(tidyverse)
2
3set.seed(42)
4# Production Line 1 data (known σ₁ = 4.0)
5line1 <- tibble(
6 line = "Line 1",
7 units = rnorm(50, mean = 95.2, sd = 4.0)
8)
9
10# Production Line 2 data (known σ₂ = 3.8)
11line2 <- tibble(
12 line = "Line 2",
13 units = rnorm(45, mean = 93.8, sd = 3.8)
14)
15
16# Combine data
17production_data <- bind_rows(line1, line2)
18
19# Summarize the data
20summary_stats <- production_data |>
21 group_by(line) |>
22 summarise(
23 n = n(),
24 mean = mean(units),
25 ".groups" = "drop"
26 ) |>
27 mutate(known_sd = if_else(line == "Line 1", line1_pop_sd, line2_pop_sd))
28
29# Perform two-sample Z-test
30line1_stats <- summary_stats |> filter(line == "Line 1")
31line2_stats <- summary_stats |> filter(line == "Line 2")
32
33# Calculate z-statistic
34z_stat <- (line1_stats$mean - line2_stats$mean) / sqrt((line1_stats$known_sd^2 / line1_stats$n) + (line2_stats$known_sd^2 / line2_stats$n))
35print(str_glue("Z-statistic: {round(z_stat, 3)}"))
36
37# 95% confidence interval
38alpha <- 0.05
39z_alpha <- qnorm(1 - alpha/2)
40mean_diff <- line1_stats$mean - line2_stats$mean
41margin_of_error <- z_alpha * sqrt((line1_stats$known_sd^2 / line1_stats$n) + (line2_stats$known_sd^2 / line2_stats$n))
42ci_lower <- mean_diff - margin_of_error
43ci_upper <- mean_diff + margin_of_error
44print(str_glue("95% CI: [{round(ci_lower, 2)}, {round(ci_upper, 2)}]")
45
46# Calculate p-value (two-sided test)
47p_value <- 2 * (1 - pnorm(abs(z_stat)))
48print(str_glue("P-value: {round(p_value, 4)}")
49
50# Calculate effect size (Cohen's d)
51pooled_sd <- sqrt((4.0^2 + 3.8^2) / 2)
52cohens_d <- abs(line1_stats$mean - line2_stats$mean) / pooled_sd
53print(str_glue("Effect size (Cohen's d): {round(cohens_d, 3)}"
54
55
56# Visualization
57ggplot(production_data, aes(x = line, y = units, fill = line)) +
58 geom_boxplot(alpha = 0.5) +
59 geom_jitter(width = 0.2, alpha = 0.5) +
60 theme_minimal() +
61 labs(
62 title = "Production Output by Line",
63 y = "Units per Hour",
64 x = "Production Line"
65 )
1import numpy as np
2import scipy.stats as stats
3import pandas as pd
4import matplotlib.pyplot as plt
5import seaborn as sns
6
7# Set random seed for reproducibility
8np.random.seed(42)
9
10# Generate sample data
11# Production Line 1 (known σ₁ = 4.0)
12line1_data = np.random.normal(95.2, 40, 50)
13
14# Production Line 2 (known σ₂ = 3.8)
15line2_data = np.random.normal(93.8, 3.8, 45)
16
17# Calculate sample means
18sample_mean1 = np.mean(line1_data)
19sample_mean2 = np.mean(line2_data)
20
21# Calculate z-statistic
22z_numerator = (sample_mean1 - sample_mean2)
23z_denominator = np.sqrt((4.0**2/50) + (3.8**2/45))
24z_stat = z_numerator / z_denominator
25print(f"Z-statistic: {z_stat:.2f}")
26
27# Calculate p-value (two-sided test)
28p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
29print(f"P-value: {p_value:.4f}")
30
31# Calculate 95% Confidence Interval
32alpha = 0.05
33z_critical = stats.norm.ppf(1 - alpha/2)
34margin_of_error = z_critical * z_denominator
35ci_lower = z_numerator - margin_of_error
36ci_upper = z_numerator + margin_of_error
37print(f"95% Confidence Interval for mean difference: ({ci_lower:.2f}, {ci_upper:.2f})")
38
39# Calculate effect size (Cohen's d)
40pooled_sd = np.sqrt((4.0**2 + 3.8**2) / 2)
41cohens_d = abs(sample_mean1 - sample_mean2) / pooled_sd
42print(f"Cohen's d: {cohens_d:.2f}")
43
44# Create DataFrame for plotting
45df = pd.DataFrame({
46 'Production Line': ['Line 1']*50 + ['Line 2']*45,
47 'Units': np.concatenate([line1_data, line2_data])
48})
49
50# Create visualization
51plt.figure(figsize=(12, 5))
52
53# Subplot 1: Boxplot
54plt.subplot(1, 2, 1)
55sns.boxplot(data=df, x='Production Line', y='Units')
56plt.title('Units per Hour by Production Line')
57
58# Subplot 2: Distribution
59plt.subplot(1, 2, 2)
60sns.histplot(data=df, x='Units', hue='Production Line',
61 element="step", stat="density")
62plt.title('Distribution of Units per Hour')
63
64plt.tight_layout()
65plt.show()
Related Calculators
One-Sample Z-Test Calculator
Two-Sample T-Test Calculator
Chi-Square Test of Independence Calculator
One-Way ANOVA Calculator
Help us improve
Found an error or have a suggestion? Let us know!