In this lab you will practice specifying, estimating, and interpreting interaction effects in OLS regression using the attain dataset. An interaction term tests whether the relationship between a predictor and an outcome depends on the value of a third variable — the moderator.
You will work through three models, one for each combination of variable types:
Model
Predictor (X₁)
Moderator (X₂)
Outcome (Y)
Type
1
educ
age
income91
Continuous × Continuous
2
educ
sex_f
income91
Continuous × Binary
3
union_member
sex_f
income91
Binary × Binary
All three use income91 as the outcome so you can compare results across models.
Setup
Load packages, read the data, drop rows with missing values on every variable used in any model, and create factor and binary variables.
# load librarieslibrary(tidyverse)library(here)library(modelsummary)library(ggeffects)# turn off scientific notationoptions(scipen =999)# load raw dataattain_raw <-read_csv(here("data", "attain.csv"))# Drop rows missing any variable used across all three models,# then create recoded variablesattain <- attain_raw |>filter(!is.na(income91),!is.na(educ),!is.na(age),!is.na(sex),!is.na(union) ) |>mutate(# Binary moderator: set male as reference groupsex_f =fct_relevel(sex, "male"),# Binary predictor: 1 = respondent is a union memberunion_member =as.integer(union %in%c("r belong", "r and sp")) )# How many rows did we drop?cat("Original rows:", nrow(attain_raw), "\n")
Original rows: 2992
cat("Rows after dropping missing:", nrow(attain), "\n")
We drop missing values in the setup step — before any model is run — so that all three models are estimated on the same sample. If you filter inside lm() or at different steps, sample sizes can vary across models, making comparisons misleading.
Model 1: Continuous × Continuous (Education × Age)
Research Question
Does the income return to each additional year of education depend on how old you are?
A positive interaction (β₃ > 0) would mean older workers capture more income per year of schooling — perhaps because experience and credentials compound over a career. A negative interaction (β₃ < 0) would suggest younger workers benefit more per year of education.
Use * in the formula to include both main effects and the interaction in one step.
model1 <-lm(income91 ~ educ * age, data = attain)
Note
educ * age in R automatically expands to educ + age + educ:age. Never write just educ:age without the main effects — the coefficients would be uninterpretable (this is called the hierarchy principle).
Regression Table
modelsummary( model1,vcov ="HC1",coef_rename =c("(Intercept)"="Intercept","educ"="Years of Education","age"="Age (years)","educ:age"="Education × Age" ),stars =TRUE,fmt =1,gof_map =c("nobs", "r.squared", "adj.r.squared"),title ="OLS: Income ~ Education × Age (HC1 Robust SEs)")
OLS: Income ~ Education × Age (HC1 Robust SEs)
(1)
Intercept
−1926.5
(7732.6)
Years of Education
2659.6***
(623.3)
Age (years)
−197.3
(129.6)
Education × Age
22.4*
(11.3)
Num.Obs.
1763
R2
0.166
R2 Adj.
0.165
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Interpret the Coefficients
Note
Reading a continuous × continuous interaction:
In the equation \(\hat{Y} = \alpha + \beta_1\,(\text{educ}) + \beta_2\,(\text{age}) + \beta_3\,(\text{educ} \times \text{age})\), the main effect \(\beta_1\) is the slope for education only when age = 0 — which is outside the range of the data and not meaningful. Instead, compute simple slopes: the effect of education at specific, substantively interesting values of age.
\[\text{Slope of educ at a given age} = \beta_1 + \beta_3 \times \text{age}\]
Question 1. What is the sign on the Education × Age coefficient? Is it statistically significant at α = 0.05? State your conclusion in one sentence.
Your answer:
Question 2. Using the formula above and the coefficients from your table, calculate the simple slope of education at three values of age. Round to the nearest dollar.
Age
Calculation
Income per year of education
Young (25 years)
β₁ + β₃ × 25 =
___
Middle (45 years)
β₁ + β₃ × 45 =
___
Older (65 years)
β₁ + β₃ × 65 =
___
Fill in the table using your model output.
Question 3. Based on the simple slopes you calculated: does the income return to education increase or decrease with age? What might explain this pattern sociologically?
Your answer:
Question 4. The main effect of Age (years) is the income change per year of age when education = 0 — also not meaningful on its own. Given that, what should you focus on when interpreting this model? What is the key coefficient?
Your answer:
Margins Plot
ggpredict() computes the effect of education at the mean, mean + 1 SD, and mean − 1 SD of age, then plots three separate lines. If the interaction is real, the lines will have different slopes.
ggpredict(model1, terms =c("educ", "age [meansd]")) |>plot() +labs(title ="Predicted Income by Education, at Low / Mean / High Age",subtitle ="Lines show predicted income across education levels at three ages",x ="Years of Education",y ="Predicted Income ($)",color ="Age" ) +theme_minimal()
Note
Reading the margins plot: Each line shows how predicted income changes across education levels for a different age group. If the lines are parallel, there is no interaction — the education slope is the same at all ages. If the lines fan out (diverge), older or younger respondents benefit more from each additional year of schooling.
Question 5. Do the lines in the plot appear parallel or non-parallel? Does this match your conclusion from the regression table?
Your answer:
Model 2: Continuous × Binary (Education × Sex)
Research Question
Does each additional year of education translate into the same income gain for men and women?
This is a classic stratification question: even when women have identical credentials, do they receive the same wage return per year of schooling? A negative interaction (β₃ < 0 on Education × Female) would be evidence of structural inequality — women receive a lower wage premium per credential.
model2 <-lm(income91 ~ educ * sex_f, data = attain)
Regression Table
modelsummary(list("No Interaction"=lm(income91 ~ educ + sex_f, data = attain),"With Interaction"= model2),vcov ="HC1",coef_rename =c("(Intercept)"="Intercept","educ"="Years of Education","sex_ffemale"="Female","educ:sex_ffemale"="Education × Female" ),stars =TRUE,fmt =1,gof_map =c("nobs", "r.squared", "adj.r.squared"),title ="OLS: Income ~ Education × Sex (HC1 Robust SEs)")
OLS: Income ~ Education × Sex (HC1 Robust SEs)
No Interaction
With Interaction
Intercept
−8871.8***
−9466.5**
(2578.7)
(3560.4)
Years of Education
3660.2***
3704.8***
(196.7)
(280.4)
Female
−3761.0**
−2628.4
(1194.2)
(4974.9)
Education × Female
−85.7
(393.3)
Num.Obs.
1763
1763
R2
0.167
0.167
R2 Adj.
0.166
0.166
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Note
Why show both models? Comparing the model without the interaction (parallel slopes assumed) to the model with the interaction lets you see how much the coefficients change when you allow slopes to differ by sex. The coefficient on Female means something different in each model: in the additive model it is the average income gap at any education level; in the interaction model it is the gap only at zero years of education.
Interpret the Coefficients
Note
Reading a continuous × binary interaction:
Group
Education slope
Formula
Men (reference, female = 0)
\(\beta_1\)
Main effect of education
Women (female = 1)
\(\beta_1 + \beta_3\)
Main effect + interaction term
Difference
\(\beta_3\)
The interaction term itself
The interaction term \(\beta_3\) is a difference in slopes — it tells you by how many dollars women’s per-year return to education differs from men’s.
Question 6. Using your model output, fill in the table:
Group
Education slope ($/year)
Men
β₁ = ___
Women
β₁ + β₃ = ___ + ___ = ___
Difference
β₃ = ___
Question 7. Is the Education × Female coefficient statistically significant at α = 0.05? State whether you reject or fail to reject H₀: β₃ = 0. What does this mean for whether the education slope differs by sex?
Your answer:
Question 8. Regardless of statistical significance: is the direction of the interaction what you expected? Based on what you know about gender inequality in the labor market, explain why β₃ might be negative.
Your answer:
Question 9. The Female coefficient in the interaction model is the income gap between men and women at zero years of education. Why is this not a substantively meaningful number? What would be a more useful way to describe the gender gap from this model?
Your answer:
Margins Plot
ggpredict(model2, terms =c("educ", "sex_f")) |>plot() +scale_color_manual(values =c("male"="#4E79A7", "female"="#E15759"),labels =c("Men", "Women")) +scale_fill_manual(values =c("male"="#4E79A7", "female"="#E15759"),labels =c("Men", "Women")) +labs(title ="Predicted Income by Education and Sex",subtitle ="Slopes differ if lines are non-parallel",x ="Years of Education",y ="Predicted Income ($)",color ="Sex",fill ="Sex" ) +theme_minimal()
Note
Reading the margins plot: Each line shows the predicted income trajectory across education levels for men (blue) and women (red). Parallel lines indicate no interaction — the education return is the same for both groups. Diverging lines indicate a positive interaction (the gap widens with more education). Converging lines indicate a negative interaction (the gap narrows).
Question 10. Do the lines for men and women appear parallel, diverging, or converging? At what education level, if any, do the confidence ribbons stop overlapping? What does this suggest about where in the education distribution the gender gap is largest?
Your answer:
Model 3: Binary × Binary (Union Membership × Sex)
Research Question
Is the union wage premium the same for men and women, or does union membership pay off differently depending on sex?
Unions historically compressed wages across skill levels. But if the union premium differs by sex, this would suggest that unions have not equalized wages across gender lines — and that the intersection of union status and sex produces unequal outcomes.
model3 <-lm(income91 ~ union_member * sex_f, data = attain)
Regression Table
modelsummary(list("No Interaction"=lm(income91 ~ union_member + sex_f, data = attain),"With Interaction"= model3),vcov ="HC1",coef_rename =c("(Intercept)"="Intercept","union_member"="Union Member","sex_ffemale"="Female","union_member:sex_ffemale"="Union × Female" ),stars =TRUE,fmt =1,gof_map =c("nobs", "r.squared", "adj.r.squared"),title ="OLS: Income ~ Union Membership × Sex (HC1 Robust SEs)")
OLS: Income ~ Union Membership × Sex (HC1 Robust SEs)
No Interaction
With Interaction
Intercept
39031.3***
39212.8***
(1034.0)
(1085.8)
Union Member
5221.6**
4181.1+
(1730.1)
(2324.7)
Female
−4277.4**
−4596.4**
(1312.2)
(1433.6)
Union × Female
2396.4
(3477.1)
Num.Obs.
1763
1763
R2
0.011
0.012
R2 Adj.
0.010
0.010
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Interpret the Coefficients
Note
Reading a binary × binary interaction:
With two binary variables, you can build a 2×2 table of predicted incomes. Each cell is a linear combination of the intercept and coefficients:
Not Union (X₁ = 0)
Union (X₁ = 1)
Men (X₂ = 0)
α
α + β₁
Women (X₂ = 1)
α + β₂
α + β₁ + β₂ + β₃
The interaction term β₃ captures non-additivity: whether the union premium for women (β₁ + β₃) is different from the union premium for men (β₁).
Question 11. Using the coefficients from your table, fill in the 2×2 predicted income table:
Not Union
Union
Union premium
Men
α = ___
α + β₁ = ___
β₁ = ___
Women
α + β₂ = ___
α + β₁ + β₂ + β₃ = ___
β₁ + β₃ = ___
Question 12. Is the union wage premium larger for men or women, based on the table you filled in? Is the Union × Female interaction term statistically significant?
Your answer:
Question 13. The Female coefficient in the interaction model is the income gap between non-union men and non-union women. The Union Member coefficient is the union premium for men only. Explain in plain language why you cannot use these main effects to describe women’s situation without also looking at the interaction term.
Your answer:
Margins Plot
ggpredict(model3, terms =c("union_member [0, 1]", "sex_f")) |>plot() +scale_color_manual(values =c("male"="#4E79A7", "female"="#E15759"),labels =c("Men", "Women")) +scale_fill_manual(values =c("male"="#4E79A7", "female"="#E15759"),labels =c("Men", "Women")) +labs(title ="Predicted Income by Union Status and Sex",subtitle ="Parallel lines = same union premium for both groups; non-parallel = interaction",x ="Union Member (0 = No, 1 = Yes)",y ="Predicted Income ($)",color ="Sex",fill ="Sex" ) +theme_minimal()
Note
Reading the margins plot: The two lines show how predicted income changes from non-union to union status for men and women separately. If the lines are parallel, the union premium is identical for both groups. If the lines have different slopes (one steeper than the other), the union premium differs by sex — this is the visual signature of an interaction.
Question 14. Do the lines in the margins plot appear parallel or non-parallel? Does the size of the union wage boost look different for men vs. women? Does this match the significance of the interaction term in your table?