Week 14

Sociology 106: Quantitative Sociological Methods

April 21, 2026

Agenda

Housekeeping:

  • Final paper, HW #11 (canceled), In-class presentations

Statistical content — three parts:

  • Part 1: Mediation vs. moderation — getting the concepts straight
  • Part 2: Mediation analysis — path models, Baron & Kenny, R implementation
  • Part 3: Multicollinearity — when predictors are too similar

Housekeeping

Weekly Assignment #11

  • HW 11 was canceled - wanted to make sure you had time to work on final exam
  • Will still be able to drop lowest of 10 assignments.

In-class presentations

  • Tuesday, April 28/Tuesday May, 5 — 7–10 minutes per student
  • Slide template on the course site; structure: question → data → results → so what
  • Practice out loud at least once before class

Final paper

  • Due Thursday, May 7
  • Should include a regression model that answers your research question
  • Should follow the formatting described in the final paper outline

Where We Are in the Course

  • Week 11: OLS regression — estimating linear relationships
  • Week 12: Logistic regression — predicting binary outcomes
  • Week 13: Interaction effects — for whom does X affect Y?
  • This week (Week 14): Mediation — how or why does X affect Y?

The arc of the course: from “Is there a relationship?” to “What kind?”

Each week we’ve added a layer of nuance. This week we ask not just whether X predicts Y, but through what mechanism — and we tackle a practical issue (multicollinearity) you’ll likely encounter writing up your papers.

Questions?

Part 1: Getting the Concepts Straight

Moderation vs. Mediation — the distinction that trips everyone up

Two Questions, Two Models

Moderation Mediation
Question For whom does X affect Y? How/why does X affect Y?
Third variable A moderator — changes the slope of X→Y A mediator — carries the effect of X to Y
Path structure X → Y, with slope depending on Z X → M → Y
Regression term Interaction: X * Z Sequential models: Y ~ X, then Y ~ X + M
What you see Different slopes for different groups X effect shrinks when M is added

Mediation story:

Does education raise income because it leads to higher-prestige jobs?

\[\text{Education} \rightarrow \underbrace{\text{Job Prestige}}_{\text{mediator } M} \rightarrow \text{Income}\]

The mediator explains the mechanism.

Moderation story (Week 13):

Does education raise income more for men than for women?

\[\text{Education} \xrightarrow{\text{slope differs by}} \text{Income}\] \[\text{(depending on Sex)}\]

The moderator changes who benefits.

Memory trick

Mediation → the mechanism (Mediator = Mechanism). Moderation → the modifier (Moderator = Modifier of the slope).

Mistake 1: Controlling for a mediator

If you add job prestige as a control in a model predicting income from education, you block the very path you’re trying to study. The coefficient on education will tell you only the direct effect — missing the part that works through prestige.

→ Week 13 warned: don’t control for mediators. Today we learn what to do instead.

Mistake 2: Calling a mediator a moderator

A mediator is on the causal path. A moderator is off the path, conditioning how strong it is.

  • “Prestige carries the education→income effect” → mediator
  • “Sex changes how strong the education→income effect is” → moderator

The test: where is the third variable?

Draw the arrows. If the third variable sits between X and Y (X → M → Y), it’s a mediator. If it sits beside the X→Y arrow, changing its slope, it’s a moderator.

Part 2: Mediation Analysis

How and why does X affect Y?

The Three-Effect Framework

\[\underbrace{c'}_{\text{total}} = \underbrace{c}_{\text{direct}} + \underbrace{a \times b}_{\text{indirect}}\]

  • Total effect (c’): the overall X → Y relationship (before adding M)
  • Direct effect (c): X → Y after controlling for M
  • Indirect effect (a × b): the part of X’s effect that travels through M

The a, b, c Paths Explained

Does education predict job prestige?

model_a <- lm(prestg80 ~ educ, data = attain_med)
Path a: Education → Prestige
(1)
(Intercept) 12.25
(1.40)
Education (yrs) 2.34
(0.10)
Num.Obs. 2525
R2 0.172

Each additional year of education is associated with 2.34 more prestige points.

Does prestige predict income, controlling for education?

model_b <- lm(income91 ~ educ + prestg80, data = attain_med)
Paths b & c: Educ + Prestige → Income
(1)
(Intercept) -12704
(3501)
Education (yrs) 3357
(278)
Job Prestige 225
(49)
Num.Obs. 2525
R2 0.092

Ignoring prestige — what’s the raw education effect?

model_total <- lm(income91 ~ educ, data = attain_med)
Total Effect: Education → Income
(1)
(Intercept) -9947
(3463)
Education (yrs) 3884
(254)
Num.Obs. 2525
R2 0.085

Total effect: $3,884/year of education.

Putting it together

  • Total effect: $3,884 per year of educ
  • Direct effect: $3,357 per year of educ
  • Indirect effect (a × b): 2.34 × $225 ≈ $526

A year of education raises prestige by 2.34 points; each prestige point raises income by $225. So about 14% of the education effect on income runs through job prestige.

Why not just subtract?

You can estimate the indirect effect as c’ − c, but the a × b product gives you the same number and generalizes better (especially with bootstrapped confidence intervals).

Full vs. Partial Mediation

Full mediation

The direct effect c drops to (near) zero when M enters the model.

M fully explains why X affects Y.

The indirect path is the whole story.

Partial mediation

The direct effect c decreases but remains significant.

M explains part of the mechanism; something else also connects X directly to Y.

Both paths matter.

In practice: partial mediation is the norm

Full mediation is rare in observational social science. Expect the direct effect to shrink — but not disappear. That’s still an important substantive finding.

Baron & Kenny Four Steps

The Baron & Kenny (1986) approach tests mediation through four sequential regressions:

Step Model What you’re checking
1 lm(Y ~ X) X significantly predicts Y (there’s something to explain)
2 lm(M ~ X) X significantly predicts M (X moves the mediator)
3 lm(Y ~ X + M) M significantly predicts Y, controlling for X
4 Compare Step 1 vs. Step 3 X→Y coefficient decreases when M is added

The logic

If X moves M (Step 2), and M moves Y independently of X (Step 3), then some of what looks like a direct X→Y effect is really X working through M. Step 4 quantifies how much.

# Step 1: Does education predict income?
step1 <- lm(income91 ~ educ,            data = attain_med)

# Step 2: Does education predict prestige?
step2 <- lm(prestg80 ~ educ,            data = attain_med)

# Step 3 & 4: Does prestige predict income? Does education effect shrink?
step3 <- lm(income91 ~ educ + prestg80, data = attain_med)
Baron & Kenny Steps 1–4
Step 1: Y~X Step 2: M~X Step 3: Y~X+M
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) -9947** 12*** -12704***
(3463) (1) (3501)
Education (X) 3884*** 2*** 3357***
(254) (0) (278)
Job Prestige (M) 225***
(49)
Num.Obs. 2525 2525 2525
R2 0.085 0.172 0.092

What the table shows:

  • Step 1: Education → Income: +$3,884 per year ✓ (Step 1 met)
  • Step 2: Education → Prestige: +2.34 pts per year ✓ (Step 2 met)
  • Step 3: Prestige → Income (controlling educ): +$225 per prestige point ✓ (Step 3 met)
  • Step 4: Education effect on income drops from $3,884 → $3,357 when prestige enters ✓

Verdict: partial mediation

Education still has a direct effect on income, but part of its total effect works through occupational prestige. Both the direct and indirect paths matter.

Baron & Kenny tells you whether mediation exists, but not how much or how confident you should be in the indirect effect.

The indirect effect (a × b) needs its own standard error and confidence interval — and the distribution of a product of two estimates is not normal, so we can’t use a simple z-test.

Solution: bootstrapping (re-sample the data many times, compute a × b each time, use the percentiles as the CI).

Modern practice

The mediation package in R implements bootstrapped confidence intervals for the indirect effect (ACME = Average Causal Mediation Effect). This is the standard in published sociology papers.

Indirect Effects with mediation

library(mediation)

# Fit the two models first
model_m <- lm(prestg80 ~ educ,            data = attain_med)
model_y <- lm(income91 ~ educ + prestg80, data = attain_med)

# Run mediation with bootstrap
med_out <- mediate(
  model.m  = model_m,
  model.y  = model_y,
  treat    = "educ",       # X variable
  mediator = "prestg80",   # M variable
  boot     = TRUE,
  sims     = 1000          # bootstrap draws
)

summary(med_out)

Reading the output

  • ACME = Average Causal Mediation Effect = indirect effect (a × b)
  • ADE = Average Direct Effect = direct effect (c)
  • Total Effect = ACME + ADE (c’)
  • Prop. Mediated = ACME / Total Effect

Our results (500 bootstraps):

Effect Estimate 95% CI
Indirect (ACME) $527 [$329, $747]
Direct (ADE) $3,357
Total $3,884
% Mediated 14%

Interpretation

About 14% of education’s effect on income is mediated by occupational prestige. The indirect path (education → prestige → income) is statistically distinguishable from zero (the CI excludes 0). But a substantial direct effect remains.

Visualizing the Indirect Effect

Causal Limitations

Mediation analysis tests a causal claim — that X works through M — but with observational data, we can never fully prove it.

Three threats to causal mediation:

Threat What it means
No temporal ordering We assume X precedes M precedes Y. With cross-sectional data, we can’t verify this.
M→Y confounding A third variable may cause both prestige and income, creating a spurious M→Y association.
X→M→Y vs X←M→Y Without experiments, the arrow could point the other way (prestige affects education attainment).

The gold standard

The cleanest mediation evidence comes from randomized experiments where X is randomly assigned. Then you know X precedes M, and there’s no X-confounder. But experiments are rare in sociology — so we work with what we have, acknowledge the limits, and make the theoretical argument carefully.

If you’re running a mediation analysis:

  1. Fit the three models (B&K Steps 1–3)
  2. Report coefficients for all three — show the reader the direct, indirect, and total effects
  3. Use mediate() with boot = TRUE for the indirect effect CI
  4. State your theoretical argument for why M mediates X→Y
  5. Acknowledge the cross-sectional limitation — you can’t prove causal mediation, but you can show the pattern is consistent with it

Language template:

“Consistent with a mediation hypothesis, the coefficient on [X] declined from [total] to [direct] when [M] was included (ACME = [value], 95% CI [lo, hi]). This suggests that approximately [%] of the association between [X] and [Y] may operate through [M], though cross-sectional data preclude strong causal claims about the mediating mechanism.”

If you wanted stronger causal mediation evidence with observational data, you would:

  • Use panel data (before/after measurements) to verify temporal ordering
  • Control for all plausible M→Y confounders explicitly
  • Use sensitivity analysis (medsens() in the mediation package) to test how robust the indirect effect is to unmeasured M→Y confounding

For this course

For your final papers: fit the three models, bootstrap the indirect effect, state the caveats. That’s the sociological standard for cross-sectional mediation claims.

Part 3: Multicollinearity

When your predictors are too alike

What Is Multicollinearity?

Multicollinearity occurs when two or more predictors in a regression are highly correlated with each other.

Why it’s a problem:

When X₁ and X₂ are highly correlated: - The model can’t tell which one is doing the work - Coefficient estimates become unstable — tiny data changes → large coefficient changes - Standard errors inflate → wide confidence intervals → everything looks non-significant - Coefficients can flip sign or become implausibly large

The core issue

Multicollinearity doesn’t bias your coefficients — it just makes them imprecise. With enough data, standard errors shrink. But in typical survey samples, high multicollinearity can completely obscure real effects.

Detecting Multicollinearity: VIF

The Variance Inflation Factor (VIF) measures how much each predictor’s variance is inflated by correlation with the others.

\[\text{VIF}_j = \frac{1}{1 - R^2_j}\]

where \(R^2_j\) is the R² from regressing predictor \(j\) on all other predictors.

Rules of thumb:

VIF Interpretation
1 No multicollinearity
1–5 Low — acceptable
5–10 Moderate — investigate
> 10 Severe — take action
library(car)

# Fit model with education, prestige, and age
model_check <- lm(
  income91 ~ educ + prestg80 + age,
  data = attain_med
)

vif(model_check)
    educ prestg80      age 
1.291837 1.228938 1.069693 

What Multicollinearity Looks Like

The takeaway

The point estimates are approximately the same in both conditions — multicollinearity does not bias coefficients. What it does is inflate the standard errors: the confidence intervals are much wider when x₁ and x₂ are highly correlated. You’re more likely to miss a real effect (Type II error), but the estimates themselves are not systematically wrong.

Is Multicollinearity Actually a Problem?

It depends on what you’re trying to do — and which variables are collinear.

Goal Does multicollinearity matter?
Prediction Usually no — collinear models predict just as well; R² is unaffected
Inference on a control variable Often no — imprecise control coefficients don’t invalidate your key result
Inference on your variable of interest Yes — if your key X is the collinear one, SEs inflate and you may miss a real effect
Understanding relative importance of two predictors Yes — you can’t disentangle their individual contributions

The core issue: precision, not accuracy

Multicollinearity does not bias your coefficients — the estimates are still correct on average. What it does is make them imprecise: standard errors inflate, confidence intervals widen, and t-statistics shrink. The main consequence is more Type II errors — failing to detect real effects that are actually there.

Type II error = failing to reject H₀ when it’s actually false (a “false negative”).

Here’s the chain of consequences:

\[\underbrace{\text{High multicollinearity}}_{\text{predictors too similar}} \Rightarrow \underbrace{\uparrow \text{SE}}_{\text{wider uncertainty}} \Rightarrow \underbrace{\downarrow t\text{-statistic}}_{\frac{\hat\beta}{SE}} \Rightarrow \underbrace{\uparrow p\text{-value}}_{\text{harder to reject }H_0} \Rightarrow \underbrace{\text{Type II error}}_{\text{miss a real effect}}\]

Example:

Suppose education truly does affect income controlling for prestige (direct effect = $800/year). With low multicollinearity between educ and prestige, SE = 200, t = 4.0, p < 0.001. With high multicollinearity (r = 0.92), SE = 1100, t = 0.73, p = 0.47 — we’d conclude “no direct effect” even though one exists.

The effect is real. The data just can’t tell the two predictors apart.

The practical implication

If you run a multivariate model and a variable you expected to be significant is not — and its VIF is high — consider whether multicollinearity is masking the effect. This is not grounds to drop the variable; it’s grounds to note the limitation.

Multicollinearity is not just about two variables being correlated. Any set of predictors can jointly cause it, even if no two are highly correlated in isolation.

Example 1: Age, work experience, and tenure

  • Age and years of work experience: r ≈ 0.75 (moderate)
  • Age and job tenure: r ≈ 0.55 (moderate)
  • Experience and tenure: r ≈ 0.60 (moderate)
  • But together, they carry almost the same information → VIF for each could be 8+

No single pairwise correlation screams “problem,” but the three together are nearly redundant.

Example 2: Education, degree, and test score

  • Education in years, highest degree attained, and a cognitive ability test score
  • All three measure roughly the same underlying “human capital” concept
  • Including all three → VIF > 10 on each, unstable coefficients, possibly sign flips

This is why you check VIF on the fitted model, not just pairwise correlations.

# Always check VIF on the full model, not just cor()
library(car)
vif(your_model)   # catches joint multicollinearity that cor() misses

The best news: collinear controls often don’t threaten your main conclusion.

If your research question is “Does education affect income?” and education has a low VIF, then:

  • The coefficient on education is precisely estimated ✓
  • Its SE and p-value are unaffected by multicollinearity among the controls ✓
  • You can still make your main inference confidently ✓

The collinear controls (say, age and experience) just have imprecise coefficients — but you probably don’t care about their individual effects.

Example from our data:

# education is our key IV; age and prestige are controls
model_key <- lm(income91 ~ educ + prestg80 + age, data = attain_med)
vif(model_key)
    educ prestg80      age 
1.291837 1.228938 1.069693 

If educ has VIF ≈ 2 but prestg80 and age have VIF ≈ 4, that’s fine — our inference about education is unaffected.

Rule of thumb for your papers

Check VIF on all predictors. If the variable of interest has low VIF (< 5), report it and move on. If only controls have elevated VIF, note it briefly but don’t panic — your main finding is still valid.

What to Do About Multicollinearity

Step 1: Check the correlation matrix

attain_med |>
  dplyr::select(educ, prestg80, age) |>
  cor(use = "complete.obs") |>
  round(2)
          educ prestg80   age
educ      1.00     0.42 -0.22
prestg80  0.42     1.00  0.02
age      -0.22     0.02  1.00

Step 2: Check VIF after fitting the model

model_check <- lm(income91 ~ educ + prestg80 + age, data = attain_med)
vif(model_check)
    educ prestg80      age 
1.291837 1.228938 1.069693 

If VIF < 5 for all predictors → you’re fine. Continue.

Problem Solution
Including education and degree (two measures of same concept) Drop one; keep the theoretically central one
Including age and birth year These are perfectly collinear — pick one
Two highly correlated controls Drop the one less theoretically important
Several items measuring the same underlying concept Combine into an index (average or sum the items)
Conceptually distinct variables that happen to correlate Keep both, note the limitation, get more data

Creating an index when variables measure the same concept:

When several correlated predictors tap the same underlying idea (e.g., multiple survey items about socioeconomic status, or hours worked across multiple jobs), collapsing them into a single composite reduces multicollinearity and often improves interpretability.

# Example: respondent works multiple jobs — combine total hours
attain <- attain |>
  mutate(hrs_total = rowMeans(dplyr::select(., hrs1, hrs2), na.rm = TRUE))

# Example: standardise and average two SES-related items into one index
attain <- attain |>
  mutate(
    educ_z    = scale(educ),
    prestg_z  = scale(prestg80),
    ses_index = (educ_z + prestg_z) / 2   # simple equal-weight composite
  )

Index variables also carry a substantive interpretation: you’re saying these items measure one construct, not two separate concepts. Make that theoretical argument in your paper.

What NOT to do

Don’t use regularization (ridge/LASSO) without understanding it. Don’t drop variables that are theoretically important just because of high VIF. Report the issue and argue for your choice.

Common multicollinearity problems in student papers:

  1. Including both education (years) and degree → highly correlated (r ≈ 0.80). Pick one.
  2. Including age and years of work experience → age − 18 ≈ work experience for many. Drop one.
  3. Including income and earnings from multiple sources → summing them and using total is usually better.

In your write-up:

# Always report VIF after your multivariate model
library(car)
vif(your_final_model)

A single line: “VIF values ranged from X to Y (all < 5), indicating no problematic multicollinearity.”

Finishing Strong

Getting your final paper across the finish line

What Your Final Paper Needs

Required components:

Component What to include
Bivariate model lm(Y ~ X) — your key IV predicting Y
Multivariate model Add at least 2 controls; discuss why they matter
Extension At least one of: interaction, mediation, or logistic (if binary Y)
Assumption check Residual plot + VIF — a paragraph is sufficient
Model table modelsummary() comparing models side-by-side
# The pattern for a final paper model table
modelsummary(
  list(
    "Bivariate"     = model_biv,
    "Multivariate"  = model_multi,
    "With Interaction" = model_interact    # or model with mediation
  ),
  stars  = TRUE,
  vcov   = "HC1",           # robust SEs — use these throughout
  gof_map = c("nobs", "r.squared", "adj.r.squared")
)

Don’t just report the number — tell the story:

Instead of… Try…
“The coefficient on education is 1847.” “Each additional year of education is associated with $1,847 higher income, holding sex and age constant.”
“The interaction term is significant.” “The education–income association is steeper for men than women: each year of education yields approximately $[gap] more income for men.”
“ACME = 624, p < 0.05.” “About [%]% of education’s effect on income appears to operate through occupational prestige.”

Lead with substance

Sociological audiences want to know: what does this mean for inequality / stratification / the social world? Start with the finding, then give the number.

Every quantitative paper needs a limitations section. Be specific, not generic.

Generic (avoid): > “This study has some limitations. The data may not be perfectly representative.”

Specific (better): > “Because the GSS uses cross-sectional data, we cannot establish the temporal ordering required for causal inference. Specifically, the mediation claim — that education affects income through occupational prestige — requires that education precede prestige, which we assume but cannot verify. Additionally, unmeasured confounders of the prestige–income relationship (e.g., parental occupational networks) may bias the indirect effect estimate.”

The goal of limitations

Not to undermine your argument — but to show the reader you understand the boundaries of your evidence. A specific limitation paragraph signals methodological maturity.

A coefficient plot shows each predictor’s estimated effect and its confidence interval as a dot-and-whisker. It lets readers immediately see which effects are large, which are small, and which cross zero (non-significant).

Use a coefficient plot in your presentation instead of the full regression table — tables are hard to read aloud. Use the table in your paper.

library(broom)

model_coefplot <- lm(income91 ~ educ + prestg80 + sex_f + age,
                     data = attain_med)

tidy(model_coefplot, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(term = case_match(term,
    "educ"        ~ "Education (yrs)",
    "prestg80"    ~ "Job Prestige",
    "sex_ffemale" ~ "Female",
    "age"         ~ "Age"
  )) |>
  ggplot(aes(x = estimate, y = fct_reorder(term, estimate))) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
                 height = 0.2, linewidth = 1.1, color = "#4E79A7") +
  geom_point(size = 3.5, color = "#4E79A7") +
  scale_x_continuous(labels = scales::dollar_format()) +
  labs(x = "Effect on income (95% CI)", y = NULL,
       title = "Predictors of income — coefficient plot") +
  theme_minimal(base_size = 12)

Reading the plot

A dot to the right of zero means a positive effect; left means negative. If the whisker (CI) crosses the dashed line at zero, the effect is not statistically significant at the 5% level. The distance from zero shows effect size; the width of the whisker shows precision.

Key Takeaways

Mediation

  • Asks how/why X affects Y — through a mediator M on the causal path
  • Three models: total effect (c’), X→M (a path), X+M→Y (b and c paths)
  • Indirect effect = a × b; use mediation::mediate() for bootstrapped CIs
  • Full mediation = direct effect disappears; partial = it shrinks
  • Causal claims require strong theoretical justification with cross-sectional data

Multicollinearity

  • Occurs when predictors are highly correlated → inflated SEs, unstable coefficients
  • Diagnose with car::vif(): VIF > 5 warrants attention, > 10 is severe
  • Fix: drop redundant predictors, keep the theoretically central one
  • Always report VIF in your paper (one sentence is enough)

Final paper

  • Bivariate + multivariate + one extension (interaction, mediation, or logistic)
  • Use modelsummary() with vcov = "HC1" (robust SEs) throughout
  • Lead with substance: what does the finding mean sociologically?
  • A specific limitations paragraph signals methodological maturity

Why This Matters

Mediation is how sociology asks mechanism questions:

  • Does education reduce poverty because it raises job quality? → mediation
  • Does neighborhood segregation affect health through differential access to healthcare? → mediation
  • Does discrimination reduce earnings through occupational sorting? → mediation

These aren’t just statistical questions — they’re questions about how social structures reproduce themselves. Knowing the mechanism tells us where to intervene.

The big picture

This course has moved from “is there a relationship?” (correlations, chi-squared) → “how strong is it?” (OLS) → “is it real?” (controls, confounding) → “for whom?” (interactions) → “how does it work?” (mediation). That’s the arc of causal thinking in social science.

Questions?

Assignments

  • HW 11 was canceled - wanted to make sure you had time to work on final exam
  • Will still be able to drop lowest of 10 assignments.

Final PaperDue Thursday, May 7

Final paper components

  1. Introduction and research question
  2. Data and methods (describe attain.csv or your dataset, key variables, analytic strategy)
  3. Results: bivariate model → multivariate model → extension (interaction or mediation)
  4. Discussion: what do the results mean? connect to literature
  5. Limitations: cross-sectional data, measurement issues, omitted variables
  6. Model table using modelsummary() with robust SEs
  7. At least one visualization (scatter + regression line, coefficient plot, or marginal effects plot)

In-class presentationsTuesday, April 28/May 5.

  • 7–10 minutes: question → data → results → so what
  • Don’t show the full model table to the class — use a coefficient plot or highlight 2–3 key numbers