Week 11

Sociology 106: Quantitative Sociological Methods

March 31, 2026

Agenda

Housekeeping:

HW #8, HW #9, Revised Proposal with Outline

Statistical content — four parts:

Part 1: Outcome variables and key predictors — what are we trying to explain?
Part 2: From correlation to regression — estimating the magnitude of a linear relationship
Part 3: Regression with categorical predictors — dummy variables and baseline categories
Part 4: OLS assumptions and multiple regression

In-class lab:

Applying OLS regression to GSS data

Housekeeping

Weekly Assignment #8

Due Thursday, April 2 — hypothesis testing from Week 9

Weekly Assignment #9

Due Thursday, April 9
Run an OLS regression on your research dataset — today’s lecture covers everything you need
Uses modelsummary() for a clean regression table (covered today)
If using a categorical predictor, use fct_relevel() to set a baseline category (covered today)

Revised Proposal with Outline

Due Thursday, April 16
Should now describe the statistical techniques you plan to use — including regression!
Come to office hours if you need help connecting your research question to a regression model

Where We Are in the Course

Week 9: Hypothesis testing — is there a relationship, or is it just chance?
This week (Week 11): Linear regression — how strong is the relationship, and in what direction?
Week 12: Logistic regression — predicting binary outcomes
Week 13: Extensions — multiple controls, more complex models

The big picture: from “yes or no” to “how much?”

Last week we answered: “Is there a statistically significant relationship?” Today’s tool — regression — gives a quantitative answer: “For every one-unit increase in X, how much does Y change on average?”

Part 1: Outcome Variables and Key Predictors

What are we trying to explain, and with what?

Two Variables, Two Roles

Every regression analysis has variables playing two distinct roles:

The outcome variable — what you are trying to explain or predict

Also called: dependent variable (DV), response variable, Y
This is the phenomenon your research question is about – the thing that is effected

The predictor — what you are using to explain the outcome

Also called: key independent variable (key IV), explanatory variable, X
This is the main theoretical variable your hypothesis is about – the thing that does the affecting

These labels reflect your theory, not your data:

Calling one variable the “outcome” and another the “predictor” encodes your theoretical argument about which variable explains which. Regression is a tool for formalizing that argument quantitatively.

Language Matters

You’ll encounter many terms for the same concept — here’s a cheat sheet:

Term	What it means	Also called
Outcome variable	What you are trying to explain	Dependent variable (DV), Y, response variable
Key predictor	Your main explanatory variable	Key IV, key independent variable, X
Control variable	Additional IV included to rule out alternatives	Covariate

In your research paper:

Your paper proposal asked you to identify your key independent and dependent variables. These map directly onto key predictor and outcome variable in regression. Today you learn to formally estimate the relationship between them.

Setting Up Your Research Question

Before running any regression, be explicit about your variables and expectations:

Step 1 — Research Question
Step 2 — Hypothesis
Step 3 — Visualize First

Your research question should name both roles:

“How does occupational prestige impact household income?”

Outcome variable (Y): income91 — household income, in dollars — what we’re explaining
Key predictor (X): prestg80 — occupational prestige score — our main theoretical variable

Why income as the outcome?

We have theoretical reasons to believe occupational status drives income (not the reverse). The direction of influence determines which variable goes on which side of the regression equation.

Write out your hypothesis before looking at the data:

\[H_0: \beta_{\text{prestige}} = 0 \quad \text{(prestige has no effect on income)}\]

\[H_1: \beta_{\text{prestige}} > 0 \quad \text{(higher prestige → higher income)}\]

This is a one-tailed test — theory strongly predicts the direction (positive).

In most cases, we’ll default to two-tailed unless we have strong prior theoretical reason for directionality.

Always plot your variables before running a regression:

ggplot(attain_reg, aes(x = prestg80, y = income91)) +
  geom_point(alpha = 0.2, color = "gray50", size = 1.2) +
  geom_smooth(method = "lm", se = TRUE, color = "#4E79A7",
              fill = "#4E79A7", alpha = 0.15) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(title = "Occupational Prestige and Household Income (GSS, 1991)",
       x = "Occupational Prestige Score (Key Predictor)",
       y = "Household Income — $ (Outcome)") +
  theme_minimal(base_size = 11)

Questions?

Part 2: From Correlation to Regression

From “is there a relationship?” to “how strong is it?”

What Correlation Tells Us — and What It Doesn’t

A correlation coefficient (r) describes direction and strength:

r	Strength
r ≈ 0.3	Weak
r ≈ 0.5	Moderate
r ≈ 0.8	Strong

But correlation cannot answer:

How many dollars does income increase per one-point gain in occupational prestige?
What is the predicted income for someone with a prestige score of 50?
Is this relationship significant after controlling for sex?

The key limitation:

Two datasets can have the same correlation (r = 0.82) but completely different slopes — different magnitudes, different units, different practical meanings.

Regression gives us the slope — the actual quantitative size of the relationship in the units of Y per unit of X.

The Linear Regression Model

For any value $x$ of the key predictor, the expected value of the outcome is:

\[E[Y \mid x] = \alpha + \beta x\]

Symbol	Name	Meaning
$E[Y \mid x]$	Conditional mean	Expected (average) value of outcome $Y$ when $X = x$
$\alpha$	Intercept	Predicted value of $Y$ when $X = 0$ — the baseline anchor
$\beta$	Slope	Average change in $Y$ for a one-unit increase in $X$

In practice, we estimate from sample data:

\[y_i = a + bx_i + e_i\]

$a$ = estimated intercept (sample estimate of $\alpha$)
$b$ = estimated slope (sample estimate of $\beta$) — this is the key number
$e_i$ = residual for observation $i$ — the gap between observed $y_i$ and predicted $\hat{y}_i$

OLS (ordinary least squares) chooses $a$ and $b$ to minimize:

\[\text{RSS} = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - a - bx_i)^2\]

Why square? Prevents cancellation, penalizes large misses, and produces a unique closed-form solution.

Observed vs. Predicted Values

Note

Residual $e_i = y_i - \hat{y}_i$: how far each observation is from the line. OLS minimizes the sum of squared residuals — hence “ordinary least squares.”

The lm() function runs OLS regression:

lm(outcome ~ key_predictor, data = your_data)

The variable on the left of ~ is always the outcome (Y)
The variable on the right is the key predictor (X)

Our example:

# Filter to complete cases
attain_reg <- attain |> filter(!is.na(prestg80), !is.na(income91))

# Fit the OLS model — outcome ~ key_predictor
model_biv <- lm(income91 ~ prestg80, data = attain_reg)

summary(model_biv)


Call:
lm(formula = income91 ~ prestg80, data = attain_reg)

Residuals:
   Min     1Q Median     3Q    Max 
-53305 -24664 -10902  12726 390467 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 21273.31    2134.09   9.968 <0.0000000000000002 ***
prestg80      472.13      45.94  10.277 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 37680 on 2525 degrees of freedom
Multiple R-squared:  0.04015,   Adjusted R-squared:  0.03977 
F-statistic: 105.6 on 1 and 2525 DF,  p-value: < 0.00000000000000022

modelsummary() from the modelsummary package produces a cleaner, publication-ready table — use this in hw9:

modelsummary(model_biv,
             coef_rename = c("(Intercept)" = "Intercept",
                             "prestg80"    = "Prestige Score"),
             stars   = TRUE,
             title   = "OLS: Household Income ~ Occupational Prestige",
             gof_map = c("nobs", "r.squared", "adj.r.squared"))

OLS: Household Income ~ Occupational Prestige
	(1)
Intercept	21273.309***
	(2134.090)
Prestige Score	472.133***
	(45.943)
Num.Obs.	2527
R2	0.040
R2 Adj.	0.040
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

What each row means:

Intercept: Predicted income when prestige = 0. Not substantively meaningful for prestige (no one has a prestige score of 0) — just the mathematical anchor for the line.

Prestige Score (slope): For each additional point in occupational prestige, household income is predicted to be about $472 higher, on average. This is your key result.

Standard error: How precisely we’ve estimated the slope. Smaller SE → more confidence.

Stars / p-value: Is the slope significantly different from zero? Here, $p < 0.001$ — we strongly reject $H_0: \beta = 0$.

R²: Occupational prestige explains about 4% of the variation in household income. Report and interpret this in hw9.

Coefficients and Hypothesis Tests

Every OLS coefficient has its own built-in hypothesis test:

\[t = \frac{b}{SE(b)}, \quad \text{compared to } t\text{-distribution with } df = n - k - 1\]

This is identical to Week 9’s t-test logic — the only thing new is what we’re testing

# Each column has a precise meaning:
#   Estimate    = the coefficient (b)
#   Std. Error  = SE(b) — precision of the estimate
#   t value     = Estimate / Std. Error  ← same t-test from Week 9!
#   Pr(>|t|)    = two-tailed p-value for H0: beta = 0
summary(model_biv)$coefficients

              Estimate Std. Error   t value                        Pr(>|t|)
(Intercept) 21273.3090 2134.08993  9.968328 0.00000000000000000000005543108
prestg80      472.1334   45.94284 10.276539 0.00000000000000000000000268663

Connection to Week 9:

When $p < 0.05$ for a coefficient, we reject $H_0: \beta = 0$ — evidence that the predictor has a real effect on the outcome. The coefficient tells us the direction and magnitude of that effect; the p-value tells us whether we can trust it isn’t due to chance.

In this example: $p < 0.001$ for the prestige coefficient, so we reject $H_0: \beta_{\text{prestige}} = 0$ — occupational prestige has a statistically significant effect on household income.

Interpreting Coefficients

The estimated regression equation:

\[\widehat{\text{income}} = 21,273 + 472 \times \text{prestige}\]

The slope ($b = 472$):

For every one unit increase in occupational prestige, predicted household income increases by $472
So, a person who moves from a prestige score of 30 to 60 (+30 points) is predicted to earn $14160 more per year
The positive sign confirms a positive relationship — higher prestige = higher income

The intercept ($a = 21,273$):

Predicted income when prestige = 0 is $21,273 — not really meaningful (prestige scores are well above 0)
This is just a mathematical anchor. We almost never interpret it substantively.
The intercept matters for drawing the line; the slope is the theoretically meaningful number.

Always report magnitude, not just significance:

A p-value tells you whether $b \neq 0$. The coefficient itself tells you how large and meaningful the relationship is. Report both.

What Is R²?

R² (R-squared) is the coefficient of determination — the proportion of variation in the outcome that is explained by the model:

\[R^2 = \frac{\text{variation in } Y \text{ explained by model}}{\text{total variation in } Y} \in [0, 1]\]

R² value	Interpretation
R² = 0.00	The model explains none of the variation in Y
R² = 0.04	Prestige explains ~4% of the variation in income
R² = 0.25	The model accounts for 25% of variation — strong for social science
R² = 1.00	The model perfectly predicts Y (never in real social science data)

What counts as a “good” R²?

It depends entirely on context. In social science, R² of 0.05–0.30 is common and meaningful. A low R² does not mean your predictor is unimportant — it means many other factors also shape the outcome. Always discuss what R² implies about how much your key predictor explains vs. how much remains unexplained.

Visualizing Regression Results

Two visualizations worth including in hw9:

Scatter + Regression Line
Coefficient Plot

Use geom_smooth(method = "lm") to overlay the regression line on a scatterplot:

ggplot(attain_reg, aes(x = prestg80, y = income91)) +
  geom_point(alpha = 0.2, color = "gray50", size = 1) +
  geom_smooth(method = "lm", se = TRUE,
              color = "#4E79A7", fill = "#4E79A7", alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(title = "Effect of Occupational Prestige on Household Income",
       x = "Occupational Prestige Score", y = "Household Income ($)") +
  theme_minimal(base_size = 11)

modelplot() from modelsummary shows coefficients with 95% confidence intervals. If the CI does not cross 0, the coefficient is significant at $\alpha = 0.05$:

modelplot(model_biv, coef_omit = "Intercept") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "#E15759") +
  labs(title = "OLS Coefficient: Effect of Prestige on Income",
       x = "Estimated Coefficient ($ per prestige point)") +
  theme_minimal(base_size = 11)

Questions?

Part 3: Regression with Categorical Predictors

When your key predictor has categories

Categorical Variables: The Challenge

So far our key predictor has been continuous (occupational prestige). What if the key predictor is categorical — like educational degree, sex, or race?

The challenge: We can’t multiply a category label by a number. Regression needs numbers.

The solution: Replace a categorical variable with a set of dummy (indicator) variables — one for each category except the baseline.

We can see groups differ — but how much? And is the difference significant? Regression tells us.

Dummy Variables and Baseline Categories

For a categorical variable with k categories, we create k − 1 dummy variables. The omitted category is the baseline (reference) group — all comparisons are made against it.

Example: Degree (5 categories) → 4 dummies (baseline = lt high [less than high school]):

Degree	Dummy variable R creates	Predicted income
lt high (baseline)	(omitted — absorbed into intercept)	$\alpha$
High School	`degree_fhigh sch` = 1, all others = 0	$\alpha + b_1$
Junior College	`degree_fjunior c` = 1, all others = 0	$\alpha + b_2$
Bachelor’s	`degree_fbachelor` = 1, all others = 0	$\alpha + b_3$
Graduate	`degree_fgraduate` = 1, all others = 0	$\alpha + b_4$

Each coefficient ($b_1, b_2, \ldots$) = predicted income above or below the baseline group.

The baseline’s predicted income = the intercept α. Every other coefficient = income above or below that baseline.

Choosing the baseline:

Pick the category that makes theoretical sense as a reference point
Common choices: lowest category, largest group, or the “control” condition
In R: fct_relevel(var, "category_name") (from forcats, loaded with tidyverse)

Setting Up Categorical IVs in R

Step 1: Check Levels
Step 2: fct_relevel
Step 3: Run + Display

Before setting a baseline, see what categories exist:

# What categories does degree have?
table(attain$degree)


bachelor graduate high sch junior c  lt high 
     497      216     1586      176      507

The first value alphabetically becomes the default baseline — we usually want to choose explicitly.

attain_cat <- attain |>
  filter(!is.na(degree), !is.na(income91)) |>
  mutate(
    # Specify all levels in educational order (lowest → highest)
    degree_f = fct_relevel(degree, "lt high", "high sch", "junior c", "bachelor", "graduate")
  )

# Verify: levels listed in educational order, first = baseline
levels(attain_cat$degree_f)

[1] "lt high"  "high sch" "junior c" "bachelor" "graduate"

Choosing the baseline category:

Always choose a category that makes theoretical sense as a reference point
Common strategy: use the group that represents “no treatment” or the lowest level — e.g., less than high school when testing whether higher degrees predict income
Avoid making the smallest group the baseline — comparisons to a tiny group are imprecise
Example: if your hypothesis is about the disadvantage of not having a college degree, set "bachelor" as the baseline so coefficients show penalty for each lower degree level

Baseline in our example:

"lt high" (less than high school) is the baseline. Every other coefficient shows how much more income people with that degree earn compared to respondents without a high school diploma.

model_cat <- lm(income91 ~ degree_f, data = attain_cat)

modelsummary(model_cat,
             coef_rename = c("(Intercept)"        = "Intercept",
                             "degree_fhigh sch"   = "High School",
                             "degree_fjunior c"   = "Junior College",
                             "degree_fbachelor"   = "Bachelor's Degree",
                             "degree_fgraduate"   = "Graduate Degree"),
             stars   = TRUE,
             title   = "OLS: Income ~ Degree (baseline = lt high)",
             gof_map = c("nobs", "r.squared"))

OLS: Income ~ Degree (baseline = lt high)
	(1)
Intercept	23181.028***
	(1765.437)
High School	14360.115***
	(2014.784)
Junior College	24106.743***
	(3357.712)
Bachelor's Degree	34134.344***
	(2459.053)
Graduate Degree	40821.525***
	(3156.575)
Num.Obs.	2632
R2	0.099
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Interpreting Categorical Coefficients

The Equation
How to Read
Predicted Values
Visualization

With degree as the key predictor (baseline = lt high [less than HS]):

\[\widehat{\text{income}} = \alpha + b_{\text{hs}} X_{\text{hs}} + b_{\text{jc}} X_{\text{jc}} + b_{\text{ba}} X_{\text{ba}} + b_{\text{grad}} X_{\text{grad}}\]

Each $X$ is a dummy variable (0/1) for that degree level; each coefficient = the income difference relative to the baseline group.

Reading categorical coefficients:

Intercept ($\alpha$): Predicted income for someone with less than a high school degree (the baseline). This is interpretable — it’s the baseline group’s average income.

Coefficient on High School ($b_{\text{hs}}$): On average, having a high school diploma (vs. less than HS) is associated with $X more in household income.

Coefficient on Graduate Degree ($b_{\text{grad}}$): On average, having a graduate degree (vs. less than HS) is associated with $Y more in household income.

Key insight: Regression with a categorical IV is equivalent to comparing group means — but regression gives us p-values for each comparison and allows us to control for other variables.

# Predicted income for each degree level — ordered lowest to highest
tibble(
  Degree   = c("Less than HS (baseline)", "High School", "Junior College",
               "Bachelor's Degree", "Graduate Degree"),
  degree_f = factor(
    c("lt high", "high sch", "junior c", "bachelor", "graduate"),
    levels = levels(attain_cat$degree_f)
  )
) |>
  mutate(Predicted_Income = scales::dollar(
    round(predict(model_cat, newdata = pick(degree_f)), 0)
  )) |>
  select(Degree, Predicted_Income)

# A tibble: 5 × 2
  Degree                  Predicted_Income
  <chr>                   <chr>           
1 Less than HS (baseline) $23,181         
2 High School             $37,541         
3 Junior College          $47,288         
4 Bachelor's Degree       $57,315         
5 Graduate Degree         $64,003

# Rename and reverse order so lowest degree appears at top (matching table above)
modelplot(
  model_cat,
  coef_omit = "Intercept",
  coef_rename = c(
    "degree_fhigh sch" = "High School",
    "degree_fjunior c" = "Junior College",
    "degree_fbachelor" = "Bachelor's Degree",
    "degree_fgraduate" = "Graduate Degree"
  )
) +
  scale_y_discrete(limits = rev) +   # lowest degree (HS) at top
  geom_vline(xintercept = 0, linetype = "dashed", color = "#E15759") +
  labs(
    title = "Income Gaps by Degree (vs. Less Than HS baseline)",
    x = "Estimated Income Difference ($)"
  ) +
  theme_minimal(base_size = 10)

Questions?

Part 4: OLS Assumptions

When can we trust our estimates?

The Four OLS Assumptions

1. Independence
2. Linearity
3. Normal Errors
4. Homoskedasticity

Observations are independent — typically satisfied by random sampling.

Violated when observations are clustered (e.g., students within schools)
Fix: include survey weights or use clustered standard errors

# If your data have survey weights:
lm(outcome ~ predictor, data = data, weights = wt_var)

There is a linear relationship between the continuous IV and the outcome.

Always check this with a scatter plot of Y vs X before running the regression.

The errors are approximately normally distributed — most important in small samples. In large samples, violations are usually not a big problem (Central Limit Theorem applies).

What we did in this class:

The raw income91 variable is right-skewed (common with income data), which produces non-normal residuals in the bottom-left panel. Running log(income91) as the outcome — model_log above — produces the more symmetric residuals in the bottom-right panel. This is a standard fix: when your outcome is right-skewed, log-transforming it often corrects the non-normality of errors.

The spread of residuals is constant across all values of X (equal variance).

Violations bias standard errors — important for t-tests. Fix: robust standard errors.

Checking Assumptions Visually

A residuals vs. fitted plot diagnoses both linearity and homoskedasticity at once:

What to look for — and what our model shows:

Linearity: Is the blue loess line approximately flat (horizontal)? In our model: mostly flat with slight curvature — no major concern.

Homoskedasticity: Is the vertical spread of points roughly constant across the x-axis? In our model: the spread is roughly similar across fitted values, though slightly wider at high incomes — common with income data.

Overall, our prestige → income model does not show dramatic assumption violations, but some heteroskedasticity is plausible. Using vcov = "HC1" robust SEs (shown next) is a most common precaution used.

Robust Standard Errors

If you suspect heteroskedasticity, use robust standard errors — they correct the SEs without changing the coefficient estimates:

The Problem
In R: modelsummary

Violations of normality and homoskedasticity do not bias coefficient estimates ($a$, $b$)
But they do bias the standard errors — which affects t-statistics and p-values
Robust SEs account for these violations with a conservative adjustment

The easiest approach is to compare standard vs. robust SEs side-by-side using modelsummary():

modelsummary(
  list("Standard SEs" = model_biv, "Robust (HC1) SEs" = model_biv),
  vcov = list("classical", "HC1"),
  coef_rename = c("(Intercept)" = "Intercept",
                  "prestg80"    = "Prestige Score"),
  stars = TRUE,
  gof_map = c("nobs", "r.squared"),
  title = "Standard vs. Robust Standard Errors"
)

Standard vs. Robust Standard Errors
	Standard SEs	Robust (HC1) SEs
Intercept	21273.309***	21273.309***
	(2134.090)	(1947.377)
Prestige Score	472.133***	472.133***
	(45.943)	(44.781)
Num.Obs.	2527	2527
R2	0.040	0.040
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

What Do Control Variables Do?

When we add a control variable to a regression, we are partitioning the variation in the outcome:

Without a control:

\[\text{income} = a + b_1{\text{prestige}} + e\]

The coefficient $b_1$ captures all the ways prestige is associated with income — including variation that is really due to sex (because men and women differ in both prestige and income).

With a control for sex:

\[\text{income} = a + b_1\text{prestige} + b_2\text{sex} + e\]

Now $b_1$ captures the association between prestige and income after filtering out any variation that could be due to sex differences. We are comparing people who are identical on the control variable (same sex) and asking: does prestige still predict income?

The key intuition:

A control variable “filters away” variation in the outcome that is explained by that control. The coefficient on the key predictor then reflects only the variation that can’t be explained by the control variables. This is what “holding constant” means statistically.

Multiple Regression

We can include multiple predictors — this is multiple regression. We’ll cover this in depth in Week 13.

R Code
Comparing Models
Interpretation

attain_multi <- attain |>
  filter(!is.na(prestg80), !is.na(income91), !is.na(sex)) |>
  mutate(sex_f = fct_relevel(sex, "male"))

model_biv2   <- lm(income91 ~ prestg80,          data = attain_multi)
model_multi  <- lm(income91 ~ prestg80 + sex_f,   data = attain_multi)

modelsummary(
  list("Bivariate" = model_biv2, "Multiple" = model_multi),
  coef_rename = c("(Intercept)" = "Intercept",
                  "prestg80"    = "Prestige Score",
                  "sex_ffemale" = "Female"),
  stars   = TRUE,
  title   = "Prestige and Income: Bivariate vs. Multiple",
  gof_map = c("nobs", "r.squared")
)

Prestige and Income: Bivariate vs. Multiple
	Bivariate	Multiple
Intercept	21273.309***	23949.643***
	(2134.090)	(2295.876)
Prestige Score	472.133***	469.766***
	(45.943)	(45.869)
Female		−4702.028**
		(1503.645)
Num.Obs.	2527	2527
R2	0.040	0.044
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Reading the multiple regression table:

Coefficient on Prestige Score (bivariate → multiple): $472 → $470 per prestige point. After controlling for sex, the prestige slope changes slightly — this shift tells us how much of the bivariate prestige effect was actually due to sex differences in prestige.

Coefficient on Female: $-4702 — women earn about $4702 less per year than men at the same prestige level. This is the gender income gap net of occupational status.

R² increase: 4% (bivariate) → 4.4% (multiple regression). Adding sex explains additional variation in income beyond prestige alone — but most variation in income remains unexplained by these two predictors.

In Week 13, we’ll cover how to decide which controls to include, and what changes in coefficients across models mean theoretically.

Reading Research: Thompson & Keith (2001)

Seeing today’s tools published in a classical sociological study

“The Blacker the Berry” — Overview

Thompson, M.S. & Keith, V.M. (2001). The Blacker the Berry: Gender, Skin Tone, Self-Esteem, and Self-Efficacy. Gender & Society, 15(3), 336–357.

Research question: Does skin tone predict psychological well-being — specifically self-esteem and self-efficacy — for Black Americans? And does this relationship differ by gender?

Motivated by colorism: the documented pattern in which lighter-skinned Black individuals receive preferential treatment in U.S. society
“The blacker the berry, the sweeter the juice” is a common idiom that celebrates dark skin tones in Black individuals — serving as a counter-narrative to colorism. The authors ask whether this affirmation holds for Black Americans’ own self-perception

Why this paper?

This is a model of how to use OLS regression to answer a sociological question. The authors carefully choose their outcome variable, identify a clear key predictor, include theoretically motivated controls, and interpret the substantive meaning of their coefficients.

Research Design

Data & Sample
Variables
Regression Model

Data: National Survey of Black Americans (NSBA), 1979–1980
Sample: 2,107 Black American adults; analyses conducted separately for men (n ≈ 581) and women (n ≈ 1,035) who are employed
Why separate models? The authors hypothesize that the relationship between skin tone and self-esteem differs by gender. Running separate regressions for men and women is similar in spirit to including an interaction term (which we’ll cover in Week 13), but the authors chose this approach for simplicity — because they test multiple interactions across several controls, splitting the sample is more tractable than modeling every interaction explicitly

Role	Variable	Measurement
Outcome (DV) 1	Self-esteem	Rosenberg 10-item scale (1–4; higher = more positive)
Outcome (DV) 2	Self-efficacy	Personal efficacy scale (1–4; higher = more control)
Key predictor	Skin tone	Interviewer-rated 5-point scale: 1 = very dark to 5 = very light brown
Controls	Education, income, employment, age, marital status, family origin	Standard socioeconomic controls

Self-esteem vs. self-efficacy — what’s the difference?

Self-esteem is a person’s overall sense of their own value or worth — “Do I feel good about myself?” It reflects how positively someone regards themselves as a person.

Self-efficacy is a person’s belief in their ability to control outcomes in their own life — “Can I make things happen?” It reflects how much agency or personal power someone feels they have.

These are related but distinct psychological constructs. The authors study both because colorism may affect not just how you feel about yourself (self-esteem) but also your sense of whether your efforts can shape your future (self-efficacy).

The authors estimate (separately for men and women):

\[\widehat{\text{self-esteem}} = \alpha + \beta_1\text{skin tone} + \beta_2\text{educ} + \beta_3\text{income} + \cdots + e\]

Connecting to today:

Outcome variable: self-esteem or self-efficacy (continuous DV)
Key predictor: skin tone (treated as continuous — 5-point scale)
Controls: education, income, age, etc. — “filtered away” so the skin tone coefficient reflects prestige net of SES differences

Key Results

Table from Paper
For Women
For Men
The Big Finding

Simplified version of Tables 2 & 3 from Thompson & Keith (2001). Skin tone scale: 1 = very dark, 5 = very light — a higher score means lighter skin, so a positive coefficient indicates lighter skin → higher outcome.

	Women		Men
Variable	Self-Esteem	Self-Efficacy	Self-Esteem	Self-Efficacy
Skin tone	.187*	.029	.088	.208†
Education	.053†	.127***	.036	.126***
Income	.025	.054***	.019	.014
Age	.034***	.033***	.019**	.030***
Employed	.292	−.093	.449†	−.378
N	1,036	1,036	647	647
R²	.092	.096	.072	.093

†p ≤ .10. p ≤ .05. p ≤ .01. p ≤ .001. Coefficients are unstandardized OLS estimates from fully adjusted models (Model 4). Adapted from Thompson & Keith (2001), Tables 2 & 3.

The key result:

Lighter skin tone (higher score) significantly predicts higher self-esteem for Black women (.187*), but not for men, and not for self-efficacy once SES is controlled — colorism operates through gender and is outcome-specific.

The skin tone coefficient tells a split story across the two outcomes:

Self-esteem: coefficient = .187* ($p < 0.05$) — lighter skin (higher score) → higher self-esteem. A Black woman rated 4 (lighter) is predicted to have meaningfully higher self-esteem than a Black woman rated 2 (darker), even after controlling for education, income, and other SES factors
Self-efficacy: coefficient = .029 (not significant) — once SES is controlled, skin tone does not significantly predict self-efficacy for women

Interpretation: Colorism shapes how Black women feel about their worth (self-esteem) but not their sense of personal agency (self-efficacy) in the fully adjusted model.

The skin tone coefficients are weaker and largely non-significant for men:

Self-esteem: coefficient = .088 (not significant) — skin tone does not predict self-esteem for Black men
Self-efficacy: coefficient = .208† ($p ≤ .10$, borderline) — a marginal suggestion that lighter skin is associated with higher self-efficacy, but this falls short of conventional significance

Why the difference? The authors argue that beauty ideals — which prize lighter skin — operate much more strongly as a social evaluation system for women than for men. For men, occupational and economic status matter more than appearance-based judgments.

Colorism operates through gender — and is outcome-specific:

For Black women, lighter skin predicts higher self-esteem (.187*) even after controlling for SES — but skin tone does not significantly predict self-efficacy. For Black men, skin tone has no significant effect on self-esteem, and only a marginal effect on self-efficacy (.208†). This finding demonstrates that colorism is not just about race — it is deeply shaped by gender, and its psychological effects depend on which dimension of self-evaluation we examine.

The magnitude matters here: the authors don’t just report that the relationship is significant — they discuss whether the coefficient size is large enough to be practically meaningful. This is exactly the “statistical vs. substantive significance” distinction we covered today.

Connecting Thompson & Keith to Today

Today’s concept	In Thompson & Keith
Outcome variable	Self-esteem, self-efficacy
Key predictor	Skin tone (the main theoretical variable)
Control variables	Education, income, age — “filtered away”
OLS regression	`lm(selfesteem ~ skintone + educ + income + ...)`
Coefficient interpretation	Each skin tone point → X-unit change in self-esteem
Statistical vs. substantive significance	Is the effect real AND meaningful?
Separate models by group	Week 13 — interaction terms capture this more formally

For your research paper:

Thompson & Keith is a model for how to write up regression results in a paper. Notice: they state the research question clearly, justify their variable choices, report coefficients with standard errors, discuss magnitude, and connect the statistical results back to the sociological theory.

Questions?

Key Takeaways

Part 1 — Variables & OLS
Part 2 — Categorical IVs & Assumptions

The outcome variable (Y) is what you are trying to explain — also called the dependent variable
The key predictor (X) is the main theoretical variable — also called the key independent variable
OLS estimates the slope ($b$): average change in outcome per one-unit increase in the predictor
OLS minimizes the sum of squared residuals — that’s where “least squares” comes from
Every coefficient has its own t-test: $t = b / SE(b)$, same logic as Week 9
R² is the proportion of variation in Y explained by the model — always report and interpret it

Connection to Week 9:

Each regression coefficient has its own null hypothesis ($H_0: \beta = 0$) and its own t-test and p-value. The inference logic from Week 9 applies directly — the only thing new is the shape of the model.

Use fct_relevel(var, "baseline") to set the comparison group
Each dummy variable coefficient = difference from the baseline in the outcome variable
The intercept = predicted outcome for the baseline group (interpretable!)
Always state your baseline category when reporting results

Assumption	How to check	Fix if violated
Independence	Study design	Survey weights, clustered SEs
Linearity	Scatter plot Y vs X	Transform variables (e.g., log)
Normal errors	Histogram of residuals	Usually OK in large samples
Homoskedasticity	Residuals vs. fitted plot	Robust standard errors (`vcov = "HC1"`)

Why This Matters for Your Research Paper

HW #9 asks you to run an OLS regression on your research data — today’s tools are exactly what you need
Your revised proposal (due 4/16) should now describe which regression model you plan to use
Your final paper will likely use OLS regression as its central analysis

What HW #9 asks — and what you now know how to do:

Identify your outcome and key predictor → Part 1 today
Run lm() and display results with modelsummary() → Part 2 today
Set a baseline if using a categorical predictor → Part 3 today
Interpret the coefficient magnitude, p-value, and R² → Part 2 today
Visualize the relationship with a scatter + regression line (continuous IV) or coefficient plot → Part 2 today

One thing to add to your thinking for hw9:

When interpreting your results, distinguish statistical significance (is $p < 0.05$?) from substantive significance (does the size of the coefficient actually matter in the real world?). Both belong in your write-up.

Questions?

Assignments

Weekly Assignment #9 — Due Thursday, April 9

Using your research dataset:

State a research question with a continuous outcome variable and at least one key predictor
Write your hypothesis (direction of expected relationship and justification)
Run OLS regression with lm() and display results with modelsummary()
If using a categorical predictor: use fct_relevel(var, "baseline_category") to set a baseline, and state what it is
Interpret the estimated coefficient(s): direction, magnitude, and what they mean
Report and interpret R²
Visualize: scatter plot with regression line (continuous IV) or coefficient plot
Discuss both statistical and substantive significance

Also due:

HW #8 — Due Thursday, April 2 (hypothesis testing)
Revised Proposal with Outline — Due Thursday, April 16

In-class lab today:

Practice running OLS on your own dataset and interpreting the output

Symbol	Name	Meaning
\(E[Y \mid x]\)	Conditional mean	Expected (average) value of outcome \(Y\) when \(X = x\)
\(\alpha\)	Intercept	Predicted value of \(Y\) when \(X = 0\) — the baseline anchor
\(\beta\)	Slope	Average change in \(Y\) for a one-unit increase in \(X\)