Lab 6: OLS Regression

Author

Your Name Here

Published

March 25, 2026

Overview

In this lab you will run a multiple OLS regression using the attain dataset. You will use lm() to fit the model, modelsummary() to display results with robust standard errors, modelplot() to visualize coefficients, and residual plots to check OLS assumptions. By the end, you should be able to replicate this workflow on your own research data for hw9.


Setup

Load packages, read the data, drop rows with missing values on the variables we will use, and create factor variables with explicit baseline categories.

# load libraries
library(tidyverse)
library(here)
library(modelsummary)

# set scientific notation off
options(scipen = 999)

# load data 
attain_raw <- read_csv(here("data", "attain.csv"))

# Drop rows missing any variable that enters the model
attain <- attain_raw |>
  # remove missing data on key variables
  filter(
    !is.na(income91),   # outcome
    !is.na(prestg80),   # continuous predictor
    !is.na(degree),     # categorical predictor
    !is.na(sex)         # binary control
  ) |>
  mutate(
    # Set explicit factor levels so baseline is the first level
    # Baseline for degree = "lt high" (less than high school)
    degree_f = fct_relevel(degree, "lt high", "high sch", "junior c", "bachelor", "graduate"),
    # Baseline for sex = "male"
    sex_f    = fct_relevel(sex, "male")
  )

# How many rows did we drop?
cat("Original rows:", nrow(attain_raw), "\n")
Original rows: 2992 
cat("Rows after dropping missing:", nrow(attain), "\n")
Rows after dropping missing: 2523 
cat("Rows dropped:", nrow(attain_raw) - nrow(attain), "\n")
Rows dropped: 469 
Note

We drop missing values in the data-loading step so that all models are estimated on the same sample. If you filter inside lm() or across different steps, sample sizes can vary across models, making comparisons misleading.


Step 1 — Research Question and Variables

We will estimate the following regression:

\[\widehat{\text{income}} = \alpha + b_1\text{prestige} + b_2\text{degree} + b_3\text{sex} + e\]

Role Variable Description
Outcome (Y) income91 Household income in dollars — continuous
Key predictor prestg80 Occupational prestige score — continuous
Categorical control degree_f Educational degree — 5 levels, baseline = lt high
Binary control sex_f Sex — binary, baseline = male

Our research question: After controlling for educational attainment and sex, how does occupational prestige impact household income?


Step 2 — Visualize Your Variables First

Always look at your data before running a regression.

Continuous predictor vs. outcome:

ggplot(attain, aes(x = prestg80, y = income91)) +
  geom_point(alpha = 0.2, color = "gray50", size = 1) +
  geom_smooth(method = "lm", se = TRUE,
              color = "#4E79A7", fill = "#4E79A7", alpha = 0.2) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(title = "Occupational Prestige and Household Income",
       x = "Occupational Prestige Score",
       y = "Household Income ($)") +
  theme_minimal()

Categorical predictor vs. outcome:

attain |>
  group_by(degree_f) |>
  summarize(mean_income = mean(income91, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = degree_f, y = mean_income)) +
  geom_col(fill = "#4E79A7") +
  scale_y_continuous(labels = scales::dollar_format(),
                     expand = expansion(mult = c(0, 0.1))) +
  labs(title = "Mean Household Income by Educational Degree",
       x = "Degree (left = lowest; right = highest)",
       y = "Mean Income") +
  theme_minimal()

Question 1. Looking at the two plots above, describe the direction of each relationship in one sentence each. Does income go up or down as prestige increases? How does income differ across degree levels?

Your answer:


Step 3 — Fit the OLS Model

Why This Model?

Every regression requires a theoretical justification for which variables belong in the equation. Here is the logic behind each choice:

  • Outcome — income91: We are interested in economic inequality. Household income is a direct and continuous measure of material well-being, making OLS appropriate.
  • Key predictor — prestg80: Occupational prestige captures social standing and job desirability. The core question is whether higher-prestige occupations translate to higher earnings after accounting for the credentials workers bring to those jobs and gender differences in pay.
  • Control — degree_f: Education is a strong driver of both occupational placement and earnings. If we omit it, any apparent prestige effect may really reflect the fact that more educated people hold more prestigious and higher-paying jobs — a classic confounder.
  • Control — sex_f: A well-documented gender pay gap exists in the U.S. labor market, and women are also sorted into lower-prestige occupations on average. Controlling for sex isolates the prestige–income relationship from those gender differences.

Including both controls means the prestige coefficient tells us: among people with the same education level and sex, how does a one-unit difference in occupational prestige impact income?

model <- lm(income91 ~ prestg80 + degree_f + sex_f, data = attain)

summary(model)

Call:
lm(formula = income91 ~ prestg80 + degree_f + sex_f, data = attain)

Residuals:
   Min     1Q Median     3Q    Max 
-57125 -17199  -5594  12495  84151 

Coefficients:
                 Estimate Std. Error t value             Pr(>|t|)    
(Intercept)      10640.83    2028.76   5.245     0.00000016930911 ***
prestg80           391.86      43.42   9.024 < 0.0000000000000002 ***
degree_fhigh sch 10211.60    1476.06   6.918     0.00000000000578 ***
degree_fjunior c 15180.33    2400.64   6.323     0.00000000030166 ***
degree_fbachelor 22987.25    1905.02  12.067 < 0.0000000000000002 ***
degree_fgraduate 28369.77    2488.95  11.398 < 0.0000000000000002 ***
sex_ffemale      -3804.14     985.80  -3.859             0.000117 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24600 on 2516 degrees of freedom
Multiple R-squared:  0.1912,    Adjusted R-squared:  0.1893 
F-statistic: 99.13 on 6 and 2516 DF,  p-value: < 0.00000000000000022
Note

summary() gives you the raw output. You can read the coefficient estimates, standard errors, t-values, and p-values here — but we will use modelsummary() below for a cleaner table. Note that the SEs from summary() are classical (not robust) — we’ll switch to robust SEs in Step 4.


Step 4 — Regression Table with Robust Standard Errors

Use modelsummary() with vcov = "HC1" to display results with heteroskedasticity-robust standard errors. This is best practice when working with income data, which often shows unequal variance across the range of fitted values.

modelsummary(
  model,
  vcov      = "HC1",
  coef_rename = c(
    "(Intercept)"        = "Intercept",
    "prestg80"           = "Occupational Prestige",
    "degree_fhigh sch"   = "High School (vs. lt high)",
    "degree_fjunior c"   = "Junior College (vs. lt high)",
    "degree_fbachelor"   = "Bachelor's Degree (vs. lt high)",
    "degree_fgraduate"   = "Graduate Degree (vs. lt high)",
    "sex_ffemale"        = "Female (vs. male)"
  ),
  stars     = TRUE,
  gof_map   = c("nobs", "r.squared", "adj.r.squared"),
  title     = "OLS: Household Income ~ Prestige + Degree + Sex (HC1 Robust SEs)"
)
OLS: Household Income ~ Prestige + Degree + Sex (HC1 Robust SEs)
 (1)
Intercept 10640.833***
(1949.074)
Occupational Prestige 391.857***
(43.663)
High School (vs. lt high) 10211.600***
(1219.854)
Junior College (vs. lt high) 15180.332***
(2222.930)
Bachelor's Degree (vs. lt high) 22987.247***
(1868.861)
Graduate Degree (vs. lt high) 28369.766***
(2609.891)
Female (vs. male) −3804.139***
(984.582)
Num.Obs. 2523
R2 0.191
R2 Adj. 0.189
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Note

What vcov = "HC1" does: It replaces the classical standard errors with robust ones that are valid even if the residual variance is not constant across fitted values (heteroskedasticity). The coefficient estimates do not change — only the standard errors, t-statistics, and p-values.


Step 5 — Visualize Coefficients

A coefficient plot shows each estimate with its 95% confidence interval. If the CI does not cross zero, the coefficient is statistically significant at α = 0.05.

modelplot(
  model,
  vcov      = "HC1",
  coef_omit = "Intercept",
  coef_rename = c(
    "prestg80"           = "Occupational Prestige",
    "degree_fhigh sch"   = "High School (vs. lt high)",
    "degree_fjunior c"   = "Junior College (vs. lt high)",
    "degree_fbachelor"   = "Bachelor's Degree (vs. lt high)",
    "degree_fgraduate"   = "Graduate Degree (vs. lt high)",
    "sex_ffemale"        = "Female (vs. male)"
  )
) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "#E15759") +
  labs(
    title = "OLS Coefficients with 95% Confidence Intervals (HC1 Robust SEs)",
    x     = "Estimated Coefficient ($)"
  ) +
  theme_minimal()

Question 2. Look at the coefficient plot. Which coefficients are statistically significant (CI does not cross the dashed red line at zero)? List them.

Your answer:

Note

Why does the prestige CI look so narrow compared to the degree gaps?

This is a scale artifact, not a substantive finding. The prestige score runs roughly 17–86, so a 1-point change is a tiny increment. The degree dummies, by contrast, compare entire education tiers against the baseline, producing large dollar gaps that dominate the x-axis. Both are estimated on the same dollar scale, so the prestige CI appears narrow even if it is far from zero.

Two ways to deal with this:

  1. Check the table directly. The stars and p-values in Step 4 are the authoritative test. A visually narrow CI on a shared-scale plot does not mean a coefficient is insignificant.
  2. Rescale the predictor. Replace prestg80 with I(prestg80 / 10) inside lm(). Now the coefficient represents income change per 10-point increase in prestige — a more meaningful unit — and the CI will visually stretch on the plot.

Back-of-napkin significance check — the “2× SE rule”: If |coefficient| ≥ 2 × SE, it is approximately significant at α = 0.05. This works because the critical t-value at α = 0.05 is ≈ 1.96 ≈ 2, so t = b / SE ≥ 2 implies p ≤ 0.05. Check this against the prestige row in your Step 4 table: multiply the SE by 2 and see if the coefficient is larger. It should be — by a lot.


Step 6 — Interpret Your Results

Use the regression table from Step 4 to answer the following questions. Write full sentences.

Question 3 — Continuous predictor. Interpret the coefficient on Occupational Prestige. Use this template:

“For every one unit increase in occupational prestige, predicted household income [increases/decreases] by $___, holding educational degree and sex constant.”

Your answer:


Question 4 — Categorical predictor. Interpret the coefficient on Bachelor's Degree (vs. lt high). Remember: the baseline category is less than high school. Use this template:

“Compared to respondents with less than a high school degree, those with a bachelor’s degree are predicted to earn $___ [more/less] per year, holding prestige and sex constant.”

Your answer:


Question 5 — Binary predictor. Interpret the coefficient on Female (vs. male). Remember: the baseline is male. Use this template:

“Compared to men, women are predicted to earn $___ [more/less] per year, holding prestige and degree constant.”

Your answer:


Question 6 — R². Report the R² value from the table and interpret it in one sentence. What proportion of variation in income does this model explain?

Your answer:


Question 7 — Statistical significance. Look at the stars on the Occupational Prestige coefficient. Is this coefficient statistically significant at the α = 0.05 level? What does that mean — do you reject or fail to reject \(H_0: \beta_{\text{prestige}} = 0\)?

Your answer:


Step 7 — Check OLS Assumptions

Residuals vs. Fitted (linearity + homoskedasticity)

If the relationship is linear and variance is constant, the points should be randomly scattered around the horizontal line with no clear pattern.

attain |>
  mutate(
    fitted = fitted(model),
    resid  = residuals(model)
  ) |>
  ggplot(aes(x = fitted, y = resid)) +
  geom_point(alpha = 0.2, color = "gray40", size = 1) +
  geom_hline(yintercept = 0, color = "#E15759", linetype = "dashed", linewidth = 1) +
  geom_smooth(method = "loess", se = FALSE, color = "#4E79A7", linewidth = 1) +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_y_continuous(labels = scales::dollar_format()) +
  labs(
    title = "Residuals vs. Fitted Values",
    x     = "Fitted Values (Predicted Income)",
    y     = "Residuals"
  ) +
  theme_minimal()

Question 8. Does the blue loess curve look approximately flat (suggesting linearity is satisfied)? Does the vertical spread of points look roughly constant across fitted values, or does it widen? What does this suggest about heteroskedasticity?

Your answer:


Histogram of Residuals (normality of errors)

If the normality assumption holds, residuals should look roughly bell-shaped. In large samples this matters less (CLT), but it is good practice to check.

tibble(resid = residuals(model)) |>
  ggplot(aes(x = resid)) +
  geom_histogram(bins = 40, fill = "#4E79A7", color = "white") +
  labs(
    title = "Distribution of Residuals",
    x     = "Residual ($)",
    y     = "Count"
  ) +
  theme_minimal()

Question 9. Does the histogram of residuals look approximately normal (symmetric, bell-shaped), or is it skewed? If skewed, what transformation of the outcome variable might help?

Your answer:


TipPutting It Together

Even if the residuals vs. fitted plot shows some heteroskedasticity, our results are still valid because we used vcov = "HC1" robust standard errors in Step 4. The coefficients themselves are unaffected by heteroskedasticity — only the standard errors are, and HC1 corrects for that.


TipBefore You Leave