Lab 4: Data Cleaning for Final Projects

Author

Your Name Here

Published

March 25, 2026

Overview

In this lab you will build a cleaned, analysis-ready dataset from the raw attain.csv file — showing you the major steps you will use to clean the dataset that you will use for your final paper. This lab is not exhaustive, but should walk through a realistic data-cleaning workflow: selecting the variables you need, recoding categorical variables to have meaningful labels, creating new measures that you might want, and saving the result so you can load it in future scripts without repeating this work.


Setup Instructions

  1. Download lab4.qmd from bCourse under “Labs” > “Lab #4”
  2. Place it in your labs folder. Your folder structure should look something like this:
soc106/
├── _quarto.yml
├── data/
│   └── attain.csv
├── assignments/
│   └── ...
└── labs/
    ├── lab1.qmd
    ├── lab2.qmd
    ├── lab3.qmd
    └── lab4.qmd
  1. Open lab4.qmd in Positron and let’s get to work!

Load Libraries

We’ll use four packages today. If you haven’t installed janitor or skimr before, run the block below once in your console (not when rendering). The janitor() package is a great package that will automate some basic cleaning and standardizing of variable names. I love to use it to take funky variable names and force a standard variable naming scheme on them: snake_case.

# Run once in the console if not already installed
#install.packages(c("janitor", "skimr"))
library(tidyverse)   # data manipulation and visualization
library(here)        # file path management — see https://here.r-lib.org/
library(janitor)     # data cleaning utilities — see https://sfirke.github.io/janitor/
library(skimr)       # quick summary statistics — see https://docs.ropensci.org/skimr/

Read and Inspect the Data

We read the file using here(), which builds paths relative to the project root regardless of where this .qmd file lives — no more broken paths when you move a file. Immediately after reading, we pipe to clean_names() from janitor, which converts all column names to snake_case: no spaces, no special characters, consistent lowercase.

attain <- read_csv(here("data", "attain.csv")) |>
    # converts column names to snake_case: lowercase, no spaces or special characters
    clean_names() |>
    glimpse()
Rows: 2,992
Columns: 44
$ x1       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ wrkstat  <chr> "keeping", "working", "working", "working", "working", "worki…
$ hrs1     <dbl> NA, 40, 50, 32, 20, 20, 35, 45, 40, 40, 40, 65, 35, 38, 46, 4…
$ prestg80 <dbl> 46, 22, 29, 42, 36, 43, 20, 44, 42, 46, 30, 75, 51, 50, 73, 4…
$ marital  <chr> "divorced", "married", "married", "married", "never ma", "nev…
$ agewed   <dbl> NA, 20, 28, NA, NA, NA, NA, NA, NA, NA, NA, NA, 29, 22, NA, N…
$ papres80 <dbl> 41, NA, NA, NA, NA, NA, 20, 44, NA, 34, NA, 51, 40, 51, 75, 4…
$ mapres80 <dbl> NA, 28, 36, NA, NA, 34, 28, 44, 23, NA, NA, NA, NA, NA, 60, 6…
$ sibs     <dbl> 4, 4, 2, 6, 3, 3, 5, 8, 2, 5, NA, 1, 4, 1, 4, 2, 0, 0, 2, 2, …
$ childs   <dbl> 2, 3, 2, 3, 0, 0, 4, 1, 6, 1, 0, 0, 2, 2, 0, 0, 2, 2, 0, 0, 0…
$ age      <dbl> 33, 59, NA, 59, 21, 22, 40, 25, 41, 45, 52, 31, 55, 56, 36, 2…
$ agekdbrn <dbl> 21, NA, 25, 23, NA, NA, 17, 23, 17, 17, NA, NA, 29, 32, NA, N…
$ educ     <dbl> 12, 12, 12, 8, 13, 15, 9, 12, 12, 12, 0, 19, 16, 16, 18, 16, …
$ paeduc   <dbl> 12, NA, NA, NA, NA, NA, 12, 8, NA, 6, NA, 14, 0, 12, 19, 16, …
$ maeduc   <dbl> 10, 12, NA, 5, 12, 20, NA, 0, 11, 6, 20, 16, 0, 16, 14, 18, 1…
$ degree   <chr> "high sch", "high sch", "lt high", "lt high", "high sch", "hi…
$ sex      <chr> "female", "male", "female", "male", "female", "female", "fema…
$ race     <chr> "black", "black", "black", "white", "black", "black", "other"…
$ weekswrk <dbl> 0, 52, 52, 44, 30, 52, 52, 52, 52, 52, 40, 52, 52, 52, 52, 45…
$ partfull <chr> NA, "full-tim", "full-tim", "full-tim", "part-tim", "part-tim…
$ region   <chr> "middle a", "middle a", "middle a", "middle a", "middle a", "…
$ xnorcsiz <chr> "city gt", "city gt", "city gt", "city gt", "city gt", "city …
$ srcbelt  <chr> "12 lrgst", "12 lrgst", "12 lrgst", "12 lrgst", "12 lrgst", "…
$ size     <dbl> 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7…
$ partyid  <chr> "strong d", "not str", "ind,near", "not str", "strong d", "ot…
$ polviews <chr> "slightly", "moderate", "slghtly", "slightly", "conserva", "l…
$ relig    <chr> "protesta", "catholic", "protesta", "catholic", "protesta", "…
$ attend   <chr> "sevrl ti", "every we", "more thn", "once a y", "once a y", "…
$ satjob   <chr> "mod. sat", "very sat", "very sat", "mod. sat", "mod. sat", "…
$ class    <chr> "working", "working", "working", "working", "lower cl", "work…
$ satfin   <chr> "more or", "more or", "more or", "not at a", "not at a", "not…
$ finalter <chr> "better", "stayed s", "stayed s", "worse", "worse", "worse", …
$ finrela  <chr> "below av", "below av", "average", "average", "far belo", "be…
$ wksub    <chr> NA, NA, "no", NA, "yes", "yes", "yes", NA, "yes", "no", NA, N…
$ wksup    <chr> NA, NA, "no", NA, "no", "no", "no", NA, "no", "no", NA, NA, N…
$ unemp    <chr> "yes", "no", NA, "yes", "no", NA, NA, "no", NA, "yes", "yes",…
$ union    <chr> "neither", "neither", NA, "neither", "neither", NA, NA, "neit…
$ parsol   <chr> "somewhat", NA, NA, "somewhat", NA, NA, NA, "much bet", NA, "…
$ tvhours  <dbl> 2, 3, 1, 3, NA, 0, 10, 4, 2, NA, 2, 2, 2, NA, 1, 1, 0, 3, 3, …
$ dwelown  <chr> "pays ren", "pays ren", "pays ren", "own or i", NA, "pays ren…
$ wordsum  <dbl> NA, NA, 6, 5, NA, 8, 5, 1, 5, NA, 5, 9, 9, NA, 7, 10, 3, 8, N…
$ income91 <dbl> 11250, NA, 16250, 18750, 13750, 45000, 23750, 11250, 27500, 1…
$ rincom91 <dbl> NA, NA, 16250, 18750, NA, 11250, 23750, 11250, 18750, 18750, …

Select Variables

One easy way to shrink a dataset is to keep only the columns relevant to the final project. You can use select(), which accepts bare column names, and the helper contains("inc") picks up any column whose name includes “inc” — useful when a dataset has several income-related fields.

attain_clean <- attain |>
    # select only the key variables we want for our analysis
    select(id, marital, sex, race, educ, paeduc, maeduc, age, partyid, contains("inc")) |>
    glimpse()
Rows: 2,992
Columns: 11
$ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ marital  <chr> "divorced", "married", "married", "married", "never ma", "nev…
$ sex      <chr> "female", "male", "female", "male", "female", "female", "fema…
$ race     <chr> "black", "black", "black", "white", "black", "black", "other"…
$ educ     <dbl> 12, 12, 12, 8, 13, 15, 9, 12, 12, 12, 0, 19, 16, 16, 18, 16, …
$ paeduc   <dbl> 12, NA, NA, NA, NA, NA, 12, 8, NA, 6, NA, 14, 0, 12, 19, 16, …
$ maeduc   <dbl> 10, 12, NA, 5, 12, 20, NA, 0, 11, 6, 20, 16, 0, 16, 14, 18, 1…
$ age      <dbl> 33, 59, NA, 59, 21, 22, 40, 25, 41, 45, 52, 31, 55, 56, 36, 2…
$ partyid  <chr> "strong d", "not str", "ind,near", "not str", "strong d", "ot…
$ income91 <dbl> 11250, NA, 16250, 18750, 13750, 45000, 23750, 11250, 27500, 1…
$ rincom91 <dbl> NA, NA, 16250, 18750, NA, 11250, 23750, 11250, 18750, 18750, …

Recode Variables

Education

The raw educ variable measures years of formal schooling (0–20). See the GSS codebook for the full definition. We use case_when() inside mutate() to create a categorical version with meaningful labels. Any educ value that is NA does not satisfy any condition and remains NA in the new variable automatically.

attain_clean <- attain_clean |>
  mutate(
    # recode educational attainment   
    educ_cat = case_when(
      educ %in% 0:8   ~ "Less than High School",   # 0–8 years
      educ %in% 9:11  ~ "Some High School",        # 9–11 years
      educ == 12      ~ "High School Graduate",    # 12 years
      educ %in% 13:15 ~ "Some College",            # 13–15 years
      educ == 16      ~ "College Graduate",        # 16 years (bachelor's)
      educ >= 17      ~ "Graduate Degree"          # 17–20 years (postgrad)
    )
  )

# Verify — counts should cover all non-NA rows
attain_clean |> count(educ_cat)
# A tibble: 7 × 2
  educ_cat                  n
  <chr>                 <int>
1 College Graduate        433
2 Graduate Degree         326
3 High School Graduate    929
4 Less than High School   198
5 Some College            744
6 Some High School        355
7 <NA>                      7

Party Identification

The raw partyid values are abbreviated strings. See the GSS codebook for definitions. Raw strings often carry invisible whitespace — in this dataset "not str" has a trailing space. We use str_trim() to strip it before comparing.

Note: clean_names() only standardizes column names (e.g., My Variablemy_variable). It has no effect on the values inside the columns, which is why we still need str_trim() here to remove whitespace from the partyid values.

This version of partyid does not distinguish which party the “not strong” identifiers lean toward — it bundles both “Not strong Democrat” and “Not strong Republican” together — so we label them as "Not Strong Party".

attain_clean <- attain_clean |>
  mutate(
    party = case_when(
      str_trim(partyid) == "strong d"  ~ "Strong Democrat",   # str_trim() removes leading/trailing whitespace from values
      str_trim(partyid) == "not str"   ~ "Not Strong Party",
      str_trim(partyid) == "ind,near"  ~ "Independent (Leaning)",
      str_trim(partyid) == "independ"  ~ "Independent",
      str_trim(partyid) == "strong r"  ~ "Strong Republican",
      str_trim(partyid) == "other pa"  ~ "Other Party"
    )
  )

attain_clean |> count(party)
# A tibble: 7 × 2
  party                     n
  <chr>                 <int>
1 Independent             369
2 Independent (Leaning)   623
3 Not Strong Party       1163
4 Other Party              44
5 Strong Democrat         423
6 Strong Republican       321
7 <NA>                     49

A quick bar chart to visualize the distribution. We use fct_relevel() to fix the order left to right on the political spectrum, and filter() to drop NA and "Other Party" before plotting.

attain_clean |>
  filter(!is.na(party), party != "Other Party") |>
  mutate(party = fct_relevel(
    party,
    "Strong Democrat",
    "Independent (Leaning)",
    "Not Strong Party",       # ambiguous: bundles weak D and weak R
    "Independent",
    "Strong Republican"
  )) |>
  ggplot(aes(x = party, fill = party)) +
  geom_bar() +
  geom_text(
    stat  = "count",
    aes(label = after_stat(count)),
    vjust = -0.5,   # nudge labels just above the top of each bar
    size  = 3.5
  ) +
  scale_fill_manual(values = c(
    "Strong Democrat"       = "#1259A5",
    "Independent (Leaning)" = "#8EB4DA",
    "Not Strong Party"      = "#9E9E9E",
    "Independent"           = "#E08C8C",
    "Strong Republican"     = "#C02020"
  )) +
  scale_y_continuous(breaks = seq(0, 1000, by = 200)) +
  labs(
    title = "Party Identification",
    x     = NULL,
    y     = "Count"
  ) +
  theme_minimal() +
  theme(
    legend.position  = "none",
    axis.text.x      = element_text(angle = 15, hjust = 1)
  )


Create Variables and Rename

This block handles two final cleanup tasks:

  1. college_years — years of education beyond high school. For respondents with 12 or fewer years we assign 0; NA values in educ propagate as NA automatically because no condition is satisfied.
  2. Rename marital to marital_status using rename().
attain_clean <- attain_clean |>
  mutate(
    college_years = case_when(
      educ > 12  ~ educ - 12,   # years of schooling beyond a HS diploma
      educ <= 12 ~ 0            # HS diploma or less: set college years to 0
    )) |>
  rename(marital_status = marital) |>  # rename(new_name = old_name)
  glimpse()
Rows: 2,992
Columns: 14
$ id             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ marital_status <chr> "divorced", "married", "married", "married", "never ma"…
$ sex            <chr> "female", "male", "female", "male", "female", "female",…
$ race           <chr> "black", "black", "black", "white", "black", "black", "…
$ educ           <dbl> 12, 12, 12, 8, 13, 15, 9, 12, 12, 12, 0, 19, 16, 16, 18…
$ paeduc         <dbl> 12, NA, NA, NA, NA, NA, 12, 8, NA, 6, NA, 14, 0, 12, 19…
$ maeduc         <dbl> 10, 12, NA, 5, 12, 20, NA, 0, 11, 6, 20, 16, 0, 16, 14,…
$ age            <dbl> 33, 59, NA, 59, 21, 22, 40, 25, 41, 45, 52, 31, 55, 56,…
$ partyid        <chr> "strong d", "not str", "ind,near", "not str", "strong d…
$ income91       <dbl> 11250, NA, 16250, 18750, 13750, 45000, 23750, 11250, 27…
$ rincom91       <dbl> NA, NA, 16250, 18750, NA, 11250, 23750, 11250, 18750, 1…
$ educ_cat       <chr> "High School Graduate", "High School Graduate", "High S…
$ party          <chr> "Strong Democrat", "Not Strong Party", "Independent (Le…
$ college_years  <dbl> 0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 7, 4, 4, 6, 4, 6, 4, 0…

Taking a Sample

If your dataset is very large, working with a random sample can speed up iteration during development. If your dataset is really large and it’s taking a long time to load on your computer, you can just use a smaller sample for this class. (Normally, I would not advise doing this, but it is fine for the course). slice_sample() lets you draw either a fixed number of rows or a proportion of the data. Always call set.seed() before sampling so your results are reproducible — anyone running the same code will get the same rows.

# Approach 1: sample a fixed number of rows
set.seed(42)
attain_sample_n <- attain_clean |>
  slice_sample(n = 500)

# Approach 2: sample a proportion of rows (here 10%)
set.seed(42)
attain_sample_prop <- attain_clean |>
  slice_sample(prop = 0.10)

# Compare sizes
nrow(attain_clean)
[1] 2992
nrow(attain_sample_n)
[1] 500
nrow(attain_sample_prop)
[1] 299

For your final project, use the full attain_clean — sampling is just shown here as a technique for when datasets are too large to work with comfortably.

WarningWatch what you write to disk

Notice that attain_sample_prop contains only a fraction of the full dataset (in this case, roughly 298 rows). When you run write_csv() in the next section, R writes exactly the object you pass to it — it won’t warn you if you accidentally save a sample instead of the full cleaned data. Before saving, always double-check which object you are passing: attain_clean, attain_sample_n, or attain_sample_prop? Running code out of order can make this easy to get wrong.


Save the Cleaned Data

Write the cleaned dataset back to data/ under a new name so the raw file is never overwritten. This is a core principle of reproducible research: always keep the original data untouched.

write_csv() from readr produces a clean, portable CSV with no row numbers.

write_csv(attain_clean, here("data", "attain_clean.csv"))

From now on, any analysis script can start with read_csv(here("data", "attain_clean.csv")) and skip all the cleaning steps above.


Data Summary with skimr

skim() gives a thorough snapshot of the cleaned dataset: counts, missing rates, distributions, and a mini-histogram for each variable. This is a good final check before moving on to analysis.

skim(attain_clean)
Data summary
Name attain_clean
Number of rows 2992
Number of columns 14
_______________________
Column type frequency:
character 6
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
marital_status 1 1.00 7 8 0 5 0
sex 0 1.00 4 6 0 2 0
race 0 1.00 5 5 0 3 0
partyid 49 0.98 7 8 0 6 0
educ_cat 7 1.00 12 21 0 6 0
party 49 0.98 11 21 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 1496.50 863.86 1 748.75 1496.5 2244.25 2992 ▇▇▇▇▇
educ 7 1.00 13.16 2.97 0 12.00 13.0 16.00 20 ▁▁▇▇▂
paeduc 837 0.72 11.07 4.20 0 8.00 12.0 14.00 20 ▂▅▇▅▂
maeduc 432 0.86 11.08 3.46 0 9.00 12.0 12.00 20 ▁▃▇▃▁
age 6 1.00 45.97 17.05 18 32.25 43.0 57.00 89 ▆▇▆▃▂
income91 356 0.88 37596.55 27456.35 500 16250.00 32500.0 55000.00 100000 ▇▇▅▂▂
rincom91 1032 0.66 25852.17 20799.05 500 11250.00 21250.0 32500.00 100000 ▇▆▂▁▁
college_years 7 1.00 1.72 2.17 0 0.00 1.0 4.00 8 ▇▂▂▁▁

Wrap-Up

In this lab you built a reusable data cleaning pipeline. Key functions covered:

Task Function
Standardize column names clean_names()
Subset columns select()
Recode categoricals case_when() inside mutate()
Strip whitespace str_trim()
Create derived variables mutate()
Rename columns rename()
Random sampling slice_sample()
Save to disk write_csv()
Summarize data skim()

TipBefore You Leave