# Run once in the console if not already installed
#install.packages(c("janitor", "skimr"))Lab 4: Data Cleaning for Final Projects
Overview
In this lab you will build a cleaned, analysis-ready dataset from the raw attain.csv file — showing you the major steps you will use to clean the dataset that you will use for your final paper. This lab is not exhaustive, but should walk through a realistic data-cleaning workflow: selecting the variables you need, recoding categorical variables to have meaningful labels, creating new measures that you might want, and saving the result so you can load it in future scripts without repeating this work.
Setup Instructions
- Download
lab4.qmdfrom bCourse under “Labs” > “Lab #4” - Place it in your
labsfolder. Your folder structure should look something like this:
soc106/
├── _quarto.yml
├── data/
│ └── attain.csv
├── assignments/
│ └── ...
└── labs/
├── lab1.qmd
├── lab2.qmd
├── lab3.qmd
└── lab4.qmd
- Open
lab4.qmdin Positron and let’s get to work!
Load Libraries
We’ll use four packages today. If you haven’t installed janitor or skimr before, run the block below once in your console (not when rendering). The janitor() package is a great package that will automate some basic cleaning and standardizing of variable names. I love to use it to take funky variable names and force a standard variable naming scheme on them: snake_case.
library(tidyverse) # data manipulation and visualization
library(here) # file path management — see https://here.r-lib.org/
library(janitor) # data cleaning utilities — see https://sfirke.github.io/janitor/
library(skimr) # quick summary statistics — see https://docs.ropensci.org/skimr/Read and Inspect the Data
We read the file using here(), which builds paths relative to the project root regardless of where this .qmd file lives — no more broken paths when you move a file. Immediately after reading, we pipe to clean_names() from janitor, which converts all column names to snake_case: no spaces, no special characters, consistent lowercase.
attain <- read_csv(here("data", "attain.csv")) |>
# converts column names to snake_case: lowercase, no spaces or special characters
clean_names() |>
glimpse()Rows: 2,992
Columns: 44
$ x1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ wrkstat <chr> "keeping", "working", "working", "working", "working", "worki…
$ hrs1 <dbl> NA, 40, 50, 32, 20, 20, 35, 45, 40, 40, 40, 65, 35, 38, 46, 4…
$ prestg80 <dbl> 46, 22, 29, 42, 36, 43, 20, 44, 42, 46, 30, 75, 51, 50, 73, 4…
$ marital <chr> "divorced", "married", "married", "married", "never ma", "nev…
$ agewed <dbl> NA, 20, 28, NA, NA, NA, NA, NA, NA, NA, NA, NA, 29, 22, NA, N…
$ papres80 <dbl> 41, NA, NA, NA, NA, NA, 20, 44, NA, 34, NA, 51, 40, 51, 75, 4…
$ mapres80 <dbl> NA, 28, 36, NA, NA, 34, 28, 44, 23, NA, NA, NA, NA, NA, 60, 6…
$ sibs <dbl> 4, 4, 2, 6, 3, 3, 5, 8, 2, 5, NA, 1, 4, 1, 4, 2, 0, 0, 2, 2, …
$ childs <dbl> 2, 3, 2, 3, 0, 0, 4, 1, 6, 1, 0, 0, 2, 2, 0, 0, 2, 2, 0, 0, 0…
$ age <dbl> 33, 59, NA, 59, 21, 22, 40, 25, 41, 45, 52, 31, 55, 56, 36, 2…
$ agekdbrn <dbl> 21, NA, 25, 23, NA, NA, 17, 23, 17, 17, NA, NA, 29, 32, NA, N…
$ educ <dbl> 12, 12, 12, 8, 13, 15, 9, 12, 12, 12, 0, 19, 16, 16, 18, 16, …
$ paeduc <dbl> 12, NA, NA, NA, NA, NA, 12, 8, NA, 6, NA, 14, 0, 12, 19, 16, …
$ maeduc <dbl> 10, 12, NA, 5, 12, 20, NA, 0, 11, 6, 20, 16, 0, 16, 14, 18, 1…
$ degree <chr> "high sch", "high sch", "lt high", "lt high", "high sch", "hi…
$ sex <chr> "female", "male", "female", "male", "female", "female", "fema…
$ race <chr> "black", "black", "black", "white", "black", "black", "other"…
$ weekswrk <dbl> 0, 52, 52, 44, 30, 52, 52, 52, 52, 52, 40, 52, 52, 52, 52, 45…
$ partfull <chr> NA, "full-tim", "full-tim", "full-tim", "part-tim", "part-tim…
$ region <chr> "middle a", "middle a", "middle a", "middle a", "middle a", "…
$ xnorcsiz <chr> "city gt", "city gt", "city gt", "city gt", "city gt", "city …
$ srcbelt <chr> "12 lrgst", "12 lrgst", "12 lrgst", "12 lrgst", "12 lrgst", "…
$ size <dbl> 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7…
$ partyid <chr> "strong d", "not str", "ind,near", "not str", "strong d", "ot…
$ polviews <chr> "slightly", "moderate", "slghtly", "slightly", "conserva", "l…
$ relig <chr> "protesta", "catholic", "protesta", "catholic", "protesta", "…
$ attend <chr> "sevrl ti", "every we", "more thn", "once a y", "once a y", "…
$ satjob <chr> "mod. sat", "very sat", "very sat", "mod. sat", "mod. sat", "…
$ class <chr> "working", "working", "working", "working", "lower cl", "work…
$ satfin <chr> "more or", "more or", "more or", "not at a", "not at a", "not…
$ finalter <chr> "better", "stayed s", "stayed s", "worse", "worse", "worse", …
$ finrela <chr> "below av", "below av", "average", "average", "far belo", "be…
$ wksub <chr> NA, NA, "no", NA, "yes", "yes", "yes", NA, "yes", "no", NA, N…
$ wksup <chr> NA, NA, "no", NA, "no", "no", "no", NA, "no", "no", NA, NA, N…
$ unemp <chr> "yes", "no", NA, "yes", "no", NA, NA, "no", NA, "yes", "yes",…
$ union <chr> "neither", "neither", NA, "neither", "neither", NA, NA, "neit…
$ parsol <chr> "somewhat", NA, NA, "somewhat", NA, NA, NA, "much bet", NA, "…
$ tvhours <dbl> 2, 3, 1, 3, NA, 0, 10, 4, 2, NA, 2, 2, 2, NA, 1, 1, 0, 3, 3, …
$ dwelown <chr> "pays ren", "pays ren", "pays ren", "own or i", NA, "pays ren…
$ wordsum <dbl> NA, NA, 6, 5, NA, 8, 5, 1, 5, NA, 5, 9, 9, NA, 7, 10, 3, 8, N…
$ income91 <dbl> 11250, NA, 16250, 18750, 13750, 45000, 23750, 11250, 27500, 1…
$ rincom91 <dbl> NA, NA, 16250, 18750, NA, 11250, 23750, 11250, 18750, 18750, …
Select Variables
One easy way to shrink a dataset is to keep only the columns relevant to the final project. You can use select(), which accepts bare column names, and the helper contains("inc") picks up any column whose name includes “inc” — useful when a dataset has several income-related fields.
attain_clean <- attain |>
# select only the key variables we want for our analysis
select(id, marital, sex, race, educ, paeduc, maeduc, age, partyid, contains("inc")) |>
glimpse()Rows: 2,992
Columns: 11
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ marital <chr> "divorced", "married", "married", "married", "never ma", "nev…
$ sex <chr> "female", "male", "female", "male", "female", "female", "fema…
$ race <chr> "black", "black", "black", "white", "black", "black", "other"…
$ educ <dbl> 12, 12, 12, 8, 13, 15, 9, 12, 12, 12, 0, 19, 16, 16, 18, 16, …
$ paeduc <dbl> 12, NA, NA, NA, NA, NA, 12, 8, NA, 6, NA, 14, 0, 12, 19, 16, …
$ maeduc <dbl> 10, 12, NA, 5, 12, 20, NA, 0, 11, 6, 20, 16, 0, 16, 14, 18, 1…
$ age <dbl> 33, 59, NA, 59, 21, 22, 40, 25, 41, 45, 52, 31, 55, 56, 36, 2…
$ partyid <chr> "strong d", "not str", "ind,near", "not str", "strong d", "ot…
$ income91 <dbl> 11250, NA, 16250, 18750, 13750, 45000, 23750, 11250, 27500, 1…
$ rincom91 <dbl> NA, NA, 16250, 18750, NA, 11250, 23750, 11250, 18750, 18750, …
Recode Variables
Education
The raw educ variable measures years of formal schooling (0–20). See the GSS codebook for the full definition. We use case_when() inside mutate() to create a categorical version with meaningful labels. Any educ value that is NA does not satisfy any condition and remains NA in the new variable automatically.
attain_clean <- attain_clean |>
mutate(
# recode educational attainment
educ_cat = case_when(
educ %in% 0:8 ~ "Less than High School", # 0–8 years
educ %in% 9:11 ~ "Some High School", # 9–11 years
educ == 12 ~ "High School Graduate", # 12 years
educ %in% 13:15 ~ "Some College", # 13–15 years
educ == 16 ~ "College Graduate", # 16 years (bachelor's)
educ >= 17 ~ "Graduate Degree" # 17–20 years (postgrad)
)
)
# Verify — counts should cover all non-NA rows
attain_clean |> count(educ_cat)# A tibble: 7 × 2
educ_cat n
<chr> <int>
1 College Graduate 433
2 Graduate Degree 326
3 High School Graduate 929
4 Less than High School 198
5 Some College 744
6 Some High School 355
7 <NA> 7
Party Identification
The raw partyid values are abbreviated strings. See the GSS codebook for definitions. Raw strings often carry invisible whitespace — in this dataset "not str" has a trailing space. We use str_trim() to strip it before comparing.
Note:
clean_names()only standardizes column names (e.g.,My Variable→my_variable). It has no effect on the values inside the columns, which is why we still needstr_trim()here to remove whitespace from thepartyidvalues.
This version of partyid does not distinguish which party the “not strong” identifiers lean toward — it bundles both “Not strong Democrat” and “Not strong Republican” together — so we label them as "Not Strong Party".
attain_clean <- attain_clean |>
mutate(
party = case_when(
str_trim(partyid) == "strong d" ~ "Strong Democrat", # str_trim() removes leading/trailing whitespace from values
str_trim(partyid) == "not str" ~ "Not Strong Party",
str_trim(partyid) == "ind,near" ~ "Independent (Leaning)",
str_trim(partyid) == "independ" ~ "Independent",
str_trim(partyid) == "strong r" ~ "Strong Republican",
str_trim(partyid) == "other pa" ~ "Other Party"
)
)
attain_clean |> count(party)# A tibble: 7 × 2
party n
<chr> <int>
1 Independent 369
2 Independent (Leaning) 623
3 Not Strong Party 1163
4 Other Party 44
5 Strong Democrat 423
6 Strong Republican 321
7 <NA> 49
A quick bar chart to visualize the distribution. We use fct_relevel() to fix the order left to right on the political spectrum, and filter() to drop NA and "Other Party" before plotting.
attain_clean |>
filter(!is.na(party), party != "Other Party") |>
mutate(party = fct_relevel(
party,
"Strong Democrat",
"Independent (Leaning)",
"Not Strong Party", # ambiguous: bundles weak D and weak R
"Independent",
"Strong Republican"
)) |>
ggplot(aes(x = party, fill = party)) +
geom_bar() +
geom_text(
stat = "count",
aes(label = after_stat(count)),
vjust = -0.5, # nudge labels just above the top of each bar
size = 3.5
) +
scale_fill_manual(values = c(
"Strong Democrat" = "#1259A5",
"Independent (Leaning)" = "#8EB4DA",
"Not Strong Party" = "#9E9E9E",
"Independent" = "#E08C8C",
"Strong Republican" = "#C02020"
)) +
scale_y_continuous(breaks = seq(0, 1000, by = 200)) +
labs(
title = "Party Identification",
x = NULL,
y = "Count"
) +
theme_minimal() +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 15, hjust = 1)
)Create Variables and Rename
This block handles two final cleanup tasks:
college_years— years of education beyond high school. For respondents with 12 or fewer years we assign 0;NAvalues ineducpropagate asNAautomatically because no condition is satisfied.- Rename
maritaltomarital_statususingrename().
attain_clean <- attain_clean |>
mutate(
college_years = case_when(
educ > 12 ~ educ - 12, # years of schooling beyond a HS diploma
educ <= 12 ~ 0 # HS diploma or less: set college years to 0
)) |>
rename(marital_status = marital) |> # rename(new_name = old_name)
glimpse()Rows: 2,992
Columns: 14
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ marital_status <chr> "divorced", "married", "married", "married", "never ma"…
$ sex <chr> "female", "male", "female", "male", "female", "female",…
$ race <chr> "black", "black", "black", "white", "black", "black", "…
$ educ <dbl> 12, 12, 12, 8, 13, 15, 9, 12, 12, 12, 0, 19, 16, 16, 18…
$ paeduc <dbl> 12, NA, NA, NA, NA, NA, 12, 8, NA, 6, NA, 14, 0, 12, 19…
$ maeduc <dbl> 10, 12, NA, 5, 12, 20, NA, 0, 11, 6, 20, 16, 0, 16, 14,…
$ age <dbl> 33, 59, NA, 59, 21, 22, 40, 25, 41, 45, 52, 31, 55, 56,…
$ partyid <chr> "strong d", "not str", "ind,near", "not str", "strong d…
$ income91 <dbl> 11250, NA, 16250, 18750, 13750, 45000, 23750, 11250, 27…
$ rincom91 <dbl> NA, NA, 16250, 18750, NA, 11250, 23750, 11250, 18750, 1…
$ educ_cat <chr> "High School Graduate", "High School Graduate", "High S…
$ party <chr> "Strong Democrat", "Not Strong Party", "Independent (Le…
$ college_years <dbl> 0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 7, 4, 4, 6, 4, 6, 4, 0…
Taking a Sample
If your dataset is very large, working with a random sample can speed up iteration during development. If your dataset is really large and it’s taking a long time to load on your computer, you can just use a smaller sample for this class. (Normally, I would not advise doing this, but it is fine for the course). slice_sample() lets you draw either a fixed number of rows or a proportion of the data. Always call set.seed() before sampling so your results are reproducible — anyone running the same code will get the same rows.
# Approach 1: sample a fixed number of rows
set.seed(42)
attain_sample_n <- attain_clean |>
slice_sample(n = 500)
# Approach 2: sample a proportion of rows (here 10%)
set.seed(42)
attain_sample_prop <- attain_clean |>
slice_sample(prop = 0.10)
# Compare sizes
nrow(attain_clean)[1] 2992
nrow(attain_sample_n)[1] 500
nrow(attain_sample_prop)[1] 299
For your final project, use the full attain_clean — sampling is just shown here as a technique for when datasets are too large to work with comfortably.
Notice that attain_sample_prop contains only a fraction of the full dataset (in this case, roughly 298 rows). When you run write_csv() in the next section, R writes exactly the object you pass to it — it won’t warn you if you accidentally save a sample instead of the full cleaned data. Before saving, always double-check which object you are passing: attain_clean, attain_sample_n, or attain_sample_prop? Running code out of order can make this easy to get wrong.
Save the Cleaned Data
Write the cleaned dataset back to data/ under a new name so the raw file is never overwritten. This is a core principle of reproducible research: always keep the original data untouched.
write_csv() from readr produces a clean, portable CSV with no row numbers.
write_csv(attain_clean, here("data", "attain_clean.csv"))From now on, any analysis script can start with read_csv(here("data", "attain_clean.csv")) and skip all the cleaning steps above.
Data Summary with skimr
skim() gives a thorough snapshot of the cleaned dataset: counts, missing rates, distributions, and a mini-histogram for each variable. This is a good final check before moving on to analysis.
skim(attain_clean)| Name | attain_clean |
| Number of rows | 2992 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| marital_status | 1 | 1.00 | 7 | 8 | 0 | 5 | 0 |
| sex | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| race | 0 | 1.00 | 5 | 5 | 0 | 3 | 0 |
| partyid | 49 | 0.98 | 7 | 8 | 0 | 6 | 0 |
| educ_cat | 7 | 1.00 | 12 | 21 | 0 | 6 | 0 |
| party | 49 | 0.98 | 11 | 21 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 1496.50 | 863.86 | 1 | 748.75 | 1496.5 | 2244.25 | 2992 | ▇▇▇▇▇ |
| educ | 7 | 1.00 | 13.16 | 2.97 | 0 | 12.00 | 13.0 | 16.00 | 20 | ▁▁▇▇▂ |
| paeduc | 837 | 0.72 | 11.07 | 4.20 | 0 | 8.00 | 12.0 | 14.00 | 20 | ▂▅▇▅▂ |
| maeduc | 432 | 0.86 | 11.08 | 3.46 | 0 | 9.00 | 12.0 | 12.00 | 20 | ▁▃▇▃▁ |
| age | 6 | 1.00 | 45.97 | 17.05 | 18 | 32.25 | 43.0 | 57.00 | 89 | ▆▇▆▃▂ |
| income91 | 356 | 0.88 | 37596.55 | 27456.35 | 500 | 16250.00 | 32500.0 | 55000.00 | 100000 | ▇▇▅▂▂ |
| rincom91 | 1032 | 0.66 | 25852.17 | 20799.05 | 500 | 11250.00 | 21250.0 | 32500.00 | 100000 | ▇▆▂▁▁ |
| college_years | 7 | 1.00 | 1.72 | 2.17 | 0 | 0.00 | 1.0 | 4.00 | 8 | ▇▂▂▁▁ |
Wrap-Up
In this lab you built a reusable data cleaning pipeline. Key functions covered:
| Task | Function |
|---|---|
| Standardize column names | clean_names() |
| Subset columns | select() |
| Recode categoricals | case_when() inside mutate() |
| Strip whitespace | str_trim() |
| Create derived variables | mutate() |
| Rename columns | rename() |
| Random sampling | slice_sample() |
| Save to disk | write_csv() |
| Summarize data | skim() |