Lab 1: Introduction to the Tidyverse

Author

Your Name Here

Published

March 25, 2026

Setup Instructions

Let’s practice using the folder structure you already set up:

  1. Download lab1.qmd from bCourse under “labs” > “Lab #1”
  2. Create a labs folder inside your soc106 folder
  3. Place lab1.qmd in the labs folder. Your folder structure should now look like this:
soc106/
├── _quarto.yml
├── data/
│   └── attain.csv
├── assignments/
│   └── hw0.qmd
└── labs/
    └── lab1.qmd
  1. Use the Explorer button on the left to find and open lab1.qmd

  2. Let’s work through it together!


Load Libraries and Data

First, we need to load the tidyverse and here packages.

# Run these in the console only if not already installed
#install.packages("tidyverse")
#install.packages("here")

Load the libraries:

# load libraries
library(tidyverse)
library(here)

Now, let’s import our data and glimpse it.

# Load data
attain <- read_csv(here("data", "attain.csv"))
New names:
Rows: 2992 Columns: 44
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(24): wrkstat, marital, degree, sex, race, partfull, region, xnorcsiz, s... dbl
(20): ...1, id, hrs1, prestg80, agewed, papres80, mapres80, sibs, childs...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
# Preview the data structure
attain |> glimpse()
Rows: 2,992
Columns: 44
$ ...1     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ wrkstat  <chr> "keeping", "working", "working", "working", "working", "worki…
$ hrs1     <dbl> NA, 40, 50, 32, 20, 20, 35, 45, 40, 40, 40, 65, 35, 38, 46, 4…
$ prestg80 <dbl> 46, 22, 29, 42, 36, 43, 20, 44, 42, 46, 30, 75, 51, 50, 73, 4…
$ marital  <chr> "divorced", "married", "married", "married", "never ma", "nev…
$ agewed   <dbl> NA, 20, 28, NA, NA, NA, NA, NA, NA, NA, NA, NA, 29, 22, NA, N…
$ papres80 <dbl> 41, NA, NA, NA, NA, NA, 20, 44, NA, 34, NA, 51, 40, 51, 75, 4…
$ mapres80 <dbl> NA, 28, 36, NA, NA, 34, 28, 44, 23, NA, NA, NA, NA, NA, 60, 6…
$ sibs     <dbl> 4, 4, 2, 6, 3, 3, 5, 8, 2, 5, NA, 1, 4, 1, 4, 2, 0, 0, 2, 2, …
$ childs   <dbl> 2, 3, 2, 3, 0, 0, 4, 1, 6, 1, 0, 0, 2, 2, 0, 0, 2, 2, 0, 0, 0…
$ age      <dbl> 33, 59, NA, 59, 21, 22, 40, 25, 41, 45, 52, 31, 55, 56, 36, 2…
$ agekdbrn <dbl> 21, NA, 25, 23, NA, NA, 17, 23, 17, 17, NA, NA, 29, 32, NA, N…
$ educ     <dbl> 12, 12, 12, 8, 13, 15, 9, 12, 12, 12, 0, 19, 16, 16, 18, 16, …
$ paeduc   <dbl> 12, NA, NA, NA, NA, NA, 12, 8, NA, 6, NA, 14, 0, 12, 19, 16, …
$ maeduc   <dbl> 10, 12, NA, 5, 12, 20, NA, 0, 11, 6, 20, 16, 0, 16, 14, 18, 1…
$ degree   <chr> "high sch", "high sch", "lt high", "lt high", "high sch", "hi…
$ sex      <chr> "female", "male", "female", "male", "female", "female", "fema…
$ race     <chr> "black", "black", "black", "white", "black", "black", "other"…
$ weekswrk <dbl> 0, 52, 52, 44, 30, 52, 52, 52, 52, 52, 40, 52, 52, 52, 52, 45…
$ partfull <chr> NA, "full-tim", "full-tim", "full-tim", "part-tim", "part-tim…
$ region   <chr> "middle a", "middle a", "middle a", "middle a", "middle a", "…
$ xnorcsiz <chr> "city gt", "city gt", "city gt", "city gt", "city gt", "city …
$ srcbelt  <chr> "12 lrgst", "12 lrgst", "12 lrgst", "12 lrgst", "12 lrgst", "…
$ size     <dbl> 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7…
$ partyid  <chr> "strong d", "not str", "ind,near", "not str", "strong d", "ot…
$ polviews <chr> "slightly", "moderate", "slghtly", "slightly", "conserva", "l…
$ relig    <chr> "protesta", "catholic", "protesta", "catholic", "protesta", "…
$ attend   <chr> "sevrl ti", "every we", "more thn", "once a y", "once a y", "…
$ satjob   <chr> "mod. sat", "very sat", "very sat", "mod. sat", "mod. sat", "…
$ class    <chr> "working", "working", "working", "working", "lower cl", "work…
$ satfin   <chr> "more or", "more or", "more or", "not at a", "not at a", "not…
$ finalter <chr> "better", "stayed s", "stayed s", "worse", "worse", "worse", …
$ finrela  <chr> "below av", "below av", "average", "average", "far belo", "be…
$ wksub    <chr> NA, NA, "no", NA, "yes", "yes", "yes", NA, "yes", "no", NA, N…
$ wksup    <chr> NA, NA, "no", NA, "no", "no", "no", NA, "no", "no", NA, NA, N…
$ unemp    <chr> "yes", "no", NA, "yes", "no", NA, NA, "no", NA, "yes", "yes",…
$ union    <chr> "neither", "neither", NA, "neither", "neither", NA, NA, "neit…
$ parsol   <chr> "somewhat", NA, NA, "somewhat", NA, NA, NA, "much bet", NA, "…
$ tvhours  <dbl> 2, 3, 1, 3, NA, 0, 10, 4, 2, NA, 2, 2, 2, NA, 1, 1, 0, 3, 3, …
$ dwelown  <chr> "pays ren", "pays ren", "pays ren", "own or i", NA, "pays ren…
$ wordsum  <dbl> NA, NA, 6, 5, NA, 8, 5, 1, 5, NA, 5, 9, 9, NA, 7, 10, 3, 8, N…
$ income91 <dbl> 11250, NA, 16250, 18750, 13750, 45000, 23750, 11250, 27500, 1…
$ rincom91 <dbl> NA, NA, 16250, 18750, NA, 11250, 23750, 11250, 18750, 18750, …

The attain dataset comes from the General Social Survey and contains information about respondents’ demographics, education, and family background.


Using select()

The select() function allows you to choose specific columns from a dataframe.

Syntax: select(column1, column2, ...)

  • List the column names you want to keep, separated by commas
  • Column names don’t need quotes
  • The order you list them is the order they’ll appear
# Select education-related variables
attain |>
  select(educ, paeduc, maeduc) |>
  head(5)
# A tibble: 5 × 3
   educ paeduc maeduc
  <dbl>  <dbl>  <dbl>
1    12     12     10
2    12     NA     12
3    12     NA     NA
4     8     NA      5
5    13     NA     12

Try it yourself: Modify the code above to select different columns like age, sex, race, and marital.

# Your code here

Using slice()

The slice() function selects specific rows by their position (row number).

Syntax: slice(row_numbers)

  • Use a single number for one row: slice(5)
  • Use : to specify a range: slice(50:55) means rows 50 through 55
  • Use c() for non-consecutive rows: slice(c(1, 5, 10))
# Get rows 50 through 55
attain |>
  slice(50:55)
# A tibble: 6 × 44
   ...1    id wrkstat  hrs1 prestg80 marital  agewed papres80 mapres80  sibs
  <dbl> <dbl> <chr>   <dbl>    <dbl> <chr>     <dbl>    <dbl>    <dbl> <dbl>
1    50    50 keeping    NA       52 separate     NA       NA       NA     8
2    51    51 keeping    NA       46 widowed      16       40       66    14
3    52    52 working    70       42 never ma     NA       30       42     5
4    53    53 retired    NA       50 widowed      19       40       NA    14
5    54    54 keeping    NA       33 married      22       51       NA     1
6    55    55 working    48       51 never ma     NA       51       NA     3
# ℹ 34 more variables: childs <dbl>, age <dbl>, agekdbrn <dbl>, educ <dbl>,
#   paeduc <dbl>, maeduc <dbl>, degree <chr>, sex <chr>, race <chr>,
#   weekswrk <dbl>, partfull <chr>, region <chr>, xnorcsiz <chr>,
#   srcbelt <chr>, size <dbl>, partyid <chr>, polviews <chr>, relig <chr>,
#   attend <chr>, satjob <chr>, class <chr>, satfin <chr>, finalter <chr>,
#   finrela <chr>, wksub <chr>, wksup <chr>, unemp <chr>, union <chr>,
#   parsol <chr>, tvhours <dbl>, dwelown <chr>, wordsum <dbl>, …

Try it yourself: Modify the code to get the first 10 rows of the dataset.

# Your code here

Using filter()

The filter() function selects rows based on conditions.

Syntax: filter(condition)

  • Use logical operators to create conditions:
    • == equal to (note: two equals signs!)
    • != not equal to
    • >, <, >=, <= for comparisons
  • For text values, put them in quotes: filter(marital == "married")
  • For numbers, no quotes needed: filter(age > 50)

Filtering with numeric variables

# Find respondents over age 50
attain |>
  filter(age > 50) |>
  head(5)
# A tibble: 5 × 44
   ...1    id wrkstat  hrs1 prestg80 marital  agewed papres80 mapres80  sibs
  <dbl> <dbl> <chr>   <dbl>    <dbl> <chr>     <dbl>    <dbl>    <dbl> <dbl>
1     2     2 working    40       22 married      20       NA       28     4
2     4     4 working    32       42 married      NA       NA       NA     6
3    11    11 working    40       30 never ma     NA       NA       NA    NA
4    13    13 working    35       51 married      29       40       NA     4
5    14    14 working    38       50 married      22       51       NA     1
# ℹ 34 more variables: childs <dbl>, age <dbl>, agekdbrn <dbl>, educ <dbl>,
#   paeduc <dbl>, maeduc <dbl>, degree <chr>, sex <chr>, race <chr>,
#   weekswrk <dbl>, partfull <chr>, region <chr>, xnorcsiz <chr>,
#   srcbelt <chr>, size <dbl>, partyid <chr>, polviews <chr>, relig <chr>,
#   attend <chr>, satjob <chr>, class <chr>, satfin <chr>, finalter <chr>,
#   finrela <chr>, wksub <chr>, wksup <chr>, unemp <chr>, union <chr>,
#   parsol <chr>, tvhours <dbl>, dwelown <chr>, wordsum <dbl>, …

Try it yourself: Find all respondents who are under age 30.

# Your code here

Filtering with character variables

You can also filter on character (text) variables using ==:

# Find married respondents
attain |>
  filter(marital == "married") |>
  head(5)
# A tibble: 5 × 44
   ...1    id wrkstat  hrs1 prestg80 marital agewed papres80 mapres80  sibs
  <dbl> <dbl> <chr>   <dbl>    <dbl> <chr>    <dbl>    <dbl>    <dbl> <dbl>
1     2     2 working    40       22 married     20       NA       28     4
2     3     3 working    50       29 married     28       NA       36     2
3     4     4 working    32       42 married     NA       NA       NA     6
4    13    13 working    35       51 married     29       40       NA     4
5    14    14 working    38       50 married     22       51       NA     1
# ℹ 34 more variables: childs <dbl>, age <dbl>, agekdbrn <dbl>, educ <dbl>,
#   paeduc <dbl>, maeduc <dbl>, degree <chr>, sex <chr>, race <chr>,
#   weekswrk <dbl>, partfull <chr>, region <chr>, xnorcsiz <chr>,
#   srcbelt <chr>, size <dbl>, partyid <chr>, polviews <chr>, relig <chr>,
#   attend <chr>, satjob <chr>, class <chr>, satfin <chr>, finalter <chr>,
#   finrela <chr>, wksub <chr>, wksup <chr>, unemp <chr>, union <chr>,
#   parsol <chr>, tvhours <dbl>, dwelown <chr>, wordsum <dbl>, …

Try it yourself: Now divorced respondents

# Your code here

Combining Functions

One of the most powerful features of the tidyverse is chaining multiple operations together with the pipe (|>):

# Find respondents over 40, select key demographics
attain |>
  filter(age > 40) |>
  select(age, sex, educ, marital) |>
  slice(1:6)
# A tibble: 6 × 4
    age sex     educ marital 
  <dbl> <chr>  <dbl> <chr>   
1    59 male      12 married 
2    59 male       8 married 
3    41 female    12 widowed 
4    45 female    12 divorced
5    52 female     0 never ma
6    55 male      16 married 

Try it yourself: Chain together functions to find married respondents, select their age and education, and show the first 10 rows.

# Your code here

Using mutate()

The mutate() function creates new variables based on existing ones.

Syntax: mutate(new_column_name = expression)

  • Left side of = is the name of your new column
  • Right side is the calculation or transformation
  • You can use math operations: +, -, *, /
  • You can reference existing columns by name
# Create a variable for years of education beyond high school
attain |>
  mutate(college_years = educ - 12) |>
  select(educ, college_years) |>
  head(5)
# A tibble: 5 × 2
   educ college_years
  <dbl>         <dbl>
1    12             0
2    12             0
3    12             0
4     8            -4
5    13             1

Try it yourself: Create a new variable that calculates how many years ago the respondent got married (hint: use age and agewed).

# Your code here

Using rename()

The rename() function changes column names.

Syntax: rename(new_name = old_name)

  • Left side of = is the NEW name you want
  • Right side is the OLD name that currently exists
  • Think of it as: “new_name gets old_name”
# Rename a column
attain |>
  rename(years_of_education = educ) |>
  select(years_of_education, age) |>
  head(3)
# A tibble: 3 × 2
  years_of_education   age
               <dbl> <dbl>
1                 12    33
2                 12    59
3                 12    NA

Try it yourself: Rename the marital column to marital_status.

# Your code here

Handling Missing Data with filter()

Missing values in R are represented as NA (Not Available).

Syntax:

  • is.na(column) — returns TRUE if the value is missing
  • !is.na(column) — returns TRUE if the value is NOT missing (the ! means “not”)
  • Use these inside filter() to keep or remove rows with missing data
# Count how many have missing father's education data
attain |>
  filter(is.na(paeduc)) |>
  nrow()
[1] 837
# Keep only respondents with non-missing father's education
attain |>
  filter(!is.na(paeduc)) |>
  glimpse()
Rows: 2,155
Columns: 44
$ ...1     <dbl> 1, 7, 8, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, …
$ id       <dbl> 1, 7, 8, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, …
$ wrkstat  <chr> "keeping", "working", "working", "working", "working", "worki…
$ hrs1     <dbl> NA, 35, 45, 40, 65, 35, 38, 46, 40, 40, NA, NA, 10, 5, 25, 75…
$ prestg80 <dbl> 46, 20, 44, 46, 75, 51, 50, 73, 41, 64, 59, 35, 46, 47, 74, 4…
$ marital  <chr> "divorced", "widowed", "never ma", "divorced", "never ma", "m…
$ agewed   <dbl> NA, NA, NA, NA, NA, 29, 22, NA, NA, 32, NA, NA, NA, NA, NA, N…
$ papres80 <dbl> 41, 20, 44, 34, 51, 40, 51, 75, 46, 50, 59, 75, 50, 51, 44, 5…
$ mapres80 <dbl> NA, 28, 44, NA, NA, NA, NA, 60, 66, NA, 30, NA, NA, NA, NA, 2…
$ sibs     <dbl> 4, 5, 8, 5, 1, 4, 1, 4, 2, 0, 0, 2, 2, 3, 2, 1, 1, 4, 6, 3, 8…
$ childs   <dbl> 2, 4, 1, 1, 0, 2, 2, 0, 0, 2, 2, 0, 0, 0, 1, 0, 0, 0, 8, 2, 3…
$ age      <dbl> 33, 40, 25, 45, 31, 55, 56, 36, 25, 52, 39, 71, 36, 26, 51, 5…
$ agekdbrn <dbl> 21, 17, 23, 17, NA, 29, 32, NA, NA, NA, 34, NA, NA, NA, 35, N…
$ educ     <dbl> 12, 9, 12, 12, 19, 16, 16, 18, 16, 18, 16, 12, 16, 16, 20, 12…
$ paeduc   <dbl> 12, 12, 8, 6, 14, 0, 12, 19, 16, 13, 16, 20, 16, 12, 20, 2, 2…
$ maeduc   <dbl> 10, NA, 0, 6, 16, 0, 16, 14, 18, 16, 14, 12, 14, NA, 16, 6, 1…
$ degree   <chr> "high sch", "lt high", "high sch", "high sch", "graduate", "b…
$ sex      <chr> "female", "female", "male", "female", "female", "male", "male…
$ race     <chr> "black", "other", "black", "black", "white", "white", "white"…
$ weekswrk <dbl> 0, 52, 52, 52, 52, 52, 52, 52, 45, 52, 0, 0, 0, 52, 52, 52, 0…
$ partfull <chr> NA, "full-tim", "part-tim", "full-tim", "full-tim", "full-tim…
$ region   <chr> "middle a", "middle a", "middle a", "middle a", "middle a", "…
$ xnorcsiz <chr> "city gt", "city gt", "city gt", "city gt", "city gt", "city …
$ srcbelt  <chr> "12 lrgst", "12 lrgst", "12 lrgst", "12 lrgst", "12 lrgst", "…
$ size     <dbl> 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7…
$ partyid  <chr> "strong d", "strong d", "strong d", "not str", "not str", "st…
$ polviews <chr> "slightly", "moderate", "conserva", "moderate", "slghtly", "c…
$ relig    <chr> "protesta", "catholic", "other", "protesta", "jewish", "other…
$ attend   <chr> "sevrl ti", "more thn", "more thn", "more thn", "once a y", "…
$ satjob   <chr> "mod. sat", "mod. sat", "very sat", "very sat", "very sat", "…
$ class    <chr> "working", "working", "working", "working", "middle c", "work…
$ satfin   <chr> "more or", "not at a", "satisfie", "not at a", "satisfie", "m…
$ finalter <chr> "better", "worse", "worse", "worse", "better", "worse", "stay…
$ finrela  <chr> "below av", "below av", "average", "below av", "average", "av…
$ wksub    <chr> NA, "yes", NA, "no", NA, NA, "yes", "yes", "yes", NA, "yes", …
$ wksup    <chr> NA, "no", NA, "no", NA, NA, "yes", "yes", "no", NA, "yes", NA…
$ unemp    <chr> "yes", NA, "no", "yes", "no", "no", "no", NA, NA, "no", NA, N…
$ union    <chr> "neither", NA, "neither", "neither", "neither", "neither", "n…
$ parsol   <chr> "somewhat", NA, "much bet", "somewhat", "about th", "much bet…
$ tvhours  <dbl> 2, 10, 4, NA, 2, 2, NA, 1, 1, 0, 3, 3, 2, 3, NA, 4, NA, 1, 8,…
$ dwelown  <chr> "pays ren", "pays ren", "pays ren", NA, "own or i", "own or i…
$ wordsum  <dbl> NA, 5, 1, NA, 9, 9, NA, 7, 10, 3, 8, NA, 9, 9, NA, 2, NA, 9, …
$ income91 <dbl> 11250, 23750, 11250, 18750, 45000, 32500, 100000, 45000, 2750…
$ rincom91 <dbl> NA, 23750, 11250, 18750, 45000, 32500, 100000, 45000, NA, 450…

Try it yourself: Count how many respondents have missing data for agewed (age at first marriage).

# Your code here

Combining Logical Conditions

You can combine multiple conditions inside filter():

Syntax:

  • & (and) — both conditions must be true
  • | (or) — at least one condition must be true
# Find respondents with non-missing data for BOTH educ AND paeduc
attain |>
  filter(!is.na(educ) & !is.na(paeduc)) |>
  select(educ, paeduc) |>
  head(5)
# A tibble: 5 × 2
   educ paeduc
  <dbl>  <dbl>
1    12     12
2     9     12
3    12      8
4    12      6
5    19     14

You can also use | (or) to match any of several conditions:

# Find respondents who are either divorced OR widowed
attain |>
  filter(marital == "divorced" | marital == "widowed") |>
  select(age, marital) |>
  head(5)
# A tibble: 5 × 2
    age marital 
  <dbl> <chr>   
1    33 divorced
2    40 widowed 
3    41 widowed 
4    45 divorced
5    58 divorced

Try it yourself: Find respondents who are over 40 AND have more than 12 years of education.

# Your code here

Summary

In this lab, you learned how to use these key tidyverse functions:

Function Purpose
select() Choose columns
slice() Choose rows by position
filter() Choose rows by condition
mutate() Create new variables
rename() Rename columns
is.na() / !is.na() Check for missing values

You also learned how to chain these functions together using the pipe (|>) to create readable data manipulation workflows.