# Run these in the console only if not already installed
#install.packages("tidyverse")
#install.packages("here")Lab 1: Introduction to the Tidyverse
Setup Instructions
Let’s practice using the folder structure you already set up:
- Download
lab1.qmdfrom bCourse under “labs” > “Lab #1” - Create a
labsfolder inside yoursoc106folder - Place
lab1.qmdin thelabsfolder. Your folder structure should now look like this:
soc106/
├── _quarto.yml
├── data/
│ └── attain.csv
├── assignments/
│ └── hw0.qmd
└── labs/
└── lab1.qmd
Use the
Explorerbutton on the left to find and openlab1.qmdLet’s work through it together!
Load Libraries and Data
First, we need to load the tidyverse and here packages.
Load the libraries:
# load libraries
library(tidyverse)
library(here)Now, let’s import our data and glimpse it.
# Load data
attain <- read_csv(here("data", "attain.csv"))New names:
Rows: 2992 Columns: 44
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(24): wrkstat, marital, degree, sex, race, partfull, region, xnorcsiz, s... dbl
(20): ...1, id, hrs1, prestg80, agewed, papres80, mapres80, sibs, childs...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
# Preview the data structure
attain |> glimpse()Rows: 2,992
Columns: 44
$ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ wrkstat <chr> "keeping", "working", "working", "working", "working", "worki…
$ hrs1 <dbl> NA, 40, 50, 32, 20, 20, 35, 45, 40, 40, 40, 65, 35, 38, 46, 4…
$ prestg80 <dbl> 46, 22, 29, 42, 36, 43, 20, 44, 42, 46, 30, 75, 51, 50, 73, 4…
$ marital <chr> "divorced", "married", "married", "married", "never ma", "nev…
$ agewed <dbl> NA, 20, 28, NA, NA, NA, NA, NA, NA, NA, NA, NA, 29, 22, NA, N…
$ papres80 <dbl> 41, NA, NA, NA, NA, NA, 20, 44, NA, 34, NA, 51, 40, 51, 75, 4…
$ mapres80 <dbl> NA, 28, 36, NA, NA, 34, 28, 44, 23, NA, NA, NA, NA, NA, 60, 6…
$ sibs <dbl> 4, 4, 2, 6, 3, 3, 5, 8, 2, 5, NA, 1, 4, 1, 4, 2, 0, 0, 2, 2, …
$ childs <dbl> 2, 3, 2, 3, 0, 0, 4, 1, 6, 1, 0, 0, 2, 2, 0, 0, 2, 2, 0, 0, 0…
$ age <dbl> 33, 59, NA, 59, 21, 22, 40, 25, 41, 45, 52, 31, 55, 56, 36, 2…
$ agekdbrn <dbl> 21, NA, 25, 23, NA, NA, 17, 23, 17, 17, NA, NA, 29, 32, NA, N…
$ educ <dbl> 12, 12, 12, 8, 13, 15, 9, 12, 12, 12, 0, 19, 16, 16, 18, 16, …
$ paeduc <dbl> 12, NA, NA, NA, NA, NA, 12, 8, NA, 6, NA, 14, 0, 12, 19, 16, …
$ maeduc <dbl> 10, 12, NA, 5, 12, 20, NA, 0, 11, 6, 20, 16, 0, 16, 14, 18, 1…
$ degree <chr> "high sch", "high sch", "lt high", "lt high", "high sch", "hi…
$ sex <chr> "female", "male", "female", "male", "female", "female", "fema…
$ race <chr> "black", "black", "black", "white", "black", "black", "other"…
$ weekswrk <dbl> 0, 52, 52, 44, 30, 52, 52, 52, 52, 52, 40, 52, 52, 52, 52, 45…
$ partfull <chr> NA, "full-tim", "full-tim", "full-tim", "part-tim", "part-tim…
$ region <chr> "middle a", "middle a", "middle a", "middle a", "middle a", "…
$ xnorcsiz <chr> "city gt", "city gt", "city gt", "city gt", "city gt", "city …
$ srcbelt <chr> "12 lrgst", "12 lrgst", "12 lrgst", "12 lrgst", "12 lrgst", "…
$ size <dbl> 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7…
$ partyid <chr> "strong d", "not str", "ind,near", "not str", "strong d", "ot…
$ polviews <chr> "slightly", "moderate", "slghtly", "slightly", "conserva", "l…
$ relig <chr> "protesta", "catholic", "protesta", "catholic", "protesta", "…
$ attend <chr> "sevrl ti", "every we", "more thn", "once a y", "once a y", "…
$ satjob <chr> "mod. sat", "very sat", "very sat", "mod. sat", "mod. sat", "…
$ class <chr> "working", "working", "working", "working", "lower cl", "work…
$ satfin <chr> "more or", "more or", "more or", "not at a", "not at a", "not…
$ finalter <chr> "better", "stayed s", "stayed s", "worse", "worse", "worse", …
$ finrela <chr> "below av", "below av", "average", "average", "far belo", "be…
$ wksub <chr> NA, NA, "no", NA, "yes", "yes", "yes", NA, "yes", "no", NA, N…
$ wksup <chr> NA, NA, "no", NA, "no", "no", "no", NA, "no", "no", NA, NA, N…
$ unemp <chr> "yes", "no", NA, "yes", "no", NA, NA, "no", NA, "yes", "yes",…
$ union <chr> "neither", "neither", NA, "neither", "neither", NA, NA, "neit…
$ parsol <chr> "somewhat", NA, NA, "somewhat", NA, NA, NA, "much bet", NA, "…
$ tvhours <dbl> 2, 3, 1, 3, NA, 0, 10, 4, 2, NA, 2, 2, 2, NA, 1, 1, 0, 3, 3, …
$ dwelown <chr> "pays ren", "pays ren", "pays ren", "own or i", NA, "pays ren…
$ wordsum <dbl> NA, NA, 6, 5, NA, 8, 5, 1, 5, NA, 5, 9, 9, NA, 7, 10, 3, 8, N…
$ income91 <dbl> 11250, NA, 16250, 18750, 13750, 45000, 23750, 11250, 27500, 1…
$ rincom91 <dbl> NA, NA, 16250, 18750, NA, 11250, 23750, 11250, 18750, 18750, …
The attain dataset comes from the General Social Survey and contains information about respondents’ demographics, education, and family background.
Using select()
The select() function allows you to choose specific columns from a dataframe.
Syntax: select(column1, column2, ...)
- List the column names you want to keep, separated by commas
- Column names don’t need quotes
- The order you list them is the order they’ll appear
# Select education-related variables
attain |>
select(educ, paeduc, maeduc) |>
head(5)# A tibble: 5 × 3
educ paeduc maeduc
<dbl> <dbl> <dbl>
1 12 12 10
2 12 NA 12
3 12 NA NA
4 8 NA 5
5 13 NA 12
Try it yourself: Modify the code above to select different columns like age, sex, race, and marital.
# Your code hereUsing slice()
The slice() function selects specific rows by their position (row number).
Syntax: slice(row_numbers)
- Use a single number for one row:
slice(5) - Use
:to specify a range:slice(50:55)means rows 50 through 55 - Use
c()for non-consecutive rows:slice(c(1, 5, 10))
# Get rows 50 through 55
attain |>
slice(50:55)# A tibble: 6 × 44
...1 id wrkstat hrs1 prestg80 marital agewed papres80 mapres80 sibs
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 50 50 keeping NA 52 separate NA NA NA 8
2 51 51 keeping NA 46 widowed 16 40 66 14
3 52 52 working 70 42 never ma NA 30 42 5
4 53 53 retired NA 50 widowed 19 40 NA 14
5 54 54 keeping NA 33 married 22 51 NA 1
6 55 55 working 48 51 never ma NA 51 NA 3
# ℹ 34 more variables: childs <dbl>, age <dbl>, agekdbrn <dbl>, educ <dbl>,
# paeduc <dbl>, maeduc <dbl>, degree <chr>, sex <chr>, race <chr>,
# weekswrk <dbl>, partfull <chr>, region <chr>, xnorcsiz <chr>,
# srcbelt <chr>, size <dbl>, partyid <chr>, polviews <chr>, relig <chr>,
# attend <chr>, satjob <chr>, class <chr>, satfin <chr>, finalter <chr>,
# finrela <chr>, wksub <chr>, wksup <chr>, unemp <chr>, union <chr>,
# parsol <chr>, tvhours <dbl>, dwelown <chr>, wordsum <dbl>, …
Try it yourself: Modify the code to get the first 10 rows of the dataset.
# Your code hereUsing filter()
The filter() function selects rows based on conditions.
Syntax: filter(condition)
- Use logical operators to create conditions:
==equal to (note: two equals signs!)!=not equal to>,<,>=,<=for comparisons
- For text values, put them in quotes:
filter(marital == "married") - For numbers, no quotes needed:
filter(age > 50)
Filtering with numeric variables
# Find respondents over age 50
attain |>
filter(age > 50) |>
head(5)# A tibble: 5 × 44
...1 id wrkstat hrs1 prestg80 marital agewed papres80 mapres80 sibs
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 2 working 40 22 married 20 NA 28 4
2 4 4 working 32 42 married NA NA NA 6
3 11 11 working 40 30 never ma NA NA NA NA
4 13 13 working 35 51 married 29 40 NA 4
5 14 14 working 38 50 married 22 51 NA 1
# ℹ 34 more variables: childs <dbl>, age <dbl>, agekdbrn <dbl>, educ <dbl>,
# paeduc <dbl>, maeduc <dbl>, degree <chr>, sex <chr>, race <chr>,
# weekswrk <dbl>, partfull <chr>, region <chr>, xnorcsiz <chr>,
# srcbelt <chr>, size <dbl>, partyid <chr>, polviews <chr>, relig <chr>,
# attend <chr>, satjob <chr>, class <chr>, satfin <chr>, finalter <chr>,
# finrela <chr>, wksub <chr>, wksup <chr>, unemp <chr>, union <chr>,
# parsol <chr>, tvhours <dbl>, dwelown <chr>, wordsum <dbl>, …
Try it yourself: Find all respondents who are under age 30.
# Your code hereFiltering with character variables
You can also filter on character (text) variables using ==:
# Find married respondents
attain |>
filter(marital == "married") |>
head(5)# A tibble: 5 × 44
...1 id wrkstat hrs1 prestg80 marital agewed papres80 mapres80 sibs
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 2 working 40 22 married 20 NA 28 4
2 3 3 working 50 29 married 28 NA 36 2
3 4 4 working 32 42 married NA NA NA 6
4 13 13 working 35 51 married 29 40 NA 4
5 14 14 working 38 50 married 22 51 NA 1
# ℹ 34 more variables: childs <dbl>, age <dbl>, agekdbrn <dbl>, educ <dbl>,
# paeduc <dbl>, maeduc <dbl>, degree <chr>, sex <chr>, race <chr>,
# weekswrk <dbl>, partfull <chr>, region <chr>, xnorcsiz <chr>,
# srcbelt <chr>, size <dbl>, partyid <chr>, polviews <chr>, relig <chr>,
# attend <chr>, satjob <chr>, class <chr>, satfin <chr>, finalter <chr>,
# finrela <chr>, wksub <chr>, wksup <chr>, unemp <chr>, union <chr>,
# parsol <chr>, tvhours <dbl>, dwelown <chr>, wordsum <dbl>, …
Try it yourself: Now divorced respondents
# Your code hereCombining Functions
One of the most powerful features of the tidyverse is chaining multiple operations together with the pipe (|>):
# Find respondents over 40, select key demographics
attain |>
filter(age > 40) |>
select(age, sex, educ, marital) |>
slice(1:6)# A tibble: 6 × 4
age sex educ marital
<dbl> <chr> <dbl> <chr>
1 59 male 12 married
2 59 male 8 married
3 41 female 12 widowed
4 45 female 12 divorced
5 52 female 0 never ma
6 55 male 16 married
Try it yourself: Chain together functions to find married respondents, select their age and education, and show the first 10 rows.
# Your code hereUsing mutate()
The mutate() function creates new variables based on existing ones.
Syntax: mutate(new_column_name = expression)
- Left side of
=is the name of your new column - Right side is the calculation or transformation
- You can use math operations:
+,-,*,/ - You can reference existing columns by name
# Create a variable for years of education beyond high school
attain |>
mutate(college_years = educ - 12) |>
select(educ, college_years) |>
head(5)# A tibble: 5 × 2
educ college_years
<dbl> <dbl>
1 12 0
2 12 0
3 12 0
4 8 -4
5 13 1
Try it yourself: Create a new variable that calculates how many years ago the respondent got married (hint: use age and agewed).
# Your code hereUsing rename()
The rename() function changes column names.
Syntax: rename(new_name = old_name)
- Left side of
=is the NEW name you want - Right side is the OLD name that currently exists
- Think of it as: “new_name gets old_name”
# Rename a column
attain |>
rename(years_of_education = educ) |>
select(years_of_education, age) |>
head(3)# A tibble: 3 × 2
years_of_education age
<dbl> <dbl>
1 12 33
2 12 59
3 12 NA
Try it yourself: Rename the marital column to marital_status.
# Your code hereHandling Missing Data with filter()
Missing values in R are represented as NA (Not Available).
Syntax:
is.na(column)— returnsTRUEif the value is missing!is.na(column)— returnsTRUEif the value is NOT missing (the!means “not”)- Use these inside
filter()to keep or remove rows with missing data
# Count how many have missing father's education data
attain |>
filter(is.na(paeduc)) |>
nrow()[1] 837
# Keep only respondents with non-missing father's education
attain |>
filter(!is.na(paeduc)) |>
glimpse()Rows: 2,155
Columns: 44
$ ...1 <dbl> 1, 7, 8, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, …
$ id <dbl> 1, 7, 8, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, …
$ wrkstat <chr> "keeping", "working", "working", "working", "working", "worki…
$ hrs1 <dbl> NA, 35, 45, 40, 65, 35, 38, 46, 40, 40, NA, NA, 10, 5, 25, 75…
$ prestg80 <dbl> 46, 20, 44, 46, 75, 51, 50, 73, 41, 64, 59, 35, 46, 47, 74, 4…
$ marital <chr> "divorced", "widowed", "never ma", "divorced", "never ma", "m…
$ agewed <dbl> NA, NA, NA, NA, NA, 29, 22, NA, NA, 32, NA, NA, NA, NA, NA, N…
$ papres80 <dbl> 41, 20, 44, 34, 51, 40, 51, 75, 46, 50, 59, 75, 50, 51, 44, 5…
$ mapres80 <dbl> NA, 28, 44, NA, NA, NA, NA, 60, 66, NA, 30, NA, NA, NA, NA, 2…
$ sibs <dbl> 4, 5, 8, 5, 1, 4, 1, 4, 2, 0, 0, 2, 2, 3, 2, 1, 1, 4, 6, 3, 8…
$ childs <dbl> 2, 4, 1, 1, 0, 2, 2, 0, 0, 2, 2, 0, 0, 0, 1, 0, 0, 0, 8, 2, 3…
$ age <dbl> 33, 40, 25, 45, 31, 55, 56, 36, 25, 52, 39, 71, 36, 26, 51, 5…
$ agekdbrn <dbl> 21, 17, 23, 17, NA, 29, 32, NA, NA, NA, 34, NA, NA, NA, 35, N…
$ educ <dbl> 12, 9, 12, 12, 19, 16, 16, 18, 16, 18, 16, 12, 16, 16, 20, 12…
$ paeduc <dbl> 12, 12, 8, 6, 14, 0, 12, 19, 16, 13, 16, 20, 16, 12, 20, 2, 2…
$ maeduc <dbl> 10, NA, 0, 6, 16, 0, 16, 14, 18, 16, 14, 12, 14, NA, 16, 6, 1…
$ degree <chr> "high sch", "lt high", "high sch", "high sch", "graduate", "b…
$ sex <chr> "female", "female", "male", "female", "female", "male", "male…
$ race <chr> "black", "other", "black", "black", "white", "white", "white"…
$ weekswrk <dbl> 0, 52, 52, 52, 52, 52, 52, 52, 45, 52, 0, 0, 0, 52, 52, 52, 0…
$ partfull <chr> NA, "full-tim", "part-tim", "full-tim", "full-tim", "full-tim…
$ region <chr> "middle a", "middle a", "middle a", "middle a", "middle a", "…
$ xnorcsiz <chr> "city gt", "city gt", "city gt", "city gt", "city gt", "city …
$ srcbelt <chr> "12 lrgst", "12 lrgst", "12 lrgst", "12 lrgst", "12 lrgst", "…
$ size <dbl> 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7323, 7…
$ partyid <chr> "strong d", "strong d", "strong d", "not str", "not str", "st…
$ polviews <chr> "slightly", "moderate", "conserva", "moderate", "slghtly", "c…
$ relig <chr> "protesta", "catholic", "other", "protesta", "jewish", "other…
$ attend <chr> "sevrl ti", "more thn", "more thn", "more thn", "once a y", "…
$ satjob <chr> "mod. sat", "mod. sat", "very sat", "very sat", "very sat", "…
$ class <chr> "working", "working", "working", "working", "middle c", "work…
$ satfin <chr> "more or", "not at a", "satisfie", "not at a", "satisfie", "m…
$ finalter <chr> "better", "worse", "worse", "worse", "better", "worse", "stay…
$ finrela <chr> "below av", "below av", "average", "below av", "average", "av…
$ wksub <chr> NA, "yes", NA, "no", NA, NA, "yes", "yes", "yes", NA, "yes", …
$ wksup <chr> NA, "no", NA, "no", NA, NA, "yes", "yes", "no", NA, "yes", NA…
$ unemp <chr> "yes", NA, "no", "yes", "no", "no", "no", NA, NA, "no", NA, N…
$ union <chr> "neither", NA, "neither", "neither", "neither", "neither", "n…
$ parsol <chr> "somewhat", NA, "much bet", "somewhat", "about th", "much bet…
$ tvhours <dbl> 2, 10, 4, NA, 2, 2, NA, 1, 1, 0, 3, 3, 2, 3, NA, 4, NA, 1, 8,…
$ dwelown <chr> "pays ren", "pays ren", "pays ren", NA, "own or i", "own or i…
$ wordsum <dbl> NA, 5, 1, NA, 9, 9, NA, 7, 10, 3, 8, NA, 9, 9, NA, 2, NA, 9, …
$ income91 <dbl> 11250, 23750, 11250, 18750, 45000, 32500, 100000, 45000, 2750…
$ rincom91 <dbl> NA, 23750, 11250, 18750, 45000, 32500, 100000, 45000, NA, 450…
Try it yourself: Count how many respondents have missing data for agewed (age at first marriage).
# Your code hereCombining Logical Conditions
You can combine multiple conditions inside filter():
Syntax:
&(and) — both conditions must be true|(or) — at least one condition must be true
# Find respondents with non-missing data for BOTH educ AND paeduc
attain |>
filter(!is.na(educ) & !is.na(paeduc)) |>
select(educ, paeduc) |>
head(5)# A tibble: 5 × 2
educ paeduc
<dbl> <dbl>
1 12 12
2 9 12
3 12 8
4 12 6
5 19 14
You can also use | (or) to match any of several conditions:
# Find respondents who are either divorced OR widowed
attain |>
filter(marital == "divorced" | marital == "widowed") |>
select(age, marital) |>
head(5)# A tibble: 5 × 2
age marital
<dbl> <chr>
1 33 divorced
2 40 widowed
3 41 widowed
4 45 divorced
5 58 divorced
Try it yourself: Find respondents who are over 40 AND have more than 12 years of education.
# Your code hereSummary
In this lab, you learned how to use these key tidyverse functions:
| Function | Purpose |
|---|---|
select() |
Choose columns |
slice() |
Choose rows by position |
filter() |
Choose rows by condition |
mutate() |
Create new variables |
rename() |
Rename columns |
is.na() / !is.na() |
Check for missing values |
You also learned how to chain these functions together using the pipe (|>) to create readable data manipulation workflows.