# load libraries
library(tidyverse)Lab 2: Summary Statistics in R
Setup Instructions
- Download
lab2.qmdfrom bCourse under “labs” > “Lab #2” - Place
lab2.qmdin yourlabsfolder. Your folder structure should now look like this:
soc106/
├── _quarto.yml
├── data/
│ └── attain.csv
├── assignments/
│ └── hw1.qmd
│ └── hw2.qmd
│ └── hw3.qmd
└── labs/
├── lab1.qmd
└── lab2.qmd
Use the
Explorerbutton on the left to find and openlab2.qmdLet’s work through it together!
Load Libraries and Data
First, load the tidyverse package:
Today we’ll use the mpg dataset, which is built into the tidyverse. It contains fuel economy data for 234 cars from 1999 to 2008. No need to download anything!
# Preview the data
mpg |> glimpse()Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Frequency Tables
The count() function tallies how many observations fall into each category. We can add a proportion column with mutate().
# Frequency table for vehicle class
mpg |>
count(class) |>
mutate(proportion = n / sum(n))# A tibble: 7 × 3
class n proportion
<chr> <int> <dbl>
1 2seater 5 0.0214
2 compact 47 0.201
3 midsize 41 0.175
4 minivan 11 0.0470
5 pickup 33 0.141
6 subcompact 35 0.150
7 suv 62 0.265
Try it yourself: Create a frequency table with proportions for the drv variable (drive type: f = front-wheel, r = rear-wheel, 4 = 4-wheel).
# Your code hereMarginal Frequencies
We can calculate marginal (conditional) frequencies by filtering to a subgroup first, then counting.
# Among SUVs, what is the distribution of drive type?
mpg |>
filter(class == "suv") |>
count(drv) |>
mutate(proportion = n / sum(n))# A tibble: 2 × 3
drv n proportion
<chr> <int> <dbl>
1 4 51 0.823
2 r 11 0.177
Try it yourself: Among compact cars (class == "compact"), what is the distribution of drive type (drv)? What proportion are front-wheel drive?
# Your code hereCentral Tendency
We use summarise() (or summarize()) to calculate the mean and median. Remember to use na.rm = TRUE to handle missing values.
# Mean and median highway mpg
mpg |>
summarise(
mean_hwy = mean(hwy, na.rm = TRUE),
median_hwy = median(hwy, na.rm = TRUE)
)# A tibble: 1 × 2
mean_hwy median_hwy
<dbl> <dbl>
1 23.4 24
For the mode, we use count() and slice_max():
# Mode of manufacturer
mpg |>
count(manufacturer) |>
slice_max(n, n = 1)# A tibble: 1 × 2
manufacturer n
<chr> <int>
1 dodge 37
Try it yourself: Calculate the mean and median of city fuel economy (cty). Then find the mode of the class variable.
# Your code hereDispersion: Percentiles
We use quantile() inside summarise() to calculate percentiles.
# Percentiles for highway mpg
mpg |>
summarise(
p25 = quantile(hwy, 0.25),
p50 = quantile(hwy, 0.50),
p75 = quantile(hwy, 0.75)
)# A tibble: 1 × 3
p25 p50 p75
<dbl> <dbl> <dbl>
1 18 24 27
Try it yourself: Calculate the 10th, 50th, and 90th percentiles for city fuel economy (cty).
# Your code hereDispersion: Variance and Standard Deviation
We can calculate variance with var() and standard deviation with sd().
# Variance and standard deviation of highway mpg
mpg |>
summarise(
variance = var(hwy),
std_dev = sd(hwy)
)# A tibble: 1 × 2
variance std_dev
<dbl> <dbl>
1 35.5 5.95
Try it yourself: Calculate the variance and standard deviation of engine displacement (displ). Verify that the standard deviation equals the square root of the variance by also calculating sqrt(var(displ)).
# Your code hereAssociation: Correlation
We can measure the linear association between two continuous variables using cor().
# Correlation between engine size and highway mpg
mpg |>
summarise(
correlation = cor(displ, hwy)
)# A tibble: 1 × 1
correlation
<dbl>
1 -0.766
A negative correlation means as engine size increases, fuel economy tends to decrease.
Try it yourself: Calculate the correlation between city fuel economy (cty) and highway fuel economy (hwy). Is the relationship positive or negative? Is it strong or weak?
# Your code herePutting It All Together
Let’s combine what we’ve learned. We can calculate multiple summary statistics at once, and even compare across groups using group_by().
# Compare highway mpg across drive types
mpg |>
group_by(drv) |>
summarise(
mean_hwy = mean(hwy),
median_hwy = median(hwy),
sd_hwy = sd(hwy),
n = n()
)# A tibble: 3 × 5
drv mean_hwy median_hwy sd_hwy n
<chr> <dbl> <dbl> <dbl> <int>
1 4 19.2 18 4.08 103
2 f 28.2 28 4.21 106
3 r 21 21 3.66 25
Try it yourself: Compare city fuel economy (cty) across vehicle classes (class). Calculate the mean, median, and standard deviation for each class. Which class has the best average fuel economy?
# Your code hereSummary
In this lab, you practiced calculating summary statistics in R:
| Function | Purpose |
|---|---|
count() |
Frequency tables |
mutate() |
Add proportions |
summarise() |
Calculate summary statistics |
mean(), median() |
Central tendency |
quantile() |
Percentiles |
var(), sd() |
Variance and standard deviation |
cor() |
Correlation coefficient |
group_by() |
Compare statistics across groups |