Lab 2: Summary Statistics in R

Author

Your Name Here

Published

March 25, 2026

Setup Instructions

  1. Download lab2.qmd from bCourse under “labs” > “Lab #2”
  2. Place lab2.qmd in your labs folder. Your folder structure should now look like this:
soc106/
├── _quarto.yml
├── data/
│   └── attain.csv
├── assignments/
│   └── hw1.qmd
│   └── hw2.qmd
│   └── hw3.qmd
└── labs/
    ├── lab1.qmd
    └── lab2.qmd
  1. Use the Explorer button on the left to find and open lab2.qmd

  2. Let’s work through it together!


Load Libraries and Data

First, load the tidyverse package:

# load libraries
library(tidyverse)

Today we’ll use the mpg dataset, which is built into the tidyverse. It contains fuel economy data for 234 cars from 1999 to 2008. No need to download anything!

# Preview the data
mpg |> glimpse()
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Frequency Tables

The count() function tallies how many observations fall into each category. We can add a proportion column with mutate().

# Frequency table for vehicle class
mpg |>
  count(class) |>
  mutate(proportion = n / sum(n))
# A tibble: 7 × 3
  class          n proportion
  <chr>      <int>      <dbl>
1 2seater        5     0.0214
2 compact       47     0.201 
3 midsize       41     0.175 
4 minivan       11     0.0470
5 pickup        33     0.141 
6 subcompact    35     0.150 
7 suv           62     0.265 

Try it yourself: Create a frequency table with proportions for the drv variable (drive type: f = front-wheel, r = rear-wheel, 4 = 4-wheel).

# Your code here

Marginal Frequencies

We can calculate marginal (conditional) frequencies by filtering to a subgroup first, then counting.

# Among SUVs, what is the distribution of drive type?
mpg |>
  filter(class == "suv") |>
  count(drv) |>
  mutate(proportion = n / sum(n))
# A tibble: 2 × 3
  drv       n proportion
  <chr> <int>      <dbl>
1 4        51      0.823
2 r        11      0.177

Try it yourself: Among compact cars (class == "compact"), what is the distribution of drive type (drv)? What proportion are front-wheel drive?

# Your code here

Central Tendency

We use summarise() (or summarize()) to calculate the mean and median. Remember to use na.rm = TRUE to handle missing values.

# Mean and median highway mpg
mpg |>
  summarise(
    mean_hwy = mean(hwy, na.rm = TRUE),
    median_hwy = median(hwy, na.rm = TRUE)
  )
# A tibble: 1 × 2
  mean_hwy median_hwy
     <dbl>      <dbl>
1     23.4         24

For the mode, we use count() and slice_max():

# Mode of manufacturer
mpg |>
  count(manufacturer) |>
  slice_max(n, n = 1)
# A tibble: 1 × 2
  manufacturer     n
  <chr>        <int>
1 dodge           37

Try it yourself: Calculate the mean and median of city fuel economy (cty). Then find the mode of the class variable.

# Your code here

Dispersion: Percentiles

We use quantile() inside summarise() to calculate percentiles.

# Percentiles for highway mpg
mpg |>
  summarise(
    p25 = quantile(hwy, 0.25),
    p50 = quantile(hwy, 0.50),
    p75 = quantile(hwy, 0.75)
  )
# A tibble: 1 × 3
    p25   p50   p75
  <dbl> <dbl> <dbl>
1    18    24    27

Try it yourself: Calculate the 10th, 50th, and 90th percentiles for city fuel economy (cty).

# Your code here

Dispersion: Variance and Standard Deviation

We can calculate variance with var() and standard deviation with sd().

# Variance and standard deviation of highway mpg
mpg |>
  summarise(
    variance = var(hwy),
    std_dev = sd(hwy)
  )
# A tibble: 1 × 2
  variance std_dev
     <dbl>   <dbl>
1     35.5    5.95

Try it yourself: Calculate the variance and standard deviation of engine displacement (displ). Verify that the standard deviation equals the square root of the variance by also calculating sqrt(var(displ)).

# Your code here

Association: Correlation

We can measure the linear association between two continuous variables using cor().

# Correlation between engine size and highway mpg
mpg |>
  summarise(
    correlation = cor(displ, hwy)
  )
# A tibble: 1 × 1
  correlation
        <dbl>
1      -0.766

A negative correlation means as engine size increases, fuel economy tends to decrease.

Try it yourself: Calculate the correlation between city fuel economy (cty) and highway fuel economy (hwy). Is the relationship positive or negative? Is it strong or weak?

# Your code here

Putting It All Together

Let’s combine what we’ve learned. We can calculate multiple summary statistics at once, and even compare across groups using group_by().

# Compare highway mpg across drive types
mpg |>
  group_by(drv) |>
  summarise(
    mean_hwy = mean(hwy),
    median_hwy = median(hwy),
    sd_hwy = sd(hwy),
    n = n()
  )
# A tibble: 3 × 5
  drv   mean_hwy median_hwy sd_hwy     n
  <chr>    <dbl>      <dbl>  <dbl> <int>
1 4         19.2         18   4.08   103
2 f         28.2         28   4.21   106
3 r         21           21   3.66    25

Try it yourself: Compare city fuel economy (cty) across vehicle classes (class). Calculate the mean, median, and standard deviation for each class. Which class has the best average fuel economy?

# Your code here

Summary

In this lab, you practiced calculating summary statistics in R:

Function Purpose
count() Frequency tables
mutate() Add proportions
summarise() Calculate summary statistics
mean(), median() Central tendency
quantile() Percentiles
var(), sd() Variance and standard deviation
cor() Correlation coefficient
group_by() Compare statistics across groups