[1] 0.01063538
Sociology 106: Quantitative Sociological Methods
March 17, 2026
Housekeeping:
Statistical content — three parts:
In-class lab:
Annotated Bibliography
Weekly Assignment #8
Spring Recess
Main idea: Last week we built confidence intervals — a range of plausible values for a parameter. Today we flip the question: given a specific claim about the population, how likely are our sample results if that claim were true?
The chain of inference so far:
Week 7: Population → Sample → Statistic → Sampling distribution
Week 8: Sample → Statistic → Confidence interval → Inference
Week 9: Null hypothesis + Sample statistic → Test statistic → p-value → Decision
| Concept | Key idea | Example |
|---|---|---|
| Point estimate | Best single guess for population parameter | \(\hat{p} = 0.449\) (GSS env. support) |
| Standard error | How uncertain is the estimate? | \(SE(\hat{p}) = 0.015\) |
| Confidence interval | Range likely to contain the true parameter | 95% CI: [0.420, 0.478] |
| z vs t | Proportions → z; Means → t | \(z^* = 1.96\); \(t^* = 2.447\) (\(df = 6\)) |
Connection to today:
Last week we estimated unknown population parameters and surrounded them with intervals. Today we start from a specific claim about the population — and ask whether our data are consistent with it.
Can we rule out chance as an explanation?
A hypothesis test is a formal statistical procedure for evaluating whether a finding is likely due to chance.
The core steps:
The core logic:
We don’t prove the null hypothesis wrong — we ask: if it were true, how surprising would our data be? If very surprising, we have grounds to doubt the null.
James Bond claims he can tell whether a martini is shaken or stirred just by tasting it.
The experiment: Give him 16 randomly prepared martinis. He correctly identifies 13 out of 16.
Null hypothesis (\(H_0\)): He’s just guessing — each trial is a 50-50 coin flip (\(\pi = 0.5\))
Alternative hypothesis (\(H_1\)): He can genuinely tell the difference — he performs better than random guessing (\(\pi > 0.5\))
The question: If he were just guessing, how likely is it that he’d get 13 or more right out of 16?
Probability ≈ 1.1% — very unlikely if he were just guessing. We have strong evidence he can actually tell the difference.
A p-value is the probability of observing a test statistic as extreme as (or more extreme than) the one we got, assuming the null hypothesis is true
Critical misconception:
What the p-value is NOT: the probability that \(H_0\) is true. The null hypothesis is either true or false — it doesn’t have a probability.
What the p-value is : P(data this extreme | \(H_0\) true) — the probability of observing results at least this extreme, assuming the null holds. It answers: “if the null were true, how often would we see data like ours?” — not “how likely is the null to be true?”
| p-value | Interpretation |
|---|---|
| Very small (e.g., 0.001) | Data would be very surprising under \(H_0\) → strong evidence against \(H_0\) |
| Small (e.g., 0.03) | Data somewhat unlikely under \(H_0\) → moderate evidence against \(H_0\) |
| Large (e.g., 0.40) | Data consistent with \(H_0\) → no reason to doubt \(H_0\) |
Statistical ≠ substantive significance:
A result can be statistically significant (\(p < 0.05\)) but substantively trivial — the effect is real but tiny. With large samples, even meaningless effects can reach significance. Always report the size of the effect, not just whether it is significant.
How small does the p-value need to be before we reject \(H_0\)?
We set a significance level \(\alpha\) in advance — the threshold below which we’ll reject:
Decision rule:
| Result | Decision |
|---|---|
| \(p < \alpha\) | Reject \(H_0\) — accept \(H_1\); say the result is “statistically significant” |
| \(p \geq \alpha\) | Fail to reject \(H_0\) — we do not “accept” \(H_0\), we just lack evidence to reject it |
Why “fail to reject” — not “accept”?
A negative result doesn’t prove the null is true. A small sample may simply be too underpowered to detect a real effect. We might be failing to reject because the effect doesn’t exist — or because we didn’t have enough data to see it.
Every decision has two possible mistakes:
| \(H_0\) is True | \(H_0\) is False | |
|---|---|---|
| Reject \(H_0\) | Type I error (false positive) | Correct ✓ |
| Fail to reject \(H_0\) | Correct ✓ | Type II error (false negative) |
In plain language: the paramedic
You arrive at an accident. Is the victim alive (\(H_0\)) or dead (\(H_1\))?
Here you’d want a very small \(\alpha\) — the cost of the errors is not symmetric.
Note
The trade-off: Moving the threshold right reduces Type I error but increases Type II error — and vice versa. You can shrink both by increasing \(n\), which pulls the two distributions apart.
A one-tailed test checks for a significant effect in one specific direction
James Bond (one-tailed): Did he perform better than random guessing?
A two-tailed test checks for a significant effect in either direction
James Bond (two-tailed): Did he perform differently from random guessing?

| Scenario | Recommended |
|---|---|
| Theory strongly predicts the direction | One-tailed (use sparingly!) |
| No strong directional prediction | Two-tailed (default) |
Default to two-tailed:
Unless you have a strong, pre-specified theoretical reason to predict direction, use a two-tailed test. Effects in the “unexpected” direction are often theoretically important!
Last week’s confidence intervals and today’s hypothesis tests are two sides of the same coin:

The equivalence:
The 95% CI is precisely the set of null values you would fail to reject at \(\alpha = 0.05\). Testing and estimation are equivalent — they just frame the same question differently.
Testing associations involving a continuous outcome variable
| Test | When to use | Key statistic | Example |
|---|---|---|---|
| One-sample t-test | Comparing sample to population; \(\bar{x} = \mu\) | t-score | Do Berkeley students work more than 20 hrs/week on average? |
| Two-sample t-test | Comparing means of two groups; \(\bar{x_1} = \bar{x_2}\) | t-score | Do men and women differ in hours worked per week? |
| Proportion test | Testing proportions; test if \(\hat{p} = P\) | z-score | Is union membership among women different from 20%? |
All three tests follow the same steps:
Compare the sample to the population.
General form of the one-sample t-test:
\[H_0: \mu = \mu_0 \qquad \text{(sample mean equals a specified population value)}\]
\(H_1\) depends on whether the research hypothesis predicts a direction:
| Alternative | Test type | When to use |
|---|---|---|
| \(H_1: \mu \neq \mu_0\) | Two-tailed | No predicted direction — default choice |
| \(H_1: \mu > \mu_0\) | One-tailed (upper) | Theory predicts sample is higher than \(\mu_0\) |
| \(H_1: \mu < \mu_0\) | One-tailed (lower) | Theory predicts sample is lower than \(\mu_0\) |
Running example: Is the mean hours worked per week by U.S. adults equal to 40 hours?
Under \(H_0\), the test statistic follows a t-distribution with \(df = n - 1\):
\[t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}\]
| Step | What to compute | Formula/Rule |
|---|---|---|
| 1 | State \(H_0\) and \(H_1\) | \(H_0: \mu = \mu_0\); choose \(H_1\) based on theory |
| 2 | One- or two-tailed? | Does theory predict a direction? → one-tailed; otherwise → two-tailed (default) |
| 3 | Sample mean & SE | \(\bar{x}\); \(SE = s / \sqrt{n}\) |
| 4 | t-statistic | \(t = (\bar{x} - \mu_0) / SE\) |
| 5 | p-value | See below |
Step 2 determines the p-value formula:
2 * pt(-abs(t), df = n - 1)1 - pt(t, df = n - 1)pt(t, df = n - 1)mu = 40 sets the null hypothesis value — R tests whether the true population mean equals 40.
One Sample t-test
data: hrs_clean$hrs1
t = 5.0511, df = 1899, p-value = 0.0000004813
alternative hypothesis: true mean is not equal to 40
95 percent confidence interval:
41.01450 42.30234
sample estimates:
mean of x
41.65842
Reading the t.test() output:
The sample mean hours worked per week was 41.66 hours. A one-sample t-test showed this is significantly different from 40 hours (\(p\) < 0.001). We reject \(H_0\); we accept \(H_1\): U.S. adults in this sample work significantly more than a standard 40-hour week.
The 95% CI [41.01, 42.3] does not contain 40 — this is the CI equivalent of rejecting \(H_0\) at \(\alpha = 0.05\).

Running example: Do union members work a different number of hours per week than non-members?
Under \(H_0\), the test statistic:
\[t = \frac{\bar{x}_1 - \bar{x}_2}{SE_{\text{diff}}} \quad \text{where} \quad SE_{\text{diff}} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]
Degrees of freedom: \(df \approx n_1 + n_2 - 2\) (R uses Welch’s correction, adjusting df when variances differ)
Welch Two Sample t-test
data: union_hrs and nonunion_hrs
t = 1.8535, df = 316.26, p-value = 0.06475
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1122722 3.7619461
sample estimates:
mean of x mean of y
43.52284 41.69801
Reading the t.test() output:
Union members worked an average of 43.52 hours per week, compared to 41.7 hours for non-members (difference = 1.82 hrs). A two-sample t-test showed this difference was not statistically significant (\(p\) = 0.065). We fail to reject \(H_0\): we cannot conclude union members work different hours than non-members.
The 95% CI on the difference [-0.11, 3.76] includes 0 — consistent with failing to reject \(H_0\) at \(\alpha = 0.05\).

We follow the same set up, but use a slightly different distribution.
Running example: Is the proportion of U.S. adults who are union members equal to 20%?
Under \(H_0\), the sampling distribution of \(\hat{p}\) is approximately:
\[\hat{p} \sim N\!\left(\pi_0,\; SE_0\right) \quad \text{where} \quad SE_0 = \sqrt{\frac{\pi_0(1-\pi_0)}{n}}\]
Key distinction — testing vs. estimation:
When building a CI, use \(SE = \sqrt{\hat{p}(1-\hat{p})/n}\) (the sample estimate). When testing a hypothesis, use \(SE_0 = \sqrt{\pi_0(1-\pi_0)/n}\) (the null value).
Step 1: Compute the standard error under \(H_0\):
\[SE_0 = \sqrt{\frac{\pi_0(1-\pi_0)}{n}} = \sqrt{\frac{0.20 \times 0.80}{n}}\]
Step 2: Compute the z-statistic:
\[z = \frac{\hat{p} - \pi_0}{SE_0}\]
Step 3: Calculate the p-value:
| Test | R code |
|---|---|
| Two-tailed | 2 * pnorm(-abs(z)) |
| One-tailed (upper) | pnorm(-abs(z)) |
prop.test() handles the SE and test statistic automatically — just supply the count of successes, the sample size, and the null value.
1-sample proportions test without continuity correction
data: x out of n, null probability 0.2
X-squared = 62.306, df = 1, p-value = 0.000000000000002941
alternative hypothesis: true p is not equal to 0.2
95 percent confidence interval:
0.1150532 0.1445757
sample estimates:
p
0.1290973
Reading the prop.test() output:
Note: correct = FALSE turns off the continuity correction — appropriate when \(n\) is large.
The sample union membership rate was 0.129 (12.9%). A proportion test showed this was significantly different from 20% (\(p\) = 0). We reject \(H_0\); we accept \(H_1\).
The 95% CI [0.115, 0.145] does not contain 0.20 — the null value falls entirely outside the plausible range for the true proportion, confirming we reject \(H_0\).

Using the GSS attain dataset, we test whether respondents’ mean household income differs from the U.S. national median household income of $30,000 (1991 Census benchmark, the year these data were collected).
The income91 variable records household income as midpoints of GSS income brackets (e.g., $500, $2,000 … $100,000).
Research question: Does mean household income in the GSS sample differ from the national median?
Question: Write the hypothesis notation and specify whether this is a one- or two-tailed test.
Null hypothesis: Mean household income in the sample equals the national median:
\[H_0: \mu = \$30{,}000\]
Alternative hypothesis: Mean household income is different from the national median:
\[H_1: \mu \neq \$30{,}000\]
Two-tailed — the research question asks whether income differs, not specifically whether it is higher or lower.
The logic:
We have no directional prediction going in — the GSS sample could plausibly earn more or less than the national median. So we split the rejection region (\(\alpha = 5\%\)) equally across both tails (2.5% each). This is the more conservative and more common choice in social science.
One Sample t-test
data: inc_clean$income91
t = 14.205, df = 2635, p-value < 0.00000000000000022
alternative hypothesis: true mean is not equal to 30000
95 percent confidence interval:
36547.93 38645.17
sample estimates:
mean of x
37596.55
Since t.test() runs a two-tailed test by default, the p-value already reflects both tails.
Since \(p < 0.001 < \alpha = 0.05\), we reject \(H_0\); we accept \(H_1\).
Conclusion:
The mean household income in our sample ($37,597)) is significantly higher than the national median of $30,000 (\(p\) < 0.001). The 95% CI [$36,548, $38,645] does not contain $30,000, confirming we reject \(H_0\). GSS respondents earn more than the national median, and this gap is unlikely to be due to sampling error alone.
Testing associations between two categorical variables
Every test we’ve covered this week asks the same fundamental question: is there a relationship between variables? The choice of test depends entirely on the types of variables involved.
| What we’re testing | Outcome variable | Grouping variable | Method |
|---|---|---|---|
| Is a proportion = \(P\)? | Binary | — | z-test |
| Is a mean = \(\mu_0\)? | Continuous | — | one-sample t-test |
| Do two group means differ? | Continuous | Binary (2 groups) | two-sample t-test |
| Do 3+ group means differ? | Continuous | Categorical (3+ groups) | ANOVA (not covered) |
| Are two categorical variables independent? | Categorical | Categorical | chi-squared test |
The key decision:
Is your outcome variable continuous? → use a t-test (or ANOVA for 3+ groups). Is your outcome variable categorical? → use chi-squared. Note: chi-squared works for any number of categories — not just binary variables.
Both tests ask “Is \(X\) related to \(Y\)?” — for different variable combinations:
Question: Does the mean of a continuous outcome differ across groups?
Our example: Do union members work a different number of hours than non-members? Focusing on the difference in means between the two groups.
Question: Does the distribution of a categorical outcome differ across groups?
Our example: Is union membership associated with sex?
| Two-sample t-test | Chi-squared | |
|---|---|---|
| \(H_0\) | \(\mu_1 = \mu_2\) | \(X\) and \(Y\) are independent |
| Outcome type | Continuous | Categorical |
| Group type | Binary/categorical | Categorical |
| Test statistic | \(t\) | \(\chi^2\) |
| Distribution | t (\(df = n_1+n_2-2\)) | \(\chi^2\) (\(df = (I{-}1)(J{-}1)\)) |
| Measure of effect | Difference in means | Difference in proportions |
| In R | t.test(x, y) |
chisq.test(table) |
Choosing your test for hw8:
Does your research question compare a continuous outcome across groups? → two-sample t-test. Does it compare the rates of a categorical outcome across groups? → chi-squared.
A contingency table shows the joint distribution of two categorical variables — counts of observations in each combination of categories
| Non-member | Union member | Total | |
|---|---|---|---|
| Female | \(n_{11}\) | \(n_{12}\) | \(n_{1\bullet}\) |
| Male | \(n_{21}\) | \(n_{22}\) | \(n_{2\bullet}\) |
| Total | \(n_{\bullet 1}\) | \(n_{\bullet 2}\) | \(n\) |
Non-member Union Sum
female 1003 105 1108
male 724 151 875
Sum 1727 256 1983
Non-member Union
female 0.905 0.095
male 0.827 0.173
The null hypothesis: The two categorical variables are independent — knowing someone’s category on one variable tells us nothing about their category on the other
\[H_0: \text{no association between } X \text{ and } Y\]
Under independence, the expected count in each cell:
\[\hat{n}_{ij} = \frac{(\text{row total}_i) \times (\text{column total}_j)}{n}\]
This is what the table would look like if the two variables had no relationship at all.
The test asks: Are the observed cell counts far enough from the expected cell counts that we doubt independence?
\[\chi^2 = \sum_{i}\sum_{j} \frac{(n_{ij} - \hat{n}_{ij})^2}{\hat{n}_{ij}}\]
where \(n_{ij}\) = observed and \(\hat{n}_{ij}\) = expected count in cell \((i, j)\)
Assumptions:
Research question: Is there a significant association between sex and union membership?
# A tibble: 2 × 4
sex `Non-member` Union Total
<chr> <int> <int> <int>
1 female 1003 105 1108
2 male 724 151 875
Non-member Union
female 0.905 0.095
male 0.827 0.173
Non-member Union
female 965 143
male 762 113
Pearson's Chi-squared test
data: tab
X-squared = 26.325, df = 1, p-value = 0.0000002886
correct = FALSE turns off Yates’ continuity correction — standard when all expected counts ≥ 5
Men were 7.8 percentage points more likely to be union members than women (17.3% vs. 9.5%). A chi-squared test showed this association was statistically significant (\(\chi^2\)(1) = 26.3, \(p\) < 0.001). We reject \(H_0\); we accept \(H_1\): sex and union membership are not independent in this sample.
Important:
Chi-squared only tells you whether an association exists — not its direction or magnitude. Always supplement with a difference in proportions.
\(H_0\): Sex and college attainment are independent. \(H_1\): They are not. \(\alpha = 0.05\).
# A tibble: 2 × 4
sex `College+` `No college` Total
<chr> <int> <int> <int>
1 female 382 1316 1698
2 male 331 953 1284
College+ No college
female 406 1292
male 307 977
Pearson's Chi-squared test
data: tab2
X-squared = 4.3281, df = 1, p-value = 0.03749
# A tibble: 2 × 3
# Groups: sex [2]
sex `College+` `No college`
<chr> <dbl> <dbl>
1 female 0.225 0.775
2 male 0.258 0.742
Difference in proportions: -0.033
Women were 3.3 percentage points less likely than men to have a college degree (22.5% vs. 25.8%). A chi-squared test showed this association was statistically significant (\(\chi^2\)(1) = 4.3, \(p\) = 0.038). We reject \(H_0\); we accept \(H_1\): sex and college attainment are not independent in this sample.
Writing up chi-squared results — always include:
Connection to Week 8:
If the 95% CI does not contain the null value → reject the null at 0.05. Tests and CIs are two sides of the same coin.
| Test | Use when | R function | Example |
|---|---|---|---|
| One-sample t-test | Is the mean = some value? | t.test(x, mu = value) |
Do GSS respondents earn differently from $30,000? |
| Two-sample t-test | Do two group means differ? | t.test(x, y) |
Do union members work more hours than non-members? |
| Proportion test | Is a proportion = some value? | prop.test(x, n, p = value) |
Is union membership different from 20%? |
| Chi-squared | Are two categorical variables independent? | chisq.test(table) |
Is union membership associated with sex? |
Always report: the group statistic(s), the p-value, your decision, and a plain-language conclusion.
Key takeaway:
Hypothesis testing is not just a tool for this week — it is the core inferential logic running through all of statistics. Every result in your paper will be evaluated with the same p-value framework we built today.