
Sociology 106: Quantitative Sociological Methods
March 3, 2026
HW 4:
Open Week Topic:
Main idea: We’ve been building probability models of populations. Today we ask: what happens when we draw a sample—and then another, and another? The answer is what makes all of statistical inference possible.
The answer:
When we draw a sample—and then another, and another—the sample statistics vary in a systematic, mathematically predictable way. That predictable variation is what makes statistical inference possible.
Admin discussion:
Statistical content — three parts:
In-class lab:
Halfway done,will finish by next week. Overall, super interesting topics ! Some general feedback:
Research design: How can you construct variables of interest? How do the variables in your data relate to your hyptotheses? What units of analysis may be most appropriate? Are there other issues you may want to think about going forward?
What has already been written in the literature: some topics have been studied a lot, which is a good sign that your question is important! If your question seems already answered: try a different data source, or look for related unanswered questions in paper conclusions.
Due March 19
Identify ten scholarly sources related to your research question
For each source: in two short paragraphs:
Here is a link to an example of one source: you will need 10.
“effect of independent variable on dependent variable”
May need to use a UC Berkeley Library proxy to access some academic articles. This is a helpful webpage for using the proxy server and here is a link to make a virtual appointment for research help.
cited by icon under the citationDon’t use AI for this:
AI will hallucinate citations more often than not, so don’t use it for this. Google scholar is probably your best bet here!
| Section | What to look for |
|---|---|
| Abstract | Overview: research question, data, main result — read first |
| Introduction | Slightly more detailed than abstract; same format |
| Conclusion | Results, alternative explanations, limitations — and future research topics |
| Background / Lit review | Previous research; a good source of additional citations |
| Data section | How the sample was constructed; what data sources researchers use |
| Methods | Don’t worry about this too much yet! |
Order of operations:
Start with abstract → conclusion → introduction. Only go deeper if the paper is clearly relevant.
How do we go from an abstract population to real data we can analyze?
| Population | Sample | |
|---|---|---|
| Concept | The whole universe a study aspires to generalize to | The subset of the population we actually observe |
| Quantity | Parameter — a number describing the population (usually unknown) | Statistic — a number computed from the data |
| Example | True proportion of Philadelphia residents who are African American: \(\pi = 0.422\) | Proportion of stopped drivers who were African American: \(\hat{p} = 0.79\) |
The goal of statistical inference: use a sample statistic to learn about a population parameter — we’ve been building toward this all semester
Start with a population — every individual you want to study. In our running example, this is all 1.45 million residents of Philadelphia in 1997. Since you can’t talk to everyone, you have to take a representive sample, that you hope will approximate your population.
A sample is drawn from the population — the individuals we actually observe. The 262 drivers stopped by police are our sample, far smaller than the full population.
Key question: is our sample representative? Does it reflect the composition of the population? This depends entirely on the sampling design — random samples are representative; convenience samples often are not.
To do inferential statistics, we need a representative sample — ideally, each unit of the population has an equal chance of being included
We also want samples that are large enough for precision, which increases with sample size
Important note
Precision is inversely proportional to the diversity of values — larger samples are needed to draw inferences about small subgroups (by race, education, sexual orientation, etc.)
Simple random sample: Use a random number generator to select from a population list

Stratified random sample: Divide population into homogenous groups (strata), then randomly sample within each stratum

Cluster sampling:

Key difference: cluster sampling selects only some groups; stratified sampling samples from all strata.
When to use which:


True random samples are almost never feasible — we rarely have a complete population list
But we often know the probability that each individual would be selected, based on demographics, geography, or other characteristics
Sample Weights are set inversely proportional to the probability of selection:
The GSS uses cluster sampling. Within each selected household, only one adult is chosen:
Always check whether your dataset uses weights, and apply them in R
Statistics vary from sample to sample — they have their own distributions
A sampling distribution is the probability distribution of a sample statistic computed across many independent samples from the same population
Three distributions you must keep straight:
| Distribution | What it describes | Philadelphia example |
|---|---|---|
| Population | All individuals in the population | 1.45M Philadelphia residents, 42.2% African American |
| Sample (data) | The individuals in our sample | 262 police stops in 1997 |
| Sampling | How our statistic would vary across repeated samples | Distribution of \(\hat{p}\) across many random samples of 262 |
Key insights
Recall: \(X \sim B(n, \pi)\) has mean \(n\pi\) and SD \(\sqrt{n\pi(1-\pi)}\)
The sample proportion \(\hat{p} = X/n\) divides that count by \(n\). It’s sampling distribution:
\[\mu_{\hat{p}} = \frac{n\pi}{n} = \pi \qquad SE(\hat{p}) = \frac{\sqrt{n\pi(1-\pi)}}{n} = \sqrt{\frac{\pi(1-\pi)}{n}}\]
Same formula as last week’s Binomial SD — divided by \(n\).
\[\mu_{\hat{p}} = \pi \qquad \text{(unbiased — centered on the true proportion)}\]
\[SE(\hat{p}) = \sqrt{\frac{\pi(1-\pi)}{n}} \qquad \text{(shrinks as } n \text{ grows)}\]
We use standard error (SE) rather than standard deviation to signal that this is the spread of a sampling distribution, not of raw data.
The SE answers: on average, how far will our \(\hat{p}\) fall from the true \(\pi\)?
key insight:
This is the mechanism that makes large surveys more trustworthy — and why the GSS (≈3,000 respondents) gives more reliable estimates than a sample of 20.
Data from Philadelphia, 1997:
Question: If drivers were stopped at random, how likely is it that 79% of stopped drivers would be African American?
Assume random stopping: \(X \sim B(262, 0.422)\)
The sampling distribution of \(\hat{p}\) under random stopping:
\[\mu_{\hat{p}} = 0.422\]
\[SE(\hat{p}) = \sqrt{\frac{0.422 \times 0.578}{262}} \approx 0.030\]
Observed: \(\hat{p} = 0.79\)
Distance from expected:
\[z = \frac{0.79 - 0.422}{0.030} \approx 12 \text{ standard errors above the mean}\]
Under random stopping, \(\hat{p} = 0.79\) is essentially impossible — it lies 12 SEs above what we’d expect. The sampling distribution reveals that this pattern cannot be explained by chance, but some other bias.

As \(N\) increases, \(\hat{p}\) becomes tighter and more bell-shaped — the Central Limit Theorem in action:
For a random sample of size \(n\) from a population with mean \(\mu\) and standard deviation \(\sigma\):
\[\mu_{\bar{x}} = \mu \qquad \text{(sample mean is unbiased)}\]
\[SE(\bar{x}) = \frac{\sigma}{\sqrt{n}} \qquad \text{(precision increases with sample size)}\]
Doubling \(n\) cuts the SE by a factor of \(\sqrt{2}\) — not 2. Precision is expensive!
The problem: You manage a pizza restaurant and want to estimate your true average daily sales. You know from years of records that daily sales average \(\mu = \$900\) with a standard deviation of \(\sigma = \$300\) — but sales fluctuate day to day. If you observe only \(n = 7\) days, how close will your sample mean be to the true average?
\[SE(\bar{x}) = \frac{\$300}{\sqrt{7}} \approx \$113\]
A 7-day average will typically be within \(\pm\$226\) (2 SEs) of the true mean — so your estimate could easily be off by over $200. Observing 28 days instead cuts the SE in half: \(\frac{\$300}{\sqrt{28}} \approx \$57\), giving a much more reliable estimate.
As \(n\) grows, the sampling distribution narrows around the true mean \(\mu\):

The theorem that makes all of statistical inference possible
This is one of the most powerful theorems in all of statistics.
If repeated independent samples of size \(N\) are drawn from any population (regardless of its shape) having mean \(\mu\) and standard deviation \(\sigma\), then — as \(N\) becomes large — the sampling distribution of the sample mean approaches a Normal distribution:
\[\bar{x} \;\sim\; N\!\left(\mu,\; \frac{\sigma}{\sqrt{N}}\right)\]
Why is this remarkable? The population can be skewed, bimodal, uniform — it doesn’t matter. As long as \(N\) is large enough, \(\bar{x}\) is approximately Normal.
This is why the Normal distribution appears everywhere in statistics — and why the tools we build next all work.
The sampling distribution approaches normality faster for more symmetric populations:
Why this matters for your research: Most survey samples (GSS, IPUMS, etc.) have \(n\) in the hundreds or thousands — well above the threshold where CLT guarantees apply. This is what lets us do inference from survey data without knowing the full population distribution.
GSS data show that among employed US adults, weekly work hours are right-skewed: \(\mu = 40.5\) hours, \(\sigma = 14\) hours.
A labor researcher samples \(n = 35\) workers at a specific company and finds a mean of 45 hours/week.
Question: If this company is typical of the US workforce, how unusual is a sample mean of 45 hours or higher?
The population is right-skewed — but \(n = 35 > 30\), so the CLT applies:
\[SE(\bar{x}) = \frac{\sigma}{\sqrt{n}} = \frac{14}{\sqrt{35}} \approx 2.37\]
\[\bar{x} \sim N(40.5,\; 2.37)\]
Even though individual work hours are skewed, the sample mean is approximately Normal.
Compute the probability in R
About 2.9% — if the company were typical, a mean this high would occur only 3% of the time by chance. This gives the researcher grounds to argue the company is unusually demanding.
Why sd = 14/sqrt(35) and not sd = 14?
The sd argument must match the distribution you’re asking about. Here the question is about a sample mean, not an individual worker:
The same gap feels very different depending on scale. A single worker putting in 45 hours is common — but a group average of 45 hours is rare, because averaging 35 people smooths out the extremes. That shrinkage by \(\sqrt{n}\) is exactly what the SE captures.

key takeaway:
Sample → statistic → sampling distribution → inference. This is the chain that runs through the rest of the course.
key takeaway:
The CLT is not just this week’s topic. It is the engine of all classical statistical inference. Once you understand it, you understand why every test we’ll do actually works.
Weekly Assignment #6
Annotated Bibliography
lab4.qmd from bCourses under “assignments” > “Lab #4”lab4.qmd in your labs folderExplorer button on the left to find and open lab4.qmd