Libraries and environments¶

Here is a list of all the Python libraries needed to replicate this tutorial. You will need to uncomment these lines of code and run them if they are not already installed.

In [ ]:
#
# Install Python libraries if necessary
# ----------------------------------------
#%pip install geopandas pandas matplotlib seaborn pygris
#%pip install "folium>=0.12" matplotlib mapclassify

1. Import Python libraries¶

Now that the appropriate libraries are installed, we need to import them for the tutorial.

In [13]:
#
# Import libraries
# ----------------------------------------
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

2. Get census tract and census place data¶

I prefer to import shapefiles directly from an API so that my work is reproducible and so that I do not have to manually download something. The easiest way to get Census shapefiles is to directly import them from the Census API. There are many libraries to do this; however my favorite is to use the pygris library. I prefer this library to others since it is relatively simple, the library is stable, and does not require a Census API key.

I pull all the shapefiles for all tracts and places in California so that we do not have to worry about whether the spatial extents of the tracts or places shapefiles are larger. If the place extent is larger than the tracts, then we could miss some tracts that are actually located within San Francisco proper when we do the spatial join.

In [14]:
#
# Pull shapefiles
# ----------------------------------------

# tracts
from pygris import tracts
ca_tracts = tracts(state = "CA", cb = True, year=2022, cache=True) # cb = true calls cartographic boundary files that are simplified and load/process faster

# places
from pygris import places
ca_places = places(state = "CA", cb = True, year=2022, cache=True) # cache set to true makes it easier to load if we call again
Using FIPS code '06' for input 'CA'
Using FIPS code '06' for input 'CA'

3. View tracts shapefile¶

Let's look at just the tracts shapefile to see what we have. Remember, we pulled in all tracts for the entire state of California so that when we conduct the spatial join of tracts to place, we won't be missing any tracts if, for some random reason, the place file for San Francisco is larger than than the spatial extent of San Francisco tracts. This is a conservative approach and you will likely get the same result if you were to subset the tracts for only San Francisco. Nevertheless, I've included them here.

I use .explore here to view the spatial object interactively. Since we have all the tracts for Califorinia, I use a location parameter in the call to initialize the interactive features on San Francisco, which I can adjust with the zoom_start parameter. Larger numbers zoom in.

In [15]:
#
# View tracts
# ----------------------------------------

# Coordinates for San Francisco (approximate center)
sf_coords = [37.7749, -122.4194]

# Plot interactive map, zoomed into San Francisco
ca_tracts.explore(
    color = "blue",       # set color to blue
    alpha=0.4,            # make face color somewhat transparent
    location=sf_coords,   # location of 
    zoom_start=11         # adjust zoom level as needed
    
)
Out[15]:
Make this Notebook Trusted to load map: File -> Trust Notebook

5. View place shapefile¶

Just as with the tracts shapefile, I want to view the places shapefile. Again, I'll focus on San Francisco as before. It's always good practice to view the data you want to spatially join.

Notice here that the polygon cover San Francisco is mostly solid and without the same borders dividing up the area as with the tracts. This is because the place polygon covers all of San Francisco, whereas the tracts detail where all the tracts are. Also notice that not all of the land is covered by places. This is because places are a US Census definition and not all areas meet those definitions to be considered places.

In [16]:
#
# View places
# ----------------------------------------

# Plot interactive map, zoomed into San Francisco
ca_places.explore(
    color = "red",        # set color to red
    alpha=0.4,            # make face color somewhat transparent
    location=sf_coords,   # location of focus
    zoom_start=11         # adjust zoom level as needed
    
)
Out[16]:
Make this Notebook Trusted to load map: File -> Trust Notebook

5. Spatially join tract and place data¶

Here I want to subset my census data to only the tracts that fall within the census place boundary of San Francisco. Since census tracts might not map on perfectly to the San Francisco place boundary, I selected all tracts whose centroid fell within the San Francisco place shapefile. This is a standard approach for determining which combining shapes that might overlap.

One key thing to note here is that you want both shapefiles to be in the same coordinate reference system (crs) so that the spatial join is being conducted with the proper precision. For more on crs' see this great geopandas explainer.

In [17]:
#
# Subset tract data to city boundary of San Francisco
# ----------------------------------------

# 1. prep
# ---------
# get only San Francisco place boundary for spatial join
san_fran_place = ca_places[ca_places["NAME"] == "San Francisco"]

# choose a projected CRS for California (meters)
proj_crs = "EPSG:3310"  # California Albers

# reproject both shapefiles into same projected CRS
tracts_proj = ca_tracts.to_crs(proj_crs)
san_fran_proj = san_fran_place.to_crs(proj_crs)

# create centroids in projected CRS
tracts_proj["centroid"] = tracts_proj.centroid

# 2. spatial join 
# ---------
san_fran_tracts = gpd.sjoin(
    tracts_proj.set_geometry("centroid"),  # use centroids for join
    san_fran_proj,
    predicate="within"
).drop(columns="geometry")  # drop centroid geometry after join

# 3. cleaning
# ---------
# remove Farallon Islands which are 30 miles off shore and not necessary for most demo calculations
san_fran_tracts = san_fran_tracts[san_fran_tracts["GEOID_left"] != "06075980401"] 

# restore original tract geometries for mapping (otherwise centroids will show)
san_fran_tracts = san_fran_tracts.set_geometry(tracts_proj.geometry)

# 4. plot
# ---------
sf_map = san_fran_tracts.explore(color = "purple", edgecolor="black", alpha=0.4)
sf_map
Out[17]:
Make this Notebook Trusted to load map: File -> Trust Notebook