π College football with sportsdataverse-py
Saturdays in autumn, condensed into tidy DataFrames. π In a few lines of Python you're about to pull a decade of play-by-play, full rosters, schedules, team info, plus live ESPN scoreboards, standings, polls and recruiting boards β all as clean polars frames ready to model.
CFB has no single native premium API, so our premium path is two-pronged:
- ποΈ Release loaders (
load_cfb_*) β pre-built, EPA/WPA-enriched datasets served straight from the cfbfastR-data GitHub release. Fast, reliable, no key needed. - π‘ ESPN families (
espn_cfb_*) β live scoreboards, team pages, standings, rankings, recruiting and per-play participants.
R user? Every verb here has a twin in cfbfastR. Let's kick off! π
π§° The toolboxβ
Everything returns a tidy polars DataFrame by default β pass
return_as_pandas=True for pandas, or (on the espn_cfb_* wrappers)
return_parsed=False for the raw JSON. β marks the premium path.
| Function | What it gives you | Source |
|---|---|---|
load_cfb_pbp | Full play-by-play with EPA/WPA, since 2003 | β release |
load_cfb_rosters | Season rosters (bio, position, hometown) | β release |
load_cfb_schedule | Season schedule + results + Elo | β release |
load_cfb_team_info | Team metadata: conference, colors, venue | β release |
load_cfb_betting_lines | Historical betting market lines (spread/total/ML) | β release |
espn_cfb_scoreboard | Live + recent scoreboard for a date/week | β ESPN |
espn_cfb_schedule | ESPN schedule frame for a date/week | β ESPN |
espn_cfb_teams | Every FBS/FCS team (grab team_ids) | β ESPN |
espn_cfb_team_roster | One team's roster | β ESPN |
espn_cfb_team_schedule | One team's schedule | β ESPN |
espn_cfb_standings | Conference / division standings | β ESPN |
espn_cfb_rankings | AP / Coaches / CFP polls | β ESPN |
espn_cfb_leaders | League stat leaders by category | β ESPN |
espn_cfb_recruits | Season recruiting class | β ESPN |
espn_cfb_play_participants | Per-play athletes (passer/rusher/tacklerβ¦) | β ESPN |
CFBPlayProcess | Full ESPN PBP pipeline (EPA/WPA + box) | β ESPN |
most_recent_cfb_season | The current season year helper | helper |
π Setupβ
pip install sportsdataverse
No API key required. The load_cfb_* loaders read public parquet from the
cfbfastR-data release, and the espn_cfb_* wrappers hit ESPN's public
endpoints.
import polars as pl
import sportsdataverse as sdv
from sportsdataverse.cfb import most_recent_cfb_season
SEASON = most_recent_cfb_season()
print('most recent CFB season:', SEASON)
ESPN's live endpoints are seasonal and occasionally rate-limited, so a tiny
safe() helper runs the riskier calls defensively β you get the frame when
the feed is up, and a friendly one-liner when it isn't (never a scary
traceback). The release loaders are reliable, so we call those directly. π
def safe(label, thunk):
try:
out = thunk()
print(f'β
{label}')
return out
except Exception as e: # noqa: BLE001 -- demo resilience
print(f'βοΈ {label}: unavailable right now ({type(e).__name__})')
return None
ποΈ Premium loaders: a whole season in one callβ
The load_cfb_* family is the fastest way to get clean, complete
season data. Each takes a seasons= int or list (β₯ 2003) and returns one
tidy frame. Let's start with the schedule β one row per game, with final
scores, conference flags, and pre/post-game Elo ratings baked in.
| Function | Grain | Highlights |
|---|---|---|
load_cfb_schedule | one row / game | scores, Elo, neutral-site & conference flags |
schedule = sdv.cfb.load_cfb_schedule(seasons=[2023])
print('schedule shape:', schedule.shape)
schedule.select([
'game_id', 'week', 'home_team', 'away_team',
'home_points', 'away_points', 'home_conference', 'neutral_site',
]).head()
π₯ Premium loaders: rostersβ
load_cfb_rosters gives you
every listed player for a season β name, position, jersey, physicals and
hometown. Perfect for joining onto play-by-play or building depth tables.
rosters = sdv.cfb.load_cfb_rosters(seasons=[2023])
print('rosters shape:', rosters.shape)
rosters.select([
'athlete_id', 'first_name', 'last_name', 'team',
'position', 'jersey', 'home_state',
]).head()
ποΈ Premium loaders: team infoβ
load_cfb_team_info
carries the reference metadata you'll want to label every chart: school
name, conference, classification (FBS/FCS), team colors, and venue.
team_info = sdv.cfb.load_cfb_team_info(seasons=[2023])
print('team_info shape:', team_info.shape)
team_info.select([
'team_id', 'school', 'conference', 'classification',
'venue_name', 'city', 'state', 'dome',
]).head()
π¬ Premium loaders: play-by-play with EPAβ
The crown jewel. load_cfb_pbp
returns every play of a season with hundreds of engineered columns β
down & distance, win probability, and Expected Points Added (EPA)
already computed. (It's a big pull, so we grab a single season and peek.) π
# The release serves PBP for whichever seasons are currently published.
# Try a few recent-ish seasons and keep the first one that comes back full,
# so the EPA recipes below always have real plays to chew on.
pbp = pl.DataFrame()
for yr in (2023, 2022, 2021, 2020):
cand = safe(f'load_cfb_pbp {yr}', lambda yr=yr: sdv.cfb.load_cfb_pbp(seasons=[yr]))
if cand is not None and cand.width > 0 and cand.height > 0:
pbp, PBP_SEASON = cand, yr
break
else:
PBP_SEASON = None
print('pbp season:', PBP_SEASON, '| pbp shape:', pbp.shape)
cols = ['game_id', 'start.pos_team.name', 'down', 'distance',
'play_type', 'EPA', 'wpa']
have = [c for c in cols if c in pbp.columns]
pbp.select(have).head() if have else 'pbp not published for these seasons right now'
π‘ Live from ESPN: the scoreboardβ
When you need today's slate (or a specific date), the ESPN wrappers shine.
espn_cfb_scoreboard takes a
dates=YYYYMMDD (or season year) and returns the games on the board. We wrap
it in safe() since live endpoints can be quiet in the offseason.
board = safe(
'ESPN scoreboard',
lambda: sdv.cfb.espn_cfb_scoreboard(dates=20231125), # rivalry Saturday
)
if board is not None and getattr(board, 'height', 0):
keep = [c for c in board.columns
if c in ('game_id', 'name', 'short_name', 'status_type_description',
'home_team_abbreviation', 'away_team_abbreviation')]
out = board.select(keep).head() if keep else board.head()
else:
out = 'no games on the board for that date'
out
π« Live from ESPN: teams (and their team_ids)β
espn_cfb_teams lists every
team in a division (groups=80 FBS, groups=81 FCS). The team_id column
is the key you feed into every team-scoped ESPN call below.
teams = safe('ESPN teams', sdv.cfb.espn_cfb_teams)
if teams is not None and teams.height:
cols = [c for c in ('team_id', 'team_location', 'team_name',
'team_abbreviation') if c in teams.columns]
out = teams.select(cols).head(8)
else:
out = 'teams unavailable right now'
out
π³ Cookbook: common CFB tasksβ
Now the fun part β real questions, answered with a few expressions. The loaders are reliable so these recipes lean on them, reaching for ESPN where it adds something live.
Recipe 1 β Highest-scoring games of the season π₯β
Straight from the loaded schedule: add the two scores and sort. No casting needed β the release frame already stores points as integers.
(schedule
.with_columns(
(pl.col('home_points') + pl.col('away_points')).alias('total_points')
)
.sort('total_points', descending=True)
.select(['week', 'home_team', 'away_team',
'home_points', 'away_points', 'total_points'])
.head(10))
Recipe 2 β Team offensive EPA/play leaderboard πβ
This is what premium EPA-tagged play-by-play unlocks. Filter to real scrimmage plays, group by the offense, and average the EPA per play β a clean efficiency ranking in five lines.
team_col = 'start.pos_team.name' # human-readable offense on each play
epa_cols = {team_col, 'EPA', 'play'}
if epa_cols.issubset(pbp.columns):
leaderboard = (
pbp
.filter(pl.col('play') & pl.col('EPA').is_not_null())
.group_by(team_col)
.agg(
pl.len().alias('plays'),
pl.col('EPA').mean().round(3).alias('epa_per_play'),
)
.filter(pl.col('plays') >= 500)
.sort('epa_per_play', descending=True)
.rename({team_col: 'offense'})
.head(15)
)
out = leaderboard
else:
out = 'expected EPA columns not present in this pbp build'
out
Recipe 3 β A team's roster, sorted by position π§©β
Join the loaded roster against team_info to resolve a school name to its
players, then count the depth at each position group.
team_name = 'Michigan'
squad = (
rosters
.filter(pl.col('team') == team_name)
.select(['first_name', 'last_name', 'position', 'jersey',
'height', 'weight', 'home_state'])
)
if squad.height:
depth = (squad.group_by('position')
.agg(pl.len().alias('players'))
.sort('players', descending=True))
print(f'{team_name}: {squad.height} players')
out = depth.head(10)
else:
out = f'no roster rows for {team_name} (try another school string)'
out
Recipe 4 β Who was on the field? Per-play participants π΅οΈβ
espn_cfb_play_participants
resolves the athletes involved in each play (passer, rusher, receiver,
tacklerβ¦) straight from ESPN's authoritative participants[] array β far
more reliable than regex-parsing the play text. Set resolve_missing=False
to skip the per-athlete $ref fan-out and keep it snappy.
gid = 401628334 # 2024 CFP National Championship
participants = safe(
f'play participants {gid}',
lambda: sdv.cfb.espn_cfb_play_participants(
game_id=gid, resolve_missing=False,
),
)
if participants is not None and getattr(participants, 'height', 0):
name_cols = [c for c in participants.columns if c.endswith('_player_name')]
show = ['play_id'] + name_cols[:4] if 'play_id' in participants.columns else name_cols[:5]
out = participants.select([c for c in show if c in participants.columns]).head()
else:
out = 'participants feed quiet right now (offseason / rate limit)'
out
Recipe 5 β Build a standings table from the schedule πβ
No standings endpoint needed: stack each team's home and away results, count wins and losses, and you've got a win-percentage table for any season the loader serves.
completed = schedule.filter(pl.col('completed') == True)
home = completed.select(
pl.col('home_team').alias('team'),
(pl.col('home_points') > pl.col('away_points')).alias('win'),
)
away = completed.select(
pl.col('away_team').alias('team'),
(pl.col('away_points') > pl.col('home_points')).alias('win'),
)
standings_tbl = (
pl.concat([home, away])
.group_by('team')
.agg(
pl.col('win').sum().alias('wins'),
(~pl.col('win')).sum().alias('losses'),
)
.with_columns(
(pl.col('wins') / (pl.col('wins') + pl.col('losses')))
.round(3).alias('win_pct')
)
.sort(['wins', 'win_pct'], descending=True)
)
standings_tbl.head(10)
Recipe 6 β End-of-season Elo power ratings β‘β
Every schedule row ships pre- and post-game Elo ratings. Grab each team's most recent post-game Elo (sort by week, take the first) for a tidy, ready-to-rank power table β no model to fit.
elo = (
pl.concat([
schedule.select(
pl.col('home_team').alias('team'),
pl.col('week'),
pl.col('home_postgame_elo').alias('elo'),
),
schedule.select(
pl.col('away_team').alias('team'),
pl.col('week'),
pl.col('away_postgame_elo').alias('elo'),
),
])
.filter(pl.col('elo').is_not_null())
.sort('week', descending=True)
.group_by('team', maintain_order=True)
.agg(pl.first('elo').alias('final_elo'))
.sort('final_elo', descending=True)
)
elo.head(15)
Recipe 7 β One team's full game log πβ
Filter the schedule to a single program, then flip the home/away columns so every row reads from that team's perspective β opponent, points for, points against, and the margin. Swap team to scout anyone.
team = 'Michigan'
gamelog = (
schedule
.filter((pl.col('home_team') == team) | (pl.col('away_team') == team))
.unique(subset=['game_id'])
.with_columns(
pl.when(pl.col('home_team') == team)
.then(pl.col('away_team')).otherwise(pl.col('home_team'))
.alias('opponent'),
pl.when(pl.col('home_team') == team)
.then(pl.col('home_points')).otherwise(pl.col('away_points'))
.alias('pts_for'),
pl.when(pl.col('home_team') == team)
.then(pl.col('away_points')).otherwise(pl.col('home_points'))
.alias('pts_against'),
)
.with_columns(
(pl.col('pts_for') - pl.col('pts_against')).alias('margin')
)
.select(['week', 'opponent', 'pts_for', 'pts_against', 'margin',
'neutral_site'])
.sort('week')
)
gamelog.head(16) if gamelog.height else f'no games found for {team}'
Recipe 8 β Rushing leaders, EPA included πβ
Premium play-by-play means leaderboards aren't just totals β they carry efficiency. Filter to designed runs, sum the yards, and average the EPA per carry to separate the bell-cows from the truly explosive backs.
rush_cols = {'rush', 'rusher_player_name', 'statYardage', 'EPA'}
if rush_cols.issubset(pbp.columns):
rushers = (
pbp
.filter((pl.col('rush') == True)
& pl.col('rusher_player_name').is_not_null())
.group_by('rusher_player_name')
.agg(
pl.len().alias('carries'),
pl.col('statYardage').sum().alias('rush_yds'),
pl.col('EPA').mean().round(3).alias('epa_per_rush'),
)
.filter(pl.col('carries') >= 100)
.sort('rush_yds', descending=True)
.head(15)
)
out = rushers
else:
out = 'rushing columns not present in this pbp build'
out
Recipe 9 β The most thrilling games of the year π’β
cfbfastR's schedule ships an excitement_index (a win-probability swinginess score). Sort it descending and you've ranked the season's white-knuckle finishes in one line.
thrillers = (
schedule
.filter(pl.col('excitement_index').is_not_null())
.sort('excitement_index', descending=True)
.select(['week', 'home_team', 'away_team',
'home_points', 'away_points', 'excitement_index'])
.head(10)
)
thrillers
Recipe 10 β Where does the talent come from? πΊοΈβ
Roll the season roster up by home_state to map the recruiting footprint of college football β a quick reminder of just how much of the sport flows out of a handful of states.
talent_map = (
rosters
.filter(pl.col('home_state').is_not_null())
.group_by('home_state')
.agg(pl.len().alias('players'))
.sort('players', descending=True)
.head(15)
)
talent_map
Recipe 11 β Conference vs. non-conference, by margin πβ
The schedule's conference_game flag lets you split the slate. Restrict to FBS, then compare the average final margin in league play versus the out-of-conference cupcakes β group games are (predictably) tighter.
fbs = schedule.filter(
(pl.col('home_division') == 'fbs') & (pl.col('completed') == True)
)
splits = (
fbs
.with_columns(
(pl.col('home_points') - pl.col('away_points')).abs().alias('margin')
)
.group_by('conference_game')
.agg(
pl.len().alias('games'),
pl.col('margin').mean().round(1).alias('avg_margin'),
pl.col('home_points').add(pl.col('away_points'))
.mean().round(1).alias('avg_total_points'),
)
.sort('conference_game')
)
splits
Recipe 12 β Biggest betting favorites in history πΈβ
load_cfb_betting_lines is a premium release frame of historical sportsbook lines. Average the spread across books per game and sort to surface the most lopsided favorites β the mismatches Vegas saw coming a mile away.
lines = safe('load_cfb_betting_lines', sdv.cfb.load_cfb_betting_lines)
if lines is not None and {'season', 'market_type', 'lines',
'game_desc', 'abbr'}.issubset(lines.columns):
target = sorted(lines['season'].drop_nulls().unique().to_list())[-1]
favorites = (
lines
.filter((pl.col('season') == target)
& (pl.col('market_type') == 'spread')
& pl.col('lines').is_not_null())
.group_by(['game_desc', 'abbr'])
.agg(pl.col('lines').mean().round(1).alias('avg_spread'))
.filter(pl.col('avg_spread') < 0) # negative spread = favorite
.sort('avg_spread')
.head(10)
)
print(f'biggest favorites, {int(target)} season:')
out = favorites
else:
out = 'betting-lines frame unavailable right now'
out
Recipe 13 β Hand it to pandas πΌβ
Every loader takes return_as_pandas=True, and any polars frame converts with .to_pandas(). Once it's a pandas DataFrame the whole pandas/numpy/scikit-learn world opens up β here, a one-call .describe() of scoring across the season.
score_pd = (
schedule
.select(['home_points', 'away_points'])
.to_pandas()
)
score_pd['total_points'] = score_pd['home_points'] + score_pd['away_points']
print(type(score_pd).__module__)
score_pd.describe().round(1)
ποΈ Live tour: standings, polls, leaders & recruitsβ
A quick lap through the rest of the live ESPN surface. Each is wrapped in
safe() so the page renders cleanly whatever the feed is doing today.
| Function | Use it for |
|---|---|
espn_cfb_standings | conference / division standings |
espn_cfb_rankings | AP / Coaches / CFP polls |
espn_cfb_leaders | league stat leaders by category |
espn_cfb_recruits | a season's recruiting class |
standings = safe('ESPN standings', sdv.cfb.espn_cfb_standings)
rankings = safe('ESPN rankings (polls)', sdv.cfb.espn_cfb_rankings)
(standings.head()
if standings is not None and getattr(standings, 'height', 0)
else (rankings.head()
if rankings is not None and getattr(rankings, 'height', 0)
else 'standings & rankings unavailable right now'))
leaders = safe(
'ESPN passing leaders',
lambda: sdv.cfb.espn_cfb_leaders(category='passingYards', season=2023, limit=15),
)
recruits = safe(
'ESPN recruiting class',
lambda: sdv.cfb.espn_cfb_recruits(season=2024, limit=25),
)
(leaders.head()
if leaders is not None and getattr(leaders, 'height', 0)
else (recruits.head()
if recruits is not None and getattr(recruits, 'height', 0)
else 'leaders & recruits unavailable right now'))
π§ͺ Bonus: process one game from scratch with CFBPlayProcessβ
Want EPA/WPA on a single live game without loading a whole season?
CFBPlayProcess drives the
full ESPN pipeline: .espn_cfb_pbp() fetches the raw summary, then
.run_processing_pipeline() returns a dict whose plays key is the
fully-featured play list (alongside an advanced box score and metadata).
from sportsdataverse.cfb import CFBPlayProcess
def process_game(game_id):
game = CFBPlayProcess(gameId=game_id)
game.espn_cfb_pbp()
processed = game.run_processing_pipeline()
return pl.DataFrame(processed['plays'], infer_schema_length=None)
plays = safe('CFBPlayProcess 401628334', lambda: process_game(401628334))
if plays is not None and plays.height:
cols = [c for c in ('period', 'pos_team', 'down', 'distance',
'play_type', 'EPA') if c in plays.columns]
out = plays.select(cols).head()
else:
out = 'live PBP pipeline quiet right now'
out
π Where to nextβ
- ποΈ Loaders are your premium fast-path β full reference on the
Loaders page (
load_cfb_pbp,load_cfb_rosters,load_cfb_schedule,load_cfb_team_info). - π‘ ESPN families live across the Site, Web, Core and Additional reference pages.
- πΌ Pass
return_as_pandas=Truefor pandas, orreturn_parsed=Falseon theespn_cfb_*wrappers for the raw JSON. - π₯ R user? The same verbs live in cfbfastR.
- Part of the SportsDataverse ecosystem.
Now go chart some chunk plays β and may your EPA always be positive! ππ