Skip to main content
Version: main

🏈 College football with sportsdataverse-py

Saturdays in autumn, condensed into tidy DataFrames. πŸ‚ In a few lines of Python you're about to pull a decade of play-by-play, full rosters, schedules, team info, plus live ESPN scoreboards, standings, polls and recruiting boards β€” all as clean polars frames ready to model.

CFB has no single native premium API, so our premium path is two-pronged:

  1. πŸ—„οΈ Release loaders (load_cfb_*) β€” pre-built, EPA/WPA-enriched datasets served straight from the cfbfastR-data GitHub release. Fast, reliable, no key needed.
  2. πŸ“‘ ESPN families (espn_cfb_*) β€” live scoreboards, team pages, standings, rankings, recruiting and per-play participants.

R user? Every verb here has a twin in cfbfastR. Let's kick off! 🏈

🧰 The toolbox​

Everything returns a tidy polars DataFrame by default β€” pass return_as_pandas=True for pandas, or (on the espn_cfb_* wrappers) return_parsed=False for the raw JSON. ⭐ marks the premium path.

FunctionWhat it gives youSource
load_cfb_pbpFull play-by-play with EPA/WPA, since 2003⭐ release
load_cfb_rostersSeason rosters (bio, position, hometown)⭐ release
load_cfb_scheduleSeason schedule + results + Elo⭐ release
load_cfb_team_infoTeam metadata: conference, colors, venue⭐ release
load_cfb_betting_linesHistorical betting market lines (spread/total/ML)⭐ release
espn_cfb_scoreboardLive + recent scoreboard for a date/week⭐ ESPN
espn_cfb_scheduleESPN schedule frame for a date/week⭐ ESPN
espn_cfb_teamsEvery FBS/FCS team (grab team_ids)⭐ ESPN
espn_cfb_team_rosterOne team's roster⭐ ESPN
espn_cfb_team_scheduleOne team's schedule⭐ ESPN
espn_cfb_standingsConference / division standings⭐ ESPN
espn_cfb_rankingsAP / Coaches / CFP polls⭐ ESPN
espn_cfb_leadersLeague stat leaders by category⭐ ESPN
espn_cfb_recruitsSeason recruiting class⭐ ESPN
espn_cfb_play_participantsPer-play athletes (passer/rusher/tackler…)⭐ ESPN
CFBPlayProcessFull ESPN PBP pipeline (EPA/WPA + box)⭐ ESPN
most_recent_cfb_seasonThe current season year helperhelper

πŸ”Œ Setup​

pip install sportsdataverse

No API key required. The load_cfb_* loaders read public parquet from the cfbfastR-data release, and the espn_cfb_* wrappers hit ESPN's public endpoints.

import polars as pl
import sportsdataverse as sdv
from sportsdataverse.cfb import most_recent_cfb_season

SEASON = most_recent_cfb_season()
print('most recent CFB season:', SEASON)

ESPN's live endpoints are seasonal and occasionally rate-limited, so a tiny safe() helper runs the riskier calls defensively β€” you get the frame when the feed is up, and a friendly one-liner when it isn't (never a scary traceback). The release loaders are reliable, so we call those directly. πŸ›Ÿ

def safe(label, thunk):
try:
out = thunk()
print(f'βœ… {label}')
return out
except Exception as e: # noqa: BLE001 -- demo resilience
print(f'⏭️ {label}: unavailable right now ({type(e).__name__})')
return None

πŸ—„οΈ Premium loaders: a whole season in one call​

The load_cfb_* family is the fastest way to get clean, complete season data. Each takes a seasons= int or list (β‰₯ 2003) and returns one tidy frame. Let's start with the schedule β€” one row per game, with final scores, conference flags, and pre/post-game Elo ratings baked in.

FunctionGrainHighlights
load_cfb_scheduleone row / gamescores, Elo, neutral-site & conference flags
schedule = sdv.cfb.load_cfb_schedule(seasons=[2023])
print('schedule shape:', schedule.shape)
schedule.select([
'game_id', 'week', 'home_team', 'away_team',
'home_points', 'away_points', 'home_conference', 'neutral_site',
]).head()

πŸ‘₯ Premium loaders: rosters​

load_cfb_rosters gives you every listed player for a season β€” name, position, jersey, physicals and hometown. Perfect for joining onto play-by-play or building depth tables.

rosters = sdv.cfb.load_cfb_rosters(seasons=[2023])
print('rosters shape:', rosters.shape)
rosters.select([
'athlete_id', 'first_name', 'last_name', 'team',
'position', 'jersey', 'home_state',
]).head()

🏟️ Premium loaders: team info​

load_cfb_team_info carries the reference metadata you'll want to label every chart: school name, conference, classification (FBS/FCS), team colors, and venue.

team_info = sdv.cfb.load_cfb_team_info(seasons=[2023])
print('team_info shape:', team_info.shape)
team_info.select([
'team_id', 'school', 'conference', 'classification',
'venue_name', 'city', 'state', 'dome',
]).head()

🎬 Premium loaders: play-by-play with EPA​

The crown jewel. load_cfb_pbp returns every play of a season with hundreds of engineered columns β€” down & distance, win probability, and Expected Points Added (EPA) already computed. (It's a big pull, so we grab a single season and peek.) πŸ“Š

# The release serves PBP for whichever seasons are currently published.
# Try a few recent-ish seasons and keep the first one that comes back full,
# so the EPA recipes below always have real plays to chew on.
pbp = pl.DataFrame()
for yr in (2023, 2022, 2021, 2020):
cand = safe(f'load_cfb_pbp {yr}', lambda yr=yr: sdv.cfb.load_cfb_pbp(seasons=[yr]))
if cand is not None and cand.width > 0 and cand.height > 0:
pbp, PBP_SEASON = cand, yr
break
else:
PBP_SEASON = None

print('pbp season:', PBP_SEASON, '| pbp shape:', pbp.shape)
cols = ['game_id', 'start.pos_team.name', 'down', 'distance',
'play_type', 'EPA', 'wpa']
have = [c for c in cols if c in pbp.columns]
pbp.select(have).head() if have else 'pbp not published for these seasons right now'

πŸ“‘ Live from ESPN: the scoreboard​

When you need today's slate (or a specific date), the ESPN wrappers shine. espn_cfb_scoreboard takes a dates=YYYYMMDD (or season year) and returns the games on the board. We wrap it in safe() since live endpoints can be quiet in the offseason.

board = safe(
'ESPN scoreboard',
lambda: sdv.cfb.espn_cfb_scoreboard(dates=20231125), # rivalry Saturday
)
if board is not None and getattr(board, 'height', 0):
keep = [c for c in board.columns
if c in ('game_id', 'name', 'short_name', 'status_type_description',
'home_team_abbreviation', 'away_team_abbreviation')]
out = board.select(keep).head() if keep else board.head()
else:
out = 'no games on the board for that date'
out

🏫 Live from ESPN: teams (and their team_ids)​

espn_cfb_teams lists every team in a division (groups=80 FBS, groups=81 FCS). The team_id column is the key you feed into every team-scoped ESPN call below.

teams = safe('ESPN teams', sdv.cfb.espn_cfb_teams)
if teams is not None and teams.height:
cols = [c for c in ('team_id', 'team_location', 'team_name',
'team_abbreviation') if c in teams.columns]
out = teams.select(cols).head(8)
else:
out = 'teams unavailable right now'
out

🍳 Cookbook: common CFB tasks​

Now the fun part β€” real questions, answered with a few expressions. The loaders are reliable so these recipes lean on them, reaching for ESPN where it adds something live.

Recipe 1 β€” Highest-scoring games of the season πŸ”₯​

Straight from the loaded schedule: add the two scores and sort. No casting needed β€” the release frame already stores points as integers.

(schedule
.with_columns(
(pl.col('home_points') + pl.col('away_points')).alias('total_points')
)
.sort('total_points', descending=True)
.select(['week', 'home_team', 'away_team',
'home_points', 'away_points', 'total_points'])
.head(10))

Recipe 2 β€” Team offensive EPA/play leaderboard πŸ“ˆβ€‹

This is what premium EPA-tagged play-by-play unlocks. Filter to real scrimmage plays, group by the offense, and average the EPA per play β€” a clean efficiency ranking in five lines.

team_col = 'start.pos_team.name' # human-readable offense on each play
epa_cols = {team_col, 'EPA', 'play'}
if epa_cols.issubset(pbp.columns):
leaderboard = (
pbp
.filter(pl.col('play') & pl.col('EPA').is_not_null())
.group_by(team_col)
.agg(
pl.len().alias('plays'),
pl.col('EPA').mean().round(3).alias('epa_per_play'),
)
.filter(pl.col('plays') >= 500)
.sort('epa_per_play', descending=True)
.rename({team_col: 'offense'})
.head(15)
)
out = leaderboard
else:
out = 'expected EPA columns not present in this pbp build'
out

Recipe 3 β€” A team's roster, sorted by position πŸ§©β€‹

Join the loaded roster against team_info to resolve a school name to its players, then count the depth at each position group.

team_name = 'Michigan'
squad = (
rosters
.filter(pl.col('team') == team_name)
.select(['first_name', 'last_name', 'position', 'jersey',
'height', 'weight', 'home_state'])
)
if squad.height:
depth = (squad.group_by('position')
.agg(pl.len().alias('players'))
.sort('players', descending=True))
print(f'{team_name}: {squad.height} players')
out = depth.head(10)
else:
out = f'no roster rows for {team_name} (try another school string)'
out

Recipe 4 β€” Who was on the field? Per-play participants πŸ•΅οΈβ€‹

espn_cfb_play_participants resolves the athletes involved in each play (passer, rusher, receiver, tackler…) straight from ESPN's authoritative participants[] array β€” far more reliable than regex-parsing the play text. Set resolve_missing=False to skip the per-athlete $ref fan-out and keep it snappy.

gid = 401628334 # 2024 CFP National Championship
participants = safe(
f'play participants {gid}',
lambda: sdv.cfb.espn_cfb_play_participants(
game_id=gid, resolve_missing=False,
),
)
if participants is not None and getattr(participants, 'height', 0):
name_cols = [c for c in participants.columns if c.endswith('_player_name')]
show = ['play_id'] + name_cols[:4] if 'play_id' in participants.columns else name_cols[:5]
out = participants.select([c for c in show if c in participants.columns]).head()
else:
out = 'participants feed quiet right now (offseason / rate limit)'
out

Recipe 5 β€” Build a standings table from the schedule πŸ†β€‹

No standings endpoint needed: stack each team's home and away results, count wins and losses, and you've got a win-percentage table for any season the loader serves.

completed = schedule.filter(pl.col('completed') == True)
home = completed.select(
pl.col('home_team').alias('team'),
(pl.col('home_points') > pl.col('away_points')).alias('win'),
)
away = completed.select(
pl.col('away_team').alias('team'),
(pl.col('away_points') > pl.col('home_points')).alias('win'),
)
standings_tbl = (
pl.concat([home, away])
.group_by('team')
.agg(
pl.col('win').sum().alias('wins'),
(~pl.col('win')).sum().alias('losses'),
)
.with_columns(
(pl.col('wins') / (pl.col('wins') + pl.col('losses')))
.round(3).alias('win_pct')
)
.sort(['wins', 'win_pct'], descending=True)
)
standings_tbl.head(10)

Recipe 6 β€” End-of-season Elo power ratings βš‘β€‹

Every schedule row ships pre- and post-game Elo ratings. Grab each team's most recent post-game Elo (sort by week, take the first) for a tidy, ready-to-rank power table β€” no model to fit.

elo = (
pl.concat([
schedule.select(
pl.col('home_team').alias('team'),
pl.col('week'),
pl.col('home_postgame_elo').alias('elo'),
),
schedule.select(
pl.col('away_team').alias('team'),
pl.col('week'),
pl.col('away_postgame_elo').alias('elo'),
),
])
.filter(pl.col('elo').is_not_null())
.sort('week', descending=True)
.group_by('team', maintain_order=True)
.agg(pl.first('elo').alias('final_elo'))
.sort('final_elo', descending=True)
)
elo.head(15)

Recipe 7 β€” One team's full game log πŸ“œβ€‹

Filter the schedule to a single program, then flip the home/away columns so every row reads from that team's perspective β€” opponent, points for, points against, and the margin. Swap team to scout anyone.

team = 'Michigan'
gamelog = (
schedule
.filter((pl.col('home_team') == team) | (pl.col('away_team') == team))
.unique(subset=['game_id'])
.with_columns(
pl.when(pl.col('home_team') == team)
.then(pl.col('away_team')).otherwise(pl.col('home_team'))
.alias('opponent'),
pl.when(pl.col('home_team') == team)
.then(pl.col('home_points')).otherwise(pl.col('away_points'))
.alias('pts_for'),
pl.when(pl.col('home_team') == team)
.then(pl.col('away_points')).otherwise(pl.col('home_points'))
.alias('pts_against'),
)
.with_columns(
(pl.col('pts_for') - pl.col('pts_against')).alias('margin')
)
.select(['week', 'opponent', 'pts_for', 'pts_against', 'margin',
'neutral_site'])
.sort('week')
)
gamelog.head(16) if gamelog.height else f'no games found for {team}'

Recipe 8 β€” Rushing leaders, EPA included πŸƒβ€‹

Premium play-by-play means leaderboards aren't just totals β€” they carry efficiency. Filter to designed runs, sum the yards, and average the EPA per carry to separate the bell-cows from the truly explosive backs.

rush_cols = {'rush', 'rusher_player_name', 'statYardage', 'EPA'}
if rush_cols.issubset(pbp.columns):
rushers = (
pbp
.filter((pl.col('rush') == True)
& pl.col('rusher_player_name').is_not_null())
.group_by('rusher_player_name')
.agg(
pl.len().alias('carries'),
pl.col('statYardage').sum().alias('rush_yds'),
pl.col('EPA').mean().round(3).alias('epa_per_rush'),
)
.filter(pl.col('carries') >= 100)
.sort('rush_yds', descending=True)
.head(15)
)
out = rushers
else:
out = 'rushing columns not present in this pbp build'
out

Recipe 9 β€” The most thrilling games of the year πŸŽ’β€‹

cfbfastR's schedule ships an excitement_index (a win-probability swinginess score). Sort it descending and you've ranked the season's white-knuckle finishes in one line.

thrillers = (
schedule
.filter(pl.col('excitement_index').is_not_null())
.sort('excitement_index', descending=True)
.select(['week', 'home_team', 'away_team',
'home_points', 'away_points', 'excitement_index'])
.head(10)
)
thrillers

Recipe 10 β€” Where does the talent come from? πŸ—ΊοΈβ€‹

Roll the season roster up by home_state to map the recruiting footprint of college football β€” a quick reminder of just how much of the sport flows out of a handful of states.

talent_map = (
rosters
.filter(pl.col('home_state').is_not_null())
.group_by('home_state')
.agg(pl.len().alias('players'))
.sort('players', descending=True)
.head(15)
)
talent_map

Recipe 11 β€” Conference vs. non-conference, by margin πŸ”€β€‹

The schedule's conference_game flag lets you split the slate. Restrict to FBS, then compare the average final margin in league play versus the out-of-conference cupcakes β€” group games are (predictably) tighter.

fbs = schedule.filter(
(pl.col('home_division') == 'fbs') & (pl.col('completed') == True)
)
splits = (
fbs
.with_columns(
(pl.col('home_points') - pl.col('away_points')).abs().alias('margin')
)
.group_by('conference_game')
.agg(
pl.len().alias('games'),
pl.col('margin').mean().round(1).alias('avg_margin'),
pl.col('home_points').add(pl.col('away_points'))
.mean().round(1).alias('avg_total_points'),
)
.sort('conference_game')
)
splits

Recipe 12 β€” Biggest betting favorites in history πŸ’Έβ€‹

load_cfb_betting_lines is a premium release frame of historical sportsbook lines. Average the spread across books per game and sort to surface the most lopsided favorites β€” the mismatches Vegas saw coming a mile away.

lines = safe('load_cfb_betting_lines', sdv.cfb.load_cfb_betting_lines)
if lines is not None and {'season', 'market_type', 'lines',
'game_desc', 'abbr'}.issubset(lines.columns):
target = sorted(lines['season'].drop_nulls().unique().to_list())[-1]
favorites = (
lines
.filter((pl.col('season') == target)
& (pl.col('market_type') == 'spread')
& pl.col('lines').is_not_null())
.group_by(['game_desc', 'abbr'])
.agg(pl.col('lines').mean().round(1).alias('avg_spread'))
.filter(pl.col('avg_spread') < 0) # negative spread = favorite
.sort('avg_spread')
.head(10)
)
print(f'biggest favorites, {int(target)} season:')
out = favorites
else:
out = 'betting-lines frame unavailable right now'
out

Recipe 13 β€” Hand it to pandas πŸΌβ€‹

Every loader takes return_as_pandas=True, and any polars frame converts with .to_pandas(). Once it's a pandas DataFrame the whole pandas/numpy/scikit-learn world opens up β€” here, a one-call .describe() of scoring across the season.

score_pd = (
schedule
.select(['home_points', 'away_points'])
.to_pandas()
)
score_pd['total_points'] = score_pd['home_points'] + score_pd['away_points']
print(type(score_pd).__module__)
score_pd.describe().round(1)

πŸ—žοΈ Live tour: standings, polls, leaders & recruits​

A quick lap through the rest of the live ESPN surface. Each is wrapped in safe() so the page renders cleanly whatever the feed is doing today.

FunctionUse it for
espn_cfb_standingsconference / division standings
espn_cfb_rankingsAP / Coaches / CFP polls
espn_cfb_leadersleague stat leaders by category
espn_cfb_recruitsa season's recruiting class
standings = safe('ESPN standings', sdv.cfb.espn_cfb_standings)
rankings = safe('ESPN rankings (polls)', sdv.cfb.espn_cfb_rankings)
(standings.head()
if standings is not None and getattr(standings, 'height', 0)
else (rankings.head()
if rankings is not None and getattr(rankings, 'height', 0)
else 'standings & rankings unavailable right now'))
leaders = safe(
'ESPN passing leaders',
lambda: sdv.cfb.espn_cfb_leaders(category='passingYards', season=2023, limit=15),
)
recruits = safe(
'ESPN recruiting class',
lambda: sdv.cfb.espn_cfb_recruits(season=2024, limit=25),
)
(leaders.head()
if leaders is not None and getattr(leaders, 'height', 0)
else (recruits.head()
if recruits is not None and getattr(recruits, 'height', 0)
else 'leaders & recruits unavailable right now'))

πŸ§ͺ Bonus: process one game from scratch with CFBPlayProcess​

Want EPA/WPA on a single live game without loading a whole season? CFBPlayProcess drives the full ESPN pipeline: .espn_cfb_pbp() fetches the raw summary, then .run_processing_pipeline() returns a dict whose plays key is the fully-featured play list (alongside an advanced box score and metadata).

from sportsdataverse.cfb import CFBPlayProcess

def process_game(game_id):
game = CFBPlayProcess(gameId=game_id)
game.espn_cfb_pbp()
processed = game.run_processing_pipeline()
return pl.DataFrame(processed['plays'], infer_schema_length=None)

plays = safe('CFBPlayProcess 401628334', lambda: process_game(401628334))
if plays is not None and plays.height:
cols = [c for c in ('period', 'pos_team', 'down', 'distance',
'play_type', 'EPA') if c in plays.columns]
out = plays.select(cols).head()
else:
out = 'live PBP pipeline quiet right now'
out

πŸŽ‰ Where to next​

  • πŸ—„οΈ Loaders are your premium fast-path β€” full reference on the Loaders page (load_cfb_pbp, load_cfb_rosters, load_cfb_schedule, load_cfb_team_info).
  • πŸ“‘ ESPN families live across the Site, Web, Core and Additional reference pages.
  • 🐼 Pass return_as_pandas=True for pandas, or return_parsed=False on the espn_cfb_* wrappers for the raw JSON.
  • πŸŸ₯ R user? The same verbs live in cfbfastR.
  • Part of the SportsDataverse ecosystem.

Now go chart some chunk plays β€” and may your EPA always be positive! πŸ“ˆπŸˆ