โพ Baseball with sportsdataverse-py
Welcome to the ballpark! ๐๏ธ In just a few lines of Python you're about to pull official MLB data โ schedules, standings, rosters, box scores, play-by-play โ straight from the league's own MLB Stats API, plus pitch-level Statcast tracking from Baseball Savant. Every premium call hands you back a tidy polars DataFrame (or raw JSON when you want it), ready to model. ๐
If you've used the R package baseballr, or Python's pybaseball, the data shapes will feel right at home. Let's play ball! โพ
๐งฐ The toolboxโ
We lead with the premium sources โ the MLB Stats API (mlb_*,
backed by statsapi.mlb.com) and the comprehensive Statcast surface
(mlb_statcast_*, from Baseball
Savant). ESPN (espn_mlb_*) is a handy secondary path. Click any name for the
full reference:
| Function | What it gives you | Source |
|---|---|---|
mlb_schedule ยท parse_mlb_api_schedule | Games for a date / range โ one row per game (with game_pk) | ๐ข MLB Stats API |
mlb_teams ยท parse_mlb_api_teams | Every club โ one row per team | ๐ข MLB Stats API |
mlb_standings ยท parse_mlb_api_standings | Division standings โ wins, losses, run diff | ๐ข MLB Stats API |
mlb_team_roster | A team's roster โ one row per player | ๐ข MLB Stats API |
mlb_person | A player's bio (one tidy row) | ๐ข MLB Stats API |
mlb_person_stats ยท parse_mlb_api_person_stats | A player's season stat splits | ๐ข MLB Stats API |
mlb_boxscore | Full game box score | ๐ข MLB Stats API |
mlb_play_by_play | Plate-appearance-level play-by-play | ๐ข MLB Stats API |
mlb_stats_leaders | League leaders for any stat (HR, AVG, ERA, โฆ) | ๐ข MLB Stats API |
mlb_win_probability | Per-play win probability + WPA for a game | ๐ข MLB Stats API |
mlb_awards ยท mlb_award_recipients | Award catalog + season winners (MVP, Cy Young, โฆ) | ๐ข MLB Stats API |
mlb_draft | Amateur draft board โ one row per pick | ๐ข MLB Stats API |
mlb_statcast_search | Every pitch matching a filter โ ~110 cols/pitch; auto date-chunks past the 25k cap; friendly filters (batters_lookup, pitch_type, at_bat_result, โฆ) | ๐ต Statcast |
mlb_statcast_search_minors ยท mlb_statcast_search_wbc | Same pitch search for MiLB and the World Baseball Classic | ๐ต Statcast |
mlb_statcast_leaderboard_* (37 of them) โ e.g. โฆ_sprint_speed, โฆ_expected_stats, โฆ_bat_tracking, โฆ_outs_above_average | Every Savant leaderboard: expected stats, sprint speed, bat tracking, pitch arsenals/movement/tempo, OAA, arm strength, catcher framing/blocking/throwing, baserunning, park factors, โฆ | ๐ต Statcast |
mlb_statcast_gamefeed | Savant single-game feed โ one tidy row per pitch | ๐ต Statcast |
mlb_statcast_player | A player's Savant page metrics | ๐ต Statcast |
espn_mlb_teams ยท espn_mlb_schedule | ESPN teams / schedule (wide frames) | โช ESPN |
most_recent_mlb_season | Current season helper | โช helper |
๐ Setupโ
pip install sportsdataverse
No API key needed for any of the premium MLB endpoints โ the MLB Stats API and Baseball Savant are both public. ๐
import polars as pl
import sportsdataverse.mlb as mlb
pl.Config.set_tbl_rows(12)
print("most recent MLB season:", mlb.most_recent_mlb_season())
most recent MLB season: 2026
The MLB Stats API and Savant are public and reliable, but they're still
live network calls โ a date with no games, an offseason day, or a blip can
make a call come back empty. So we use a tiny safe() helper: you get the
frame when the feed is up, and a friendly one-liner when it isn't (never a
scary traceback). ๐
We also pick a stable completed-season date for our examples so the page renders the same in June as in October.
def safe(label, thunk):
"""Run a live call defensively: return its result, or print a one-liner."""
try:
out = thunk()
print(f"โ
{label}")
return out
except Exception as e: # noqa: BLE001 -- demo resilience
print(f"โญ๏ธ {label}: unavailable right now ({type(e).__name__})")
return None
# A known completed regular-season slate โ stable for the docs build.
SAMPLE_SEASON = 2024
SAMPLE_DATE = "2024-07-01" # YYYY-MM-DD for the Stats API
JUDGE_ID = 592450 # Aaron Judge, NYY โ our running example player
YANKEES_ID = 147 # New York Yankees team_id
๐ The schedule (MLB Stats API)โ
mlb_schedule returns the
raw JSON dict; its partner
parse_mlb_api_schedule
flattens it to one row per game. The most important column is game_pk โ
that's the id you feed to the box score and play-by-play endpoints. Pass a
single date=, or a start_date/end_date range, team_id, or season.
schedule = safe(
"schedule",
lambda: mlb.parse_mlb_api_schedule(mlb.mlb_schedule(date=SAMPLE_DATE)),
)
cols = ["game_pk", "status_detailed_state",
"teams_away_team_name", "teams_away_score",
"teams_home_team_name", "teams_home_score"]
(schedule.select([c for c in cols if c in schedule.columns]).head()
if schedule is not None else "schedule unavailable right now")
โ
schedule
shape: (3, 6)
โโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ โโ
โ game_pk โ status_detailed โ teams_away_team โ teams_away_scor โ teams_home_team โ teams_home_sco โ
โ --- โ _state โ _name โ e โ _name โ re โ
โ i64 โ --- โ --- โ --- โ --- โ --- โ
โ โ str โ str โ i64 โ str โ i64 โ
โโโโโโโโโโโชโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโก
โ 744914 โ Final โ Houston Astros โ 3 โ Toronto Blue โ 1 โ
โ โ โ โ โ Jays โ โ
โ 744840 โ Final โ New York Mets โ 9 โ Washington โ 7 โ
โ โ โ โ โ Nationals โ โ
โ 746535 โ Final โ Milwaukee โ 7 โ Colorado โ 8 โ
โ โ โ Brewers โ โ Rockies โ โ
โโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโ
๐ Standings (MLB Stats API)โ
mlb_standings covers
both leagues by default (league_id="103,104").
parse_mlb_api_standings
returns one row per team with wins/losses, division rank, and winning
percentage.
standings = safe(
"standings",
lambda: mlb.parse_mlb_api_standings(mlb.mlb_standings(season=SAMPLE_SEASON)),
)
keep = ["team_name", "standings_division_name", "wins", "losses",
"winning_percentage", "division_rank"]
(standings.select([c for c in keep if c in standings.columns])
.sort("wins", descending=True).head(10)
if standings is not None else "standings unavailable right now")