TOC
Intro to Statcast Data
This post will serve as a short introduction to collecting statcast data.
Focus on the Task
When I first started collecting sports data, I built a number of homemade packages in R and python, which was a great learning experience - my package development, web scraping and API interaction skills all leveled up. There was, however, alot more time dedicated to collecting the data than actually analyzing the data. So, if Data Analysis is your end game, you can save alot of time (and headaches!) by using existing packages. Beware the distraction of building your own bespoke soultion - it probably will not do as much or work us well as what already exists (unless of course, that is your goal!).
For now, we are going to focus on a few exsisting packages that I find useful so we can get to the analysis. If you are interested in learning how to do it yourself, looking at the guts of these packages is also a great starting point!
Python Tools
pybaseball
is a great package for collecting baseball data. Along with statcast data, pybaseball
can collect data from fangraphs and baseball reference.
Check out more info here: https://github.com/jldbc/pybaseball
pip install pybaseball
To get pitching data for a particular player, use playerid_lookup
and then statcast_pitcher
.
import pandas as pd
from pybaseball import statcast
from pybaseball import playerid_lookup
from pybaseball import statcast_pitcher
player_id = playerid_lookup('Chapman', 'Aroldis')
## Gathering player lookup table. This may take a moment.
player_id
## name_last name_first ... mlb_played_first mlb_played_last
## 0 chapman aroldis ... 2010.0 2021.0
##
## [1 rows x 8 columns]
pid = player_id.key_mlbam[0]
print(pid)
## 547973
chapman_stats = statcast_pitcher('2015-03-01','2021-10-01',pid)
## Gathering Player Data
chapman_stats.columns
## Index(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',
## 'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
## 'description', 'spin_dir', 'spin_rate_deprecated',
## 'break_angle_deprecated', 'break_length_deprecated', 'zone', 'des',
## 'game_type', 'stand', 'p_throws', 'home_team', 'away_team', 'type',
## 'hit_location', 'bb_type', 'balls', 'strikes', 'game_year', 'pfx_x',
## 'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',
## 'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y',
## 'tfs_deprecated', 'tfs_zulu_deprecated', 'fielder_2', 'umpire', 'sv_id',
## 'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot',
## 'hit_distance_sc', 'launch_speed', 'launch_angle', 'effective_speed',
## 'release_spin_rate', 'release_extension', 'game_pk', 'pitcher.1',
## 'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',
## 'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',
## 'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle',
## 'woba_value', 'woba_denom', 'babip_value', 'iso_value',
## 'launch_speed_angle', 'at_bat_number', 'pitch_number', 'pitch_name',
## 'home_score', 'away_score', 'bat_score', 'fld_score', 'post_away_score',
## 'post_home_score', 'post_bat_score', 'post_fld_score',
## 'if_fielding_alignment', 'of_fielding_alignment', 'spin_axis',
## 'delta_home_win_exp', 'delta_run_exp'],
## dtype='object')
Save the Data
I have found that an incredibly important step when collecting data is to save an untransformed version before conducting any analysis. It is much faster to work with locally stored data than to recollect every time you want to test your ETL pipeline.
So, our last step here is to save the data.
chapman_stats.to_csv("./chapman_20150301-20211001.csv")
Next Steps
In the next article, we will explore working with Statcast Data.