Collect Statcast Pitching Data

Posted by Nick Paul on Monday, September 20, 2021

TOC

  1. Intro
  2. Advice
  3. Collect Pitcher Data
  4. Next Steps

Intro to Statcast Data

This post will serve as a short introduction to collecting statcast data.

Focus on the Task

When I first started collecting sports data, I built a number of homemade packages in R and python, which was a great learning experience - my package development, web scraping and API interaction skills all leveled up. There was, however, alot more time dedicated to collecting the data than actually analyzing the data. So, if Data Analysis is your end game, you can save alot of time (and headaches!) by using existing packages. Beware the distraction of building your own bespoke soultion - it probably will not do as much or work us well as what already exists (unless of course, that is your goal!).

For now, we are going to focus on a few exsisting packages that I find useful so we can get to the analysis. If you are interested in learning how to do it yourself, looking at the guts of these packages is also a great starting point!

Python Tools

pybaseball is a great package for collecting baseball data. Along with statcast data, pybaseball can collect data from fangraphs and baseball reference.

Check out more info here: https://github.com/jldbc/pybaseball

pip install pybaseball

To get pitching data for a particular player, use playerid_lookup and then statcast_pitcher.

import pandas as pd
from pybaseball import statcast
from pybaseball import playerid_lookup
from pybaseball import statcast_pitcher

player_id = playerid_lookup('Chapman', 'Aroldis')
## Gathering player lookup table. This may take a moment.
player_id
##   name_last name_first  ...  mlb_played_first mlb_played_last
## 0   chapman    aroldis  ...            2010.0          2021.0
## 
## [1 rows x 8 columns]
pid = player_id.key_mlbam[0]
print(pid)
## 547973
chapman_stats = statcast_pitcher('2015-03-01','2021-10-01',pid)
## Gathering Player Data
chapman_stats.columns
## Index(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',
##        'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',
##        'description', 'spin_dir', 'spin_rate_deprecated',
##        'break_angle_deprecated', 'break_length_deprecated', 'zone', 'des',
##        'game_type', 'stand', 'p_throws', 'home_team', 'away_team', 'type',
##        'hit_location', 'bb_type', 'balls', 'strikes', 'game_year', 'pfx_x',
##        'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',
##        'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y',
##        'tfs_deprecated', 'tfs_zulu_deprecated', 'fielder_2', 'umpire', 'sv_id',
##        'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot',
##        'hit_distance_sc', 'launch_speed', 'launch_angle', 'effective_speed',
##        'release_spin_rate', 'release_extension', 'game_pk', 'pitcher.1',
##        'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',
##        'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',
##        'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle',
##        'woba_value', 'woba_denom', 'babip_value', 'iso_value',
##        'launch_speed_angle', 'at_bat_number', 'pitch_number', 'pitch_name',
##        'home_score', 'away_score', 'bat_score', 'fld_score', 'post_away_score',
##        'post_home_score', 'post_bat_score', 'post_fld_score',
##        'if_fielding_alignment', 'of_fielding_alignment', 'spin_axis',
##        'delta_home_win_exp', 'delta_run_exp'],
##       dtype='object')

Save the Data

I have found that an incredibly important step when collecting data is to save an untransformed version before conducting any analysis. It is much faster to work with locally stored data than to recollect every time you want to test your ETL pipeline.

So, our last step here is to save the data.

chapman_stats.to_csv("./chapman_20150301-20211001.csv")

Next Steps

In the next article, we will explore working with Statcast Data.