Abstract:

Athletes in today’s world garner huge sums of money. For this reason, teams need to make informed decisions on how to spend their money effectively. They need accurate measures for assessing the true talent of their players. A lot of the variation in year-to-year performance is due to luck, but there is a significant element of skill as well. I attempted to isolate the proportion of pitching performance that was due to luck and skill, respectively, through simple and multiple linear regression analysis on ERA and other performance statistics. I did this using several models that utilized various combinations of luck-based and skill-based predictors.

I found that a slightly higher proportion of pitching performance is due to skill than luck. I then created my own model for predicting ERA using significant skill indicators, and compared it to existing ERA estimators, where it fared decently. The most significant skill indicator was a pitcher’s strikeout minus walk rate. The two most significant luck indicators proved to be batting average on balls in play (BABIP) and home run to fly ball ratio. To my surprise, left on base percentage, which measures the rate at which pitchers leave runners on base at the end of an inning without them scoring, was not entirely luck based. This result indicates that some pitchers step up when there are runners on base, and others choke under the pressure of an imminent score.

Background:

There is a lot of money in sports, for both the players and the management. The management wants their team to win, so that their games get sold out and they can make lucrative television deals. Thus, they go out looking to recruit the right players to win, and they recruit them mostly by offering them large sums of money. With all of this money involved, there arose a need for accurate evaluation of talent. Scouts, who observed players, were often biased and inaccurate. Statistics, on the other hand, eventually proved to be an unbiased and accurate method of player evaluation. “The Moneyball Era” of the early 2000’s saw the rise of statistical analysis in baseball (Sabermetrics).

Many statistics had existed for decades, but as it turned out, most were luck-based, unimportant in terms of run scoring or run prevention, and/or fluctuated wildly from year to year. Sabermetricians focused on finding and creating statistics that were relatively stable across seasons, and identifying those that were not. Statistics that were stable for each player (although they differed between players) across seasons indicated a degree of reliability, and could tell analysts something about the true talent of a player.

The most important statistic for pitchers in terms of run prevention, their main goal, is earned run average, or ERA. ERA reports the average number of earned runs (i.e., runs that did not score with the help of a fielding error) a pitcher allows per nine innings (the typical length of a baseball game). ERA is an unbiased statistic; it measures an objective outcome, an outcome which is very important. However, ERA does not have a high correlation from year to year; there is a certain degree of luck involved. How well can we predict ERA from batted ball and pitch data? Or is too much of the variation in ERA due to luck for us to even come up with useful predictions?

Methods:

Data was gathered from the customizable leaderboards on the FanGraphs website. Because the leaderboards were customizable, I did not have to code for any new variables except for one, NL, which indicated which league the individual pitched in that year. The rest of the variables are as follows: ERA, HR9, K, BB, KminusBB, WHIP, BABIP, LOB, Oswing, Zswing, Zone, SwStr, IFFB, HRFB, LD, Hard, Soft, Medium, GB, FB, GBFB, Team, and Name (of the player). Each of these variables (except for Name) appears twice in the dataset, once for 2017 and once for 2018. The dataset contained data from the 140 pitchers who threw at least 100 innings in 2018 and the 134 pitchers who threw at least 100 innings in 2017. Because there was some overlap between years, there were a total of 189 pitchers in the dataset.

Exploratory analysis was performed; normality was assessed for outcome variables. A single model containing all of the predictor terms was built to compare the effects of each predictor on ERA using their p-values, and it was adjusted accordingly for multicollinearity. Year-to-year correlation for each predictor term was used to determine which were luck-based and which were skill-based. The NL variable was excluded from this analysis because, while it has a high year to year correlation, the effect of the league in which a pitcher pitches is out of his control, and so this variable is luck-based.

Predictors with a year-to-year correlation (R-squared) of 0.400 or more were put in the skill-based category, while predictors with a year-to-year correlation of 0.010 were put in the luck-based category. Predictors with an R-squared between 0.400 and 0.010 were left without a category, since they contained some element of both skill and luck. These two categories were used to make two separate models to compare the effects of skill versus the effects of luck on ERA. Each model was adjusted accordingly for multicollinearity. The skill-only model was then edited using backward elimination to create a new estimator for ERA using only significant (at the alpha = 0.05 level) predictor terms. This new model was then compared to existing ERA estimators.

Results:

After eliminating certain predictors to reduce multicollinearity, the luck-based indicators (those with a year-to-year correlation less than 0.01 or otherwise deemed as luck-indicators) turned out to be BABIP (R-squared: 0.003), HRFB (0.003), LD (0.006), and NL (N/A), and the skill-based indicators (those with a year-to-year correlation more than 0.40) were KminusBB (0.51), GBFB (0.66), Zone (0.56), Zswing (0.52), and Oswing (0.43). Variables in between the thresholds were LOB (0.13), ERA (0.11), HR9 (0.11), WHIP (0.01), IFFB (0.05), Hard (0.15), Medium (0.02), and Soft (0.11). Table 2 lists year-to-year correlations for each predictor.

The two most significant luck-based terms were BABIP (p-value in reduced full model: < 2eˆ-16) and HRFB (< 2eˆ-16). BABIP measures batting average on balls in play, or how often the pitcher allows a hit on balls that a fielder can make a play on (i.e., not home runs or strikeouts). HRFB, home run to fly ball rate, divides the total number of home runs a pitcher allows by the total number of fly balls they allow. The two most significant skill-based terms were KminusBB (p-value in reduced full model: 1.26*eˆ-15) and Oswing (0.02). KminusBB is the percentage of batters a pitcher strikes out minus the percentage of batters they walk. Oswing is the percentage of a pitcher’s pitches outside of the strike zone that batters swing at. However, in the reduced skill model, Zone (p-value: 0.05) took the place of Oswing as the second most significant skill-based indicator. Zone is the percentage of pitches a pitcher throws in the strike zone.

Other predictors that were significant in the reduced full model for ERA but not significant in any other models (if they were used in any other models) were LOB (p-value: < 2eˆ-16), GBFB (p-value: < 2eˆ-16), and NL (p-value: 0.03). LOB, left on base percentage, indicates the percent of runners that get on base that a pitcher leaves on base without allowing them to score by the end of the inning. GBFB is the number of ground balls a pitcher allows divided by the number of fly balls a pitcher allows. NL indicates whether a pitcher pitched the full year in the National or American League. Table 3 lists p-values of predictors in the reduced full model. All predictors in the reduced full model had negative coefficients except for BABIP (9.864), Zswing (0.0012), and HRFB (0.1004)–one-unit increases in these three predictors would raise ERA. The full model had the lowest mean of squared errors with 0.03. It was closely followed by the reduced full model (0.05). The skill and luck models were close, with MSE’s of 0.49 and 0.50, respectively. The reduced skill model (MSE: 0.49) stacked up fairly well against existing ERA estimators, beating out SIERA (0.51) and xFIP (0.51). However, FIP (0.33) and tERA (0.34) were much more accurate in predicting ERA. Table 1 lists the predictive values of each model.

Discussion:

My most surprising finding was that LOB, left on base percentage, correlated relatively well from year to year. It exceeded the threshold that I set for the luck category. LOB is generally considered purely luck by Sabermetricians. This result implies that some pitchers are better with runners on base than others. This could be because some thrive under the pressure of having runners close to scoring, while others choke, an explanation most Sabermetricians scoff at. Another potential explanation is that strikeout pitchers have lower LOB because when a batter puts the ball in play with a runner on base, they can advance the runner, even if they get out (like with a sacrifice bunt). Strikeout pitchers prevent batters from putting the ball in play when they strike them out. Yet another potential explanation is that worse pitchers have lower LOB’s because they give up hits and walks more frequently. All of these explanations can explain why some of the variance in LOB is not random.

The coefficients of the reduced full model were not particularly surprising. Interpreting significant coefficients, we have the variables KminusBB, BABIP, LOB, Oswing, HRFB, GBFB, and NL. Pitching in the National League should indeed lower ERA because the pitcher hits instead of another hitter like in the American League, and pitchers are not good hitters. GBFB ratio should indeed lower ERA as it increases, because ground balls are less likely to go for doubles, triples, and home runs. HRFB ratio should indeed increase ERA as it increases, because the higher it is, the more home runs a pitcher has against them. Oswing should indeed lower ERA as it increases; hitters generally do not hit well when they swing at bad pitches outside of the strike zone. LOB percentage should indeed lower ERA as it increases; the more men you leave on base without scoring, the better. BABIP increases ERA as it increases; the more hits that fall in against a pitcher, the more runs will be scored against them. KminusBB lowers ERA as it increases because as it increases, strikeouts are being maximized and walks are being limited. Zone rate was not found to be significant here, although it was significant in the skills-only model. Pitching in the zone can be good and bad; while it is generally good to throw strikes because walks will be limited, if a pitcher only throw pitches right down the middle, then the batters will start hitting them. Hard percentage was also not found to be significant in the reduced full model, probably because it does not correlated so well from year to year and as such, is primarily due to luck. Zswing was the last term, and it was found to be insignificant, which makes sense. While swinging at pitches outside of the strike zone is almost always bad, swinging at pitches in the strike zone can be good or bad depending on which pitch you are swinging at. If it is a fastball, you are more likely to get a hit, but if it is a curveball, you are less likely.

The skills-only model had slightly better predictive ability than the luck-only model. However, there are limitations on this result. The thresholds I set for the luck and skill categories were arbitrary. In addition, some significant terms were left out of the models because of the gap between the thresholds. Most importantly, LOB percentage was left out of the luck and skill models because it fell between the thresholds. It had excellent predictive ability on ERA in the reduced full model (p-value: < 2eˆ-16).

For the reduced skill model, I am not surprised that KminusBB turned out to be the most significant term (see Figure 1). Strikeouts are vital, because if the batter cannot put the ball in play, they cannot get a hit or advance potential runners. If walks are limited while strikeouts are maximized, then the hitter is left with no easy way to get on base. The reduced skill model produced a predictive value on ERA comparable to xFIP and SIERA. However, tERA and FIP fared much better. FIP only includes home runs allowed and does not account for HRFB ratio, so while this may result in better predictive value, it is less skill-based, because the amount of home runs allowed fluctuates based on HRFB ratio. Meanwhile, tERA and SIERA rely heavily on batted ball data, which I deemed insignificant in the reduced skill model (i.e., GBFB ratio) or primarily luck-based (Hard, LD, Soft, Medium, IFFB). xFIP has a comparable formula, and so a comparable predictive value. I am not saying that my model is better than these existing models. It instead serves as a simplification of the complicated relationship between luck and skill in baseball. I left out the R-squared adjusted for the ERA estimators because their full formulas with all of their predictors was not readily available, so the R-squared adjusted did not apply.

Conclusion:

To answer our initial question, we can predict ERA from batted ball and pitch data very well. Our full model had an MSE of 0.03125–it was on average off by around 0.03 points of ERA–a miniscule amount compared to the league average of around 4.00. Even our reduced full model, adjusted for multicollinearity, produced an MSE of just 0.05268. However, our second question is more complicated. While much of the variation in ERA is due to luck, we determined for our data, using arbitrary categorizations, that slightly more of the variation in ERA is due to skill. This result is probably not generalizable because of the arbitrary nature of the test, and the fact that some important predictors (i.e., LOB) were left out because they did not clearly fit in one category. Still, our reduced luck model managed a decent MSE of 0.4925. The difference between an ERA of 4.00 and an ERA of 3.51 is more significant than we would like, but it is pretty close (see Figure 2), and half a run every nine innings will usually not make or break a team’s chances. So, our reduced skill model does have some predictive ability after all, and should help to discern a pitcher’s true talent level.

Tables:

Table 1

Table 2

Table 3

Figures:

The two scatterplots depicted in the figures are almost exactly inversely related. This is because the reduced skill model relied primarily on KminusBB.

References:

McCracken, Voros. “Pitching and Defense: How Much Control Do Hurlers Have?” Baseball Prospectus, 23 Jan. 2001, http://www.baseballprospectus.com/news/article/878/pitching-and-defense-how-much-control-do- hurlers-have/.

“GB%, LD%, FB%.” FanGraphs Baseball, http://www.fangraphs.com/library/pitching/batted-ball/.

MacAree, Graham. “Sabermetrics 101: Pitching.” Lookout Landing, 3 Mar. 2010, 12:00PM ET, http://www.lookoutlanding.com/2010/3/3/1334154/sabermetrics-101-pitching.

“Plate Discipline (O-Swing%, Z-Swing%, Etc.).” FanGraphs Baseball, http://www.fangraphs.com/library/pitching/plate- discipline-o-swing-z-swing-etc/.

Beneventano, Philip, et al. “Predicting Run Production and Run Prevention in Baseball: The Impact of Sabermetrics.” ResearchGate, June 2012, http://www.researchgate.net/profile/Bruce_Weinberg/publication/266344641_Predicting_R Run-Production-and-Run-Prevention-in-Baseball-The-Impact-of-Sabermetrics.pdf.