When Do Stats Become Meaningful?

https://cdn.vox-cdn.com/thumbor/02nrOBeXv6dPF5fjVxr60OyDHY0=/0x760:2336x1983/fit-in/1200x630/cdn.vox-cdn.com/uploads/chorus_asset/file/23430043/72693380.jpg

2006 American League Most Valuable Player Justin Morneau

Sample size, "stabilization", and (appropriately) sorting signal from noise.

Every year, a few weeks into the season, the baseball internet is inundated with articles seeking to determine if a player or team's start to the season is "real." If such a piece had been written anytime from 2003 to mid-2006 about former Twins' first basemen Justin Morneau, the writer might have well concluded at varying points that Morneau was destined to become anything from a prospect bust and a future all-star.

Truly, the first few seasons of the lanky Canadian's MLB career were a mixed bag.

Morneau was a third-round draft choice and a consensus top-100 prospect. After his 2003 debut, he collected 970 plate appearances across parts of the next three seasons and slashed a combined .248 /.313 /.461. That worked out to league average-ish weighted on-base average (wOBA, .327) and weighted runs created+ (wRC+, 97) marks that paled compared to the typical production of that era and the offensive benchmarks expected of first basemen.

Within that up-and-down start to his career, Morneau included extended stretches where he looked the part of the run-producing, middle-of-the-order anchor his scouting report projected him to be, and an overmatched young player struggling to find his footing at the game's highest level (complete with multiple trips back to the minors).

Photo by MARLIN LEVISON/Star Tribune via Getty Images
Twins prospect Justin Morneau with the AAA Rochester Red Wings in 2004

The route the beginning of Morneau's major league career took makes for a useful case study for this installment of our analytics fundamentals series, focused on sample sizes, sorting signals from noise, and what it really means when we talk about statistics "stabilizing".

Signal vs. Random Noise

By now, most fans of baseball are familiar with the idea that small sample sizes might be misleading and should be treated with some degree of caution. That hotshot rookie who comes up and gets off to a good start is probably not a .400 hitter and the team that wins its first ten games of the season is probably not perfect.

Each event in baseball is more or less a random occurrence. Low probability and fluky results happen all the time when viewed in isolation. Morneau's teammate on the 2005-2007 Twins, Jason Tyner, homered once in 1,467 major league plate appearances, for instance. But throughout a longer series of events, such as a season or a career, a clearer picture of what's happened starts to emerge.

There is a reason the type of articles I mentioned in the first paragraph above happen sometime around the beginning of May each season. By then, many players have about a month's worth of games under their belts, and that leads us to think that they have enough data points that we might be able to consider their numbers meaningful.

But how many plate appearances do we need to see before we have some confidence that the player's past performance is meaningful for predicting the future?

Baseball Stat Reliability

What is now generally accepted work on this topic is attributable to a handful of public baseball analysts, beginning with Russell Carleton, now of Baseball Prospectus, back in 2007. Harry Pavlidis, Derek Carty, and Sean Dolinar, and Jonah Pemstein, updated and extended from Carleton's foundation with different sampling methodologies as better data sources became available over the next decade.

Central to those efforts was the statistical reliability concept, which measures a stat's consistency. If you took the same measurement from a large sample multiple times of highly reliable data — e.g., a batter's seasonal strikeout rate across multiple seasons — you'd expect to get approximately the same result each time.

Reliability is expressed as a number between 0 and 1. The closer to 1 you get, the more reliable the metric, which means it has more signal, and less random error and uncertainty.

Stats with high reliability tell us something meaningful about the player's "true talent level," as opposed to other factors like luck, health, or any number of other things that affect the performance of humans.

And some stats indicate if a performance is meaningful with fewer data points than others. Said differently, those stats become reliable more quickly. We're interested in those because they can help us figure out if that hot start is a good representation of the player's ability or a stretch of good luck.

To find the sample sizes at which various stats become reliable, Carleton proposed using a technique from social psychology called split half-reliability, which measures the consistency between two sub-sets of a complete dataset.

The specific approaches for split-half reliability can vary from simply numbering data points and dividing up evens and odds, or dividing up sequential series of data (e.g., 1-50, 51-100, etc), to more complicated approaches that correlate all possible combinations of the two samples like the Kuder-Richardson formulas and Cronbach's alpha.

Credit Sean Dolinar and Jonah Pemstein, FanGraphs | LINK
Example of Reliability at different sample sizes

The authors listed above used one of these techniques to obtain samples of plate appearances (or batted balls) and then correlated them with each other at different sample sizes looking for the point where the correlation between the two samples crossed 0.707.

Why 0.707? Because that's the point where the rate of signal in the data surpasses the rate of noise (i.e., the r^2; .707 times .707 = ~0.500).

Just Give Me the Numbers

Here's the side-by-side comparison of Carleton and Carty's results for hitters and pitchers:

Credit Russell Carleton LINK | Derek Carty LINK
Comparison of reliability points for various hitter stats
Credit Russell Carleton LINK | Derek Carty LINK
Comparison of reliability points for various pitcher stats

The methods used were slightly different and you can see that the results have some differences. That said, they are very directionally consistent and the relative differences between different stats are similar.

The ones that tend to be most under the direct control of the player's particular skill set tend to stabilize most quickly. Strikeout and walk rates become meaningful in comparatively few events for both hitters and pitchers. Same for fly ball and ground ball rates. That passes muster because we know the different playing styles of the players significantly influence those types of outcomes. Joe Ryan isn't likely to suddenly become a ground ball pitcher and Miguel Sanó wasn't likely to start making lots of contact.

On the other hand, line drive rate and batting average on balls in play are much more noisy stats and therefore take a lot more data before they become reliable. Similarly, the staple triple-slash stats take more events to become reliable — about half a season's worth of plate appearances for a hitter's on-base percentage, and a hitter's batting average needs about a season and a half.

Using Them Responsibly

In many of the kinds of early-season articles I mentioned in the introduction, the numbers in the tables above are often used as a sort of proof of a player's stats. Once they reach the reliability threshold, they are now "real" and can be used as the expectation going forward.

That's not quite right.

There is a very important caveat that comes with reliability analysis that often gets overlooked. A measure is said to have high reliability if it produces similar results under consistent conditions.

Let's have Carleton explain:

Reliability analysis answers the question I see that Smith has had 100 PA this year and a walk rate of X. If I were to go back in time and give Smith another 100 PA in the same basic circumstances, how confident would I be that he would reproduce the same performance?

————————

When I say that strikeout rate for pitchers stabilizes at 70 batters faced, what I mean is that we can be reasonably sure that his strikeout rate over those 70 batters is a good reflection of his talent level over those 70 (now past) plate appearances.

This is different from saying that once a pitcher has gotten to 70 batters, we can assume that he will perform this way for the rest of the season. That's an assumption. It's not a bad one, but it is an assumption. Instead, what it means is that if his underlying skill set has changed in some meaningful way, we'll know in 70 plate appearances.

In short, the problem with using those thresholds as a definitive predictor of future performance is that the future is almost assuredly going to be a different set of circumstances. The player will be older, he might be hurt, or tired, the pitchers he faces will be different, the strategies opponents use against him might be different, he may have something happening in his personal life, or have gotten a new coach, and on and on.

That's where this comes back to Morneau.

Putting It All Together

Morneau was a major boost to the 2004 AL Central Division Champion Twins when he was a July callup and delivered .271/.340/.536 (.362 wOBA, 118 wRC+) with 19 home runs and 58 runs batted in over 312 plate appearances in 74 games. Better still were his 17.3% strikeout and 9.0% walk rates.

But a myriad of health issues — an appendectomy, chicken pox, pneumonia, and pleurisy — wrecked Morneau's offseason. He suffered a concussion in April of 2005 and had a bone spur in his left elbow that June.

That all likely contributed to Morneau producing a disappointing .239/.304/.437 (.314 wOBA, 91 wRC+) line with 22 home runs and 79 RBIs over 543 plate appearances and 141 games that season.

The struggles continued when Morneau hit just .208 and struck out in more than a quarter of his April 2006 plate appearances. May was better, but not nearly as good as what was to come and Morneau was hitting just .237/.299/.453 when the Twins rolled into Seattle in early June.

The rest of Morneau's 2006 season is forever etched into Twins' lore. On that trip to Seattle, he was benched by Manager Ron Gardenhire, and visited by his Dad for a serious talk, both as much for his performance on the field as for his habits and activities off it.

Photo by Brace Hemmelgarn/Minnesota Twins/Getty Images
Justin Morneau and Ron Gardenhire of the Minnesota Twins pose for a photo in 2023

The wake-up call seemed to work (and stick).

Over 450 plate appearances the rest of that season, a re-focused Morneau hit .361/.411/.609, cracked 23 dingers, drove in 92 runs, and helped propel the Twins to another AL Central division crown on the way to being named the American League Most Valuable Player. For the next four seasons, Morneau was a consistent middle-of-the-order force, until his career path was dramatically altered by another concussion in mid-2010.

★ ★ ★ ★ ★

The summary of this is the stats and their reliability thresholds help us identify when something meaningful has probably changed. But, whether that's truly a change in talent or skill that can carry forward to the future, or something else altogether, is something we can't be certain about. Time (and sometimes more analysis) will tell.

Unconsciously, we tend to think of players as static. But they aren't. They are (we all are) constantly changing, learning, growing, and being influenced and affected by things that may or may not have anything to do with the task at hand. It's not all limited to talent and luck, even when we have sufficient sample sizes.

We'd do well to leave some space in our evaluations and assessments for those other factors that might be affecting a player's ability to perform at his true talent level.


John writes for Twinkie Town, Twins Daily, and Pitcher List with an emphasis on analysis. He is a lifelong Twins fan and former college pitcher. Follow him on Twitter @JohnFoley_21.

×