The individuals in our shared dataset are actually baseball players. The game of baseball and statistics go hand in hand and the excellent Retrosheet website has the sequence of events that have occurred for every at-bat in every major league baseball game dating all the way back to 1916! This is obviously a wonderful and very rich public domain dataset for playing around with data analytics.

Introduction to the Game

If you’re not familiar with baseball, here are some of the many resources that describe the basics of the game:

Fantasy Baseball

Many of our applications involve using data science to detect individuals with unusual or exceptional characteristics. There are actually a large number of baseball players that move through the major leagues every year. Some play for a long time and some only for a very short time. While they are all very gifted athletes, there are still a smaller group that stand out as “exceptional”.

To indentify these individuals, we use fantasy baseball scoring metrics to identify a sub-population of players that have a significant positive impact on their teams. We then use various data features to predict which players will fall into that population.

Due to its popularity, we base our analysis on Yahoo Sports fantasy baseball platform. The offensive impact of a batter is evaluated by assigning points as follows:

Data Transformation

To disguise Retrosheet data as health and emergency service patient data, we use the following mapping: