The Problem
Our task is to create a machine learning model that can predict which Major League Baseball player is most likely to get a hit in a day. We aim to use statistics from recent games as well as previous seasons to accomplish this. The output will be the likelihood of a hit for each batter on a given day against a given pitcher. We chose this task because there is a fantasy game on MLB.com where if you can predict a hit for 57 days in a row, you will win $5.6 million. Our goal is to make a system where we can input the current day's games and matchups and we can predict which player is most likely to get a hit.
Our Solution
When we started, we had many different ideas for possible features. To narrow it down, we tested most of the ideas. After many iterations, we narrowed it down to the following list:
- Hitter batting average for the current season
- Hitter batting average in last 20 plate appearances (indicator of hot/cold streak)
- Hitter batting average in last 50 plate appearances (indicator of longer-run hot/cold streak)
- Hitter plate appearances per game (full season)
- Home/away
- Ballpark factor (an adjustment for the hitter-friendliness of the ballpark)
- Opposing pitcher season ERA for the current season
- Gametime temperature (research says that warmer weather is better for hitters)
- Games with at least one hit ratio (calculated as the proportion of games in which a batter got at least one hit in all four years in the training set)
While this was better, it still was not good enough. To improve this further, we decided to try and combine these two approaches. We created a python script that took a J48 tree from Weka and converted it into a python set of nested if statements. We used this to find the best several players. To choose the best out of this group, we ran them through a Weka Bayesian logistic regression. We used the top players from this for our final predictions
Training and Testing
To train our models, we gathered our data from several sources, primarily Retrosheet, baseball-refrence.com and ESPN. We used the 2009 to 2012 season for training and the 2013 season for testing. Some of the stats were just averages, such as batting average and WHIP. Others were calculated from game data, such as batting hot streaks, hitter plate appearances per game and plate appearances per game. We then used excel to merge the various csv files into one file with all our data. Finally, we used a python script to convert this into an .arff file readable by Weka.
To test our final algorithm, we created a python script that acted as if it was playing the game. For every day, it take all possible players and runs it through our algorithm. It would then check if its prediction was correct. It keeps track of its current streak, as well as the longest streak it got over the season, as well as our prediction percentage. If we always chose the best player, we would only have one possible test set which will always have a 10 game streak. To get a better idea of how well we were choosing the best players, we chose randomly from the top 5 players.
Results
Baseball is a very unpredictable and our results showed that. For evaluation metrics, we decided to use the Recall Rate of the algorithm and the average max streak achieved during a season. What we found was our combination of algorithms performed better than either algorithm independently. While 1.6% may seem insignificant, it doubles the chance of winning, although the chances are still very low. While we are disappointed that we are not close to winning, the game has been played for millions of people over many years and no one has reached a streak of 50 yet.
J48 Tree | Logistical Regression | Combination | |
---|---|---|---|
Recall Rate | 70.8% | 67.3% | 72.4% |
Average Max Streak | 12.48 | 12.6 | 13.1 |
The most important thing to take from this table is that the combination of techniques is more effective than either individually. This is because this method is able to combine cost sensitive tree to remove a lot of false positives, which makes the ranking from the logistical regression more accurate.
Contact Information
Peter Haddad: peterhaddad2016@u.northwestern.edu
Jake Kobza: jacobkobza2016@u.northwestern.edu
Bruno Peynetti: bpeynetti@u.northwestern.edu