Beat the streak site by peterhad313

The Problem

Our task is to create a machine learning model that can predict which Major League Baseball player is most likely to get a hit in a day. We aim to use statistics from recent games as well as previous seasons to accomplish this. The output will be the likelihood of a hit for each batter on a given day against a given pitcher. We chose this task because there is a fantasy game on MLB.com where if you can predict a hit for 57 days in a row, you will win $5.6 million. Our goal is to make a system where we can input the current day's games and matchups and we can predict which player is most likely to get a hit.

Our Solution

When we started, we had many different ideas for possible features. To narrow it down, we tested most of the ideas. After many iterations, we narrowed it down to the following list:

Hitter batting average for the current season
Hitter batting average in last 20 plate appearances (indicator of hot/cold streak)
Hitter batting average in last 50 plate appearances (indicator of longer-run hot/cold streak)
Hitter plate appearances per game (full season)
Home/away
Ballpark factor (an adjustment for the hitter-friendliness of the ballpark)
Opposing pitcher season ERA for the current season
Gametime temperature (research says that warmer weather is better for hitters)
Games with at least one hit ratio (calculated as the proportion of games in which a batter got at least one hit in all four years in the training set)

When choosing a learner, we were limited by what type of output we needed. In order to choose the hitters most likely to succeed, we needed to output a percentage probability of success. The main algorithm that outputs this is a logistical regression. When we tested this, we got a recall rate of 67%. This was not good enough so we looked for other options. One of the most interesting ideas for this was a CostSensitiveClassifier. With this, we can create a classifier that is focused on avoiding false positives. By switching to a cost sensitive approach, we were able to increase the recall rate of a J48 tree from 59% to 71%.

Figure 1: Full Algorithm

While this was better, it still was not good enough. To improve this further, we decided to try and combine these two approaches. We created a python script that took a J48 tree from Weka and converted it into a python set of nested if statements. We used this to find the best several players. To choose the best out of this group, we ran them through a Weka Bayesian logistic regression. We used the top players from this for our final predictions

Training and Testing

To train our models, we gathered our data from several sources, primarily Retrosheet, baseball-refrence.com and ESPN. We used the 2009 to 2012 season for training and the 2013 season for testing. Some of the stats were just averages, such as batting average and WHIP. Others were calculated from game data, such as batting hot streaks, hitter plate appearances per game and plate appearances per game. We then used excel to merge the various csv files into one file with all our data. Finally, we used a python script to convert this into an .arff file readable by Weka.

To test our final algorithm, we created a python script that acted as if it was playing the game. For every day, it take all possible players and runs it through our algorithm. It would then check if its prediction was correct. It keeps track of its current streak, as well as the longest streak it got over the season, as well as our prediction percentage. If we always chose the best player, we would only have one possible test set which will always have a 10 game streak. To get a better idea of how well we were choosing the best players, we chose randomly from the top 5 players.

Results

Baseball is a very unpredictable and our results showed that. For evaluation metrics, we decided to use the Recall Rate of the algorithm and the average max streak achieved during a season. What we found was our combination of algorithms performed better than either algorithm independently. While 1.6% may seem insignificant, it doubles the chance of winning, although the chances are still very low. While we are disappointed that we are not close to winning, the game has been played for millions of people over many years and no one has reached a streak of 50 yet.

	J48 Tree	Logistical Regression	Combination
Recall Rate	70.8%	67.3%	72.4%
Average Max Streak	12.48	12.6	13.1

The most important thing to take from this table is that the combination of techniques is more effective than either individually. This is because this method is able to combine cost sensitive tree to remove a lot of false positives, which makes the ranking from the logistical regression more accurate.

Contact Information

Peter Haddad: peterhaddad2016@u.northwestern.edu

Jake Kobza: jacobkobza2016@u.northwestern.edu

Bruno Peynetti: bpeynetti@u.northwestern.edu

Beat the Streak

A Northwestern Univeristy Machine Learning Project

The Problem

Our Solution

Training and Testing

Results

Contact Information