Predicting the Winning Team with Machine Learning

Predicting the Winning Team with Machine Learning

November 17, 2019 100 By Stanley Isaacs


Hello world It’s siraj and our task today is going to be to try to predict if a team is Gonna win a game or not Now this is for football or as Americans call it soccer which is one of the most which is the most popular game globally when it comes to sports and Of all the domestic teams out there the English premier league is the most popular of all of them So we’re gonna predict the outcome for an English Premier league team Using a data set of past games and this data set I’ll show it to you right now has a bunch of different statistics This is what the data set looks like right here. You’ve got a home team, and you’ve got an away team Right here, so it could be arsenal it could be Chelsea, Brighton, Manchester city, so you’ve got a home team And you’ve got an away team, and then you’ve got a bunch of statistics So these are all acronyms, but I have definitions for all these acronyms that we can look at right over here right so we have acronyms for the full time home team goals the home team the away team the shots the target the corners the amount of yellow cars the amount of Red cards, so there’s a lot of different statistics here, right? There’s so many things that go into what makes a team win, or lose, right? And so we’re going to take all of these features And then we’re gonna use them to try to predict the target or the label and the label in our case is going to be? The ft. R. Which is the Full time result so the ft. R is right here right h a Hd, right, so it could be either the home team h the away team a or a draw D So it’s a multi-class classification Problem this is not a binary classification problem It’s not just what the home team wins or loses its multi-class because there are three possible labels home team away team or draw So that’s what we’re going to try to predict given all of those features in the data set before I show you the steps Let me just demo this really quickly so I can just say x test and then just take the first Row from this and the labels are gone This is just for the all the features given no label, and we can see that it says home write So it’s able to predict given all those other features whether or not a team is going to win lose or tie the game Okay, so back up to this We’re gonna try to predict the winning football team and our steps are gonna be the firt It’s a four-step process So our steps are gonna be to first clean our data set make sure that we only use features that we need what do I? Mean by that when it comes to predicting who’s Gonna win a team? There’s an entire industry around this right there are pregame analyses by commentators or postgame analyses by commentators Entire Channels Like Espn are dedicated to trying to predict who’s gonna win a match and in fact even during the game there are Commentators trying to predict who’s Gonna win like during halftime who’s Gonna win the full game, so this is a this is something That’s been going on for forever rice. It’s gladdie or gladiator Roman days or whatever It’s been going on right people trying to predict. Who’s Gonna win a match But we’re gonna do something that people don’t do often and that is using statistical analysis Or otherwise known as machine learning mathematical optimization to try to predict who’s going to win if you think about it? This is like one of the most perfect machine learning problems out there trying to predict who’s gonna win think of all the features out there and those features don’t Necessarily have to do with the game they could be the sentiment of the audience the sentiment of the crowd of news articles How are people talking about a team? What hashtags related to the team are trending on Twitter are they home are they away? What’s the weather like that day? What are the forecast predictions so there’s so many different data points that could go into potentially from across the web? Telling us whether or not a team is gonna win or lose, but since I’ve never talked about this topic before I’m just gonna start off from a very basic level and based on your feedback, and how you feel about this topic I can talk about it more and do more advanced things later, okay? So we’re gonna clean our data set then we’re gonna split it into a training and a testing set And what I mean by that is we’re gonna use scikit-learn to do that I have still I have yet to find a better library for splitting training and testing data Then Scikit-learn it is still like the best out there even if I’m using tensorflow or pi torch to build my model I’ll still use scikit-learn to split my training and testing data. It’s just a one-liner super simple and then once we split it We’re gonna train it on three different classifiers, so remember this is a classification problem a multi-class classification problem And so we’re gonna use either logistic regression Support vector machine or and I’ve talked about both of those in my math of intelligence series links to those in the description But I’ll also talk about them a little bit in this video Just as a refresher and the third one is a model that I haven’t talked about before and that’s called x g boost well you Could think of it as a technique model same thing, so we’re gonna use those three as our classifiers We’re gonna Train all three of them on the data set and then we’re gonna pick the classifier that has the best result and that Is Gonna be the classifier that we use to predict the the winning team and we’re also going to optimize its parameter Or its hyper parameters using grid search, right? so we’re using an ensemble of Machine Learning methods which psyche alert makes very easy to do once we pick the right one then we’ll optimize that model and then that will take that optimize model and use that to predict the winning team and So the history of this is like I said it’s been going on for a long time and sports betting has just been increasing in popularity for many years Right if you look at the past five years it’s growing at double-digit rates and there’s a lot of reasons for this number one is just the accessibility of the internet rights more people have internet access and Embedding on the internet is easier than in-person Another reason is just that machine learning is becoming democratized And so everybody’s being able to build these predictive models to try to predict these scores so this is this is definitely a field that’s increasing in popularity and and This is not something that’s happening in the fringe of society. This is a very mainstream Task Kaggle the Data Science Community hosts this yearly competition called March Madness or machine learning mania Whatever you want to call it to try to predict the scores for the NCAa that is basketball And you have an entire community around this and people are trying out different models and discussing them So definitely check that link out as well, so this is something that’s happening and I also found you know several papers talking about this so it’s not just something that’s People who want to make money do this is something that legitimate researchers at academic institutions? Look into and try to try to predict right so from this paper. I I’m quoting verbatim It is possible to predict the winner of English County twenty20 cricket games and almost two-thirds of instances Right and then for this other paper right here Something that becomes clear from the results is that Twitter contains enough information to be useful for predicting outcomes? For the premier Li that’s for Twitter, right? Right here, so they use Twitter sentiment to try to predict just twitter alone to try to predict who’s Gonna win So there’s a lot of different angles we can look at here right we could use sentiment analysis We could use the past score history. We could use a whole bunch of different things We’re gonna use a score history, but you could try to simulate the game and a simulation And then you know try to see from that But you know that there’s a lot of different possibilities here and check this out in 2014 bing Which is owned by Microsoft correctly predicted the outcomes of for all the 15 games in the knockout round for the 2014 world cup? Every single game 15 of them own hundred percent accuracy so you can be sure that bings model is really good however They are not going to share it with us because it’s it’s kind of like you know Financial analysts at Jp. Morgan or chase if they know how to predict these stock prices. They’re not gonna tell us Why would they share their profits with us? So what we’ve got to do is we’ve got to figure it out for ourselves to try to reverse-engineer the techniques So that we can benefit from it okay, so that was a little primer on the background so back to the data set So this data set that I got is from football data Uk you can find it right here if you go to slash Data Php And then what I did was I selected the england football results and luckily for us They’ve got data sets for every season back like two decades So it’s perfect and if you want one you could just click on Premier league and boom it downloads Just like that And I showed you the data set so one thing right off the bat that we can notice is that if we were to just Graph and I’ve already done this beforehand for us and it’s in Markdown right here We’ll see that the home team has the majority stake of this graph so that means right off the bat without doing any machine learning we already know that if you are a home team you have an Advantage to win probabilistically speaking if you’re the home team You’re more likely to win than if you’re not just from bet from that alone And we can reason about this a couple ways we could say well if you’re the home team then you know football is a team sport and a cheering crowd helps you and To travel through your less fatigued You know you’re familiar with the pitch and the weather conditions All these things you had a hot dog from the stand and it tastes really good just kidding Baseball food or any kind of sports or like stadium food is never good. You know what I’m saying I’ve got two great repositories for us I’m about to start the code here, but I’ve got one for another epL prediction great ipython Notebook or Jupiter Notebook And I’ve got one for that that kaggle competition that I just talked about for NCAA Prediction definitely check them both out and this guy adesh panda has really great tutorials and software on his github so just check out all of his uh repositories because he has some really great example code so what we’re gonna do is I’m just going to code out a Good part of this just from the start and then we’re going to just go over the rest Okay, so don’t save all right move move move move move okay so First things first so our dependencies in this case are going to be to import pandas for data pre-processing We want to import pandas because that’s like the most popular data processing library and we also talked about Xg boost right that is one of the other machine learning models that we want to use We’re just going to form a prediction model based on an ensemble of decision trees which I’ve talked about as well decision trees And so another thing we’re gonna do is we’re going to import logistic regression, right? That’s model two of three there are three different models that we’re going to train our Data set on one of them is xg boost the other is logistic regression which is used whenever the response variable is categorical right either yes, or no, or you know some kind of Non continuous discrete value, you know black white red green you know things like that, so which is perfect for us, huh, you know Win lose or draw so we have logistic regression, and then we have one more which is going to be the support vector machine? Right support vector machines. I’ll talk about that as well, and then finally we’re going to want to import this display This display library because we are going to display our results Okay, so that’s it for our dependency and now we can read our data sets So now we’re gonna go ahead and look at pandas and pandas Gonna is Gonna let us read from our CSV file that we downloaded That I’ve called final Data set CSV and then once we have that we’re going to preview that data So I’m gonna say okay just go ahead and display the data that I’ve just pulled into memory as a panda’s data frame object I’ll look at its head that is just the first few columns of that data set and Once I have that hocking I can go ahead and print it and now we can see this this data Set what it looks like and so notice There’s a whole bunch of acronyms here Lots of Data sets have acronyms like this and that can be confusing but like I said I’ve got this Legend of what each acronym means? the home team goal difference the difference in points a difference in last Year’s Prediction for the past three games the wins for the past three games for the home team the number of wins for the past three Games for the away team so you know I’ve kind of aggravated this data, and I’ve just made it into something a little more Consumable and so still remember that we still have one single target that we’re trying to predict and that is fTR, right? the full time result for the full time game who is the team that won the home team the away team or was it a Draw and so that’s our target that we’re trying to predict so before we get into building this model. Let’s first explore this data set So if we were to explore this data set we could say okay So first of all let’s just kind of think about what is the win rate for the home team? So what is the win rate for the home team? So how often does the home team win aside from anything else? This is kind of what we just talked about right? How do we do this programmatically what we say okay? Get the total number of Matches and that’s gonna be that first index in the data frame object and then calculate the number of features From it. So we want the number of features, and we’ll subtract one because one of them is Gonna be the label That’s not going to be our feature right the ftR. So we’ll subtract one from that and then we’re gonna calculate the Matches one by the home team Which is going to be the length of the data all right? Toph tfTr? Okay, and for that for the home team So that’s number of matches there that were won by the home team and finally we’ll calculate the win rate the win rate for The the home team as well and then once we have that finally we can print out the results, and it’s gonna tell us exactly how many times the Home team has won as a percentage of all the wins so I can go ahead and print that I’ve got this print statement Right here, and then we can go ahead and see the result Okay, so already this is a this is the graph that I showed at the beginning 46 percent about 46 percent of winds are from The team that is home just right off the bat just something for us to know right where we’re exploring the data We’re trying to think about what are the features that matter the most right feature selection? That’s the process that we’re going to now we’re going through so remember when it comes to deep learning We don’t have to really think about what are the ideal features deep learning learns those features for us however That’s like a next step We’re just gonna try to build some more basic models first And then you know whether or not you know based on feedback of how you guys like this topic I might do a deep learning video on sports analytics later, but right now we’re just Gonna build these three simple models and Thinking about feature selection is a really important skill to have as a data scientist So if you write, which deep learning you don’t have to do that, but again you’ve got to have a lot of gPus and Crucially you have to have a lot of data right you have to have a lot of data to be able to do that now In this case we don’t have that much data. We have in this data Set it download it in like you know two seconds of course It was only 500 it’s only about 500 Data points right we want a huge amount of data at least a hundred Thousand now if we had at least a hundred thousand Data points Then this would be something to use deep learning for right if we’re trying to aggregate a bunch of different results sent to mit from Twitter past team scores Different you know talking points from other people then we would use something like deep learning But in this case we want to try to visualize the distribution of this data. So what we’ll do is We’ll say okay, so from pandas. There’s this great. Tool that lets us come what’s called the scatter Matrix and the scatter Matrix Basically shows how much one variable affects the other so we’re Gonna build a scatter Matrix For a set of our features to try to predict to try to see just visually What is the correlation between these different features and see just for ourselves? This this this will help us pick the relevant features that we want to use right, so we have the home team goal difference We have the away team goal difference We have the home team points the away team points the difference in points and the difference in last Year’s prediction, okay? And so once we visualize this some of them have a positive correlation the line is going up some of them have a negative correlation so that means like in terms of so that means if the goals increase for the home team Then maybe the points decrease for the for the away team right and so we can look at the positive versus negative correlations That’s an indicator of how features are related together right this doesn’t have some direct relation to what we’re about to do but just good practice to think about ways of visualizing our data seeing the Relationship between between different features and then trying to predict what those best features are for our model ok so Then once we’ve explored our data. We’re gonna prepare it so remember we have one single Target variable one single objective or Label as we like to call it and that is the fTR the full-time results So what we want to do is say given all of those other features try to predict the FDR Okay, and make us some money. Yeah, no. I’m just kidding. I mean yes actually you probably want to make some money We’re trying to predict the full time result right and so we’re gonna split it into the fTR and then everything else Then we’ll standardize it which means it’s all gonna be on the same scale We that means we want all of our data to be an integer format? and we want it all to be on the same scale so it’s not like we have like one feature is in the hundreds of Thousands and then the other feature is in the you know between 1 and 10 if we’re if they’re gonna be small values we want Them all to be small values and what this does is it improves our prediction capability of our model So once we’ve standardized our data then we’re gonna add these three features which is the the last three wins for both sides and we looked at that before right hm 1 2 3 and then a and an am one two and three So if we look back at the data some of the data was categorical like if we look at this data set you know We have the referee we have htR we don’t want any of that, right? We want all of our data to be a number. We want it to be some continuous variable. No discrete numbers So we’re gonna pre-process those features by saying create a new data frame find those feature columns that are categorical by saying if it’s if the data type is equal to equal equal to object instead of an integer and then convert it into an integer right so that way we remove all the Categorical features we only have one categorical variable and that is our labeled the FDR We don’t want our features to be categorical those are gonna be continuous variables and so once we have that we’ve pre-processed our data We’ve explored it We’ve added the the features that we thought were most relevant, and we could see them all here right no more categorical features they’re all numbers and so once we have that now we can train and we can split our model into a Training and a testing data set it with a very easy one-liner which Scikit-learn right? This is Gonna split our with the train train test split function. It’s Gonna split that CSV It’s Gonna split that Data frame object Into a training and a testing set and it already knows what the label is gonna be and it’s going to put them all in A one dimensional array all of those labels the fTR scores for each of the associated inputs and we have 12 Features right we have 12 features for a single input and so for the next step now We’re gonna actually build this model, so I’m gonna come back to these helper functions that are gonna help us train the model But let’s right now. Just build this model, right so I’ll go down here So let’s just write this out right now okay, so I’m gonna say okay So we know that the first model that we want to try out or at least? One of the models that we want to try out is logistic regression I’ll give it some random state as a seed that you know this could be any number of things right well I’m just gonna say you know 40 I could say 42 It doesn’t matter, but just some seed number and we could try out different seeds to see how the results vary But I’m just gonna you know put some magic numbers down right now to to get some result out and so the next classifier We’re Gonna Build is a support vector machine So the order of classifier as I initialized them doesn’t really matter so so that’s irrelevant But the fact that I am initializing them is important because it means that these are the three important ones that we are using And so my third classifier is going to be xg boost now Then I’m gonna talk about what all of these are in a second, but let me just write them out here We have an x g boost classifier, and then we have a C. That’s gonna be my 82 let me print that boom boom for a B and C okay Right and so if we train this we’ll see that clearly The xg Boost library did the best so we already know that? xg boost is the best model for this data and notice that the xg boost model had an accuracy score and an F1 score of About 74 percent that is a 74% accuracy on the testing data set which is a really good It’s it’s really good. It’s better than just guessing right? It’s way better than just trying to guess what team is Gonna win That’s about a 75% accuracy. It’s pretty good So let’s let’s go back and see what these models are by the way, so these models so logistic regression now remember I have a video on Logistic regression, and I have a video on support vector machines Just search both of those on YouTube and then the word suraj, it’ll be the first link that shows up, but for logistic regression It’s used to Predict the probability of an occurrence of an event by fitting Data to logistic curve So a logistic curve consists of this equation right here the probability right, so if you have two classes if it’s a binary classification problem whether or not someone is Dead or alive the X-Axis would be the concentration of the Toxin whether you know you’re trying to predict if someone’s gonna live or die based on this toxin and the y-Axis Is Gonna be the probability of each of those classes and we use a logistic regression curve? denoted by this equation where you just plug in the x value and it will output a probability to show that now in the Multi-class case as is our problem as is our problem We’re going to use multinomial logistic regression which the library does for us but that’s what logistic regression does it’s use extensively across a wide range of fields and It’s a very very popular model That’s gonna be our first and these are all classification models by the way remember once we frame our problem Then we can pick what model we want to use. We know that this is a classification problem Therefore we want to use a model that is well suited for classification and then the next question is Based on our data which of the models is best to use and we don’t always know that right off the bat even very experienced data scientists don’t always know that so they have to try out several models to see which one works best and Then for support vector machines. So what a support vector machine Does is it will find so let’s say we have two classes, and we plot them in two-dimensional space Just like in this image right here what it will do is it will try to find the points that are closest to each other to find the smallest Margin between both classes and Once it found finds these points these support vectors it will build a hyperplane right in the middle so that they sort of the distance between That line and both of those points is the smallest and the reason it does that is so that once we give it a new? Data set a new Data point Whatever side of the line it falls on that’s the class it’s going to be that’s how we classify it So in the simple case for two classes It’s just a line and then it just falls and the data will fall on one of two sides But in a more complex case you’ll have all three classes, and then it will draw a line That’s kind of curved between them, so it’ll just like segment the graph into the three different segments But the idea is still the same finding those closest points and finding the line that minimizes the Margin Okay, so that’s where support vector machines and the last one is xg boosts, so we talked about random forests It’s kind of it’s it’s very similar to a random forest The Xj Boost algorithm is one of the most popular algorithms on Kaggle when it comes to winners? There are a lot of xg boosts is happening, but basically the classification and regression tree the decision tree That’s used for both classification and regression is a good model. It’s not a great model, but it’s a good model It’s a very simple model, right you give it a bunch of features And it’s going to slowly build there’s a variety of ways of doing this But it’s gonna build a tree or each a branch or level in the tree equates to one question, right? So you’re trying to predict whether or not it’s gonna rain. It’ll be like is the is there is a sky cloudy Yes, no yes, okay. Did it rain yesterday? Yes? No yes, okay, then there’s a 75% chance. It’s Gonna rain So that’s a decision tree right so what xg boosts does it’s a gradient boosting technique What it will do is it will create a bunch of weak learners though those are Decision trees that are okay like their their predictive capability isn’t that good and it will combine the results of all of them? so it’s an ensemble method it will take all these trees and then find a result by using the prediction capability of all of those trees So that’s 4xg boost and we have this tree right here. Where we have Input age gender occupation a bunch of different features, and we’re trying to answer the question does this person like computer games Different trees will specialize in answering different parts of the question like does this person use the computer daily. What’s their age? What’s their gender and then we’ll combine the results from all of them So the function of this kid is the result of these two trees predictions combined So that’s what each of those are and then if we go up here back to these Helper methods that I was going to talk about By the way the F1 score is just a measure of a models accuracy It’s a very standard score that just measures how accurate a model is so back to these three Helper methods what we did here was we gave the Train predict function our classifier as well as the training and testing data sets, so if we go back up here we’ll see that in the train predict method we indicated the classifier that we’re Gonna use then we trained it given the Training and testing data we predicted the results and then we predicted the labels and so this train predict method use the train classifier map method to start a clock fit the model and then Print the results and then for the labels it started a clock made a prediction and then stopped the clock So that was it they just predicted the labels And then it fit the classifier just that and so once we did that we realized that X-G boost gave us the best result right so exci boost is the model that we want to use a 74 percent accuracy But that’s not enough right now that we know that x-G boost is the best model now. We can say okay Let’s optimize this model, and what there different ways that we could optimize this model right, but in our case We’re gonna Optimize it by optimizing the hyper parameter, so this is hyper parameter optimization there are a bunch of different hyper parameters that go into xg boost so we were kind of shielded from that because we use the Scikit-Learn Library But we can use scikit-learn ironically enough to optimize hyper parameters that we don’t even normally see so if we come down here We can import grid search, which is a which is basically brute forcing That’s that’s what grid search is we’re brute forcing all the possible combinations of all the hyper Parameters will create an initial set of hyper parameters here will initialize the xg boost classifier And we’ll make an F-1 scoring function and then perform grid search on that classifier with the scoring function given the initial Parameters that we just defined up here And then it’s going to find the ideal parameters for that model and notice that the F1 score and the accuracy score increased Right so after we optimize the hyper parameters the F1 scoring the accuracy score increased which means that our model is now way more optimized Anyway, disclaimer you know you could make money using this you could lose money using this who knows right? This is this is it. This is an educated guess This is a statistical guess based on past data sets, but we can definitely improve this model Right we could bring in more data more relevant features. We could bring in sentiment analysis We could add other features But there’s also one more thing that I want to say so it’s able to predict whether or not the home team will win right, but based on a data point from this CSV file however We don’t always know what all of these things are how are we supposed to predict? Whether or not there’s gonna be you know x number of fouls. How are we gonna? How are we supposed to know whether or not there’s gonna be x number of off sides we don’t necessarily know these things Beforehand these features beforehand, so the trick is To pick features that are completely predictable what do I mean by that that means that what are features that are gonna help predict? Who’s Gonna win, but those features themselves are predictable like how many players are gonna be on the team? Well, you know for a fact? They’re gonna be five players on the team who are the players gonna be what is the lineup? When is the game happening? Where is the court right so things that you know for sure and so if all of your features are? Known they’re all predictable then your result will not require you to guess what those features are which is the case in this very basic Example another another way to improve the model is to just use way way way more quality data We could also just predict each of these features themselves like we could try to predict How many goals are Gonna occur in a game? We could try to predict all these things and then just have probabilistic Values for all of these features and then try to predict the home team right so there’s a lot of machine learning that could be happening But ideally we know for sure what all these features are gonna be and we can use them to then predict the winning team? so if you’re interested in this topic Definitely check out all the links in the description and let me know what you guys thought of this topic in the comments, please Subscribe for more programming videos and for now, I’ve got to play some soccer. I mean football so thanks for watching