Description:
A Data science project which predicts cricket score of first innings of a team by giving few features Problem Statement Indian Premier League (IPL) is a Twenty20 cricket format league in India. It is usually played in April and May every year. The league was founded by Board of Control for Cricket India (BCCI) in 2008.
Technical Details:
The following diagram shows the various steps that we have followed:.
The above picture clearly tells you how bad is taking run rate as a single factor to predict the final score in an limited overs cricket match. In ODI and T-20 cricket, many factors play a key role in deciding what the final score will be. some of the key factors:
- Number of wickets left
- Number of balls left
- On how much scores are the current batsman batting?
- How much the team had scored in last 5 overs?
- How much the team had lost wickets in last 5 overs?
- The nature of the pitch
- How strong is the batting and bowling team?
We will use some of these factors to predict score using machine learning algorithms. We use regression analysis in machine learning to predict the final score of an T-20 match.
- Data collection:-
Collected the below details related to the factors influencing the account dormancy through conducting various surveys and documented the details. The dataset included both categorical and numerical data.
The data collected by Kaggle Link : IPL Complete Dataset (2008-2020) | Kaggle
Dataset consists of following columns(features):
- mid: Each match is given a unique number
- date: When the match happened
- venue: Stadium where match is being played
- bat_team: Batting team name
- bowl_team: Bowling team name
- batsman: Batsman name who faced that ball
- bowler: Bowler who bowled that ball
- runs: Total runs scored by team at that instance
- wickets: Total wickets fallen at that instance
- overs: Total overs bowled at that instance
- runs_last_5: Total runs scored in last 5 overs
- wickets_last_5: Total wickets that fell in last 5 overs
- striker: max(runs scored by striker, runs scored by non-striker)
- non-striker: min(runs scored by striker, runs scored by non-striker)
- total: Total runs scored by batting team after first innings.
2. Exploratory Data Analysis:
Go through all the features and try to understand the story behind each feature. See how these features are inter-related. If you can’t understand the data, take the help of plots. They give you great understanding of the data.
You understand when you visualize it pictorially. Picture speaks a lot of data
From the barplot, we can easily make inferences regarding the year in which a particular team has scored the maximum wins (and also the number of wins)
Barplot-
A bar plot or bar chart is a graph that represents the category of data with rectangular bars with lengths and heights that is proportional to the values which they represent.
Joint plot-
A Jointplot comprises three plots. Out of the three, one plot displays a bivariate graph which shows how the dependent variable(Y) varies with the independent variable(X). Another plot is placed horizontally at the top of the bivariate graph and it shows the distribution of the independent variable(X). The third plot is placed on the right margin of the bivariate graph with the orientation set to vertical and it shows the distribution of the dependent variable(Y).
Box Plot-
- It is showing to much of outliers so we can use winsorizations or trimming.
Inferences from the above charts: –
- The .xlsx file has data of IPL matches starting from the season 2008 to 2017
- We have found 425 raws that are outliers.
- We will use trimming as dataset is too large.
After Trimming-
Description of pair plot of all the variables
- Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical
Let’s plot data using pairplot:-
From the picture below, we can observe the variations in each plot. The plots are in matrix format where the row name represents x axis and column name represents the y axis.
Heat map & Scatter Plot-
Heatmap is the way of representing the data in a 2-dimentional form
- all the values are less than 0 .5 so the features are not correlated.
- We can also see in scatter plot that the data is not that much correlated.
- Feature Engineering
This is the most important step in any Data Science project. Sometimes, the features are readily available. Using the visualizations, understand which features helps us to predict the class label better. If you can’t go with the given features, engineer/build new features using the domain knowledge. This involves a lot of Math and Statistics and geometric intuition.
Code EDA-
- Model building:
import pandas as pddataset = pd.read_csv(‘data/odi.csv’)X = dataset.iloc[:,[7,8,9,12,13]].values #Input featuresy = dataset.iloc[:, 14].values #Label
I have used ‘ipl.csv’ datafile here for predicting scores in T20 Cricket.
Features Used:
- runs
- wickets
- overs
- striker
- non-striker
Why didn’t I use other features?
While experimenting, all the other features didn’t make much difference in results. You can use a different combination of features and test the code on them.
Label Used: Total
Splitting data into training and testing set
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
We will train our model on 75 percent of the dataset and test the model on remaining dataset.
Feature Scaling the data
from sklearn.preprocessing import StandardScalersc = StandardScaler()X_train = sc.fit_transform(X_train)X_test = sc.transform(X_test)
Training the dataset
- Using Linear Regression Algorithm
· from sklearn.linear_model import LinearRegression· lin = LinearRegression()· lin.fit(X_train,y_train)
- Using Random Forest Regression Algorithm
· from sklearn.ensemble import RandomForestRegressor· lin = RandomForestRegressor(n_estimators=100,max_features=None)· lin.fit(X_train,y_train)
You can use any one of these algorithms but as you will see later random forest regression gives us better accuracy.
Testing the dataset on trained model
y_pred = lin.predict(X_test)score = lin.score(X_test,y_test)*100print(“R-squared value:” , score)
R-squared value
R-sqaured is a statistic that will give some information about the goodness of fit of a model. In regression, the R-squared coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R-squared value of 1 indicates that the regression predictions perfectly fit the data.
Linear regression-
Random Forest Regression-
R squared value –
- linear regression-(60.4)
- Random Forest regression-(71.9)
Selected model: Random Forest
Using Lazy classifier techniques various machine learning algorithms were explored such as KNN, Ada Boost, Decision Tree, ensemble techniques , the model that gave the highest accuracy for this data set is Random Forest.
Cricket Score prediction uses the following packages and library from python:
import pandas as pd
import matplotlib.pyplot as plt import seaborn as sns import numpy as np import sklearn from sklearn.ensemble import RandomForestClassifier from dataprep.eda import plot,plot_correlation,plot_missing,create_report from sklearn.preprocessing import StandardScaler
|
6. Deployment using Flask:
Deployment process was done using flask technique.
Deployment Architecture
Created app.py file to show prediction of Cricket Score-