Page 1 of 1

Part 2 - Numerical Data (40 marks) This question has been created to test your statistical analysis and programming know

Posted: Wed May 04, 2022 11:56 am
by answerhappygod
Part 2 - Numerical Data (40
marks)
This question has been created to test your statistical analysis
and programming knowledge in Python. You are given a csv file,
which include various data entries for each football match in
English Premier League during the 2020-2021 season. To name a few
of these entries: date, referee name, number of goals, red cards,
etc. The csv data set you are provided contains one row per
football match. The column names are abbreviations and given
as:
Div = League Division Date = Match Date (dd/mm/yy)
Time = Time of match
kick off HomeTeam = Home Team
AwayTeam = Away Team
FTHG = Full Time Home Team Goals
FTAG = Full Time Away Team Goals
FTR = Full Time Result (H=Home Win, D=Draw, A=Away Win)
HTHG = Half Time Home Team Goals
HTAG = Half Time Away Team Goals
HTR = Half Time Result (H=Home Win, D=Draw, A=Away Win)
Referee = Match Referee
HS = Home Team Shots
AS = Away Team Shots
HST = Home Team Shots on Target
AST = Away Team Shots on Target
HF = Home Team Fouls Committed
AF = Away Team
Fouls Committed HC = Home Team Corners
AC = Away Team Corners
HY = Home Team Yellow Cards
AY = Away Team Yellow Cards
HR = Home Team Red Cards
AR = Away Team Red Cards
In this exercise, you are asked to perform a number of
operations to (1) perform statistical analysis of the data, and (2)
gain insights from the data
# suggested imports
import pandas as pd
import numpy as np
import statsmodels.api as sm
import scipy
from urllib import request
import scipy.stats as stats
from statsmodels import graphics
import arviz as az
import pymc3 as pm
from pymc3 import glm
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, RocCurveDisplay, auc,
roc_curve
import seaborn as sns
sns.set_style(style="darkgrid", rc={"axes.facecolor": ".9",
"grid.color": ".8"})
sns.set_palette(palette="deep")
sns_c = sns.color_palette(palette="deep")
P2.2 - Statistical Analysis (29 marks)
P2.2.1 - Model selection for Regression Analysis (9
marks)
In this question, we construct a regression analysis to
investigate how well FTHG (or FTAG) can be predicted from the other
variables in the data frame. The objective of this question is to
derive a sparse model (linear and polynomial) with fewer
variables.
P2.2.1.1 - Variable Selection for Linear Regression (5
marks)
In variable selection ('variable'
means the same as 'predictor'), variables get iteratively added or
removed from the regression model. Once finished, the model
typically contains only a subset of the original variables. It
makes it easier to interpret the model, and in some cases it makes
it generalise better to new data.
To perform variable selection, create a
function select_variable(df, main_pred, main_target, alpha),
where
main_pred is a dictionary of variables. For this analysis,
firstly, either all Home or Away teams will be marked and the
predictors given below will be used
Home: [Time, FTR, HTHG, HTR, HS, HST, HF, HC, HBP]
Away: [Time, FTR, HTAG, HTR, AS, AST, AF, AC, ABP].
main_target is the variable for the regression, Home: FTHG
(or Away: FTAG)
alpha is the significance level for selecting significant
predictors
The function should return
To calculate regression fits and 𝑝p-values you will
use statsmodels. The general procedure follows two stages:
Stage 1 (adding predictors): you build a model by adding
variables one after the other. You keep adding variables that
increase the adjusted 𝑅2R2 value
(provided by statsmodels package).
Start with an empty set of variables
Fit multiple one-variable regression models. In each iteration,
use one of the variables provided in predictors. The variable that
leads to the largest increase in adjusted 𝑅2R2 is added
to the model.
Now proceed by adding a second variable into the model. Starting
from the remaining variables, again choose the variable that leads
to the largest increase in adjusted 𝑅2R2.
Continue in the same way for the third, fourth, … variable.
You are finished when there is no variable left that increases
adjusted 𝑅2R2.
Stage 2 (removing non-significant predictors): if any of the
utilised predictors are not significant, you need to remove them.
Keep removing variables until all variables in the model are
significant.
Start by fitting a model using the variables that have been
added to the model in Stage 1.
If there is a variable that is not significant, remove the
variable with the largest 𝑝p-value and fit the model again
with the reduced set of variables.
Keep removing variables and re-fitting the model until all
remaining variables are significant.
The remaining significant variables are the output of your
function.
P2.2.1.2 - Model Selection for Polynomial Regression (4
marks)
Often the dataset provided is not linearly separable and a
simple linear regression model may not be able to derive
relationships between both the independent and dependent variables.
In such cases, a possible solution would be to implement polynomial
regression instead
(https://en.wikipedia.org/wiki/Polynomial_regression). Polynomial
regression is a form of regression analysis in which the
relationship between the independent variable π‘₯x and the
dependent variable 𝑦y is modelled as
an π‘›π‘‘β„Žnth degree polynomial in π‘₯x.
Example: Given 𝑦y the dependent
variable, π‘₯1,π‘₯2x1,x2 the independent
variables, 𝑏0b0 the bias
and 𝑏1,𝑏2,...,𝑏𝑛b1,b2,...,bn the weights a polynomial
regression of degree 2 would have the form:
𝑦=𝑏0+𝑏1π‘₯1+𝑏2π‘₯21+𝑏3π‘₯2+𝑏4π‘₯22y=b0+b1x1+b2x12+b3x2+b4x22
Implement a function polynomial_model(df, main_pred,
main_target, degrees) which uses the selected subset of
variables as an argument from the function select_variable(),
and calculates all possible combinations of the variable set and
polynomial degrees. The function polynomial_model() finds
the degree that yields the best polynomial model (according to the
adjusted R-squared metric) to predict the value of a FTHG or FTAG
as in the linear regression part above.
Arguments and outputs of the function are given as
a dataframe df,
a dictionary main_pred indicating the predictors for
home and away,
a dictionary main_target indicating target variable
for home and away,
a list of integers indicating the degrees to test degrees,
The function should return
P2.2.2 - Predicting Match Result (5 marks)