P2.2.1 - Model selection for Regression Analysis (9 marks) In this question, we construct a regression analyses to inves

Post by **answerhappygod** » Mon May 02, 2022 12:46 pm

P2.2.1 - Model selection for Regression Analysis (9 marks)
In this question, we construct a regression analyses to
investigate how well FTHG (or FTAG) can be predicted from the other
variables in the dataframe. The objective of this question is to
derive a sparse model (linear and polynomial) with fewer
variables.
P2.2.1.1 - Variable Selection for Linear Regression (5
marks)
In variable selection ('variable'
means the same as 'predictor'), variables get iteratively added or
removed from the regression model. Once finished, the model
typically contains only a subset of the original variables. It
makes it easier to interpret the model, and in some cases it makes
it generalise better to new data.
To perform variable selection, create a function
select_variable(df, main_pred, main_target, alpha), where
main_pred is a dictionary of variables. For this analysis,
firstly, either all Home or Away teams will be marked and the
predictors given below will be used
Home: [Time, FTR, HTHG, HTR, HS, HST, HF, HC, HBP]
Away: [Time, FTR, HTAG, HTR, AS, AST, AF, AC, ABP].
main_target is the variable for the regression, Home: FTHG (or
Away: FTAG)
alpha is the significance level for selecting significant
predictors
The function should return
To calculate regression fits and p-values you will use
statsmodels. The general procedure follows two stages:
Stage 1 (adding predictors): you build a model by adding
variables one after the other. You keep adding variables that
increase the adjusted R2 value (provided
by statsmodels package).
Start with an empty set of variables
Fit multiple one-variable regression models. In each iteration,
use one of the variables provided in predictors. The variable that
leads to the largest increase in adjusted R2 is added to the
model.
Now proceed by adding a second variable into the model. Starting
from the remaining variables, again choose the variable that leads
to the largest increase in adjusted R2.
Continue in the same way for the third, fourth, … variable.
You are finished when there is no variable left that increases
adjusted R2.
Stage 2 (removing non-significant predictors): if any of the
utilised predictors are not significant, you need to remove them.
Keep removing variables until all variables in the model are
significant.
Start by fitting a model using the variables that have been
added to the model in Stage 1.
If there is a variable that is not significant, remove the
variable with the largest p-value and fit the model again with the
reduced set of variables.
Keep removing variables and re-fitting the model until all
remaining variables are significant.
The remaining significant variables are the output of your
function.
[ ]
def select_variable(df, main_pred, main_target, alpha):
# your code here
return main_pred
Dataset: