https://docs.google.com/spreadsheets/d/1ZBCKnn0lwhnN6cT_1Ct8VrbBUkuZskfieovB0h2NeZc/edit?usp=sharing
-
- Site Admin
- Posts: 899603
- Joined: Mon Aug 02, 2021 8:13 am
https://docs.google.com/spreadsheets/d/1ZBCKnn0lwhnN6cT_1Ct8VrbBUkuZskfieovB0h2NeZc/edit?usp=sharing
Problems The file eBay Auctions.csv contains information on 1972 auctions that transacted on eBay.com during May-June in 2004. The goal is to use these data in order to build a model that will classify competitive auctions from non-competitive ones. A competitive auction is defined as an auction with at least 2 bids placed on the auctioned item. The data include variables that describe the auctioned item (auction category), the seller (their eBay rating), and the auction terms that the seller selected (auction duration, opening price, currency, day-of-week of auction close). In addition, we have the price that the auction closed at. The goal is to predict whether an auction will be competitive or not. Notes: Note that in the dataset, the original variables of Category (11 categories), Currency (USD, nonUS), and EndDay (Weekend, Week) are categorical. Therefore, the dataset also contains corresponding dummy variables. Also, note that only one dummy variable from each group of dummy variables is already excluded (e.g., there are only 10 category dummy variables) to avoid multicollinearity. 1. Import the dataset. Remove Category, Currency, EndDay variables from the imported dataset because we already have their corresponding dummy variables. (1 point) 2. Split the data into training and validation datasets using a 60% -40% ratio. (1 point)
Notes: Note that in the dataset, the original variables of Category (11 categories), Currency (USD, nonUS), and EndDay (Weekend, Week) are categorical. Therefore, the dataset also contains corresponding dummy variables. Also, note that only one dummy variable from each group of dummy variables is already excluded (e.g., there are only 10 category dummy variables) to avoid multicollinearity. 1. Import the dataset. Remove Category, Currency, EndDay variables from the imported dataset because we already have their corresponding dummy variables. (1 point) 2. Split the data into training and validation datasets using a 60%-40% ratio. (I point) I
3. Fit a classification tree. Use Competitive as the target variable and the rest of the variables as predictors. (As mentioned in the notes, you don't have to exclude one dummy variable from each dummy group for a categorical variable). To avoid overfitting, set the maxdepth=6. a. Report the tree - plot the tree and copy and paste the resulting diagram. You don't have to care too much about the aesthetics of the diagram. (1 point) b. List the decision rules. For example, if variable1<0 AND variable2<2, class=0. (0.5 point) c. Report the prediction confusion matrix of validation data. (0.5 point) d. Which predictors are used by the tree? (0.5 point) 4. Are the rules practical for predicting the outcome of a new auction? (Hint: Can you use the rules to classify a new auction before the auction ends? In other words, do you know the values of all predictors used in the rules before the auction ends? Some of them may not be known before the end of the auction. What are those variables?). In short, which variables should NOT be included in the predictor set? (0.5 point) Explain why. (0.5 point) 5. Fit another classification tree using the same setting in question 3. This time, use only the predictors that can be used for predicting the outcome of a new auction before the auction ends. a. Report the tree - plot the tree and copy and paste the resulting diagram. You don't have to care too much about the aesthetics of the diagram. (1 point) b. List the decision rules. For example, if variable 1<0 AND variable2<2, class-0. (0.5 point) c. Report the prediction confulion matrix of validation data. (0.5 point) d. Which predictors are used by the tree? (0.5 point) 6. Compare the overall performance (e.g., accuracy or error rates) of the two decision trees (from Q3 and Q5). Which model has better predictive performance? (1 point) Explain why. (1 point)