Page 1 of 1

Breakfast Cereals. Data were collected on the nutritional information and consumer rating of 77 breakfast cereals. The c

Posted: Sun Jul 03, 2022 11:23 am
by answerhappygod
Breakfast Cereals. Data were collected on the nutritionalinformation and consumer rating of 77 breakfast cereals. Theconsumer rating is a rating of cereal “healthiness” for consumerinformation (not a rating by consumers). For each cereal, the datainclude 13 numerical variables. For each cereal, the information isbased on a bowl of cereal rather than a serving size. These dataare also considered in a textbook example, Section 4.13. Your taskis to explore and summarize the data as follows:
1)Load the data for the breakfast cereals stored in fileCereals.csv. Which variables are quantitative/numerical? Which areordinal? Which are nominal?
2)Compute the mean, median, min, max, and standard deviation foreach of the quantitative variables. This can be done using R’ssapply() function (e.g., sapply(data, mean, na.rm = TRUE)).
3)Using the ggplot2 package in R, plot a histogram andprobability density for each of the quantitative variables (hint:if you wish to plot all histograms at once in a grid, you will needto rely on the facet_wrap() function of the ggplot2 package. Thisfunction was introduced in part 2 of the module 3 slides. Also, forthis, you may need to use the melt() function of the reshape2package to convert your data frame in wide format to another one inlong format). Based on the histograms and summary statistics inpart (b) above, answer the following questions: Which variableshave the largest variability? Which variables seem skewed? Arethere any values that seem extreme?
4)Use the ggplot2 package in R to plot a side-by-side boxplot tocompare the calories in hot vs. cold cereals. What can you learnfrom the plot? Use the ggplot2 package in R to plot a side-by-sideboxplot of consumer rating as a function of the shelf height. If wewere to predict consumer rating from shelf height, would we need tokeep all three categories of shelf height?
5)Using the ggplot2 package in R, produce the correlationheatmap (Module 3, Part 2) for the quantitative variables in thecereals dataset. Which pair of variables is most stronglycorrelated? How can we reduce the number of variables based onthese correlations? How would the correlations change if wenormalized the data first? Obtain the first 5 principal components(PC) of the 13 numerical variables (those are also produced inTable 4.11 in the textbook). Describe briefly what the first PCrepresents.