Page 1 of 1

1 Data Cleaning This dataset to a large extent relies on user input, and these are users from around the world. Looking

Posted: Fri Apr 29, 2022 10:45 am
by answerhappygod
1 Data Cleaning This Dataset To A Large Extent Relies On User Input And These Are Users From Around The World Looking 1
1 Data Cleaning This Dataset To A Large Extent Relies On User Input And These Are Users From Around The World Looking 1 (179.62 KiB) Viewed 25 times
1 Data Cleaning This dataset to a large extent relies on user input, and these are users from around the world. Looking at the data, you will find things like the following: • date_fueled sometimes has a description of what the user did, instead of a date. Numerical fields like gallons, miles, odometer will have commas in them as a thousands delimeter (this depends on the location of the user, generally). So, instead of writing 1523.50, they might write 1,523.50. Converting this to a float in pandas will require editing the string. • The fields relating to costs (cost_per_gallon and total_spent) have the currency symbol in the value (eg. R500 or $500). There are many different currencies used. 1.1 Date Fields 1. Identify what percentage of date_fueled entries are not proper dates. [1] 2. If date_fueled is not entered correctly (or is not a date), and the date captured is a valid date, then fill in this value as a proxy. [1] 3. Convert the column to a date format, setting any invalid date fueled entries to Nat. [2] 4. Remove dates that are in the future, or dates that are earlier than 2005. [1] 5. Plot the distribution of fueling dates and comment on the results. [2] 1.2 Numeric Fields 1. Identify what percentage of gallons, miles, odometer and entries are missing. [3] 2. The miles, gallons and mpg columns are interdependent. If one is missing, the other two can be used to calculate it. [3] 3. The values will be read in as objects (or strings) by Pandas. Convert these values to float (note the point above about commas in the value). [5] 4. Plot the distributions and comment on the distributions. [3] 5. Compute the statistical description of the columns: mean, standard deviation, max, min, most fre- quent, and quartiles. Do these results make sense? [3]