1 Data Cleaning This dataset to a large extent relies on user input, and these are users from around the world. Looking

Business, Finance, Economics, Accounting, Operations Management, Computer Science, Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Algebra, Precalculus, Statistics and Probabilty, Advanced Math, Physics, Chemistry, Biology, Nursing, Psychology, Certifications, Tests, Prep, and more.
Post Reply
answerhappygod
Site Admin
Posts: 899604
Joined: Mon Aug 02, 2021 8:13 am

1 Data Cleaning This dataset to a large extent relies on user input, and these are users from around the world. Looking

Post by answerhappygod »

1 Data Cleaning This Dataset To A Large Extent Relies On User Input And These Are Users From Around The World Looking 1
1 Data Cleaning This Dataset To A Large Extent Relies On User Input And These Are Users From Around The World Looking 1 (179.62 KiB) Viewed 24 times
1 Data Cleaning This dataset to a large extent relies on user input, and these are users from around the world. Looking at the data, you will find things like the following: • date_fueled sometimes has a description of what the user did, instead of a date. Numerical fields like gallons, miles, odometer will have commas in them as a thousands delimeter (this depends on the location of the user, generally). So, instead of writing 1523.50, they might write 1,523.50. Converting this to a float in pandas will require editing the string. • The fields relating to costs (cost_per_gallon and total_spent) have the currency symbol in the value (eg. R500 or $500). There are many different currencies used. 1.1 Date Fields 1. Identify what percentage of date_fueled entries are not proper dates. [1] 2. If date_fueled is not entered correctly (or is not a date), and the date captured is a valid date, then fill in this value as a proxy. [1] 3. Convert the column to a date format, setting any invalid date fueled entries to Nat. [2] 4. Remove dates that are in the future, or dates that are earlier than 2005. [1] 5. Plot the distribution of fueling dates and comment on the results. [2] 1.2 Numeric Fields 1. Identify what percentage of gallons, miles, odometer and entries are missing. [3] 2. The miles, gallons and mpg columns are interdependent. If one is missing, the other two can be used to calculate it. [3] 3. The values will be read in as objects (or strings) by Pandas. Convert these values to float (note the point above about commas in the value). [5] 4. Plot the distributions and comment on the distributions. [3] 5. Compute the statistical description of the columns: mean, standard deviation, max, min, most fre- quent, and quartiles. Do these results make sense? [3]
Join a community of subject matter experts. Register for FREE to view solutions, replies, and use search function. Request answer by replying!
Post Reply