This dataset to a large extent relies on user input, and these are users from around the world. Looking at the data, you
Posted: Thu May 05, 2022 1:11 pm
This dataset to a large extent relies on user input, and these
are users from around the world. Looking at the data, you will find
things like the following: • date_fueled sometimes has a
description of what the user did, instead of a date. • Numerical
fields like gallons, miles, odometer will have commas in them as a
thousands delimeter (this depends on the location of the user,
generally). So, instead of writing 1523.50, they might write
1,523.50. Converting this to a float in pandas will require editing
the string. • The fields relating to costs (cost_per_gallon and
total_spent) have the currency symbol in the value (eg. R500 or
$500). There are many different currencies used.
1.1 Date Fields 1. Identify what percentage of date_fueled
entries are not proper dates. [1]
2. If date_fueled is not entered correctly (or is not a date),
and the date captured is a valid date, then fill in this value as a
proxy. [1]
3. Convert the column to a date format, setting any invalid date
fueled entries to NaT. [2]
4. Remove dates that are in the future, or dates that are
earlier than 2005. [1]
5. Plot the distribution of fueling dates and comment on the
results. [2]
1.2 Numeric Fields
1. Identify what percentage of gallons, miles, odometer and
entries are missing. [3]
2. The miles, gallons and mpg columns are interdependent. If one
is missing, the other two can be used to calculate it. [3]
3. The values will be read in as objects (or strings) by Pandas.
Convert these values to float (note the point above about commas in
the value). [5]
4. Plot the distributions and comment on the distributions.
[3]
5. Compute the statistical description of the columns: mean,
standard deviation, max, min, most frequent, and quartiles. Do
these results make sense? [3]
2 2 Feature Engineering We can use the existing features to
create new features with more useable information. Add the
following features:
1. Create a new column with the currency. (Something to keep in
mind is that the Swiss Franc has a period in the abbreviation).
[2]
2. Create a new column containing the float value of the total
spend and the cost per gallon. (Swiss Franc comment as above).
[2]
3. Car make, model, year, User ID: use the url (the last value
in the URL is the user ID) [4]
The data is given in imperial units, and in SA, we use proper
measurement standards.
1. litres filled: use the gallons - consider whether to use UK
or US gallons. [2]
2. km driven: use the miles driven to compute this [1]
3. litres per 100km: use the two new features to calculate this.
[1]
3 Vehicle Exploration We will see in the next few questions (and
you should be aware of it by now) that the data captured by users
is not always accurate. In particular, the transaction level data.
There is probably more accuracy in the user profile: their vehicle
make and model, the year of the vehicle, and, hopefully, the
currency they use. We’ll look at vehicle and user profile
information for the global population here, before we consider
removing outliers and bad transaction data.
1. Plot the number of unique users per country (remember, we
proxy this by currency). [2]
2. Look at the popularity of the app: plot the number of unique
users per day. [2]
3. Look at the distribution of age of the vehicles per country -
look at the year of the vehicle. Remember to look at the date it
was refueled, not the current date. [3]
4. Which makes and models of vehicles are the most popular?
[2]
4 Fuel Usage It is particularly difficult to identify outliers
in this dataset, due, for example, to the multiple currencies. As
an example, refilling a vehicle in South Africa would be maybe
R1000, but in the US it would be $70. One would have to either
perform outlier detection on each currency separately. (We could
convert everything into a single currency, based on the time of the
transaction, and use that, however, that is not required in this
assigment.) We will focus on the top five currencies only (Rands
will be one of them) to simplify things.
4.1 Outlier Removal
1. Identify the top 5 currencies by number of transactions.
[2]
2. For each of the top 5 currencies separately, remove outliers
by considering the total spend, litres, cost per litre, gallons,
etc. Choose values you believe are reasonable and provide your
reasoning. As an example of something you would want to look out
for, there are some SA users that have their currency set to
dollars. This will show a user refuelling with several hundred
dollars, but only putting in tens of litres, which is clearly
wrong. [10]
3. How many values have been removed after accounting for
outliers? [1]
3 4.2 Fuel Efficiency Now that you have a much cleaner dataset,
we can start to look at some of the data more closely for insights.
In particular, we want to look at the fuel efficiency in litres per
100km. In general, there are many confounding factors and unknown
variables that can make an analysis of fuel efficiency difficult:
engine size, vehicle type, fuel type (diesel vs petrol will show a
massive difference), aircon usage, vehicle load, weather,
transmission type. With this in mind, we need to be aware that the
results found here are unlikely to be completely representative and
accurate, but hopefully indicative. When you start this section,
make sure you have removed outliers as indicated in the previous
question
. 1. Look at the difference in cost per litre per country for
January 2022 - use the average currency conversion rate to Rands
(quote your values and source). Are there any notable differences?
Discuss reasons why this may/may not be the case. [5]
2. Looking at the odometer readings, find examples of where
users have missed logging a fill-up. Give a basic rule for
identifying this, and estimate how many there are in the dataset.
[5]
3. Plot the average distance (in km) per tank per country. Which
country has the largest average distance? Provide some explanations
for why this might be the case. [5]
4. Do newer vehicles drive further distances between fill-ups?
Provide a plot to show this. [4]
5. Take the top 5 most popular vehicles in SA (ie, those with
currency set to R). Compute their fuel efficiency and discuss
whether these values are realistic. [3]
6. Which vehicles are the most fuel efficient in each country?
(Make sure the values are reasonable!!! You can look up values of
fuel efficiency online to do a sanity check, but a value of 1l per
100km, or 100l per 100km are clearly wrong). [5]
7. Plot the difference in fuel efficiency for the top 5 Canadian
vehicles between seasons. Would you expect to see big differences,
and do you see them? [3]
8. Show the correlations between fuel efficiency and other
features. You should find that there is a relative strongly
correlation with distance travelled, the age of the vehicle, and
the model of vehicle. [5]
9. Use a random forest to get a list of the most important
variables. How different are they from each other, and how do these
relate to the variables from the correlations above? [5]
4.3 Fuel Usage in SA In South Africa, fuel prices are always
adjusted at midnight on the first Tuesday of the month. If the
price is going up, we expect there to be more people refuelling on
a Tuesday than usual. If the price is going down, we might expect
people to postpone refuelling until the Wednesday.
1. Filter the above dataset to focus on SA drivers. [1]
2. Plot the fuel prices over time for SA. [2]
3. Add an indicator column to show the day of the week that the
transaction happened. [2]
4. Using a suitable plot, show if the difference in the number
of people refueling on a Tuesday vs other days. [3]
5. Now reduce your dataset to only the entries on the 1st
Tuesday and 1st Wednesday in SA every month. [2]
6. For each Tuesday and Wednesday, add an indicator for whether
the price goes up or the price goes down that month. [2]
7. Do more people refuel on the first Wednesday of the month
when the prices goes down? [2]
8. Do more people refuel on the first Tuesday of the month when
the prices goes up? [2]
are users from around the world. Looking at the data, you will find
things like the following: • date_fueled sometimes has a
description of what the user did, instead of a date. • Numerical
fields like gallons, miles, odometer will have commas in them as a
thousands delimeter (this depends on the location of the user,
generally). So, instead of writing 1523.50, they might write
1,523.50. Converting this to a float in pandas will require editing
the string. • The fields relating to costs (cost_per_gallon and
total_spent) have the currency symbol in the value (eg. R500 or
$500). There are many different currencies used.
1.1 Date Fields 1. Identify what percentage of date_fueled
entries are not proper dates. [1]
2. If date_fueled is not entered correctly (or is not a date),
and the date captured is a valid date, then fill in this value as a
proxy. [1]
3. Convert the column to a date format, setting any invalid date
fueled entries to NaT. [2]
4. Remove dates that are in the future, or dates that are
earlier than 2005. [1]
5. Plot the distribution of fueling dates and comment on the
results. [2]
1.2 Numeric Fields
1. Identify what percentage of gallons, miles, odometer and
entries are missing. [3]
2. The miles, gallons and mpg columns are interdependent. If one
is missing, the other two can be used to calculate it. [3]
3. The values will be read in as objects (or strings) by Pandas.
Convert these values to float (note the point above about commas in
the value). [5]
4. Plot the distributions and comment on the distributions.
[3]
5. Compute the statistical description of the columns: mean,
standard deviation, max, min, most frequent, and quartiles. Do
these results make sense? [3]
2 2 Feature Engineering We can use the existing features to
create new features with more useable information. Add the
following features:
1. Create a new column with the currency. (Something to keep in
mind is that the Swiss Franc has a period in the abbreviation).
[2]
2. Create a new column containing the float value of the total
spend and the cost per gallon. (Swiss Franc comment as above).
[2]
3. Car make, model, year, User ID: use the url (the last value
in the URL is the user ID) [4]
The data is given in imperial units, and in SA, we use proper
measurement standards.
1. litres filled: use the gallons - consider whether to use UK
or US gallons. [2]
2. km driven: use the miles driven to compute this [1]
3. litres per 100km: use the two new features to calculate this.
[1]
3 Vehicle Exploration We will see in the next few questions (and
you should be aware of it by now) that the data captured by users
is not always accurate. In particular, the transaction level data.
There is probably more accuracy in the user profile: their vehicle
make and model, the year of the vehicle, and, hopefully, the
currency they use. We’ll look at vehicle and user profile
information for the global population here, before we consider
removing outliers and bad transaction data.
1. Plot the number of unique users per country (remember, we
proxy this by currency). [2]
2. Look at the popularity of the app: plot the number of unique
users per day. [2]
3. Look at the distribution of age of the vehicles per country -
look at the year of the vehicle. Remember to look at the date it
was refueled, not the current date. [3]
4. Which makes and models of vehicles are the most popular?
[2]
4 Fuel Usage It is particularly difficult to identify outliers
in this dataset, due, for example, to the multiple currencies. As
an example, refilling a vehicle in South Africa would be maybe
R1000, but in the US it would be $70. One would have to either
perform outlier detection on each currency separately. (We could
convert everything into a single currency, based on the time of the
transaction, and use that, however, that is not required in this
assigment.) We will focus on the top five currencies only (Rands
will be one of them) to simplify things.
4.1 Outlier Removal
1. Identify the top 5 currencies by number of transactions.
[2]
2. For each of the top 5 currencies separately, remove outliers
by considering the total spend, litres, cost per litre, gallons,
etc. Choose values you believe are reasonable and provide your
reasoning. As an example of something you would want to look out
for, there are some SA users that have their currency set to
dollars. This will show a user refuelling with several hundred
dollars, but only putting in tens of litres, which is clearly
wrong. [10]
3. How many values have been removed after accounting for
outliers? [1]
3 4.2 Fuel Efficiency Now that you have a much cleaner dataset,
we can start to look at some of the data more closely for insights.
In particular, we want to look at the fuel efficiency in litres per
100km. In general, there are many confounding factors and unknown
variables that can make an analysis of fuel efficiency difficult:
engine size, vehicle type, fuel type (diesel vs petrol will show a
massive difference), aircon usage, vehicle load, weather,
transmission type. With this in mind, we need to be aware that the
results found here are unlikely to be completely representative and
accurate, but hopefully indicative. When you start this section,
make sure you have removed outliers as indicated in the previous
question
. 1. Look at the difference in cost per litre per country for
January 2022 - use the average currency conversion rate to Rands
(quote your values and source). Are there any notable differences?
Discuss reasons why this may/may not be the case. [5]
2. Looking at the odometer readings, find examples of where
users have missed logging a fill-up. Give a basic rule for
identifying this, and estimate how many there are in the dataset.
[5]
3. Plot the average distance (in km) per tank per country. Which
country has the largest average distance? Provide some explanations
for why this might be the case. [5]
4. Do newer vehicles drive further distances between fill-ups?
Provide a plot to show this. [4]
5. Take the top 5 most popular vehicles in SA (ie, those with
currency set to R). Compute their fuel efficiency and discuss
whether these values are realistic. [3]
6. Which vehicles are the most fuel efficient in each country?
(Make sure the values are reasonable!!! You can look up values of
fuel efficiency online to do a sanity check, but a value of 1l per
100km, or 100l per 100km are clearly wrong). [5]
7. Plot the difference in fuel efficiency for the top 5 Canadian
vehicles between seasons. Would you expect to see big differences,
and do you see them? [3]
8. Show the correlations between fuel efficiency and other
features. You should find that there is a relative strongly
correlation with distance travelled, the age of the vehicle, and
the model of vehicle. [5]
9. Use a random forest to get a list of the most important
variables. How different are they from each other, and how do these
relate to the variables from the correlations above? [5]
4.3 Fuel Usage in SA In South Africa, fuel prices are always
adjusted at midnight on the first Tuesday of the month. If the
price is going up, we expect there to be more people refuelling on
a Tuesday than usual. If the price is going down, we might expect
people to postpone refuelling until the Wednesday.
1. Filter the above dataset to focus on SA drivers. [1]
2. Plot the fuel prices over time for SA. [2]
3. Add an indicator column to show the day of the week that the
transaction happened. [2]
4. Using a suitable plot, show if the difference in the number
of people refueling on a Tuesday vs other days. [3]
5. Now reduce your dataset to only the entries on the 1st
Tuesday and 1st Wednesday in SA every month. [2]
6. For each Tuesday and Wednesday, add an indicator for whether
the price goes up or the price goes down that month. [2]
7. Do more people refuel on the first Wednesday of the month
when the prices goes down? [2]
8. Do more people refuel on the first Tuesday of the month when
the prices goes up? [2]