STEP 4: If you explore the Google Play data set long enough or look at the discussions section, you'll notice some apps
Posted: Thu May 05, 2022 1:16 pm
STEP 4: If you explore the Google Play data set long enough or look at the discussions section, you'll notice some apps have duplicate entries. For instance, Instagram has four entries: for app in android: name = app [0] if name == 'Instagram': print (app) ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,00 0,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies wi th device', 'Varies with device'] ['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,00 0,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies wi th device', 'Varies with device'] ['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,00 0,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies wi th device', 'Varies with device'] ['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '¹1,00 0,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies wi th device', 'Varies with device'] In total, there are 1,181 cases where an app occurs more than once: duplicate_apps = [] unique_apps = [] for app in android: name = app [0] if name in unique_apps: duplicate_apps.append(name) else: unique_apps.append(name) print('Number of duplicate apps:', len (duplicate_apps)) print('\n') print('Examples of duplicate apps:', duplicate_apps[:15]) Number of duplicate apps: 1181 Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Go ogle My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshB ooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Exp enses', 'HipChat Chat Built for Teams', 'Xero Accounting Software'] Above, we:
. Created two lists: one for storing the name of duplicate apps, and one for storing the name of unique apps. Looped through the android data set (the Google Play data set), and for each iteration: We saved the app name to a variable named name. O If name was already in the unique_apps list, we appended name to the duplicate_apps list. O Else (if name wasn't already in the unique_apps list), we appended name to the unique_apps list. (As a side note, you may notice we used the in operator above to check for membership in a list. We only learned to use in to check for membership in dictionaries, but in also works with lists): app_names = ['Instagram', 'Facebook'] print('Instagram' in app_names) print('Twitter' in app_names) print (232 in app_names) print('Facebook' in app_names) True False False True We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way. If you examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times. for app in android: name = app [0] if name == 'Instagram': print (app)
66577313 Social, 'Varies with device', '1,00 'July 31, 2018', 'Varies wi 'Varies with device', '1,00 , 'July 31, 2018', 'Varies wi "Social ['Instagram', 'SOCIAL', '4.5', 0,000,000+', 'Free', '0', 'Teen, th device', 'Varies with device ['Instagram', 'SOCIAL', '4.5', ¹66577446 0,000,000+', 'Free', '0', 'Teen th device', 'Varies with device l ['Instagram', 'SOCIAL', '4.5', '66577313' 0,000,000+', 'Free', '0', 'Teen, Social th device', 'Varies with device'l ['Instagram', 'SOCIAL', '4.5', 66509917' 'Varies with device', '1,00 0,000,000+', 'Free', '0', 'Teen Sociat', 'July 31, 2018', 'Varies wi th device', 'Varies with device'] 'Varies with device', '1,00 'July 31, 2018', 'Varies wi " We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app. We'll remove the rows on the next screen. Now it's your turn to write some code and confirm the data has duplicate entries. If you get stuck during the following exercise, you can check the solution notebook. Instructions 1. Using a combination of narrative and code, explain the reader that the Google Play data set has duplicate entries. Print a few duplicate rows to confirm. 2. Count the number of duplicates using the technique we learned above. 3. Explain that you won't remove the duplicates randomly. Describe the criterion you're going to use to remove the duplicates. O We already suggested a criterion above, but you can come up with another criterion if you want. Make sure you support your criterion with at least one argument.
. Created two lists: one for storing the name of duplicate apps, and one for storing the name of unique apps. Looped through the android data set (the Google Play data set), and for each iteration: We saved the app name to a variable named name. O If name was already in the unique_apps list, we appended name to the duplicate_apps list. O Else (if name wasn't already in the unique_apps list), we appended name to the unique_apps list. (As a side note, you may notice we used the in operator above to check for membership in a list. We only learned to use in to check for membership in dictionaries, but in also works with lists): app_names = ['Instagram', 'Facebook'] print('Instagram' in app_names) print('Twitter' in app_names) print (232 in app_names) print('Facebook' in app_names) True False False True We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way. If you examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times. for app in android: name = app [0] if name == 'Instagram': print (app)
66577313 Social, 'Varies with device', '1,00 'July 31, 2018', 'Varies wi 'Varies with device', '1,00 , 'July 31, 2018', 'Varies wi "Social ['Instagram', 'SOCIAL', '4.5', 0,000,000+', 'Free', '0', 'Teen, th device', 'Varies with device ['Instagram', 'SOCIAL', '4.5', ¹66577446 0,000,000+', 'Free', '0', 'Teen th device', 'Varies with device l ['Instagram', 'SOCIAL', '4.5', '66577313' 0,000,000+', 'Free', '0', 'Teen, Social th device', 'Varies with device'l ['Instagram', 'SOCIAL', '4.5', 66509917' 'Varies with device', '1,00 0,000,000+', 'Free', '0', 'Teen Sociat', 'July 31, 2018', 'Varies wi th device', 'Varies with device'] 'Varies with device', '1,00 'July 31, 2018', 'Varies wi " We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app. We'll remove the rows on the next screen. Now it's your turn to write some code and confirm the data has duplicate entries. If you get stuck during the following exercise, you can check the solution notebook. Instructions 1. Using a combination of narrative and code, explain the reader that the Google Play data set has duplicate entries. Print a few duplicate rows to confirm. 2. Count the number of duplicates using the technique we learned above. 3. Explain that you won't remove the duplicates randomly. Describe the criterion you're going to use to remove the duplicates. O We already suggested a criterion above, but you can come up with another criterion if you want. Make sure you support your criterion with at least one argument.