Python Data Science question:
(a) Read in the data and transform data frames. i. Import the data from the csv file 'ratings.csv' into a pandas data frame. Each row corresponds to the rating given by a user to a particular movie. There is a header row that specifies the fields: userld, movield, rating, timestamp. Transform the data into a movie-user ratings matrix such that rows refer to movies, columns refer to users, and each cell contains the rating for the associated movie/user pair. Assign a value of 0 for any missing values. (Note: For this operation, you can use the pivot_table() function in pandas with index as movie ids, columns as user ids, values as rating.) ii. Import the data from the csv file 'movies.csv' into a pandas data frame. Each row corresponds to title and genre information for a particular movie. There is a header row that specifies the fields: movield, title, genre. Each movie is associated with a list of genres, separated by I'. Create a new column called 'firstGenre' that contains only the first genre from the list. You can do this easily with str.split(). Drop the original 'genre' column. iii. Return the data frames from (i) and (ii). In [7]: # construct_data(ratingsFilename, genreFilename) takes as input the filenames to read # in data from (as specified above), converts the ratings data into a movie-user # dataframe df_A, and converts the movie data into a data frame df_B that includes a # 'firstGenre' column. # You will use these dataframes in subsequent questions. def construct_data(ratingsFilename, genreFilename): ### ### YOUR CODE HERE # Example: construct_data('ratings.csv', 'movies.csv') # -> (df_A 9724x610, df_B 9742x3)
n [10]: (b) Apply PCA to the movie-user matrix A with a specified number of components k. Return the resulting explained variance ratio vector and the transformed data (converted back to a pandas dataframe). apply_pca(df, k) takes as input a movie-user ratings dataframe and target number # of pca components k. The function should first mean center the data in the data #frame, them perform PCA with k components and transform the data. Return the pca # attribute explained_variance_ratio_ that records the explained variance for each # of the k components, and the transformed data (in a new data frame). def apply_pca(df, k): ### ### YOUR CODE HERE ### df_A, df_B = construct_data('ratings.csv', 'movies.csv') apply_pca(df_A, 2) #-> (array ( [0.1762..., 0.0418...]), transformed_df 9724x2)
(c) Join the data frames from Q1a (movie title and firstGenre) and Q1b (movies with dimensionality reduced by pca). The result should be a data frame with rows corresponding to movies, the first k columns should be the lower dimensional representation of user rating from pca, then the last three columns should be movield, movie title, and firstGenre. Return the resulting data frame. 14]: # join_movieDataFrames (pcaDF, genreDF) takes two data frames as input: (1) movies # after dimensionalilty reduction with pca, and (2) movies with title and first genre. # The two data frames should be joined together, using movieId as a key, and returned # as a single data frame. def join_movieDataFrames (pcaDF, genreDF): ### ### YOUR CODE HERE (df_A, df_B) = construct_data('ratings.csv', 'movies.csv') (exvar, df_T) = apply_pca (df_A, 2) join_movieDataFrames (df_T, df_B) #-> joined_df (9724x5)
(d) Consider the color mapping of each genre below in the dictionary 'genre_color' to assign a color to each genre. Now, use the methods above to apply PCA and reduce the dimensionality of the movies using k-2, join the results to movie title and firstGenre, then plot the results as a scattterplot, coloring each movie according to its firstGenre using the color map below. This question will be graded manually. [16]: genre_color={'Animation': 'r', 'Horror': 'b', 'Thriller': 'y', 'Drama': 'm', 'Comedy': 'deeppink', 'Sci-Fi': 'gold', 'Western': 'orange', 'Adventure': 'g', 'Documentary': 'brown', 'Musical': 'indigo', 'Fantasy': 'yellow', 'Mystery': 'purple', 'Film-Noir': 'cyan', '(no genres listed)': 'coral', 'Action': 'teal', 'War': 'black', 'Romance': 'skyblue', 'Children': 'lime', 'Crime': 'darkgreen'}
Python Data Science question:
-
- Site Admin
- Posts: 899603
- Joined: Mon Aug 02, 2021 8:13 am