Part 1: The Dataset In this project, we are exploring movie screenplays. We'll be trying to predict each movie's genre f

Post by **answerhappygod** » Wed Apr 27, 2022 3:14 pm

: Part 1 The Dataset In This Project We Are Exploring Movie Screenplays We Ll Be Trying To Predict Each Movie S Genre F 1 (66.82 KiB) Viewed 1846 times

: Part 1 The Dataset In This Project We Are Exploring Movie Screenplays We Ll Be Trying To Predict Each Movie S Genre F 2 (64.33 KiB) Viewed 1846 times

: Part 1 The Dataset In This Project We Are Exploring Movie Screenplays We Ll Be Trying To Predict Each Movie S Genre F 3 (41.49 KiB) Viewed 1846 times

: Part 1 The Dataset In This Project We Are Exploring Movie Screenplays We Ll Be Trying To Predict Each Movie S Genre F 4 (48.42 KiB) Viewed 1846 times

: Part 1 The Dataset In This Project We Are Exploring Movie Screenplays We Ll Be Trying To Predict Each Movie S Genre F 5 (47.19 KiB) Viewed 1846 times

: Part 1 The Dataset In This Project We Are Exploring Movie Screenplays We Ll Be Trying To Predict Each Movie S Genre F 6 (45.57 KiB) Viewed 1846 times

: Part 1 The Dataset In This Project We Are Exploring Movie Screenplays We Ll Be Trying To Predict Each Movie S Genre F 7 (56.56 KiB) Viewed 1846 times

Part 1: The Dataset In this project, we are exploring movie screenplays. We'll be trying to predict each movie's genre from the text of its screenplay. In particular, we have compiled a list of 5,000 words that occur in conversations between movie characters. For each movie, our dataset tells us the frequency with which each of these words occurs in certain conversations in its screenplay. All words have been converted to lowercase, Run the cell below to read the movies table. It may take up to a minute to load. 13): movies - Table.read_table("movies.csv) Here is one row of the table and some of the frequencies of words that were said in the movie. (4): movies.where("Title", "runaway bride").selecte, 1, 2, 3, 4, 14, 49, 1042, 4884) Title Year Rating Genre # Words breez england it bravo runaway bride 1999 5.2 comedy 4895 0 0 0.0234092 0 The above cell prints a few columns of the row for the comedy movie Runaway Bride. The movie contains 4895 words. The word "st" appears 115 times, as it makes up 10.0234092 of the words in the movie. The word england" doesn't appear at all. Additional context: This numerical representation of a body of text, one that describes only the frequencies of individual words, is called a bag-of-words representation. This is a model that is often used in NLP. A lot of information is discarded in this representation: the order of the words, the context of each word, who said what, the cast of characters and actors, etc. However, a bag-of-words representation is often used for machine learning applications as a reasonable starting point, because a great deal of information is also retained and expressed in a convenient and compact format. In this project, we will investigate whether this representation is sufficient to build an accurate genre classifier. All movie titles are unique. The row_for_title function provides fast access to the one row for each title. Note: All movies in our dataset have their titles lower-cased. 151: title_index = movies.index_by( 'Title') def row_for_title(title): Return the row for a title, similar to the following expression (but faster) movies.where('title', title). row() return title_index.get(title)[0] row_for_title('toy story)
Question 1.0 Set expected_row_sum to the number that you expect will result from summing all proportions in each row, excluding the first five columns. Think about what any one row adds up to 17): # Set row_sum to a number that's the approximate) sum of each row of word proportions. expected_row_sum = 1 (9): grader.check("41_0") (8): 91_0 passed! This dataset was extracted from a dataset from Cornell University. After transforming the dataset (e... converting the words to lowercase, removing the naughty words, and converting the counts to frequencies), we created this new dataset containing the frequency of 5000 common words in each movie. 191: print("Words with frequencies:, movies.drop(np.arange(5)). num_columns) print("Movies with genres:', movies.num_rows) Words with frequencies: 5000 Movies with genres: 333 1.1. Word Stemming The columns other than "Title", "Year", "Rating", "Genre", and "# Words" in the movies table are all words that appear in some of the movies in our dataset. These words have been stemmed, or abbreviated heuristically, in an attempt to make different inflected forms of the same base word into the same string. For example, the column "manag" is the sum of proportions of the words "manage", "manager", "managed", and "managerial" (and perhaps others) in each movie. This is a common technique used in machine learning and natural language processing. Stemming makes it a little tricky to search for the words you want to use, so we have provided another table called vocab_table that will let you see examples of unstemmed versions of each stemmed word. Run the code below to load it. Note: You should use vocab_table for the rest of Section 1.1, not vocab_mapping 1101: #Just run this cell. vocab_mapping - Table.read_table('stem.csv') stemmed = np.take (movies. labels, np.arange(3, len(movies. tabets))) vocab_table = Table().with_column('Sten', stemmed).join("Stem', vocab_mapping) vocab_table. take(np.arange(1100, 1110) (10): Stem Word bond bonding
book booking book booked book book Question 1.1.1 Using vocab_table , find the stemmed version of the word "elements" and assign the value to stemmed_message (11): stenmed_message = vocab_table.where(vocab_table.column("Word"), are.equal_tol"elements").colu stenmed_message 1111: 'element' (121: grader.check("91_1_1") 1121: 41_1_1 passed! Question 1.1.2 What stem in the dataset has the most words that are shortened to it? Assign most_sten to that stem. [131: most_stem = vocab_table.where("Stem").column(e). item(Ⓡ) most_stem (131:a (141: grader.check("1_1_2") (141: 91_1_2 passed! Question 1.1.3 What is the longest word in the dataset whose stem wasn't shortened? Assign that to longest_uncut. Break ties alphabetically from Z to A (so if your options are "cat" or "bat", you should pick "cat"). Note that when sorting letters, the letter a is smaller than the letter z. Hint 1: vocab_table has 2 columns: one for stems and one for the unstemmed (normal) word. Find the longest word that wasn't cut at all (same length as stem). Hint 2: There is a table function that allows you to compute a function on every element in a column. Check Python Reference if you aren't sure which one. Hint 3: Check the comments in the cell below if you are stuck.
Question 1.1.3 What is the longest word in the dataset whose stem wasn't shortened? Assign that to longest_uncut. Break ties alphabetically from Z to A (so if your options are "cat" or "bat", you should pick "cat"). Note that when sorting letters, the letter a is smaller than the letter 2. Hint 1: vocab_table has 2 columns: one for stems and one for the unstemmed (normal) word. Find the longest word that wasn't cut at all (same length as stem). Hint 2: There is a table function that allows you to compute a function on every element in a column. Check Python Reference if you aren't sure which one. Hint 3: Check the comments in the cell below if you are stuck. (821: # In our solution, we found it useful to first add columns with # the length of the word and the length of the stem, # and then to add a column with the difference between those lengths. # What will the difference be if the word is not shortened? tbl_with_lens = vocab_table.apply(len, "Word") tbl_with_diff - vocab_table.apply(ten, "Word") - vocab_table.apply(ten, "Stem') Longest_uncut = vocab_table.where("Word", tbl_with_diff) longest_uncut 1821: Stem Word [83]: grader.check("41_1_3") 183): 91_1_3 results: 41_13 - 1 result: Trying: type(longest_uncut) - str Expecting: True Line 1, in qi_13 Failed example: type longest_uncut) == str Expected: True Got: False Question 1.1.4 How many stems have only one word that is shortened to them? For example, if the stem "book* only maps to the word "books" and if the stem is olvmans to the word both should be counted as stems that map
1.2. Exploratory Data Analysis: Linear Regression Let's explore our dataset before trying to build a classifier. To start, we'll use the associated proportions to investigate the relationship between different words, The first association we'll investigate is the association between the proportion of words that are "outer" and the proportion of words that are "space". As usual, we'll investigate our data visually before performing any numerical analysis. Run the cell below to plot a scatter diagram of "space" proportions vs "outer" proportions and to create the outer_space table. Each point on the scatter plot represents one movie. 136]: #Just run this cell! outer_space = movies.select("outer", "Space") outer_space.scatter("outer", "Space") plots.axis(1-0.0005, 0.001, -0.0005, 0.0031); plots.ticks (rotation.45); 0.0030 0.0025 0.0020 0.0015 space 0.0010 0.0005 0.0000 -0.0005 0.00050 0.00025 0.00000 -0.00025 0.00100 -0.00050 0.00075 outer Question 1.2.1 Looking at that chart it is difficult to see if there is an association Calculate the correlation coefficient for the potential linear association between proportion of words that are "outer" and the proportion of words that are "space" for every movie in the dataset, and assign it to outer_space_r. Hint: If you need a refresher on how to calculate the correlation coefficient check out Ch 15.1. a 11371 These twarravs should T Pane
Question 1.2.1 Looking at that chart it is difficult to see if there is an association Calculate the correlation coefficient for the potential linear association between proportion of words that are "outer and the proportion of words that are "space" for every movie in the dataset, and assign it to outer_space_r. Hint: If you need a refresher on how to calculate the correlation coefficient check out Ch 15.1. 11371: = These two arrays should make your code cleaner! outer - movies.column("outer") space = movies.column("space") outer_su... space_su outer_spacer... outer_spacer (137): Ellipsis I 1: grader.check("ql_2_1") Question 1.2.2 Choose two different words in the dataset with a magnitude (absolute value) of correlation higher than 0.2 and plot a scatter plot with a line of best fit for them. Please do not pick "outer" and "space" or "san" and "francisco". The code to plot the scatter plot and line of best fit is given for you, you just need to calculate the correct values to r, slope and intercept. Hint 1: It's easier to think of words with a positive correlation, ie words that are often mentioned together. Try to think of common phrases or idioms. Hint 2: Refer to Section 15.2 of the textbook for the formulas. (2B: word_x word y # These arrays should make your code cleaner! arr_x = movies.column(word_x) arr_y - movies.column(word_y) X_su y_su slope = intercept # DON'T CHANGE THESE LINES OF CODE movies.scatter(word_x, word
Question 1.2.2 Choose two different words in the dataset with a magnitude (absolute value) of correlation higher than 0.2 and plot a scatter plot with a line of best fit for them. Please do not pick "outer" and "space" or "san" and "francisco". The code to plot the scatter plot and line of best fit is given for you, you just need to calculate the correct values to r, slope and intercept Hint 1: It's easier to think of words with a positive correlation, i.e. words that are often mentioned together. Try to think of common phrases or idioms. Hint 2: Refer to Section 15.2 of the textbook for the formulas. 128): word_x = word_y # These arrays should make your code cleaner! arr_x = movies.column(word_x) arr_y movies.column(word_y) X_SU y_su slope = intercept # DON'T CHANGE THESE LINES OF CODE movies.scatter (word_x, word_y) max_x = max(movies.column(word_x)) plots.tittelf"Correlation: {r), magnitude greater than .2: (abstr) > 0.2}") plots.plot([, max_x * 1.31, (intercept, intercept + slope = (max_x*1.3)], color="gold"); Question 1.2.3 Imagine that you picked the words "san" and "francisco" as the two words that you would expect to be correlated because they compose the city name San Francisco. Assign san_francisco to either the number 1 or the number 2 according to which statement is true regarding the correlation between "san and "francisco." 1. "san" can also preceed other city names like San Diego and San Jose. This might lead to "san" appearing in movies without "francisco," and would reduce the correlation between "san" and "francisco." 2. "san" can also preceed other city names like San Diego and San Jose. The fact that "san" could appear more often in front of different cities and without "francisco" would increase the correlation between "san and "francisco." [29]: san_francisco - ... 11: grader.check("91_2_3")