(b) (10 marks) We want to be able to carry out an analysis of words in long documents to find the most frequently used w
Posted: Sat May 14, 2022 6:49 pm
(b) (10 marks)
We want to be able to carry out an analysis of words in long
documents to find the most frequently used words. This can be used
for example to identify the most important words for language
learning or to try to identify authors in old literary works. Later
on we will ask you to analyse Shakespeare's Hamlet to find the 20
most frequent words and the number of times each word occurs.
Because the most common words are mainly stop words (articles,
prepositions, etc.) and the play's characters (Hamlet, Horatio,
etc.) we will also want the ability to exclude certain words from
the analysis. First we want to explore the problem in a more
general abstract form. Explain the algorithms and ADTs you would
use for the following problem. Given a filename (string), a
positive integer n and a list of excluded words
(strings), find the n most frequent words in the file,
apart from the excluded words, and their frequencies, given in
descending order of frequency.
(i) (5 marks)
Write your answer in English, showing how your solution would
work. The main ADT you use should be a bag, but if you need other
ADTs or data structures you are free to choose from others covered
in this module so far, such as lists, sets, queues, priority queues
etc.
(ii) (5 marks)
Now justify your solution by explaining the characteristics and
the expected performance of each ADT or algorithm used, in standard
Python implementations.
Add your answer for Q1(b)(ii) here:
We want to be able to carry out an analysis of words in long
documents to find the most frequently used words. This can be used
for example to identify the most important words for language
learning or to try to identify authors in old literary works. Later
on we will ask you to analyse Shakespeare's Hamlet to find the 20
most frequent words and the number of times each word occurs.
Because the most common words are mainly stop words (articles,
prepositions, etc.) and the play's characters (Hamlet, Horatio,
etc.) we will also want the ability to exclude certain words from
the analysis. First we want to explore the problem in a more
general abstract form. Explain the algorithms and ADTs you would
use for the following problem. Given a filename (string), a
positive integer n and a list of excluded words
(strings), find the n most frequent words in the file,
apart from the excluded words, and their frequencies, given in
descending order of frequency.
(i) (5 marks)
Write your answer in English, showing how your solution would
work. The main ADT you use should be a bag, but if you need other
ADTs or data structures you are free to choose from others covered
in this module so far, such as lists, sets, queues, priority queues
etc.
(ii) (5 marks)
Now justify your solution by explaining the characteristics and
the expected performance of each ADT or algorithm used, in standard
Python implementations.
Add your answer for Q1(b)(ii) here: