Part A: The Index
In any case, an index is structure where, for each term, we have
a list of the documents that this term appears in. Consider the
following expository examples of terms and documents they appear
in:
We could represent an index of these terms as something
like:
If we wanted to represent such a mapping from terms to a list of
documents (file names) in Python, a dictionary would be a natural
structure to use. Thus, the index could be represented as the
following Python dictionary:
Note that in the dictionary above, the terms (strings) are the
keys, and the values are lists of the names of files containing
those terms (the file names are strings).
Building an index
Your task in this part of the assignment is to write the code
the builds an index for a set of text files. More specifically, you
will be implementing a function:
This function is passed the following information:
You will build the index based on the set of files specified in
the parameter filenames. For each file in the filenames list, you
should parse out all the terms in the
file in order to add appropriate entries to the index that you are
building.
Terms are defined as follows:
To help implement the third bullet point above, the Python
string library (which is already imported in the starter code we
give you with the line import string) provides a constant
called string.punctuation, which is a string containing all
the punctuation marks in Python. You can use this constant in
conjunction with the strip function on strings to remove
punctuation marks from just the beginning and ending of a string,
as shown in the example from the Python console below:
Recall that the strip function only removes characters from the
beginning and ending of a string. strip will not remove punctuation
marks in the middle of a string. Thus, in the example above, the
period in the middle of 'j.lo' remains after the call to strip.
To make this all more concrete, say that we want to index the
text file 'doc1.txt' shown below:
doc1.txt
In parsing this file, we should produce the following terms:
As you produce terms from a file, you should add appropriate
entries into the index for each of the terms. For example, if we
started with an index that was the empty dictionary, the index
should look as follows after processing the file 'doc1.txt':
Say, we now process another file, 'doc2.txt', shown below:
doc2.txt
We should appropriately update the index we had before with the
terms found in 'doc2.txt', we should result in the index shown
below.
Note that in cases where a term found in 'doc2.txt' previously
existed in the index (such as 'are' and 'strong'), the list of
documents that contain that term was expanded to include the file
'doc2.txt'. In cases where a term found in 'doc2.txt' was not
previously in the index (such as 'you' and 'yoda'), a new entry is
added to the index to indicate that the given term appeared in
'doc2.txt'.
Building the dictionary file_titles
In real-world search engines, we often want to store additional
information about each file while we are processing it in order to
use this information later when we want to display search results.
For example, for web pages, we might store the title of the page.
For news articles, we might store the headline of the article. In
the news article data we provide in this assignment, the first line
of each file is a title for the article in that file, which we want
to store for future reference. Here is an example of the first
portion of two of the actual files (named '001.txt' and '002.txt',
respectively) in the BBC News article data that we provide for you
in the assignment:
[ `001.txt` ]{style="max-width:50%"} ``` {.console
style="white-space: pre-wrap;"} Broadband steams ahead in the US
More and more Americans are joining the internet's fast lane,
according to official figures. ... ``` [ `002.txt`
]{style="max-width:50%"} ``` {.console style="white-space:
pre-wrap;"} EA to take on film and TV giants Video game giant
Electronic Arts (EA) says it wants to become the biggest
entertainment firm in the world. ... ```
When we are process each file to build up the index, we also
want to add an entry to the dictionary file_titles to keep track of
the title for each file. Recall, that the title is simply the first
line of the file (with the "newline" character at the end of the
line removed). We store this information in file_titles, where the
file name (string) is the key and the title (string) is the value
for an entry in the dictionary.
So, if we started with file_titles as an empty dictionary (which
is the case when your create_index function is called), after we
process the files '001.txt' and '002.txt', the file_titles
dictionary should look as follows:
file_titles.txt
Running your create_index function
You can run your create_index function by running the
searchengine.py program and specifying the directory name for a set
of text files that you would like to index. We provide two such
data sets for you to test your code. The first is in a directory
named small, which includes three very short text files, making the
size of the resulting index manageable to inspect manually. You can
run your program with the small dataset in a Python Terminal using
this command (on a Mac, use python3 instead of py):
In the searchengine.py program, we provide a main function that
calls your create_index function and prints the resulting index and
file_titles on the terminal. The output printed by the program on
the small dataset, should look as shown below. (Note that the
output below is produced on a mac, where file paths use the
forward-slash character: '/'. On a PC, you would see a double
backslash '\\' in the file paths instead of a forward-slash).
``` {.console style="white-space: pre-wrap;"} Index: {'file1':
['small/1.txt'], 'title': ['small/1.txt', 'small/2.txt',
'small/3.txt'], 'apple': ['small/1.txt', 'small/2.txt',
'small/3.txt'], 'ball': ['small/1.txt', 'small/3.txt'], 'dog':
['small/1.txt', 'small/2.txt'], 'elephant': ['small/1.txt'],
'frog': ['small/1.txt'], 'file2': ['small/2.txt'], 'carrot':
['small/2.txt', 'small/3.txt'], 'file3': ['small/3.txt'], 'gerbil':
['small/3.txt'], 'hamster': ['small/3.txt'], 'iguana':
['small/3.txt'], 'lizard': ['small/3.txt']} File names ->
document titles: {'small/1.txt': '** File1 title ',
'small/2.txt': ' File2 title ',
'small/3.txt': ' File3 title **'}
The output produced in that case is too large to verify
manually, but running on such a large dataset is a good way to see
if you program crashes on any cases that might have been missed
with the small dataset.
Part B: Using the index to search
Your task in this part of the assignment is to write the code
that implements the search functionality making use of the index
you built in Part A. Specifically, you will be implementing a
function:
This function is passed the following information:
Your search function should return a list of the names of the
files that contain all of the terms in the given query. As we
discussed in class, you can determine which files contain all the
query terms using the index. Recall that in the index, the value
associated with each term (key) is a list of the files that contain
the given term. This list of files is called the "posting list" for
the term. In order to determine which files contain all of the
terms in a query, you start with the posting list for the first
term in the query. You then consecutively consider the overlap
(i.e., the common elements) of the posting list you have with the
posting list associated with each subsequent term in the query.
When you've processed all the terms in the query in this way, the
posting list you have left should contain only those files that
contain every term in the query.
To make things more concrete, if you were to build an index on
the small dataset, your search function should return the
respective results shown in the seven examples below. (Again, note
that the examples below are produced on a Mac, where file paths
use: /. On a PC, you would see \
in the file paths instead).
Running your search function
> py searchengine.py small -s
In search mode, the program will first build an index (and a
file_title dictionary) on the files in the directory you specify by
calling your create_index function. It will then repeatedly ask the
user (you) for a query, which is sent (along with the index) to
your search function to produce a posting list of results. These
results are then displayed on the terminal, where, for each result,
the title of the article and the associated file name are listed
(making use of the file_title dictionary produced by your
create_index function). You can end the program by simply pressing
Enter when asked for a query (i.e., giving the empy query).
When you've think you've gotten your program working with the
small dataset, try running your program with the BBC News data as
follows:
Here is the output of a working program running on some sample
queries on the BBC News data (user input is
in blue italics). You can see if
your program produces the same output. (Note that the order of the
articles printed might be different depending on how you wrote your
code, but the set of articles that match each query should be the
same as in this sample run.)
Part A: The Index In any case, an index is structure where, for each term, we have a list of the documents that this ter
-
- Site Admin
- Posts: 899603
- Joined: Mon Aug 02, 2021 8:13 am