•Our first task will be to extract the text data that we are
interested in. Take a moment and review the
file synthetic.txt.
•You will have noticed there are 17 lines in total. But only the
subset of data between the lines *** START OF SYNTHETIC TEST
CASE *** and *** END OF SYNTHETIC TEST CASE *** are
to be processed.
•Each of the files provided to you has a section defined like
this. Specifically:
•The string "*** START OF" indicates the beginning of
the region of interest
•The string "*** END" indicates the end of the region
of interest for that file
•Write a function, get_words_from_file(filename), that
returns a list of lower case words that are within the region of
interest.
•The professor wants every word in the text file, but, does not
want any of the punctuation.
•They share with you a regular
expression: "[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", that finds
all words that meet this definition.
•Here is an example of using this regular expression to process
a single line:
•import re line = "james' words included hypen-words!"
words_on_line = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+",
line) print(words_on_line) You don't need to understand how this
regular expression works. You just need to work out how to
integrate it into your solution.
•Feel free to write helper functions as you see fit but remember
these will need to be included in your answer to this question and
subsequent questions.
•We have used books that were encoded in UTF-8 and this means
you will need to use the optional encoding parameter when opening
files for reading. That is your open file call should look
like open(filename, encoding='utf-8'). This will be especially
helpful if your operating system doesn't set Python's default
encoding to UTF-8.
Test
Result
filename = "abc.txt" words2 = get_words_from_file(filename)
print(filename, "loaded ok.") print("{} valid words
found.".format(len(words2))) print("Valid word list:")
print("\n".join(words2))
abc.txt loaded ok. 3 valid words found. Valid word list: a ba
bac
filename = "synthetic.txt" words = get_words_from_file(filename)
print(filename, "loaded ok.") print("{} valid words
found.".format(len(words))) print("Valid word list:") for word in
words: print(word)
synthetic.txt loaded ok. 73 valid words found. Valid word list:
toby's code was rather interesting it had the following issues
short meaningless identifiers such as n and n deep complicated
nesting a doc-string drought very long rambling and unfocused
functions not enough spacing between functions inconsistent spacing
before and after operators just like this here boy was he going to
get a low style mark let's hope he asks his friend bob to help him
bring his code up to an acceptable level
•Our first task will be to extract the text data that we are interested in. Take a moment and review the file synthetic.
-
- Posts: 43759
- Joined: Sat Aug 07, 2021 7:38 am