Our first task will be to extract the text data that we are interested in. Take a moment and review the file synthetic.t

correctanswer · Post by **correctanswer** » Fri Jun 10, 2022 11:58 am

: Our First Task Will Be To Extract The Text Data That We Are Interested In Take A Moment And Review The File Synthetic T 1 (62.3 KiB) Viewed 68 times

Our first task will be to extract the text data that we are interested in. Take a moment and review the file synthetic.txt. You will have noticed there are 17 lines in total. But only the subset of data between the lines START OF SYNTHETIC TEST CASE and *** END OF SYNTHETIC TEST CASE are to be processed. Each of the files provided to you has a section defined like this. Specifically: START OF" indicates the beginning of the region of interest. The string"** The string" END" indicates the end of the region of interest for that file Write a function, get words_from_file(filenane), that returns a list of lower case words that are within the region of interest. The professor wants every word in the text file, but, does not want any of the punctuation.. They share with you a regular expression: "[a-z1+1-1 la-z1+1 la-z1+117| la-21, that finds all words that meet this definition. Here is an example of using this regular expression to process a single line: import re line "janes' words included hypen-words!" words on line = re.findall("[a-z]+[-] [a-z]+] [a-z]+['1? [a-z]+", line) print(words_on_line) You don't need to understand how this regular expression works. You just need to work out how to integrate it into your solution. Feel free to write helper functions as you see fit but remember these will need to be included in your answer to this question and subsequent questions. We have used books that were encoded in UTF-8 and this means you will need to use the optional encoding parameter when opening files for reading. That is your open file call should look like open(filename, encoding="utf-8). This will be especially helpful if your operating system doesn't set Python's default encoding to UTF-8 For example: Test Result filenane "abc.txt" abc.txt loaded ok. words2 get words_from_file(filename) print(filename, "loaded ok.") 3 valid words found. Valid word list: print(") valid words found.". format(lenfwords2))) a print("Valid word list:") ba print("\n".join(words2)) bac filenane "synthetic.txt" words get words_from_file(filename) synthetic.txt loaded ok. 73 valid words found. Valid word list: print(filename, "Loaded ok.") print(") valid words found.".format(len(words))) toby's print("Valid word list:") code for word in words: was print (word) rather interesting it had the following issues Jesues short meaningless identifiers such as n and n deep complicated nesting a doc-string drought very Long rambling and unfocused functions not enough spacing between functions inconsistent spacing before and after operators just Like this here boy was he going to get a Low style
nark let's hope he asks his friend bob to help. hin bring his code up to an acceptable. level 1 import re 2 3-def get words_from_file(filename): 4 ***Returns a list of lower case words that are with the region of 5 interest, every word in the text file, but, not any of the punctuation."** 6 file open(filename, 'r', encoding = 'utf-8') flag - False words [] for line in file: if(str(line).strip()*** START OF SYNTHETIC TEST CASE ***"): flog-True elif(str(line).stripO)- END OF SYNTHETIC TEST CASE ***"): flag-False break elif(flag): new_line line. lower()) words_on_line = re.findall([a-z]+[-][a-z]+[a-z]+[*]?1 [a-z]+". new_line) words.extend(words_on_line). file.close() return words Test Expected Got filename "abc.txt" words2= get_words_from_file(filename) abc.txt loaded ok.. 3 valid words found. Valid word list:- abc.txt loaded ok. 3 valid words found. Valid word list: De print(filename, "loaded ok.") print(") valid words found.".format(len (words2))) a print("Valid word list:") ba ba bac print("\n".join(words2)) bac filename= "synthetic.txt" words get words_from_file(filename) synthetic.txt loaded ok. 73 valid words found. Valid word list: synthetic.txt loaded ok. ✔ 73 valid words found. Valid word list: print(filename, "loaded ok.") print(") valid words found.".format(len (words))) Toby's toby's print("Valid word list:") code code for word in words: was print (word) rather was rather interesting interesting it- it had had the the following following issues issues short short- meaningless meaningless identifiers- identifiers Such suche ase ase n n and and The ne deepu complicated- deep complicated- nesting nesting Answer: ZASELKRASN 7 8 9. 10. 11 12. 13 14 15. 16 17 18 19 20 21
filename="short.txt words get *Teaded " Testing was aborted du 2 tests not run due to previous errors. One or more hidden tests failed Hide differences