Python: 1. define a class which will be used to model the N-gram model over a text. The class must be called Ngram, and

Business, Finance, Economics, Accounting, Operations Management, Computer Science, Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Algebra, Precalculus, Statistics and Probabilty, Advanced Math, Physics, Chemistry, Biology, Nursing, Psychology, Certifications, Tests, Prep, and more.
Post Reply
answerhappygod
Site Admin
Posts: 899603
Joined: Mon Aug 02, 2021 8:13 am

Python: 1. define a class which will be used to model the N-gram model over a text. The class must be called Ngram, and

Post by answerhappygod »

Python:
1. define a class which will be used to model the N-gram model
over a text. The class must be called Ngram, and it needs a
constructor with arguments filename (the path to a file from which
the model will be extracted) and n (representing N). The default
value for filename must be the empty string, and n will be zero if
not specified otherwise. In addition to the filename and N, the
class must have three dictionaries as instance variables, which
need to be initialized by the constructor, but are going to be
filled during the later tasks. Their names must be raw_counts,
prob, and cond_prob.
Example usage:
>>> ngram_model = Ngram ("example .txt ", 2)
>>> ngram_model .n, ngram_model . filename
(2, "example .txt ")
>>> ngram_model.raw_counts , ngram_model.prob ,
ngram_model.cond_prob
({} , {}, {})
2. Extracting N-gram Counts
The first step towards modeling the conditional distribution is to
extract the raw N-gram counts from the sentence collection. This
time, we are providing you with an adapted smart tokenization
function tokenize_smart(sentence) which already splits the input
sentences well enough into tokens. Use this function in a method
extract_raw_counts() of your class, which fills the dictionary
assigned to the raw_counts instance variable with raw N-gram
counts. The logical structure of your code should be as
follows:
for each line in the file
cut off the trailing newline character
tokenize the sentence (= the line) using the function
provided
append N-1 instances of "BOS" and "EOS" to the beginning and the
end of the token list
starting at each position i in the token list
put the tokens from i to i+N into a new tuple (= the N-gram)
increase the count of the N-gram in raw_counts by 1
(adding the N-gram as a key if previously not present)
Your Code:
import re
class Ngram:
"""
A class for n-gram models based on a text file.
:attr filename: the name of the file that the model is based
on
:type filename: str
:attr n: the number of tokens in the tuples
:type n: int
:attr raw_counts: the raw counts of the n-gram tuples
:type raw_counts: dict[tuple[str*], int]
:attr prob: the probabilities of the n-gram tuples
:type prob: dict[tuple[str*], int]
:attr cond_prob: the probability distributions over the respective
next tokens for each n-1-gram
:type cond_prob: dict[tuple[str*], dict[str, int]]
"""
# 1
def __init__(self, filename="", n=0):
"""
Initialize an Ngram object.
:param filename: The name of the file to base the model
on.
:param n: The number of tokens in the n-gram tuples.
"""
pass
# 2
def extract_raw_counts(self):
"""
Compute the raw counts for each n-gram occurring in the text.
"""
pass
Join a community of subject matter experts. Register for FREE to view solutions, replies, and use search function. Request answer by replying!
Post Reply