Python: 1. define a class which will be used to model the N-gram model over a text. The class must be called Ngram, and
Posted: Sat Feb 19, 2022 3:22 pm
Python:
1. define a class which will be used to model the N-gram model
over a text. The class must be called Ngram, and it needs a
constructor with arguments filename (the path to a file from which
the model will be extracted) and n (representing N). The default
value for filename must be the empty string, and n will be zero if
not specified otherwise. In addition to the filename and N, the
class must have three dictionaries as instance variables, which
need to be initialized by the constructor, but are going to be
filled during the later tasks. Their names must be raw_counts,
prob, and cond_prob.
Example usage:
>>> ngram_model = Ngram ("example .txt ", 2)
>>> ngram_model .n, ngram_model . filename
(2, "example .txt ")
>>> ngram_model.raw_counts , ngram_model.prob ,
ngram_model.cond_prob
({} , {}, {})
2. Extracting N-gram Counts
The first step towards modeling the conditional distribution is to
extract the raw N-gram counts from the sentence collection. This
time, we are providing you with an adapted smart tokenization
function tokenize_smart(sentence) which already splits the input
sentences well enough into tokens. Use this function in a method
extract_raw_counts() of your class, which fills the dictionary
assigned to the raw_counts instance variable with raw N-gram
counts. The logical structure of your code should be as
follows:
for each line in the file
cut off the trailing newline character
tokenize the sentence (= the line) using the function
provided
append N-1 instances of "BOS" and "EOS" to the beginning and the
end of the token list
starting at each position i in the token list
put the tokens from i to i+N into a new tuple (= the N-gram)
increase the count of the N-gram in raw_counts by 1
(adding the N-gram as a key if previously not present)
3: Computing N-gram Probabilities
The next step is to convert the raw frequency counts into
probabilities. For the implementation of the class
method extract_probabilities(), you need to compute the sum of all
raw counts (= the total number
of N-grams in the sentence list), and then fill the dictionary
assigned to the prob instance variable with the
raw counts divided by the sum of all counts.
Your Code:
import re
class Ngram:
"""
A class for n-gram models based on a text file.
:attr filename: the name of the file that the model is based
on
:type filename: str
:attr n: the number of tokens in the tuples
:type n: int
:attr raw_counts: the raw counts of the n-gram tuples
:type raw_counts: dict[tuple[str*], int]
:attr prob: the probabilities of the n-gram tuples
:type prob: dict[tuple[str*], int]
:attr cond_prob: the probability distributions over the respective
next tokens for each n-1-gram
:type cond_prob: dict[tuple[str*], dict[str, int]]
"""
# 1
def __init__(self, filename="", n=0):
"""
Initialize an Ngram object.
:param filename: The name of the file to base the model
on.
:param n: The number of tokens in the n-gram tuples.
"""
pass
# 2
def extract_raw_counts(self):
"""
Compute the raw counts for each n-gram occurring in the text.
"""
pass
# Task 3
def extract_probabilities(self):
"""
Compute the probability of an n-gram
occurring in the text.
"""
pass
1. define a class which will be used to model the N-gram model
over a text. The class must be called Ngram, and it needs a
constructor with arguments filename (the path to a file from which
the model will be extracted) and n (representing N). The default
value for filename must be the empty string, and n will be zero if
not specified otherwise. In addition to the filename and N, the
class must have three dictionaries as instance variables, which
need to be initialized by the constructor, but are going to be
filled during the later tasks. Their names must be raw_counts,
prob, and cond_prob.
Example usage:
>>> ngram_model = Ngram ("example .txt ", 2)
>>> ngram_model .n, ngram_model . filename
(2, "example .txt ")
>>> ngram_model.raw_counts , ngram_model.prob ,
ngram_model.cond_prob
({} , {}, {})
2. Extracting N-gram Counts
The first step towards modeling the conditional distribution is to
extract the raw N-gram counts from the sentence collection. This
time, we are providing you with an adapted smart tokenization
function tokenize_smart(sentence) which already splits the input
sentences well enough into tokens. Use this function in a method
extract_raw_counts() of your class, which fills the dictionary
assigned to the raw_counts instance variable with raw N-gram
counts. The logical structure of your code should be as
follows:
for each line in the file
cut off the trailing newline character
tokenize the sentence (= the line) using the function
provided
append N-1 instances of "BOS" and "EOS" to the beginning and the
end of the token list
starting at each position i in the token list
put the tokens from i to i+N into a new tuple (= the N-gram)
increase the count of the N-gram in raw_counts by 1
(adding the N-gram as a key if previously not present)
3: Computing N-gram Probabilities
The next step is to convert the raw frequency counts into
probabilities. For the implementation of the class
method extract_probabilities(), you need to compute the sum of all
raw counts (= the total number
of N-grams in the sentence list), and then fill the dictionary
assigned to the prob instance variable with the
raw counts divided by the sum of all counts.
Your Code:
import re
class Ngram:
"""
A class for n-gram models based on a text file.
:attr filename: the name of the file that the model is based
on
:type filename: str
:attr n: the number of tokens in the tuples
:type n: int
:attr raw_counts: the raw counts of the n-gram tuples
:type raw_counts: dict[tuple[str*], int]
:attr prob: the probabilities of the n-gram tuples
:type prob: dict[tuple[str*], int]
:attr cond_prob: the probability distributions over the respective
next tokens for each n-1-gram
:type cond_prob: dict[tuple[str*], dict[str, int]]
"""
# 1
def __init__(self, filename="", n=0):
"""
Initialize an Ngram object.
:param filename: The name of the file to base the model
on.
:param n: The number of tokens in the n-gram tuples.
"""
pass
# 2
def extract_raw_counts(self):
"""
Compute the raw counts for each n-gram occurring in the text.
"""
pass
# Task 3
def extract_probabilities(self):
"""
Compute the probability of an n-gram
occurring in the text.
"""
pass