Page 1 of 1

Can some please help me with this problem in python? Thank you Background. To state the problem more formally, you are g

Posted: Thu May 12, 2022 2:22 pm
by answerhappygod
Can some please help me with this problem in python? Thank
you
Background. To state the problem more formally, you are given a
set of text documents and asked to determine whether or not, and
the degree to which, any pair of documents contains heavily copied
portions of text. If one document shares a large number of phrases
in common with another, it’s likely that the two documents are
plagiarized (or the result of two students “working together” or
one student “helping” another). Shown above is an actual graph of
lab reports submitted for Intro. Physics at a large University.
This graph represents the data collected for about 800 lab reports.
Each node in the graph represents some document. Each edge
indicates the number of 6-word phrases shared between the documents
it connects. To reduce noise a threshold of 200 common phrases has
been set so a document that shares fewer than 200 6-word phrases
with all other documents is not shown. The 2 Lab Manual is a sort
of style-guide for the lab report and the two brown boxes are
sample lab reports that were distributed. (Many people apparently
borrowed liberally from these help materials). Particularly
suspicious are clusters like the one in the top-right corner: those
documents have an inordinate number of 6-word phrases in common
with each other. It is likely that those people turned in
essentially the same lab report or copied large portions from each
other.
Assignment. Your task is very similar to the one described and
shown above: find the common word sequences amont pairs of
documents from a given set of documents. Simply put, your input
will be a set of plain-text documents, and a number n; your output
will be some representation showing the number of n-word sequences
each document has in common with every other document in the set.
Such n-word sequences are called n-grams. We often use 6-grams to
detect plagiarism, but all of the code should be written so that n
is provided as a parameter.
Document Sets. You are provided three document sets on Canvas. •
sm doc set.zip - 25 documents, about 33,946 words • med doc set.zip
- 75 documents, about 91,273 words • big doc set.zip - 1354
documents, about 1,615,193 words The medium and small sets of
documents are just subsets of the big set. All documents are (bad!)
“high school” essays obtained from the website www.freeessays.cc.
It is possible that some files have some html content or some
corrupted lines. More on this below.
Details. Additional details are discussed below. • Output: We
could think of the final output as an M × M grid (where M is the
total number of documents) of numbers, with a number in each cell
being the number of “hits” (common n-grams) between that pair of
documents. For example, consider the following 5 × 5 grid:
A B C D E
A - 4 50 700 0
A - - 0 0 5
A - - - 50 0
A - - - - 0
A - - - - -
From this table, we can conclude that documents A, C, and D
share a fair number of 6-word (let’s say n is 6) phrases. We can
probably safely conclude with a high degree of certainty that
documents A and D are plagiarized from each other. When the number
of documents is high, we probably only want to prnt the matrix for
those documents that have a number of hits that is above a certain
threshold. However, when M is large, it is unmanageable to print an
M × M matrix, so we should probably list the pairs of plagiarized
documents ordered by the number of hits. For example, as
follows:
700: A, D
50: A, C
50: C, D
5: B, E
4: A, B
• Documents: As stated above, sets of documents are provided.
One set will be small (25 or so documents) for testing purposes.
The other sets are larger (one has 75 documents and the other over
1300) which are provided for you to test if your solution is
scalable. Depending on your solution, processing the medium an
larger sets of documents could take a long time. However, it is
possible to design solutions that are fast. (My solution takes
about 12 seconds for the large document set.) Your program should
be written in such a way that it can process all of the documents
in a given folder/directory.
• Strategy: What approach should you take to solve this problem?
This is up to you. The straightforward approach of comparing every
6-gram (say) to all the other 6-grams in all the other documents
will be too expensive a solution. Loosely stated, if w is the total
number of words in all the documents, this approach will take about
w 2 steps to identify all the plagiarized pairs. For the medium and
large document sets, this is too expensive. It will work, but take
far too long. You will need to use a more clever approach and think
about how best to store the data.
• Important steps in the project: 1. Process the documents: You
will need to be able to process a set of documents in a
directory/folder and produce all possible n-word sequences for each
document. It should be easy to change n and see the results. Proof
that you have achieved this sub-goal will consist of showing that
you can output all n-word seqeuences in every file in a given
folder. (I will discuss in class how to process a set of files in a
given folder.)
Some suggestions:
– When your process the files, remove all punctuation marks,
html markers, and convert all characters to lower case.
– Some files may have some characters that produce an error when
read. Set the errors flag to “ignore” when reading (I will talk
about this in class).
– Process the file line by line, rather than reading in the
entire file as a string, because doing so may cause memory issues
when your program is running on the large document set.
2. Decide how to handle the data: Now that you know how to
generate all the n-grams, you need a suitable model for how you are
going to handle all the data you have generated. In particular, you
need to think about how you are going to compute similarities
between documents. Keep in mind that you will be doing this for
every pair of documents in a given directory/folder. This means
that you will need to pay some attention to the efficiency of your
approach. (See earlier discussion on strategy.)
3. Synthesize the first two steps: This is the final step of
synthesizing the work in steps 1 and 2 to identify the suspicious
pairs of documents. A nice final product would be a file that
contains a listing of all the pairs of documents that are suspected
of plagiarism, along with the number of common n-grams between
them. For example, in my implementation, this is done by a function
called
find plagiarized pairs(data set dir, n, threshold)
which processes all the documents in the folder data set dir and
identifies all document pairs that have at least threshold n-grams
in common. I can then make the function call find plagiarized
pairs("med doc set", 6, 100) to find all pairs of plagiarized
documents from the folder med doc set that have at least 100
6-grams in common.
Additional info:
This project is mean to find all the files in a given folder
that could be considered similar. We determine what similar is by
using n-grams. An n-gram is a consecutive sequence of n words. The
consecutive n-grams from a file should overlap by one word
Example:
6-grams:
twinkle twinkle little star how i, twinkle little star how i
wonder, little star how i wonder what, star how i wonder what you,
how i wonder what you are
4-grams:
twinkle twinkle little star, twinkle little star how, little
star how i, star how i wonder, how i wonder what, i wonder what
you, wonder what you are
By doing this you should be able to input the number of words
you want for n and the number of n-grams should be stored in a
dictionary as the value and the names of each file as the key. You
should be able to use the number of n-grams for each file as a way
to check for plagerism. You should also use the glob module to
create a list of all the files in a folder.
If you need anything more, please be more specific. Thank
you.