Text Mining Exercise with Google NGram In this exercise, we shall walk through an example of text mining in the form of
Posted: Mon Nov 15, 2021 5:03 pm
Text Mining Exercise with Google NGram
In this exercise, we shall walk through an example of text
mining in the form of Google’s Ngram Viewer. Google maintains a
large repository of optical character recognized (scanned) books in
its library known as Google Books. The Google Ngram Viewer is an
offshoot of this project and is a simple, free text mining tool
allowing researchers to search, compare and visualize information
from its extensive collection of published documents (Google
Books), in the form of n-grams. This allows users to assess
cultural and human-interest trends across a long period (1500 to
2019, or any amount of years between those dates). This covers
approximately 4% of all books ever published, or in other words
about 5.2 million books; the largest collection of published texts
anywhere that has been made available for mining. Google Ngram
Viewer allows you to rapidly quantify trends of various sorts over
some time including:
o People
o Events
o Terms
o Grammar
o Spelling
The Ngram Viewer allows you to follow trends in these topics but
also might reveal oddities in the results caused by censorship, the
invention of new technology, or events that are occurring around
those times (An introduction to text mining:
2. Case study: Ngram Viewer, 2019) It also offers the option for
breaking up English usage between UK and US publications. Note that
the Ngram viewer is case-sensitive. Exercise: Try it Yourself Now
it is your turn to have a go. Let’s first use an example of how the
description of the term analytics has changed over the last two
centuries. The most common categories are 'Analytics', ‘Data
Science’ and ‘Data Analytics’. How does the usage of those terms
change over time?
A. 1. Open up the Ngram Viewer in another browser window (click
here - https://books.google.com/ngrams)
2. In the ‘phrases’ box type in your search terms with a comma
separating each item : ‘Data Analytics, Data Science’
3. Choose the date range 1800-2019 (this is the default setting)
and the default setting of English(2019)
4. Choose a smoothing rating (we chose 2 to begin)
5. Hit Enter on your Keyboard What do you see? You should get a
line graph where both terms gained significant mention in
publications after 2010. If you have a screen snipping tool, you
may insert your screenshot below:
B. Now include the term 'Analytics' into the mix: Type in: Data
Analytics, Data Science, Analytics Describe what you observe. If
you have a screen snipping tool, you may insert your screenshot
below:
C. Now compare with lower case instances of the same terms, with
the same settings: ‘ Data Analytics, Data Science, Analytics, data
analytics, data science, analytics ’ What do you observe? What
could explain the difference in trends between the way these terms
are expressed? Note your observations below. PS: Feel free to try
your combination of terms. For example, try comparing capital
cities of the world Washington, London, Paris, Berlin, Delhi,
Tokyo, Buenos Aires To learn more about Google Ngram Viewer’s usage
and its limitations, go to
https://port.sas.ac.uk/mod/book/view.ph ... pterid=328
Reference: An introduction to text mining:
2. Case study: Ngram Viewer. (n.d.). Port.sas.ac.uk. Retrieved
November 8, 2021, from
https://port.sas.ac.uk/mod/book/view.ph ... pterid=327
In this exercise, we shall walk through an example of text
mining in the form of Google’s Ngram Viewer. Google maintains a
large repository of optical character recognized (scanned) books in
its library known as Google Books. The Google Ngram Viewer is an
offshoot of this project and is a simple, free text mining tool
allowing researchers to search, compare and visualize information
from its extensive collection of published documents (Google
Books), in the form of n-grams. This allows users to assess
cultural and human-interest trends across a long period (1500 to
2019, or any amount of years between those dates). This covers
approximately 4% of all books ever published, or in other words
about 5.2 million books; the largest collection of published texts
anywhere that has been made available for mining. Google Ngram
Viewer allows you to rapidly quantify trends of various sorts over
some time including:
o People
o Events
o Terms
o Grammar
o Spelling
The Ngram Viewer allows you to follow trends in these topics but
also might reveal oddities in the results caused by censorship, the
invention of new technology, or events that are occurring around
those times (An introduction to text mining:
2. Case study: Ngram Viewer, 2019) It also offers the option for
breaking up English usage between UK and US publications. Note that
the Ngram viewer is case-sensitive. Exercise: Try it Yourself Now
it is your turn to have a go. Let’s first use an example of how the
description of the term analytics has changed over the last two
centuries. The most common categories are 'Analytics', ‘Data
Science’ and ‘Data Analytics’. How does the usage of those terms
change over time?
A. 1. Open up the Ngram Viewer in another browser window (click
here - https://books.google.com/ngrams)
2. In the ‘phrases’ box type in your search terms with a comma
separating each item : ‘Data Analytics, Data Science’
3. Choose the date range 1800-2019 (this is the default setting)
and the default setting of English(2019)
4. Choose a smoothing rating (we chose 2 to begin)
5. Hit Enter on your Keyboard What do you see? You should get a
line graph where both terms gained significant mention in
publications after 2010. If you have a screen snipping tool, you
may insert your screenshot below:
B. Now include the term 'Analytics' into the mix: Type in: Data
Analytics, Data Science, Analytics Describe what you observe. If
you have a screen snipping tool, you may insert your screenshot
below:
C. Now compare with lower case instances of the same terms, with
the same settings: ‘ Data Analytics, Data Science, Analytics, data
analytics, data science, analytics ’ What do you observe? What
could explain the difference in trends between the way these terms
are expressed? Note your observations below. PS: Feel free to try
your combination of terms. For example, try comparing capital
cities of the world Washington, London, Paris, Berlin, Delhi,
Tokyo, Buenos Aires To learn more about Google Ngram Viewer’s usage
and its limitations, go to
https://port.sas.ac.uk/mod/book/view.ph ... pterid=328
Reference: An introduction to text mining:
2. Case study: Ngram Viewer. (n.d.). Port.sas.ac.uk. Retrieved
November 8, 2021, from
https://port.sas.ac.uk/mod/book/view.ph ... pterid=327