Stemming words

#################STEMMING WORDS#############################

The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved.

The reason why we stem is to shorten the lookup, and normalize sentences.

Consider:

I was taking a ride in the car.
I was riding in the car.

This sentence means the same thing. in the car is the same. I was is the same. the ing denotes a clear past-tense in both cases, so is it truly necessary to differentiate between ride and riding, in the case of just trying to figure out the meaning of what this past-tense activity was?

No, not really.

This is just one minor example, but imagine every word in the English language, every possible tense and affix you can put on a word. Having individual dictionary entries per version would be highly redundant and inefficient, especially since, once we convert to numbers, the "value" is going to be identical.

One of the most popular stemming algorithms is the Porter stemmer, which has been around since 1979.

First, we're going to grab and define our stemmer:

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

Now, let's choose some words with a similar stem, like:

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

Next, we can easily stem by doing something like:

for w in example_words:
    print(ps.stem(w))

Our output:

python
python
python
python
pythonli

Now let's try stemming a typical sentence, rather than some words:

new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(new_text)

for w in words:
    print(ps.stem(w))

Now our result is:

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.

Next up, we're going to discuss something a bit more advanced from the NLTK module, Part of Speech tagging, where we can use the NLTK module to identify the parts of speech for each word in a sentence.

 

#########################SENT DESK ###########################

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# I was taking a ride in a car
# I was riding the car
# Above ride the word is same but the meaning is changed
# Stemming means word variuos forms i.e write-wrote-written...etc writing

ps=PorterStemmer()

#Example 1
example_words=["python","Pythoner","pythoned","pythonly"]

for w in example_words:
print(ps.stem(w))

#Example 2
new_text="It is very important to be pythonly while you are Pythoning with python. All Pythoners have pythoned"
words=word_tokenize(new_text)

for w in words:
print(ps.stem(w))

#####################################

 

 

#################GEEKS ##############################

 

Introduction to Stemming

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.

 Some more example of stemming for root word "like" include:
->"likes"
->"liked"
->"likely"
->"liking"

Errors in Stemming:
There are mainly two errors in stemming – overstemming and under stemming. Over-stemming occurs when two words are stemmed to same root that are of different stems. Under-stemming occurs when two words are stemmed to same root that are not of different stems.

Applications of stemming are:

 

  1. Stemming is used in information retrieval systems like search engines.
  2. It is used to determine domain vocabularies in domain analysis.

Google search adopted word stemming in 2003. Previously a search for “cat” would not have returned “catty” or “cats”.

Some Stemming algorithms are:

  • Potter’s Stemmer algorithm
    It is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes.
    Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree’.

    Advantage: It produces the best output as compared to other stemmers and it has less error rate.
    Limitation:  Morphological variants produced are not always real words.
  • Lovins Stemmer
    It is proposed by Lovins in 1968, that removes the longest suffix from a word then word is recoded to convert this stem into valid words.
    Example: sitting -> sitt -> sit

    Advantage: It is fast and handles irregular plurals like 'teeth' and 'tooth' etc.
    Limitation: It is time consuming and frequently fails to form words from stem.
  • Dawson Stemmer
    It is extension of Lovins stemmer in which suffixes are stored in the reversed order indexed by their length and last letter.

    Advantage: It is fast in execution and covers more suffices.
    Limitation: It is very complex to implement.
  • Krovetz Stemmer
    It was proposed in 1993 by Robert Krovetz. Following are the steps:
    1) Convert the plural form of a word to its singular form.
    2) Convert the past tense of a word to its present tense and remove the suffix ‘ing’.
    Example: ‘children’ -> ‘child’

    Advantage: It is light in nature and can be used as pre-stemmer for other stemmers.
    Limitation: It is inefficient in case of large documents.
  • Xerox Stemmer
    Advantage: It works well in case of large documents and stems produced are valid.
    Limitation: It is language dependent and mainly implemented on english and over stemming may occur.
  • N-Gram Stemmer
    An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion of n-grams in common.
    Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S*

    Advantage: It is based on string comparisons and it is language dependent.
    Limitation: It requires space to create and index the n-grams and it

Python | Stemming words with NLTK

Prerequisite: Introduction to Stemming

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.

Some more example of stemming for root word "like" include:

-> "likes"
-> "liked"
-> "likely"
-> "liking"

Errors in Stemming:
There are mainly two errors in stemming – Overstemming and Understemming. Overstemming occurs when two words are stemmed to same root that are of different stems. Under-stemming occurs when two words are stemmed to same root that are not of different stems.


Applications of stemming are:

  • Stemming is used in information retrieval systems like search engines.
  • It is used to determine domain vocabularies in domain analysis.

Stemming is desirable as it may reduce redundancy as most of the time the word stem and their inflected/derived words mean the same.

Below is the implementation of stemming words using NLTK:

Code #1:

filter_nonebrightness_4
# import these modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
  
ps = PorterStemmer()
 
# choose some words to be stemmed
words = ["program", "programs", "programer", "programing", "programers"]
 
for w in words:
    print(w, " : ", ps.stem(w))

Output:

program  :  program
programs  :  program
programer  :  program
programing  :  program
programers  :  program

Code #2: Stemming words from sentences

filter_nonebrightness_4
# importing modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
  
ps = PorterStemmer()
  
sentence = "Programers program with programing languages"
words = word_tokenize(sentence)
  
for w in words:
    print(w, " : ", ps.stem(w))

Output :

Programers  :  program
program  :  program
with  :  with
programing  :  program
languages  :  languag