A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlier, stemming can often create non-existent words, whereas lemmas are actual words.
So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma.
Some times you will wind up with a very similar word, but sometimes, you will wind up with a completely different word. Let's see some examples.
from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() print(lemmatizer.lemmatize("cats")) print(lemmatizer.lemmatize("cacti")) print(lemmatizer.lemmatize("geese")) print(lemmatizer.lemmatize("rocks")) print(lemmatizer.lemmatize("python")) print(lemmatizer.lemmatize("better", pos="a")) print(lemmatizer.lemmatize("best", pos="a")) print(lemmatizer.lemmatize("run")) print(lemmatizer.lemmatize("run",'v'))
Here, we've got a bunch of examples of the lemma for the words that we use. The only major thing to note is that lemmatize takes a part of speech parameter, "pos." If not supplied, the default is "noun." This means that an attempt will be made to find the closest noun, which can create trouble for you. Keep this in mind if you use lemmatizing!
In the next tutorial, we're going to dive into the NTLK corpus that came with the module, looking at all of the awesome documents they have waiting for us there.
Python | Lemmatization with NLTK
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.
Text preprocessing includes both Stemming as well as Lemmatization. Many times people find these two terms confusing. Some treat these two as same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.
Applications of lemmatization are:
- Used in comprehensive retrieval systems like search engines.
- Used in compact indexing
Examples of lemmatization: -> rocks : rock -> corpora : corpus -> better : good
One major difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not supplied, the default is “noun.”
Below is the implementation of lemmatization words using NLTK:
# import these modules from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() print ( "rocks :" , lemmatizer.lemmatize( "rocks" )) print ( "corpora :" , lemmatizer.lemmatize( "corpora" )) # a denotes adjective in "pos" print ( "better :" , lemmatizer.lemmatize( "better" , pos = "a" )) |
Output :
rocks : rock corpora : corpus better : good
Corpora
n this part of the tutorial, I want us to take a moment to peak into the corpora we all downloaded! The NLTK corpus is a massive dump of all kinds of natural language data sets that are definitely worth taking a look at.
Almost all of the files in the NLTK corpus follow the same rules for accessing them by using the NLTK module, but nothing is magical about them. These files are plain text files for the most part, some are XML and some are other formats, but they are all accessible by you manually, or via the module and Python. Let's talk about viewing them manually.
Depending on your installation, your nltk_data directory might be hiding in a multitude of locations. To figure out where it is, head to your Python directory, where the NLTK module is. If you do not know where that is, use the following code:
import nltk print(nltk.__file__)
Run that, and the output will be the location of the NLTK module's __init__.py. Head into the NLTK directory, and then look for the data.py file.
The important blurb of code is:
if sys.platform.startswith('win'): # Common locations on Windows: path += [ str(r'C:\nltk_data'), str(r'D:\nltk_data'), str(r'E:\nltk_data'), os.path.join(sys.prefix, str('nltk_data')), os.path.join(sys.prefix, str('lib'), str('nltk_data')), os.path.join(os.environ.get(str('APPDATA'), str('C:\\')), str('nltk_data')) ] else: # Common locations on UNIX & OS X: path += [ str('/usr/share/nltk_data'), str('/usr/local/share/nltk_data'), str('/usr/lib/nltk_data'), str('/usr/local/lib/nltk_data') ]
There, you can see the various possible directories for the nltk_data. If you're on Windows, chances are it is in your appdata, in the local directory. To get there, you will want to open your file browser, go to the top, and type in %appdata%
Next click on roaming, and then find the nltk_data directory. In there, you will have your corpora file. The full path is something like:
C:\Users\yourname\AppData\Roaming\nltk_data\corpora
Within here, you have all of the available corpora, including things like books, chat logs, movie reviews, and a whole lot more.
Now, we're going to talk about accessing these documents via NLTK. As you can see, these are mostly text documents, so you could just use normal Python code to open and read documents. That said, the NLTK module has a few nice methods for handling the corpus, so you may find it useful to use their methology. Here's an example of us opening the Gutenberg Bible, and reading the first few lines:
from nltk.tokenize import sent_tokenize, PunktSentenceTokenizer from nltk.corpus import gutenberg # sample text sample = gutenberg.raw("bible-kjv.txt") tok = sent_tokenize(sample) for x in range(5): print(tok[x])
One of the more advanced data sets in here is "wordnet." Wordnet is a collection of words, definitions, examples of their use, synonyms, antonyms, and more. We'll dive into using wordnet next.