Chunking and Named entity

Chunking

Very good example 

Now that we know the parts of speech, we can do what is called chunking, and group words into hopefully meaningful chunks. One of the main goals of chunking is to group into what are known as "noun phrases." These are phrases of one or more words that contain a noun, maybe some descriptive words, maybe a verb, and maybe something like an adverb. The idea is to group nouns with the words that are in relation to them.

In order to chunk, we combine the part of speech tags with regular expressions. Mainly from regular expressions, we are going to utilize the following:

+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions	  
. = Any character except a new line
	  

See the tutorial linked above if you need help with regular expressions. The last things to note is that the part of speech tags are denoted with the "<" and ">" and we can also place regular expressions within the tags themselves, so account for things like "all nouns" (<N.*>)

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            chunked.draw()     

    except Exception as e:
        print(str(e))

process_content()

The result of this is something like:

The main line here in question is:

chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

This line, broken down:

<RB.?>* = "0 or more of any tense of adverb," followed by:

<VB.?>* = "0 or more of any tense of verb," followed by:

<NNP>+ = "One or more proper nouns," followed by

<NN>? = "zero or one singular noun."

Try playing around with combinations to group various instances until you feel comfortable with chunking.

Not covered in the video, but also a reasonable task is to actually access the chunks specifically. This is something rarely talked about, but can be an essential step depending on what you're doing. Say you print the chunks out, you are going to see output like:

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
  'S/POS
  (Chunk
    ADDRESS/NNP
    BEFORE/NNP
    A/NNP
    JOINT/NNP
    SESSION/NNP
    OF/NNP
    THE/NNP
    CONGRESS/NNP
    ON/NNP
    THE/NNP
    STATE/NNP
    OF/NNP
    THE/NNP
    UNION/NNP
    January/NNP)
  31/CD
  ,/,
  2006/CD
  THE/DT
  (Chunk PRESIDENT/NNP)
  :/:
  (Chunk Thank/NNP)
  you/PRP
  all/DT
  ./.)

Cool, that helps us visually, but what if we want to access this data via our program? Well, what is happening here is our "chunked" variable is an NLTK tree. Each "chunk" and "non chunk" is a "subtree" of the tree. We can reference these by doing something like chunked.subtrees. We can then iterate through these subtrees like so:

            for subtree in chunked.subtrees():
                print(subtree)

Next, we might be only interested in getting just the chunks, ignoring the rest. We can use the filter parameter in the chunked.subtrees() call.

            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

Now, we're filtering to only show the subtrees with the label of "Chunk." Keep in mind, this isn't "Chunk" as in the NLTK chunk attribute... this is "Chunk" literally because that's the label we gave it here: chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

Had we said instead something like chunkGram = r"""Pythons: {<RB.?>*<VB.?>*<NNP>+<NN>?}""", then we would filter by the label of "Pythons." The result here should be something like:

-
(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(Chunk
  ADDRESS/NNP
  BEFORE/NNP
  A/NNP
  JOINT/NNP
  SESSION/NNP
  OF/NNP
  THE/NNP
  CONGRESS/NNP
  ON/NNP
  THE/NNP
  STATE/NNP
  OF/NNP
  THE/NNP
  UNION/NNP
  January/NNP)
(Chunk PRESIDENT/NNP)
(Chunk Thank/NNP)

Full code for this would be:

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            print(chunked)
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

            chunked.draw()

    except Exception as e:
        print(str(e))

process_content()

If you get particular enough, you may find that you may be better off if there was a way to chunk everything, except some stuff. This process is what is known as chinking, and that's what we're going to be covering next.

 

 

You may find that, after a lot of chunking, you have some words in your chunk you still do not want, but you have no idea how to get rid of them by chunking. You may find that chinking is your solution.

Chinking is a lot like chunking, it is basically a way for you to remove a chunk from a chunk. The chunk that you remove from your chunk is your chink.

The code is very similar, you just denote the chink, after the chunk, with }{ instead of the chunk's {}.

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)

            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)

            chunked.draw()

    except Exception as e:
        print(str(e))

process_content()

With this, you are given something like:

Now, the main difference here is:

}<VB.?|IN|DT|TO>+{

This means we're removing from the chink one or more verbs, prepositions, determiners, or the word 'to'.

Now that we've learned how to do some custom forms of chunking, and chinking, let's discuss a built-in form of chunking that comes with NLTK, and that is named entity recognition.

 

Named Entity recognition

One of the most major forms of chunking in natural language processing is called "Named Entity Recognition." The idea is to have the machine immediately be able to pull out "entities" like people, places, things, locations, monetary figures, and more.

This can be a bit of a challenge, but NLTK is this built in for us. There are two major options with NLTK's named entity recognition: either recognize all named entities, or recognize named entities as their respective type, like people, places, locations, etc.

Here's an example:

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            namedEnt.draw()
    except Exception as e:
        print(str(e))


process_content()

Here, with the option of binary = True, this means either something is a named entity, or not. There will be no further detail. The result is:

If you set binary = False, then the result is:

Immediately, you can see a few things. When Binary is False, it picked up the same things, but wound up splitting up terms like White House into "White" and "House" as if they were different, whereas we could see in the binary = True option, the named entity recognition was correct to say White House was part of the same named entity.

Depending on your goals, you may use the binary option how you see fit. Here are the types of Named Entities that you can get if you have binary as false:

NE Type and Examples
ORGANIZATION - Georgia-Pacific Corp., WHO
PERSON - Eddy Bonte, President Obama
LOCATION - Murray River, Mount Everest
DATE - June, 2008-06-29
TIME - two fifty a m, 1:30 p.m.
MONEY - 175 million Canadian Dollars, GBP 10.40
PERCENT - twenty pct, 18.75 %
FACILITY - Washington Monument, Stonehenge
GPE - South East Asia, Midlothian

Either way, you will probably find that you need to do a bit more work to get it just right, but this is pretty powerful right out of the box.

 

 

 

 

#############################GEEKS########################

NLP | Chunking and chinking with RegEx

Chunk extraction or partial parsing is a process of meaningful extracting short phrases from the sentence (tagged with Part-of-Speech).
Chunks are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks. A ChunkRule class specifies what words or patterns to include and exclude in a chunk.

Defining Chunk patterns :
Chuck patterns are normal regular expressions which are modified and designed to match the part-of-speech tag designed to match sequences of part-of-speech tags. Angle brackets are used to specify an indiviual tag for example – to match a noun tag. One can define multiple tags in the same way.

Code #1 : Converting chunks to RegEx Pattern.

filter_nonebrightness_4
# Laading Library
from nltk.chunk.regexp import tag_pattern2re_pattern
# Chunk Pattern to RegEx Pattern 
print("Chunk Pattern : ", tag_pattern2re_pattern('<DT>?<NN.*>+'))

Output :

Chunk Pattern :  ()?(<(NN[^\{\}]*)>)+

Curly Braces are used to specify a chunk like {} and to specify the chink pattern one can just flip the braces }{. For a particular phrase type, these rules (chunk and a chink pattern) can be combined into a grammer.

Code #2 : Parsing the sentence with RegExParser.

filter_nonebrightness_4
from nltk.chunk import RegexpParser
# Introducing the Pattern 
chunker = RegexpParser(r'''
NP:
{<DT><NN.*><.*>*<NN.*>} 
}<VB.*>{
''')
chunker.parse([('the', 'DT'), ('book', 'NN'), (
'has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])

Output :

Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), 
Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])






NLP | Chunking Rules

Below are the steps involed for Chunking –

  • Conversion of sentence to a flat tree.
  • Creation of Chunk string using this tree.
  • Creation of RegexpChunkParser by parsing the grammer using RegexpParser.
  • Appying the created chunk rule to the ChunkString that matches the sentence into a chunk.
  • Splitting the bigger chunk to a smaller chunk using the defined chunk rules.
  • ChunkString is then converted back to tree, with two chunk subtrees.

Code #1 : ChunkString getting modified by applying each rule.

filter_nonebrightness_4
# Loading Libraries
from nltk.chunk.regexp import ChunkString, ChunkRule, ChinkRule
from nltk.tree import Tree
# ChunkString() starts with the flat tree
tree = Tree('S', [('the', 'DT'), ('book', 'NN'),
('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])
# Initializing ChunkString()
chunk_string = ChunkString(tree)
print ("Chunk String : ", chunk_string)
# Initializing ChunkRule
chunk_rule = ChunkRule('<DT><NN.*><.*>*<NN.*>', 'chunk determiners and nouns')
chunk_rule.apply(chunk_string)
print ("\nApplied ChunkRule : ", chunk_string)
# Another ChinkRule
ir = ChinkRule('<VB.*>', 'chink verbs')
ir.apply(chunk_string)
print ("\nApplied ChinkRule : ", chunk_string, "\n")
# Back to chunk sub-tree
chunk_string.to_chunkstruct()

Output:

Chunk String :   <<DT>  <NN>  <VBZ>  <JJ>  <NNS> 

Applied ChunkRule :  {<DT>  <NN>  <VBZ>  <JJ>  <NNS>}

Applied ChinkRule :  {<DT>  <NN>} <VBZ> {<JJ>  <NNS>} 

Tree('S', [Tree('CHUNK', [('the', 'DT'), ('book', 'NN')]), 
    ('has', 'VBZ'), Tree('CHUNK', [('many', 'JJ'), ('chapters', 'NNS')])])

Note : This code works exactly in the same manner as explained in the ChunkRule steps above.

Code #2 : How to this task directly with RegexpChunkParser.

filter_nonebrightness_4
# Loading Libraries
from nltk.chunk.regexp import ChunkString, ChunkRule, ChinkRule
from nltk.tree import Tree
from nltk.chunk import RegexpChunkParser
# ChunkString() starts with the flat tree
tree = Tree('S', [('the', 'DT'), ('book', 'NN'),
('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])
# Initializing ChunkRule
chunk_rule = ChunkRule('<DT><NN.*><.*>*<NN.*>', 'chunk determiners and nouns')
# Another ChinkRule
chink_rule = ChinkRule('<VB.*>', 'chink verbs')
# Applying RegexpChunkParser
chunker = RegexpChunkParser([chunk_rule, chink_rule])
chunker.parse(tree)

Output:

Tree('S', [Tree('CHUNK', [('the', 'DT'), ('book', 'NN')]), 
    ('has', 'VBZ'), Tree('CHUNK', [('many', 'JJ'), ('chapters', 'NNS')])])

Code #3 : Parsing with different ChunkType.

filter_nonebrightness_4
# Loading Libraries
from nltk.chunk.regexp import ChunkString, ChunkRule, ChinkRule
from nltk.tree import Tree
from nltk.chunk import RegexpChunkParser
# ChunkString() starts with the flat tree
tree = Tree('S', [('the', 'DT'), ('book', 'NN'),
('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])
# Initializing ChunkRule
chunk_rule = ChunkRule('<DT><NN.*><.*>*<NN.*>', 'chunk determiners and nouns')
# Another ChinkRule
chink_rule = ChinkRule('<VB.*>', 'chink verbs')
# Applying RegexpChunkParser
chunker = RegexpChunkParser([chunk_rule, chink_rule], chunk_label ='CP')
chunker.parse(tree)

Output:

Tree('S', [Tree('CP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), 
          Tree('CP', [('many', 'JJ'), ('chapters', 'NNS')])])

NLP | Chunking and chinking with RegEx

Chunk extraction or partial parsing is a process of meaningful extracting short phrases from the sentence (tagged with Part-of-Speech).
Chunks are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks. A ChunkRule class specifies what words or patterns to include and exclude in a chunk.

Defining Chunk patterns :
Chuck patterns are normal regular expressions which are modified and designed to match the part-of-speech tag designed to match sequences of part-of-speech tags. Angle brackets are used to specify an indiviual tag for example – to match a noun tag. One can define multiple tags in the same way.

Code #1 : Converting chunks to RegEx Pattern.

filter_nonebrightness_4
# Laading Library
from nltk.chunk.regexp import tag_pattern2re_pattern
 
# Chunk Pattern to RegEx Pattern 
print("Chunk Pattern : ", tag_pattern2re_pattern('<DT>?<NN.*>+'))

Output :

Chunk Pattern :  ()?(<(NN[^\{\}]*)>)+

Curly Braces are used to specify a chunk like {} and to specify the chink pattern one can just flip the braces }{. For a particular phrase type, these rules (chunk and a chink pattern) can be combined into a grammer.

Code #2 : Parsing the sentence with RegExParser.

filter_nonebrightness_4
from nltk.chunk import RegexpParser
 
# Introducing the Pattern 
chunker = RegexpParser(r'''
NP:
{<DT><NN.*><.*>*<NN.*>} 
}<VB.*>{
''')
 
chunker.parse([('the', 'DT'), ('book', 'NN'), (
    'has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')])

Output :

Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), 
Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])




NLP | Chunking using Corpus Reader

What are Chunks?
These are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks. A ChunkRule class specifies what words or patterns to include and exclude in a chunk.

How it works :

  • The ChunkedCorpusReader class works similar to the TaggedCorpusReader for getting tagged tokens, plus it also provides three new methods for getting chunks.
  • An instance of nltk.tree.Tree represents each chunk.
  • Noun phrase trees look like Tree(‘NP’, […]) where as Sentence level trees look like Tree(‘S’, […]).
  • A list of sentence trees, with each noun phrase as a subtree of the sentence is obtained in n chunked_sents()
  • A list of noun phrase trees alongside tagged tokens of words that were not in a chunk is obtained in chunked_words().

Diagram listing the major methods:

Code #1 : Creating a ChunkedCorpusReader for words

filter_nonebrightness_4
# Using ChunkedCorpusReader
from nltk.corpus.reader import ChunkedCorpusReader
 
# intitializing
x = ChunkedCorpusReader('.', r'.*\.chunk')
 
words = x.chunked_words()
print ("Words : \n", words)

Output :

Words : 
[Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), 
('moves', 'NNS')]), ('have', 'VBP'), ...]

Code #2 : For sentence

filter_nonebrightness_4
Chunked Sentence = x.chunked_sents()
print ("Chunked Sentence : \n", tagged_sent)

Output :

Chunked Sentence : 
[Tree('S', [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), 
('moves', 'NNS')]), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), 
Tree('NP', [('300', 'CD'), ('jobs', 'NNS')]), (', ', ', '),
Tree('NP', [('the', 'DT'), ('spokesman', 'NN')]), ('said', 'VBD'), ('.', '.')])]

Code #3 : For paragraphs

filter_nonebrightness_4
para = x.chunked_paras()()
print ("para : \n", para)

Output :

[[Tree('S', [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction',
'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ('trimmed', 'VBN'),
('about', 'IN'), 
Tree('NP', [('300', 'CD'), ('jobs', 'NNS')]), (', ', ', '), 
Tree('NP', [('the', 'DT'), ('spokesman', 'NN')]), ('said', 'VBD'), ('.', '.')])]] 




NLP | Classifier-based Chunking | Set 1

The ClassifierBasedTagger class learns from the features, unlike most part-of-speech taggers. ClassifierChunker class can be created such that it can learn from both the words and part-of-speech tags, instead of just from the part-of-speech tags as the TagChunker class does.

The (word, pos, iob) 3-tuples is converted into ((word, pos), iob) 2-tuples using the chunk_trees2train_chunks() from tree2conlltags(), to remain compatible with the 2-tuple (word, pos) format required for training a ClassiferBasedTagger class.

Code #1 : Let’s understand

filter_nonebrightness_4
# Loading Libraries
from nltk.chunk import ChunkParserI
from nltk.chunk.util import tree2conlltags, conlltags2tree
from nltk.tag import ClassifierBasedTagger
 
def chunk_trees2train_chunks(chunk_sents):
 
    # Using tree2conlltags
    tag_sents = [tree2conlltags(sent) for
                 sent in chunk_sents]
 
    3-tuple is converted to 2-tuple
    return [[((w, t), c) for
             (w, t, c) in sent] for sent in tag_sents]

Now, a feature detector function is needed to pass into ClassifierBasedTagger. Any feature detector function used with the ClassifierChunker class (defined next) should recognize that tokens are a list of (word, pos) tuples, and have the same function signature as prev_next_pos_iob(). To give the classifier as much information as we can, this feature set contains the current, previous, and next word and part-of-speech tag, along with the previous IOB tag.

Code #2 : detector function

filter_nonebrightness_4
def prev_next_pos_iob(tokens, index, history):
     
    word, pos = tokens[index]
    if index == 0:
        prevword, prevpos, previob = ('<START>', )*3
    else:
        prevword, prevpos = tokens[index-1]
        previob = history[index-1]
         
    if index == len(tokens) - 1:
        nextword, nextpos = ('<END>', )*2
    else:
        nextword, nextpos = tokens[index + 1]
        feats = {'word': word,
                 'pos': pos,
                 'nextword': nextword,
                 'nextpos': nextpos,
                 'prevword': prevword,
                 'prevpos': prevpos,
                 'previob': previob
                 }
    return feats

Now, ClassifierChunker class is need which uses an internal ClassifierBasedTagger with training sentences from chunk_trees2train_chunks() and features extracted using prev_next_pos_iob(). As a subclass of ChunkerParserIClassifierChunker implements the parse() method to convert the ((w, t), c) tuples, produced by the internal tagger into Trees using conlltags2tree()

Code #3 :

filter_nonebrightness_4
class ClassifierChunker(ChunkParserI):
def __init__(self, train_sents, 
feature_detector = prev_next_pos_iob, **kwargs):
if not feature_detector:
feature_detector = self.feature_detector
train_chunks = chunk_trees2train_chunks(train_sents)
self.tagger = ClassifierBasedTagger(train = train_chunks,
feature_detector = feature_detector, **kwargs)
def parse(self, tagged_sent):
if not tagged_sent: return None
chunks = self.tagger.tag(tagged_sent)
return conlltags2tree(
[(w, t, c) for ((w, t), c) in chunks])

NLP | Classifier-based Chunking | Set 2

Using the data from the treebank_chunk corpus let us evaluate the chunkers (prepared in the previous article).

Code #1 :

filter_nonebrightness_4
# loading libraries
from chunkers import ClassifierChunker
from nltk.corpus import treebank_chunk
train_data = treebank_chunk.chunked_sents()[:3000]
test_data = treebank_chunk.chunked_sents()[3000:]
# initializing
chunker = ClassifierChunker(train_data)
# evaluation
score = chunker.evaluate(test_data)
a = score.accuracy()
p = score.precision()
r = recall
print ("Accuracy of ClassifierChunker : ", a)
print ("\nPrecision of ClassifierChunker : ", p)
print ("\nRecall of ClassifierChunker : ", r)

Output :

Accuracy of ClassifierChunker : 0.9721733155838022

Precision of ClassifierChunker : 0.9258838793383068

Recall of ClassifierChunker : 0.9359016393442623

Code #2 : Let’s compare the performance of conll_train

filter_nonebrightness_4
chunker = ClassifierChunker(conll_train)
score = chunker.evaluate(conll_test)
a = score.accuracy()
p = score.precision()
r = score.recall()
print ("Accuracy of ClassifierChunker : ", a)
print ("\nPrecision of ClassifierChunker : ", p)
print ("\nRecall of ClassifierChunker : ", r)

Output :

Accuracy of ClassifierChunker : 0.9264622074002153

Precision of ClassifierChunker : 0.8737924310910219

Recall of ClassifierChunker : 0.9007354620620346

the word can be passed through the tagger into our feature detector function, by creating nested 2-tuples of the form ((word, pos), iob), The chunk_trees2train_chunks() method produces these nested 2-tuples.
The following features are extracted:

  • The current word and part-of-speech tag
  • The previous word and IOB tag, part-of-speech tag
  • The next word and part-of-speech tag

The ClassifierChunker class uses an internal ClassifierBasedTagger and prev_next_pos_iob() as its default feature_detector. The results from the tagger, which are in the same nested 2-tuple form, are then reformated into 3-tuples to return a final Tree using conlltags2tree().
Code #3 : different classifier builder

filter_nonebrightness_4
# loading libraries
from chunkers import ClassifierChunker
from nltk.corpus import treebank_chunk
from nltk.classify import MaxentClassifier
train_data = treebank_chunk.chunked_sents()[:3000]
test_data = treebank_chunk.chunked_sents()[3000:]
builder = lambda toks: MaxentClassifier.train(
toks, trace = 0, max_iter = 10, min_lldelta = 0.01)
chunker = ClassifierChunker(
train_data, classifier_builder = builder)
score = chunker.evaluate(test_data)
a = score.accuracy()
p = score.precision()
r = score.recall()
print ("Accuracy of ClassifierChunker : ", a)
print ("\nPrecision of ClassifierChunker : ", p)
print ("\nRecall of ClassifierChunker : ", r)

Output :

Accuracy of ClassifierChunker : 0.9743204362949285

Precision of ClassifierChunker : 0.9334423548650859

Recall of ClassifierChunker : 0.9357377049180328

ClassifierBasedTagger class defaults to using NaiveBayesClassifier.train as its classifier_builder. But any classifier can be used by overriding the classifier_builder keyword argument.

NLP | Distributed chunking with Execnet

The article aims to perform chunking and tagging over an execnet gateway. Here two objects will be sent instead of one, and a Tree is received, which requires pickling and unpickling for serialization.

How it works ?

  • Use a pickled tagger.
  • First, pickle the default chunker used by nltk.chunk.ne_chunk(), though any chunker would do.
  • Next, make a gateway for the remote_chunk module, get a channel, and send the pickled tagger and chunker over.
  • Then, receive a pickled Tree, which can be unpickled and inspected to see the result. Finally, exit the gateway:

Code : Explaining the working

filter_noneedit

play_arrow

brightness_4

# importing libraries
import execnet, remote_chunk
import nltk.data, nltk.tag, nltk.chunk
import pickle
from nltk.corpus import treebank_chunk
 
tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER))
chunker = pickle.dumps(
        nltk.data.load(nltk.chunk._MULTICLASS_NE_CHUNKER))
gw = execnet.makegateway()
 
channel = gw.remote_exec(remote_chunk)
channel.send(tagger)
channel.send(chunker)
channel.send(treebank_chunk.sents()[0])
 
chunk_tree = pickle.loads(channel.receive())
 
print (chunk_tree)
gw.exit()

Output :

Tree('S', [Tree('PERSON', [('Pierre', 'NNP')]), Tree('ORGANIZATION',
[('Vinken', 'NNP')]), (', ', ', '), ('61', 'CD'), ('years', 'NNS'),
('old', 'JJ'), (', ', ', '), ('will', 'MD'), ('join', 'VB'), ('the',
'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive',
'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]) 

Communication this time is slightly different as in figure given below –

  • The remote_chunk.py module is just a little bit more complicated than the remote_tag.py module.
  • In addition to receiving a pickled tagger, it also expects to receive a pickled chunker that implements the ChunkerIinterface
  • Once it has both a tagger and a chunker, it expects to receive any number of tokenized sentences, which it tags and parses into a Tree. This Tree is then pickled and sent back over the channel:

Code : Explaining the above working

filter_noneedit

play_arrow

brightness_4

import pickle
 
if __name__ == '__channelexec__':
    tagger = pickle.loads(channel.receive())
    chunker = pickle.loads(channel.receive())
 
for sentence in channel:
    chunk_tree = chunker.parse(tagger.tag(sent))
    channel.send(pickle.dumps(chunk_tree))

remote_chunk module’s only external dependency is the pickle module, which is part of the Python standard library. It doesn’t need to import any NLTK modules in order to use the tagger or chunker, because all the necessary data is pickled and sent over the channel.