NLP frameworks

Why use an NLP framework?

How is it better than other options, i.e. manual implementation or using existing standalone tools? (Note: the benefits of using a framework listed below are not necessarily true for all frameworks.)

Overview of NLP frameworks

NLTK tutorial

  1. Installation:
    # in terminal
    pip3 install --user nltk
    
    ipython3
    import nltk
    
    # optionally:
    # nltk.download()
    # usually, you should chose to download "all" (but it may get stuck)
    
  2. http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk
  3. http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize
  4. http://textminingonline.com/dive-into-nltk-part-iii-part-of-speech-tagging-and-pos-tagger
  5. http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization

Using existing tools in NLTK

Sentence segmentation, word tokenization, part-of-speech tagging, named entity recognition.

with open("genesis.txt", "r") as f:
   genesis = f.read()

sentences = nltk.sent_tokenize(genesis)
# just the first sentence
tokens_0 = nltk.word_tokenize(sentences[0])
tagged_0 = nltk.pos_tag(tokens_0)
# all sentences
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]
tagged_sentences = nltk.pos_tag_sents(tokenized_sentences)

ne=nltk.ne_chunk(tagged_0)
print(ne)
ne.draw()

Training a tagger

from nltk.corpus import treebank
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]

from nltk.tag import tnt
tnt_pos_tagger = tnt.TnT()
tnt_pos_tagger.train(train_data)

tnt_pos_tagger.tag(nltk.word_tokenize("A platypus is a very special animal."))

tnt_pos_tagger.evaluate(test_data)

import pickle
with open('tnt_treebank_pos_tagger.pickle', 'wb') as f:
    pickle.dump(tnt_pos_tagger, f)
with open('tnt_treebank_pos_tagger.pickle', 'rb') as f:
    loaded_tagger = pickle.load(f)

Trees in NLTK

Let's create a simple constituency tree for the sentence A red bus stopped suddenly:

# what we want to create: 
#
#           S
#       /       \
#    NP           VP
#  / |  \      /      \
# A red bus stopped suddenly
#

from nltk import Tree

# Tree(root, [children])
np = Tree('NP', ['A', 'red', 'bus'])
vp = Tree('VP', ['stopped', 'suddenly'])
# children can be strings or Trees
s = Tree('S', [np, vp])

# print out the tree
print(s)

# draw the tree (opens a small graphical window)
s.draw()

And a dependency tree for the same sentence:

# what we want to create: 
#
#       stopped
#       /      \
#    bus    suddenly
#  / |
# A red

# can either use string leaf nodes:
t1=Tree('stopped', [Tree('bus', ['A', 'red']), 'suddenly'])
t1.draw()

# or represent each leaf node as a Tree without children:
t2=Tree('stopped', [Tree('bus', [ Tree('A', []), Tree('red', []) ]), Tree('suddenly', []) ])
t2.draw()

Tagging and parsing with UDPipe

Easy way: use the online service (also has a REST API)

Powerful way: use local installation (more control, also supports training) -- see below

  1. Installation:
    # checkout the udpipe repository
    git clone https://github.com/ufal/udpipe.git
    
    # compile udpipe
    cd udpipe/src
    make
    cd ../..
    
    # install Python bindings
    pip3 install --user ufal.udpipe
    
    # download trained models
    wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1659/udpipe-ud-1.2-160523.zip
    unzip udpipe-ud-1.2-160523.zip
    
  2. Sample usage:
    # start ipython in the directory with the models (udpipe-ud-1.2-160523),
    # as this makes it easier to load the models just by the filename;
    # otherwise you have to specify the full path to the model
    cd udpipe-ud-1.2-160523
    ipython3
    
    from ufal.udpipe import *
    
    # load model from the given file;
    # if the file does not exist, expect a Segmentation fault
    model = Model.load("english-ud-1.2-160523.udpipe")
    
    # create a UDPipe processing pipeline with the loaded model,
    # with "horizontal" input (a sentence with space-separated tokens),
    # default setting for tagger and parser,
    # and CoNLL-U output
    pipeline = Pipeline(model, "horizontal", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu")
    
    # analyze a tokenized sentence with UDPipe
    # and print out the resulting CoNLL-U analysis
    print(pipeline.process("A man went into a bar ."))