NLP frameworks

Why use an NLP framework?

How is it better than other options, i.e. manual implementation or using existing standalone tools? (Note: the benefits of using a framework listed below are not necessarily true for all frameworks.)

You can read in data in various formats, convert to unified representation, no need for further conversions to use with the tools, unified structured API to access the annotated data
You get a number of tools in one batch, ready to use, with unified APIs
You can often do everything from one or more Python scripts and run the whole pipeline at once, while standalone tools typically have to be ran and their inputs and outputs manipulated from terminal/bash script/Makefile
Built in visualisation
Apply but also train the tools (for machine learning you can go to: NPFL054 Introduction to machine learning, NAIL029 Machine Learning, NPFL104 Machine Learning Exercises)

Overview of NLP frameworks

NLTK Natural Language Toolkit (http://www.nltk.org/, reasonable tutorial http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk, NLTK book http://www.nltk.org/book/) -- good for English, usable for other langs, not much support for e.g. Czech (you have to manually read in Czech corpora, process them into required format and train the tools you need); reasonably easy integration of existing standalone NLP tools (API to run e.g. the Stanford tools -- you have to install them independently and set up some system variables correctly so that NLTK finds them, but then you can invoke them directly from NLTK)
Treex (http://ufal.mff.cuni.cz/treex) -- ÚFAL NLP toolkit, best for Czech, good for English, built-in support for several other langs (nl, de, pt, es…), support for ud; only in Perl; attempt to port API to Python: PyTreex (https://github.com/ufal/pytreex); web interface: TreexWeb (https://lindat.mff.cuni.cz/services/treex-web/)
Stanford CoreNLP (http://stanfordnlp.github.io/CoreNLP/) -- EN, ZH, ES, AR, FR, DE; comprehensive framework, good performance; Java, command-line interface, web service, APIs in ~15 langs (incl. Python, PHP, JavaScript…), also some integration with NLTK
OpenNLP (https://opennlp.apache.org/) -- comprehensive framework; Java, command-line interface
Gate -- good for abstracting over complex pipelines
Spacy -- very easy to use; but supports only 7 languages and cannot be trained (EN, DE, ES, PT, FR, IT, NL)
CogComp-NLP -- simple online interface; supports only English
UDPipe (https://ufal.mff.cuni.cz/udpipe) -- Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files (use from commandline / bindings for C++, Python, Perl, C#, Java) -- if you dont want to do anything sophisticated within the framework but just want to get the analyses (and either do the processing yourself manually or within another framework or no procesing is needed...)
Udapi (http://udapi.github.io/) -- lightweight toolkit for working with Universal Dependencies -- currently can really only read in and write out data, but when it reads them in, you can access them through a rather nice API; Python, Perl, Java (go to NPFL070 Language Data Resources)
search in corpora: PMLTQ (https://lindat.mff.cuni.cz/services/pmltq/) -- go to NPFL075 Prague Dependency Treebank
deep learning: TensorFlow (https://www.tensorflow.org/) -- go to NPFL114 Deep Learning
information retrieval: Retriever, Lucene -- go to NPFL103 Information Retrieval
dialogue systems: Alex (https://github.com/UFAL-DSG/alex) -- go to NPFL099 Statistical dialogue systems

NLTK tutorial

Installation:

# in terminal
pip3 install --user nltk

ipython3
import nltk

# optionally:
# nltk.download()
# usually, you should chose to download "all" (but it may get stuck)

http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk
http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize
http://textminingonline.com/dive-into-nltk-part-iii-part-of-speech-tagging-and-pos-tagger
http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization

Using existing tools in NLTK

Sentence segmentation, word tokenization, part-of-speech tagging, named entity recognition.

with open("genesis.txt", "r") as f:
   genesis = f.read()

sentences = nltk.sent_tokenize(genesis)
# just the first sentence
tokens_0 = nltk.word_tokenize(sentences[0])
tagged_0 = nltk.pos_tag(tokens_0)
# all sentences
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]
tagged_sentences = nltk.pos_tag_sents(tokenized_sentences)

ne=nltk.ne_chunk(tagged_0)
print(ne)
ne.draw()

Training a tagger

from nltk.corpus import treebank
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]

from nltk.tag import tnt
tnt_pos_tagger = tnt.TnT()
tnt_pos_tagger.train(train_data)

tnt_pos_tagger.tag(nltk.word_tokenize("A platypus is a very special animal."))

tnt_pos_tagger.evaluate(test_data)

import pickle
with open('tnt_treebank_pos_tagger.pickle', 'wb') as f:
    pickle.dump(tnt_pos_tagger, f)
with open('tnt_treebank_pos_tagger.pickle', 'rb') as f:
    loaded_tagger = pickle.load(f)

Trees in NLTK

Let's create a simple constituency tree for the sentence A red bus stopped suddenly:

# what we want to create: 
#
#           S
#       /       \
#    NP           VP
#  / |  \      /      \
# A red bus stopped suddenly
#

from nltk import Tree

# Tree(root, [children])
np = Tree('NP', ['A', 'red', 'bus'])
vp = Tree('VP', ['stopped', 'suddenly'])
# children can be strings or Trees
s = Tree('S', [np, vp])

# print out the tree
print(s)

# draw the tree (opens a small graphical window)
s.draw()

And a dependency tree for the same sentence:

# what we want to create: 
#
#       stopped
#       /      \
#    bus    suddenly
#  / |
# A red

# can either use string leaf nodes:
t1=Tree('stopped', [Tree('bus', ['A', 'red']), 'suddenly'])
t1.draw()

# or represent each leaf node as a Tree without children:
t2=Tree('stopped', [Tree('bus', [ Tree('A', []), Tree('red', []) ]), Tree('suddenly', []) ])
t2.draw()

Tagging and parsing with UDPipe

Easy way: use the online service (also has a REST API)

Powerful way: use local installation (more control, also supports training) -- see below

Installation:

# checkout the udpipe repository
git clone https://github.com/ufal/udpipe.git

# compile udpipe
cd udpipe/src
make
cd ../..

# install Python bindings
pip3 install --user ufal.udpipe

# download trained models
wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1659/udpipe-ud-1.2-160523.zip
unzip udpipe-ud-1.2-160523.zip

Sample usage:

# start ipython in the directory with the models (udpipe-ud-1.2-160523),
# as this makes it easier to load the models just by the filename;
# otherwise you have to specify the full path to the model
cd udpipe-ud-1.2-160523
ipython3

from ufal.udpipe import *

# load model from the given file;
# if the file does not exist, expect a Segmentation fault
model = Model.load("english-ud-1.2-160523.udpipe")

# create a UDPipe processing pipeline with the loaded model,
# with "horizontal" input (a sentence with space-separated tokens),
# default setting for tagger and parser,
# and CoNLL-U output
pipeline = Pipeline(model, "horizontal", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu")

# analyze a tokenized sentence with UDPipe
# and print out the resulting CoNLL-U analysis
print(pipeline.process("A man went into a bar ."))