NLP frameworks
Why use an NLP framework?
How is it better than other options, i.e. manual implementation or using existing standalone tools? (Note: the benefits of using a framework listed below are not necessarily true for all frameworks.)
- You can read in data in various formats, convert to unified representation, no need for further conversions to use with the tools, unified structured API to access the annotated data
- You get a number of tools in one batch, ready to use, with unified APIs
- You can often do everything from one or more Python scripts and run the whole pipeline at once, while standalone tools typically have to be ran and their inputs and outputs manipulated from terminal/bash script/Makefile
- Built in visualisation
- Apply but also train the tools (for machine learning you can go to: NPFL054 Introduction to machine learning, NAIL029 Machine Learning, NPFL104 Machine Learning Exercises)
Overview of NLP frameworks
- NLTK Natural Language Toolkit (http://www.nltk.org/, reasonable tutorial http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk, NLTK book http://www.nltk.org/book/) -- good for English, usable for other langs, not much support for e.g. Czech (you have to manually read in Czech corpora, process them into required format and train the tools you need); reasonably easy integration of existing standalone NLP tools (API to run e.g. the Stanford tools -- you have to install them independently and set up some system variables correctly so that NLTK finds them, but then you can invoke them directly from NLTK)
- Treex (http://ufal.mff.cuni.cz/treex) -- ÚFAL NLP toolkit, best for Czech, good for English, built-in support for several other langs (nl, de, pt, es…), support for ud; only in Perl; attempt to port API to Python: PyTreex (https://github.com/ufal/pytreex); web interface: TreexWeb (https://lindat.mff.cuni.cz/services/treex-web/)
- Stanford CoreNLP (http://stanfordnlp.github.io/CoreNLP/) -- EN, ZH, ES, AR, FR, DE; comprehensive framework, good performance; Java, command-line interface, web service, APIs in ~15 langs (incl. Python, PHP, JavaScript…), also some integration with NLTK
- OpenNLP (https://opennlp.apache.org/) -- comprehensive framework; Java, command-line interface
- Gate -- good for abstracting over complex pipelines
- Spacy -- very easy to use; but supports only 7 languages and cannot be trained (EN, DE, ES, PT, FR, IT, NL)
- CogComp-NLP -- simple online interface; supports only English
- UDPipe (https://ufal.mff.cuni.cz/udpipe) -- Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files (use from commandline / bindings for C++, Python, Perl, C#, Java) -- if you dont want to do anything sophisticated within the framework but just want to get the analyses (and either do the processing yourself manually or within another framework or no procesing is needed...)
- Udapi (http://udapi.github.io/) -- lightweight toolkit for working with Universal Dependencies -- currently can really only read in and write out data, but when it reads them in, you can access them through a rather nice API; Python, Perl, Java (go to NPFL070 Language Data Resources)
- search in corpora: PMLTQ (https://lindat.mff.cuni.cz/services/pmltq/) -- go to NPFL075 Prague Dependency Treebank
- deep learning: TensorFlow (https://www.tensorflow.org/) -- go to NPFL114 Deep Learning
- information retrieval: Retriever, Lucene -- go to NPFL103 Information Retrieval
- dialogue systems: Alex (https://github.com/UFAL-DSG/alex) -- go to NPFL099 Statistical dialogue systems
NLTK tutorial
- Installation:
# in terminal pip3 install --user nltk ipython3 import nltk # optionally: # nltk.download() # usually, you should chose to download "all" (but it may get stuck)
- http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk
- http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize
- http://textminingonline.com/dive-into-nltk-part-iii-part-of-speech-tagging-and-pos-tagger
- http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization
Using existing tools in NLTK
Sentence segmentation, word tokenization, part-of-speech tagging, named entity recognition.
with open("genesis.txt", "r") as f: genesis = f.read() sentences = nltk.sent_tokenize(genesis) # just the first sentence tokens_0 = nltk.word_tokenize(sentences[0]) tagged_0 = nltk.pos_tag(tokens_0) # all sentences tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences] tagged_sentences = nltk.pos_tag_sents(tokenized_sentences) ne=nltk.ne_chunk(tagged_0) print(ne) ne.draw()
Training a tagger
from nltk.corpus import treebank train_data = treebank.tagged_sents()[:3000] test_data = treebank.tagged_sents()[3000:] from nltk.tag import tnt tnt_pos_tagger = tnt.TnT() tnt_pos_tagger.train(train_data) tnt_pos_tagger.tag(nltk.word_tokenize("A platypus is a very special animal.")) tnt_pos_tagger.evaluate(test_data) import pickle with open('tnt_treebank_pos_tagger.pickle', 'wb') as f: pickle.dump(tnt_pos_tagger, f) with open('tnt_treebank_pos_tagger.pickle', 'rb') as f: loaded_tagger = pickle.load(f)
Trees in NLTK
Let's create a simple constituency tree for the sentence A red bus stopped suddenly:
# what we want to create: # # S # / \ # NP VP # / | \ / \ # A red bus stopped suddenly # from nltk import Tree # Tree(root, [children]) np = Tree('NP', ['A', 'red', 'bus']) vp = Tree('VP', ['stopped', 'suddenly']) # children can be strings or Trees s = Tree('S', [np, vp]) # print out the tree print(s) # draw the tree (opens a small graphical window) s.draw()
And a dependency tree for the same sentence:
# what we want to create: # # stopped # / \ # bus suddenly # / | # A red # can either use string leaf nodes: t1=Tree('stopped', [Tree('bus', ['A', 'red']), 'suddenly']) t1.draw() # or represent each leaf node as a Tree without children: t2=Tree('stopped', [Tree('bus', [ Tree('A', []), Tree('red', []) ]), Tree('suddenly', []) ]) t2.draw()
Tagging and parsing with UDPipe
Easy way: use the online service (also has a REST API)
Powerful way: use local installation (more control, also supports training) -- see below
- Installation:
# checkout the udpipe repository git clone https://github.com/ufal/udpipe.git # compile udpipe cd udpipe/src make cd ../.. # install Python bindings pip3 install --user ufal.udpipe # download trained models wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1659/udpipe-ud-1.2-160523.zip unzip udpipe-ud-1.2-160523.zip
- Sample usage:
# start ipython in the directory with the models (udpipe-ud-1.2-160523), # as this makes it easier to load the models just by the filename; # otherwise you have to specify the full path to the model cd udpipe-ud-1.2-160523 ipython3 from ufal.udpipe import * # load model from the given file; # if the file does not exist, expect a Segmentation fault model = Model.load("english-ud-1.2-160523.udpipe") # create a UDPipe processing pipeline with the loaded model, # with "horizontal" input (a sentence with space-separated tokens), # default setting for tagger and parser, # and CoNLL-U output pipeline = Pipeline(model, "horizontal", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu") # analyze a tokenized sentence with UDPipe # and print out the resulting CoNLL-U analysis print(pipeline.process("A man went into a bar ."))