Automatic processing of text data


NPFL098 / ATKL00345

Pavel Straňák

stranak@ufal.mff.cuni.cz

úterý 10.40–13.50
Malostranské nám. 25, SU1

11. 4. 2017

NLP Applications and tools

Analysis of text

  1. sentence segmentation
  2. tokenisation
  3. stemming, POS tagging, lematisation a morfological analysis
  4. (surface) syntactic parser, chunker (= identification of clauses)
  5. deep syntax (e.g. in Treex via modifying the surface parse tree)
  6. other units (on various layers of description):
    • named entity recognition (NER): person, place, date, …
    • coreference (pronominal, nominal)
    • time relations (X immediately after Y and together with Z, etc.)
    • Word Sense Disambiguation (see t-lemma)

Complex analysis: A chain of many steps. Or a joint problem for a statistical system that tries to learn all of them together. (It makes sense, they influence each other more in a complex way in reality. Not realy a simple chain.)

NLP Toolkits

NLP Toolkits cont.

Treex

"Treex (formerly TectoMT) is a highly modular NLP software system implemented in Perl programming language under Linux."

NLTK

LINDAT Tools

Services run at LINDAT/CLARIN as simple web applications. Each app has a clickable GUI (a HTML web form) and a REST API.

Homework

Play with curl tool and examples of using it to process data in UDPipe and other LINDAT tools. See "REST API Documentation" of the tools in the links above.