NPFL092 Technology for NLP (Natural Language Processing)

Course schedule overview

  1. Linux and Bash
    • survival in Linux
    • Bash command line and scripting
    • text-processing commands
  2. Python
    • introduction to Python
    • text processing
    • regular expressions
    • object-oriented interface for processing linguistic structures in Python
  3. XML
    • representing linguistic structures in XML
    • processing XML in Python
  4. Extras (covered fully or partially based on remaining time at the end of the term)
    • selected good practices in software development (not only in NLP, not only in Python)
    • NLP tools and frameworks, processing morphologically and syntactically annotated data, visualization, search
    • data and software licensing

More detailed course schedule

  1. Introduction
    • slides
    • Motivation
    • Course requirements: MFF linux lab account
    • Course plan, overview of required work, assignment requirements
    Survival in Linux:
    • keyboard shortcuts in KDE/GNOME, selected e.g. from here
    • motivation for scripting, command line features (completion, history...), keyboard shortcuts
    • bash in a nutshell (ls (-l,-a,-1,-R), cd, pwd, cp (-R), mv, rm (-r, -f), mkdir (-p), rmdir, chmod, ssh (-XY), less, more, cat, ln (-s), .bashrc, wget, head, tail, file, man...)
    • exercise: playing with text files (, also available for download at
    • remote access to a unix machine: SSH (Secure Shell)
      • you can access a lab computer e.g. by opening a unix terminal and typing
        (replace yourlogin with your login into the lab and type your lab password when asked for it; instead of 17 you can use any number between 1 and something like 30 -- it is the number of the computer in the central lab that you are connecting to)
      • your home is shared across all the lab computers in all the MS labs (SU1, SU2, Rotunda), i.e. you will see your files everywhere
      • you can ssh even from non-unix machines
        • on Windows, you can use e.g. the Putty software
        • on any computer with the Chrome browser, you can use the Secure Shell extension (and there are similar extensions for other browsers as well) which allows you to open a remote terminal in a browser tab -- this is probably the most comfortable way
        • on an Android device, you can use e.g. JuiceSSH
    • Supplementary reading
    • Homework: Connect remotely from your home computer to the MS lab, check that you can see there the data from the class (or use wget and unzip to get the UDHR data to the computer -- see link above), and try practising some of the commands from the class: try renaming files, copying files, changing file permissions, etc. Try to create a shell script that prints some text, make it executable, and run it, e.g.:
      echo 'echo Hello World' >
      chmod u+x
      You can also try connecting to the MS lab from your smartphone and running a few commands -- this will let you experience the power of being able to work remotely in Bash from anywhere...
      This homework does not require you to submit anything to us, just practice as much as you feel that you need so that you feel confident in Bash. Do this homework before coming to the next lab. And, as always, if you run into any problems, contact us per e-mail!
  2. Character encoding (very short)
    • ascii, 8-bits, unicode, conversions, locales (LC_*)
    • slides
    • Questions: answer the following questions:
      • What is ASCII?
      • What 8-bit encoding do you know for Czech or for your native language? How do they differ from ASCII?
      • What is Unicode?
      • What Unicode encodings do you know?
      • What is the relation between UTF-8 a ASCII?
      • Take a sample of Czech text (containing some diacritics), store it into a plain text file and convert it (by iconv) to at least two different 8-bit encodings, and to utf-8 and utf-16. Explain the differences in file sizes.
      • How can you detect file encoding?
      • Store any Czech web page into a file, change the file encoding and the encoding specified in the file header, and find out how it is displayed in your browser if the two encodings differ.
      • How do you specify file encoding when storing a plain text or a source code in your favourite text editor?
    Mastering your text editor
    • requirements on a modern source-code editor
      1. modes (progr. languages, xml, html...)
      2. syntax highlighting
      3. completion
      4. indentation
      5. support for encodings (utf-8)
      6. integration with compiler...
    • fallback mode for working in a text console
    • you can use any editor you like, as long as it has the capabilities listed above and you know how to use them
    • if you don't have a favourite Linux editor yet, we suggest e.g. atom (demonstration of atom in the class); Atom is installed in the labs, and is cross-platform, i.e. you can also use it on Windows and Mac
    • for a text-mode editor (without a graphical user interface, e.g. for working through ssh), we suggest nano
    • other good editors include e.g. Sublime (cross-platform); for Windows, e.g. Notepad++ and PSPad are good
    • for using emacs (if you really want to): look here
    • for using vim (if you really want to): run the vimtutor command to go through an introductory tutorial of using vim (vimtutor english to run the English version of the tutorial) (boring for those who already know or use vi, too long for 45 minutes)
    • Homework: make sure you know how to invoke all the mentioned features in your favourite text editor
  3. Text-processing commands in bash
    • sort, uniq, cat, cut, [e]grep, sed, head, tail, rev, diff, patch, set, pipelines, man...
    • regular expressions
    • exercises
    • Homework: read Unix for Poets by Kenneth Ward Church
    Bash scripting
    • if, while, for
    • xargs : Compare
      sed 's/:/\n/g' <<< "$PATH" | \
      grep $USER | \
      while read path ; do
        ls $path
      sed 's/:/\n/g' <<< "$PATH" | \
      grep $USER | \
      xargs ls

    Shell script, patch to show changes we made–just run

    patch -p0 <


    • Homework 01:
      • Write your Makefile with targets t2—t18 from the Exercises. Put the HW into
        (and commit it and push it to Redmine)
      • Deadline: Wednesday 1st November 2017
      • Please create a fresh git clone of the homework repo in the unix lab (recall that you can access it remotely using ssh) to double-check that everything is in its place.
  4. Bash cont.
    • warm-up exercises:
      • Task 1: constract a bash pipeline that extracts words from an English text read from the input, and sorts them in the "rhyming" order (lexicographical ordering, but from the last letter to the first letter; "retrográdní uspořádání" in Czech) (hint: use the command rev for reverting individual lines)
      • Task 1: construct a bash pipeline that reads an English text from the input and finds 3-letter "suffices" that are most frequent in the words that are contained in the text, irrespectively of the words' frequencies (suffices not in the linguistic sense, simply just the last 3 letters from a word that contains at least 5 letters) (hint: you can use e.g.sed 's/./&\t/g | rev | cut -f2,3,4 | rev for extracting the last three letters)
    • system variables
    • editting .bashrc (aliases, paths...)
    • looping, branching, e.g.
      for file in *; do
        if [ -x $file ]
          echo Executable file: $file
          echo Shebang line:  `head -n 1 $file`
  5. Introduction to Python
    • Study the Python Tutorial as homework
    • To solve practical tasks, Google is your friend…
    • By default, we will use Python version 3:
      A day may come when you will need to use Python 2, so please note that there are some differences between these two. (Also note that you may encounter code snippets in either Python 2 or Python 3…)
    • To work interactively with Python, use IPython:
      • to save the commands 5-10 from your IPython session to a file named, run:
        %save mysession 5-10
      • to exit IPython, run:
    • For non-interactive work, use your favourite editor. (Rudolf uses vim, but heard PyCharm is real good.)
    • First Python exercises (simple language modelling): we got up to the 4th item only
  6. Simple language modelling in Python
    • Finishing the Language modelling exercises from last class
      A sample solution to exercises 1 to 13 can be found in
    • the string data type in Python
      • a tutorial
      • case changing (lower, upper, capitalize, title, swapcase)
      • is* tests (isupper, isalnum...)
      • matching substrings (find, startswith, endswith, count, replace)
      • split, splitlines, join
      • other useful methods (not necessarily for strings): dir, sorted, set
      • my string ipython3 session from the lab (unfiltered)
    • Warmdown: implement a simple wc-like tool in Python, so that running
      python3 textfile.txt
      will print out three numbers: the number of lines, words, and characters in the file (for words, you can simply use whitespace-delimited strings -- there is a string method that does just that...)
    • Homework hw02: Implement at least three items from the extensions of the language modelling exercises (extension 1 is obligatory; the simplest to do are then probably 2 and 3, the rest may require more googling). You can get bonus points for implementing more of the extensions.
      Put your homework into 2017-npfl092/hw02/ (and don't forget to add it, commit, and push).
      Deadline: Wednesday 15th November 2017 19:00
    • If you need help, try (preferably in this order):
      1. Google
      2. Google
      3. Google
      4. asking at/after the next lab
      5. asking per e-mail (please send the e-mail to both of us, as this increases your chances of getting an early reply)
  7. Python: text processing, regular expressions
    • a warm-up exercise: find palindrome words in English
      • A palindrome word reads the same forward and backward, e.g. "level"
      • Write a python script that reads text from stdin and prints detected palindromes (one per line) to stdout
      • print only palindrome words longer than three letters
      • apply your script on the English translation of Homer's The Odyssey available as an UTF-8 encoded Project Gutenberg ebook here.
      • a slightly more advanced extension (optional): try to find longer expressions that read same in both directions after spaces are removed (two or more words; a contiguous segment of the input text, possibly crossing line boundaries)
    • encoding in Python
      • differences in handling of encoded data between Python 2 and Python 3
      • a simple rule: use Unicode everywhere, and if conversions from other encodings are needed, then do them as close to the physical data as possible (i.e., encoding should processed properly already in the data reading/writing phase, and not internally by decoding the content of variables)
      • example: f = open(fname, encoding="latin-1")
      • sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
      • more about the topic can be found here
    • regular expressions in Python
      • a python regexp tutorial
      • to be able to use the regex module:
        1. in bash: pip3 install --user regex
        2. in python: import regex as re
        (Python has built-in regex support in the re module, but regex seems to be more powerful while using the same API.)
      • search, findall, sub
      • raw strings (r'...'), character classes ([[:alnum:]], \w, ...), flags (flags=re.I or r'(?i)...'), subexpressions r'(.) (...)' + backreferences r'\1 \2'
      • revision of regexes (^[abc]*|^[.+-]?[a-f]+[^012[:alpha:]]{3,5}(up|down)c{,5}$)
      • good text to play with: the first chapter of genesis again
      • my regex ipython3 session from the lab (unfiltered, from a lab taught in year 2016)
    • Homework 03: Redo hw01 in Python, implementing the targets t2 to t18 from the Exercises in one Python script called, so that e.g. running python3 t16 prints out the frequency list of letters in the skakalpes file; running you script with no parameters should invoke all the targets.
      Of course, do not take the tasks word-by-word, as they explicitly name Bash commands to use, while you have to use Python commands instead. E.g. for t2, you can use urllib.request.urlopen, which returns an object with many methods, including read() (you must first import urllib.request). In t3, just print the text once (you don't have to implement less). For t4, look for decode()...
      Put the HW to 2017-npfl092/hw03/
      Deadline: Monday 27th November 2017 23:59
  8. Python: modules, packages, classes
    • Specification: implement a simple Czech POS tagger in Python, choose any approach you want, required precision at least 50%
      • Tagger input format - data encoded in iso-8859-2 in a simple line-oriented plain-text format: empty line separate sentences, non-empty lines contain word forms in the first column and simplified (one-letter) POS tag in the second column, such as N for nouns or A for adjectives (you can look at tagset documentation). Columns are separated by tabs.
      • Tagger output format: empty lines not changed, nonempty lines enriched with a third column containing the predicted POS for each line
      • Training data: tagger-devel.tsv
      • Evaluation data: tagger-eval.tsv (to be used only for evaluation!!!)
      • Performance evaluation (precision=correct/total): eval-tagger.sh_
           cat tagger-eval.tsv | ./ | ./  
      • Example baseline solution - everything is a noun, precision 34%:
        python -c'import sys;print"".join(l if l<"\r" else l[:-1]+"\tN\n" for l in sys.stdin)'<tagger-eval.tsv|./
    • Homework HW04: a simple POS tagger, this time OO solution
      • turn your warm-up exercise solution into an OO solution:
        • implement a class Tagger
        • the tagger class has a method tagger.see(word,pos) which gets a word-pos instance from the training data (and probably stores it into a dictionary or something)
        • the tagger class has a method tagger.train() that infers a model (if needed)
        • the tagger class has a method that saves the model to a file (again, it is recommended to use pickle)
        • the tagger class has a method tagger.load(filename) that loads the model from a file
        • the tagger class has a method tagger.predict(word) that predicts a POS tag for a word given the trained model
      • the tagger should be usable as a Python module:
        • e.g. if your Tagger class resides in, you should be able to use it in another script (e.g. by importing it (from my-tagger-class import Tagger)
        • one option of achieving this is by having just the Tagger class in the script, with no code outside of the class (you then need another script to use your tagger)
        • another option is to wrap any code which is outside the class into the name=main block, which is executed only if the script is run directly, not when it is imported into another script:
          # This is the Tagger class, which will be imported when you "import Tagger"
          class Tagger:
              def __init__(self):
                  self.model = dict()
              def see(self, word, pos):
                  self.model[word] = pos
          # This code is only executed when you run the script directly, e.g. "python3"
          if __name__ == "__main__":
              tagger = Tagger()
              tagger.see("big", "A")
      • wrap your solution into a Makefile with the following targets:
        • download - downloads the data
        • train - trains a tagging model given the training file and stores it into a file
        • predict - appends the column with predicted POS into the test file
        • eval - prints the accuracy
      • Put your solution into 2017-npfl092/hw04/
      • Deadline: Monday 11th December 2017, 23:59 CET
  9. A gentle introduction to XML
    • Motivation for XML, basics of XML syntax, examples, well-formedness/validity, dtd, xmllint
    • Slides
    • XML exercise: create an XML file representing some data structures (ideally NLP-related) manually in a text editor, or by a Python script. The file should contain at least 7 different elements, some of them should have attributes. Create a DTD file and make sure that the XML file is valid w.r.t. the DTD file.
    • Create a Makefile that has targets "wellformed" and "valid" and uses xmllint to verify the file's well-formedness and its validity with respect to the DTD file.
    • Homework 05:
      • finish the exercise: XML+DTD files
      • store it into 2017-npfl092/hw05/ in your git repository (and don't forget to commit and push it)
      • Deadline: Monday 18th December 2017, 23:59 CET
  10. XML & JSON
    • Exercise: For all file in, check whether they are well-formed xml files or not (e.g. by xmllint), and if not then fix them (possibly manually in a text editor, or any way you want).
    • Exercise: write a Python script that recognizes (at least some of) the well-formedness violations present in the above mentioned files, without using any specific library for XML processing
    • A very quick overview of some XML-related standards (namespaces, XPath, XSL, SAX, DOM): slides
    • Intro to XML and JSON processing in Python: xmljson.pdf
    • Homework 06:
      • download a simplified file with Universal Dependencies trees dependency_trees_from_ud.tsv (note: simplification = some columns removed from the standard conllu format)
      • write a Python script that converts this data into a reasonably structured XML file
      • write a Python script that reads the XML file and converts it into a JSON file
      • write a Python script that rades the JSON file and converts it back to the tsv file
      • check that the final output file is identical with the original input file
      • organize it all in a Makefile with targets download, tsv2xml, xml2json, json2tsv, and check for the individual steps, and a target all that runs them all
      • put your solution into 2017-npfl092/hw06/
      • deadline: 3rd January, 17:00
  11. NLTK and other NLP frameworks
    • NLP frameworks, including an intro to NLTK and UDPipe
    • exercise: once again processing genesis, this time in NLTK:
      • read in the text of the first chapter of Genesis
      • use NLTK to split the text into sentences, split the sentences into tokens, and tag the tokens for part-of-speech
      • print out the output as TSV, one token per line, wordform POStag separated by a tab, with an empty line separating sentences
    • Homework 07:
      • train and evaluate a Czech part-of-speech tagger in NLTK
      • use any of the trainable taggers available in NLTK (tnt looked quite promising), achieving some non-trivial accuracy (if your accuracy is e.g. 20%, then something is wrong)
      • you can experiment with multiple taggers and multiple settings and improvements to achieve a good accuracy (this is not required, but you can get bonus points)
      • use the data from the previous tagging homework: tagger-devel.tsv as training data, tagger-eval.tsv as evaluation data
      • note that you have to convert the input data appropriately into a format which is expected by the tagger
      • wrap your solution into a Makefile, with the targets download, train, predict, eval (as in hw04)
      • put your solution into 2017-npfl092/hw07/
      • Deadline: 8th January, 23:59
  12. Selected good practices in software development (not only in NLP, not only in Python)
    • warm-up exercise: find English word groups in which the words are derived one from the other, such as interest-interesting-interestingly; use the list of 10,000 most frequent English lemmas bnc_freq_10000.txt
    • good development practices - slides (testing, benchmarking, profiling, code reviewing, bug reporting)
    • exercise:
      • exchange solutions of HW05 with one of your colleagues
      • implement unit tests (using unittest) of his/her solution
      • if you find some problems, send him/her a bugreport

    The future is under construction!!!

  13. Homework HW-- (not yet set): word frequency colorizer
    • write a Python script that reads some big text (e.g. the one from the morning exercise), tokenizes it, performs some trivial stemming (e.g. removing the most frequent inflectional and derivational suffixes like -ed or -ly), collects numbers of occurrences of such stems, and generated an HTML file which contains e.g. first 1000 words colorized according to their stem's frequency (e.g. three bands - green - very frequent words, yello - middle band, red - very rare words)
    • Commit your solution into npfl092/hw--/
  14. Data visualization

    Morning warm-up exercise: (1) make a frequency list of html tag frequencies of this web page, (2) supposing the page is a well formed XML, write a converter that transforms its content into a simply formatted plain text (such as \n, several spaces and * in front of every list item). You can use any standard technique for processing XML files (Twig, Sax, XPath...).


    • gnuplot
    • dot/graphviz
    • figures/tables for latex
    • Homework: ACL-style article draft containing a learning curve of your tagger (or of any other trainable tool). Create a Makefile that
      1. applies your previously created POS-tagger on gradually increasing training data (or apply any other tool for which a quantity can be measured that is dependent on the input data size) and evaluate it in each iteration (always on the same test data). It is recommended to use exponentially growing size of the training data (e.g. 100 words, 1kW, 10kW, 100kW ...). You can use any other trainable NLP tool (but not tools developed by your colleagues in the course). The simplest acceptable solution is a tool measuring OOV (out-of-vocabulary rate - how many words in the test data have not been seen in the training data).
      2. collects the learning curve statistics from the individual iterations and converts them to a LaTeX table as well as to a graphical form: data size (preferably in log scale) on the horizontal axis, and tool performance on the vertical axis. Use gnuplot for the latter task.
      3. downloads the LaTeX article style for ACL 2011 conference papers and compiles your article into PDF. Create a simple LaTeX article using this style and include the generated table and figure into it and fill the table's and figure's captions (the text in the rest of the article is not important).
      Commit the homework into 2016-npfl092/hw08/. Make sure that the Makefile performs all the steps correctly on a fresh checkout of the directory. Deadline: 16th January 2016, 12:00.
  15. Data and Software licensing
    • morning exercise: theater of the absurd is a form of drama; one of its characteristics lies in using repetitive dialogues, sometimes with utterances swapped between two or more actors. Task: find occurrences of swapped utterances in Václav Havel's play Zahradní slavnost (The Garden Party), and print out whose replicas were repeated by whom.
    • Licenses
      • authors' rights in the Czech Republic, slides authors_rights_intro.pdf
      • open source movement
      • GPL, Artistic license
      • Creative Commons (mainly CC0 and Attribution) and Open Data Commons:
      • Licenses for PDT, CNK,
      • data distributors, ELRA/ELDA, LDC, currently emerging networks
    • Checking all your homework tasks.
    • Premium task (T.B.A.)
  16. Final written test

Required work

Rules for homework

Premium tasks

Rules for the final test

Determination of the final grade