NPFL092 Technology for NLP (Natural Language Processing)

Course schedule overview

  1. Linux and Bash
    • survival in Linux
    • Bash command line and scripting
    • text-processing commands
  2. Python
    • introduction to Python
    • text processing
    • regular expressions
    • object-oriented interface for processing linguistic structures in Python
  3. XML
    • representing linguistic structures in XML
    • processing XML in Python
  4. Extras (covered fully or partially based on remaining time at the end of the term)
    • selected good practices in software development (not only in NLP, not only in Python)
    • NLP tools and frameworks, processing morphologically and syntactically annotated data, visualization, search
    • data and software licensing

More detailed course schedule

  1. Introduction
    • slides
    • Motivation
    • Course requirements: MFF linux lab account
    • Course plan, overview of required work, assignment requirements
    Survival in Linux:
    • keyboard shortcuts in KDE/GNOME, selected e.g. from here
    • motivation for scripting, command line features (completion, history...), keyboard shortcuts
    • bash in a nutshell (ls (-l,-a,-1,-R), cd, pwd, cp (-R), mv, rm (-r, -f), mkdir (-p), rmdir, chmod, ssh (-XY), less, more, cat, ln (-s), .bashrc, wget, head, tail, file, man...)
    • exercise: playing with text files (udhr.zip, also available for download at bit.ly/2hQQeTH)
    • remote access to a unix machine: SSH (Secure Shell)
      • you can access a lab computer e.g. by opening a unix terminal and typing
        ssh yourlogin@u-pl17.ms.mff.cuni.cz
        (replace yourlogin with your login into the lab and type your lab password when asked for it; instead of 17 you can use any number between 1 and something like 30 -- it is the number of the computer in the central lab that you are connecting to)
      • your home is shared across all the lab computers in all the MS labs (SU1, SU2, Rotunda), i.e. you will see your files everywhere
      • you can ssh even from non-unix machines
        • on Windows, you can use e.g. the Putty software
        • on any computer with the Chrome browser, you can use the Secure Shell extension (and there are similar extensions for other browsers as well) which allows you to open a remote terminal in a browser tab -- this is probably the most comfortable way
        • on an Android device, you can use e.g. JuiceSSH
    • Supplementary reading
    • Homework: Connect remotely from your home computer to the MS lab, check that you can see there the data from the class (or use wget and unzip to get the UDHR data to the computer -- see link above), and try practising some of the commands from the class: try renaming files, copying files, changing file permissions, etc. Try to create a shell script that prints some text, make it executable, and run it, e.g.:
      echo 'echo Hello World' > hello.sh
      chmod u+x hello.sh
      ./hello.sh
      You can also try connecting to the MS lab from your smartphone and running a few commands -- this will let you experience the power of being able to work remotely in Bash from anywhere...
      This homework does not require you to submit anything to us, just practice as much as you feel that you need so that you feel confident in Bash. Do this homework before coming to the next lab. And, as always, if you run into any problems, contact us per e-mail!
  2. Character encoding (very short)
    • ascii, 8-bits, unicode, conversions, locales (LC_*)
    • slides
    • Questions: answer the following questions:
      • What is ASCII?
      • What 8-bit encoding do you know for Czech or for your native language? How do they differ from ASCII?
      • What is Unicode?
      • What Unicode encodings do you know?
      • What is the relation between UTF-8 a ASCII?
      • Take a sample of Czech text (containing some diacritics), store it into a plain text file and convert it (by iconv) to at least two different 8-bit encodings, and to utf-8 and utf-16. Explain the differences in file sizes.
      • How can you detect file encoding?
      • Store any Czech web page into a file, change the file encoding and the encoding specified in the file header, and find out how it is displayed in your browser if the two encodings differ.
      • How do you specify file encoding when storing a plain text or a source code in your favourite text editor?
    Mastering your text editor
    • requirements on a modern source-code editor
      1. modes (progr. languages, xml, html...)
      2. syntax highlighting
      3. completion
      4. indentation
      5. support for encodings (utf-8)
      6. integration with compiler...
    • fallback mode for working in a text console
    • you can use any editor you like, as long as it has the capabilities listed above and you know how to use them
    • if you don't have a favourite Linux editor yet, we suggest e.g. atom (demonstration of atom in the class); Atom is installed in the labs, and is cross-platform, i.e. you can also use it on Windows and Mac
    • for a text-mode editor (without a graphical user interface, e.g. for working through ssh), we suggest nano
    • other good editors include e.g. Sublime (cross-platform); for Windows, e.g. Notepad++ and PSPad are good
    • for using emacs (if you really want to): look here
    • for using vim (if you really want to): run the vimtutor command to go through an introductory tutorial of using vim (vimtutor english to run the English version of the tutorial) (boring for those who already know or use vi, too long for 45 minutes)
    • Homework: make sure you know how to invoke all the mentioned features in your favourite text editor
  3. Text-processing commands in bash
    • sort, uniq, cat, cut, [e]grep, sed, head, tail, rev, diff, patch, set, pipelines, man...
    • regular expressions
    • exercises
    • Homework: read Unix for Poets by Kenneth Ward Church
    Bash scripting
    • if, while, for
    • xargs : Compare
      sed 's/:/\n/g' <<< "$PATH" | \
      grep $USER | \
      while read path ; do
        ls $path
      done
      with
      sed 's/:/\n/g' <<< "$PATH" | \
      grep $USER | \
      xargs ls

    Shell script, patch to show changes we made–just run

    patch -p0 < script.sh

    Makefiles

    Git
    • Homework: Write your Makefile with targets t2—t18 from the Exercises. Put the HW into
      2017-npfl092/hw01/
      (and commit it and push it to Redmine)
  4. Introduction to Python
    • Study the Python Tutorial as homework
    • To solve practical tasks, Google is your friend…
    • By default, we will use Python version 3:
      python3
      A day may come when you will need to use Python 2, so please note that there are some differences between these two. (Also note that you may encounter code snippets in either Python 2 or Python 3…)
    • To work interactively with Python, use IPython:
      ipython3
      • to save the commands 5-10 from your IPython session to a file named mysession.py, run:
        %save mysession 5-10
      • to exit IPython, run:
        exit
    • For non-interactive work, use your favourite editor. (Rudolf uses vim, but heard PyCharm is real good.)
    • First Python exercises (simple language modelling)
    • Homework: Implement at least two items from the bonus extensions (extension 1 is obligatory; the simplest to do are then probably 2 and 3, the rest may require more googling). You can get bonus points for implementing more of the extensions.
      Commit your homework to SVN; you should put it into:
      https://svn.ms.mff.cuni.cz/svn/undergrads/students/<your-login>/2016-npfl092/hw02/
    • If you need help, try (preferably in this order):
      1. Google
      2. Google
      3. Google
      4. asking at/after the next lab
      5. asking per e-mail (please send the e-mail to both of us, as this increases your chances of getting an early reply)
  5. Basic text processing in Python
    • a warm-up exercise: find palindrome words in English
      • A palindrome word reads the same forward and backward, e.g. "level"
      • Write a python script that reads text from stdin and prints detected palindromes (one per line) to stdout
      • print only palindrome words longer than three letters
      • apply your script on the English translation of Homer's The Odyssey available as an UTF-8 encoded Project Gutenberg ebook here.
      • a slightly more advanced extension (optional): try to find longer expressions that read same in both directions after spaces are removed (two or more words; a contiguous segment of the input text, possibly crossing line boundaries)
    • encoding in Python
      • differences in handling of encoded data between Python 2 and Python 3
      • a simple rule: use Unicode everywhere, and if conversions from other encodings are needed, then do them as close to the physical data as possible (i.e., encoding should processed properly already in the data reading/writing phase, and not internally by decoding the content of variables)
      • example: f = open(fname, encoding="latin-1")
      • sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
      • more about the topic can be found here
    • Homework HW03: word frequency colorizer
      • write a Python script that reads some big text (e.g. the one from the morning exercise), tokenizes it, performs some trivial stemming (e.g. removing the most frequent inflectional and derivational suffixes like -ed or -ly), collects numbers of occurrences of such stems, and generated an HTML file which contains e.g. first 1000 words colorized according to their stem's frequency (e.g. three bands - green - very frequent words, yello - middle band, red - very rare words)
      • Commit your solution into 2016-npfl092/hw03/
  6. Python: strings and regular expressions
    • Warmup: implement a simple wc-like tool in Python, so that running
      python3 wc.py textfile.txt
      will print out three numbers: the number of lines, words, and characters in the file (for words, you can simply use whitespace-delimited strings -- there is a string method that does just that...)
    • the string data type in Python
      • a tutorial
      • case changing (lower, upper, capitalize, title, swapcase)
      • is* tests (isupper, isalnum...)
      • matching substrings (find, startswith, endswith, count, replace)
      • split, splitlines, join
      • other useful methods (not necessarily for strings): dir, sorted, set
      • my string ipython3 session from the lab (unfiltered)
    • regular expressions in Python
      • a python regexp tutorial
      • to be able to use the regex module:
        1. in bash: pip3 install --user regex
        2. in python: import regex as re
        (Python has built-in regex support in the re module, but regex seems to be more powerful while using the same API.)
      • search, findall, sub
      • raw strings (r'...'), character classes ([[:alnum:]], \w, ...), flags (flags=re.I or r'(?i)...'), subexpressions r'(.) (...)' + backreferences r'\1 \2'
      • my regex ipython3 session from the lab (unfiltered)
    • Homework 04: Redo hw01 in Python, implementing the targets t2 to t18 from the Exercises in one Python script called make.py, so that e.g. running python3 make.py t16 prints out the frequency list of letters in the skakalpes file; running you script with no parameters should invoke all the targets.
      Of course, do not take the tasks word-by-word, as they explicitly name Bash commands to use, while you have to use Python commands instead. E.g. for t2, you can use urllib.request.urlopen, which returns an object with many methods, including read() (you must first import urllib.request). In t3, just print the text once (you don't have to implement less). For t4, look for decode()...
      Commit the HW to 2016-npfl092/hw04/

    • Voluntary bonus task for bonus points: a simple lemmatizer of English
      • input: POS-tagged English text, one sentence per line, in the format word|tag word|tag word|tag (e.g. playing|VBG skipping|VBG deadliest|RBS)
      • output: a lemma for each word (e.g. play skip deadly)
      • The tags follow the Penn Treebank Tagset
      • You should try to lemmatize at least words with the following tags: NNS NNPS, VBG VBD VBN VBZ, JJR JJS, RBR RBS
      • Your solution does not have to be perfect (that could take years to develop), but it should try to handle some of the more regular changes.
      • Commit the task to 2016-npfl092/bonus01/
  7. Python: modules, packages, classes
    • Specification: implement a simple Czech POS tagger in Python, choose any approach you want, required precision at least 50%
      • Tagger input format - data encoded in iso-8859-2 in a simple line-oriented plain-text format: empty line separate sentences, non-empty lines contain word forms in the first column and simplified (one-letter) POS tag in the second column, such as N for nouns or A for adjectives (you can look at tagset documentation). Columns are separated by tabs.
      • Tagger output format: empty lines not changed, nonempty lines enriched with a third column containing the predicted POS for each line
      • Training data: tagger-devel.tsv
      • Evaluation data: tagger-eval.tsv (to be used only for evaluation!!!)
      • Performance evaluation (precision=correct/total): eval-tagger.sh
           cat tagger-eval.tsv | ./my_tagger.pl | ./eval-tagger.sh  
      • Example baseline solution - everything is a noun, precision 34%:
          cat tagger-eval.tsv | perl -ne 'chomp;if($_){print "$_\tN\n"}'| eval-tagger.sh
          prec=897/2618=0.342627960275019        
    • Homework HW05: a simple POS tagger, this time OO solution
      • turn your warm-up exercise solution into an OO solution:
        • implement a class Tagger
        • the tagger class has a method tagger.see(word,pos) which expects a word-pos instance from the training data
        • the tagger class has a method tagger.train() that inferes a model (if needed)
        • the tagger class has a method tagger.save(filename) that save the model to a file
        • the tagger class has a method tagger.load(filename) that loads the model from a file
        • the tagger class has a method tagger.predict(word) that predicts a POS tag for a word given the trained model
      • the tagger should be usable as a Python module:
        • e.g. if your Tagger class resides in my-tagger-class.py, you should be able to use it in another script (e.g. calling-my-tagger.py) by importing it (from my-tagger-class import Tagger)
        • one option of achieving this is by having just the Tagger class in the script, with no code outside of the class (you then need another script to use your tagger)
        • another option is to wrap any code which is outside the class into the name=main block, which is executed only if the script is run directly, not when it is imported into another script:
          # This is the Tagger class, which will be imported when you "import Tagger"
          class Tagger:
              model = dict()
              def see(self, word, pos):
                  self.model[word] = pos
          
          # This code is only executed when you run the script directly, e.g. "python3 my-tagger-class.py"
          if __name__ == "__main__":
              t = Tagger()
              t.see("big", "A")
      • wrap your solution into a Makefile with the following targets:
        • download - downloads the data
        • train - trains a tagging model given the training file and stores it into a file
        • predict - appends the column with predicted POS into the test file
        • eval - prints the accuracy
      • Commit your solution into 2016-npfl092/hw05/
  8. Introduction to XML
    • warm-up exercise: try to find automatically some reflexive verb forms in Czech (or any other language having reflexive verbs), i.e., verb forms that frequently appear in a sentence together with reflexive pronouns 'se' or 'si'. You may use any Czech raw text data, e.g. this book available from Project Gutenberg.
    • Motivation for XML, basics of XML syntax, examples, well-formedness/validity, dtd, xmllint
    • Slides
    • samples of linguistic data in XML (VALLEX, PDT 2.0 sample)
    • XML exercise: create an XML file representing some linguistic structures (your choice) manually in a text editor, or by a Python script. The file should contain at least 7 different elements, some of them should have attributes. Create a DTD file and make sure that the XML file is valid w.r.t. the DTD file. Create a Makefile that has targets "wellformed" and "valid" and uses xmllint to the file's well-formedness and its validity with respect to the DTD file.
    • Homework:
      • finish the exercise: XML+DTD files;
  9. XML, cont.
    • Exercise: For all file in sample.zip, check whether they are well-formed xml files or not (e.g. by xmllint), and if not then fix them (possibly manually in a text editor, or any way you want).
    • Exercise: write a Python script that recognizes (at least some of) the well-formedness violations present in the above mentioned files, without using any specific library for XML processing
    • overview of Python modules for XML (DOM approach, SAX approach, ElementTree library); study materials: XML Chapter in the "Dive into Python 3" book, ElementTree module tutorial
    • Homework:
      • download a simplified file with Universal Dependencies trees dependency_trees_from_ud.tsv (note: simplification = some columns removed from the standard conllu format)
      • write a Python script that converts this data into a reasonably structured XML file
      • write a Python script that converts the XML file back into the original (tab-separated) format, check the identity of the output with the original file
      • write a Python script that converts the XML file into a simply formatted HTML
      • organize it all in a Makefile with targets download, toxml, totsv, tohtml
      • commit your solution into 2016-npfl092/hw07/
  10. NLTK and other NLP frameworks
    • NLP frameworks
    • NLTK tutorial
    • Homework:
      • train and evaluate a Czech part-of-speech tagger in NLTK
      • use any of the trainable taggers available in NLTK (tnt looked quite promising); you can experiment with multiple taggers and multiple settings and improvements to achieve a good accuracy (this is not required and there is no minimum accuracy you must achieve, but you can get bonus points; but still your result should not be something obviously wrong, such as 20% accuracy)
      • use the data from the previous tagging homework: tagger-devel.tsv as training data, tagger-eval.tsv as evaluation data
      • note that you have to convert the input data appropriately into a format which is expected by the tagger
      • commit your solution into 2016-npfl092/hw08/
  11. NLTK and other NLP frameworks, vol 2
    • warmup: once again processing genesis, this time in NLTK:
      • read in the text of the first chapter of Genesis
      • use NLTK to split the text into sentences, split the sentences into tokens, and tag the tokens for part-of-speech
      • print out the output as TSV, one token per line, wordform POStag separated by a tab, with an empty line separating sentences
      • sample solutions: v1, v2, v3
    • named entites in NLTK
    • tree structure and visualization in NLTK
    • parsing in UDPipe
    • Voluntary bonus task for bonus points: conversion from Udapi CoNLL-U ouputs to NLTK Tree() structures
      • Input: a tokenized sentence in the "horizontal" format, e.g. "A cat sat on a mat ." (you can choose any language supported by UDPipe, or even make this configurable if you really want to, but your code has to work for any sentence in the language, not just one example sentence)
      • Process with the UDPipe pipeline, convert the resulting TSV output to a dependency tree in NLTK notation, e.g. tree = Tree('sat', [Tree('cat', ['A']), Tree('mat', ['on', 'a']), '.']) (if you want to, you can represent the leaves as trees with the leaf word as root and an empty list of children: Tree('on', []); this may actually make the task easier to solve, as you can first create a list containing one Tree for each token, with the token form as the root and an empty list of children, and then append each token to its parent tree)
      • Output: show the dependency tree using tree.draw()
      • Create a Makefile with a show target that runs your script on one example sentence
      • Commit the solution to 2016-npfl092/bonus02/
  12. Selected good practices in software development (not only in NLP, not only in Python)
    • warm-up exercise: find English word groups in which the words are derived one from the other, such as interest-interesting-interestingly; use the list of 10,000 most frequent English lemmas bnc_freq_10000.txt
    • good development practices - slides (testing, benchmarking, profiling, code reviewing, bug reporting)
    • exercise:
      • exchange solutions of HW05 with one of your colleagues
      • implement unit tests (using unittest) of his/her solution
      • if you find some problems, send him/her a bugreport





    The future is under construction!!!


  13. Data visualization

    Morning warm-up exercise: (1) make a frequency list of html tag frequencies of this web page, (2) supposing the page is a well formed XML, write a converter that transforms its content into a simply formatted plain text (such as \n, several spaces and * in front of every list item). You can use any standard technique for processing XML files (Twig, Sax, XPath...).

    Slides

    • gnuplot
    • dot/graphviz
    • figures/tables for latex
    • Homework: ACL-style article draft containing a learning curve of your tagger (or of any other trainable tool). Create a Makefile that
      1. applies your previously created POS-tagger on gradually increasing training data (or apply any other tool for which a quantity can be measured that is dependent on the input data size) and evaluate it in each iteration (always on the same test data). It is recommended to use exponentially growing size of the training data (e.g. 100 words, 1kW, 10kW, 100kW ...). You can use any other trainable NLP tool (but not tools developed by your colleagues in the course). The simplest acceptable solution is a tool measuring OOV (out-of-vocabulary rate - how many words in the test data have not been seen in the training data).
      2. collects the learning curve statistics from the individual iterations and converts them to a LaTeX table as well as to a graphical form: data size (preferably in log scale) on the horizontal axis, and tool performance on the vertical axis. Use gnuplot for the latter task.
      3. downloads the LaTeX article style for ACL 2011 conference papers and compiles your article into PDF. Create a simple LaTeX article using this style and include the generated table and figure into it and fill the table's and figure's captions (the text in the rest of the article is not important).
      Commit the homework into 2016-npfl092/hw08/. Make sure that the Makefile performs all the steps correctly on a fresh checkout of the directory. Deadline: 16th January 2016, 12:00.
  14. Data and Software licensing
    • morning exercise: theater of the absurd is a form of drama; one of its characteristics lies in using repetitive dialogues, sometimes with utterances swapped between two or more actors. Task: find occurrences of swapped utterances in Václav Havel's play Zahradní slavnost (The Garden Party), and print out whose replicas were repeated by whom.
    • Licenses
      • authors' rights in the Czech Republic, slides authors_rights_intro.pdf
      • open source movement
      • GPL, Artistic license
      • Creative Commons (mainly CC0 and Attribution) and Open Data Commons: http://www.opendatacommons.org/
      • Licenses for PDT, CNK,
      • data distributors, ELRA/ELDA, LDC, currently emerging networks
    • Checking all your homework tasks.
    • Premium task (T.B.A.)
  15. Final written test

Required work

Rules for homework

Premium tasks

Rules for the final test

Determination of the final grade