NPFL092 Technology for NLP (Natural Language Processing)

Course schedule overview

  1. Introduction, Survival in Linux, Intro to bash
  2. Character encoding, Mastering your text editor
  3. Text-processing commands in bash
  4. Introduction to Perl
  5. Perl, cont.
  6. Perl, cont.
  7. 'Musts' in software development
  8. Introduction to XML
  9. XML, cont.
  10. Data visualization
  11. Data and Software licensing

More detailed course schedule

  1. Introduction
    • slides
    • Motivation
    • Course requirements: MFF linux lab account
    • Course plan, overview of required work, assignment requirements
    Survival in Linux:
    • keyboard shortcuts in KDE/GNOME, selected e.g. from here
    • motivation for scripting, command line features (completion, history...), keyboard shortcuts
    • bash in a nutshell (ls (-l,-a,-1,-R), cd, pwd, cp (-R), mv, rm (-r, -f), mkdir (-p), rmdir, chmod, ssh (-XY), less, cat, ln (-s), .bashrc, ...), other shells
    • exercise: playing with text files (udhr.zip)
    • Supplementary reading
  2. Character encoding (very short)
    • ascii, 8-bits, unicode, conversions, locales (LC_*)
    • slides
    • Questions: answer the following questions:
      • What is ASCII?
      • What 8-bit encoding do you know for Czech or for your native language? How do they differ from ASCII?
      • What is Unicode?
      • What Unicode encodings do you know?
      • What is the relation between UTF-8 a ASCII?
      • Take a sample of Czech text (containing some diacritics), store it into a plain text file and convert it (by iconv) to at least two different 8-bit encodings, and to utf-8 and utf-16. Explain the differences in file sizes.
      • How can you detect file encoding?
      • Store any Czech web page into a file, change the file encoding and the encoding specified in the file header, and find out how it is displayed in your browser if the two encodings differ.
      • How do you specify file encoding when storing a plain text or a source code in your favourite text editor?
    Mastering your text editor
    • requirements on a modern source-code editor
      1. modes (progr. languages, xml, html...)
      2. syntax highlighting
      3. completion
      4. indentation
      5. support for encodings (utf-8)
      6. integration with compiler...
    • fallback mode for working in a text console
    • overview of editors in linux
    • detailed description of a selected editor (for those who do not master any), list of most important shortcuts
      • for emacs: here, .emacs
      • for vim: run the vimtutor command to go through a introductory tutorial of using vim (vimtutor english to run the English version of the tutorial) (boring for those who already know or use vi, too long for 45 minutes)
    • Homework: make sure you know how to invoke all the mentioned features in your favourite text editor
  3. Text-processing commands in bash
    • sort, uniq, cat, cut, [e]grep, sed, head, tail, rev, diff, patch, set, pipelines, man...
    • regular expressions
    • exercises
    • Homework: read Unix for Poets by Kenneth Ward Church
    Bash scripting
    • if, while, for
    • xargs : Compare
      sed 's/:/\n/g' <<< "$PATH" | \
      grep $USER | \
      while read path ; do
        ls $path
      done
      with
      sed 's/:/\n/g' <<< "$PATH" | \
      grep $USER | \
      xargs ls

    Shell script, patch to show changes we made–just run

    patch -p0 < script.sh

    You will be sent an e-mail. Your user name for the SVN is the name in the Subject: field, your password is in the e-mail.

    Makefiles

    Subversion
    • Homework: Write your Makefile with targets t2—t18 from the Exercises. Commit the HW to
      https://svn.ms.mff.cuni.cz/svn/undergrads/students/<your-login>/2016-npfl092/hw01/

      Results in ODS and in HTML
      Deadline: Thursday 3rd November 2016 13:00
    • Additional Homework: Write a script that can sum a given column (for example ls -l | ./sum 5 will sum the sizes of files). If no column is specified, sum the first one. Whitespace can precede the first column, integers and floats can both appear in the column.
      Submit to 2016-npfl092/ahw01/

  4. Introduction to Python
    • Study the Python Tutorial as homework
    • To solve practical tasks, Google is your friend…
    • By default, we will use Python version 3:
      python3
      A day may come when you will need to use Python 2, so please note that there are some differences between these two. (Also note that you may encounter code snippets in either Python 2 or Python 3…)
    • To work interactively with Python, use IPython:
      ipython3
      • to save the commands 5-10 from your IPython session to a file named mysession.py, run:
        %save mysession 5-10
      • to exit IPython, run:
        exit
    • For non-interactive work, use your favourite editor. (Rudolf uses vim, but heard PyCharm is real good.)
    • First Python exercises (simple language modelling)
      A sample solution to exercises 1 to 13 can be found in solution_1.py
    • Homework: Implement at least two items from the bonus extensions (extension 1 is obligatory; the simplest to do are then probably 2 and 3, the rest may require more googling). You can get bonus points for implementing more of the extensions.
      Commit your homework to SVN; you should put it into:
      https://svn.ms.mff.cuni.cz/svn/undergrads/students/<your-login>/2016-npfl092/hw02/
      Deadline: Monday 7th November 2016 17:00
      Results in ODS and in HTML
    • If you need help, try (preferably in this order):
      1. Google
      2. Google
      3. Google
      4. asking at/after the next lab
      5. asking per e-mail (please send the e-mail to both of us, as this increases your chances of getting an early reply)
  5. Basic text processing in Python
    • a warm-up exercise: find palindrome words in English
      • A palindrome word reads the same forward and backward, e.g. "level"
      • Write a python script that reads text from stdin and prints detected palindromes (one per line) to stdout
      • print only palindrome words longer than three letters
      • apply your script on the English translation of Homer's The Odyssey available as an UTF-8 encoded Project Gutenberg ebook here.
      • a slightly more advanced extension (optional): try to find longer expressions that read same in both directions after spaces are removed (two or more words; a contiguous segment of the input text, possibly crossing line boundaries)
    • encoding in Python
      • differences in handling of encoded data between Python 2 and Python 3
      • a simple rule: use Unicode everywhere, and if conversions from other encodings are needed, then do them as close to the physical data as possible (i.e., encoding should processed properly already in the data reading/writing phase, and not internally by decoding the content of variables)
      • example: f = open(fname, encoding="latin-1")
      • sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
      • more about the topic can be found here
    • Homework HW03: word frequency colorizer
      • write a Python script that reads some big text (e.g. the one from the morning exercise), tokenizes it, performs some trivial stemming (e.g. removing the most frequent inflectional and derivational suffixes like -ed or -ly), collects numbers of occurrences of such stems, and generated an HTML file which contains e.g. first 1000 words colorized according to their stem's frequency (e.g. three bands - green - very frequent words, yello - middle band, red - very rare words)
      • Commit your solution into 2016-npfl092/hw03/
      • Deadline: Monday 14th November 2016 17:00
      • Results in ODS and in HTML
  6. Python: strings and regular expressions
    • Warmup: implement a simple wc-like tool in Python, so that running
      python3 wc.py textfile.txt
      will print out three numbers: the number of lines, words, and characters in the file (for words, you can simply use whitespace-delimited strings -- there is a string method that does just that...)
    • the string data type in Python
      • a tutorial
      • case changing (lower, upper, capitalize, title, swapcase)
      • is* tests (isupper, isalnum...)
      • matching substrings (find, startswith, endswith, count, replace)
      • split, splitlines, join
      • other useful methods (not necessarily for strings): dir, sorted, set
      • my string ipython3 session from the lab (unfiltered)
    • regular expressions in Python
      • a python regexp tutorial
      • to be able to use the regex module:
        1. in bash: pip3 install --user regex
        2. in python: import regex as re
        (Python has built-in regex support in the re module, but regex seems to be more powerful while using the same API.)
      • search, findall, sub
      • raw strings (r'...'), character classes ([[:alnum:]], \w, ...), flags (flags=re.I or r'(?i)...'), subexpressions r'(.) (...)' + backreferences r'\1 \2'
      • my regex ipython3 session from the lab (unfiltered)
    • Homework 04: Redo hw01 in Python, implementing the targets t2 to t18 from the Exercises in one Python script called make.py, so that e.g. running python3 make.py t16 prints out the frequency list of letters in the skakalpes file; running you script with no parameters should invoke all the targets.
      Of course, do not take the tasks word-by-word, as they explicitly name Bash commands to use, while you have to use Python commands instead. E.g. for t2, you can use urllib.request.urlopen, which returns an object with many methods, including read() (you must first import urllib.request). In t3, just print the text once (you don't have to implement less). For t4, look for decode()...
      Commit the HW to 2016-npfl092/hw04/
      Deadline: Monday 28th November 2016 17:00
      Results in ODS and in HTML
    • Voluntary bonus task for bonus points: a simple lemmatizer of English
      • input: POS-tagged English text, one sentence per line, in the format word|tag word|tag word|tag (e.g. playing|VBG skipping|VBG deadliest|RBS)
      • output: a lemma for each word (e.g. play skip deadly)
      • The tags follow the Penn Treebank Tagset
      • You should try to lemmatize at least words with the following tags: NNS NNPS, VBG VBD VBN VBZ, JJR JJS, RBR RBS
      • Your solution does not have to be perfect (that could take years to develop), but it should try to handle some of the more regular changes.
      • Commit the task to 2016-npfl092/bonus01/
      • Deadline: Mon 28th Nov 2016 17:00
  7. Python: modules, packages, classes
    • Specification: implement a simple Czech POS tagger in Python, choose any approach you want, required precision at least 50%
      • Tagger input format - data encoded in iso-8859-2 in a simple line-oriented plain-text format: empty line separate sentences, non-empty lines contain word forms in the first column and simplified (one-letter) POS tag in the second column, such as N for nouns or A for adjectives (you can look at tagset documentation). Columns are separated by tabs.
      • Tagger output format: empty lines not changed, nonempty lines enriched with a third column containing the predicted POS for each line
      • Training data: tagger-devel.tsv
      • Evaluation data: tagger-eval.tsv (to be used only for evaluation!!!)
      • Performance evaluation (precision=correct/total): eval-tagger.sh
           cat tagger-eval.tsv | ./my_tagger.pl | ./eval-tagger.sh  
      • Example baseline solution - everything is a noun, precision 34%:
          cat tagger-eval.tsv | perl -ne 'chomp;if($_){print "$_\tN\n"}'| eval-tagger.sh
          prec=897/2618=0.342627960275019        
    • Homework HW05: a simple POS tagger, this time OO solution
      • turn your warm-up exercise solution into an OO solution:
        • implement a class Tagger
        • the tagger class has a method tagger.see(word,pos) which expects a word-pos instance from the training data
        • the tagger class has a method tagger.train() that inferes a model (if needed)
        • the tagger class has a method tagger.save(filename) that save the model to a file
        • the tagger class has a method tagger.load(filename) that loads the model from a file
        • the tagger class has a method tagger.predict(word) that predicts a POS tag for a word given the trained model
      • the tagger should be usable as a Python module:
        • e.g. if your Tagger class resides in my-tagger-class.py, you should be able to use it in another script (e.g. calling-my-tagger.py) by importing it (from my-tagger-class import Tagger)
        • one option of achieving this is by having just the Tagger class in the script, with no code outside of the class (you then need another script to use your tagger)
        • another option is to wrap any code which is outside the class into the name=main block, which is executed only if the script is run directly, not when it is imported into another script:
          # This is the Tagger class, which will be imported when you "import Tagger"
          class Tagger:
              model = dict()
              def see(self, word, pos):
                  self.model[word] = pos
          
          # This code is only executed when you run the script directly, e.g. "python3 my-tagger-class.py"
          if __name__ == "__main__":
              t = Tagger()
              t.see("big", "A")
      • wrap your solution into a Makefile with the following targets:
        • download - downloads the data
        • train - trains a tagging model given the training file and stores it into a file
        • predict - appends the column with predicted POS into the test file
        • eval - prints the accuracy
      • Commit your solution into 2016-npfl092/hw05/
      • Deadline: 5 December 2016, 17:00
      • Results in ODS and in HTML
  8. Introduction to XML
    • warm-up exercise: try to find automatically some reflexive verb forms in Czech (or any other language having reflexive verbs), i.e., verb forms that frequently appear in a sentence together with reflexive pronouns 'se' or 'si'. You may use any Czech raw text data, e.g. this book available from Project Gutenberg.
    • Motivation for XML, basics of XML syntax, examples, well-formedness/validity, dtd, xmllint
    • Slides
    • samples of linguistic data in XML (VALLEX, PDT 2.0 sample)
    • XML exercise: create an XML file representing some linguistic structures (your choice) manually in a text editor, or by a Python script. The file should contain at least 7 different elements, some of them should have attributes. Create a DTD file and make sure that the XML file is valid w.r.t. the DTD file. Create a Makefile that has targets "wellformed" and "valid" and uses xmllint to the file's well-formedness and its validity with respect to the DTD file.
    • Homework:
      • finish the exercise: XML+DTD files; commit them into 2016-npfl092/hw06/. Deadline: 12th December 2016, 17:00
      • Results in ODS and in HTML
  9. XML, cont.
    • Exercise: For all file in sample.zip, check whether they are well-formed xml files or not (e.g. by xmllint), and if not then fix them (possibly manually in a text editor, or any way you want).
    • Exercise: write a Python script that recognizes (at least some of) the well-formedness violations present in the above mentioned files, without using any specific library for XML processing
    • overview of Python modules for XML (DOM approach, SAX approach, ElementTree library); study materials: XML Chapter in the "Dive into Python 3" book, ElementTree module tutorial
    • Homework:
      • download a simplified file with Universal Dependencies trees dependency_trees_from_ud.tsv (note: simplification = some columns removed from the standard conllu format)
      • write a Python script that converts this data into a reasonably structured XML file
      • write a Python script that converts the XML file back into the original (tab-separated) format, check the identity of the output with the original file
      • write a Python script that converts the XML file into a simply formatted HTML
      • organize it all in a Makefile with targets download, toxml, totsv, tohtml
      • commit your solution into 2016-npfl092/hw07/
      • deadline: 2nd January, 17:00
      • Results in ODS and in HTML
  10. NLTK and other NLP frameworks
    • NLP frameworks
    • NLTK tutorial
    • Homework:
      • train and evaluate a Czech part-of-speech tagger in NLTK
      • use any of the trainable taggers available in NLTK (tnt looked quite promising); you can experiment with multiple taggers and multiple settings and improvements to achieve a good accuracy (this is not required and there is no minimum accuracy you must achieve, but you can get bonus points; but still your result should not be something obviously wrong, such as 20% accuracy)
      • use the data from the previous tagging homework: tagger-devel.tsv as training data, tagger-eval.tsv as evaluation data
      • note that you have to convert the input data appropriately into a format which is expected by the tagger
      • commit your solution into 2016-npfl092/hw08/
      • Deadline: 16th January, 17:00
  11. NLTK and other NLP frameworks, vol 2
    • warmup: once again processing genesis, this time in NLTK:
      • read in the text of the first chapter of Genesis
      • use NLTK to split the text into sentences, split the sentences into tokens, and tag the tokens for part-of-speech
      • print out the output as TSV, one token per line, wordform POStag separated by a tab, with an empty line separating sentences
      • sample solutions: v1, v2, v3
    • named entites in NLTK
    • tree structure and visualization in NLTK
    • parsing in UDPipe
    • Voluntary bonus task for bonus points: conversion from Udapi CoNLL-U ouputs to NLTK Tree() structures
      • Input: a tokenized sentence in the "horizontal" format, e.g. "A cat sat on a mat ." (you can choose any language supported by UDPipe, or even make this configurable if you really want to, but your code has to work for any sentence in the language, not just one example sentence)
      • Process with the UDPipe pipeline, convert the resulting TSV output to a dependency tree in NLTK notation, e.g. tree = Tree('sat', [Tree('cat', ['A']), Tree('mat', ['on', 'a']), '.']) (if you want to, you can represent the leaves as trees with the leaf word as root and an empty list of children: Tree('on', []); this may actually make the task easier to solve, as you can first create a list containing one Tree for each token, with the token form as the root and an empty list of children, and then append each token to its parent tree)
      • Output: show the dependency tree using tree.draw()
      • Create a Makefile with a show target that runs your script on one example sentence
      • Commit the solution to 2016-npfl092/bonus02/
      • Deadline: Mon 16th Jan 2016 17:00
  12. Selected good practices in software development (not only in NLP, not only in Python)
    • warm-up exercise: find English word groups in which the words are derived one from the other, such as interest-interesting-interestingly; use the list of 10,000 most frequent English lemmas bnc_freq_10000.txt
    • good development practices - slides (testing, benchmarking, profiling, code reviewing, bug reporting)
    • exercise:
      • exchange solutions of HW05 with one of your colleagues
      • implement unit tests (using unittest) of his/her solution
      • if you find some problems, send him/her a bugreport



  13. The future is under construction!!!
  14. Perl, cont.
    • warm-up exercise: write a Perl script that analyzes your daily svn activity. It reads the output from svn log and counts the number of commits per hour from 00:00-00:59 to 23:00-23:59, along the whole history of your personal directory in the repository undergrads (regardless of the date, i.e. the output should have 24 lines)
    • slightly more advanced topics in Perl (map,grep,sort,references,locale,POD,packages), slides npfl092_perl_cont.pdf
    • premium exercise (to be announced, worth the same number of points as one homework!)
    • Homework:
      • finish all exercises from the morning lab (the svn-log exercise and two exercised from the slides) and commit them as svn_traffic.pl, ex1.pl, and ex2.pl into 2016-npfl092/hw03/
      • read the first three chapters in Damian Conway's Perl Best Practices
      • create a simple Perl module
        • the module contains a function that expects an array of words as its argument and returns an array of their (guessed) part-of-speech tags, such as N for nouns, A for adjectives, etc.
        • you can choose any language and any tagset
        • the actual solution can be quite stupid (e.g. just a few regular expressions for typical word endings)
        • the module must contain POD
        • create a script test.pl for testing the module's functionality (for more advanced Perl programmers: use Test::More)
        • commit both files into 2016-npfl092/hw03/ in your directory in the undergrads repository, deadline: 12:00 noon on Monday NOV 30, 2016
  15. Perl, cont. 2
    • Morning exercise: find a piece of Perl code on the web (at least 20 LOCs) and try to find out whether (and how) it violates the recommendations from the PBP book. Could you improve the code? How?
    • Slides
    • see App::cpanminus - zero-configuration dependecy-free installer of CPAN modules
    • Introduction to object-oriented programming in Perl, slides
    • a short glimpse at selected OO API for processing linguistic data
      1. btred macros, tutorial
      2. tectomt blocks, sample block
    • homework
      • explore NLP-oriented modules at CPAN
      • install Moose from CPAN
    • further reading (optional, but recommended):
  16. Selected good practices in software development (not only in NLP, not only in Perl)
    • warm-up exercise: find English word groups in which the words are derived one from the other, such as interest-interesting-interestingly; use the list of 10,000 most frequent English lemmas bnc_freq_10000.txt
    • good development practices - slides (testing, benchmarking, profiling, code reviewing, bug reporting)
    • premium task (to be announced; worth the same number of points as one homework!)
    • code reviewing
    • Homework (Deadline: 4th January 2016, 12:00 ):
      • finish the morning exercise, commit it as 2016-npfl092/hw04/morning.pl
      • code reviewing
        • review the code of the HW3 module of your colleague, commit the reviewed code into npfl092/hw4/codereview/, commit the review summary as 2016-npfl092/hw04/codereview/summary.txt
        • write tests for this module using Test::More, commit them into the directory 2016-npfl092/hw04/codereview/t/
        • if you find a bug in it, write a bug report, send it to the module author and commit it as 2016-npfl092/hw04/codereview/bugreport.txt
      • Moose
        • create a Moose-based class for a linguistically oriented data structure of your choice (really anything, but related to language), and write a test that demonstrate its usage
        • commit the files into 2016-npfl092/hw04/moose/
    • Additional homework ahw04: Download an ispell dictionary here. Extract the file english.0 from it. Write a perl program that will list all the words in the file that are substrings of any other word in the list. A word is not considered a substring if
      1. The subword equals the word
      2. The word equals the subword plus "s"
      3. The word equals the subword plus "'s".
      The program should finish in a reasonable time, i.e. less than a minute.
    • further reading (optional, but recommended):
      1. Benchmarking Perl by brian d foy
      2. Solving any Perl problem by brian d foy
      3. continue reading the Perl Best Practices book
  17. XML, cont.
    • namespaces, xsl, xpath, DOM/SAX, LibXML, xsh, xml in NLP (pml, tigerxml..., standoff)
    • Slides
    • Exercise: file-format conversion from a proprietary format into XML (input: sample of morphologically tagged data in a line oriented format sample0.txt)
    • Homework: Write a program to convert your XML data from the morning exercise back to its original format. Write a program that transforms your XML data from the previous homework to HTML. Organize it all in Makefile that has three targets: txt2xml, xml2txt, xml2html. Commit into hw07/, deadline: 11th January 2016, 12:00
  18. Data visualization

    Morning warm-up exercise: (1) make a frequency list of html tag frequencies of this web page, (2) supposing the page is a well formed XML, write a converter that transforms its content into a simply formatted plain text (such as \n, several spaces and * in front of every list item). You can use any standard technique for processing XML files (Twig, Sax, XPath...).

    Slides

    • gnuplot
    • dot/graphviz
    • figures/tables for latex
    • Homework: ACL-style article draft containing a learning curve of your tagger (or of any other trainable tool). Create a Makefile that
      1. applies your previously created POS-tagger on gradually increasing training data (or apply any other tool for which a quantity can be measured that is dependent on the input data size) and evaluate it in each iteration (always on the same test data). It is recommended to use exponentially growing size of the training data (e.g. 100 words, 1kW, 10kW, 100kW ...). You can use any other trainable NLP tool (but not tools developed by your colleagues in the course). The simplest acceptable solution is a tool measuring OOV (out-of-vocabulary rate - how many words in the test data have not been seen in the training data).
      2. collects the learning curve statistics from the individual iterations and converts them to a LaTeX table as well as to a graphical form: data size (preferably in log scale) on the horizontal axis, and tool performance on the vertical axis. Use gnuplot for the latter task.
      3. downloads the LaTeX article style for ACL 2011 conference papers and compiles your article into PDF. Create a simple LaTeX article using this style and include the generated table and figure into it and fill the table's and figure's captions (the text in the rest of the article is not important).
      Commit the homework into 2016-npfl092/hw08/. Make sure that the Makefile performs all the steps correctly on a fresh checkout of the directory. Deadline: 16th January 2016, 12:00.
  19. Data and Software licensing
    • morning exercise: theater of the absurd is a form of drama; one of its characteristics lies in using repetitive dialogues, sometimes with utterances swapped between two or more actors. Task: find occurrences of swapped utterances in Václav Havel's play Zahradní slavnost (The Garden Party), and print out whose replicas were repeated by whom.
    • Licenses
      • authors' rights in the Czech Republic, slides authors_rights_intro.pdf
      • open source movement
      • GPL, Artistic license
      • Creative Commons (mainly CC0 and Attribution) and Open Data Commons: http://www.opendatacommons.org/
      • Licenses for PDT, CNK,
      • data distributors, ELRA/ELDA, LDC, currently emerging networks
    • Checking all your homework tasks.
    • Premium task (T.B.A.)
  20. Final written test

Required work

Rules for homework

Premium tasks

Rules for the final test

Determination of the final grade