NPFL092 – Technology for NLP (Natural Language Processing)

About

The aim of the course is to get students familiar with basic software tools used in natural language processing.

SIS code: NPFL092
Semester: winter
E-credits: 5
Examination: 1/2 MC (KZ)

Teachers

Whenever you have a question or need some help (and Googling does not work), contact us as soon as possible! Please always e-mail both of us.

Classes

  • the classes combine lectures and practicals
  • in 2018/2019, the classes are held on Wednesday in SU2, 14:00 - 16:15

Requirements

To pass the course. you will need to submit homework assignments and do a written test. See Grading for more details.

Classes

1. Introduction; Survival in Linux, Bash UNIX (Czech) Bash (English) hw_ssh Questions

2. Encoding; Editors Encoding Python in Atom hw_editors Questions

3. Git Git intro tryGit Branching hw_git Questions

4. Bash Unix for Poets hw_makefile Questions

5. Python, basic manipulation with strings Python tutorial hw_string1 Questions

6. Python: string manipulation cont. Strings Unicode Text files Regexes hw_string2 Questions

7. Python: modules, packages, classes Reading hw_tagger Questions

8. NLTK and other NLP frameworks hw_nltk Questions

9. A gentle introduction to XML XML hw_xml Questions

10. XML & JSON XML+ XML&JSON hw_xml2json

11. Selected good practices in software development (not only in NLP, not only in Python) Good practices Questions

12. OVERFLOW BUFFER

13. Final test Questions


Legend: Slides Video Homework assignment Additional reading Test questions


1. Introduction; Survival in Linux, Bash

 Oct 3

xkcd comics

  • Introduction

    • Motivation
    • Course requirements: MFF linux lab account
    • Course plan, overview of required work, assignment requirements
  • keyboard shortcuts in KDE/GNOME, selected e.g. from here

  • motivation for scripting, command line features (completion, history...), keyboard shortcuts

  • bash in a nutshell

    • ls (-l,-a,-1,-R), cd, pwd
    • cp (-R), mv, rm (-r, -f), mkdir (-p), rmdir, ln (-s)
    • file, cat, less, head, tail
    • chmod, wget, ssh (-XY), .bashrc, man...
  • exercise: playing with text files udhr.zip, also available for download at bit.ly/2hQQeTH

  • remote access to a unix machine: SSH (Secure Shell)

    • you can access a lab computer e.g. by opening a unix terminal and typing:

      ssh yourlogin@u-pl17.ms.mff.cuni.cz
      

      (replace yourlogin with your login into the lab and type your lab password when asked for it; instead of 17 you can use any number between 1 and something like 30 — it is the number of the computer in the central lab that you are connecting to)

    • your home is shared across all the lab computers in all the MS labs (SU1, SU2, Rotunda), i.e. you will see your files everywhere

    • you can ssh even from non-unix machines

      • on Windows, you can use e.g. the Putty software
      • on any computer with the Chrome browser, you can use the Secure Shell extension (and there are similar extensions for other browsers as well) which allows you to open a remote terminal in a browser tab — this is probably the most comfortable way
      • on an Android device, you can use e.g. JuiceSSH

UNIX (Czech) Bash (English)

hw_ssh Questions

2. Encoding; Editors

 Oct 10

Character encoding

Encoding

xkcd comics

  • ascii, 8-bits, unicode, conversions, locales (LC_*)
  • Questions: answer the following questions:
    • What is ASCII?
    • What 8-bit encoding do you know for Czech or for your native language? How do they differ from ASCII?
    • What is Unicode?
    • What Unicode encodings do you know?
    • What is the relation between UTF-8 a ASCII?
    • Take a sample of Czech text (containing some diacritics), store it into a plain text file and convert it (by iconv) to at least two different 8-bit encodings, and to utf-8 and utf-16. Explain the differences in file sizes.
    • How can you detect file encoding?
    • Store any Czech web page into a file, change the file encoding and the encoding specified in the file header, and find out how it is displayed in your browser if the two encodings differ.
    • How do you specify file encoding when storing a plain text or a source code in your favourite text editor?

Mastering your text editor

Python in Atom

xkcd comics

  • requirements on a modern source-code editor

    1. modes (programming languages, xml, html...)
    2. syntax highlighting
    3. completion
    4. indentation
    5. support for encodings (utf-8)
    6. integration with compiler...
  • fallback mode for working in a text console

  • you can use any editor you like, as long as it has the capabilities listed above and you know how to use them

  • if you don't have a favourite Linux editor yet, we suggest e.g. atom (demonstration of atom in the class); Atom is installed in the labs, and is cross-platform, i.e. you can also use it on Windows and Mac

  • for a text-mode editor (without a graphical user interface, e.g. for working through ssh), we suggest nano

  • other good editors include e.g. Sublime (cross-platform); for Windows, e.g. Notepad++ and PSPad are good

  • for using emacs (if you really want to): look here

  • for using vim (if you really want to): run the vimtutor command to go through an introductory tutorial of using vim (vimtutor english to run the English version of the tutorial)

hw_editors Questions

3. Git

 Oct 17 Git intro tryGit

These are instructions for using Git and the Redmine repository. You will need to set this up and learn to use it to be able to submit homework assignments.

Notes

  • the first section (first setup) and the concrete URLs are specific to Redmine
  • the rest would be the same with any other remote machine which offers storing of git repositiories (e.g. GitHub, GitLab, BitBucket)

First setup on Redmine

  • Login and set a new password
    • go to https://redmine.ms.mff.cuni.cz/login
    • log in with your username (name.surname) and password
    • your first password is temporary — go to "My account" — "Change password" and set a new password (and remember it)
  • Setup the repository for this course
    • go to https://redmine.ms.mff.cuni.cz/projects/undergrads
    • click on your subproject
    • Settings — unselect "Public" — Save
    • Settings — Members — New member
      • add "Rudolf Rosa" and "Zdenek Zabokrtsky" as "Reporters"
    • Settings — Repositories — New repository
      • SCM: Gitolite
      • Identifier: 2018-npfl092

Filling your new repository — VERSION A (git add remote)

xkcd comics

  • create a local repository
    • cd
    • mkdir 2018-npfl092
    • cd 2018-npfl092
    • git init
  • add a README file
    • echo 'This is my repository for NPFL092.' > README
    • git status
    • git add README
    • git status
  • commit the changes locally
    • git commit -m'adding README'
    • git status
  • push your repository to Redmine
    • git remote add origin https://name.surname@redmine.ms.mff.cuni.cz/undergrads/surname/2018-npfl092.git
    • git push -u origin master
      • you will need to enter your Redmine password
      • you can also use SSH instead of HTTPS, which saves you some password typing, but requires you to set up SSH keys
    • git status

Filling your new repository — VERSION B (git clone)

  • clone your repository
    • cd
    • git clone https://name.surname@redmine.ms.mff.cuni.cz/undergrads/surname/2018-npfl092.git
    • cd 2018-npfl092
    • Note: you can also use SSH instead of HTTPS, which saves you some password typing, but requires you to set up SSH keys.
  • add a README file
    • echo 'This is my repository for NPFL092.' > README
    • git status
    • git add README
    • git status
  • commit the changes locally
    • git commit -m'adding README'
    • git status
  • push changes to Redmine
    • git push
    • you will need to enter your Redmine password
    • git status

Synchronizing changes

  • make a new clone of the repository at a different place
    • cd; mkdir new_clone_of_repo; cd new_clone_of repo
    • git clone https://name.surname@redmine.ms.mff.cuni.cz/undergrads/surname/2018-npfl092.git
    • cd 2018-npfl092
  • make some changes here, stage them, commit them locally, and push them to Redmine
    • echo 'This repo will contain my homework.' >> README
    • git add README
    • git commit -m'adding more info'
    • git push
  • go back to your first local repo and get the new changes from Redmine
    • cd ~/2018-npfl092
    • cat README
    • git pull
    • cat README

Regular working with your repo

  • go to a directory containing a clone of your repository (or make a new one with git clone if on a different computer)
  • synchronize your local repo with the repo on Redmine with git pull
  • do any changes to the files, create new files, etc.
  • view the changes with git status (and with git diff to see changes inside files)
  • stage new/changed files that you want to become part of the repo with git add (untracked files are ignored by git)
  • create a new snapshot in your local repo with git commit
  • synchronize the repo on Redmine with your local repo with git push

Going back to previous versions

  • to throw away current uncommitted changes:
    • git checkout filename to revert to the last committed version of file filename
    • beware, there is no undo, i.e. with this command you immediately loose any uncommitted changes!
  • to only show info about commits:
    • git log to figure out which commit you are interested in
    • git show commitid to show the details about a commit with id commitid
  • to temporarily switch to a previous state of the repository:
    • commit all your changes
    • git checkout commitid to go to the state after the commit commitid
    • git checkout master to return to the current state

Branching

Branching

  • git branch branchname to create a new branch called branchname
  • git checkout branchname to switch to the branch branchname
  • git checkout master to switch back to master
  • git merge branchname to merge branch branchname into the current branch
    • typically you merge into master
    • i.e. you first git checkout master
    • and then git merge branchname
  • git branch -d branchname to remove the branch called branchname

hw_git Questions

4. Bash

 Oct 24

  • Bash scripting
    • text processing commands: sort, uniq, cat, cut, [e]grep, sed, head, tail, rev, diff, patch, set, pipelines, man...

    • regular expressions

    • if, while, for

    • xargs: Compare

      sed 's/:/\n/g' <<< "$PATH" | \
      grep $USER | \
      while read path ; do
          ls $path
      done
      

      with

      sed 's/:/\n/g' <<< "$PATH" | \
      grep $USER | \
      xargs ls
      
    • Shell script, patch to show changes we made, just run

      patch -p0 < script.sh
      

xkcd comics

  • Makefiles

  • warm-up exercises:

    1. constract a bash pipeline that extracts words from an English text read from the input, and sorts them in the "rhyming" order (lexicographical ordering, but from the last letter to the first letter; "retrográdní uspořádání" in Czech) (hint: use the command rev for reverting individual lines)
    2. construct a bash pipeline that reads an English text from the input and finds 3-letter "suffices" that are most frequent in the words that are contained in the text, irrespectively of the words' frequencies (suffices not in the linguistic sense, simply just the last 3 letters from a word that contains at least 5 letters) (hint: you can use e.g.sed 's/./&\t/g | rev | cut -f2,3,4 | rev for extracting the last three letters)
  • system variables

  • editting .bashrc (aliases, paths...)

  • looping, branching, e.g.

    #!/bin/bash
    for file in *; do
        if [ -x $file ]
        then
            echo Executable file: $file
            echo Shebang line:  `head -n 1 $file`
            echo
        fi
    done
    

Unix for Poets

hw_makefile Questions

5. Python, basic manipulation with strings

 Oct 31

xkcd comics

  • To solve practical tasks, Google is your friend!

  • By default, we will use Python version 3: python3 A day may come when you will need to use Python 2, so please note that there are some differences between these two. (Also note that you may encounter code snippets in either Python 2 or Python 3…)

  • To work interactively with Python, use IPython: ipython3

    • to save the commands 5-10 from your IPython session to a file named mysession.py, run:

      %save mysession 5-10
      
    • to exit IPython, run:

      exit
      
  • For non-interactive work, use your favourite text editor.

  • Python types

    • int: a = 1
    • float: a = 1.0
    • bool: a = True
    • str: a = '1 2 3' or a = "1 2 3"
    • list: a = [1, 2, 3]
    • dict: a = {"a": 1, "b": 2, "c": 3}
    • tuple: a = (1, 2, 3) (something like a fixed-length immutable list)

First Python exercises (simple language modelling)

  1. Create a string containing the first chapter of genesis. Print out first 40 characters.

    str[from:to]  # from is inclusive, to is exclusive
    

    Print out 4th to 6th character 1-based (=3rd to 5th 0-based)
    Check the length of the result using len().

  2. Split the string into tokens (use str.split(); see ?str.split for help).
    Print out first 10 tokens. (List splice behaves similarly to substring.)
    Print out last 10 tokens.
    Print out 11th to 18th token.
    Check the length of the result using len().
    Just printing a list splice is fine; also see ?str.join

  3. Compute the unigram counts into a dictionary.

    # Built-in dict (need to explicitly initialize keys):
    unigrams = {}
    
    # The Python way is to use the foreach-style loops;
    # and horizotal formatting matters!
    for token in tokens:
    # do something
    
    # defaultdict, supports autoinitialization:
    from collections import defaultdict
    # int = values for non-set keys initialized to 0:
    unigrams = defaultdict(int)
    
    # Even easier:
    from collections import Counter
    
  4. Print out most frequent unigram.

    max(something)
    max(something, key=function_to_get_key)
    
    # getting value stored under a key in a dict:
    unigrams[key]
    unigrams.get(key)
    

    Or use Counter.most_common()

  5. Print out the unigrams sorted by count.
    Use sorted() — behaves similarly to max()
    Or use Counter.most_common()

  6. Get unigrams with count > 5; can be done with list comprehension:

    [token for token in unigrams if unigrams[token] > 5]
    
  7. Count bigrams in the text into a dict of Counters

    bigrams = defaultdict(Counter)
    bigrams[first][second] += 1
    
  8. For each unigram with count > 5, print it together with its most frequent successor.

    [(token, something) for …]
    
  9. Print the successor together with its relative frequency rounded to 2 decimal digits.

    max(), sum(), dict.values(), round(number, ndigits)
    
  10. Print a random token. Print a random unigram disregarding their distribution.

    import random
    ?random.choice
    list(dict.keys())
    
  11. Pick a random word, generate a string of 20 words by always picking the most frequent follower.

    range(10)
    
  12. Put that into a function, with the number of words to be generated as a parameter.
    Return the result in a list.

    list.append(item)
    
    def function_name (parameter_name = default):
    # do something
    return 123
    
  13. Sample the next word according to the bigram distribution

    import numpy as np
    ?np.random.choice
    np.random.choice(list, p=list_of_probs)
    

Python tutorial

hw_string1 Questions

6. Python: string manipulation cont.

 Nov 7

The string data type in Python

Strings

  • str.*: useful methods you can invoke on a string

    • case changing (lower, upper, capitalize, title, swapcase)
    • is* tests (isupper, isalnum...)
    • matching substrings (find, startswith, endswith, count, replace)
    • split, splitlines, join
    • other useful methods (not necessarily for strings): dir, sorted, set
    • my ipython3 session from the lab (unfiltered)
  • exercise: implement a simple wc-like tool in Python, so that running

    ./wc.py textfile.txt
    

    will print out three numbers: the number of lines, words, and characters in the file (for words, you can simply use whitespace-delimited strings -- there is a string method that does just that...)

  • exercise: find palindrome words in English

    • A palindrome word reads the same forward and backward, e.g. "level"
    • Write a python script that reads text from stdin and prints detected palindromes (one per line) to stdout
    • print only palindrome words longer than three letters
    • apply your script on the English translation of Homer's The Odyssey (available as an UTF-8 encoded Project Gutenberg ebook)
    • a slightly more advanced extension (optional): try to find longer expressions that read same in both directions after spaces are removed (two or more words; a contiguous segment of the input text, possibly crossing line boundaries).

Encoding in Python

Unicode Text files

  • a simple rule: use Unicode everywhere, and if conversions from other encodings are needed, then do them as close to the physical data as possible (i.e., encoding should processed properly already in the data reading/writing phase, and not internally by decoding the content of variables)

  • example:

    f = open(fname, encoding="latin-1")
    sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
    

Regular expressions in Python

xkcd comics Regexes

  • Python has built-in regex support in the re module, but the regex module seems to be more powerful while using the same API. To be able to use it, you need to:

    1. install it (in Bash):

      pip3 install --user regex
      
    2. import in (in Python)

      import regex as re
      
  • search, findall, sub

  • raw strings r'...'

  • character classes [[:alnum:]], \w, ...

  • flags flags=re.I or r'(?i)...'

  • subexpressions r'(.) (...)' + backreferences r'\1 \2'

  • revision of regexes
    ^[abc]*|^[.+-]?[a-f]+[^012[:alpha:]]{3,5}(up|down)c{,5}$

  • good text to play with: the first chapter of genesis again

  • my regex ipython3 session from the lab (unfiltered, from a lab taught in year 2016)

hw_string2 Questions

7. Python: modules, packages, classes

 Nov 14

  • Exercise: implement a simple Czech POS tagger in Python, choose any approach you want, required precision at least 50%

    • Tagger input format - data encoded in iso-8859-2 in a simple line-oriented plain-text format: empty line separate sentences, non-empty lines contain word forms in the first column and simplified (one-letter) POS tag in the second column, such as N for nouns or A for adjectives (you can look at tagset documentation). Columns are separated by tabs.

    • Tagger output format: empty lines not changed, nonempty lines enriched with a third column containing the predicted POS for each line

    • Training data: tagger-devel.tsv

    • Evaluation data: tagger-eval.tsv (to be used only for evaluation!!!)

    • Performance evaluation (precision=correct/total): eval-tagger.sh_

      cat tagger-eval.tsv | ./my_tagger.py | ./eval-tagger.sh
      
    • Example baseline solution - everything is a noun, precision 34%:

      python -c'import sys;print"".join(s if s<"\r" else s[:-1]+"\tN\n"for s in sys.stdin)'<tagger-eval.tsv|./eval-tagger.sh
      prec=897/2618=0.342627960275019
      
  • Classes in Python

    • classic examples with classes representing dogs
    • class A:, def foo(self, x, y):, a = A(), a.foo(x, y)
    • def __init__(self, x, y), def __str__(self), from Module import Class, if __name__ == "__main__":
      • a module is typically a .py file; you can just import the module, or even import specific classes from the module
      • beware of name clashes; but you can always import MyModule as SomeOtherName
    • inheritance: class B(A); overriding is the default, just redefine the method; use super().foo() to invoke parent's implementation
    • static members (without self, belong to class): class A, a = 5, A.a = 10, def b(x, y), A.b(x, y)
    • a package is basically a directory containing multiple modules -- packA/modA.py, packA/modB.py, from packA.modB import classC...
  • Virtual environments

    • Sometimes you need several different "installations" of Python -- you need version 1.2.3 of a package for project A, but version 3.5.6 for project B, etc.
    • The answer is to create several separate virtual environments:
      1. Once for each project, create a venv for the project; specify any path you like to store the environment:

        python3 -m venv ~/venv_proj_A
        
      2. Every time you start working on project A, switch to the right venv:

        source ~/venv_proj_A/bin/activate
        
      3. Checking that everything looks fine:

        • Your prompt should now show something like (venv_proj_A)
        • Your python and python3 should now be local just for this venv:
          • Try running which python and which python3
          • This should print out paths within the venv, e.g. /home/rosa/venv_proj_A/bin/python3
        • Your pip should now be a local pip just for this venv (and pip and pip3 should be identical):
          • which pip should say something like /home/rosa/venv_proj_A/bin/pip
          • pip --version should mention python 3
      4. To install Python packages just for this project:

        • Use pip install package_name (instead of the usual pip3 install --user package_name)
        • The package will be installed locally just for this venv
      5. To get out of the venv:

        • run deactivate
        • or close the terminal

Reading hw_tagger Questions

8. NLTK and other NLP frameworks

 Nov 21

Why use an NLP framework?

How is it better than other options, i.e. manual implementation or using existing standalone tools? (Note: the benefits of using a framework listed below are not necessarily true for all frameworks.)

  • You can read in data in various formats, convert to unified representation, no need for further conversions to use with the tools, unified structured API to access the annotated data
  • You get a number of tools in one batch, ready to use, with unified APIs
  • You can often do everything from one or more Python scripts and run the whole pipeline at once, while standalone tools typically have to be ran and their inputs and outputs manipulated from terminal/bash script/Makefile
  • Built in visualisation
  • Apply but also train the tools (for machine learning you can go to: NPFL054 Introduction to machine learning, NAIL029 Machine Learning, NPFL104 Machine Learning Exercises)

Overview of NLP frameworks

NLTK tutorial

  1. Installation:

    # in terminal
    pip3 install --user nltk
    
    ipython3
    import nltk
    
    # optionally:
    # nltk.download()
    # usually, you should chose to download "all" (but it may get stuck)
    
  2. https://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk

  3. https://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize

  4. https://textminingonline.com/dive-into-nltk-part-iii-part-of-speech-tagging-and-pos-tagger

  5. https://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization

Using existing tools in NLTK

Sentence segmentation, word tokenization, part-of-speech tagging, named entity recognition.

with open("genesis.txt", "r") as f:
    genesis = f.read()

sentences = nltk.sent_tokenize(genesis)
# just the first sentence
tokens_0 = nltk.word_tokenize(sentences[0])
tagged_0 = nltk.pos_tag(tokens_0)
# all sentences
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]
tagged_sentences = nltk.pos_tag_sents(tokenized_sentences)

ne=nltk.ne_chunk(tagged_0)
print(ne)
ne.draw()

Exercise

Once again processing genesis, this time in NLTK:

  • read in the text of the first chapter of Genesis
  • use NLTK to split the text into sentences, split the sentences into tokens, and tag the tokens for part-of-speech
  • print out the output as TSV, one token per line, wordform POStag separated by a tab, with an empty line separating sentences

Training a tagger

from nltk.corpus import treebank
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]

from nltk.tag import tnt
tnt_pos_tagger = tnt.TnT()
tnt_pos_tagger.train(train_data)

tnt_pos_tagger.tag(nltk.word_tokenize("A platypus is a very special animal."))

tnt_pos_tagger.evaluate(test_data)

import pickle
with open('tnt_treebank_pos_tagger.pickle', 'wb') as f:
    pickle.dump(tnt_pos_tagger, f)
with open('tnt_treebank_pos_tagger.pickle', 'rb') as f:
    loaded_tagger = pickle.load(f)

Trees in NLTK

Let's create a simple constituency tree for the sentence A red bus stopped suddenly:

# what we want to create: 
#
#           S
#       /       \
#    NP           VP
#  / |  \      /      \
# A red bus stopped suddenly
#

from nltk import Tree

# Tree(root, [children])
np = Tree('NP', ['A', 'red', 'bus'])
vp = Tree('VP', ['stopped', 'suddenly'])
# children can be strings or Trees
s = Tree('S', [np, vp])

# print out the tree
print(s)

# draw the tree (opens a small graphical window)
s.draw()

And a dependency tree for the same sentence:

# what we want to create: 
#
#       stopped
#       /      \
#    bus    suddenly
#  / |
# A red

# can either use string leaf nodes:
t1=Tree('stopped', [Tree('bus', ['A', 'red']), 'suddenly'])
t1.draw()

# or represent each leaf node as a Tree without children:
t2=Tree('stopped', [Tree('bus', [ Tree('A', []), Tree('red', []) ]), Tree('suddenly', []) ])
t2.draw()

Tagging and parsing with UDPipe

Easy way: use the online service (also has a REST API)

Powerful way: use local installation (more control, also supports training) — see below:

  1. Installation:

    # checkout the udpipe repository
    git clone https://github.com/ufal/udpipe.git
    
    # compile udpipe
    cd udpipe/src
    make
    cd ../..
    
    # install Python bindings
    pip3 install --user ufal.udpipe
    
    # download trained models
    wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1659/udpipe-ud-1.2-160523.zip
    unzip udpipe-ud-1.2-160523.zip
    
  2. Sample usage:

    # start ipython in the directory with the models (udpipe-ud-1.2-160523),
    # as this makes it easier to load the models just by the filename;
    # otherwise you have to specify the full path to the model
    cd udpipe-ud-1.2-160523
    ipython3
    
    from ufal.udpipe import *
    
    # load model from the given file;
    # if the file does not exist, expect a Segmentation fault
    model = Model.load("english-ud-1.2-160523.udpipe")
    
    # create a UDPipe processing pipeline with the loaded model,
    # with "horizontal" input (a sentence with space-separated tokens),
    # default setting for tagger and parser,
    # and CoNLL-U output
    pipeline = Pipeline(model, "horizontal", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu")
    
    # analyze a tokenized sentence with UDPipe
    # and print out the resulting CoNLL-U analysis
    print(pipeline.process("A man went into a bar ."))
    

hw_nltk Questions

9. A gentle introduction to XML

 Nov 28 XML

  • Motivation for XML, basics of XML syntax, examples, well-formedness/validity, dtd, xmllint
  • XML exercise:
    • Create an XML file representing some data structures (ideally NLP-related) manually in a text editor, or by a Python script.
    • The file should contain at least 7 different elements, some of them should have attributes.
    • Create a DTD file and make sure that the XML file is valid w.r.t. the DTD file.
    • Create a Makefile that has targets wellformed and valid and uses xmllint to verify the file's well-formedness and its validity with respect to the DTD file.

hw_xml Questions

10. XML & JSON

 Dec 5

  • Exercise: For all file in xmlsamples.zip, check whether they are well-formed xml files or not (e.g. by xmllint), and if not then fix them (possibly manually in a text editor, or any way you want).
  • Exercise: write a Python script that recognizes (at least some of) the well-formedness violations present in the above mentioned files, without using any specific library for XML processing

XML+

  • A very quick overview of some XML-related standards (namespaces, XPath, XSL, SAX, DOM)

XML&JSON

  • Intro to XML and JSON processing in Python

hw_xml2json

11. Selected good practices in software development (not only in NLP, not only in Python)

 Dec 12 Good practices

  • warm-up exercise:
    • find English word groups in which the words are derived one from the other, such as interest-interesting-interestingly
    • use the list of 10,000 most frequent English lemmas bnc_freq_10000.txt
  • good development practices
    • testing
    • benchmarking
    • profiling
    • code reviewing
    • bug reporting
  • exercise:
    • exchange solutions of HW tagger with one of your colleagues
    • implement unit tests (using unittest) of his/her solution
    • if you find some problems, send him/her a bugreport

Questions

12. OVERFLOW BUFFER

 Dec 19 Very probably, we will not manage to keep up with the schedule, so something planned for earlier will eventually be moved into this class...

13. Final test

 Jan 9 Questions

Assignments

1. hw_ssh

2. hw_editors

3. hw_git

4. hw_makefile

5. hw_string1

6. hw_string2

7. hw_tagger

8. hw_nltk

9. hw_xml

10. hw_xml2json


Notes

  • Submit assignments via Git (except for the first two assignments). Use the assignment names as directory names.
  • We will only look at the last version submitted before the deadline.
  • The estimated durations are only approximate. If possible, please let us know how much time you spent with each assignment, so that we can improve the estimates for future students.

1. hw_ssh

 Duration: 10-30min  100 points  Deadline: Oct 13 23:59

In this homework, you will practice working through SSH.

  • Connect remotely from your home computer to the MS lab

  • Check that you can see there the data from the class (or use wget and unzip to get the UDHR data to the computer from https://ufal.mff.cuni.cz/~rosa/courses/npfl092/data/udhr.zip)

  • Try practising some of the commands from the class: try renaming files, copying files, changing file permissions, etc.

  • Try to create a shell script that prints some text, make it executable, and run it, e.g.:

    echo 'echo Hello World' > hello.sh
    chmod u+x hello.sh
    ./hello.sh
    
  • List your "friends":

    • Create an executable script called friends.sh that lists all users which have the same first character of their username as you do.
    • Hint: in our lab, all users whose username starts with "r" have their home directories in /afs/ms/u/r/.
    • So you just need to list the contents of such a directory.
  • Put your scripts into a shared directory:

    • Go to /afs/ms/doc/vyuka/INCOMING/TechnoNLP/.
    • Create a new directory there for yourself.
      • Use your last name as the name for the directory.
    • In this new directory, create another directory called hw_ssh.
    • Copy your two scripts into the hw_ssh directory.
  • You can also try connecting to the MS lab from your smartphone and running a few commands -- this will let you experience the power of being able to work remotely in Bash from anywhere...

You should be absolutely confident in doing these tasks. If you are not, take some more time to practice.

And, as always, contact us per e-mail if you run into any problems!

2. hw_editors

 Duration: 30min-2h  100 points  Deadline: Oct 20 23:59

Choose your text editor or editors you will be using for the course, and make sure you can invoke all the features mentioned in the class:

  1. modes (programming languages, xml, html...)
  2. syntax highlighting
  3. completion
  4. indentation
  5. support for encodings (utf-8)
  6. integration with compiler...

If your chosen editor is only graphical and cannot work in a text console, make sure to also choose a fallback text-based editor to use through SSH.

Then, try out your editor on the following task of improving a badly formatted source code:

  • download a badly formatted code: bad.py
  • run the code (python3 bad.py) to check it works (it should)
  • get some basic information about the code with file
  • open the code in your editor and improve its formating (but do not change its function), including at least the following improvements:
    • convert the code into UTF-8 (and remove the first line specifying the encoding; UTF-8 is the default in Python 3)
    • switch it from windows newlines (CRLF) to unix newlines (LF)
    • make indentation and spacing consistent
    • make quotes consistent
    • add a she-bang
    • make it executable
    • rerun it to check you did not break it
    • ideally, you should be able to do all of the above in your editor, without going to Bash
  • if you do not know Python yet, then this will also serve as your first introduction to some basic Python constructions
    • indentation is important in Python, as it marks the start and end of a block; so in the script, the first two prints are part of the first for block, while the third print is not; any number of spaces can be used for indentation, but it is common to use 4 spaces
    • there is no difference between single quotes and double quotes
  • save the improved script as good.py
  • take a screenshot of the code open in your editor, and save it as good.png or good.jpg (so that we can see the syntax highlighting)
  • put the good code and the screenshot into /afs/ms/doc/vyuka/INCOMING/TechnoNLP/your-surname/hw_editors/ (of course, replace your-surname with your surname).
    • this cannot be done directly through SSH:
      • you can use scp (it is similar to cp but can copy to and from remote machines; the colon : between the machine identifier and the path is what makes it clear that you want to use a remote machine):
        • scp good.py yourlogin@u-pl13.ms.mff.cuni.cz:/afs/ms/...
        • on Windows, use WinSCP
      • or you can simply do that when you are physically in the lab
      • or you can wait till next lesson and then use Git for that

3. hw_git

 Duration: 1-2h  100 points  Deadline: Oct 27 23:59

Go again through the instructions for using Git and Redmine, and make sure everything works both on the lab computers (connect through SSH to check this) and on your home computer. Then proceed with the following "toy" homework assignment:

  1. On your home computer, clone your repository from Redmine and go into it.

  2. In your Git repository, create a directory called hw_git and add it to into Git (git add hw_git).

  3. In this directory, create a text file that contains at least 10 lines of text, e.g. copied from a news website. (and add it into Git).

  4. Commit the changes locally (e.g. git commit -m'adding text file').

  5. Create a new Bash script called sample.sh in the directory. When you run the Bash script (./sample.sh), it should write out the first 5 lines from the text file.

  6. Commit the changes locally.

  7. Push the changes to Redmine.

  8. Connect to a lab computer through SSH, clone the repository from Redmine, try to find your script and run it to see that everything works fine. (If it does not, fix it.)

  9. Still through SSH, change the script to only print first 2 lines from the file.

  10. Commit and push the changes. (Even though the script file is already part of the Git repository, i.e. it is "versioned", the new changes are not, so you still need to either add the current version of the script again (git add sample.sh), or use commit with the -a switch which automatically adds all changes to versioned files.)

  11. Go back to your local clone of the repository on your home computer, pull the changes, and check that everything works correctly, i.e. that the script prints the first 2 lines from the file. (If it does not, fix it.)

  12. In the local clone, change the script once more, so that it now prints the last 5 lines from the text file. Commit and push.

  13. Go again into the repository clone stored in the lab, pull the changes, and check that the script works correctly. (If it does not, fix it.)

  14. Copy your solutions for hw_ssh and hw_editor into the Git repository. Again, make sure to add them, commit them, push them, and check that they work.

  15. If you run into problems which you are unable to solve, ask for help!

You will submit all of the following assignments in this way, i.e. through Git, in a directory named identically to the assignment. Once you finish an assignment, always use SSH to connect to the lab, pull the assignment, and check that it works correctly.

4. hw_makefile

 Duration: 1-3h  100 points  Deadline: Nov 3 23:59

Create a Makefile with targets t1-t18, performing the tasks 1-18 listed below.

Put your Makefile into a new directory called hw_makefile/ and submit using Git.

  1. print the text Hello world

  2. using wget, download skakalpes-il2.txt

  3. view the file using cat and less

  4. using iconv, convert the file from iso-8859-2 to to utf-8 and store it into skakalpes-utf8.txt

  5. view the new file

  6. count the number of lines in the file using wc

  7. using head and tail, view the first 15 lines , the last 15 lines and lines 10-20

  8. using cut, print the first two words on each line

  9. using grep, print all lines containing a digit

  10. using sed, substitute spaces and punctuations marks with the new line symbol, so that there is at most one word per line (\n)

  11. using grep, avoid empty lines

  12. using sort, sort the words alphabetically

  13. using wc, count the number of words in the text

  14. using sort|uniq, count the number of distinct words in the text

  15. using sort|uniq -c|sort -nr, create a frequency list of words

  16. create a frequency list of letters

  17. using paste, create the frequency list of word bigrams (create another file with lines shifted upwards by one, merge it by paste with the original file and make a frequency list of the lines)

  18. Longer excercise: write a shell script that downloads the main web-page of some news server and finds all word bigrams in it in which both words are capitalized. Make a frequency list of HTML tags used in the document.

5. hw_string1

 Duration: 1-4h  100 points  Deadline: Nov 16 23:59

The homework is focused on basic string processing operations in Python.

  • use the Czech version of UDHR from udhr.zip (the file udhr/czc) to be used as the standard input for all the exercises
  • create a sequence of Python scripts corresponding to the following exercises. Name them accordlingly to their exercise numbers in the list, e.g. 05.py
  • create a Makefile that executes them: after typing make E=05, the script 05.py is executed, udhr/czc is pushed to it's standard input, it's standard output is stored into 05.stdout, and its standard error output is stored into 05.stderr
  • make all executes them all (in the expected order)

Text input/output

  1. Read a plain-text input from STDIN and print it to STDOUT, line by line.

  2. Read a plain-text input from STDIN and print it to STDOUT, this time without looping over lines.

  3. Read a plain-text input from STDIN and store in into a file named 'udhr-czc-win', encoded in cp-1250.

Basic string manipulation (without regular expressions)

  1. Print only first 5 characters from each line.

  2. Print only last 5 characters from each line.

  3. Print all the lines lowercased.

  4. Print the text in which spaces are substituted with underscores.

  5. Join the input text into one string and before printing it, break the text into short lines again by replacing the nearest space after 40th character by the newline symbol

  6. Print the longest word from the input text.

  7. Replace single-digit numbers by their Roman equivalents.

Regular expressions

  1. Print lines that are typeset only in capital letters.

  2. Replace all spaces and punctuation marks by the newline symbol.

Put your solution (i.e., the Python scripts and the Makefile) into a new directory called hw_string1/ and submit using Git as usual.

6. hw_string2

 Duration: 1-3h  100 points  Deadline: Nov 19 23:59

We'll continue practicing basic string/text processing in Python.

  • The very same instruction as specified for the previous homework apply.

Regular expressions, cont.

  1. Reimplement the above 40-char-line-break task using regular expressions. Can you do it using a single RE substitution

  2. Extract words that end with from from the text (btw such words are mostly adverbs)

  3. Try to read the input line supposing that it's encoded in ISO-8859-2.

  4. Replace the following three Czech accented letters "ž","š","č" by "z","s","c".

  5. Print words containing a long vowel (e.g. "á","é")

  6. Print words containing at least two subsequent vowels.

  7. Store the source code created in the previous exercise in ISO-8859-2, run it and see what happens.

Counting

  1. Print three most frequent words from the input text.

  2. Print the frequency list of punctuation marks (approximate it by excluding alphanumerical and white-space symbols).

  3. Print three most frequent word-bigram pairs.

  4. Print three most frequent letter-bigram pairs, this time lowercased.

  5. Remove "stop words" from the text. Approximate the list of stop words by the list of words that have at least 10 occurrences in the text.

Put your solution (i.e., the Python scripts and the Makefile) into a new directory called hw_string2/ and submit using Git as usual.

7. hw_tagger

 Duration: 1-4h  100 points  Deadline: Nov 24 23:59

Implement a simple POS tagger using Object-Oriented programming. Do not forget to also include the Makefile!

  • turn your solution to the in-class tagger exercise into an OO solution:

    • implement a class Tagger
    • the tagger class has a method tagger.see(word,pos) which gets a word-pos instance from the training data (and probably stores it into a dictionary or something)
    • the tagger class has a method tagger.train() that infers a model (if needed)
    • the tagger class has a method tagger.save(filename) that saves the model to a file (again, it is recommended to use pickle)
    • the tagger class has a method tagger.load(filename) that loads the model from a file
    • the tagger class has a method tagger.predict(word) that predicts a POS tag for a word given the trained model
  • the tagger should be usable as a Python module:

    • e.g. if your Tagger class resides in my-tagger-class.py, you should be able to use it in another script (e.g. calling-my-tagger.py) by importing it (from my-tagger-class import Tagger)

    • one option of achieving this is by having just the Tagger class in the script, with no code outside of the class (you then need another script to use your tagger)

    • another option is to wrap any code which is outside the class into the name=main block, which is executed only if the script is run directly, not when it is imported into another script:

      # This is the Tagger class, which will be imported when you "import Tagger"
      class Tagger:
          def __init__(self):
              self.model = dict()
      
          def see(self, word, pos):
              self.model[word] = pos
      
      # This code is only executed when you run the script directly, e.g. "python3 my-tagger-class.py"
      if __name__ == "__main__":
          tagger = Tagger()
          tagger.see("big", "A")
      
  • wrap your solution into a Makefile with the following targets:

    • download - downloads the data
    • train - trains a tagging model given the training file and stores it into a file
    • predict - appends the column with predicted POS into the test file
    • eval - prints the accuracy

8. hw_nltk

 Duration: 30min-3h  100 points  Deadline: Dec 1 23:59

Part-of-speech tagging again, this time in NLTK.

  • train and evaluate a Czech part-of-speech tagger in NLTK
  • use any of the trainable taggers available in NLTK (tnt looked quite promising), achieving some non-trivial accuracy (if your accuracy is e.g. 20%, then something is wrong)
  • you can experiment with multiple taggers and multiple settings and improvements to achieve a good accuracy (this is not required, but you can get bonus points)
  • use the data from HW Tagger: tagger-devel.tsv as training data, tagger-eval.tsv as evaluation data
  • note that you have to convert the input data appropriately into a format which is expected by the tagger
  • wrap your solution into a Makefile, with the targets download, train, predict, eval, as in HW Tagger:

9. hw_xml

 Duration: 1-2h  100 points  Deadline: Dec 8 23:59

Finish the XML+DTD exercise from the class. Do not forget to also include the Makefile!

  • XML exercise:
    • Create an XML file representing some data structures (ideally NLP-related) manually in a text editor, or by a Python script.
    • The file should contain at least 7 different elements, some of them should have attributes.
    • Create a DTD file and make sure that the XML file is valid w.r.t. the DTD file.
    • Create a Makefile that has targets wellformed and valid and uses xmllint to verify the file's well-formedness and its validity with respect to the DTD file.

10. hw_xml2json

 Duration: 2-6h  100 points  Deadline: Dec 15 23:59

Implement conversions between TSV, XML and JSON.

  • download a simplified file with Universal Dependencies trees dependency_trees_from_ud.tsv (note: simplification = some columns removed from the standard conllu format)
  • write a Python script that converts this data into a reasonably structured XML file
  • write a Python script that reads the XML file and converts it into a JSON file
  • write a Python script that rades the JSON file and converts it back to the tsv file
  • check that the final output file is identical with the original input file
  • organize it all in a Makefile with targets download, tsv2xml, xml2json, json2tsv, and check for the individual steps, and a target all that runs them all

Sample test questions

Sample questions for the final written test. The test is not limited to the following list. However, all the test questions will come from the below illustrated areas.

  1. Basic survival in Linux (or rather in Bash)
  2. Character encoding
  3. Text-processing in Bash
  4. Git
  5. Python basics
  6. Simple string processing in Python
  7. Python modules, packages, and classes
  8. Introduction to XML
  9. NLTK and other NLP frameworks
  10. Selected good practices in software development (not only in NLP, not only in Python)
  11. General text-processing problem solving

  1. Basic survival in Linux (or rather in Bash)
    1. Name and describe at least two options for each of the following commands in bash: ls, sort, cut, iconv, grep (1 point).

    2. Give examples of what the .bashrc file can be used for (1 point).

    3. Explain how command line pipelining works (1 point).

    4. Create a bash script that counts the total number of words in all *txt files in all subdirectories of the current directory (2 points).

    5. You created a new file called doit.sh and wrote some Bash commands into it, e.g.:

      echo "ls -t | head -n 5 | cat -n" > doit.sh
      

      How do you run it now? (1 point)

    6. What do you think the following command does?

      ls -t | head -n 5 | cat -n
      

      How would you check what it really does (without running it)? (1 point)

  2. Character encoding
    1. Explain the notions "character set" and "character encoding" (1 point).

    2. Explain the main properties of ASCII (1 point).

    3. What 8-bit encoding do you know for Czech or other European languages (or your native language)? Name at least three. How do they differ from ASCII? (1 point)

    4. What is Unicode and what Unicode encodings do you know? (1 point)

    5. Explain the relation between UTF-8 and ASCII. (1 point)

    6. How can you detect the encoding of a file? (1 point)

    7. You have three files containing identical Czech text. One of them is encoded using the ISO charset, one of them uses UTF-8, and one uses UTF-16. How can you tell which is which? (1 point)

    8. How would you proceed if you are supposed to read a file encoded in ISO-8859-1, add a line number to each line and store it in UTF8? (a source code snippet in your favourite programming language is expected here) (2 points)

    9. Name three Unicode encodings (1 point).

    10. Explain the size difference between a file containing a text in Czech (or in your native language) stored in an 8-bit encoding and the same file stored in UTF-8. (1 point)

    11. How do you convert a file from one encoding to another, for instance from a non-UTF-8 encoding to UTF-8? (1 point)

    12. Write a Python script that reads a text content from STDIN encoded in ISO-8859-2 and prints it to STDOUT in utf8. (2 points)

    13. Explain what BOM is (in the context of file encoding). (1 point)

  3. Text-processing in Bash
    1. Using the Bash command line, get all lines from a file that contain one or two digits, followed by a dot or a space. (1 point)

    2. Using the Bash command line, remove all punctuation from a given file. (1 point)

    3. Using the Bash command line, split text from a given file into words, so that there is one word on each line. (1 point)

    4. Using the Bash command line, download a webpage from a given URL and print the frequency list of opening HTML tags contained in the page. (2 points)

    5. Using the Bash command line, print out the first 5 lines of each file (in the current directory) whose name starts with "abc". (2 points)

    6. Using the Bash command line, find the most frequent word in a text file. (2 points)

    7. Assume you have some linguistically analyzed text in a tab-separated file (TSV). You are just interested in the word form, which is in the second column, and the part-of-speech tag, which is in the fourth column. How do you extract only this information from the file using the Bash command line? (2 points)

    8. Create a Makefile with three targets. The "download" target downloads the webpage nic.nikde.eu into a file, the "show" target prints out the file, and the "clean" target deletes the file. (2 points)

    9. Create a Makefile with two targets. When the first target is called, a web page is downloaded from a given URL. When the second target is called, the number of HTML paragraphs (<p> elements) contained in the file is printed. (2 points)

    10. Suppose there is a plain-text file containing an English text. Write a Bash pipeline of commands which prints the frequency list of 50 most frequent tokens contained in the text. (Simplification: it is sufficient to use only whitespace characters as token separators) (2 points).

    11. Assume you have some linguistic data in a text file. However, some lines are comments (these lines start with a "#" sign) and some lines are empty, and you are not interested in those. How do you get only the non-empy non-comment lines using the Bash command line? (2 points)

    12. Assume you have some linguistically analyzed text in a comma-separated file (CSV). The first column is the token index — for regular tokens, this is simply a natural number (e.g. 1 or 128), for multiword tokens this is a number range (e.g. 5-8), and for empty tokens it is a decimal number (e.g. 6.1). How do you get only the lines that contain a regular token? (2 points)

    13. Explain the following bash code:

      grep . table.txt | rev | cut -f2,3 | rev
      

      (1 point)

    14. Create a bash script that reads an English text from STDIN and prints only interrogative sentences extracted from the text to STDOUT, one sentence per line (simplification: let's suppose that sentences can be ended only by fullstops and questionmarks). (2 points)

    15. Write a bash script that returns a word-bigram frequency "table" (in the tab-separated format) for its input (2 points).

    16. Write a Bash script that returns a letter-bigram frequency "table" (in the tab-separated format) for its input (2 points).

  4. Git
    1. Name 4 Git commands and briefly explain what each of them does (a few words or a short sentence for each command) (1 point).

    2. Assume you already are in a local clone of a remote Git repository. Create a new file called "a.txt" with the text "This is a file.", and do everything that is necessary so that the file gets into the remote repository (2 points).

    3. Name two advantages of versioning your source codes (with Git) versus not versioning it (e.g. just having it in a directory on your laptop) (1 point).

    4. You and your colleague are working together on a project versioned with Git. Line 27 of script.py is empty. You change that line to initialize a variable ("a = 10"), while you colleague changes it to modify another variable ("b += 20"). He is faster than you, so he commits and pushes first. What happens now? Can you push? Can you commit? What do you need to do now? (2 points)

    5. What's probably wrong with the following sequence of commands? What did the author probably want to do? How would you correct it?

      echo aaa > a; git add a; git push; git commit -m'creating a'
      

      (2 points)

    6. What's probably wrong with the following sequence of commands? What did the author probably want to do? How would you correct it?

      echo aaa > a; git commit -m'creating a'; git push
      

      (2 points)

    7. What's probably wrong with the following sequence of commands? What did the author probably want to do? How would you correct it?

      echo aaa > a; git add a; git push
      

      (2 points)

  5. Python basics
    1. What should the first line of a Python script look like? (1 point)
    2. How do you install a Python module? (1 point)
    3. How do you use a Python module in your Python script? (1 point)
    4. What Python data types do you know? What do they represent? (1 point)
    5. In Python, given a string called text, how do you get the following: first character, last character, first 3 characters, last 4 characters, 3rd to 5th character? (2 points)
    6. Write a minimal Python script that prints "Hello NAME", where NAME is given to it as the first commandline argument; include the correct shebang line in the script. (2 points)
    7. In Python, define a function that takes a string, splits it into tokens, and prints out the first N tokens (10 by default). (2 points)
    8. In Python, given a text split into a list of tokens, print out the 10 most frequent tokens. (1 point)
    9. In Python, given a text split into a list of tokens, print out all tokens that have a frequency higher than 5. (1 point)
    10. In Python, given a text split into a list of tokens, print out all tokens that have a frequency above the median. (2 points)
    11. In Python, implement an improved version of wc: write a script that reads in the contents of a file, and prints out the number of characters, whitespace characters, words, lines and empty lines in the file. (2 points)
    12. In Python, assume the variable genesis_text contains a text, with punctuation removed, i.e. there are just words separated by spaces. Print out the most frequent word. (2 points)
  6. Simple string processing in Python
    1. Name 5 string methods and explain what they do. (1 point)

    2. Write a piece of code that prints out all numbers in a text (tokens that consist only of digits 0-9) joined by underscores (e.g. "L33t Peter has 5 apples, 123 oranges, an iPhone7 and 6466868 pears." becomes "5_123_6466868") (1 point)

    3. Write a piece of code that replaces all occurences of "Python" by "vicious snake". (1 point)

    4. Write a piece of code that decides whether a string looks like a name — one word consiting of an uppercase letter followed by lowercase letters. (1 point)

    5. Write a piece of code that converts all dates in text from the format "nth/nd/rd Month" to "Month n", so e.g. "I was born on 29th January and my sister on 3rd February" becomes "I was born on January 29 and my sister on February 3" (1 point)

    6. Write a piece of code that replaces all words that start with "pwd" by *****. (1 point)

    7. Write a piece of code that converts the "'s" possessive to the "of" possessive, so that e.g. "I like Peter's car the most." becomes "I like car of Peter the most." (1 point)

    8. Write a piece of code that takes a text in which some lines start with an asterisk and a space ("* ") and replaces the asterisks with consecutive ordinal numbers followed by a dot, starting with 1; e.g.:

      Do not forget to buy:
      * cheese
      * wine
      (just a cheap one)
      * some bread
      

      becomes:

      Do not forget to buy:
      1. cheese
      2. wine
      (just a cheap one)
      3. some bread
      

      (2 points)

    9. Write a Python script that reads an English text from STDIN and prints the same text with 'highlighted' personal pronouns (e.g. by placing them between two asterisks *) (2 points).

    10. Write a Python script that returns a word-bigram frequency table for its input. A text is expected on STDIN and a two column table is expected to be printed on STDOUT (2 points).

    11. Write a Python script that returns a letter-bigram frequency table for its input (2 points).

    12. Suppose you have a file containing a list of first names, one per line. Process another file containing an English text with Python, so that all personal names are shortened just to the initial letter and a dot, if a surname follows the first name. ("John Smith called me yesterday" → "J. Smith called me yesterday") (2 points)

    13. Write a Python script that removes all leading and trailing whitespace from each input line, and replaces all the remaining sequences of whitespace characters with just one space. (2 points)

    14. Create a Python script that reads an English text from STDIN and prints only interrogative sentences extracted from the text to STDOUT (simplification: let's suppose that sentences can be ended only by fullstops and questionmarks). (2 points)

  7. Python modules, packages, and classes
    1. Create a very simple Python object-oriented tree representation: create a class Node which has attribute children which keeps the list of the node's children, and attribute lemma. There should be a method nodeA.add_child(lemma) which creates a new node (a child of nodeA) labelled with the given lemma. You can disregard any absolute and relative ordering of nodes (2 points).

    2. Explain the differences between the notion of a function and the notion of a method in Python (1 point).

  8. Introduction to XML
    1. What is XML? (1 point)

    2. Explain the XML terms 'tag', 'attribute', and 'element'? (1 point)

    3. What is a well-formed XML file? (1 point)

    4. What is a valid XML file? (1 point)

    5. What is DTD? Give a short example (1 point).

    6. What is the difference between XML well-formedness and XML validity? (1 point)

    7. How can you check an XML file's well-formedness? (1 point)

    8. How can you check an XML file's validity? (1 point)

    9. Explain the difference between DOM(-like) and SAX(-like) approaches to processing XML data (1 point).

    10. Modify the following code so that it prints not only tags and attributes of elements directly embedded in the root element, but tags and attributes of all elements in the XML file (i.e., including the root and all deeper elements).

      import xml.etree.cElementTree as ET
      tree = ET.ElementTree(file='example.xml')
      for child in root:
          print child.tag, child.attrib
      

      (2 points)

    11. Create a Python script that reads a simple frequency list from STDIN (tab separated lemma and frequency on each line) and turns it into a simple XML formatted file printed to STDOUT (2 points).

  9. NLTK and other NLP frameworks
    1. What are some advantages of using an existing NLP framework over writing all the codes yourself? (1 point)

    2. What are some disadvantages of using an existing NLP framework over writing all the codes yourself? (1 point)

    3. Name at least 4 things NLTK can do (1 point).

    4. Given a list of tokens, write code that POS-tags the tokens, using NLTK (2 points).

    5. Write a script that reads in English text which has one sentence per line and prints out POS tags for the words (one sentence per line, POS tags separated by spaces), using NLTK (2 points).

    6. Write code using NLTK that takes English text and prints out the POS tag of the sentence-initial words (i.e. for each sentence, only print out the tag of its first word) (1 point).(2 points)

    7. Given a list of tokens, POS-tag them with NLTK and print out a frequency list of the tags (2 points).

    1. Name at least 2 NLP frameworks or framework-like tools, say something about them in 1-2 lines (at least what they are good for) (1 point).
  10. Selected good practices in software development (not only in NLP, not only in Python)
    1. Explain what unit testing is. How is it done in Python? (1 point)

    2. Suggest at least three tests for the Module MagicalTagger whose synopsis is shown below:

      import MagicalTagger	  
      tagger = MagicalTagger('English')
      sentence = ['Time','flies','like','an','arrow']
      pos_tags = tagger.tag(sentence)
      

      (2 points)

    3. What are crucial properties of a good bug report? (1 point)

    4. What is code profiling? (1 point)

    5. Let's say that you should develop a graph storage Python module (based only on the standard library) for representing coincidence of two words within a single sentence (i.e., whether two words appeared in the same sentence in some corpus, or not). You are considering two possible alternative representations: adjacency list (every word keeps the list of its neighbors in the graph), and an incidence matrix (a two-dimensional array keeping just ones and zeros). How would you decide which representation is better (for a given NLP application)? (2 points)

  11. General text-processing problem solving
    1. Suppose there are three files, a, b, and c. One of them contains text in English, the other two contain texts in other languages. Try to automatically detect which is the English one (i.e. "I look into the files with my eyes." is not a valid solution because this is not automatic) (2 points).

    2. Assume that Rudolf simply runs the code you submit for homework on his computer without looking into the code. Why is that a bad idea? What could happen? Show why this is a bad idea by inventing a short part of code you could have submitted as homework. (2 points)

    3. Assume you have a text file with one sentence on each line. Print only sentences that have exactly four words (2 points).

    4. In NLP, we often lowercase all data, so that e.g. "And" (e.g. at the start of a sentence) and "and" (inside a sentence) are treated the same way. Why might this not be the best idea? What problems could we have because of that? What could be a better approach? (Don't write code, just explain this briefly with your own words.) (1 point)

Homework assignments

  • There will be 10 homework assignments.
  • For each assignment, you will get points, up to a given maximum (the maximum is specified with each assignment).
    • If your submission is especially good, you can get extra points (up to +10% of the maximum).
  • All assignments will have a fixed deadline (usually in 10 days).
  • If you submit the assignment after the deadline, you will get:
    • up to 50% of the maximum points if it is less than 2 weeks after the deadline;
    • 0 points if it is more than 2 weeks after the deadline.
  • Once we check the submitted assignments, you will see the points you got and the comments from us in:
  • To be allowed to take the test (which is required to pass the course), you need to get at least 50% of the total points from the assignments.

Test

Grading

Your grade is based on the average of your performance; the test and the homework assignments are weighted 1:1.

  • ≥ 90%: grade 1 (excellent)
  • ≥ 70%: grade 2 (very good)
  • ≥ 50%: grade 3 (good)
  • < 50%: grade 4 (fail)

For example, if you get 600 out of 1000 points for homework assignments (60%) and 36 out of 40 points for the test (90%), your total performance is 75% and you get a 2.

No cheating

  • Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty.
  • Discussing homework assignments with your classmates is OK. Sharing code is not OK (unless explicitly allowed); by default, you must complete the assignments yourself.
  • All students involved in cheating will be punished. E.g. if you share your assignment with a friend, both you and your friend will be punished.