Basic survival in Linux (or rather in Bash)
- Name and describe at least two options for each of the following
commands in bash: ls, sort, cut, iconv, grep (1 point).
- Give examples of what the .bashrc file can be used for (1 point).
- Explain how command line pipelining works (1 point).
- Create a bash script that counts the total number of words in all *txt files in all subdirectories of the current directory (2 points).
- You created a new file called
doit.sh
and wrote some Bash commands into it,
e.g.:
echo "ls -t | head -n 5 | cat -n" > doit.sh
How do you run it now? (1 point)
- What do you think the following command does?
ls -t | head -n 5 | cat -n
How would you check what it really does (without running it)? (1 point)
Character encoding
- Explain the notions "character set" and "character encoding" (1 point).
- Explain the main properties of ASCII (1 point).
- What 8-bit encoding do you know for Czech or other European languages (or your native language)? Name at least three. How do they differ from ASCII? (1 point)
- What is Unicode and what Unicode encodings do you know? (1 point)
- Explain the relation between UTF-8 and ASCII. (1 point)
- How can you detect the encoding of a file? (1 point)
- You have three files containing identical Czech text. One of them is encoded
using the ISO charset, one of them uses UTF-8, and one uses UTF-16. How can
you tell which is which? (1 point)
- How would you proceed if you are supposed to read a file encoded in ISO-8859-1, add a line number to each line and store it in UTF8? (a source code snippet in your favourite programming language is expected here) (2 points)
- Name three Unicode encodings (1 point).
- Explain the size difference between a file containing a text
in Czech (or in your native language) stored in an 8-bit encoding
and the same file stored in UTF-8. (1 point)
- How do you convert a file from one encoding to another, for instance from a non-UTF-8 encoding to UTF-8? (1 point)
- Write a Python script that reads a text content from STDIN encoded in ISO-8859-2 and prints it to STDOUT in utf8. (2 points)
- Explain what BOM is (in the context of file encoding). (1 point)
Text-processing in Bash
- Using the Bash command line, get all lines from a file that contain one or two digits, followed by a dot or a space. (1 point)
- Using the Bash command line, remove all punctuation from a given file. (1 point)
- Using the Bash command line, split text from a given file into words, so that there is one word on each line. (1 point)
- Using the Bash command line, download a webpage from a given URL and print the frequency list of opening HTML tags contained in the page. (2 points)
- Using the Bash command line, print out the first 5 lines of each file (in the current directory) whose name starts with "abc". (2 points)
- Using the Bash command line, find the most frequent word in a text file. (2 points)
- Assume you have some linguistically analyzed text in a tab-separated file (TSV). You are just
interested in the word form, which is in the second column, and the
part-of-speech tag, which is in the fourth column. How do you extract only
this information from the file using the Bash command line? (2 points)
-
Create a Makefile with three targets. The "download" target downloads the
webpage
nic.nikde.eu
into a file, the "show" target prints out the file, and the "clean" target
deletes the file. (2 points)
- Create a Makefile with two targets. When the first target is called,
a web page is downloaded from a given URL. When the second
target is called, the number of HTML paragraphs (<p> elements) contained in the
file is printed. (2 points)
- Suppose there is a plain-text file containing an English text.
Write a Bash pipeline of commands which prints the frequency list
of 50 most frequent tokens contained in the
text. (Simplification: it is sufficient to use only whitespace
characters as token separators) (2 points).
- Assume you have some linguistic data in a text file. However, some lines are
comments (these lines start with a "#" sign) and some lines are empty, and you
are not interested in those. How do you get only the non-empy non-comment
lines using the Bash command line? (2 points)
- Assume you have some linguistically analyzed text in a comma-separated file
(CSV). The first column is the token index -- for regular tokens, this is
simply a natural number (e.g. 1 or 128), for multiword tokens this is a
number range (e.g. 5-8), and for empty tokens it is a decimal number (e.g.
6.1). How do you get only the lines that contain a regular token? (2 points)
- Explain the following bash code:
grep . table.txt | rev | cut -f2,3 | rev
(1 point)
- Create a bash script that reads an English text from STDIN and prints only interrogative sentences extracted from the text to STDOUT, one sentence per line (simplification: let's suppose that sentences can be ended only by fullstops and questionmarks). (2 points)
- Write a bash script that returns a word-bigram frequency "table" (in the tab-separated format) for
its input (2 points).
- Write a Bash script that returns a letter-bigram frequency "table" (in the tab-separated format) for
its input (2 points).
Git
-
Name 4 Git commands and briefly explain what each of them does (a few words or
a short sentence for each command) (1 point).
-
Assume you already are in a local clone of a remote Git repository.
Create a new file called "a.txt" with the text "This is a file.", and do
everything that is necessary so that the file gets into the remote repository (2 points).
-
Name two advantages of versioning your source codes (with Git) versus not
versioning it (e.g. just having it in a directory on your laptop) (1 point).
-
You and your colleague are working together on a project versioned with Git.
Line 27 of script.py is empty. You change that line to initialize a variable
("a = 10"), while you colleague changes it to modify another variable ("b +=
20"). He is faster than you, so he commits and pushes first. What happens
now? Can you push? Can you commit? What do you need to do now? (2 points)
-
What's probably wrong with the following sequence of commands? What did the author
probably want to do? How would you correct it?
echo aaa > a; git add a; git push; git commit -m'creating a'
(2 points)
-
What's probably wrong with the following sequence of commands? What did the author
probably want to do? How would you correct it?
echo aaa > a; git commit -m'creating a'; git push
(2 points)
-
What's probably wrong with the following sequence of commands? What did the author
probably want to do? How would you correct it?
echo aaa > a; git add a; git push
(2 points)
Python basics
- What should the first line of a Python script look like? (1 point)
- How do you install a Python module? (1 point)
- How do you use a Python module in your Python script? (1 point)
- What Python data types do you know? What do they represent? (1 point)
- In Python, given a string called
text
, how do you get the following: first character, last character, first 3 characters, last 4 characters, 3rd to 5th character? (2 points)
- Write a minimal Python script that prints "Hello NAME", where NAME is given to it as the first commandline argument; include the correct shebang line in the script. (2 points)
- In Python, define a function that takes a string, splits it into tokens, and prints out the first N tokens (10 by default). (2 points)
- In Python, given a text split into a list of tokens, print out the 10 most frequent tokens. (1 point)
- In Python, given a text split into a list of tokens, print out all tokens that have a frequency higher than 5. (1 point)
- In Python, given a text split into a list of tokens, print out all tokens that have a frequency above the median. (2 points)
- In Python, implement an improved version of
wc
: write a script that reads in the contents of a file, and prints out the number of characters, whitespace characters, words, lines and empty lines in the file. (2 points)
- In Python, assume the variable
genesis_text
contains a text, with punctuation removed,
i.e. there are just words separated by spaces. Print out the most frequent
word. (2 points)
Simple string processing in Python
- Name 5 string methods and explain what they do. (1 point)
- Write a piece of code that prints out all numbers in a text (tokens that consist only of digits 0-9) joined by underscores (e.g. "L33t Peter has 5 apples, 123 oranges, an iPhone7 and 6466868 pears." becomes "5_123_6466868") (1 point)
- Write a piece of code that replaces all occurences of "Python" by "vicious snake". (1 point)
- Write a piece of code that decides whether a string looks like a name -- one word consiting of an uppercase letter followed by lowercase letters. (1 point)
- Write a piece of code that converts all dates in text from the format "nth/nd/rd Month" to "Month n", so e.g. "I was born on 29th January and my sister on 3rd February" becomes "I was born on January 29 and my sister on February 3" (1 point)
- Write a piece of code that replaces all words that start with "pwd" by *****. (1 point)
- Write a piece of code that converts the "'s" possessive to the "of" possessive, so that e.g. "I like Peter's car the most." becomes "I like car of Peter the most." (1 point)
- Write a piece of code that takes a text in which some lines start with an asterisk and a space ("* ") and replaces the asterisks with consecutive ordinal numbers followed by a dot, starting with 1; e.g.:
Do not forget to buy:
* cheese
* wine
(just a cheap one)
* some bread
becomes:
Do not forget to buy:
1. cheese
2. wine
(just a cheap one)
3. some bread
(2 points)
- Write a Python script that reads an English text from STDIN and
prints the same text with 'highlighted' personal pronouns (e.g. by
placing them between two asterisks *) (2 points).
- Write a Python script that returns a word-bigram frequency table for
its input. A text is expected on STDIN and a two column table is expected to be printed on STDOUT (2 points).
- Write a Python script that returns a letter-bigram frequency table for
its input (2 points).
- Suppose you have a file containing a list of first names, one
per line. Process another file containing an English text with
Python, so that all personal names are shortened just to the initial
letter and a dot, if a surname follows the first name. ("John
Smith called me yesterday" → "J. Smith called me yesterday") (2 points)
- Write a Python script that removes all leading and trailing
whitespace from each input line, and replaces all the remaining
sequences of whitespace characters with just one space. (2 points)
- Create a Python script that reads an English text from STDIN and prints only interrogative sentences extracted from the text to STDOUT
(simplification: let's suppose that sentences can be ended only by fullstops and questionmarks). (2 points)
Python modules, packages, and classes
- Create a very simple Python object-oriented tree representation: create a class Node which has attribute children which keeps the list of the node's children, and attribute lemma. There should be a method nodeA.add_child(lemma) which creates a new node (a child of nodeA) labelled with the given lemma. You can disregard any absolute and relative ordering of nodes (2 points).
- Explain the differences between the notion of a function and the notion of a method in Python (1 point).
Introduction to XML
- What is XML? (1 point)
- Explain the XML terms 'tag', 'attribute', and 'element'? (1 point)
- What is a well-formed XML file? (1 point)
- What is a valid XML file? (1 point)
- What is DTD? Give a short example (1 point).
- What is the difference between XML well-formedness and XML validity? (1 point)
- How can you check an XML file's well-formedness? (1 point)
- How can you check an XML file's validity? (1 point)
- Explain the difference between DOM(-like) and SAX(-like) approaches to processing XML data (1 point).
- Modify the following code so that it prints not only tags and attributes of elements directly embedded in the root element, but tags and attributes of all elements in the XML file (i.e., including the root and all deeper elements).
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file='example.xml')
for child in root:
print child.tag, child.attrib
(2 points)
- Create a Python script that reads a simple frequency list from STDIN (tab separated lemma and frequency on each line) and turns it into a simple XML formatted file printed to STDOUT (2 points).
NLTK and other NLP frameworks
- What are some advantages of using an existing NLP framework over writing all
the codes yourself? (1 point)
-
What are some disadvantages of using an existing NLP framework over writing all
the codes yourself? (1 point)
- Name at least 4 things NLTK can do (1 point).
- Given a list of tokens, write code that POS-tags the tokens, using NLTK (2 points).
- Write a script that reads in English text which has one sentence per line and prints out POS tags for the words (one sentence per line, POS tags separated by spaces), using NLTK (2 points).
- Write code using NLTK that takes English text and prints out the POS tag of the sentence-initial words (i.e. for each sentence, only print out the tag of its first word) (1 point).(2 points)
- Given a list of tokens, POS-tag them with NLTK and print out a frequency list of the tags (2 points).
- Name at least 2 NLP frameworks or framework-like tools, say something about them in 1-2 lines (at least what they are good for) (1 point).
Selected good practices in software development (not only in NLP, not only in Python)
- Explain what unit testing is. How is it done in Python? (1 point)
- Suggest at least three tests for the Module MagicalTagger whose synopsis is shown below:
import MagicalTagger
tagger = MagicalTagger('English')
sentence = ['Time','flies','like','an','arrow']
pos_tags = tagger.tag(sentence)
(2 points)
- What are crucial properties of a good bug report? (1 point)
- What is code profiling? (1 point)
- Let's say that you should develop a graph storage Python module (based only on the standard library) for representing coincidence of two words within a single sentence (i.e., whether two words appeared in the same sentence in some corpus, or not). You are considering two possible alternative representations: adjacency list (every word keeps the list of its neighbors in the graph), and an incidence matrix (a two-dimensional array keeping just ones and zeros). How would you decide which representation is better (for a given NLP application)? (2 points)
General text-processing problem solving
- Suppose there are three files, a, b, and c. One of them contains text in
English, the other two contain texts in other languages. Try to automatically
detect which is the English one (i.e. "I look into the files with my eyes." is
not a valid solution because this is not automatic) (2 points).
- Assume that Rudolf simply runs the code you submit for homework on his computer
without looking into the code. Why is that a bad idea? What could happen? Show
why this is a bad idea by inventing a short part of code you could have
submitted as homework. (2 points)
- Assume you have a text file with one sentence on each line. Print only sentences
that have exactly four words (2 points).
- In NLP, we often lowercase all data, so that e.g. "And" (e.g. at the start of a
sentence) and "and" (inside a sentence) are treated the same way. Why might
this not be the best idea? What problems could we have because of that? What
could be a better approach? (Don't write code, just explain this briefly with
your own words.) (1 point)