NPFL092 - sample questions for the final written test

Note: the test is not limited to the following list, however, all the test questions will come from the below illustrated areas.

Basic survival in Linux (or rather in Bash)

Name and describe at least two options for each of the following commands in bash: ls, sort, cut, iconv, grep (1 point).
Give examples of what the .bashrc file can be used for (1 point).
Explain how command line pipelining works (1 point).
Create a bash script that counts the total number of words in all *txt files in all subdirectories of the current directory (2 points).
You created a new file called doit.sh and wrote some Bash commands into it, e.g.:
```
 echo "ls -t | head -n 5 | cat -n" > doit.sh 
```
How do you run it now? (1 point)
What do you think the following command does?
```
 ls -t | head -n 5 | cat -n 
```
How would you check what it really does (without running it)? (1 point)

Character encoding

Explain the notions "character set" and "character encoding" (1 point).
Explain the main properties of ASCII (1 point).
What 8-bit encoding do you know for Czech or other European languages (or your native language)? Name at least three. How do they differ from ASCII? (1 point)
What is Unicode and what Unicode encodings do you know? (1 point)
Explain the relation between UTF-8 and ASCII. (1 point)
How can you detect the encoding of a file? (1 point)
You have three files containing identical Czech text. One of them is encoded using the ISO charset, one of them uses UTF-8, and one uses UTF-16. How can you tell which is which? (1 point)
How would you proceed if you are supposed to read a file encoded in ISO-8859-1, add a line number to each line and store it in UTF8? (a source code snippet in your favourite programming language is expected here) (2 points)
Name three Unicode encodings (1 point).
Explain the size difference between a file containing a text in Czech (or in your native language) stored in an 8-bit encoding and the same file stored in UTF-8. (1 point)
How do you convert a file from one encoding to another, for instance from a non-UTF-8 encoding to UTF-8? (1 point)
Write a Python script that reads a text content from STDIN encoded in ISO-8859-2 and prints it to STDOUT in utf8. (2 points)
Explain what BOM is (in the context of file encoding). (1 point)

Text-processing in Bash

Using the Bash command line, get all lines from a file that contain one or two digits, followed by a dot or a space. (1 point)
Using the Bash command line, remove all punctuation from a given file. (1 point)
Using the Bash command line, split text from a given file into words, so that there is one word on each line. (1 point)
Using the Bash command line, download a webpage from a given URL and print the frequency list of opening HTML tags contained in the page. (2 points)
Using the Bash command line, print out the first 5 lines of each file (in the current directory) whose name starts with "abc". (2 points)
Using the Bash command line, find the most frequent word in a text file. (2 points)
Assume you have some linguistically analyzed text in a tab-separated file (TSV). You are just interested in the word form, which is in the second column, and the part-of-speech tag, which is in the fourth column. How do you extract only this information from the file using the Bash command line? (2 points)
Create a Makefile with three targets. The "download" target downloads the webpage nic.nikde.eu into a file, the "show" target prints out the file, and the "clean" target deletes the file. (2 points)
Create a Makefile with two targets. When the first target is called, a web page is downloaded from a given URL. When the second target is called, the number of HTML paragraphs (<p> elements) contained in the file is printed. (2 points)
Suppose there is a plain-text file containing an English text. Write a Bash pipeline of commands which prints the frequency list of 50 most frequent tokens contained in the text. (Simplification: it is sufficient to use only whitespace characters as token separators) (2 points).
Assume you have some linguistic data in a text file. However, some lines are comments (these lines start with a "#" sign) and some lines are empty, and you are not interested in those. How do you get only the non-empy non-comment lines using the Bash command line? (2 points)
Assume you have some linguistically analyzed text in a comma-separated file (CSV). The first column is the token index -- for regular tokens, this is simply a natural number (e.g. 1 or 128), for multiword tokens this is a number range (e.g. 5-8), and for empty tokens it is a decimal number (e.g. 6.1). How do you get only the lines that contain a regular token? (2 points)

Explain the following bash code:

grep . table.txt | rev | cut -f2,3 | rev

(1 point)

Create a bash script that reads an English text from STDIN and prints only interrogative sentences extracted from the text to STDOUT, one sentence per line (simplification: let's suppose that sentences can be ended only by fullstops and questionmarks). (2 points)
Write a bash script that returns a word-bigram frequency "table" (in the tab-separated format) for its input (2 points).
Write a Bash script that returns a letter-bigram frequency "table" (in the tab-separated format) for its input (2 points).

Git

Name 4 Git commands and briefly explain what each of them does (a few words or a short sentence for each command) (1 point).
Assume you already are in a local clone of a remote Git repository. Create a new file called "a.txt" with the text "This is a file.", and do everything that is necessary so that the file gets into the remote repository (2 points).
Name two advantages of versioning your source codes (with Git) versus not versioning it (e.g. just having it in a directory on your laptop) (1 point).
You and your colleague are working together on a project versioned with Git. Line 27 of script.py is empty. You change that line to initialize a variable ("a = 10"), while you colleague changes it to modify another variable ("b += 20"). He is faster than you, so he commits and pushes first. What happens now? Can you push? Can you commit? What do you need to do now? (2 points)
What's probably wrong with the following sequence of commands? What did the author probably want to do? How would you correct it?
```
  echo aaa > a; git add a; git push; git commit -m'creating a'
```
(2 points)
What's probably wrong with the following sequence of commands? What did the author probably want to do? How would you correct it?
```
  echo aaa > a; git commit -m'creating a'; git push
```
(2 points)
What's probably wrong with the following sequence of commands? What did the author probably want to do? How would you correct it?
```
  echo aaa > a; git add a; git push
```
(2 points)

Python basics

What should the first line of a Python script look like? (1 point)
How do you install a Python module? (1 point)
How do you use a Python module in your Python script? (1 point)
What Python data types do you know? What do they represent? (1 point)
In Python, given a string called text, how do you get the following: first character, last character, first 3 characters, last 4 characters, 3rd to 5th character? (2 points)
Write a minimal Python script that prints "Hello NAME", where NAME is given to it as the first commandline argument; include the correct shebang line in the script. (2 points)
In Python, define a function that takes a string, splits it into tokens, and prints out the first N tokens (10 by default). (2 points)
In Python, given a text split into a list of tokens, print out the 10 most frequent tokens. (1 point)
In Python, given a text split into a list of tokens, print out all tokens that have a frequency higher than 5. (1 point)
In Python, given a text split into a list of tokens, print out all tokens that have a frequency above the median. (2 points)
In Python, implement an improved version of wc: write a script that reads in the contents of a file, and prints out the number of characters, whitespace characters, words, lines and empty lines in the file. (2 points)
In Python, assume the variable genesis_text contains a text, with punctuation removed, i.e. there are just words separated by spaces. Print out the most frequent word. (2 points)

Simple string processing in Python

Name 5 string methods and explain what they do. (1 point)
Write a piece of code that prints out all numbers in a text (tokens that consist only of digits 0-9) joined by underscores (e.g. "L33t Peter has 5 apples, 123 oranges, an iPhone7 and 6466868 pears." becomes "5_123_6466868") (1 point)
Write a piece of code that replaces all occurences of "Python" by "vicious snake". (1 point)
Write a piece of code that decides whether a string looks like a name -- one word consiting of an uppercase letter followed by lowercase letters. (1 point)
Write a piece of code that converts all dates in text from the format "nth/nd/rd Month" to "Month n", so e.g. "I was born on 29th January and my sister on 3rd February" becomes "I was born on January 29 and my sister on February 3" (1 point)
Write a piece of code that replaces all words that start with "pwd" by *****. (1 point)
Write a piece of code that converts the "'s" possessive to the "of" possessive, so that e.g. "I like Peter's car the most." becomes "I like car of Peter the most." (1 point)
Write a piece of code that takes a text in which some lines start with an asterisk and a space ("* ") and replaces the asterisks with consecutive ordinal numbers followed by a dot, starting with 1; e.g.:
```
Do not forget to buy:
* cheese
* wine
(just a cheap one)
* some bread
```
becomes:
```
Do not forget to buy:
1. cheese
2. wine
(just a cheap one)
3. some bread
```
(2 points)
Write a Python script that reads an English text from STDIN and prints the same text with 'highlighted' personal pronouns (e.g. by placing them between two asterisks *) (2 points).
Write a Python script that returns a word-bigram frequency table for its input. A text is expected on STDIN and a two column table is expected to be printed on STDOUT (2 points).
Write a Python script that returns a letter-bigram frequency table for its input (2 points).
Suppose you have a file containing a list of first names, one per line. Process another file containing an English text with Python, so that all personal names are shortened just to the initial letter and a dot, if a surname follows the first name. ("John Smith called me yesterday" → "J. Smith called me yesterday") (2 points)
Write a Python script that removes all leading and trailing whitespace from each input line, and replaces all the remaining sequences of whitespace characters with just one space. (2 points)
Create a Python script that reads an English text from STDIN and prints only interrogative sentences extracted from the text to STDOUT (simplification: let's suppose that sentences can be ended only by fullstops and questionmarks). (2 points)

Python modules, packages, and classes

Create a very simple Python object-oriented tree representation: create a class Node which has attribute children which keeps the list of the node's children, and attribute lemma. There should be a method nodeA.add_child(lemma) which creates a new node (a child of nodeA) labelled with the given lemma. You can disregard any absolute and relative ordering of nodes (2 points).
Explain the differences between the notion of a function and the notion of a method in Python (1 point).

Introduction to XML

What is XML? (1 point)
Explain the XML terms 'tag', 'attribute', and 'element'? (1 point)
What is a well-formed XML file? (1 point)
What is a valid XML file? (1 point)
What is DTD? Give a short example (1 point).
What is the difference between XML well-formedness and XML validity? (1 point)
How can you check an XML file's well-formedness? (1 point)
How can you check an XML file's validity? (1 point)
Explain the difference between DOM(-like) and SAX(-like) approaches to processing XML data (1 point).
Modify the following code so that it prints not only tags and attributes of elements directly embedded in the root element, but tags and attributes of all elements in the XML file (i.e., including the root and all deeper elements).
```
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file='example.xml')
for child in root:
	print child.tag, child.attrib
```
(2 points)
Create a Python script that reads a simple frequency list from STDIN (tab separated lemma and frequency on each line) and turns it into a simple XML formatted file printed to STDOUT (2 points).

NLTK and other NLP frameworks

What are some advantages of using an existing NLP framework over writing all the codes yourself? (1 point)
What are some disadvantages of using an existing NLP framework over writing all the codes yourself? (1 point)
Name at least 4 things NLTK can do (1 point).
Given a list of tokens, write code that POS-tags the tokens, using NLTK (2 points).
Write a script that reads in English text which has one sentence per line and prints out POS tags for the words (one sentence per line, POS tags separated by spaces), using NLTK (2 points).
Write code using NLTK that takes English text and prints out the POS tag of the sentence-initial words (i.e. for each sentence, only print out the tag of its first word) (1 point).(2 points)
Given a list of tokens, POS-tag them with NLTK and print out a frequency list of the tags (2 points).
Name at least 2 NLP frameworks or framework-like tools, say something about them in 1-2 lines (at least what they are good for) (1 point).

Selected good practices in software development (not only in NLP, not only in Python)

Explain what unit testing is. How is it done in Python? (1 point)

Suggest at least three tests for the Module MagicalTagger whose synopsis is shown below:

 import MagicalTagger	  
 tagger = MagicalTagger('English')
 sentence = ['Time','flies','like','an','arrow']
 pos_tags = tagger.tag(sentence)

(2 points)

What are crucial properties of a good bug report? (1 point)
What is code profiling? (1 point)
Let's say that you should develop a graph storage Python module (based only on the standard library) for representing coincidence of two words within a single sentence (i.e., whether two words appeared in the same sentence in some corpus, or not). You are considering two possible alternative representations: adjacency list (every word keeps the list of its neighbors in the graph), and an incidence matrix (a two-dimensional array keeping just ones and zeros). How would you decide which representation is better (for a given NLP application)? (2 points)

General text-processing problem solving

Suppose there are three files, a, b, and c. One of them contains text in English, the other two contain texts in other languages. Try to automatically detect which is the English one (i.e. "I look into the files with my eyes." is not a valid solution because this is not automatic) (2 points).
Assume that Rudolf simply runs the code you submit for homework on his computer without looking into the code. Why is that a bad idea? What could happen? Show why this is a bad idea by inventing a short part of code you could have submitted as homework. (2 points)
Assume you have a text file with one sentence on each line. Print only sentences that have exactly four words (2 points).
In NLP, we often lowercase all data, so that e.g. "And" (e.g. at the start of a sentence) and "and" (inside a sentence) are treated the same way. Why might this not be the best idea? What problems could we have because of that? What could be a better approach? (Don't write code, just explain this briefly with your own words.) (1 point)