First Python exercises (simple language modelling)

  1. Create a string containing the first chapter of genesis.
    Print out first 40 characters.
    str[from:to]  # from is inclusive, to is exclusive
    Print out 4th to 6th character 1-based (=3rd to 5th 0-based)
    Check the length of the result using len().
  2. Split the string into tokens (use str.split(); see ?str.split for help).
    Print out first 10 tokens. (List splice behaves similarly to substring.)
    Print out last 10 tokens.
    Print out 11th to 18th token.
    Check the length of the result using len().
    Just printing a list splice is fine; also see ?str.join
  3. Compute the unigram counts into a dictionary.
    # Built-in dict (need to explicitly initialize keys):
    unigrams = {}
    # The Python way is to use the foreach-style loops;
    # and horizotal formatting matters!
    for token in tokens:
        # do something
    # defaultdict, supports autoinitialization:
    from collections import defaultdict
    # int = values for non-set keys initialized to 0:
    unigrams = defaultdict(int)
    # Even easier:
    from collections import Counter
  4. Print out most frequent unigram.
    max(something)
    max(something, key=function_to_get_key)
    
    # getting value stored under a key in a dict:
    unigrams[key]
    unigrams.get(key)
    Or use Counter.most_common()
  5. Print out the unigrams sorted by count.
    Use sorted() -- behaves similarly to max()
    Or use Counter.most_common()
  6. Get unigrams with count > 5; can be done with list comprehension:
    [token for token in unigrams if unigrams[token] > 5]
  7. Count bigrams in the text into a dict of Counters
    bigrams = defaultdict(Counter)
    bigrams[first][second] += 1
  8. For each unigram with count > 5, print it together with its most frequent successor.
    [(token, something) for …]
  9. Print the successor together with its relative frequency rounded to 2 decimal digits.
    max(), sum(), dict.values(), round(number, ndigits)
  10. Print a random token. Print a random unigram disregarding their distribution.
    import random
    ?random.choice
    list(dict.keys())
  11. Pick a random word, generate a string of 20 words by always picking the most frequent follower.
    range(10)
  12. Put that into a function, with the number of words to be generated as a parameter.
    Return the result in a list.
    list.append(item)
    def function_name (parameter_name = default):
        # do something
        return 123
  13. Sample the next word according to the bigram distribution
    import numpy as np
    ?np.random.choice
    np.random.choice(list, p=list_of_probs)

Extensions (homework)

As a homework, implement the 1st extension, and at least two other extensions. You can get bonus points if you implement more extensions.
  1. make a file generate.py that generates the text according to exercise 13 (running python3 generate.py should print out a sequence of 20 words sampled from the bigram distribution computed from the text)
  2. load the text from a file, e.g.:
    file = open("filename", "r")
    text = file.read()
    file.close()
    or e.g. (cleaner):
    with open("filename", "r") as file:
        text = file.read()
  3. take configuration (input file, N) from command line
    n=sys.argv[1]
  4. do a better tokenization, lowercase...
  5. detect sentence boundaries, generate a sentence starting at a sentence start and ending at a sentence end (but max N tokens)
  6. save (and load) the unigram and bigram counts into a file (use pickle)

Review of Python types

int
a = 1
float
a = 1.0
bool
a = True
str
a = "1 2 3"
a = '1 2 3'
list
a = [1, 2, 3]
dict
a = {"a": 1, "b": 2, "c": 3}
tuple (something like a fixed-length immutable list)
a = (1, 2, 3)

Solution of the exercises

A sample solution to the exercises 1 to 13 can be found in solution_1.py