First Python exercises (simple language modelling)
-
Create a string containing the first chapter of genesis.
Print out first 40 characters.
str[from:to] # from is inclusive, to is exclusive
Print out 4th to 6th character 1-based (=3rd to 5th 0-based)
Check the length of the result usinglen()
. -
Split the string into tokens (use
str.split()
; see?str.split
for help).
Print out first 10 tokens. (List splice behaves similarly to substring.)
Print out last 10 tokens.
Print out 11th to 18th token.
Check the length of the result usinglen()
.
Just printing a list splice is fine; also see?str.join
-
Compute the unigram counts into a dictionary.
# Built-in dict (need to explicitly initialize keys): unigrams = {}
# The Python way is to use the foreach-style loops; # and horizotal formatting matters! for token in tokens: # do something
# defaultdict, supports autoinitialization: from collections import defaultdict # int = values for non-set keys initialized to 0: unigrams = defaultdict(int)
# Even easier: from collections import Counter
-
Print out most frequent unigram.
max(something) max(something, key=function_to_get_key) # getting value stored under a key in a dict: unigrams[key] unigrams.get(key)
Or useCounter.most_common()
-
Print out the unigrams sorted by count.
Usesorted()
-- behaves similarly tomax()
Or useCounter.most_common()
-
Get unigrams with count > 5; can be done with list comprehension:
[token for token in unigrams if unigrams[token] > 5]
-
Count bigrams in the text into a dict of Counters
bigrams = defaultdict(Counter) bigrams[first][second] += 1
-
For each unigram with count > 5, print it together with its most frequent successor.
[(token, something) for …]
-
Print the successor together with its relative frequency rounded to 2 decimal digits.
max(), sum(), dict.values(), round(number, ndigits)
-
Print a random token. Print a random unigram disregarding their distribution.
import random ?random.choice list(dict.keys())
-
Pick a random word, generate a string of 20 words by always picking the most frequent follower.
range(10)
-
Put that into a function, with the number of words to be generated as a parameter.
Return the result in a list.list.append(item)
def function_name (parameter_name = default): # do something return 123
-
Sample the next word according to the bigram distribution
import numpy as np ?np.random.choice np.random.choice(list, p=list_of_probs)
Extensions (homework)
As a homework, implement the 1st extension, and at least two other extensions. You can get bonus points if you implement more extensions.-
make a file
generate.py
that generates the text according to exercise 13 (runningpython3 generate.py
should print out a sequence of 20 words sampled from the bigram distribution computed from the text) -
load the text from a file, e.g.:
file = open("filename", "r") text = file.read() file.close()
or e.g. (cleaner):with open("filename", "r") as file: text = file.read()
-
take configuration (input file, N) from command line
n=sys.argv[1]
- do a better tokenization, lowercase...
- detect sentence boundaries, generate a sentence starting at a sentence start and ending at a sentence end (but max N tokens)
- save (and load) the unigram and bigram counts into a file (use pickle)
Review of Python types
- int
- a = 1
- float
- a = 1.0
- bool
- a = True
- str
- a = "1 2 3"
a = '1 2 3' - list
- a = [1, 2, 3]
- dict
- a = {"a": 1, "b": 2, "c": 3}
- tuple (something like a fixed-length immutable list)
- a = (1, 2, 3)
Solution of the exercises
A sample solution to the exercises 1 to 13 can be found in solution_1.py