Making sense of language is no easy task. All the more since we have such an intimate experience of it -- the trees that we regularly encounter in everyday communication may make it hard to see the forest. In this situation, computers can help us take a step back and look at large quantities of text from a fifty-thousand-foot view.1
Say you're interested in a particular text -- the Bible -- and the patterns of occurrence of some of its large cast of notoriously well-known characters. Combing through the entire Bible and figuring this out manually, while possible, is a daunting task. Fortunately, you can use a programming language like Python to do that for you in a few lines of code:
from nltk import Text, word_tokenize from requests import get # download the raw text of the Bible raw_text = get("http://www.gutenberg.org/cache/epub/10/pg10.txt").text # split it into tokens (~ words) tokens = word_tokenize(raw_text) # create a Text object out of the tokens text = Text(tokens) # take a look at the occurrences of some words throughout the Bible text.dispersion_plot(["God", "Jesus", "Adam", "Eve", "Moses", "Job", "Noah"])
An invaluable side-effect of using a programming language is that it makes your analyses easily reproducible. If anyone wonders how you came up with the plot shown above, the step-by-step recipe is right there: the source data you used, the transformations you performed, the outputs you generated. And the best part is, if you notice a flaw or something that can be improved, you can fix just that part, leave everything else as is, and re-run the analysis with no additional effort.
True, some tasks may be more easily achieved using special-purpose point-and-click software. But nothing beats a full-fledged programming language where flexibility and reproducibility are concerned.
In this course, you will learn:
Recommended resources (no need to study these in advance though):
1 Just don't then completely forget about the trees either, they're important too!