Gentle Introduction into Natural Language Processing and Corpus Linguistics

Making sense of language is no easy task. All the more since we have such an intimate experience of it -- the trees that we regularly encounter in everyday communication may make it hard to see the forest. In this situation, computers can help us take a step back and look at large quantities of text from a fifty-thousand-foot view.¹

Say you're interested in a particular text -- the Bible -- and the patterns of occurrence of some of its large cast of notoriously well-known characters. Combing through the entire Bible and figuring this out manually, while possible, is a daunting task. Fortunately, you can use a programming language like Python to do that for you in a few lines of code:

In [1]:

from nltk import Text, word_tokenize
from requests import get

# download the raw text of the Bible
raw_text = get("http://www.gutenberg.org/cache/epub/10/pg10.txt").text
# split it into tokens (~ words)
tokens = word_tokenize(raw_text)
# create a Text object out of the tokens
text = Text(tokens)
# take a look at the occurrences of some words throughout the Bible
text.dispersion_plot(["God", "Jesus", "Adam", "Eve", "Moses", "Job", "Noah"])

An invaluable side-effect of using a programming language is that it makes your analyses easily reproducible. If anyone wonders how you came up with the plot shown above, the step-by-step recipe is right there: the source data you used, the transformations you performed, the outputs you generated. And the best part is, if you notice a flaw or something that can be improved, you can fix just that part, leave everything else as is, and re-run the analysis with no additional effort.

True, some tasks may be more easily achieved using special-purpose point-and-click software. But nothing beats a full-fledged programming language where flexibility and reproducibility are concerned.

In this course, you will learn:

the basics of the Python programming language, which will help you automate the tedious, repetitive parts of analyzing textual data, and empower you to learn more about your material in less time and with less effort
how to access NLP (natural language processing) tools from Python which will help you gain a better understanding of your texts, e.g. by automatically adding part-of-speech labels
how to quickly explore existing (often carefully curated and richly annotated) large collections of texts (corpora) through convenient graphical user interfaces, in order to gain instant insights, generate hypotheses, and speedily answer questions that don't require the full flexibility of a programming language

Recommended resources (no need to study these in advance though):

Jupyter Notebook -- a powerful data analysis environment for Python
Natural Language Toolkit a.k.a. NLTK -- a Python library for linguists which comes with a great introductory textbook on Python and working with language data

¹ Just don't then completely forget about the trees either, they're important too!

V4Py

A V4&DARIAH-CEH Summer Python school

Search form

Gentle Introduction into Natural Language Processing and Corpus Linguistics