The humanities have seen an irreversible paradigm shift towards Digital Humanities, based on automatic quantitative analysis of (big) data. This trend has even reached scholarly fields such as history and literary studies, not to mention linguistics and translatology with their notable tradition of corpus-driven and quantitative methods. Apart from research, data science has been widely used in journalism, public administration, as well as consulting. Competence in data science can hence give you a competitive advantage on the labor market.
We will teach you:  
- to clean and structure data into neat tables;
- to discover trends, recurring patterns, and outliers;
- basics of modern data visualization.

We use the open-source programming language R along with its advanced RStudio IDE and tidyverse, the globally popular collection of professional data-scientific tools. We mostly use the library-innate data sets mtcars, diamonds, and iris to explain the concepts and functions, but later on we present a case study. We use linguistic data by default, but we gladly tailor this part of the course to interesting data sets and tasks delivered by course students (with reasonable notice).

Depending on their current pricing and promotion policy, DataCamp, a respected MOOC hub for data science, may offer our students complimentary access to premium content for the whole term.

Prerequisities: English, basic computer literacy, frustration tolerance, and discipline for regular homeworks. No programming skills are required.
Grade requirements: active participation in all lessons (exceptions are up to teachers), timely submission of homeworks, comprehensive discussion preparation on selected reading (3 - 4 papers/term)

Syllabus
- Getting to know RStudio. Essential concepts. Data science as a subdomain of programming.
- Packages/libraries, functions, arguments, and parameters.
- Selected data structures: vector, factor, data frame, table, tibble, list, matrix.
- Reporting in RMarkDown.
- Data aggregation.
- Visual grammar in the ggplot2 plotting library.
- Visual data exploration: variable types and combinations, appropriate plots and mapping to aesthetic scales.
- Handling overplotting.
- Smoothing in ggplot2.
- Statistical transformation objects ("stat_xxx") and their interaction with the geometrical objects ("geom_xxx").
- The Tidy Data concept.
- Data wrangling: the essential functions of dplyr and tidyr for table transformations.
- Operations on strings (the stringr library).
- Import and export of diverse file formats and objects.
- Case study.

References:

Hadley Wickham and Garrett Grolemund. 2017. R for Data Science. O'Reilly. free online: http://r4ds.had.co.nz/
Garrett Grolemund. 2014. Hands-On Programming with R. O'Reilly.
Nina Zumel and John Mount. 2014 Practical Data Science with R. Manning.
Julia Silge and David Robinson: Text Mining with R. A tidy approach. 2017. O'Reilly.
Stefan Th. Gries. 2013. Statistics for Linguistics with R. A practical introduction. De Gruyter.
Stefan Th. Gries. 2009. Quantitative Corpus Linguistics with R. De Gruyter. Routledge.
Matthew L. Jockers. 2014. Text Analysis with R for Students of Literature. Springer.
Natalia Levshina. 2015. How to do Linguistics with R. Data exploration and statistical analysis. John Benjamins.
Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis: Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. 2015. Wiley.

 

The course usually takes place in the summer term, on Friday mornings (9-12), in the SU1 lab at Malostranské náměstí 25.