Czech version
Winter Semester
The course is taught under the code NPFL112 at FSV UK, U Kříže 8, Prague 5 – Jinonice, on Fridays from 9:30 to 10:50, Building C, Room C420. The course is open to everyone, but priority is given to students enrolled in the Certificate in Digital Humanities program. The course capacity is 22 students. Heads up if you need credits at the Faculty of Arts: in the winter semester, this course does not map on AMLV00046, because it lacks the stats lectures by V. Cvrček!
Summer Semester
The course is taught on Fridays at MFF UK, Malostranské náměstí 25, under the code NPFL112 (9:00–10:30) and under the code AMLV00046, where it also includes a mandatory statistics lecture by Prof. Václav Cvrček (ÚČNK FFUK) and runs from 9:00 to 12:10. NPFL112 and AMLV00046 can differ in credits and grades; please check SIS if this matters to you.
The humanities have seen an irreversible paradigm shift towards Digital Humanities, based on automatic quantitative analysis of (big) data. This trend has even reached scholarly fields such as history and literary studies, not to mention linguistics and translatology with their notable tradition of corpus-driven and quantitative methods. Apart from research, data science has been widely used in journalism, public administration, as well as consulting. Competence in data science can hence give you a competitive advantage on the labor market.
We will teach you:
- to clean and structure data into neat tables;
- to discover trends, recurring patterns, and outliers;
- basics of modern data visualization.
We use the open-source programming language R along with its advanced RStudio IDE and tidyverse, the globally popular collection of professional data-scientific tools. We mostly use the library-innate data sets mtcars, diamonds, and iris to explain the concepts and functions, but later on we present a case study. We use linguistic data by default, but we gladly tailor this part of the course to interesting data sets and tasks delivered by course students (with reasonable notice).
DataCamp, a respected MOOC hub for data science, offers our students complimentary access to premium content for the whole term.
Prerequisities: English, basic computer literacy, frustration tolerance, and discipline for regular homeworks. No programming skills are required.
Grade requirements:
active participation in all lessons (exceptions are up to teachers), timely submission of homeworks
The course is completed with an examination without a final test. Instead, the grading is based on your obligation fulfillment like so:
Grade C: 30,000 DataCamp XP, active participation (or equivalent: each absence increases your passing limit by 1,000 DataCamp XP), one home assignment submitted in time and approved by the teacher.
Grade B: 30,000 DataCamp XP, active participation (or equivalent: each absence increases your passing limit by 1,000 DataCamp XP), two home assignments submitted in time and approved by the teacher.
Grade A: 30,000 DataCamp XP, active participation (or equivalent: each absence increases your passing limit by 1,000 DataCamp XP), three home assignments submitted in time and approved by the teacher.
For your limit count only DataCamp XP that you acquire in DataCamp courses listed for home assignments and in your current term. Should you have completed them in the past, you must negotiate an alternative list of Data Camp courses with the teacher in advance.
Your free DataCamp license is valid for six months since the course start and cannot be extended. You must complete your assignments within that period. No alternative assignments can be negotiated.
Syllabus
- Getting to know RStudio. Essential concepts. Data science as a subdomain of programming.
- Packages/libraries, functions, arguments, and parameters.
- Selected data structures: vector, factor, data frame, table, tibble, list, matrix.
- Reporting in RMarkDown.
- Data aggregation.
- Visual grammar in the ggplot2 plotting library.
- Visual data exploration: variable types and combinations, appropriate plots and mapping to aesthetic scales.
- Handling overplotting.
- Smoothing in ggplot2.
- Statistical transformation objects ("stat_xxx") and their interaction with the geometrical objects ("geom_xxx").
- The Tidy Data concept.
- Data wrangling: the essential functions of dplyr and tidyr for table transformations.
- Operations on strings (the stringr library).
- Import and export of diverse file formats and objects.
- Case study.
Teaching materials
https://ufal.github.io/R_BEGINNERS_SHORT/ (website currently shaped as a summer school)
References:
Hadley Wickham and Garrett Grolemund. 2017. R for Data Science. O'Reilly. free online: http://r4ds.had.co.nz/
Garrett Grolemund. 2014. Hands-On Programming with R. O'Reilly.
Nina Zumel and John Mount. 2014 Practical Data Science with R. Manning.
Julia Silge and David Robinson: Text Mining with R. A tidy approach. 2017. O'Reilly.
Stefan Th. Gries. 2013. Statistics for Linguistics with R. A practical introduction. De Gruyter.
Stefan Th. Gries. 2009. Quantitative Corpus Linguistics with R. De Gruyter. Routledge.
Matthew L. Jockers. 2014. Text Analysis with R for Students of Literature. Springer.
Natalia Levshina. 2015. How to do Linguistics with R. Data exploration and statistical analysis. John Benjamins.
Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis: Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. 2015. Wiley.



