Czech version
Winter Semester
The course is taught under the code NPFL112 at FSV UK, U Kříže 8, Prague 5 – Jinonice, on Fridays from 9:30 to 10:50, Building C, Room C420. The course is open to everyone, but priority is given to students enrolled in the Certificate in Digital Humanities program. The course capacity is 22 students. Heads up if you need credits at the Faculty of Arts: in the winter semester, this course does not map on AMLV00046, because it lacks the stats lectures by V. Cvrček!
Summer Semester
The course is taught on Fridays at MFF UK, Malostranské náměstí 25, under the code NPFL112 (9:00–10:30) and under the code AMLV00046, where it also includes a mandatory statistics lecture by Prof. Václav Cvrček (ÚČNK FFUK) and runs from 9:00 to 12:10. NPFL112 and AMLV00046 can differ in credits and grades; please check SIS if this matters to you.
Course annotation
The humanities have seen an irreversible paradigm shift towards Digital Humanities, based on automatic quantitative analysis of (big) data. This trend has even reached scholarly fields such as history and literary studies, not to mention linguistics and translatology with their notable tradition of corpus-driven and quantitative methods. Apart from research, data science has been widely used in journalism, public administration, as well as consulting. Competence in data science can hence give you a competitive advantage on the labor market.
We will teach you:
- to clean and structure data into neat tables;
- to discover trends, recurring patterns, and outliers;
- basics of modern data visualization.
We use the open-source programming language R along with its advanced RStudio IDE and tidyverse, the globally popular collection of professional data-scientific tools. We mostly use the library-innate data sets mtcars, diamonds, and iris to explain the concepts and functions, but later on we present a case study. We use linguistic data by default, but we gladly tailor this part of the course to interesting data sets and tasks delivered by course students (with reasonable notice).
DataCamp, a respected MOOC hub for data science, offers our students complimentary access to premium content for the whole term.
Prerequisities: English, basic computer literacy, frustration tolerance, and discipline for regular homeworks. No programming skills are required.
Grade requirements:
active participation in all lessons (exceptions are up to teachers), timely submission of homeworks
The course is completed with an examination without a final test. Instead, the grading is based on your obligation fulfillment like so:
Grade C: 30,000 DataCamp XP, active participation (or equivalent: each absence increases your passing limit by 1,000 DataCamp XP), one home assignment submitted in time and approved by the teacher.
Grade B: 30,000 DataCamp XP, active participation (or equivalent: each absence increases your passing limit by 1,000 DataCamp XP), two home assignments submitted in time and approved by the teacher.
Grade A: 30,000 DataCamp XP, active participation (or equivalent: each absence increases your passing limit by 1,000 DataCamp XP), three home assignments submitted in time and approved by the teacher.
For your limit count only DataCamp XP that you acquire in DataCamp courses listed for home assignments and in your current term. Should you have completed them in the past, you must negotiate an alternative list of Data Camp courses with the teacher in advance.
Your free DataCamp license is valid for six months since the course start and cannot be extended. You must complete your assignments within that period. No alternative assignments can be negotiated.
Syllabus
- Getting to know RStudio. Essential concepts. Data science as a subdomain of programming.
- Packages/libraries, functions, arguments, and parameters.
- Selected data structures: vector, factor, data frame, table, tibble, list, matrix.
- Reporting in RMarkDown.
- Data aggregation.
- Visual grammar in the ggplot2 plotting library.
- Visual data exploration: variable types and combinations, appropriate plots and mapping to aesthetic scales.
- Handling overplotting.
- Smoothing in ggplot2.
- Statistical transformation objects ("stat_xxx") and their interaction with the geometrical objects ("geom_xxx").
- The Tidy Data concept.
- Data wrangling: the essential functions of dplyr and tidyr for table transformations.
- Operations on strings (the stringr library).
- Import and export of diverse file formats and objects.
- Case study.
Presentations
https://github.com/ufal/NPFL112
Courses website with session plans and homework assignments for the current term: https://ufal.github.io/NPFL112/
RStudio on the Jupyter Lab cloud
We use a cloud instance of the RStudio IDE running here: https://aic.ufal.mff.cuni.cz/jlab/hub/login. To get a user name and password, you must be a registered student of this course. You will receive your credentials by e-mail from the teacher at the start of the term (please keep an eye on your spam box, too). You can work with this RStudio in any common web browser.
How to...
- Change your password at AIC Jupyter Lab
- Launch RStudio when you just have logged in at AIC UFAL Jupyter Lab
- Copy a directory or file from a teacher's directory to your directory
- Make a directory your working directory (you may need this especially when file paths do not work for you)
- Install a software package (aka library)
References:
Hadley Wickham and Garrett Grolemund. 2017. R for Data Science. O'Reilly. free online: http://r4ds.had.co.nz/
Garrett Grolemund. 2014. Hands-On Programming with R. O'Reilly.
Nina Zumel and John Mount. 2014 Practical Data Science with R. Manning.
Julia Silge and David Robinson: Text Mining with R. A tidy approach. 2017. O'Reilly.
Stefan Th. Gries. 2013. Statistics for Linguistics with R. A practical introduction. De Gruyter.
Stefan Th. Gries. 2009. Quantitative Corpus Linguistics with R. De Gruyter. Routledge.
Matthew L. Jockers. 2014. Text Analysis with R for Students of Literature. Springer.
Natalia Levshina. 2015. How to do Linguistics with R. Data exploration and statistical analysis. John Benjamins.
Simon Munzert, Christian Rubba, Peter Meissner, Dominic Nyhuis: Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. 2015. Wiley.



