SIS code: 
Semester: 
summer
E-credits: 
3
Instructor: 
Sylvie Archaimbault, Martin Hájek, Jana Plaňavová Latanowicz

Data Analytics for Students of Social Studies and Humanities

Aim of the course

We encourage students to use data in their projects.

This course is a gentle, programming-free combination of lectures and practical demonstrations of real-life data workflows in various Social Studies and Humanities (SSH) research areas. It aims at motivating the SSH students to improve their digital literacy in more advanced data analytics courses. The curriculum has arisen as a joint effort of Charles University (CU), University of Warsaw (UW), and Sorbonne University (SU).

This course does not require any prior data analysis or computer science experience. All you need to get started is basic computer literacy.

You will learn how to tell data stories and captivate your future audiences with TableauPublic, how to use the systems Transkribus and Pero for the digitization of historical documents, and how to annotate texts in TEITOK. We will acquaint yout with the André Mazon's digitized correspondence archive and with the migrant stories published at i am a migrant


We cordially invite you to the workshop 2023.

 

Calendar 2022/23

No. Date Topic Teaching materials
1. Feb 14  Introduction (CU)
    -- course organization, motivation, outline
    -- basic terminology
npfl134-lec-1.pdf
2. Feb 21 Collection of André Mazon's correspondence I (SU)
    -- Mazon’s correspondence
    -- digitization
npfl134-lec-2-part-1.pdf
npfl134-lec-2-part-2.pdf
3. Feb 28 Beginner's guide to data analysis with Google sheets (CU)
    -- Titanic dataset
    -- pivot tables, box plots, histograms
    -- missing values, duplicates
npfl134-lec-3.pdf
    -- Titanic data set in Google sheets (url)
    -- Titanic train.csv at Kaggle (url)
4. Mar 7 Collection of André Mazon's correspondence II
    -- analysis of metadata using the Tableau system

Lecture slides: npfl134_lec-4-TableauMazonMetadata.pdf
Three Orchids data three_orchids.csv
The resulting vizz: https://public.tableau.com/app/profile/silvie.cinkova/viz/Three_orchids/Dashboard1?publish=yes

Mazon metadata
   
--  as a Google spreadsheet: https://drive.google.com/file/d/1a6jtVCWw-k_1lAzdQI8PPQqkNkjnPOZO/view?usp=sharing
   --  Mazon metadata as an Excel spreadsheet: mazon_gps.csv 
   --  Silvie's visualizations of Mazon from the lecture slides https://public.tableau.com/app/profile/silvie.cinkova/viz/2023-03-07Mazon/Sheet5?publish=yes

Tableau tutorials in a mind map: https://www.orgpad.com/s/RVO0h1pEGYd
   --  Create your Tableau account here: https://public.tableau.com/app/discover?authMode=signUp
   --  Titanic train.csv with survival as Boolean values Titanic_train_Kaggle_boolean.csv

5. Mar 14 Collection of André Mazon's correspondence III
    -- analysis of letters (images and transcriptions)
    -- Optical Character Recognition, Handwritten Text Recognition
    -- Transkribus and Pero systems

Presentation about OCR and HTR - slides: npfl134-lec-5-transkribus.pdf
Sign-up at Transkribus home page: https://readcoop.eu/transkribus/ and download and install the client (https://readcoop.eu/transkribus/download/)
By the time of the lecture, you should have received an e-mail with data and instructions for the homework - if not, let us know.

6. Mar 21

Introduction to the Universal Dependencies framework &  Corpus   Linguistics for Information Extraction
 

- Presentation  -  slides in pdfUD_infoextr_handouts_big.pdf
Strudel paper https://onlinelibrary.wiley.com/doi/10.1111/j.1551-6709.2009.01068.x
7. Mar 28 Collection of André Mazon's correspondence IV
    -- annotating data
    -- linguistic processing using the UDPipe and NameTag tools
    -- searching and querying data in TEITOK

- Introduction slides to the class

- Follow the search examples at two corpora in TeiTOK:
   -- Mazon collection
   -- Migrant stories
lecture video (2021/22)

8. Apr 4 Quantitative textual analysis in Sociology
   -- Migrant stories 

lecture-2023-04-04-hajek.pdf
lecture video (2021/22)

9. Apr 11 Quantitative textual analysis in Sociology
   -- Computer-assisted qualitative data analysis software
   -- reQual tool

lecture-2023-04-11-hajek.pdf
lecture video

10. Apr 18 Network analysis of  Migrant Stories
    -- visualization in Gephi, part I

- lecture video
- Download this gephi file with the guided example and load it into your Gephi installation

migrants_country_or_de.gephi

Presentation in html: https://cunicz-my.sharepoint.com/:u:/g/personal/50243070_cuni_cz/EZsPFYz...

To view speaker's notes, put your cursor on the presentation in your web browser and press s.

11. Apr 25 Network analysis of  Migrant Stories
    -- visualization in Gephi, part II
- lecture video
12. May 2 Introduction to Machine Learning

npfl134-lec-12.pdf
- lecture video

13. May 9 Student presentations

 

14. May 16 Sharing data in repositories npfl134-lec-14.pdf
- lecture video from 2021/22 

 

Literature

  1. Brett, M.R. Topic Modeling: A Basic Introduction. The Journal of Digital Humanities 2(1): 12-16. 2012. on-line
  2. Corrales Compagnucci, Marcelo. Big Data, Databases and "Ownership" Rights in the Cloud. https://doi.org/10.1007/978-981-15-0349-8. 2020.
  3. Erjavec, T., Ogrodniczuk, M., Osenova, P. et al.The ParlaMint corpora of parliamentary proceedings. Lang Resources & Evaluation (2022). https://doi.org/10.1007/s10579-021-09574-0
  4. Foster, Ian, Ghani, Rayid, Jarmin, R.S., Kreuter, F. and Lane, J. (ed.). Big Data and Social Science: A Practical Guide to Methods and Tools (Chapman & Hall/CRC Statistics in the Social and Behavioral Sciences). 2017.
  5. Hladká Barbora, Holub Martin: A Gentle Introduction to Machine Learning for Natural Language Processing: How to start in 16 practical steps.In: Language and Linguistics Compass, vol. 9, No. 2, pp. 55-76, 2015.
  6. Jurafski, Dan, Martin, James H. Speech and Langugae Processing. 2021. url
  7. Piotrowski, Michael. Natural Language Processing for Historical Texts. Morgan & Claypool Publishers. 2012. pdf
  8. Glossary of common terms used in the course: url

Acknowledgement

By courtesy of DataCamp, you will receive a six-month access to their e-learning materials. These will help you master Tableau Public to the level you wish.

The dataset of André Mazon's correspondence is available for the course's activities based on the Partnership Agreement between the Center of Slavic Studies (Sorbonne University) and the Institue of Formal and Apllied Linguistics (Charles University).

This course is funded by the 4EU+ Alliance under grant agreement No 2021_F3_10, visit this site.