Currently, (deep) neural networks are being used to solve Natural Language Processing (NLP) tasks. In these networks, language units (e.g. words) are represented by vectors of real numbers (so-called word embeddings). For many tasks (e.g. machine translation), we assume that the embeddings capture the meaning and other information about the words useful for solving the task. Contextual representations, which we acquire using pre-trained models, are currently growing in importance.
In this project, we will focus on examining information contained in vector representations (mainly word and contextual embeddings) with the help of analysis of independent and principal components (ICA, PCA). We show what semantic and syntactic information is captured in embeddings, and how they differ across NLP roles in both Czech and English.