Note: There are no home work assignments this year (2017/2018), instead invest some extra energy into the project. You were, however, supposed to send me a pseudo code of an inflectional morphological analyzer during the semester.

 

Use the Linguistica program to generate signatures (roughly paradigms) from a plain text. You can use a text of your choice or you can process Švejk  (taken from the Prague Municipal Library, converted to utf8)

  1. You need Python 3.4, if you do not have it, I recommend using the anaconda distribution.
  2. Install Linguistica from github, you do not need the graphical user-interface. 
  3. Download the corpus:  
    wget https://ufal.mff.cuni.cz/~hana/2016/docs/svejk_1_a_2.txt
  4. Run it:  
    python3 -m linguistica cli
  5. Tell it to analyze your text:
    Path to your file: svejk_1_a_2.txt
  6. For all questions, accept the defaults (you can play with them later)
  7. The corpus is analyzed and the result is saved to the lxa_outputs directory
  8. Write up a half a page to one page report discussing the results: what is surprisingly (in)correct, what was missed, what is linguistically wrong but technically fine, ... No prose, bullets are enough, just that you have something in hand when we discuss it in class.

An alternative homework (requires my approval)

  • Write a skeletton of a morphological analyser for an inflectional language.
  • Use your favorite language (preferably java, python, C++). 
  • Input: a word
  • Output: lemma candidates with tag candidates 
  • Be sure to specify all the necessary datastructures (for storing paradigms, lexical entries, etc)
  • Ignore parsing any grammar specification, simply specify one or two examples in code (e.g. instantiate two paradigms)
  • Ignore efficiency
  • Use standard datastructures such as Maps/Hashtables, Multimaps, Sets, Lists, ...
  • Write basic comments