Under construction.
Pragmatic approach to morphological analysis and tagging combining a limited amount of low-cost high-impact manual resources with resources acquired automatically.
Morphological analysis, tagging and lemmatization are essential for many Natural Language Processing (NLP) applications of both practical and theoretical nature. Modern taggers and analyzers are very accurate. However, the standard way to create them for a particular language requires substantial amount of expertise, time and money. A tagger is usually trained on a large corpus (around 100,000+ words) annotated with the correct tags. Morphological analyzers usually rely on large manually created lexicons. For example, the Czech analyzer (Hajic 2004) uses a lexicon with 300,000+ entries. As a result, most of the world languages and dialects have no realistic prospect for morphological taggers or analyzers created in this way. We have been developing a method for creating morphological taggers and analyzers of fusional languages without the need for large-scale knowledge- and labor-intensive resources for the target language. Instead, we rely on (i) resources available for a related language and (ii) a limited amount of high-impact, low-cost manually created resources. This greatly reduces cost, time requirements and the need for (language-specific) linguistic expertise.
See this bibliography list, esp: