Under construction.
Pragmatic approach to morphological analysis and tagging combining a limited amount of low-cost high-impact manual resources with resources aquired automatically.
Intro
Morphological analysis, tagging and lemmatization are essential for many Natural Language Processing (NLP) applications of both practical and theoretical nature.
Modern taggers and analyzers are very accurate. However, the standard way to create them for a
particular language requires substantial amount of expertise, time and money. A tagger is usually trained
on a large corpus (around 100,000+ words) annotated with the correct tags. Morphological analyzers
usually rely on large manually created lexicons. For example, the Czech analyzer (Hajic 2004) uses
a lexicon with 300,000+ entries. As a result, most of the world languages and dialects have no realistic
prospect for morphological taggers or analyzers created in this way.
We have been developing a method for creating morphological taggers and analyzers of fusional languages
without the need for large-scale knowledge- and labor-intensive resources
for the target language. Instead, we rely on (i) resources available for a related language and (ii) a limited amount of
high-impact, low-cost manually created resources. This greatly reduces cost, time requirements and the need for (language-specific) linguistic expertise.
Bibliography
See My bibliography list, esp:
Sources, manuals