feat -- Flexible Error Annotation Tool

Under construction.

Pragmatic approach to morphological analysis and tagging combining a limited amount of low-cost high-impact manual resources with resources aquired automatically.

Intro

Morphological analysis, tagging and lemmatization are essential for many Natural Language Processing (NLP) applications of both practical and theoretical nature. Modern taggers and analyzers are very accurate. However, the standard way to create them for a particular language requires substantial amount of expertise, time and money. A tagger is usually trained on a large corpus (around 100,000+ words) annotated with the correct tags. Morphological analyzers usually rely on large manually created lexicons. For example, the Czech analyzer (Hajic 2004) uses a lexicon with 300,000+ entries. As a result, most of the world languages and dialects have no realistic prospect for morphological taggers or analyzers created in this way.

We have been developing a method for creating morphological taggers and analyzers of fusional languages without the need for large-scale knowledge- and labor-intensive resources for the target language. Instead, we rely on (i) resources available for a related language and (ii) a limited amount of high-impact, low-cost manually created resources. This greatly reduces cost, time requirements and the need for (language-specific) linguistic expertise.

Bibliography

See My bibliography list, esp:

Hana & Feldman (2012): Resource-light approaches to computational morphology. Part I: Monolingual Approaches.
Hana et al (2012): Building a Corpus of Old Czech.
Feldman and Hana (2010). A resource-light approach to morpho-syntactic tagging.
Hana et al (2006): Tagging Portuguese with a Spanish Tagger Using Cognates.
Hana, Feldman and Brew (2004): A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources.

Sources, manuals

svn (Requires a password, email me if you need one)
Morph system resources reference
Morph system Guide
A manual for resources creators (this is a supplement to Hana & Feldman 2010)