analysis/analysis_thread [ Modules ]
SYNOPSIS
9 struct TMorfoAnalyzer ma; 10 morfo_analyzer_load(&ma, fd); 11 struct TMorfoAnalysis ms; 12 morfo_analysis_init(&ms, &ma); 13 14 ... 15 /* repeat runing, using and reseting */ 16 morfo_analysis_run(&ms, "nesu"); 17 /* use the results stored in ms */ 18 ... 19 morfo_analysis_results_reset(&ms); 20 ... 21 22 morfo_analysis_dispose(&ms); 23 morfo_analyzer_dispose(&ma);
FUNCTION
This module uses the compiled analyzer structure to do morphological analysis. It copes with iterating the structure and constructiong and storing the results.
SEE ALSO
The complex usage is demonstrated in the morfo-analysis program that brings a commland line interface to the analysis. See the morfo-analysis.c source file.
See documentation of the analysis_thread.c module to read about private functions.
analysis_thread/enum TRTagSrc [ Types ]
[ Top ] [ analysis_thread ] [ Types ]
NAME
enum TRTagSrc -- symbols to mark the source of the tag
SOURCE
48 enum TRTagSrc { 49 rtsDict, 50 rtsForceNG, 51 rtsPrefix, 52 rtsFallback, 53 rtsNumber, 54 rtsCOUNT /* number of items/no src type */ 55 };
analysis_thread/morfo_analysis_attr_as_csts [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_attr_as_csts -- converts attributes to a string in the CSTS fashion
SYNOPSIS
374 natural idx = morfo_analysis_attr_as_csts(ms, attr_atom);
FUNCTION
The function converts attributes to a string in the CSTS fashion. The string is stored on the top of the ms->char_stack.
The CSTS fashion is understand as a concatenation of all single attributes in the following patterns and order:
- _:<SYNTACTIC>
- _;<SEMANTIC>
- _,<STYLE>
- _^(<LSDER>)
- _^(<LSCOM>)
Note that a word in angle brackets is an attribute category name.
INPUTS
- ms - the analysis structure
- attr_atom - the atom of an attribute to convert
RESULT
- the index referencing the resulting string in the ms->char_stack
analysis_thread/morfo_analysis_attr_as_xml [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_attr_as_csts -- converts attributes to a string in the XML fashion
SYNOPSIS
403 natural idx = morfo_analysis_attr_as_xml(ms, attr_atom);
FUNCTION
The function converts attributes to a string in the XML fashion, i.e. a space separated sequence of key="val" pairs where val is a concatenation of space separated attributes. The string is stored on the top of the ms->char_stack.
The output may look like:
syn="a b" sty="cd e f"
INPUTS
- ms - the analysis structure
- attr_atom - the atom of an attribute to convert
RESULT
- the index referencing the resulting string in the ms->char_stack.
analysis_thread/morfo_analysis_dispose [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_dispose -- free the allocated resources
SYNOPSIS
502 struct TMorfoAnalysis ms; 503 morfo_analysis_init(&ms, ma); 504 ... 505 morfo_analysis_dispose(&ms);
FUNCTION
The function frees the resources allocated with the ms structure but not the structure itself.
INPUTS
- ms - the analysis structure to initialize
NOTES
The analyzer structure ma is yours and thus it is up to you to dispose it.
RESULT
void but changes ms
analysis_thread/morfo_analysis_free [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_free -- free the allocated resources
SYNOPSIS
527 struct TMorfoAnalysis *ms = morfo_analysis_new(ma); 528 ... 529 morfo_analysis_free(ms); ms = NULL;
FUNCTION
The function frees the resources allocated with the ms structure but not the structure itself.
INPUTS
- ms - the analysis structure to initialize
NOTES
The analyzer structure ma is yours and thus it is up to you to dispose it.
RESULT
void but changes ms
analysis_thread/morfo_analysis_init [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_init -- initialize the structure for analysis
SYNOPSIS
452 struct TMorfoAnalysis ms; 453 morfo_analysis_init(&ms, ma);
FUNCTION
The function initializes the ms structure. It does not allocate memory for it.
INPUTS
- ms - the analysis structure to initialize
- ma - the analyzer to use for analysis
NOTES
Use morfo_analysis_dispose to free the allocated resources.
RESULT
void but changes ms
SEE ALSO
analysis_thread/morfo_analysis_new [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_new -- allocate and initialize the structure for analysis
SYNOPSIS
478 struct TMorfoAnalysis *ms = morfo_analysis_new(ma);
FUNCTION
The function allocates and initializes the ms structure.
INPUTS
- ma - the analyzer to use for analysis
NOTES
Use morfo_analysis_free to free the allocated resources.
RESULT
void but changes ms
SEE ALSO
analysis_thread/morfo_analysis_punct [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_punct -- merget the punct_str as a punctuation
SYNOPSIS
139 morfo_analysis_punct(ms, punct_str);
FUNCTION
The function merges the punct_str with other results. The string is considered to be the punctuation and thus is is marked with the tag ma->punct_tg.
INPUTS
- ms - the analysis structure
- punct_str - the string to merge
RESULT
void, but changes ms->lresults
analysis_thread/morfo_analysis_results_reset [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_results_reset -- clears the internal stacks holding the result
SYNOPSIS
299 morfo_analysis_results_reset(ms);
FUNCTION
Clears the internal stacks holding the result. It is suggested to call it after the results were processed and before other analysis is executed.
INPUTS
- ms - the analysis structure
RESULT
void, but changes ms->lresults
analysis_thread/morfo_analysis_run [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_run -- run the partial analyses
SYNOPSIS
159 morfo_analysis_run(ms, form_str);
FUNCTION
The function runs the partial analyses in a deliberated order. The desctiption follows.
* NUMBER
Before MA is started, the input word is checked whether it is a number. If so, it is reported as a number and no analysis is done at all. Otherwise several paths is tried in successive order.
Note that punctuation also bypasses the analysis.
* BASIC ANALYSIS
First the form string is used as is to track the tree of forms. This basic case may result in
* one or more lemmas with morphological tags found as belonging to the form * or no analysis found.
This procedure is reused in all latter ways of the analysis.
* NEGATION AND GRADING
Then, no matter whether any (basic) analysis was found if the given form begins with a morpheme for negation and/or for grading to superlative, present morphemes are recognized and the rest of the form is tracked with the basic analysis.
There are two possible morphemes and one combination considered:
- the negative morpheme "ne" forming negation,
- the morpheme "nej" forming superlative grade
- and combination of both "nejne" forming negated superlative.
For example word "nejnepořádnější" is split in four ways:
- (1) 0 + nejnepořádnější
- (2) ne + jnepořádnější
- (3) nej + nepořádnější
- (4) nejne + pořádnější
The first is the common way for all forms (basic analysis). Other three are driven by present letters (morphemes) at the beginning of the form.
If any split is successfully tracked, it may or may not be considered to be a valid analysis. Only few parts of speech can be combined with mentioned morphemes. So after tracking the rest of the form when the part of speech is known, results are filtered.
Two types of reasons could vindicate the combination of a morpheme with the form. First, the morphological tag may suggest connecting the morpheme either for negation or superlative with a joker marker. Second, the part of speech is known to be able to be used with the morpheme. The negative morpheme may be used with affirmative nouns, adjectives and verbs. Superlatives may be formed from comparatives of adjectives and adverbs.
If both morphemes for negation and superlative are present both must be allowed by a rule. If not, the analysis is discarded. Otherwise the analysis is accepted and the morphological tag is adjusted to match the recognized morphemes.
* PREFIX DERIVATION
If no analysis was found so far, the form may be a word derived with a prefix and so the case is considered, if besides the word form tree also the prefix tree was compiled. The prefix tree is compiled from a list of derivational prefixes that may be provided as an extra input of the compiler.
Although both trees are of the same data structure type they are built and treated in an independent way.
If the derivation analysis takes place, first a prefix is searched in the prefix tree. When a prefix is recognized tracking follows in the forms tree with the rest of the given form (i. e. basic analysis).
If the rest is proved to be a word form it is checked whether its part of speech matches with those that are allowed by the prefix. If it is the case the analysis is accepted, otherwise it is discarded. Abbreviations are never allowed to be derived with a prefix.
* PREFIX TYPES
Prefixes are tagged with flags that say which word class (part of speech) it may be followed with. Now two flags are used:
- nominal flag that combines with nouns and adjectives
- and verb flag that combines just with verbs.
* PREFIX SEARCHING
The analysis continues searching a longer prefix (and shorter form) that may be other division of the derivation.
At most one derivational prefix could be recognized.
After no more prefixes and forms can be recognized, one more way of the analysis is considered. The word derivation may be combined with negation and/or grading.
The superlative morpheme may be followed with negative morpheme and one or both are followed with derivational prefix followed finally with the form. The part of speech of the form must match with the all present (negative, superlative) morphemes and the prefix flags.
Again just one derivational prefix is recognized.
* PROCESSING SCHEME
The following list gives a brief view of the processing:
- 0. punctuation bypasses
- 1. if it is a number, say it is a number and finish
- 2. look for the form as is
- 3. consider negative and superlative morphemes if present
- 4. if we have any analysis found then finish
- 5. consider all derivational prefixes that the given form begins with
- 6. consider negative and superlative morphemes if present followed with derivational prefixes
- 7. if we have any analysis found then finish
- 8. if the given form begins with a uppercase letter, guess it is a proper name otherwise tag the form as unknown
INPUTS
- ms - the analysis structure
- form_str - the string to analyze
RESULT
void, but accumulated in ms->lresults
analysis_thread/morfo_analysis_sort_lr [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_sort_lr -- sort the results by the lemma string
SYNOPSIS
318 morfo_analysis_sort_lr(ms);
FUNCTION
The function sorts the results by the lemma string.
INPUTS
- ms - the analysis structure
RESULT
void, but changes ms->lresults
analysis_thread/morfo_analysis_sort_r [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_sort_r -- sort all the results in lemmas and tags
SYNOPSIS
358 morfo_analysis_sort_r(ms);
INPUTS
- ms - the analysis structure
RESULT
void, but changes ms->lresults and ms->tresults
analysis_thread/morfo_analysis_sort_tr [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_sort_tr -- sort the tags in a list
SYNOPSIS
337 natural head = morfo_analysis_sort_tr(ms, tag_list, s);
FUNCTION
The function sorts the tags in a list by their string representation.
INPUTS
- ms - the analysis structure
- tag_list - the list head index into the ms->tresults
- s - sets the order of the sorting (0 = ascending, 1 = descending)
RESULT
void, but changes the ms->tresults
analysis_thread/morfo_analysis_tag_as_str [ Functions ]
[ Top ] [ analysis_thread ] [ Functions ]
NAME
morfo_analysis_tag_as_str -- get a positional representation of the tag
SYNOPSIS
430 char *t_str = morfo_analysis_tag_as_str(ms, t_atom);
FUNCTION
The function converts the tag atom to a string with its positional representation. The string is stored in a static buffer and will be overwriten by a subsequent call.
INPUTS
- ms - the analysis structure
- t_atom - the atom of a tag to convert
RESULT
- the pointer to the static buffer
analysis_thread/struct TMorfoAnalysis [ Structures ]
[ Top ] [ analysis_thread ] [ Structures ]
NAME
struct TMorfoAnalysis -- a structure to hold the context of the analysis
ATTRIBUTES
See bellow.
FUNCTION
The structure holds the analyzer, state, configuration and results of the current analysis.
SOURCE
105 struct TMorfoAnalysis { 106 struct TMorfoAnalyzer *ma; 107 struct TVTriePage *pg; /* the currently traced page of the analysis tree */ 108 struct TVTriePage *pr; /* the currently traced page of the prefix tree */ 109 struct TGrowingStack char_stack; /* repository to store found lemmas */ 110 struct TGrowingStack lresults; /* of struct TRLemma */ 111 struct TGrowingStack tresults; /* of struct TRTag */ 112 struct TGrowingStack lc; /* of char to store lowercased form */ 113 struct TVHT_nat lht; /* lemma hash table (to group results by lemma) */ 114 115 /* variables stored here to save space on the stack while doing a recursion */ 116 const char *form_str; /* points to the beginning of the form with possible prefix removed (nej^rychlejší) */ 117 const char *form_end; /* point to the next character of the form that ougth to be processed */ 118 const char *lemma_p; /* points to the beginngin of the whole form (prefixes included) */ 119 uint16_t lemma_p_len; /* the length of the recognized prefix */ 120 uint16_t lemma_p_type; /* the type of the recognized prefix */ 121 uint8_t neg; /* indicates the presence of the ne- prefix */ 122 uint8_t grade; /* indicates the presence of the nej- prefix */ 123 124 TMorfoAnalysisMergeTags merge_tags; /* the method of the tags filtering */ 125 };
analysis_thread/struct TRLemma [ Structures ]
[ Top ] [ analysis_thread ] [ Structures ]
NAME
struct TRLemma -- a structure to store the a part of the analysis joined with a particular lemma
ATTRIBUTES
- lm_idx -- the index to the ms->char_stack to the beginning of the lemma
- tag_list -- the index in the ms->tresults repository of the head of the list of struct TRTag
- attr_atom -- the references the first attribute-blob; the scheme is ms->ma->string_rep[ ms->ma->attr_idx[atom] ]. The blob begins with its length. Then the flags are concateneted with a zero byte separation. The double zero byte divides the flags category.
SOURCE
85 struct TRLemma { 86 natural lm_idx; 87 natural tag_list; 88 natural attr_atom; /* uint16_t is enough to hold attr_atom */ 89 };
analysis_thread/struct TRTag [ Structures ]
[ Top ] [ analysis_thread ] [ Structures ]
NAME
struct TRTag -- a list item structure storing a tag as a part or the analysis result
ATTRIBUTES
- atom -- the tag representation
- next -- the next item index or NATURAL_MAX at the end
- src -- the analysis source attribute
SOURCE
68 struct TRTag { 69 natural atom; 70 natural next; 71 uint8_t src; /* of enum TRTagSrc */ 72 };