./analysis/analysis

analysis/analysis_thread [ Modules ]

SYNOPSIS

 9     struct TMorfoAnalyzer ma;
10     morfo_analyzer_load(&ma, fd);
11     struct TMorfoAnalysis ms;
12     morfo_analysis_init(&ms, &ma);
13 
14     ...
15     /* repeat runing, using and reseting */
16     morfo_analysis_run(&ms, "nesu");
17     /* use the results stored in ms */
18     ...
19     morfo_analysis_results_reset(&ms);
20     ...
21 
22     morfo_analysis_dispose(&ms);
23     morfo_analyzer_dispose(&ma);

FUNCTION

This module uses the compiled analyzer structure to do morphological analysis. It copes with iterating the structure and constructiong and storing the results.

analysis_thread/enum TRTagSrc [ Types ]

[ Top ] [ analysis_thread ] [ Types ]

NAME

enum TRTagSrc -- symbols to mark the source of the tag

SOURCE

48 enum TRTagSrc {
49     rtsDict,
50     rtsForceNG,
51     rtsPrefix,
52     rtsFallback,
53     rtsNumber,
54     rtsCOUNT /* number of items/no src type */
55 };

analysis_thread/morfo_analysis_attr_as_csts [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_attr_as_csts -- converts attributes to a string in the CSTS fashion

SYNOPSIS

374     natural idx = morfo_analysis_attr_as_csts(ms, attr_atom);

FUNCTION

The function converts attributes to a string in the CSTS fashion. The string is stored on the top of the ms->char_stack.

The CSTS fashion is understand as a concatenation of all single attributes in the following patterns and order:

_:<SYNTACTIC>
_;<SEMANTIC>
_,<STYLE>
_^(<LSDER>)
_^(<LSCOM>)

Note that a word in angle brackets is an attribute category name.

INPUTS

ms - the analysis structure
attr_atom - the atom of an attribute to convert

RESULT

the index referencing the resulting string in the ms->char_stack

analysis_thread/morfo_analysis_attr_as_xml [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_attr_as_csts -- converts attributes to a string in the XML fashion

SYNOPSIS

403     natural idx = morfo_analysis_attr_as_xml(ms, attr_atom);

FUNCTION

The function converts attributes to a string in the XML fashion, i.e. a space separated sequence of key="val" pairs where val is a concatenation of space separated attributes. The string is stored on the top of the ms->char_stack.

The output may look like:

    syn="a b" sty="cd e f"

INPUTS

ms - the analysis structure
attr_atom - the atom of an attribute to convert

RESULT

the index referencing the resulting string in the ms->char_stack.

analysis_thread/morfo_analysis_dispose [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_dispose -- free the allocated resources

SYNOPSIS

502     struct TMorfoAnalysis ms;
503     morfo_analysis_init(&ms, ma);
504     ...
505     morfo_analysis_dispose(&ms);

FUNCTION

The function frees the resources allocated with the ms structure but not the structure itself.

INPUTS

ms - the analysis structure to initialize

NOTES

The analyzer structure ma is yours and thus it is up to you to dispose it.

RESULT

void but changes ms

analysis_thread/morfo_analysis_free [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_free -- free the allocated resources

SYNOPSIS

527     struct TMorfoAnalysis *ms = morfo_analysis_new(ma);
528     ...
529     morfo_analysis_free(ms); ms = NULL;

FUNCTION

The function frees the resources allocated with the ms structure but not the structure itself.

INPUTS

ms - the analysis structure to initialize

NOTES

The analyzer structure ma is yours and thus it is up to you to dispose it.

RESULT

void but changes ms

analysis_thread/morfo_analysis_init [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_init -- initialize the structure for analysis

SYNOPSIS

452     struct TMorfoAnalysis ms;
453     morfo_analysis_init(&ms, ma);

FUNCTION

The function initializes the ms structure. It does not allocate memory for it.

INPUTS

ms - the analysis structure to initialize
ma - the analyzer to use for analysis

NOTES

Use morfo_analysis_dispose to free the allocated resources.

RESULT

void but changes ms

analysis_thread/morfo_analysis_new [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_new -- allocate and initialize the structure for analysis

SYNOPSIS

478     struct TMorfoAnalysis *ms = morfo_analysis_new(ma);

FUNCTION

The function allocates and initializes the ms structure.

INPUTS

ma - the analyzer to use for analysis

NOTES

Use morfo_analysis_free to free the allocated resources.

RESULT

void but changes ms

analysis_thread/morfo_analysis_punct [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_punct -- merget the punct_str as a punctuation

SYNOPSIS

139     morfo_analysis_punct(ms, punct_str);

FUNCTION

The function merges the punct_str with other results. The string is considered to be the punctuation and thus is is marked with the tag ma->punct_tg.

INPUTS

ms - the analysis structure
punct_str - the string to merge

RESULT

void, but changes ms->lresults

analysis_thread/morfo_analysis_results_reset [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_results_reset -- clears the internal stacks holding the result

SYNOPSIS

299     morfo_analysis_results_reset(ms);

FUNCTION

Clears the internal stacks holding the result. It is suggested to call it after the results were processed and before other analysis is executed.

INPUTS

ms - the analysis structure

RESULT

void, but changes ms->lresults

analysis_thread/morfo_analysis_run [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_run -- run the partial analyses

SYNOPSIS

159     morfo_analysis_run(ms, form_str);

FUNCTION

The function runs the partial analyses in a deliberated order. The desctiption follows.

* NUMBER

Before MA is started, the input word is checked whether it is a number. If so, it is reported as a number and no analysis is done at all. Otherwise several paths is tried in successive order.

Note that punctuation also bypasses the analysis.

* BASIC ANALYSIS

First the form string is used as is to track the tree of forms. This basic case may result in

  * one or more lemmas with morphological tags found as belonging to the form
  * or no analysis found.

This procedure is reused in all latter ways of the analysis.

* NEGATION AND GRADING

Then, no matter whether any (basic) analysis was found if the given form begins with a morpheme for negation and/or for grading to superlative, present morphemes are recognized and the rest of the form is tracked with the basic analysis.

There are two possible morphemes and one combination considered:

the negative morpheme "ne" forming negation,
the morpheme "nej" forming superlative grade
and combination of both "nejne" forming negated superlative.

For example word "nejnepořádnější" is split in four ways:

(1) 0 + nejnepořádnější
(2) ne + jnepořádnější
(3) nej + nepořádnější
(4) nejne + pořádnější

The first is the common way for all forms (basic analysis). Other three are driven by present letters (morphemes) at the beginning of the form.

If any split is successfully tracked, it may or may not be considered to be a valid analysis. Only few parts of speech can be combined with mentioned morphemes. So after tracking the rest of the form when the part of speech is known, results are filtered.

Two types of reasons could vindicate the combination of a morpheme with the form. First, the morphological tag may suggest connecting the morpheme either for negation or superlative with a joker marker. Second, the part of speech is known to be able to be used with the morpheme. The negative morpheme may be used with affirmative nouns, adjectives and verbs. Superlatives may be formed from comparatives of adjectives and adverbs.

If both morphemes for negation and superlative are present both must be allowed by a rule. If not, the analysis is discarded. Otherwise the analysis is accepted and the morphological tag is adjusted to match the recognized morphemes.

* PREFIX DERIVATION

If no analysis was found so far, the form may be a word derived with a prefix and so the case is considered, if besides the word form tree also the prefix tree was compiled. The prefix tree is compiled from a list of derivational prefixes that may be provided as an extra input of the compiler.

Although both trees are of the same data structure type they are built and treated in an independent way.

If the derivation analysis takes place, first a prefix is searched in the prefix tree. When a prefix is recognized tracking follows in the forms tree with the rest of the given form (i. e. basic analysis).

If the rest is proved to be a word form it is checked whether its part of speech matches with those that are allowed by the prefix. If it is the case the analysis is accepted, otherwise it is discarded. Abbreviations are never allowed to be derived with a prefix.

* PREFIX TYPES

Prefixes are tagged with flags that say which word class (part of speech) it may be followed with. Now two flags are used:

nominal flag that combines with nouns and adjectives
and verb flag that combines just with verbs.

* PREFIX SEARCHING

The analysis continues searching a longer prefix (and shorter form) that may be other division of the derivation.

At most one derivational prefix could be recognized.

After no more prefixes and forms can be recognized, one more way of the analysis is considered. The word derivation may be combined with negation and/or grading.

The superlative morpheme may be followed with negative morpheme and one or both are followed with derivational prefix followed finally with the form. The part of speech of the form must match with the all present (negative, superlative) morphemes and the prefix flags.

Again just one derivational prefix is recognized.

* PROCESSING SCHEME

The following list gives a brief view of the processing:

0. punctuation bypasses
1. if it is a number, say it is a number and finish
2. look for the form as is
3. consider negative and superlative morphemes if present
4. if we have any analysis found then finish
5. consider all derivational prefixes that the given form begins with
6. consider negative and superlative morphemes if present followed with derivational prefixes
7. if we have any analysis found then finish
8. if the given form begins with a uppercase letter, guess it is a proper name otherwise tag the form as unknown

INPUTS

ms - the analysis structure
form_str - the string to analyze

RESULT

void, but accumulated in ms->lresults

analysis_thread/morfo_analysis_sort_lr [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_sort_lr -- sort the results by the lemma string

SYNOPSIS

318     morfo_analysis_sort_lr(ms);

FUNCTION

The function sorts the results by the lemma string.

INPUTS

ms - the analysis structure

RESULT

void, but changes ms->lresults

analysis_thread/morfo_analysis_sort_r [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_sort_r -- sort all the results in lemmas and tags

SYNOPSIS

358     morfo_analysis_sort_r(ms);

INPUTS

ms - the analysis structure

RESULT

void, but changes ms->lresults and ms->tresults

analysis_thread/morfo_analysis_sort_tr [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_sort_tr -- sort the tags in a list

SYNOPSIS

337     natural head = morfo_analysis_sort_tr(ms, tag_list, s);

FUNCTION

The function sorts the tags in a list by their string representation.

INPUTS

ms - the analysis structure
tag_list - the list head index into the ms->tresults
s - sets the order of the sorting (0 = ascending, 1 = descending)

RESULT

void, but changes the ms->tresults

analysis_thread/morfo_analysis_tag_as_str [ Functions ]

[ Top ] [ analysis_thread ] [ Functions ]

NAME

morfo_analysis_tag_as_str -- get a positional representation of the tag

SYNOPSIS

430     char *t_str = morfo_analysis_tag_as_str(ms, t_atom);

FUNCTION

The function converts the tag atom to a string with its positional representation. The string is stored in a static buffer and will be overwriten by a subsequent call.

INPUTS

ms - the analysis structure
t_atom - the atom of a tag to convert

RESULT

the pointer to the static buffer

analysis_thread/struct TMorfoAnalysis [ Structures ]

[ Top ] [ analysis_thread ] [ Structures ]

NAME

struct TMorfoAnalysis -- a structure to hold the context of the analysis

ATTRIBUTES

See bellow.

FUNCTION

The structure holds the analyzer, state, configuration and results of the current analysis.

SOURCE

105 struct TMorfoAnalysis {
106     struct TMorfoAnalyzer *ma;
107     struct TVTriePage *pg;              /* the currently traced page of the analysis tree  */
108     struct TVTriePage *pr;              /* the currently traced page of the prefix tree  */
109     struct TGrowingStack char_stack;    /* repository to store found lemmas */
110     struct TGrowingStack lresults;      /* of struct TRLemma */
111     struct TGrowingStack tresults;      /* of struct TRTag */
112     struct TGrowingStack lc;            /* of char to store lowercased form */
113     struct TVHT_nat lht;                /* lemma hash table (to group results by lemma) */
114 
115     /* variables stored here to save space on the stack while doing a recursion */
116     const char *form_str;               /* points to the beginning of the form with possible prefix removed (nej^rychlejší) */
117     const char *form_end;               /* point to the next character of the form that ougth to be processed */
118     const char *lemma_p;                /* points to the beginngin of the whole form (prefixes included) */
119     uint16_t lemma_p_len;               /* the length of the recognized prefix */
120     uint16_t lemma_p_type;              /* the type of the recognized prefix */
121     uint8_t  neg;                       /* indicates the presence of the ne- prefix */
122     uint8_t  grade;                     /* indicates the presence of the nej- prefix */
123 
124     TMorfoAnalysisMergeTags merge_tags; /* the method of the tags filtering */
125 };

analysis_thread/struct TRLemma [ Structures ]

[ Top ] [ analysis_thread ] [ Structures ]

NAME

struct TRLemma -- a structure to store the a part of the analysis joined with a particular lemma

ATTRIBUTES

lm_idx -- the index to the ms->char_stack to the beginning of the lemma
tag_list -- the index in the ms->tresults repository of the head of the list of struct TRTag
attr_atom -- the references the first attribute-blob; the scheme is ms->ma->string_rep[ ms->ma->attr_idx[atom] ]. The blob begins with its length. Then the flags are concateneted with a zero byte separation. The double zero byte divides the flags category.

SOURCE

85 struct TRLemma {
86     natural lm_idx;
87     natural tag_list;
88     natural attr_atom;  /* uint16_t is enough to hold attr_atom */
89 };

analysis_thread/struct TRTag [ Structures ]

[ Top ] [ analysis_thread ] [ Structures ]

NAME

struct TRTag -- a list item structure storing a tag as a part or the analysis result

ATTRIBUTES

atom -- the tag representation
next -- the next item index or NATURAL_MAX at the end
src -- the analysis source attribute

SOURCE

68 struct TRTag {
69     natural atom;
70     natural next;
71     uint8_t src;  /* of enum TRTagSrc */
72 };