Intro

Russian, as a fusional language, has rich inflection. As a consequence, a tagset capturing its morphological features is necessarily large. A natural way to make it manageable is to use a structured system. We have decided to use a positional tagset, inspired by the Czech Positional Tagset (Hajic, 2004). We have used preliminary versions of this tagset in our previous work (e.g., Hana et al. (2004); Feldman (2006); Feldman and Hana (2010)).

See our LREC 2010 poster for more details: [Citation] [Abstract] [BibTeX] [paper pdf] [poster pdf] (There are some errors in the poster, they are corrected at this page, see history below).
Jirka Hana and Anna Feldman (2010). A Positional Tagset for Russian. In: Proceedings of the 7th International Conference on Language Resources and Evaluation ({LREC} 2010)}. European Language Resources Association. pp. 1277-1284.
Abstract: Fusional languages have rich inflection. As a consequence, tagsets capturing their morphological features are necessarily large. A natural way to make a tagset manageable is to use a structured system. In this paper, we present a positional tagset for describing morphological properties of Russian. The tagset was inspired by the Czech positional system (Hajic, 2004). We have used preliminary versions of this tagset in our previous work (e.g., Hana et al. (2004, 2006); Feldman (2006); Feldman and Hana (2010)). Here, we both systematize and extend these preliminary versions (by adding information about animacy, aspect and reflexivity); give a more detailed description of the tagset and provide comparisons with the Czech system.
BibTeX:
@inproceedings{hana:feldman:2010:lrec,
	title = {A Positional Tagset for Russian},
	author = {Jirka Hana and Anna Feldman},
	year = {2010},
	booktitle = {Proceedings of the 7th International Conference on Language Resources and Evaluation ({LREC} 2010)},
	publisher = {European Language Resources Association},
	address = {Valletta, Malta},
	pages = {1278--1284},
	isbn = {2-9517408-6-7},
}
    

History

Click here to see the history of important changes to this document.

Positions

Pos Abbr Name Nr. of values   Values
1 p Part of Speech 12 NAPCVDRJITZX
2 s SubPOS (Detailed Part of Speech) 42 N ACGcU P5SDQqWwZz =}nrjuav Bfie bg FRV ,^ I T #: 0X
3 g Gender 4 FMNX
4 y Animacy 3 AIX
5 n Number 3 SPX
6 c Case 7 123467X
7 f Possessor's Gender 4 FMNX
8 m Possessor's Number 2 SP
9 e Person 4 123X
10 r Reflexivity 2 IR
11 t Tense 4 FPRX
12 b Verbal aspect 3 PIX
13 d Degree of comparison 3 123
14 a Negation 2 AN
15 v Voice 2 AP
16 i Variant, Abbreviation 7 -1235678

Part of Speech
Most POS values are traditional and include nouns, adjectives, verbs, prepositions, etc. However, there are some other distinctions as well. Participles in Russian behave similarly to adjectives. They agree with nouns they modify in gender, number, and case; their forms are identical to those of adjectives’. Since their syntactic distribution is different, they are distinguished from adjectives in the subPOS position. Gerunds (“verbal adverbs”) are treated as a special form of verbs and are distinguished in the subPOS position as well.
Detailed Part of Speech
This position specifies the POS in more detail. For example, it distinguishes long and short adjectives, various types of pronouns, such as personal, possessive, demonstrative, interrogative pronouns; finite and infinite verbs and so on. One might criticize our distinctions here as the ones that do not happen at the same linguistic level. We do agree with this criticism. However, it was a compromise between the size of the tagset and the ability to identify fine-grained distinctions that are extremely rare in the language and perhaps do not deserve a separate slot in the tag. This is a price for using a positional tagset, which we are ready to pay.
Animacy
Animacy manifests itself only in accusative masculine singular and accusative plural of all genders. For nouns, we consider it a lexical feature, on par with gender, thus we marked it for all forms. It is also encoded for all noun modifiers that have different forms depending on the animacy of their the noun. Therefore certain adjectives, pronouns and numerals have animate and inanimate forms in acc.masc.sg. and acc.pl., and a single form (tagged with the wildcard X value) otherwise.
Gender
The gender position stands for grammatical gender, which captures both lexical gender of nouns and agreement gender of adjectives, pronouns, numerals and verbs.
Possessor’s Gender & Number
The tagset distinguishes two number slots (similarly genders): The two different numbers are exemplified by the following example (našu has S as the agreement number (there is one photograph), and P as the possessor's number):
On kupil našu staruju fotografiju.
he bought our fem.sg old fem.sg photograph fem.sg
PPM-S1--3I------ VBM-S----IR----- PSFIS4-P1I------AAFIS4------1A-- NNFIS4-------A--
`He bought the old photograph of ours’
Reflexivity
This position captures the traditional category of reflexivity:
Tense
This position encodes morphological tense. For example, morphologically present verbs used to express past are annotated as being in the present tense (P), or infinitive in compound future tense is tagged as not distinguishing tense (-).
Verbal aspect
Aspect is encoded (at least to some extent) in the verb morphology of Russian, mostly by prefixes. Most linguists more or less confidently prefer to categorize Russian aspect as a derivational category (Karcevski, 1927; Ruzicka, 1952; Dahl, 1985; Bermel, 1997), only very few claim aspect to be an inflectional category (e.g. Isačenko, 1968).
Negation
The negation slot refers to the presence (value N) or absence (A) of the negative prefix ne for open class words. For pronouns the slot has always - value. Words that are not negated synchronically do not have N in this slot (they may still have negative semantics, but the initial ne is not a morphological prefix anymore), for example nenavist’ ‘hate’ is tagged as NNFIS1-------A--. All adjectives, including participles, allow such negation, at least in theory:
Variant, Abbreviation
The main function of the last slot is to enable unique generation of forms, i.e. ensure that a lemma with a tag corresponds to a single form. Therefore, if a particular combination of morphological categories can be expressed by more than one form of a single lemma, the variant slot can be used to distinguish between them. The values are assigned to forms based on their register (standard, colloquial and archaic) and frequency (common vs. rare). But unlike in the case of the other slots, these are just basic guidelines and strictly speaking, the assignment is arbitrary. Specifying this position is optional. In applications where such distinction between forms is not needed or even desirable, all forms should be assigned the basic variant (-). Note that this slot is used only to distinguish variants of forms of a single lemma, not to provide information about the register/frequency of lemmas. Therefore, forms of a colloquial/archaic lemma are assigned the basic variant value. The final slot serves one more function. It is used to mark abbreviations as such. In theory, we could have introduced a dedicated slot. However, because there is very little need for distinguishing variants of abbreviations (abbreviations rarely, if ever, inflect), this would make the tagset more complex without bringing much benefit. Also, this is the way abbreviations are marked in the Czech tagset. An abbreviation could be seen as a form of a lemma (e.g. gr. being a form of graždanin ‘citizen’). However, because the abbreviation is not really an inflection of the lemma and because many words can be abbreviated in several ways, we decided to use the abbreviation itself as its lemma.

Values

Position 1 - POS
A Adjective
C Numeral
D Adverb
I Interjection
J Conjunction
N Noun
P Pronoun
V Verb
R Preposition
T Particle
X Unknown, special use
Z Punctuation
Position 2 - SubPOS
N N: Noun
A A: Adjective (long, non-participle) (xorosij, ploxoj)
C A: Short adjective (non-participle) (surov, krasiv)
G A: Participle, active or long passive (čitajuscij, čitavsij, pročitavšij, čitaemyj; but not pročitannyj (AA), pročitan (Ac)
c A: Short passive participle (procitan)
U A: Possessive adjective (mamin, oveč'ju)
P P: Personal pronoun (ja, my, ty, vy, on, ona, ono, oni, sebja)
5 P: 3rd person pronoun in prepositional forms (nego, nej, ...)
S P: Possessive pronoun (moj, ego, svoj, ..)
D P: Pronoun demonstrative (ètot, tot, sej, takoj, èkij, ... )
Q P: Relative/interrogative pronoun with nominal declension (kto, čto)
q P: Relative/interrogative pronoun with adjectival declension (kakoj, kotoryj, cej, ...)
W P: Negative pronoun with nominal declension (nicto, nikto)
w P: Negative pronoun with adjectival declension (nikakoj, nicej)
Z P: Indefinite pronoun with nominal declension (kto-to, kto-nibud', cto-to, ...)
z P: Indefinite pronoun with adjectival declension (samyj, ves', ...)
= C: Number written using digits
}C: Number written using Roman numerals (XIV)
n C: Cardinal numeral (odin, tri, sorok)
r C: Ordinal numeral (pervyj, tretij)
j C: Generic/collective numeral (dvoje, četvero)
u C: Interrogative numeral (skol'ko)
a C: Indefinite numeral (mnogo, neskol'ko)
v C: Multiplicative numeral (dvaždy, triždy)
B V: Verb in present, past or rarely future form (čitaju, splju, pišum, spal, ždal)
f V: Infinitive (delat', spat')
i V: Imperative (spi, sdelaj, pročti)
e V: Gerund (delaja; pridja, otpisav)
b D: Adverb without a possibility to form negation and degrees of comparison (vverxu, vnizu, potom)
g D: Adverb forming negation and degrees of comparison (vysoko, daleko)
F R: Part of a preposition; never appears isolated (nesmotrja)
R R: Nonvocalized preposition (ob, pered, s, v, ...)
V R: Vocalized preposition (obo, peredo, so, vo, ...)
, J: Subordinate conjunction (esli, čto, kotoryj)
^J: Non-subordinate conjunction (i, a, xotja, pricem)
I I: Interjection (oj, aga, m-da)
T T: Particle (li)
#Z: Sentence boundary
: Z: Punctuation
0 X: Part of a multiword foreign phrase
X X: Unknown, special use
Position 3 - Gender Distinguished for: N, A{ACGUc}, P{P5DLwSq8}, C{nra}, VB
F Feminine
M Masculine
N Neuter
X Any gender
Position 4 - Animacy Distinguished for: N, A{AGU}, P{SDwqz}, C{nrja}
A Animate
I Inanimate
X Either
Position 5 - Number    Distinguished for: N, A{ACGUc}, P{P5DwSq}, C{nra}, VB
P Plural
S Singular
X Any number
Position 6 - Case Distinguished for: N, A{AGU}, P, C{nrjua}
1 Nominative
2 Genitive
3 Dative
4 Accusative
6 Locative
7 Instrumental
X Any case
Position 7 - Possessor's Gender Distinguished for: PS, AU
F Feminine possessor
M Masculine possessor
N Neuter possessor
X Possessor of any gender
Position 8 - Possessor's Number Distinguished for: PP
P Plural possessor
S Singular possessor
Position 9 - Person Distinguished for: P{P5S}, V{Bi}
1 1st person
2 2nd person
3 3rd person
X Any person
Position 10 - Reflexivity Distinguished for: AG, P{P5S}, V
I Irreflexive
R Reflexive
Position 11 - Tense Distinguished for: A{G}, V{Bp}
F Future
P Present
R Past
X Any (Past, Present, or Future)
Position 12 - Aspect Distinguished for: AG, V
P perfective
I imperfective
X either aspect
Position 13 - Degree of comparison Distinguished for: AA, Dg
1 Positive
2 Comparative
3 Superlative
Position 14 - Negation Distinguished for: N, A, Dg
A Affirmative (not negated)
N Negated
Position 15 - Voice Distinguished for: AG, Ac
A Active
P Passive
Position 16 - Variant Distinguished for: As needed
-Basic variant
1 Variant (generally less frequent)
2 Variant (generally rarely used, bookish, or archaic)
3 Variant (very archaic)
5 Variant (colloquial)
6 Variant (colloquial, generally less frequent)
7 Variant (colloquial, generally less frequent)
8 Abbreviations

Overview of the tagset

The following table provides an overview of the Russian tagset by POS. A template denotes a set of tags. Roman letters refer to particular values, while italics denote variables. Thus for example, to obtain the set of tags corresponding to the template NNgync-----a---, one needs to instantiate all the possible combinations of the g (gender), y (animacy), n (number), c (case), and a (negation) variables. In this case, g ∈ {F, M, N, X}, y ∈ {A, I, X}, n ∈ {P, S, X}, c ∈ {1, 2, 3, 4, 6, 7, X}, a ∈ {A, N}. A variable neverstands for the - (N/A) value. If a single Sub-POS allows a particular position to have both the N/A value and other values, we list them as separate templates.

template  description  sample word  sample tag
N - Nouns
NNgync-------a-- noun golos NNMIS4-----A----
A - Adjectives (incl. Participles)
AAgync------da-- long adjective tjaželyj AAMIS4------1A--
ACg-n--------a-- short adjective krasiv ACM-S--------A--
AGgync---rtb-av- long participle čitajuščij AGMXS1---IPI-AA-
tv ∈ {PA,RA,XP} pročitavšij AGMXS1---IRP-AA-
i.e. present/past active, passive čitavšij AGMXS1---IRI-AA-
čitaemyj AGMXS1---IXI-AP-
AUgyncf------a-- possessive adjective mužnin AUMXS2M------A--
Acg-n----r-P-aP- pass.perf.short participle pročitan AcM-S----I-P-AP-
P - pronoun
PP--nc--eI------ personal pronoun, e ∈ {1,2} nam PP--P3--1I------
PPg-nc--3I------ personal pronoun 3rd person on PPM-S1--3I------
PP---c---R------ personal reflexive sebja sebja PP---4---R------
P5g-nc--3I------ personal p. in prep. forms nego P5M-S2--3-------
PDgync---------- demonstrative Etu PDFXS4----------
PW---c---------- negative (nominal declension) ničto PW---1----------
Pwgync---------- negative (adj declension) nikakoj PwMXS1----------
PSgync-meI------ possessive, e ∈ {1,2} moja PSFXS1-S1I------
PSXXXXfm3I------ possessive ego PSXXXXMS3I------
PSgync---R------ possessive reflexive svoj PSMXS1---R------
PQ---c---------- relative/interrogative (nom decl) što, kto PQ---1----------
Pqgync---------- relative/interrogative (adj decl) kakoj PqMXS1----------
PZ---c---------- indefinite (nominal declension) kogo-to PZ---4----------
Pzgync---------- indefinite (adjectival declension) kakoj-to PzMXS1----------
C - Numeral
C=-------------- numbers (using digits) 3.14 C=--------------
C}-------------- roman numeral XVII C}--------------
Cngync---------- cardinal numeral 1 odnomu CnMXS3----------
Cngy-c---------- cardinal numeral 2, poltora dvux CnMX-2----------
Cn-y-c---------- cardinal numeral 3, 4 trëx Cn-A-4----------
Cn---c---------- cardinal numeral 5+ pjati Cn---2----------
Crgync---------- ordinal pervyj CrMXS1----------
Cj-y-c---------- generic/collective numeral dvoix Cj-A-3----------
Cu---c---------- interrogative skol'ko Cu---x----------
Ca---c---------- indefinite numeral neskol'ko Ca---1----------
Cagync---------- indefinite num. (adj decl.) mnogomu CaMXS3-----------
Cv-------------- multiplicative triždi Cv---------------
V - verb
VB--n---ertb---- present; (rarely fut.) finite form otryvaeš' VB--S---2IPI----
VBg-n----rRb---- past tense čital VBM-S----IRI----
Ve-------r-b---- gerund grozja Ve-------I-I----
napisav Ve-------I-P----
Vf-------r-b---- infinitive spat' Vf-------I-I----
Vi--n---er-b---- imperative rabotaj Vi--S---2I-I----
D - Adverb
Db-------------- adv. not forming negation/degrees tam Db--------------
Dg----------da-- adv. forming negation/degrees sil'nee Dg--------2A----
R - Preposition
RR---c---------- nonvocalized prep. with c case nad RR---7----------
RV---c---------- vocalized prep. with c case nado RV---7----------
RF-------------- part of a multiword prep. nesmotrja RF--------------
J - Conjunction
J^-------------- coordinating conj. i J^--------------
J,-------------- subordinating conj. čto J,--------------
T - particle
TT-------------- particle net TT--------------
I - Interjection
II-------------- Interjection II--------------
Z - punctuation
Z#-------------- Sentence boundary Z#--------------
Z:-------------- Punctuation ! Z:--------------
X - special
X0-------------- part of a multiword foreign phrase X0--------------
XX-------------- unknown XX--------------

Additional restrictions

Additional Notes

Numerals
Participles
Tag abbreviations
Sometimes it is convenient to abbreviate tags (e.g. in manuals for annotators, when entered manually). We suggest the following conventions - the abbreviations are formed by: Examples:
nounNNgyncNNFIS1 = NNFIS1-------A--
adjectiveAsgyncAAXXXX = AAXXXX------1A--
adverbDbDb = Db--------------
DgDg = Dg----------1A--
DgdDg2 = Dg----------2A--
preposition RRcRR7 = RR---7----------
RVc
conjunctionJˆ = Jˆ--------------
J,
particleTT TT = TT--------------    (similarly II, Z# Z:, X0, XX, ..)
noun abbr NNgyXX-8 NFIXX-8 = NNFIXX-------A-8

Downloads

Support

Development of this tagset was or has been partially supported by: