Principal investigator (ÚFAL): 
Grant id: 
ÚFAL budget: 
100k CZK in 2014

Čapek GAUK

An alternative way of getting more annotated linguistic data

The purpose of the project “An alternative way of getting more annotated linguistic data” is to verify a possibility of using sentence diagrams made by schoolchildren to enlarge training data for automatic syntactical analysis. The object language is the Czech and the sentence diagrams are teached according to the national curriculum. The linguistic data corresponds to the annotation schemes of the Prague dependency treebanks.

The performance of supervised learning techniques directly correlates with the size of training data: the more annotated data, the better. In natural language processing tasks, training text data are enriched with linguistic knowledge. The annotation process is very resource consuming, thus we have been seeking for alternative ways of faster annotation.

In the Czech schools, the sentence diagrams are an obligatory part of language lessons. We ask if it is possible to collect and transform them into annotation schemes of language corpora and thereby enlarge training data. In the initial elaboration, a prototype sentence diagramming editor Capek was implemented. In addition, initial transformation rules were created. The collection of annotated sentences to test the rules has been created with the cooperation of teachers and schoolchildren. The proposed project will be the continuation of the pilot study.