The field of comparative syntax aims at developing a theoretical model of the syntactic properties all languages have in common and of the range and limits of syntactic variation. Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance the development of such a model. In this talk I will discuss previously obtained results, as well as briefly touch on future research ideas.
First I will discuss a preprocessing tool that selects parallel sentence pairs that are suitable for comparative syntactic research, filtering out sentence pairs that are syntactically too different. Results were obtained through experiments on Dutch, German and English, and suggest a graph edit distance on parse trees yields the best results.
I will furthermore discuss recent results in extracting syntactic differences from parallel corpora. We build on Wiersma et al.'s (2011) method, and apply the Minimal Description Length Principle in the task. After mining for characteristic part-of-speech patterns by compressing the data, we extract differences in distribution of found patterns between languages. Results were obtained through experiments on Dutch, English and Czech, and show useful and meaningful differences, which can guide linguists in their comparative syntactic research.