mrg2fs.pl and negra2fs.pl -- readme +++++++++++++++++++++++++++++++++++ Scripts mrg2fs.pl and negra2fs.pl serve for importing other treebank formats to FS format. Supported formats ================= Penn Treebank ------------- Penn Treebank format is recognized by the script mrg2fs.pl: ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) )) Newlines are not important, as long as every sentence starts on a separate line. Negra ----- Negra export format is recognized by the script negra2fs.pl. The introductory section of Negra export format (BOT ... EOT) is ignored and only sentences (BOS ... EOS) are converted. Secondary edges are ignored as well. %% word tag morph edge parent secedge comment #BOS 1 1 985275570 1 Mögen VMFIN 3.Pl.Pres.Konj HD 508 Puristen NN Masc.Nom.Pl.* NK 505 aller PIDAT *.Gen.Pl NK 500 Musikbereiche NN Masc.Gen.Pl.* NK 500 auch ADV -- MO 508 ... Execution ========= Both of the scripts share the same command line format and options: {mrg,negra}2fs.pl flags file The input files are processed one by one and converted to FS format. The output files have the same names with an additional suffix .fs. -d directory The output directory where to save the converted files. Default: current directory. -m output_file_name Merge all input files together and write the output to the output_file_name. -n number Use in combination with -m. The option -n limits the number of trees allowed in one output file. New files are created for the remaining trees. The output_file_name should contain %d or a similar sprintf identifier to hold the output file number.