This section describes the command line tool. The REST API is described on the API Reference page.
The simplest way to run SouDeC is to provide a plain text as a standard input and get the result in txt format at the standard output.
./soudec.pl --stdin
The input is assumed to be in UTF-8 encoding and can be either a plain text or (with --input-format presegmented
switch) a pre-segmented, i.e. sentence per line text.
The following line runs SouDeC with a presegmented (sentence per line) plain text file as an input and the result given in the CONLL-U format at the standard output.
./soudec.pl --input-file [input_file_name] --input-format presegmented --output-format conllu
The result in the selected output format goes always to the standard output; additionally, for logging purposes, the result in the CONLL-U format can be stored to a file, e.g. the following command will send the result in HTML to the standard output and also store the result in a file in the CoNLL-U format.
./soudec.pl --input-file [input_file_name] --output-format html --store-format conllu
Usage: soudec.pl [options] options: -i|--input-file [input text file name] -a|--ann-file [manual annotation file name] -si|--stdin (input text provided via stdin) -if|--input-format [input format: txt (default) or presegmented] -p|--phrase-file [phrases reliability file name] -r|--reliability [minimal required phrase reliability] -of|--output-format [output format: txt (default), html, conllu] -os|--output-statistics (add SouDeC statistics to the output; if present, output is JSON with two items: data (in output-format) and stats (in HTML)) -ne|--named-entities (add NameTag marks to classes in the output) -aa|--add-antecedent (add the antecedent if coreference is used to determine the class) -sf|--store-format [format: log the output in the given format: txt, html, conllu] -ss|--store-statistics (log SouDeC statistics to an HTML file) -v|--version (prints the version and ends) -h|--help (prints a short help and ends)
The input format can be specified using the --input-format
option. Currently supported input formats are:
txt
(default): the input is a plain text
presegmented
: the input is a presegmented plain text, i.e. each sentence is on a single line; empty lines mark paragraph breaks
The output format is specified using the --output-format
option. Currently supported output formats are:
txt
(default): the output in a plain text; detected phrases are enclosed in @
marks and detected sources are enclosed in >>
and <<
marks, followed by a class of the source in square brackets (one of five classes: anonymous, anonymous-partial, unofficial, official-non-political, official-political).
html
: the output in HTML; detected phrases and sources are colour-marked, the sources are followed by a class in square brackets (one of five classes: anonymous, anonymous-partial, unofficial, official-non-political, official-political).
conllu
: the CoNLL-U format with information about detected phrases and detected and classified sources in a single column of the misc
attribute. The item key is SD=
, followed by P
or S
for a phrase or source, respectively, followed by an underscore and a numeric id (sources and phrases are numbered separately and independently); in case of sources, after another underscore, a source class is given:
a
for anonymous,
ap
for anonymous-partial,
u
for unofficial,
onp
for official-non-political,
op
for official-political.
SD=P_3
marks the third detected phrase of the document, SD=S_2_onp
marks the second detected source in the document, classified as official-non-political. If a source consists of several tokens, all its tokens carry the same mark.
TODO