This section describes the command line tool. The REST API is described on the API Reference page.
The simplest way to run SouDeC is to provide a plain text as a standard input and get the result in txt format at the standard output.
./soudec.pl --stdin
The input is assumed to be in UTF-8 encoding and can be either a plain text or (with --input-format presegmented switch) a pre-segmented, i.e. sentence per line text.
The following line runs SouDeC with a presegmented (sentence per line) plain text file as an input and the result given in the CONLL-U format at the standard output.
./soudec.pl --input-file [input_file_name] --input-format presegmented --output-format conllu
The result in the selected output format goes always to the standard output; additionally, for logging purposes, the result in the CONLL-U format can be stored to a file, e.g. the following command will send the result in HTML to the standard output and also store the result in a file in the CoNLL-U format.
./soudec.pl --input-file [input_file_name] --output-format html --store-format conllu
Usage: soudec.pl [options]
options: -i|--input-file [input text file name]
-a|--ann-file [manual annotation file name]
-si|--stdin (input text provided via stdin)
-if|--input-format [input format: txt (default), presegmented, conllu]
-p|--phrase-file [phrases reliability file name]
-r|--reliability [minimal required phrase reliability]
-of|--output-format [output format: txt (default), html, conllu]
-os|--output-statistics (format: add statistics to the output in the given format (html, tsv, or a comma-separated list thereof); if present, output is JSON with items: data (in output-format) and stats_html and/or stats_tsv)
-uil|--ui-language [language: localize the response whenever possible to the given language: en (default), cs]
-ne|--named-entities (add NameTag marks to classes in the output)
-aa|--add-antecedent (add the antecedent if coreference is used to determine the class)
-sf|--store-format [format: log the output in the given format: txt, html, conllu]
-ss|--store-statistics (format: log statistics in the given format ('html', 'tsv', or a comma-separated list thereof))
-ll|--logging-level (override the default (minimal) logging level (0=full, 1=limited, 2=minimal))
-e|--experimental (use the listed experimental features ('perspron', 'gen', or a comma-separated list thereof))
-v|--version (prints the version and ends)
-h|--help (prints a short help and ends)
The input format can be specified using the --input-format option. Currently supported input formats are:
txt (default): the input is a plain text
presegmented: the input is a presegmented plain text, i.e. each sentence is on a single line; empty lines mark paragraph breaks
conllu: the input is in the CoNLL-U format, parsed and with Nametag annotation; calling UDPipe and NameTag will be skipped
The output format is specified using the --output-format option. Currently supported output formats are:
txt (default): the output in a plain text; detected phrases are enclosed in @ marks and detected sources are enclosed in >> and << marks, followed by a class of the source in square brackets (one of five classes: anonymous, anonymous-partial, unofficial, official-non-political, official-political).
html: the output in HTML; detected phrases and sources are colour-marked, the sources are followed by a class in square brackets (one of five classes: anonymous, anonymous-partial, unofficial, official-non-political, official-political).
conllu: the CoNLL-U format with information about detected phrases and detected and classified sources in a single column of the misc attribute. The item key is SD=, followed by P or S for a phrase or source, respectively, followed by an underscore and a numeric id (sources and phrases are numbered separately and independently); in case of sources, after another underscore, a source class is given:
a for anonymous,
ap for anonymous-partial,
u for unofficial,
onp for official-non-political,
op for official-political.
SD=P_3 marks the third detected phrase of the document, SD=S_2_onp marks the second detected source in the document, classified as official-non-political. If a source consists of several tokens, all its tokens carry the same mark.
TODO