SouDeC User's Manual

This section describes the command line tool. The REST API is described on the API Reference page.

1. Running SouDeC

The simplest way to run SouDeC is to provide a plain text as a standard input and get the result in txt format at the standard output.

./soudec.pl --stdin

The input is assumed to be in UTF-8 encoding and can be either a plain text or (with --input-format presegmented switch) a pre-segmented, i.e. sentence per line text.

The following line runs SouDeC with a presegmented (sentence per line) plain text file as an input and the result given in the CONLL-U format at the standard output.

./soudec.pl --input-file [input_file_name] --input-format presegmented --output-format conllu

The result in the selected output format goes always to the standard output; additionally, for logging purposes, the result in the CONLL-U format can be stored to a file, e.g. the following command will send the result in HTML to the standard output and also store the result in a file in the CoNLL-U format.

./soudec.pl --input-file [input_file_name] --output-format html --store-format conllu

The full command syntax of running SouDec

Usage: soudec.pl [options]
options:  -i|--input-file [input text file name]
          -a|--ann-file [manual annotation file name]
         -si|--stdin (input text provided via stdin)
         -if|--input-format [input format: txt (default) or presegmented]
          -p|--phrase-file [phrases reliability file name]
          -r|--reliability [minimal required phrase reliability]
         -of|--output-format [output format: txt (default), html, conllu]
         -os|--output-statistics (add SouDeC statistics to the output; if present, output is JSON with two items: data (in output-format) and stats (in HTML))
         -ne|--named-entities (add NameTag marks to classes in the output)
         -aa|--add-antecedent (add the antecedent if coreference is used to determine the class)
         -sf|--store-format [format: log the output in the given format: txt, html, conllu]
         -ss|--store-statistics (log SouDeC statistics to an HTML file)
          -v|--version (prints the version and ends)
          -h|--help (prints a short help and ends)

1.1. Input Formats

The input format can be specified using the --input-format option. Currently supported input formats are:

  • txt (default): the input is a plain text
  • presegmented: the input is a presegmented plain text, i.e. each sentence is on a single line; empty lines mark paragraph breaks

1.2. Output Formats

The output format is specified using the --output-format option. Currently supported output formats are:

  • txt (default): the output in a plain text; detected phrases are enclosed in @ marks and detected sources are enclosed in >> and << marks, followed by a class of the source in square brackets (one of five classes: anonymous, anonymous-partial, unofficial, official-non-political, official-political).
  • html: the output in HTML; detected phrases and sources are colour-marked, the sources are followed by a class in square brackets (one of five classes: anonymous, anonymous-partial, unofficial, official-non-political, official-political).
  • conllu: the CoNLL-U format with information about detected phrases and detected and classified sources in a single column of the misc attribute. The item key is SD=, followed by P or S for a phrase or source, respectively, followed by an underscore and a numeric id (sources and phrases are numbered separately and independently); in case of sources, after another underscore, a source class is given:
    • a for anonymous,
    • ap for anonymous-partial,
    • u for unofficial,
    • onp for official-non-political,
    • op for official-political.
    For example, SD=P_3 marks the third detected phrase of the document, SD=S_2_onp marks the second detected source in the document, classified as official-non-political. If a source consists of several tokens, all its tokens carry the same mark.

2. Running the SouDeC REST Server

TODO