MasKIT User's Manual

This section describes the command line tool. The REST API is described on the API Reference page.

1. Running MasKIT

The simplest way to run MasKIT is to provide a plain text as a standard input and get the result in txt format at the standard output.

./maskit.pl --stdin

The input is assumed to be in UTF-8 encoding and can be either a plain text or (with --input-format presegmented switch) a pre-segmented, i.e. sentence per line text.

The following line runs MasKIT with a presegmented (sentence per line) plain text file as an input and the result given in the CONLL-U format at the standard output.

./maskit.pl --input-file [input_file_name] --input-format presegmented --output-format conllu

The result in the selected output format goes always to the standard output; additionally, for logging purposes, the result can be stored to a file, e.g. the following command will send the result in HTML to the standard output and also store the result in the CONLL-U format in a file.

./maskit.pl --input-file [input_file_name] --output-format html --store-format conllu

The full command syntax of running MasKIT

Usage: maskit.pl [options]
options:  -i|--input-file [input text file name]
         -si|--stdin (input text provided via stdin)
         -if|--input-format [input format: txt (default) or presegmented]
         -rf|--replacements-file [replacements file name]
          -r|--randomize (if used, the replacements are selected in random order)
          -c|--classes (if used, classes are used as replacements)
         -of|--output-format [output format: txt (default), html, conllu]
          -d|--diff (display the original expressions next to the anonymized versions)
         -ne|--named-entities [scope: 1 - add NameTag marks to the anonymized versions, 2 - to all recognized tokens]
         -os|--output-statistics (add MasKIT statistics to output; if present, output is JSON with two items: data (in output-format) and stats (in HTML))
         -sf|--store-format [format: log the output in the given format: txt, html, conllu]
         -ss|--store-statistics (log statistics to an HTML file)
         -ls|--log-states (log intermediate states in CoNLL-U format for debugging; possible values (separated by a comma):
                           UD (after UDPipe), NT (after NameTag), PA (after parsing to Tree::Simple), FN (after fixing NameTag errors),
                           UN (after unification of single-word NEs) 
         -ll|--logging-level (override the default (anonymous) logging level (0=full, 1=limited, 2=anonymous))
         -uu|--url-udpipe [URL: set a custom UDPipe URL]
         -un|--url-nametag [URL: set a custom NameTag URL]
          -v|--version (prints the version of the program and ends)
          -n|--info (prints the program version and supported features as JSON and ends)
          -h|--help (prints a short help and ends)

1.1. Input Formats

The input format can be specified using the --input-format option. Currently supported input formats are:

  • txt (default): the input is a plain text in UTF-8
  • presegmented: the input is a presegmented plain text in UTF-8, i.e. each sentence is on a single line; empty lines mark paragraph breaks

1.2. Output Formats

The output format is specified using the --output-format option. Currently supported output formats are:

  • txt (default): the output in a plain text; the original texts (if present in the output thanks to --diff option) are diplayed next to the replacements (separated from the replacement by an underscore and enclosed in square brackets).
  • html: the output in HTML; the replacements are colour-marked, the original texts (if present in the output thanks to --diff option) are in subsript, enclosed in square brackets and striked through.
  • conllu: the CoNLL-U format; the original text is unchanged, all MasKIT-related information is put in the misc column with prefix MK=. Replacements are put after r:, e.g. MK=r:Praha, or MK=r:MĚSTO-x if --classes option is used (where x stands for a numeric index of this class occurrence in the document, e.g. MK=r:MĚSTO-1 for the first town/city encountered in the document), see below for a list of classes. Tokens to be hidden (in multiword anonymized expressions) are marked MK=h.

1.2.1 Classes

If --classes option is used, MasKIT uses class names instead of fake text replacements (i.e., it anonymizes instead of pseudonymizing). Currently, the following classes are used (in the CoNLL-U output format, the M- prefix is dropped):

  • M-MUŽ-JMÉNO: a first name (male)
  • M-ŽENA-JMÉNO: a first name (female)
  • M-MUŽ-PŘÍJMENÍ: a family name (male)
  • M-ŽENA-PŘÍJMENÍ: a family name (female)
  • M-PŘÍJMENÍ-ZKRATKA: an abbreviated family name
  • M-ULICE: a street/square name
  • M-ČÍSLO-ULICE: a street/square number
  • M-OBEC: a town/city name
  • M-PSČ: a zip code
  • M-PSČ1: a first part of a two-part zip code
  • M-PSČ2: a second part of a two-part zip code
  • M-PHONE: a phone/fax number
  • M-RČ1: a first part of a birth registration number
  • M-RČ2: a second part of a birth registration number
  • M-DEN-NAROZENÍ: a date of birth (day)
  • M-MĚSÍC-NAROZENÍ: a date of birth (month)
  • M-DEN-MĚSÍC-NAROZENÍ: a date of birth (day and month)
  • M-ROK-NAROZENÍ: a date of birth (year)
  • M-DEN-ÚMRTÍ: a date of death (day)
  • M-MĚSÍC-ÚMRTÍ: a date of death (month)
  • M-DEN-MĚSÍC-ÚMRTÍ: a date of death (day and month)
  • M-ROK-ÚMRTÍ: a date of death (year)
  • M-EMAIL: an e-mail address
  • M-WWW: a WWW address
  • M-FIRMA: a company name
  • M-AGENTURA: (since ver. 0.69) a name of a government or political institution
  • M-INSTITUCE: a name of a cultural, educational or scientific institution
  • M-IČO: a commercial register number
  • M-DIČ: a tax register number
  • M-ČÍSLO-POZEMKU: a land register number
  • M-SPZ: a vehicle registration number
  • M-ČÍSLO-JEDNACÍ: an agenda reference number