Encode::Arabic::ArabTeX - Perl extension for multi-purpose processing of the ArabTeX notation of Arabic


NAME

Encode::Arabic::ArabTeX - Perl extension for multi-purpose processing of the ArabTeX notation of Arabic


REVISION

    $Revision: 1.34 $             $Date: 2004/08/22 20:42:10 $


SYNOPSIS

    use Encode::Arabic::ArabTeX;        # imports just like 'use Encode' would, plus extended options
    while ($line = <>) {                # maps the ArabTeX notation for Arabic into the Arabic script
        print encode 'utf8', decode 'arabtex', $line;       # 'arabtex' alias 'ArabTeX'
    }
    # ArabTeX lower ASCII transliteration <--> Arabic script in Perl's internal format
    $string = decode 'ArabTeX', $octets;
    $octets = encode 'ArabTeX', $string;
    Encode::Arabic::ArabTeX->encoder('dump' => '!./encoder.code');  # dump the encoder engine to file
    Encode::Arabic::ArabTeX->decoder('load');   # load the decoder engine from module's extra sources


DESCRIPTION

ArabTeX is an excellent extension to TeX/LaTeX designed for typesetting the right-to-left scripts of the Orient. It comes up with very intuitive and comprehensible lower ASCII transliterations, the expressive power of which is even better than that of the scripts.

Encode::Arabic::ArabTeX implements the rules needed for proper interpretation of the ArabTeX notation of Arabic. The conversion ifself is done by Encode::Mapper, and the user interface is built on the Encode::Encoding module.

ENCODING BUSINESS

Since the ArabTeX notation is not a simple mapping to the graphemes of the Arabic script, encoding the script into the notation is ambiguous. Two different strings in the notation may correspond to identical strings in the script. Heuristics must be engaged to decide which of the representations is more appropriate.

Together with this bottle-neck, encoding may not be perfectly invertible by the decode operation, due to over-generation or approximations in the encoding algorithm.

There are situations where conversion from the Arabic script to the ArabTeX notation is still convenient and useful. Imagine you need to edit the data, enhance it with vowels or other diacritical marks, produce phonetic transcripts and trim the typography of the script ... Do it in the ArabTeX notation, having an unrivalled control over your acts!

Nonetheless, encoding is not the very purpose for this module's existence ;)

DECODING BUSINESS

The module decodes the ArabTeX notation as defined in the User Manual Version 4.00 of March 11, 2004, ftp://ftp.informatik.uni-stuttgart.de/pub/arabtex/doc/arabdoc.pdf. The implementation uses three levels of Encode::Mapper engines to decode the notation:

Hamza writing
Hamza carriers are determined from the context in accordance with the Arabic orthographical conventions. The first level of mapping expands every <'> into the verbatim encoding of the relevant carrier. This level of processing can become optional, if people ever need to encode the hamza carriers explicitly.

Unlike in ArabTeX, interpretation of geminated hamza <''> is correct here. We have experimented with imaginable Arabic spellings of <ra''asa>, <ru''isa>, <tara''usuN> etc. on http://www.arabic-morphology.com/ to deduce the proper ones.

Grapheme generation
The core level includes most of the rules needed, and converts the ArabTeX notation to Arabic graphemes in Unicode. The engine recognizes all the consonants of Modern Standard Arabic, plus the following letters:
                    [ "|",           ""         ],              # ArabTeX's "invisible consonant"
                    [ "B",           "\x{0640}" ],              # ArabTeX's "consonantal ta.twil"
                    [ "T",           "\x{0629}" ],              # ta' marbu.ta
                    [ "p",           "\x{067E}" ],              # pa'
                    [ "v",           "\x{06A4}" ],              # va'
                    [ "g",           "\x{06AF}" ],              # gaf
                    [ "c",           "\x{0681}" ],              # .ha with hamza
                    [ "^c",          "\x{0686}" ],              # gim with three
                    [ ",c",          "\x{0685}" ],              # _ha with three
                    [ "^z",          "\x{0698}" ],              # zay with three
                    [ "^n",          "\x{06AD}" ],              # kaf with three
                    [ "^l",          "\x{06B5}" ],              # lam with a bow above
                    [ ".r",          "\x{0695}" ],              # ra' with a bow below

There are many nice features in the notation, like assimilation, gemination, hyphenation, all implemented here. Defective and historical writings of vowels are supported, too! Try yourself if your fonts can handle these ;)

Wasla and ligatures
Wasla is introduced if there is a preceding long or short vowel, and the blank space is one newline, one tabulator, or up to four single spaces. Optionally, diacritical marks in between laam and 'alif go after the latter letter, since most of the current systems rendering the Arabic script do not produce the desired ligatures if the two kinds of graphemes are not adjacent immediately.

There are modes and options in ArabTeX that have not been dealt with yet in Encode::Arabic::ArabTeX. Still, mutual consistency of the systems is very high. This new release does support vowel quoting and works in the ArabTeX's \vocalize mode by default. The other conversion modes are implemented, too, as described below within the enmode and demode methods.

EXPORTS, ENGINES & MODES

The module exports as if use Encode also appeared in the package. The import options, except for the first-place subsequence of :xml, :simple or :describe, are just delegated to Encode and imports performed properly.

If the first element in the list to use is :xml, all XML markup, or rather any data enclosed in the well-paired and non-nested angle brackets < and >, will be preserved. Properties of the Encode::Arabic::ArabTeX engines can be generally controlled through the Encode::Mapper API.

In case the next, possibly the first, element in this list is :simple, rules in the engines get simplified so that quotes be mapped to empty strings and infrequent or experimental notations of vowels not be interpreted in the extra manner of ArabTeX. Using :simple is recommended for simple every-day tasks where these nuances would have no impact and where full initialization would be bothering.

The :describe option calls the Encode::Mapper's describe method on the module's engines right after their compilation.

Initialization of the engines takes place the first time they are used, unless they have already been defined. There are two explicit methods for it:

encoder
Initialize or redefine the encoder engine. If no parameters are given, rules in the module are compiled into a list of Encode::Mapper objects. Currently, the --dump and --load options have some experimental meaning.

decoder
See the description of encoder.

There are five conversion modes currently recognized in this module, and their aliases are mapped according to the module's %modemap hash. Selection of the appropriate mode is done best through the enmode and demode functions of Encode::Arabic, or with a direct call of the namesake methods in Encode::Arabic::ArabTeX:

    our %Encode::Arabic::ArabTeX::modemap = (           # the module provides these definitions
            'default'       => 3,                           'undef'         => 0,
            'fullvocalize'  => 4,   'full'          => 4,
            'vocalize'      => 3,   'nosukuun'      => 3,
            'novocalize'    => 2,   'novowels'      => 2,   'none'          => 2,
            'noshadda'      => 1,   'noneplus'      => 1,
        );
    # the function calls might be preferred as more comfortable
    Encode::Arabic::demode 'arabtex', 'full';           # like 'encode' and 'decode' of Encode
    Encode::Arabic::ArabTeX->demode('fullvocalize');    # like the Encode::Encoding interfaces
    # how modes can be set easily
    use Encode::Arabic ':modes';   enmode 'arabtex', 'undef';   demode 'arabtex', 'noneplus';
enmode
Currently in development. The mode is fixed to 'undef' internally.

demode
Enforces the proper version of the final, third level of the Encode::Mapper engines.


SEE ALSO

Encode::Arabic, Encode::Mapper, Encode::Encoding, Encode

ArabTeX system ftp://ftp.informatik.uni-stuttgart.de/pub/arabtex/arabtex.htm

Klaus Lagally http://www.informatik.uni-stuttgart.de/ifi/bs/people/lagall_e.htm

External Tools Not Only for ArabTeX Documents http://ufal.mff.cuni.cz/publications/year2002/FLM2002.zip

Arabeyes Arabic Unix Project http://www.arabeyes.org/


AUTHOR

Otakar Smrz, http://ufal.mff.cuni.cz/~smrz/

    eval { 'E<lt>' . 'smrz' . "\x40" . ( join '.', qw 'ufal mff cuni cz' ) . 'E<gt>' }

Perl is also designed to make the easy jobs not that easy ;)


COPYRIGHT AND LICENSE

Copyright 2003, 2004 by Otakar Smrz

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Encode::Arabic::ArabTeX - Perl extension for multi-purpose processing of the ArabTeX notation of Arabic