Prague Arabic Dependency Treebank ++

Sunday, March 21, 2010

This Blog Has Moved

This blog is now located at http://padt-online.blogspot.com/. You will be automatically redirected in 30 seconds, or you may click here. For feed subscribers, please update your feed subscriptions to http://padt-online.blogspot.com/feeds/posts/default.

# posted by Otakar Smrz : 3:57 PM 0 comments links to this post

Wednesday, January 13, 2010

ElixirFM 1.1 Update + Wiki + API

The ElixirFM Functional Arabic Morphology project has released an update of its libraries, executables, data, and documentation at SourceForge.

The current version 1.1.927 includes important improvements in the performance of the system and comes with enhanced user and programming interfaces. Next to the ElixirFM Online Interface, the project also features:

ElixirFM Wiki: documentation for the project has been set up, which now brings notable information for the computational linguists and interested developers who would like to explore the ElixirFM system more deeply and use it in their applications
ElixirFM API: there is a powerful ElixirFM programming interface for Perl which allows you to invoke the elixir executable from your code and further parse and process the results easily

The ElixirFM lexicon has been extended and refined, and a number of words have been encoded in a way that makes their deep word structure more explicit. The sources of the lexicon plus the editing software are available freely upon request.

ElixirFM now operates more smoothly in all its modes. In particular, the resolve mode involves solution pruning and its morphological analyses now comply with most linguistic constraints. Likewise, the online inflect and derive modes have been integrated with lookup, due to which word form generation becomes much more intuitive and yet more enjoyable.

ElixirFM is published under the GNU General Public License GNU GPL 3. Everyone is welcome to participate in this project!

# posted by Otakar Smrz : 1:38 AM 0 comments links to this post

Tuesday, March 3, 2009

ElixirFM 1.1 Online Interface

In the recent months, the ElixirFM project has undergone considerable improvement in various respects. We have worked most on developing the programming library and on refining the lexicon. On top of these essential components, we have built a user-friendly web application, the ElixirFM 1.1 Online Interface.

ElixirFM is a computational model of the morphology of Modern Written Arabic. It provides the user with four different modes of operation, in addition to the unique lexical resource and the other open-source functions of the implementation.

Resolve: provides tokenization and morphological analysis of the inserted text, even if you omit some symbols or do not spell everything correctly. You can experiment with entering the text not only in the original script and orthography, but also in other notations, including a purely phonetic transcription.
Inflect: lets you inflect words into the forms required by context. You only need to define the grammatical parameters of the expected word forms. You can either enter natural language descriptions, or you can specify the parameters using the positional morphological tags.
Derive: lets you derive words of similar meaning but different grammatical category. You only need to tell the desired grammatical categories, using either natural language descriptions, or the positional morphological tags.
Lookup: can lookup lexical entries by the citation form and nests of entries by the root. You can even search the dictionary using English.

The online interface includes example queries for each of the modes. It further incorporates several interactive tools to facilitate the browsing of the results returned by the system.

Information on the programming libraries and the research context of the project is in part available in our papers. Yet, we would like to extend the documentation according to the requirements of the users, and would be happy to discuss any unclear issues with anyone interested.

ElixirFM is published under the GNU General Public License GNU GPL 3. Everyone is welcome to participate in this project!

Enjoy ... and let us know in case of questions or comments :)

# posted by Otakar Smrz : 11:52 PM 0 comments links to this post

Wednesday, July 9, 2008

SourceForge Projects

The SourceForge open-source software repository offers a number of projects related to computational processing of Arabic:

ElixirFM: High-level implementation of Functional Arabic Morphology
Encode Arabic: Implementations for encodings of Arabic, in Haskell and Perl
AraMorph: Buckwalter Arabic morphological analyzer
Arabic WordNet: Multi-lingual concept dictionary mapping word senses in Arabic to those in the English Princeton WordNet
Sarf: Arabic morphology system that can generate and inflect Arabic verbs, derivative nouns, and gerunds
Arabic Spellchecker Word Lists: Arabic word list for spell checkers

Users can register with SourceForge and subscribe to the monitoring service of every project, in order to receive notifications of new updates.

# posted by Otakar Smrz : 11:20 AM 0 comments links to this post

Friday, May 2, 2008

A Word on the Million Words

Work on the new PADT 2.0 is now in progress. The recent developments are described in our submission to the LREC 2008 Workshop on Arabic & Local Languages:

Prague Arabic Dependency Treebank: A Word on the Million Words: [paper]

According to the paper, the expected contents of PADT 2.0 will include these annotations:

PADT 2.0 Corpus Fun. Morphology Dep. Syntax Tectogrammatics Notes

Total 1,095,610 1,281,858 1,001,908 30,894 merged annotations

Prague 328,240 383,482 282,252 30,894 original annotations

Penn 767,370 898,376 719,656 converted annotations

Prague Corpus Fun. Morphology Dep. Syntax Tectogrammatics Notes

AEP 99,360 116,717 116,717 9,690 Arabic English Parallel News

EAT 48,371 55,097 55,097 13,934 English-Arabic Treebank

ASB 11,881 14,254 14,254 Arabic Gigaword

NHR 21,445 25,329 12,613 Arabic Gigaword

HYT 85,683 100,537 41,855 5,228 Arabic Gigaword

XIN 61,500 71,548 41,716 2,042 Arabic Gigaword

Penn Corpus Fun. Morphology Dep. Syntax Tectogrammatics Notes

1v3 151,546 172,386 172,386 Penn Arabic Treebank 1v3

2v2 141,515 161,217 161,217 Penn Arabic Treebank 2v2

3v2 335,250 394,466 394,466 Penn Arabic Treebank 3v2

4v1 149,784 178,720 Penn Arabic Treebank 4v1

PADT 2.0	Corpus	Fun. Morphology	Dep. Syntax	Tectogrammatics	Notes
Total	1,095,610	1,281,858	1,001,908	30,894	merged annotations
Prague	328,240	383,482	282,252	30,894	original annotations
Penn	767,370	898,376	719,656		converted annotations
Prague	Corpus	Fun. Morphology	Dep. Syntax	Tectogrammatics	Notes
AEP	99,360	116,717	116,717	9,690	Arabic English Parallel News
EAT	48,371	55,097	55,097	13,934	English-Arabic Treebank
ASB	11,881	14,254	14,254		Arabic Gigaword
NHR	21,445	25,329	12,613		Arabic Gigaword
HYT	85,683	100,537	41,855	5,228	Arabic Gigaword
XIN	61,500	71,548	41,716	2,042	Arabic Gigaword
Penn	Corpus	Fun. Morphology	Dep. Syntax	Tectogrammatics	Notes
1v3	151,546	172,386	172,386		Penn Arabic Treebank 1v3
2v2	141,515	161,217	161,217		Penn Arabic Treebank 2v2
3v2	335,250	394,466	394,466		Penn Arabic Treebank 3v2
4v1	149,784	178,720			Penn Arabic Treebank 4v1

Your suggestions and comments are very welcome. Thank you.

# posted by Otakar Smrz : 3:34 PM 0 comments links to this post

Prague Arabic Dependency Treebank ++

Sunday, March 21, 2010

This Blog Has Moved

Wednesday, January 13, 2010

ElixirFM 1.1 Update + Wiki + API

Tuesday, March 3, 2009

ElixirFM 1.1 Online Interface

Wednesday, July 9, 2008

SourceForge Projects

Friday, May 2, 2008

A Word on the Million Words

Join Us

Quickies

Projects

Links

Archives