Technology for NLP


NPFL092

Zdeněk Žabokrtský & Rudolf Rosa

{zabokrtsky,rosa}@ufal.mff.cuni.cz

Tuesday 9.00–11.20
SU2

Introduction to XML, Part 2

XML Namespace

XML Namespace (cont.)

Navigation in XML

+ XLink, XPointer

XPath

XPath (cont.)

More examples

/*/*/*/para elements para in the 4th level
//chapter elements chapter anywhere in the document
//bold | //italic all elements bold or italic anywehere in the document
/book/chapter[2] the second chapter
/book/chapter[last()] the last chapter
//chapter[@id="ch2"] The chapter(s) whose attribute id has the value "ch2"
../@lang attribute lang of the parent of the current node
//*[count(para)=3] all elements that contain exactly three para elements
//chapter/para[position()<3] the first two paragraphs of every chapter
//song[@lang="en"]/title titles of all the English songs

XML Transformations

Why?
Presenting the content of an XML document (users are not interested in tags)
Style (style sheet)
Relation between logical structure of a document and its presentation form
XSL = eXtensible Stylesheet Language
Description of transformations of XML documents
XSLT = XSL Transformation
Transforming process realized by an XSLT processor
xsltproc test.xsl test.xml > test.html

XSL

Stylesheet declaration
XML document in xsl namespace
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    … declaration of templates …
</xsl:stylesheet>
Style body
At least one transformation rule (“template”)
  • Matching expression: finds nodes to apply the template on
  • Output: found nodes substituted by the specified output
<xsl:template match="matching_expression">
    … output …
</xsl:template>

XSL Example

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:template match="/">
    <html>
      <head></head>
      <body>
        <xsl:apply-templates/>
      </body>
    </html>
  </xsl:template>

  <xsl:template match="chapter">
    <h1>
      <xsl:number/>.
      <xsl:value-of select="title"/>
      (id = <xsl:value-of select="@id"/>)
    </h1>
    <p>
      <xsl:apply-templates/>
    </p>
  </xsl:template>
  <xsl:template match="title"/><!-- do not process again -->

</xsl:stylesheet>
<book>
 <chapter id="k1">
  <title>Intro</title>
   Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
 </chapter>
 <chapter id="k2">
  <title>Conclusion</title>
  Etiam euismod scelerisque dapibus.
 </chapter>
</book>

API for XML

Application programming interface

SAX—Simple API for XML
  • a document is read serially
  • its contents are reported as “callbacks”
  • fast and efficient to implement, but user applications have to keep track of the current position
DOM—Document Object Model
  • navigation of the entire document
  • a document represented as a tree of objects
  • language-neutral interface to an in-memory representation

SAX Event Handlers in Perl

Each type of event is processed by a subroutine with a dedicated name:

SAX in Perl: an Example

#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;

use XML::SAX;
use XML::SAX::Writer;

my $out;
my $writer = XML::SAX::Writer->new(Output=>\$out);
my $filter = Element_Attribute_Counter->new(Handler => $writer);

my $parser = XML::SAX::ParserFactory->parser(Handler => $filter);
$parser->parse_uri($ARGV[0]);
print "$out\n",Dumper $filter->get_count;

package Element_Attribute_Counter;
use base qw/XML::SAX::Base/;
use Scalar::Util qw/refaddr/;

my %count;

sub start_element {
    my ($self,$element) = @_;
    my $addr = refaddr $self;
    $count{$addr}{$element->{Name}}++;
    my $attributes = $element->{Attributes};
    foreach my $attribute (keys %$attributes) {
        $count{$addr}{"@".$attributes->{$attribute}{Name}}++;
    }
    $element->{Name} = uc $element->{Name};
    $self->SUPER::start_element($element);
} # start_element

sub get_count {
    my $self = shift;
    return $count{refaddr $self};
} # get_counter

XML::Twig — Using Handlers

#!/usr/bin/perl
use warnings;
use strict;

use XML::Twig;
use Data::Dumper;

my %count;

sub element_handler {
    my ($self, $current) = @_;
    $count{ $current->name }++;
    $count{ '@' . $_ }++ for $current->att_names;
    $self->purge;
} # element_handler

my $twig = XML::Twig->new(
               twig_handlers => {
                   "*" => \&element_handler,
               }
           );
$twig->parsefile($ARGV[0]);

print Dumper \%count;

XML Pulldown Parser — XML::LibXML::Reader

#!/usr/bin/perl
use warnings;
use strict;

use XML::LibXML::Reader;
use Data::Dumper;

my %count;

my $reader = XML::LibXML::Reader->new(location => $ARGV[0])
    or die "Cannot read $ARGV[0]\n";
while ($reader->nextElement) {
    $count{$reader->name}++;
    if ($reader->hasAttributes) {
        $reader->moveToFirstAttribute;
        do {{
            $count{'@' . $reader->name}++;
        }} while $reader->moveToNextAttribute == 1;
    }
}

print Dumper \%count;

DOM

DOM in Perl

XML::DOM uses the expat library

#!/usr/bin/perl
use strict;
use warnings;

use XML::DOM;
my $parser = XML::DOM::Parser->new();
my $doc = $parser->parsefile('book.xml');

foreach my $chap ($doc->getElementsByTagName('chapter')) {
    print 'Attribute id contains: ', $chap->getAttribute('id'), "\n";
    foreach my $child ($chap->getChildNodes) {
        my $type = $child->getNodeType;
        if ($type == ELEMENT_NODE) {
            print 'Element ', $child->getTagName,
                ' contains: ', $child->getFirstChild->getNodeValue,
                "\n";
        } elsif ($type == TEXT_NODE) {
            print 'Text: ', $child->getNodeValue, "\n";
        }
    }
}

XML::LibXML

Similar to XML::DOM, but uses the libxml library

#!/usr/bin/perl
use warnings;
use strict;

use XML::LibXML;
my $doc = XML::LibXML->load_xml(location => 'book.xml');

for my $chap ($doc->getElementsByTagName('chapter')) {
    print 'Attribute id contains: ', $chap->getAttribute('id'), "\n";
    foreach my $child ($chap->getChildNodes) {
        my $type = $child->nodeType;
        if ($type == XML_ELEMENT_NODE) {
            print 'Element ', $child->nodeName,
                ' contains: ', $child->getFirstChild->nodeValue,
                "\n";
        } elsif ($type == XML_TEXT_NODE) {
            print 'Text: ', $child->nodeValue, "\n";
        }
    }
}

XSH

XML::XSH2 is built on top of XML::LibXML

open book.xml ;
my $ch := insert element chapter append book ;
insert attribute 'id="a"' into $ch ;
insert attribute 'nonumber=1' into $ch ;
insert chunk "<title>Appendix</title>" prepend $ch ;
insert text "Quamquam sunt sub aqua, sub aqua maledicere temptant." append $ch ;
insert text {"\n"} after /book/chapter[last()] ;

echo "<html><head></head><body>" ;
my $count = 0;
for my $ch in /book/chapter {
    echo "<h1>" ;
    if not($ch[@nonumber=1]) echo :s {++$count} "." ;
    echo :s $ch/title " (id = " $ch/@id ")</h1>" ;
    for $ch/text() echo "<p>" (.) "</p>" ;
}
echo "</body></html>" ;

XML in NLP

Alternatives to XML

Not a markup language, but a human readable data exchange format: