Technology for NLP


NPFL092

Zdeněk Žabokrtský & Rudolf Rosa

{zabokrtsky,rosa}@ufal.mff.cuni.cz

Tuesday 9.00–11.20
SU2

Introduction to XML

eXtensible Markup Language

<firm id="bsi001">
  <name>Boot &amp; Shoe Inc.</name>
  <address>
    <street>246 Walker Drive</street>
    <town>Lacetown</town>
    <state>OH</state>
    <zipcode>44312</zipcode>
    <phone>732754689</phone>
  </address>
  <specialization>
    Shoemaking, cobler and cordwainer work: shoes, boots, sandals, clogs and moccasins.
  </specialization>
</firm>

Resources: http://www.kosek.cz

History

XML versus Database

Database Table

SurnameFirstnameE-mailPhone
SmithJohnjsmith@gmail.com2554119897
WalkerDaviddwalk@hotmail.com8446225653

The same data in XML

<addressbook>
  <person>
    <surname>Smith</surname>
    <firstname>John</firstname>
    <email>jsmith@gmail.com</email>
    <phone>2554119897</phone>
  </person>
  <person>
    <surname>Walker</surname>
    <firstname>David</firstname>
    <email>dwalk@hotmail.com</email>
    <phone>8446225653</phone>
  </person>
</addressbook>

Advantages of XML

Quick Syntax Tour

Terms

XML document
text file in the XML format
Element(s)
what a document consists of
Tags
mark element boundaries (start/end tags)
Attributes
another information associated with elements

Quick Syntax Tour (2)

Quick Syntax Tour (3)

XML document can (should) contain instructions for xml processor
Declaration header XML:

<?xml version="1.0" encoding="utf-8" ?>

Document type declaration:

<!DOCTYPE MojeKniha SYSTEM "MojeKniha.DTD">

Comments

<!-- Here be dragons -->

(not allowed inside tags, cannot contain --)

Well-Formed XML Document

the document conforming all syntactic requirements

Time to experiment

Document Type Definition (DTD)

DTD Structure

Four types of declarations

Declaration of Elements

Declaration of Elements

Elements containing text

Element content description

Declaration of Attributes

Declaration of Attributes

Types of Attribute Values

Required Attribute

Implicit Attribute

DTD Sample

<!ELEMENT collection (description,recipe*)>

<!ELEMENT description ANY>

<!ELEMENT recipe (title,ingredient*,preparation,comment?,nutrition)>

<!ELEMENT title (#PCDATA)>

<!ELEMENT ingredient (ingredient*,preparation)?>
<!ATTLIST ingredient name CDATA #REQUIRED
                     amount CDATA #IMPLIED
                     unit CDATA #IMPLIED>

<!ELEMENT preparation (step*)>

<!ELEMENT step (#PCDATA)>

<!ELEMENT comment (#PCDATA)>

<!ELEMENT nutrition EMPTY>
<!ATTLIST nutrition protein CDATA #REQUIRED
                    carbohydrates CDATA #REQUIRED
                    fat CDATA #REQUIRED
                    calories CDATA #REQUIRED
                    alcohol CDATA #IMPLIED>

Validation

xmllint (distributed with libxml)