Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Beginning Python (2005)

.pdf
Скачиваний:
158
Добавлен:
17.08.2013
Размер:
15.78 Mб
Скачать

Using Python for XML

<xs:complexType>

<xs:sequence>

<xs:element name=”book” maxOccurs=”unbounded”> <xs:complexType>

<xs:sequence>

<xs:element name=”title” type=”xs:string”/>

<xs:element name=”author” type=”xs:string” maxOccurs=”unbounded”/> </xs:sequence>

</xs:complexType>

</xs:element>

</xs:sequence>

<xs:attribute name=”owner” type=”xs:string” use=”required”/> </xs:complexType>

</xs:element>

</xs:schema>

This expresses exactly the same data model as the DTD, but some differences are immediately apparent.

Schemas Are Pure XML

To begin with, this document’s top-level node contains a namespace declaration, specifying that all tags starting with xs: belong to the namespace identified by the URI “http://www.w3.org/2001/ XMLSchema”. What this means for practical purposes is that you now have a document model that you can validate your schema against, using the same tools you would use to validate any other XML document.

Schemas Are Hierarchical

Next, notice that the preceding document has a hierarchy very similar to the document it is describing. Rather than create individual elements and link them together using references, the document model mimics the structure of the document as closely as possible. You can also create global elements and then reference them in a structure, but you are not required to use references; they are optional. This creates a more intuitive structure for visualizing the form of possible documents that can be created from this model.

Other Advantages of Schemas

Finally, schemas support attributes such as maxOccurs, which will take either a numeric value from 1 to infinity or the value unbounded, which expresses that any number of that element or grouping may occur. Although this schema doesn’t illustrate it, schemas can express that an element matches a specific regular expression, using the pattern attribute, and schemas can express more flexible content models by mixing the choice and sequence content models.

Schemas Are Less Widely Supported

One of the downsides of schemas is that they haven’t been around as a standard for very long. If you are using commercial processors and XML editors, they are more likely to support DTDs than schemas. Schemas are slowly gaining popularity in the marketplace, but DTDs are still the language of choice, and if you want to include other vocabularies into yours, especially from the W3C, odds are good that it’ll be

281

TEAM LinG

Chapter 15

a DTD, not a schema. RSS (Rich Site Summary, which you’ll learn more about in this chapter), is specified using a DTD.

XPath

XPath is a language for describing locations and node sets within an XML document. Entire books have been written on it. However, the basics are fairly simple. An XPath expression contains a description of a pattern that a node must match. If the node matches, it is selected; otherwise, it is ignored. Patterns are composed of a series of steps, either relative to a context node or absolutely defined from the document root. An absolute path begins with a slash, a relative one does not, and each step is separated by a slash.

A step contains three parts: an axis that describes the direction to travel, a node test to select nodes along that axis, and optional predicates, which are Boolean (true or false) tests that a node must meet. An example step might be ancestor-or-self::book[1], where ancestor-or-self is the axis to move along, book is the node test, and [1] is a predicate specifying to select the first node that meets all the other conditions. If the axis is omitted, it is assumed to refer to the child axis for the current node, so library/book[1]/author[1] would select the first author of the first book in the library.

A node test can be a function as well as a node name. For instance, book/node() will return all nodes below the selected book node, regardless of whether they are text or elements.

The following table describes a handful of shortcuts for axes.

Shortcut

Meaning

 

 

@

Specifies the attribute axis. This is an abbreviation for attribute::

*

Specifies all children of the current node

//

Specifies any descendant of the current node. This is an abbreviation for

 

descendant-or-self::*//. If used at the beginning of an XPath, matches ele-

 

ments anywhere in the document.

 

 

For a more thorough coverage of the subject, you may want to visit http://w3schools.org or pick up a book on XPath.

HTML as a Subset of XML

XML bears a striking resemblance to HTML. This isn’t entirely by accident. XML and HTML both sprang from SGML and share a number of syntactic features. Earlier versions of HTML aren’t directly compatible with XML, because XML requires that every tag be closed, and certain HTML tags don’t require a closing tag, such as <br> and <img>. However, the W3C has declared the XHTML schema in an attempt to bring the two standards in line with each other. XHTML can be manipulated using the same sets of tools as pure XML. However, Python also comes with specialized libraries designed specifically for dealing with HTML.

282

TEAM LinG

Using Python for XML

The HTML DTDs

The current version of HTML is 4.01, which includes 4.01 Transitional, 4.01 Strict, and 4.01 Frameset, specifically for dealing with frames. However, many people still use HTML 3.2, so it’s useful to be able to parse documents from earlier DTDs.

HTMLParser

The HTMLParser class, unlike the htmllib class, is not based on an SGML parser and can be used for both XHTML and earlier versions of HTML.

Try It Out

Using HTMLParser

1.Create a sample HTML file named headings.html that contains at least one h1 tag.

2.Cut and paste the following code from the wrox.com web site into a file:

from HTMLParser import HTMLParser

class HeadingParser(HTMLParser): inHeading = False

def handle_starttag(self, tag, attrs): if tag == “h1”:

self.inHeading = True print “Found a Heading 1”

def handle_data(self, data): if self.inHeading:

print data

def handle_endtag(self, tag): if tag ==”h1”:

self.inHeading = False

hParser = HeadingParser()

file = open(“headings.html”, “r”) html = file.read()

file.close()

hParser.feed(html)

3.Run the code.

How It Works

The HTMLParser class defines methods, which are called when the parser finds certain types of content, such as a beginning tag, an end tag, or a processing instruction. By default, these methods do nothing. To parse an HTML document, a class that inherits from HTMLparser and implements the necessary methods must be created. After a parse class has been created and instantiated, the parser is fed data using the feed method. Data can be fed to it one line at a time or all at once.

283

TEAM LinG

Chapter 15

This example class only handles tags of type <h1>. When an HTMLParser encounters a tag, the handle_starttag method is called, and the tag name and any attached attributes are passed to it. This handle_starttag method determines whether the tag is an <h1>. If so, it prints a message saying it has encountered an h1 and sets a flag indicating that it is currently in an <h1>.

If text data is found, the handle_data function is called, which determines whether it is in an h1, based on the flag. If the flag is true, the method prints the text data.

If a closing tag is encountered, the handle_endtag method is called, which determines whether the tag that was just closed was an <h1>. If so, it prints a message, and then sets the flag to false.

htmllib

htmllib is a parser based on the sgmllib SGML parser. It defines an HTMLParser class that extends the SGML parser class, and in turn, expects to be extended as a subclass to implement its handler methods. It must be provided with input in string form via a method, and makes calls to methods of a formatter object in order to produce output and it does not work with XHTML. It comes with predefined methods for all HTML 2.0 elements and a number of 3.0 and 3.2 elements.

To parse an HTML document, the parser must override the handler methods for each HTML element. Handler methods for tags that don’t have closing tags, such as <br>, take the form do_<tagname>. Tags that have both a closing and opening tag have handler methods of the form start_<tagname> and end_<tagname>.

Try It Out

Using htmllib

To see how the htmllib can be used, try the following example:

from formatter import AbstractFormatter , DumbWriter from htmllib import HTMLParser

class HeadingParser(HTMLParser): def start_h1(self, tag):

print “found H1”

writer = DumbWriter()

formatter = AbstractFormatter (writer) parser=HeadingParser(formatter) parser.feed(open(‘headings.html’).read()) parser.close()

print “Finished parsing”

How It Works

The HeadingParser class implements the HTMLParser interface. As an example, it implements a handler method for the h1 element. The HTMLParser interface expects a formatter object to handle formatted output. The formatter, in turn, expects a writer object. Fortunately, the formatter module contains some simple default implementations of these interfaces called AbstractFormatter and DumbWriter. When the formatter for the HeadingParser has been set, the feed method is used to feed data into the

284

TEAM LinG

Using Python for XML

parser, either all at once, as this example shows, or one line at a time. Because the parser is event-driven, either way of feeding data will have the same result. When the parser is done, it should be closed to release any open handles.

XML Libraries Available for Python

Python comes standard with a number of libraries designed to help you work with XML. You have your choice of several DOM (Document Object Model) implementations, an interface to the nonvalidating Expat XML parser, and several libraries for using SAX (the Simple API for XML).

The available DOM implementations are as follows:

xml.dom: A fully compliant DOM processor

Xml.dom.minidom: A lightweight and much faster but not fully compliant implementation of the DOM specification

The PyXML package is a freely available open-source collection of third-party libraries to process XML with Python. Documentation and downloads are available from Sourceforge at http://pyxml

.sourceforge.net/. It contains a number of useful utility libraries for dealing with XML, such as a pretty printer for outputting easy-to-read XML, as well as some additional parsers. The full list includes the following:

xmlproc: A validating XML parser

Expat: A fast nonvalidating parser

sgmlop: A C helper module that can speed up xmllib.py and sgmllib.py by a factor of 5

PySAX: SAX1 and SAX2 libraries with drivers for most of the parsers

4DOM: A fully compliant DOM Level 2 implementation

javadom: An adapter from Java DOM implementations to the standard Python DOM binding

pulldom: A DOM implementation that supports lazy instantiation of nodes

marshall: Enables Python objects to be serialized to XML

If you don’t already have PyXML installed in your system, please install it now. You will need it to complete examples later in this chapter. Detailed installation instructions are available with the download.

Validating XML Using Python

Document models are wonderful things for describing the kind of data that’s expected, but they aren’t very useful if the document isn’t verified against it. Surprisingly, many XML processors don’t do this automatically; you are expected to supply your own code for verifying the XML. Luckily, there are libraries that do just that.

285

TEAM LinG

Chapter 15

What Is Validation?

Validation is the process of verifying that a document matches the document model that has been specified for it. It verifies that tag names match the vocabulary specified, that attributes match the enumeration or pattern that has been specified for them, and so on.

Well-Formedness versus Validation

All of the XML parsers available will check documents for well formedness. This guarantees that any documents being processed are complete, that every tag opened has been closed, that all tags are well formed (that is, those that need to have matching opening and closing tags have these matching sets), and so on.

If these properties are all satisfied, then the document is well-formed. But validation involves more than that.

Available Tools

Only one of the parsers available for Python today actually validates against a document model, and that is xmlproc. Xmlproc is available as part of the PyXML package; it is not a part of the core Python libraries. To continue with the XML examples in this chapter, you will need to download and install the pyxml package.

Try It Out

Validation Using xmlproc

1.Change the line reading <library owner=”John Q. Reader”> to the following line in your example XML library and save it to a file called library.xml:

<library owner=”John Q. Reader” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-

instance” xsi:noNameSpaceSchemaLocation=”library.xsd”>

2.Save the example schema from earlier in the chapter to a file called library.xsd.

3.Download and install PyXML on your system if you haven’t already. The following code has been tested using PyXML 0.8.4

4.Place the following code into a file called validator.py:

#!/usr/bin/python

from xml.parsers.xmlproc import xmlval

class docErrorHandler(xmlval.ErrorHandler): def warning(self, message):

print message

def error(self, message): print message

def fatal(self, message): print message

286

TEAM LinG

Using Python for XML

parser = xmlval.XMLValidator() parser.set_error_handler(docErrorHandler(parser)) parser.parse_resource(“library.xml”)

5.From the command line, run python validator.py.

How It Works

Including the line <library owner=”John Q. Reader” xmlns:xsi=”http://www.w3.org/2001/ XMLSchema-instance” xsi:noNameSpaceSchemaLocation=”library.xsd”> in the file registers the prefix xsi to point to the namespace and then uses the noNameSpaceSchemaLocation attribute from that namespace to specify that this document uses the library.xsd schema as a content model.

The xmlval module from xmlproc is a module for doing document validation. XMLValidator is a validating parser. It can also use an external parser such as Expat and validate after the external parser has parsed the document.

The XMLValidator class creates four classes: Application, ErrorHandler, PubIdResolver, and InputSourceFactory. An Application object handles document events, and an ErrorHandler handles document parse errors. In a full-fledged XML application, you would implement the Application interface as described later in the section on SAX, but for pure validation, only the ErrorHandler interface needs to be implemented, so that any validation errors that might occur can be printed.

The ErrorHandler has three methods that will need to be implemented: the warning, error, and fatal methods. As the names might indicate, warning handles all warnings, error handles nonfatal errors, and fatal handles fatal errors. For a simple validator, it is only necessary to print any warnings, errors, or fatal errors that may occur, so each of these simply prints the error message.

After the ErrorHandler interface has been implemented, the validating parser needs to be instantiated, and the ErrorHandler needs to be registered with it, using parser.set_error_handler (docErrorHandler(parser)). The __init__ method for an ErrorHandler requires a locator parameter to locate error events, which needs to be of the Parser type.

When everything has been configured, the parse method takes a filename as an argument and parses it, using the ErrorHandler as a callback interface when parsing and validation errors are found.

What Is SAX?

When parsing XML, you have your choice of two different types of parsers: SAX and DOM. SAX stands for the Simple API for XML. Originally only implemented for Java, it was added to Python as of version 2.0. It is a stream-based, event-driven parser. The events are known as document events, and a document event might be the start of an element, the end of an element, encountering a text node, or encountering a comment. For example, the following simple document:

<?xml version=”1.0”?> <author>

<name>Ursula K. LeGuin</name> </author>

287

TEAM LinG

Chapter 15

might fire the following events:

start document

start element: author start element: name

characters: Ursula K. LeGuin end element: name

end element: author end document

Whenever a document event occurs, the parser fires an event for the calling application to handle. More precisely, it fires an event for the calling application’s Content Handler object to handle. Content Handlers are objects that implement a known interface specified by the SAX API from which the parser can call methods. In the preceding example, the parser would call the startDocument method of the content handler, followed by two calls to the startElement method, and so on.

Stream-based

When parsing a document with SAX, the document is read and parsed in the order in which it appears. The parser opens the file or other datasource (such as a URL) as a stream of data (which means that it doesn’t have to have it all at once) and then fires events whenever an element is encountered.

Because the parser does not wait for the whole document to load before beginning parsing, SAX can parse documents very soon after it starts reading the document. However, because SAX does not read the whole document, it may process a partial document before discovering that the document is badly formed. SAX-based applications should implement error-checking for such conditions.

Event-driven

When working with SAX, document events are handled by event handlers, similar to a GUI. You declare callback functions for specific types of document events, which are then passed to the parser and called when a document event occurs that matches the callback function.

What Is DOM?

At the heart of DOM lies the document object. This is a tree-based representation of the XML document. Tree-based models are a natural fit for XML’s hierarchical structure, making this a very intuitive way of working with XML. Each element in the tree is called a Node object, and it may have attributes, child nodes, text, and so on, all of which are also objects stored in the tree. DOM objects have a number of methods for creating and adding nodes, for finding nodes of a specific type or name, and for reordering or deleting nodes.

In-memory Access

The major difference between SAX and DOM is the latter’s ability to store the entire document in memory and manipulate and search it as a tree, rather than force you to parse the document repeatedly, or force you to build your own in-memory representation of the document. The document is parsed once,

288

TEAM LinG

Using Python for XML

and then nodes can be added, removed, or changed in memory and then written back out to a file when the program is finished.

Why Use SAX or DOM

Although either SAX or DOM can do almost anything you might want to do with XML, there are reasons why you might want to use one over the other for a given task. For instance, if you are working on an application in which you will be modifying an XML document repeatedly based on user input, you might want the convenient random access capabilities of DOM. On the other hand, if you’re building an application that needs to process a stream of XML quickly with minimal overhead, SAX might be a better choice for you. Following are some of the advantages and disadvantages you might want to be aware of when architecting your application to use XML.

Capability Trade-Offs

DOM is architected with random access in mind. It provides a tree that can be manipulated at runtime and needs to be loaded into memory only once. SAX is stream-based so data comes in as a stream one character after the next, but the document isn’t seen in it’s entirety before it starts getting processed; therefore, if you want to randomly access data, you have to either build a partial tree of the document in memory based on document events, or reparse the document every time you want a different piece of data.

Most people find the object-oriented behavior of DOM very intuitive and easy to learn. The event-driven model of SAX is more similar to functional programming and can be more challenging to get up to speed on.

Memory Considerations

If you are working in a memory-limited environment, DOM is probably not the right choice. Even on a fairly high-end system, constructing a DOM tree for a 2 or 3 MB XML document can bring the computer grinding to a halt while it processes. Because SAX treats the document as a stream, it never loads the whole document into memory, so it is preferable if you are memory constrained or working with very large documents.

Speed Considerations

Using DOM requires a great deal of up-front processing time while the document tree is being built, but once the tree is built DOM allows for much faster searching and manipulation of nodes because the entire document is in memory. SAX is somewhat fast for searching documents, but not as efficient for their manipulation. However, for document transformations, SAX is considered to be the parser of choice because the event-driven model is fast and very compatible with how XSLT works.

SAX and DOM Parsers Available for Python

The following Python SAX and DOM parsers are available: PyXML, xml.sax, and xml.dom.minidom. They each behave a bit differently, so here is an overview of each of them.

289

TEAM LinG

Chapter 15

PyXML

PyXML contains the following parsers:

Name

Description

 

 

xmlproc

A validating XML parser

Expat

A fast nonvalidating parser

PySAX

SAX1 and SAX2 libraries with drivers for most of the parsers

4DOM

A fully compliant DOM Level 2 implementation

javadom

An adapter from Java DOM implementations to the standard Python DOM

 

binding

pulldom

A DOM implementation that supports lazy instantiation of nodes

 

 

xml.sax

xml.sax is the built-in SAX package that comes with Python. It uses the Expat nonvalidating parser by default but can be passed a list of parser instances that can change this behavior.

xml.dom.minidom

xml.dom.minidom is a lightweight DOM implementation, designed to be simpler and smaller than a full DOM implementation.

Try It Out Working with XML Using DOM

1.If you haven’t already, save the example XML file from the beginning of this chapter in a file called library.xml.

2.Either type in or get the following code from this book’s web site, and save it to a file called xml_minidom.py:

from xml.dom.minidom import parse import xml.dom.minidom

def printLibrary(library):

books = myLibrary.getElementsByTagName(“book”) for book in books:

print “*****Book*****”

print “Title: %s” % book.getElementsByTagName(“title”)[0].childNodes[0].data for author in book.getElementsByTagName(“author”):

print “Author: %s” % author.childNodes[0].data

# open an XML file and parse it into a DOM myDoc = parse(‘library.xml’)

myLibrary = myDoc.getElementsByTagName(“library”)[0]

290

TEAM LinG