
Beginning Python - From Novice To Professional (2005)
.pdf330 C H A P T E R 1 5 ■ P Y T H O N A N D T H E W E B
Configuring Apache
Assuming everything went well (if not, check out the sources of information given earlier), you now have to configure Apache to use mod_python. Find the Apache configuration file that is used for specifying modules; it is usually called httpd.conf or apache.conf. Add the line that corresponds to your operating system:
# UNIX
LoadModule python_module libexec/mod_python.so
# Windows
LoadModule python_module modules/mod_python.so
There may be slight variations in how to write this (for example, the exact path to mod_python.so), though the correct version for UNIX should have been reported as a result of running configure, earlier.
Now Apache knows where to find mod_python, but it has no reason to use it: You need to tell it when to do so. To do that, you must add some lines to your Apache configuration, either in some main configuration file (possibly commonapache2.conf, depending on your installation), or in a file called .htaccess in the directory where you place your scripts for Web access. (The latter option is only available if it has been allowed in the main configuration of the server using the AllowOverride directive.) In the following, I assume that you’re using the .htaccess method; otherwise, you have to wrap the directives like this (remember to use quotes around the path if you are a Windows user):
<Directory /path/to/your/directory> (Add the directives here)
</Directory>
The specific directives to use are described in the following sections.
CGI Handler
The CGI handler simulates the environment your program runs in when you actually use CGI. So you’re really using mod_python to run your program, but you can still (mostly) write it as if it were a CGI script, using the cgi and cgitb modules, for example. (There are some limitations; see the documentation for details.)
The main reason for using the CGI handler as opposed to plain CGI is performance. According to a simple test in the mod_python documentation, you can increase your performance by about one order of magnitude (a factor of about 10) or even more. The publisher (described later) is faster than this, and writing your own handler is even faster, possibly tripling the speed of the CGI handler. If you only want speed, the CGI handler may be an easy option. If you’re writing new code, though, and want some extra functionality and flexibility, using one of the other solutions (described in the following sections) is probably a better idea—the CGI handler doesn’t really tap into the great potential of mod_python, and is best used with legacy code.
Anyway, to get the CGI show on the road, put the following in an .htaccess file in the directory where you keep your CGI scripts:
SetHandler mod_python
PythonHandler mod_python.cgihandler

C H A P T E R 1 5 ■ P Y T H O N A N D T H E W E B |
331 |
For debugging information (which can be useful when something goes wrong, as it usually will), you can add the following:
PythonDebug On
You should remove this directive when you’re done developing; there’s no point in exposing the innards of your program to the (potentially malevolent) public.
Once you’ve set things up properly, you should be able to run your CGI scripts just like before.
■Note In order to get this to work, you might need to give your script a .py ending, even if you access it with a URL ending in .cgi. mod_python converts the .cgi to a .py when it looks for a file to fulfill the request.
PSP
If you’ve used PHP (the PHP: Hypertext Preprocessor, originally known as Personal Home Page), Microsoft ASP (Active Server Pages), JSP (JavaServer Pages), or something similar, the concepts underlying PSP, or Python Server Pages, should be familiar. PSP documents are a mix of HTML (or, for that matter, some other form of document) and Python code, with the Python code enclosed in special-purpose tags. Any HTML (or other plain data) will be converted to calls to an output function.
Setting Apache up to serve your PSP pages is as simple as putting the following in your
.htaccess file:
AddHandler mod_python .psp
PythonHandler mod_python.psp
This will treat files with the .psp file extension as PSP files.
■Caution While developing your PSP pages, using the directive PythonDebug On can be useful. You should not, though, keep it on when the system is used for real, because any error in the PSP page will result in an exception traceback including the source code being served to the user. Letting a potentially hostile user see the source code of your program is something that should not be done lightly. If you publish the code deliberately, others may help you find security flaws, and this can definitely be one of the strong sides to open source software development. However, simply letting users glimpse your code through error messages is probably not useful, and potentially a security risk.
There are two main sets of PSP tags: one set for statements, another for expressions. The values of expressions in expression tags are directly put into the output document. Listing 15-8 is a simple PSP example, which first performs some setup code (statements) and then outputs some random data as part of the Web page, using an expression tag.
332 C H A P T E R 1 5 ■ P Y T H O N A N D T H E W E B
Listing 15-8. A Slightly Stochastic PSP Example
<%
from random import choice adjectives = ['beautiful', 'cruel'] %>
<html>
<head>
<title>Hello</title>
</head>
<body>
<p>Hello, <%=choice(adjectives)%> world. My name is Mr. Gumby.</p> </body>
</html>
You can mix plain output, statements, and expressions in any way you like. You can write comments (that will not be part of the output) <%- like this -%>. There is really very little to PSP programming beyond these basics. You need to be aware of one issue, though: If code in a statement tag starts an indented block, the block will persist, with the following HTML being put inside the block. One way to close such a block is to insert a comment, as in the following:
A <%
for i in range(3): %> merry, <%
# End the for loop
%> merry christmas time.
In general, if you’ve used PHP or JSP or the like, you will probably notice that PSP is more picky about newlines and indentation; this is, of course, a feature inherited from Python itself.
There are many, many other systems that somewhat resemble mod_python’s PSP, and even some that are almost identical (such as the Webware PSP system, available from http:// webwareforpython.org) or similarly named, but with a rather different syntax (such as the Spyce PSP, available from http://spyce.sf.net). The Web development system Zope (see http:// zope.org) has its own template languages (such as ZPT). The rather innovative template system ClearSilver (see http://clearsilver.net) has Python bindings, and could be an interesting alternative for the curious. A visit to the Parnassus Web category (http://py.vaults.ca/apyllo. py?i=127386987) or a Web search for “python template system” (or something similar) should point you toward several other interesting systems.
The Publisher
This is where mod_python really comes into its own: It lets you write Python programs that have a much more interesting environment than CGI scripts. To use the publisher handler, put the following in your .htaccess file (again, optionally adding PythonDebug On while you’re developing):
AddHandler mod_python .py
PythonHandler mod_python.publisher

C H A P T E R 1 5 ■ P Y T H O N A N D T H E W E B |
333 |
This will run any file with a name ending in .py as a Python script, using the publisher handler. The first thing to know about the publisher is that it exposes functions to the Web as if they were documents. For example, if you have a script called script.py available from http:// example.com/script.py that contains a function called func, the URL http://example.com/ script.py/func will make the publisher first run the function (with a special request object as the only parameter) and display whatever is returned as the document displayed to the user. As is the custom with ordinary Web documents, the default “document” (that is, function) is called index, so the URL http://example.com/script.py will call the function by that name. In other words, something like the following is sufficient to make use of the publisher handler:
def index(req):
return "Hello, world!"
The request object lets you access several pieces of information about the request received, as well as setting custom HTTP headers and the like—consult the mod_python documentation for instructions on how to use it. If you don’t care about it, you can just drop it, like this:
def index():
return "Hello, world!"
The publisher actually checks how many arguments the given function takes as well as what they’re called and supplies only what it can accept.
■Tip You can also do this sort of magic checking if you want to. It is not necessarily portable across Python implementations (for example, to Jython), but if you’re sticking to CPython, you can use the inspect module to poke at such corners of functions (and other objects) to see how many arguments they take and what the arguments are called.
You can give your function more (or just other) arguments than the request object, too:
def greet(name='world'): return 'Hello, %s!' % name
Note that the dispatcher uses the names of the arguments, so when there is no argument called req, you won’t receive the request object. You can now access this function and supply it with an argument using a URL such as http://example.com/script.py/greet?name=Gumby.
The resulting Web page should now contain the greeting “Hello, Gumby!”.
Note that the default argument is quite useful; if the user (or the calling program) doesn’t supply all parameters, it’s better to display a default page of some sort than to confront the user with a rather noninformative “internal server error” message. Also, it would be problematic if supplying extra arguments (not used by the function) would lead to an error condition; luckily, it won’t. The dispatcher only uses the arguments it needs.
One nice thing about the dispatcher is that access control and authorization is very easy to implement. The path given in the URL (after the script name) is actually a series of attribute lookups; for each step in the series of lookups, mod_python also looks for the attributes __auth__ and __access__ in the same object (or module) as the attribute itself. If you have

334 C H A P T E R 1 5 ■ P Y T H O N A N D T H E W E B
defined the _ _auth_ _ attribute, and it is callable (for example, a function or method), the user is queried for a user name and password, and _ _auth_ _ is called with the request object, the user name, and the password. If the return value is true, the user is authenticated. If _ _auth_ _ is a dictionary, the user name will be looked up, and the password will be matched against the corresponding key. The _ _auth_ _ attribute can also be some constant value; if it is false, the user is never authorized. (You can use the _ _auth_realm_ _ attribute to give the realm name, usually used in the login query dialog box.) Once a user has been authenticated, it is time to check whether he or she should be granted access to a given object (for example, the module or script itself); for this you use the _ _access_ _ attribute. If you have defined _ _access_ _ and it is callable, it is called with the request object and the user name, and, again, the truth value returned determines whether the user is granted access (with a true value granting access). If __access__ is a list, then the user is granted access if the user name is found in the list. Just like _ _auth_ _, _ _access_ _ can be a Boolean constant.
Listing 15-9 gives a simple example of a script with authentication and access control.
Listing 15-9. Simple Authentication with the mod_python Publisher
from sha import sha
__auth_realm__ = "A simple test"
def __auth__(req, user, pswd):
return user == "gumby" and sha(pswd).hexdigest() == \ '17a15a277d43d3d9514ff731a7b5fa92dfd37aff'
def __access__(req, user): return True
def index(req, name="world"):
return "<html>Hello, %s!</html>" % name
Note that the script in Listing 15-9 uses the sha module to avoid storing the password (which is “goop”, by the way) in plain text. Instead, a digest of the correct password is compared with a digest of the password supplied by the user. This doesn’t give a great increase in security, but it’s better than nothing. The __access__ function doesn’t really do anything useful here. In a real application, you might have a common authentication function, to check that the users really are who they claim to be (that is, verify that the passwords fit the user names), and then use specialized __access__ functions (or lists) in different objects to restrict access to a subset of the users. For more information about how objects are published, see the section “The Publishing Algorithm” in the mod_python documentation.
■Note The __auth__ mechanism uses HTTP authentication, as opposed to the cookie-based authentication used by some systems.
C H A P T E R 1 5 ■ P Y T H O N A N D T H E W E B |
335 |
Web Services: Scraping Done Right
Web services are a bit like computer-friendly Web pages. They are based on standards and protocols that enable programs to exchange information across the network, usually with one program, the client or service requester, asking for some information or service, and the other program, the server or service provider, providing this information or service. Yeah, I know. Glaringly obvious stuff. And it also seems very similar to the network programming discussed in Chapter 14. There are differences, though . . .
Web services often work on a rather high level of abstraction. They use HTTP (the “Web protocol”) as the underlying protocol; on top of this, they use more content-oriented protocols, for example, using some XML format to encode requests and responses. This means that a Web server can be the platform for Web services. As the title of this section indicates, it’s Web scraping taken to another level; one could see the Web service as a dynamic Web page designed for a computerized client, rather than for human consumption.
There are standards for Web services that go really, really far in capturing all kinds of complexity, but you can get a lot done with utter simplicity as well. In this section, I discuss two simple Web service protocols: I start with the simplest, RSS, which you could even argue is so simple that it isn’t really a Web service protocol at all. Then I show you how to use XML-RPC from the client side; the server side is dealt with in more detail in Chapter 27. There are several other standards you might want to check out. For example, SOAP is, in some sense, XML-RPC on steroids, and WSDL is a format for describing Web services formally. A good Web search engine will, as always, be your friend here.
RSS
RSS, which stands for either Rich Site Summary, RDF Site Summary, or Really Simple Syndication (depending on the version number), is, in its simplest form, a format for listing news items in XML. What makes RSS documents (or feeds) more of a service than simply a static document is that they’re expected to be updated regularly (or irregularly). They may even be computed dynamically, representing, for example, the most recent additions to a Web log or the like. There are plenty of RSS readers out there, and because the RSS format is so easy to deal with, it’s easy to find new applications for it. For example, some browsers (such as Mozilla Firefox) will let you bookmark an RSS feed, and will then give you a dynamic bookmark submenu with the individual news items as menu items. Some people are even using RSS feeds to “broadcast” sound or video files (called podcasting).
One slightly confusing part of the RSS picture is that versions 0.9x and 2.0.x, now mainly called Really Simple Syndication (with 0.9x originally called Rich Site Summary), are sort of compatible with each other, but completely incompatible with RSS 1.0. There are also other formats for this sort of news feeds and site syndication, such as the more recent Atom (see, for example, http://ietf.org/html.charters/atompub-charter.html). The problem is that if you want to write a client program that handles feeds from several sites, you must be prepared to parse several different formats; you may even have to parse HTML fragments found in the messages themselves. In this section, I’ll use a tiny subset of RSS 2.0. Listing 15-10 shows an example RSS file; for full specifications of recent RSS 2.0 versions, see http://blogs.law.harvard. edu/tech/rss. For a specification of RSS 1.0, see http://web.resource.org/rss/1.0.
336 C H A P T E R 1 5 ■ P Y T H O N A N D T H E W E B
Listing 15-10. A Simple RSS 2.0 File
<?xml version="1.0"?> <rss version="2.0">
<channel>
<title>Example Top Stories</title> <link>http://www.example.com<link> <description>
Example News is a top notch provider of meaningless news items. </description>
<item>
<title>Interesting stuff</title>
<description>Something really interesting happened today</description> <link>http://www.example.com/newsitem1.html</link>
</item>
<item>
<title>More interesting stuff</title>
<description>Then something even more interesting happened</description> <link>http://www.example.com/newsitem2.html</link>
</item>
</channel>
</rss>
The RSS 2.0 standard specifies a few mandatory elements, and many optional ones. You can count on an RSS 2.0 channel element having a title, link, and description. They can contain (among other things) zero or more item elements, which, at the very least, have either a title or a description. If you’re writing a program to deal with a specific feed, a good idea might be to simply find out which elements it provides.
Another thing making the parsing a bit challenging is the sad fact that even though RSS is supposed to be valid XML, and therefore easy to parse, chances are you will come across illformed RSS feeds. If nothing else, the news messages themselves may contain such illegalities as unescaped ampersands (&) or the like.
There aren’t really (at the time of writing) any obvious standard RSS modules for Python that will handle these difficulties, so you’re more or less back to screen scraping (for now, at least). Luckily, the handy Beautiful Soup parser can deal with XML as well as HTML, and it won’t complain about a bit of sloppiness on the part of the RSS feed. To round off this little introduction to RSS, Listing 15-11 is an example program that will get the top stories from Wired News (http://wired.com). Note that it uses the class BeautifulStoneSoup, rather than BeautifulSoup, to parse the RSS feed; this class can deal with XML in general, while BeautifulSoup is targeted specifically at HTML. (In order to use the BeautifulStoneSoup class, you will, of course, need to download BeautifulSoup, as discussed earlier in this chapter.) The program also demonstrates how you can use the wrap function from the standard Python module textwrap to make text fit nicely on the screen.

C H A P T E R 1 5 ■ P Y T H O N A N D T H E W E B |
337 |
■Tip You can find an RSS feed of Python news at http://python.org/channews.rdf. This feed is in RSS 1.0 format, though.
Listing 15-11. Getting the Wired News
from BeautifulSoup import BeautifulStoneSoup from urllib import urlopen
from textwrap import wrap
URL = 'http://www.wired.com/news/feeds/rss2'
soup = BeautifulStoneSoup(urlopen(URL).read())
for item in soup('item'): print item.title.string
print '-'*len(item.title.string)
print '\n'.join(wrap(item.description.string)) print '[%s]\n' % item.link.string
XML-RPC
RPC stands for Remote Procedure Call, and XML is a language you’ve already encountered a few times by now. The XML-RPC Web site at http://www.xmlrpc.com is a good source of information. The standard library module xmlrpclib module is used to connect to XML-RPC servers, and SimpleXMLRPCServer is a class you can use to write such servers. In Chapter 27 you will learn more about the server stuff—here I will only look at the client side.
■Caution The version of SimpleXMLRPCServer that ships with Python 2.2 contains a serious flaw in its handling of exceptions. If you use this version, the programs in this chapter and Chapters 27 and 28 may appear to work, but in fact they won’t be working correctly. Either use a newer version of Python (for example, version 2.3) or download a new version of SimpleXMLRPCServer.py from http://www.sweetapp.com/ xmlrpc, and replace your current version with the downloaded one. Also, in Python 2.2, 2.3, and 2.4 the XML-RPC code contains some vulnerabilities; see http://www.python.org/security/PSF-2005-001 for information on this, as well as fixes for the problem. The easiest solution is to use either version 2.3.5 or newer in the 2.3 series, or 2.4.1 or newer in the 2.4 series, as the problems have been fixed there.
To show you how easy xmlrpclib is to use, here is an example that connects to the XMLRPC server at The Covers Project (http://coversproject.com), a database of cover songs:

338C H A P T E R 1 5 ■ P Y T H O N A N D T H E W E B
>>>from xmlrpclib import ServerProxy
>>>server = ServerProxy('http://coversproject.com/RPC.php')
>>>covers = server.covers.Covered('Monty Python')
>>>for cover in covers:
... if cover['song'] == 'Brave Sir Robin':
... print cover['artist']
...
Happy Rhodes
The ServerProxy object looks like a normal object with various methods you can call, but in fact, whenever you call one of its methods, it sends a request to the server, which responds to the requests and returns an answer. So, in a way, you’re calling the method covers.Covered on the server itself. Network programming could hardly be any easier than this. (For more information about the XML-RPC methods supplied by The Covers Project, see http:// coversproject.com/about/xmlrpc.html.)
■Note XML-RPC procedure (function/method) names may contain dots, as in the preceding example. The name covers.Covered does not imply the existence of an object named covers on the server—the dots are only used to structure the names.
Here is a slightly more complicated example, which uses the news service Meerkat to find some articles about Python:
>>>from xmlrpclib import ServerProxy
>>>query = {'search': 'Python', 'num_items': 5}
>>>s = ServerProxy('http://www.oreillynet.com/meerkat/xml-rpc/server.php')
>>>items = s.meerkat.getItems(query)
>>>[i['title'] for i in items]
['MacHack: Meet Dylan the Yoot', 'Hap Debugger', 'Project and proposal for integrating validation with processing pipelines', 'Spam Check', 'ZCoMIX 1.0 Final Released']
>>>items[0].keys()['link', 'description', 'title']
>>>items[3]['link']
'http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/134945'
As you can see, the method meerkat.getItems is called with a mapping parameter that contains various arguments (in this case a search query and the number of items to be returned) and returns a list of mappings, each of which has a title, a description, and a link. Actually, there is a lot more to Meerkat than this—if you want to experiment, take a look at the Meerkat Web site (http://www.oreillynet.com/meerkat) or one of the many online tutorials about
the Meerkat XML-RPC API (for example, http://oreillynet.com/pub/a/rss/2000/11/14/ meerkat_xmlrpc.html).
C H A P T E R 1 5 ■ P Y T H O N A N D T H E W E B |
339 |
A Quick Summary
As always, new topics were covered—and here they are again:
Screen scraping. This is the practice of downloading Web pages automatically, and extracting information from them. The Tidy program and its library version are useful tools for fixing ill-formed HTML before using an HTML parser. Another option is to use Beautiful Soup, which is very forgiving of messy input.
CGI. The Common Gateway Interface is a way of creating dynamic Web pages, by making a Web server run and communicate with your programs and display the results. The cgi and cgitb modules are useful for writing CGI scripts. CGI scripts are usually invoked from HTML forms.
mod_python. The mod_python handler framework makes it possible to write Apache handlers in Python. It includes three useful standard handlers: the CGI handler, the PSP handler, and the publisher handler.
Web services. Web services are to programs what (dynamic) Web pages are to people. You may see them as a way of making it possible to do network programming at a higher level of abstraction. Two example Web service standards discussed in this chapter are RSS and XML-RPC.
New Functions in This Chapter
Function |
Description |
cgitb.enable() |
Enables tracebacks in CGI scripts |
|
|
What Now?
I’m sure you’ve tested the programs you’ve written so far by running them. In the next chapter, you will learn how you can really test them—thoroughly and methodically. Maybe even obsessively, if you’re lucky . . .