
Beginning Perl Web Development - From Novice To Professional (2006)
.pdf
90 C H A P T E R 5 ■ LW P M O D U L E S
SOLVING A REAL-WORLD PROBLEM WITH THE LWP
The LWP even helped me to get a console gaming system. In 2004, a popular beverage company had a contest that involved collecting a number of points to earn prizes. These prizes were made available online, but limited quantities of specific prizes were available. For the more popular items, these quantities were quickly depleted. In order to ensure that I was one of the lucky people to get the item I wanted—a gaming console— I needed a method to monitor the web page to see when the item became available. Perl to the rescue!
Using the LWP I was able to quickly create a script to look for certain text (“Now Available,” for example) to appear on the page, and then send an e-mail alert when the text was found. With this script set to check every five minutes, I got the gaming console.
Of course, this is just one example of how the LWP can be used to solve a real-world problem, albeit a simple one.
#!/usr/bin/perl -w
use LWP; use strict;
my $browser = LWP::UserAgent->new(agent => 'Perly v1');
my $result = $browser->get("http://www.braingia.org/ewfojwefoj"); die "An error occurred: ", $result->status_line( ) unless \ $result->is_success;
#Do something more meaningful with the content than this! print $result->content;
When you run this code, it will output the raw web page. It will probably fly past on the screen, likely ending with something like this:
</script>
</body>
</html>
You’ll take a closer look at this code later in the chapter, in the “Retrieving a Web Page” section.
HTTP from 29,999 Feet
While not quite a high-altitude flyover of HTTP—thus 29,999 feet instead of 30,000 feet—this section gives you a primer on HTTP’s inner workings. RFC 2616 (which can be found at http://www.rfc-editor.org/) defines the Hypertext Transfer Protocol (HTTP) and provides the model under which web traffic operates. HTTP is based on requests and responses. In HTTP communications, the requester of a document is the client, and the responder is the server. When you visit a web page in a browser such as Mozilla Firefox, the browser sends the request to the server, which then responds accordingly.

C H A P T E R 5 ■ LW P M O D U L E S |
91 |
HTTP Requests
An HTTP request contains the method for the request, information about the resource being requested, and the protocol version. These three pieces of information are contained on the first line, known as the request line. Next follow one or more optional header lines, which normally consist of key:value pairs. Finally, an optional body is included in the HTTP request. The body of the HTTP request frequently contains form values being passed as part of the request, but it can include any number of other objects.
Consider this example, which is created with the following command:
telnet www.braingia.org 80
The HTTP request looks like this:
GET / HTTP/1.1
Host: www.braingia.org
The first line is the request line, which contains three pieces of information: the method (GET), the resource (/, to indicate the root directory or that the default file be served from this directory), and the protocol version (HTTP/1.1). Following the request line is a header. In this case, this is the Host header, and it specifies the host (www.braingia.org) should receive the request. The Host header enables multiple web sites to share the same physical IP. It’s up to the web server itself, such as Apache, to handle the request correctly, based on the value of the Host header.1 Notice the extra empty line after the header. This carriage return/line feed (CRLF) is key for an HTTP request.
HTTP Responses
The web server will receive the HTTP request and respond to it. The first line of the response, known as the status line, contains the protocol version, followed by a numeric status code and the text response corresponding to that code.
Following the status line are optional response headers and entity headers. Finally, the optional body is included after an additional blank line (CRLF), as is the case in the request.
Here’s an example of a response, based on the request shown in the previous section:
HTTP/1.1 200 OK
Date: Wed, 06 Apr 2005 15:47:45 GMT
Server: Apache/1.3.26 (Unix) Debian GNU/Linux mod_mono/0.11 mod_perl/1.26
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1
<body follows here. . .>
1.The Host header is quite common in HTTP requests, but that was not always the case. Prior to the existence of the Host header, every web site with its own host and domain name was required to have its own IP address. This contributed to IP address space depletion as the Internet grew. By using the Host header, a single IP address can house thousands of web sites, all using different domain names

92 C H A P T E R 5 ■ LW P M O D U L E S
As you can see from this example, the first line (the status line) contains the protocol version (HTTP/1.1), the status (200), and the text associated with that numeric response (OK). The numeric status codes are divided into classes based on the first digit in the code, as listed in Table 5-1.
Table 5-1. HTTP Status Codes
Code Class |
Type |
1nn |
Informational |
2nn |
Success |
3nn |
Redirection |
4nn |
Client error |
5nn |
Server error |
|
|
Following the status line are a number of optional header lines, including the date and the server version, Transfer-Encoding, and Content-Type. A blank line (CRLF) is included, followed by the body. In this case, I’ve snipped the body of the response, which was the HTML and other bits from the actual web page.
■Note Some of the headers that I referred to as optional may be required, depending on the type of request and response. However, most requests and responses won’t require additional headers. Additionally, for most HTTP transactions, the body is almost always included, since the body is the content of the web page itself, but realize that the body is indeed optional according to the RFC 2616, notably for HEAD requests. For more information about HTTP requirements, see RFC 2616 (http://www.rfc-editor.org/).
Without the LWP, the Perl programmer would need to manually code each portion of the HTTP request, in much the same way that the CGI programmer would need to code each portion of the HTTP response if it weren’t for the CGI module. The LWP modules provide functions and object-oriented classes for working with HTTP.
Keeping It Simple with LWP::Simple
The LWP::Simple module gives the programmer a simple interface into common uses of the LWP for working with web resources. It provides five functions that enable you to use the GET HTTP method very easily: get(), getprint(), getstore(), head(), and mirror(). These functions give the programmer just enough control to be dangerous, but they don’t offer full power provided with the LWP through the LWP::UserAgent module, which I’ll cover after describing the
LWP::Simple functions.

C H A P T E R 5 ■ LW P M O D U L E S |
93 |
Get Functions
Most requests for web pages on the Internet use the GET method. LWP::Simple includes functions to perform GET requests on Internet resources. including the aptly titled get() function:
$page = get("http://www.braingia.org/");
Using this function, the body of the resulting resource will be saved to the variable $page. If the GET request fails, the value of $page will be undefined.
Related to the get() function are two other functions: getprint() and getstore(). The getprint() function usually returns output directly to STDOUT, but it can return output to whatever the currently selected filehandle happens to be. Since STDOUT is usually that filehandle, getprint() will normally just output to the screen. This function is useful for simple Perl commands executed from the shell, as opposed to commands from within full-blown Perl programs. For example, a cron job could be created to automatically check the contents of a web page using a command line such as this:
> perl -MLWP::Simple -e "getprint('http://www.braingia.org/') or die"
The getstore() function takes the output of a web page and automatically stores it in an external file. Obviously, if you actually want to work with that resulting output from within your Perl program, you will need to then open the file and read in its contents.
The getstore() function also returns the status of the GET method and sets is_success() if the status is in the 200 range. It sets is_error() if the status is in the 400 or 500 range. This effectively means that you can test to ensure that the GET request was successful by looking to see if is_success() is true. Consider the example shown in Listing 5-1 (Example1.pl).
Listing 5-1. Using is_success() with getstore()
#!/usr/bin/perl -w
use LWP::Simple; use strict;
my $status = getstore("http://www.braingia.org/","/tmp/braingia"); unless (is_success($status)) {
die "Couldn't retrieve page: $status";
}
open (PAGE, "/tmp/braingia") or die "$!"; while (<PAGE>) {
print();
}
close(PAGE);
If the getstore() function is successful, the raw HTML and other page items will be printed to STDOUT, similar to the output shown for the first example in this chapter (Getua.pl).
If you would like to see what happens when an error is returned, simply point the URL for the getstore() function to a file that doesn’t exist, as shown in Listing 5-2 (Example2.pl).

94C H A P T E R 5 ■ LW P M O D U L E S
Listing 5-2. Using getstore() to Print an Invalid Page
#!/usr/bin/perl -w
use LWP::Simple; use strict;
my $status = \ getstore("http://www.braingia.org/nofile.aspx","/tmp/braingia");
unless (is_success($status)) {
die "Couldn't retrieve page: ${$status}";
}
open (PAGE, "/tmp/braingia") or die "$!"; while (<PAGE>) {
print();
}
close(PAGE);
There won’t be a file named nofile.aspx on my web site (I’d be surprised if I ever have anything named *.aspx on my site), so the getstore() function will return a 404, for a Page Not Found error, which will, in turn, cause is_success to be false. The script will die and output the status message:
Couldn't retrieve page: 404 at ./example2.pl line 10.
The Head Function
The HEAD method is normally used to test hypertext links for validity and, when implemented by the server, returns the header information in the same way that a GET request would. The HEAD method never returns the body of the resource.
■Caution Unfortunately, the HEAD method is not supported by all web servers and is turned off by others. This means that the use of the HEAD method is unreliable.
LWP::Simple implements the HEAD method with the head() function. You can use this function in either a scalar or list context.
In a scalar context, head() returns true or false based on the status of the return code. You can use this form in an if/then or unless control structure to test for success:
die "Wasn't able to run the HEAD method on the URL" unless \ head('http://www.braingia.org');

C H A P T E R 5 ■ LW P M O D U L E S |
95 |
When called in a list context, the head() function returns five items from the response header:
•Content type
•Document length
•Modified time
•Expires
•Server
For example, the head() function might be called in this manner in order to capture the five values:
($content_type,$doclen,$modified,$expires,$server) = \ head('http://www.braingia.org');
The Mirror Function
The mirror() function works in much the same was as the getstore() function, but also includes a check to compare the modification time of the local file and the modification time of the remote resource, based on the If-Modified-Since response header. Listing 5-3 shows an example of the mirror() function in action (Example3.pl):
Listing 5-3. Using the mirror() Function
#!/usr/bin/perl -w
use LWP::Simple; use strict;
my $url = "http://www.braingia.org/"; my $file = "/tmp/braingiamirrorweb";
my $status = mirror($url,$file);
die "Cannot retrieve $url" unless is_success($status);
This program won’t produce any output to the terminal unless there is an error. If it’s successful, there will be a file in /tmp called braingiamirrorweb. Inside that file will be raw output such as HTML and other bits as found on the web page. The contents will be similar to the following:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/
DTD/xhtml11.dtd">
<html>

96 C H A P T E R 5 ■ LW P M O D U L E S
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/> <script type="text/javascript">
<!--
The user agent object also has a mirror() method, and the lwp-mirror program implements the mirror() function. Both of these are discussed in the “Using Mirroring a Web Site” section later in this chapter.
Getting More Functionality with LWP::UserAgent
The user agent plays a central role in web transactions. The user agent is roughly synonymous with the browser or client side of an HTTP request and response transaction. The LWP includes a UserAgent namespace, LWP::UserAgent, which implements many functions and has numerous attributes that you would find in a web browser.
The user agent is frequently used to create a new browser object. This object can have a number of attributes set to define the behavior and operation of the resulting browser object. Table 5-2 summarizes the LWP::UserAgent attributes and their corresponding default values for the browser object.
Table 5-2. LWP::UserAgent Attributes
Attribute |
Default Value |
agent |
libwww-perl/NNNN (where NNNN is the version) |
conn_cache |
No default |
cookie_jar |
No default |
from |
No default |
keep_alive |
No default |
max_redirect |
7 |
max_size |
No default |
parse_head |
1 |
protocols_allowed |
No default |
protocols_forbidden |
No default |
requests_redirectable |
GET HEAD |
timeout |
180 |
|
|
To set one or more of the attributes, pass them as a key pair to the new() call when invoking UserAgent. Here’s an example:
use LWP;
my $browser = LWP::UserAgent->new(agent=>'Mozilla'); print "the browser agent is ", $browser->agent(), "\n";

C H A P T E R 5 ■ LW P M O D U L E S |
97 |
These attributes can also be changed after the browser object has been created, as shown here:
use LWP;
my $browser = LWP::UserAgent->new(); $browser->agent("Mozilla");
print "the browser agent is ", $browser->agent(), "\n";
■Caution Some (poorly designed) web sites use the user agent value to prevent users of certain browsers from accessing the site. As you just saw, the user agent can be trivially changed by the user. Stick to the web standards set by organizations such as the W3C, and you won’t have to use stupid tricks such as these on sites that you design.
In the upcoming examples of using the LWP, you’ll see how many of the attributes for the user agent object are put into action.
Using the LWP
Now that you’ve seen some of the LWP components, this section looks at some common uses of the LWP. These include retrieving a web page, submitting a web form, handling cookies, handling password-protected sites, mirroring a web site, and handling proxies.
Retrieving a Web Page
The LWP makes the process of screen scraping rather trivial. Screen scraping refers to programmatically capturing the document being served in an HTTP request, through a means other than a standard web browser. A common goal of screen scraping is to look for certain text on the document and do something if that text is found. Listing 5-4 shows an example of how to do this (Get.pl).
Listing 5-4. Retrieving a Web Page with get()
#!/usr/bin/perl -w
use LWP::Simple; use strict;
my $webpage = get("http://www.braingia.org/");
if (($webpage) && (grep {/Steve/} $webpage)) { print "I found the text\n";
}

98 C H A P T E R 5 ■ LW P M O D U L E S
This example uses the get() function from LWP::Simple, which enables you to quickly and easily retrieve a web page using the GET method, as explained earlier in the chapter. The program will perform a GET against the web page at http://www.braingia.org/, and then search for some text within the page, including any HTML, scripts, or other material returned. If that text is found, the program will print a simple message to STDOUT indicating that it found the text, something like this:
I found the text
The choices for working with the resulting text from the get() function are limited only by what you would like to do with the results.
Setting Additional Parameters
The get() function works well for simple GET method requests. However, some sites require you to set additional parameters, such as authentication, user agent, and other values. When you need to set these additional parameters, use the LWP::UserAgent class.
Consider the example in Listing 5-5 (Getua.pl), which performs a GET on a URL and also sets the agent parameter.
Listing 5-5. Setting a User Agent and Retrieving a Web Page
#!/usr/bin/perl -w
use LWP; use strict;
my $browser = LWP::UserAgent->new(agent => 'Perly v1');
my $result = $browser->get("http://www.braingia.org/ewfojwefoj"); die "An error occurred: ", $result->status_line( ) unless $result->is_success;
#Do something more meaningful with the content than this! print $result->content;
You may recognize this as the example I showed you at the beginning of this chapter. The program will report itself as “Perly v1” to the web server. You can use this to mimic any web browser or make up your own, as shown in the example. The output from this program is raw HTML and JavaScript, as shown previously.
■Note For more information about user agent strings, see the appropriately titled “User-Agent Strings” document at http://www.mozilla.org/build/revised-user-agent-strings.html.
Setting Timeouts
Sometimes, the web server is slow to respond, or other network-type issues cause the browser to time out. You can set the timeout of the browser to a value appropriate for your application.

C H A P T E R 5 ■ LW P M O D U L E S |
99 |
Recall that the default is 180 seconds. You can set the timeout either when you create the browser object or at any time during its life. Assume you have a browser object called $browser. In this example, you set the timeout to 30 seconds, instead of the default 180:
$browser->timeout(30);
Controlling Browser Redirects
Browser objects created through the LWP::UserAgent class accept HTTP redirects for the GET and HEAD methods. You can change this behavior to accept redirects for other combinations of HTTP methods or disallow redirects entirely. The requests_redirectable attribute accepts
a list of HTTP methods that can be redirected:
$browser->requests_redirectable([\@methods]);
This list is inclusive, so if you merely call the function with one method as an argument, you overwrite what’s already there. To accept redirects for the POST method (discussed in the “Submitting a Web Form” section a little later in this chapter), you add it to the list by calling requests_redirectable:
push @{$browser->requests_redirectable}, 'POST';
Realize that the requests_redirectable attribute already contains two values: GET and HEAD. Therefore, if you want to add a method to that list, you must use a method such as push (as in this example). If you don’t push a new value onto the stack, you’ll be replacing what’s already there. This can cause no end to confusion.
Based on that note of caution, it’s sometimes helpful to see if a particular method will indeed accept a redirect for a given browser object. A call to the redirect_ok() method will return true if a redirect would be permitted for the given method. Consider this example:
if ($browser->redirect_ok(GET)) {
print "The browser object would accept a redirect for GET\n";
}
Sending Additional Headers
In some cases, you may need to specify additional header lines as part of the request for a URL. In these instances, you can send them along with the request as key/value pairs. For example, a GET method using the get() function would normally look like this:
$browser->get($url);
To include additional headers, place them after the URL, as in this example:
$browser->get($url, Header => Value, Header => Value . . .)
A use for this might be to send the acceptable character set to the server:
$browser->get($url, 'Accept-Charset' => 'iso-9859-1');
Cloning the Browser
If you already have a browser object set up in your program and configured as you like it, you owser object. Assume that