
Advanced PHP Programming
.pdf
248 Chapter 10 Data Component Caching
A key feature of both advisory locking methods is that they release any locks held by a process when it exits.This means that if a process holding a lock suffers an unexpected failure (for example, the Web server process that is running incurs a segmentation fault), the lock held by that process is released, preventing a deadlock from occurring.
PHP opts for whole-file locking with its flock() function. Ironically, on most systems, this is actually implemented internally by using fcntl. Here is the caching example reworked to use file locking:
<?php
$file = $_SERVER[‘PHP_SELF’]; $cachefile = “$file.cache”;
$lockfp = @fopen($cachefile, “a”);
if(filesize($cachefile) && flock($lockfp, LOCK_SH | LOCK_NB)) { readfile($cachefile);
flock($lockfp, LOCK_UN); exit;
}
else if(flock($lockfp, LOCK_EX | LOCK_NB)) { $cachefp = fopen($cachefile, “w”); ob_start();
}
?>
<HTML>
<BODY>
<!-- Cacheable for a day -->
Today is <?= strftime(“%A, %B %e %Y”) ?> </BODY>
</HTML>
<?
if( $cachefp) {
$file = ob_get_contents(); fwrite($cachefp, $file); fclose($cachefp);
flock($lockfp, LOCK_SH | LOCK_NB); ob_end_flush();
}
fclose($lockfp); ?>
This example is a bit convoluted, but let’s look at what is happening.
First, you open the cache file in append (a) mode and acquire a nonblocking shared lock on it. Nonblocking (option LOCK_NB) means that the operation will return immediately if the lock cannot be taken. If you did not specify this option, the script would simply pause at that point until the lock became available. Shared (LOCK_SH) means that

In-Memory Caching |
249 |
you are willing to share the lock with other processes that also have the LOCK_SH lock. In contrast, an exclusive lock (LOCK_EX) means that no other locks, exclusive or shared, can be held simultaneously. Usually you use shared locks to provide access for readers because it is okay if multiple processes read the file at the same time.You use exclusive locks for writing because (unless extensive precautions are taken) it is unsafe for multiple processes to write to a file at once or for a process to read from a file while another is writing to it.
If the cache file has nonzero length and the lock succeeds, you know the cache file exists, so you call readfile to return the contents of the cache file.You could also use include() on the file.That would cause any literal PHP code in the cache file to be executed. (readfile just dumps it to the output buffer.) Depending on what you are trying to do, this might or might not be desirable.You should play it safe here and call readfile.
If you fail this check, you acquire an exclusive lock on the file.You can use a nonblocking operation in case someone has beaten you to this point. If you acquire the lock, you can open the cache file for writing and start output buffering.
When you complete the request, you write the buffer to the cache file. If you somehow missed both the shared reader lock and the exclusive writer lock, you simply generate the page and quit.
Advisory file locks work well, but there are a few reasons to consider not using them:
nIf your files reside on an NFS (the Unix Network File System) filesystem, flock is not guaranteed to work at all.
nCertain operating systems (Windows, for example) implement flock() on a process level, so multithreaded applications might not correctly lock between threads. (This is mainly a concern with the ISAPI Server Abstraction API (SAPI),
the PHP SAPI for Microsoft’s IIS Web server.)
nBy acquiring a nonblocking lock, you cause any request to the page while the cache file is being written to do a full dynamic generation of the page. If the generation is expensive, a spike occurs in resource usage whenever the cache is refreshed. Acquiring a blocking lock can reduce the system load during regenera-
tion, but it causes all pages to hang while the page is being generated.
nWriting directly to the cache file can result in partial cache files being created if an unforeseen event occurs (for example, if the process performing the write crashes or times out). Partial files are still served (the reader process has no way of know-
ing whether an unlocked cache file is complete), rendering the page corrupted.
nOn paper, advisory locks are guaranteed to release locks when the process holding them exits. Many operating systems have had bugs that under certain rare circumstances could cause locks to not be released on process death. Many of the PHP SAPIs (including mod_php—the traditional way for running PHP on Apache) are not single-request execution architectures.This means that if you leave a lock

250 Chapter 10 Data Component Caching
lying around at request shutdown, the lock will continue to exist until the process running that script exits, which may be hours or days later.This could result in an interminable deadlock. I’ve never experienced one of these bugs personally; your mileage may vary.
File swaps work by taking advantage of a nuance mentioned earlier in this chapter.When you use unlink() on a file, what really happens is that the filename-to-inode mapping is removed.The filename no longer exists, but the storage associated with it remains unchanged (for the moment), and it still has the same inode associated with it. In fact, the operating system does not reallocate that space until all open file handles on that inode are closed.This means that any processes that are reading from that file while it is unlinked are not interrupted; they simply continue to read from the old file data.When the last of the processes holding an open descriptor on that inode closes, the space allocated for that inode is released back for reuse.
After the file is removed, you can reopen a new file with the same name. Even though the name is identical, the operating system does not connect this new file with the old inode, and it allocates new storage for the file.Thus you have all the elements necessary to preserve integrity while updating the file.
Converting the locking example to a swapping implementation is simple:
<?php
$cachefile = “{$_SERVER[‘PHP_SELF’]}.cache”; if(file_exists($cachefile)) {
include($cachefile); return;
}
else {
$cachefile_tmp = $cachefile.”.”.getmypid(); $cachefp = fopen($cachefile_tmp, “w”); ob_start();
}
?>
<HTML>
<BODY>
<!-- Cacheable for a day -->
Today is <?= strftime(“%A, %B %e %Y”) ? > </BODY>
</HTML>
<?php
if( $cachefp) {
$file = ob_get_contents(); fwrite($cachefp, $file); fclose($cachefp); rename($cachefile_tmp, $cachefile);

DBM-Based Caching |
251 |
ob_end_flush();
}
?>
Because you are never writing directly to the cache file, you know that if it exists, it must be complete, so you can unconditionally include it in that case. If the cache file does not exist, you need to create it yourself.You open a temporary file that has the process ID of the process appended to its name:
$cachefile_tmp = $cachefile.”.”.getmypid();
Only one process can have a given process ID at any one time, so you are guaranteed that a file is unique. (If you are doing this over NFS or on another networked filesystem, you have to take some additional steps.You’ll learn more on that later in this chapter.) You open your private temporary file and set output buffering on.Then you generate the entire page, write the contents of the output buffer to your temporary cache file, and rename the temporary cache file as the “true” cache file. If more than one process does this simultaneously, the last one wins, which is fine in this case.
You should always make sure that your temporary cache file is on the same filesystem as the ultimate cache target.The rename() function performs atomic moves when the source and destination are on the same filesystem, meaning that the operation is instantaneous. No copy occurs; the directory entry for the destination is simply updated with the inode of the source file.This results in rename() being a single operation in the kernel. In contrast, when you use rename() on a file across filesystems, the system must actually copy the entire file from one location to the other.You’ve already seen why copying cache files is a dangerous business.
These are the benefits of using this methodology:
nThe code is much shorter and incurs fewer system calls (thus in general is faster).
nBecause you never modify the true cache file directly, you eliminate the possibility of writing a partial or corrupted cache file.
nIt works on network filesystems (with a bit of finessing).
The major drawback of this method is that you still have resource usage peaks while the cache file is being rewritten. (If the cache file is missing, everyone requesting it dynamically generates content until someone has created a fresh cached copy.) There are some tricks for getting around this, though, and we will examine them later in this chapter.
DBM-Based Caching
A frequently overlooked storage medium is DBM files. Often relegated to being a “poor man’s database,” many people forget that DBM files are extremely fast and are designed to provide high-speed, concurrent read/write access to local data. DBM file caches excel over flat-file caches in that they are designed to have multiple data sources stored in them (whereas flat-file caches work best with a single piece of data per file), and they are

252Chapter 10 Data Component Caching
designed to support concurrent updates (whereas you have to build concurrency into a flat-file filesystem).
Using DBM files is a good solution when you need to store specific data as key/value pairs (for example, a database query cache). In contrast with the other methods described in this chapter, DBM files work as a key/value cache out-of-the-box.
In PHP the dba (DBM database abstraction) extension provides a universal interface to a multitude of DBM libraries, including the following:
ndbm—The original Berkley DB file driver
nndbm—Once a cutting-edge replacement for dbm, now largely abandoned
ngdbm—The GNU dbm replacement
nSleepycat DB versions 2–4—Not IBM’s DB2, but an evolution of dbm brought
about by the nice folks at Berkeley
n cdb—A constant database library (nonupdatable) by djb of Qmail fame
Licenses
Along with the feature set differences between these libraries, there are license differences as well. The original dbm and ndbm are BSD licensed, gdbm is licensed under the Gnu Public License (GPL), and the Sleepycat libraries have an even more restrictive GPL-style license.
License differences may not mean much to you if you are developing as a hobby, but if you are building a commercial software application, you need to be certain you understand the ramifications of the licensing on the software you use. For example, if you link against a library under the GPL, you need to the source code of your application available to anyone you sell the application to. If you link against SleepyCat’s DB4 dbm in a commercial application, you need to purchase a license from SleepyCat.
You might use a DBM file to cache some data. Say you are writing a reporting interface to track promotional offers. Each offer has a unique ID, and you have written this function:
int showConversions(int promotionID)
which finds the number of distinct users who have signed up for a give promotion. On the back end the showConversions script might look like this:
function showConversion($promotionID) { $db = new DB_MySQL_Test;
$row = $db->execute(“SELECT count(distinct(userid)) cnt FROM promotions
WHERE promotionid = $promotionid”)->fetch_assoc();
return $row[‘cnt’];
}
This query is not blindingly fast, especially with the marketing folks reloading it constantly, so you would like to apply some caching.

DBM-Based Caching |
253 |
To add caching straight to the function, you just need to open a DBM file and preferentially fetch the result from there if it exists:
function showConversion($promotionID) {
$gdbm = dba_popen(“promotionCounter.dbm”, “c”, “gdbm”); if(($count = dba_fetch($promotionid, $gdbm)) !== false) {
return $count;
}
$db = new DB_MySQL_Test;
$row = $db->execute(“SELECT count(distinct(userid)) cnt FROM promotions
WHERE promotionid = $promotionid”); dba_replace($promotion, $row[0], $gdbm);
return $row[‘cnt’];
}
Cache Concurrency and Coherency
A nice feature of DBM files is that concurrency support is built into them.The exact locking method is internal to the specific back end being used (or at least is not exposed to the user from PHP), but safe concurrent access is guaranteed.
Cache Invalidation and Management
If you are an astute reader, you probably noticed the serious flaw in the caching scheme discussed earlier in this chapter, in the section “DBM-Based Caching.”You have no method to invalidate the cache! The counts that you’ve cached will never update.While this certainly makes the results return quickly, it also renders the result useless. A good caching system strives to make its impact transparent—or at least barely noticeable.
Unlike the flat-file implementations discusses earlier in this chapter, the difficulty here is not how to update the files—the dba_replace and dba_insert functions take care of all the work for you.The issue is how to know that you should update them at all. DBM files do not carry modification times on individual rows, so how do you know if the value available is from one second ago or one week ago?
Probably the cleverest approach I have seen to this problem is the probabilistic approach.You look at the frequency at which the data is requested and figure out how many requests you get on average before you should invalidate the cache. For example, if you receive 10 requests per second to the page where the data is displayed and you would like to cache the data for 5 minutes, you should flush the data according to the following formula:
5 minutes × (60 seconds/minute) × (10 requests/second) = 3,000 requests
Sharing a global access count between all processes is impractical. It would require storing access time information for every row in the DBM file.That is not only complicated, but it’s slow as well, as it means you have to write to the DBM file (to record the

254 Chapter 10 Data Component Caching
time) on every read call. Instead, you can take the probabilistic approach. If instead of updating exactly on the 3,000th request, you assign a 1/3,000 probability that you will update on any given request, probabilistically you end up with the same number of refreshes over a long period of time.
Here is a reimplementation of showConversion() that implements probabilistic removal:
function showConversion($promotionID) {
$gdbm = dba_popen(“promotionCounter.dbm”, “c”, “gdbm”);
//if this is our 1 in 3000 chance, we will skip
//looking for our key and simply reinsert it if(rand(3000) > 1) {
if($count = dba_fetch($promotionid, $gdbm)) {
return $count;
}
}
$db = new DB_MySQL_Test;
$row = $db->execute(“SELECT count(distinct(userid)) cnt FROM promotions
WHERE promotionid = $promotionid”)->fetch_assoc(); dba_replace($promotion, $row[0], $gdbm);
return $row[cnt];
}
The beauty of this method is its simplicity.You cache only the data you are really interested in, and you let mathematics handle all the tough choices.The downside of this method is that it requires you to really know the access frequency of an application; making poor choices for the removal probability can result in values staying in the cache much longer than they should.This is especially true if there are lulls in traffic, which break the mathematical model. Still, it is an interesting example of thinking out-of-the- box, and it may be a valid choice if the access patterns for your data are particularly stable or as an enhancement to a deterministic process.
To implement expiration in the cache, you can wrap all the calls to it with a class that adds modification times to all the entries and performs internal expiration:
<?php
class Cache_DBM { private $dbm; private $expiration;
function _ _construct($filename, $expiration=3600) { $this->dbm = dba_popen($filename, “c”, “ndbm”); $this->expiration = $expiration;
}
function put($name, $tostore) {
$storageobj = array(‘object’ => $tostore, ‘time’ => time());

DBM-Based Caching |
255 |
dba_replace($name, serialize($storageobj), $this->dbm);
}
function get($name) {
$getobj = unserialize(dba_fetch($name, $this->dbm)); if(time() - $getobj[time] < $this->expiration) {
return $getobj[object];
}
else {
dba_delete($name, $this->dbm); return false;
}
}
function delete($name) {
return dba_delete($name, $this->dbm);
}
}
?>
You would use this class by constructing a new cache object:
<?php
require_once ‘Cache/DBM.inc’;
$cache = new Cache_DBM(“/path/to/cachedb”);
?>
This cache object calls dba_popen to open the cache DBM file (and to create it if it does not exist).The cache object also sets the expiration time to the default of 3,600 seconds (1 hour). If you wanted a different time, say 1 day, you could specify the expiration as well:
$cache = Cache_DBM(“/path/to/cachedb”, 86400);
Cache storage and lookups are performed by a keyname, which you need to provide. For example, to store and then reinstantiate a Foo object, you would use this:
$foo = new Foo();
//store it $cache->put(‘foo’, $foo);
In the library, this creates an array that contains $foo as well as the current time and serializes it.This serialization is then stored in the cache DBM with the key foo.You have to serialize the object because a DBM file can only store strings. (Actually, it can store arbitrary contiguous binary structures, but PHP sees them as strings.) If there is existing data under the key foo, it is replaced. Some DBM drivers (DB4, for example) can support multiple data values for a given key, but PHP does not yet support this.
To get a previously stored value, you use the get() method to look up the data by key:
$obj = $cache->get(‘foo’);

256 Chapter 10 Data Component Caching
get is a bit complicated internally.To get back a stored object, it must first be looked up by key.Then it is deserialized into its container, and the insert time is compared against the expiration time specified in the cache constructor to see if it is stale. If it fails the expiration check, then it is returned to the user; otherwise, it is deleted from the cache.
When using this class in the real world, you perform a get() first to see whether a valid copy of the data is in the cache, and if it is not, you use put():
<?php class Foo {
public function id() { return “I am a Foo”;
}
}
require_once ‘Cache/DBM.inc’;
$dbm = new Cache_DBM(“/data/cachefiles/generic”); if($obj = $dbm->get(“foo”)) {
// Cache Hit, $obj is what we were looking for print $obj->id();
}
else {
// Cache Miss, create a new $obj and insert it into the cache $obj = new Foo();
$dbm->put(“foo”, $obj); print $obj->id();
}
// ... use $obj however we want ?>
The following are some things to note about the wrapper class:
nAny sort of data structure (for example, object, array, string) can be handled automatically. Anything can be handled automatically except resources, but resources cannot be effectively shared between processes anyway.
nYou can perform a put() to recache an object at any time.This is useful if you take an action that you know invalidates the cached value.
nKeynames are not autodetermined, so you must know that foo refers to the Foo object you are interested in.This works well enough for singletons (where this naming scheme makes perfect sense), but for anything more complicated, a naming convention needs to be devised.
With the exception of cdb, DBM implementations dynamically extend their backing storage to handle new data.This means that if left to its own devices, a DBM cache will function as long as the filesystem it is on has free space.The DBM library does not track access statistics, so without wrapping the library to provide this functionality, you can’t do intelligent cache management.

Shared Memory Caching |
257 |
One idiosyncrasy of DBM files is that they do not shrink. Space is reused inside the file, but the actual size of the file can only grow, never shrink. If the cache sees a lot of activity (such as frequent inserts and significant turnover of information), some form of cache maintenance is necessary. As with file-based caches, for many applications the lowmaintenance overhead involves simply removing and re-creating the DBM files.
If you do not want to take measures that draconian, you can add a garbage-collection method to Cache_DBM:
function garbageCollection() { $cursor = dba_firstkey($this->dbm); while($cursor) {
$keys[] = $cursor;
$cursor = dba_nextkey($this->dbm);
}
foreach( $keys as $key ) { $this->get($key);
}
}
You use a cursor to walk through the keys of the cache, store them, and then call get() on each key in succession. As shown earlier in this section, get() removes the entry if it is expired, and you simply ignore its return value if it is not expired.This method may seem a little longer than necessary; putting the call to get() inside the first while loop would make the code more readable and reduce an entire loop from the code. Unfortunately, most DBM implementations do not correctly handle keys being removed from under them while looping through the keys.Therefore, you need to implement this two-step process to ensure that you visit all the entries in the cache.
Garbage collection such as this is not cheap, and it should not be done more frequently than is needed. I have seen implementations where the garbage collector was called at the end of every page request, to ensure that the cache was kept tight.This can quickly become a serious bottleneck in the system. A much better solution is to run the garbage collector as part of a scheduled job from cron.This keeps the impact negligible.
Shared Memory Caching
Sharing memory space between processes in Unix is done either with the BSD methodology or the System V methodology.The BSD methodology uses the mmap() system call to allow separate processes to map the same memory segment into their own address spaces.The PHP semaphore and shmop extensions provide two alternative interfaces to System V shared memory and semaphores.
The System V interprocess communication (IPC) implementation is designed to provide an entire IPC facility.Three facilities are provided: shared memory segments, semaphores, and message queues. For caching data, in this section you use two of the three System V IPC capabilities: shared memory and semaphores. Shared memory provides the cache storage, and semaphores allow you to implement locking on the cache.