Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Burgess M.Principles of network and system administration.2004.pdf
Скачиваний:
181
Добавлен:
23.08.2013
Размер:
5.65 Mб
Скачать

Chapter 8

Diagnostics, fault and change management

All complex systems behave unexpectedly some of the time. They fail to operate within the limits addressed by policy and the reason for this can be clearly understood by comparing information content. Policy is generally a set of simplistic, high level general rules that cannot capture the same level of detail as the true human–computer interaction in its real environment. One must therefore expect failure and plan for it. If a failure occurs in a manner that was expected, its effects can be controlled and mitigated.

Principle 43 (Predictable failure). Systems should fail predictably so that they can be recovered quickly. Predictability is encouraged by adopting standardized (or well-understood) protocols and procedures for quality assurance in design and maintenance.

This chapter is about learning what to expect of a non-deterministic system: how to understand its flaws, and how to insure oneself against the unexpected.

8.1 Fault tolerance and propagation

How do errors penetrate a system? Faults travel from part to part as if in a network of interconnections. If errors can propagate freely, then a small error in one part of a system can have consequences for another part. By studying different kinds of network, we can learn about the likelihood of error propagation.

Networks come in a variety of forms. Figure 8.1 shows the progression from a highly ordered, centralized structure to a decentralized form, to a generalized mesh. This classification was originally discussed by Paul Baran of RAND corporation in 1964 as part of a project to develop a communications system that would be robust to failure in the case of a nuclear attack [36, 25].

The same argument applies to the propagation of errors though any set of interconnected parts.

282

CHAPTER 8. DIAGNOSTICS, FAULT AND CHANGE MANAGEMENT

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(a)

(b)

(c)

Figure 8.1: Network topologies: (a) centralized, (b) decentralized or hierarchical, and (c) distributed mesh.

Many complex systems exhibit a surprising degree of tolerance against errors. This is because they have in-built redundancy. Certain types of network also have this property in the routes between the nodes. If we think of networks not so much in the sense of communication lines between computers, but as abstract links between different dependent parts of the whole, then the importance of networks becomes apparent. The idea of a network is thus of more general importance than as a means of communication between computers and humans. Networks are webs of influence. If a system is tolerant to faults and security breaches, then we can look at it in one of two complementary ways:

The access network that allows problems to propagate is poorly connected; i.e. connections (security breaches) between nodes (resources) are absent.

The resource network is well connected and is resilient to removal of nodes (resources) and connections (supply channels).

The first of these viewpoints is useful for modeling intrusion or penetration by faults or intruders, while the latter is useful for securing a system against lack of access to critical resources.

A tolerant network is robust to node removal and connection removal. Node removal is usually more serious (see figure 8.2).

One type of network of special importance is the random network. A random network is formed by making random connections between nodes within a set. Randomness is a good strategy for covering a large number of possibilities without making exhaustive use of resources. In the absence of precise knowledge about a system, random ‘shots in the dark’ are an efficient way of hitting an unpredictable or moving target, such as a random fault [263]. Conversely, random links lead to a high probability of connecting together all of the elements in a set of nodes, provided their density is sufficient [6]. While this double dose of unpredictability

8.2. NETWORKS AND SMALL WORLDS

283

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 8.2: Network tolerance to node removal: nodes are more important than connectors.

sounds like an unlikely combination for success, it works and leads to highly robust networks.

Humans do not build technology at random so, apart from a robustness to failure, why should random networks be of interest to the study of human–computer systems? The answer lies in so-called small-world networks, that approximate random ones.

8.2 Networks and small worlds

You have probably heard the maxim that no two individuals in the world are more than six degrees of separation from one another. In other words, I know someone, who knows someone, who knows someone,... who knows you. In the world of system administration, the degree of separation is probably much less than six; but on average, the value is around six for arbitrary people on the planet. How could this possibly be?

This strange, almost counter-intuitive, idea is not a freak coincidence of human social structures; it is a property of a kind of network known as a small-world network [319].

Definition 6 (Small-world network). There is a class of highly clustered graphs that behave like random graphs. These are called small-world networks.

Small-world networks have local clustering, i.e. they have a centralized structure at the level of small groups, but this is not the reason they are called small-world networks. The ‘small world’ phenomenon is rather the opposite of

284

CHAPTER 8. DIAGNOSTICS, FAULT AND CHANGE MANAGEMENT

this, namely that someone in a small cluster will be closely connected to someone in a rather distant cluster. The reason for this is the existence of a few long-range or ‘weak’ links. In a small-world network, these weak links play a vital role in connecting distant parts of the network together. If we add a sufficient number of random, long-distance links something magical happens: suddenly a group of small clusters starts to achieve the connectivity of a random network (figure 8.3 and 8.4).

Figure 8.3: A network is built up by adding connections between neighbors. As more distant neighbors become connected, small, local clusters become connected over longer distances.

The study of networks reveals that networks with the small-world property also exhibit so-called ‘scale-free’ behavior, over a wide range of scales. In other words, over a wide range of scales, the networks appear to have the same properties regardless of how one zooms in or out and looks at different regions of the network.

Scale-free networks are formed spontaneously when there is some form of preferential attachment, i.e. when the likelihood of a new connection is determined by an already existing connection. This is observed, for instance, in the links to sites in the World Wide Web and is exploited by search engines like Google in order to rank the importance of sites. When someone sees that a site is well-connected, they tend to refer to it themselves, thus making it even more well-connected. This ‘rich get richer’ phenomenon leads to a form of connectivity that is not necessarily ordered, but which exhibits a form of ordering.

Why should this phenomenon be of more than passing interest to network and system administration? The small-world phenomenon is a sociological phenomenon, but it is mimicked in the deployment of technology. Studies of real networks, both of humans and of computers, reveal that the small-world property crops up in a wide range of circumstances in computer technologies. The effect that humans have on the systems we create is not negligible; it has some important consequences, because of the scale-free nature of the graphs. In particular, the lack of actual randomness leads to so-called ‘heavy tailed’ distributions in the properties that are associated with small-world configurations. This is leading technology designers to re-examine their strategies for handling traffic flow around networks.