Minggu, 12 Juni 2011

Consistency and replication

Leave a Comment
An important issue in distributed  system is the replication of data. Data are generally replicated to enhance reliability or improve performance. One of the major problems is keeping replicas consistent. Informally, this means that when one copy is updated we need to ensure that the order copies are updated as well otherwise the replicas will no longer be the same. In this chapter, we take a detailed look at what consistency of replicated data actually means and the various ways that consistency can be achieved.
We start with a general introduction discussing why replications is useful and how it relates to scalability. We then continue by focusing on what consistency actually means. An important class of what are known as consistency models assumes that multiple processes simultaneously access shared data. Consistency for these situations can be formulated with respect to what processes can expect when reading and updating the shared data, knowing that others are accessing that data as well.
Consistency models for shared data are often hard to implement efficiently in large-scale distributed system. Moreover, in many cases is formed by client-centric consistency models, which concentrate on consistency from the perspective of a single (possibly mobile) client. Client-centric consistency models are discussed in a separate section.
Consistency is only half of the story. We also need to consider how consistency is actually implemented. There are essentially two, more or less independent, issues we need to consider. First of all, we start with concentrating on managing replicas, which takes into account not only the placement of replica servers, but also how content is distributed to these servers.
The second issue is how replicas are kept consistent. In most cases, applications require a strong form of consistency. Informally, this means that updates are to be propagated more of less immediately between replicas. There are various alter/natives for implementing strong consistency, which are discussed in a separate section. Also, attention is paid to caching protocols, which form a special case of consistency protocols.
In this section, we start with discussing the important reasons for wanting to replicate data in the first place. We concentrate on replication as a technique for achieving scalability, and motivate why reasoning about consistency is so important
7.1.1 reason for replication
There are two primary reason for replicating data; reliability and performance. First,  data are replicated to increase the reliability of a system. If a file system has been replicated in may be possible to continue working after one replica crashes by simply switching to one of the other replicas. Also, by maintaining multiple copies, it becomes possible to provide better protection against corrupted data. For example, imagine there are three copies of a file and every read and write operation is performed on each copy. We can safeguard ourselves against a single, failing write operation, by considering the values that is returned by at least two copies as being the correct one.
The other reason for replicating data is performance. Replication for performance is important when the distributed system needs to scale in numbers and geographical area. Scaling in numbers occurs, for example, when an increasing number of processes needs to access data that are managed by a single server. In that case, performance can be improved by replicating the server and subsequently dividing the work.
Scaling with respect to the size of a geographical are may also require replication. The basic idea is that by placing a copy of data in the proximity of the process using them, the time to access the data decreases. As a consequence, the performance as perceived by that process increases. This example also illustrates that the benefits of replication for performance may be hard to evaluate. Although a client process may perceive better performance, it may also be the case that more network bandwidth is now consumed keeping all replicas up to date.
If replication helps to improve reliability and performance, who could be against it? Unfortunately, there is a price to be paid when data are replicated. The problem with replication is that having multiple copies may lead to consistency problems. Whenever a copy is modified , that copy becomes different from the rest. Consequently, modifications have to be carried out on all copies to ensure consistency. Exactly when and how those ,modifications need to be carried out determines the price of replication.
To understand the problem, consider improving access times to web pages. If no special measures are taken, fetching a page from a remote Web server may sometimes even take seconds to complete. To improve performance, web browsers often locally store a copy of a previously fetched Web page (i.e., they cache a web page). If a user requires that page again, the browser automatically returns the local copy. The access time as perceived by the user is excellent. However, if the user always wants to have the a test version of a page, he may be in for bad luck. The problem is that in the page has been modified in the meantime, modifications will not have been propagated to cached copies, making those copies out-of-date.
One solution to the problem of returning a stale copy to the user is to forbid the browser to keep local copies in the first place, effectively letting the server be fully in charge of replica is placed near the user. Another solution is to let web server invalidate or update each cached copy. But this requires that the server keep track of all caches and sending them messages. This, in turn, may degrade the overall performance of the server. We return to performance versus scalability issues below.

7.1.2 Replication as Scaling Technique
Replication and caching for performance are widely applied as scaling techniques. Scalability issues generally appear in the from of performance problems. Placing copies of data close to the processes using them can improve performance though reduction of access time and thus solve scalability problems.
A possible trade-off that needs to be made is that keeping copies up to date may require more network bandwidth. Consider a process P that accesses a local replica N times per second, whereas the replica itself is updated M times per second. Assume that an  update completely refreshes the previous version of the local replica. If N<<M.  that is, the access-to-update ratio is very low, we have the situation where many updated versions of the local replica will never be accessed by P, rending the network communication for those versions useless. In this case, it may have been better not to install a local replica close to P, or to apply a different strategy for updating the replica. We return to these issues below.
A more serious problem, however, is that keeping multiple copies consistent may itself be subject to serious scalability problems. Intuitively, a collection of copies is consistent when the copies are always the same. This means that a read operation performed at any copy will always return the same result. Consequently, when an update operation is performed on one copy, the update should be propagated to all copies before a subsequent operation takes place, no matter at which copy that operation is initiated or performed.
This type of consistency is sometimes informally (and imprecisely) referred to as tight consistency as provide by what is also called synchronous replication. (In the next section, we will provide precise definitions of consistency and introduce a range of consistency models). The key idea is that an update is performed at all copies as a single atomic operation, or transaction. Unfortunately, implementing atomicity involving a large number of replicas that may be widely dispersed across a large-scale network is inherently difficult when operations are also required to complete quickly.
Difficulties come from the fact that we need to synchronize all replicas. In essence, this means that all replicas first need to reach agreement on when exactly an update is to be performed locally. For example, replicas may need to decide on a global ordering of operations using Lamport timestamps, a let a coordinator assign such an order. Global synchronize all replicas. In essence, this means that all replicas are spread across a wide-area network.
We are bow faced with a dilemma. On the one hand, scalability problems can be alleviated by applying replications and caching, leading to improved performance. On the other hand, to keep all copies consistent generally requires global synchronization, which is inherently costly in terms of performance. The cure may be worse than the disease.
It many cases, the only real solution is to loosen the consistency constraints. In other words, if we can relax the requirement that update to be executed as atomic operations, we may be able to avoid (instantaneous) global synchronization, and may thus gain performance. The price paid is that copies may not always be the same everywhere. As it turns out, to what extent consistency can be loosened depends highly on the access and update patterns of the replicated data, as well as on the purpose for which those data are used.
In the following sections, we first consider a range of consistency models by providing precise definitions of what consistency models through what are called distribution and consistency protocols. Different approaches to classifying consistency and replication can be found in Gray et al.(1996) and Wiesmann et al.(2000)
If You Enjoyed This, Take 5 Seconds To Share It

0 komentar:

Posting Komentar