This paper explains how we avoid JINI lease failures, building a distributed directory service (FADA -Federated Advanced Directory Architecture), which is key component of the Digital Business Ecosystem project (DBE from now on).
As we introduce before, FADA is a distributed directory of services that has been used by the DBE project. It is written in Java, and uses the Jini Networking Technology as base. The DBE members will offer their services through FADA, and those services need a way to be searched, and a starting point to find them and use them.
Every service provider will provide with a Java class called the service proxy. Those service proxies are downloaded by clients and executed in the local client's Java Virtual Machine (JVM from now on). These proxies will communicate from the client with the real service. Thus proxies are a consistent unified way to access services from anywhere in the world. These service proxies must be stored in some place and FADA is that place. FADA is a distributed network of FADA nodes that collaborate to provide with a distributed database of service proxies.
A distributed system is a set of entities that exist in geographically distinct places and collaborate to provide a functionality. The difference with traditional or non-distributed systems is that in the latter all components of the system exist in the same place, and communication between them is achieved via the computer buses, which are fast, reliable and always up and working. The difference with centralized remote systems is that in the latter there is a single node that acts as referee for the communication between all collaborating entities, while in a real distributed system all components are self contained (although they need to communicate with their peers to provide the system functionality).
A distributed system has several advantages over non-distributed systems:
- A centralized system offers a single point of failure, which is the central server. If this node crashes the system functionality is damaged, even if the rest of entities are up an running, because they are not able to collaborate anymore.
- A distributed system, if so designed, is much easier to scale, and the throughput is dramatically improved by scalability.
- Price is also an issue: it is much cheaper to make several low-cost entities to collaborate than to make a single entity more powerful. Example: If top technology gave you a 100 MIPS CPU, pushing technology to achieve a 200 MIPS CPU is much more expensive than scaling a system with 20 5 MIPS CPUs to 40 5 MIPS CPUs. And, if well designed, the throughput will be much higher, as 40 CPUs are able to work in parallel, while a single CPU can do only a few things per cycle (if it is able to do several things in a single CPU cycle).
But it also has some drawbacks:
- In a non-remote system all components are physically in the same place. Communication is achieved by the internal buses. A failure is unlikely to happen.
- In a centralized remote system communication is easily managed by a single node. There is no need to synchronize with any other peer, as the notion of event ordering and control is managed by the central node.
- A truly distributed system has to deal with network failures, and since no single part is self contained, every involved entity must be aware of such failures.
- A truly distributed system has to deal with out of order service requests and transaction control in a much more complicated way than centralized or stand-alone systems.
The trend in the last years is going towards truly distributed systems, in which data replication is also an issue to be considered. In a centralized system, if the central host goes down everything is lost. In a distributed system, if some host goes down the information can have been replicated over several other hosts to provide high availability. The conclusion is that advantages overcome difficulties in design and implementation of distributed systems.
Kinds of failures
When a running program gets in a situation where a critical error has occurred and there is no way to continue from that state, execution is simply aborted. For example, if it has to do some calculations based on two variables, a and b, and it must compute a/b, and both a and b equal zero, there is no way to continue once you get the undefinite number 0/0. Or, if the program has a pointer and it is expected to do something with data that is supposed to be where the pointer points, but the pointer value is null (points nowhere) there is no way for the program to continue. Of course, all these conditions can be checked beforehand by the programmer, and try to avoid the conditions that lead to this kind of errors, known as fail-stop errors.
Another kind of error are known as the Bizantium errors. This is when some internal program state (for instance, some variables) are computed without triggering a fail-stop error, but their state is inconsistent and will make calculations based on them to be erroneous. These errors are harder to detect and prevent.
Distributed systems share these kinds of errors with their non-distributed counterparts, and it also has another kind of errors (shared with centralized remote systems): partial failures.
Say you have to perform an action that requires two remote operations that can only be performed once (for instance, you have to take money from an account and transfer it to another). If everything is OK money is first substracted from the first account, then that amount is added to the second account. Now imagine something gets in the way of the first remote operation. Something we are not aware of. The money is therefore not substracted from the first account, and it was added to the second. You have just given away an amount of money. Your boss will fire you off shortly after he realizes your mistake.
You could go one step further and request an acknowledgement for every remote operation. This eliminates the above conditions, you think. Right? Wrong!
Imagine you decide to reverse the order of the operations. Both of them have to be achieved at once in order to finalize the transaction, so no matter which order you call them the result (in case everything runs fine) will be the same. Now imagine you add an amount of money to one account, wait for the acknowledgement, then substract that amount from the second account. What happens if you don't receive the acknowledgement for the first operation? What has happened? Didn't the operation take place? Or it did but the acknowledgement got lost in the way? You don't know, and you just can't call the operation again (if you've already learned the lesson).
This example shows one of the most frustrating experiences with distributed systems: things are quite different from stand-alone systems.
Availability of services
Imagine you're offering a service to someone. You could advertise your service by a number of means. One of them could be to join a repository where users look for services. This repository can be a distributed system or not, but the important thing here is that your service will be remotely accessed by clients. Imagine a client wants to contact your service and goes to the repository, and it finds a reference to your service. The client tries to contact your service, but he gets no response from you, because your service (due to unknown causes) has crashed. He tries later to see if you have realized the fact and solved the problem, but he sees you haven't. After several tries he gives up and tries another service. Another client tries to contact your service, and he runs into the same trouble. All clients wish the repository to have a way to discard non-working services to avoid waste of time, bandwidth and money. One way the repository could achieve that is by periodically polling the state of the registered services. But if there are hundreds of services it means that the repository has to initiate several hundreds of connections to keep the `database' clean.
One of the solutions to this overload problem is to reverse the actions. The service itself is the only responsible to grant the service availability, so it must make aware the repository of the fact that the service is still working.
When the service is registered in the repository the latter doesn't grant the registration forever. Instead it says something like “OK, I will store your service here for the next 5 minutes. If in that time I don't have news from you, I'll assume you have crashed, and delete you from the repository”. Now every service must renew the registration in order to not to be deleted. If a service crashes it won't be able to renew its registration, and after the specified period of time the repository”will delete the service from its registry. If the service is alive it will renew the registration for an extended amount of time.
Services get an entity called Lease upon registration in the repository. The Lease entity has already been used in DHCP, Jini and now in the FADA.
The service proxy registration lease
All registered services get a registration lease upon registration request. This lease is an object that identifies the registration, and has an internal counter that, when arrives to zero, deregisters the service proxy from the FADA infrastructure. The lease is said to have `expired'. This approach leads to an automatic cleanup of service proxies whose real service is not reachable anymore, thus eliminating the need to maintain the database of registered proxies.
This mechanism is used extensively in Jini, and has been inherited by FADA. The idea of the lease has already been used in Java distributed garbage collection, and other domains as DHCP, the dynamic host configuration protocol. Instead of having a server periodically polling the state of the client, it is the client that periodically renews interest in some item, in this case, the registration. This approach frees the server side of much of its work, and distributes the load among clients (in this case service providers are clients of the FADA).
Leasing and network delays
Jini already offers the concept of lease and the mechanisms to obtain, renew and cancel leases, but the classes that offer automated lease renewal are designed and implemented thinking in the low delays of a LAN, for which Jini was thought. In our case we'll be dealing with the high delays of the Internet, and the standard Jini behavior must be modified.
The situation in Jini is the following: Jini Lookup Servers and services are within the same LAN. The service, upon registration in the Jini Lookup Server, gets a Lease for its registration. This Lease object has a method that tells the date of expiration. Getting the actual time it is easy to compute the remaining time before lease expiration and service proxy deregistration. But the lease renewal operation takes some time, part of which is due to CPU time used for calculations, and part of which is used in the transmission of the physical frame on the LAN. In a LAN the network delays are quite low, and usually under a fixed worst-case value. If we make sure we perform the call to lease renewal before deadline minus delay time happens, we will make sure the lease has been renewed in time.
But in the Internet, where FADA is supposed to work, the distribution of the delays is quite different. Network delay is dependent of many factors as frame length, congestion state of the links, state of the routers, damage on the physical connections that lead to alternate paths searching and reconfiguration of path tables in the routers, etc. Those factors are beyond our reach. We must be able to estimate the worst delay in a given instant.
The Fetish toolkit class FetishImpl will take care of registering services in the FADA, and it will obtain the lease for those registrations. This lease is wrapped by the FADA in a new class that implements the interface FetishLease, a descendant of the standard Jini Lease. This class keeps information needed by the FetishImpl.
A new class called FetishLeaseRenewalManager has also been provided. It is very similar to the Jini LeaseRenewalManager, from which the name has been taken and modified. It does the same basic job as the mentioned Jini LeaseRenewalManager, but with a very important difference. Instead of presuming a low delay, that will be under a fixed value, when a FetishLease is given to the FetishLeaseRenewalManager for renewal it first attempts a renewal, and calculates the round-trip time of the RMI call. This round-trip time gives a hint of the network delay that should be taken into account when renewing the lease (if the lease renewal request is started before the lease expires, but the network delay is very high, the request can finally arrive at destination well after the lease has already expired).
But this round-trip time is not enough, as it is an instant value of the state of the network, but that state might evolve to better or worse conditions. If it evolves to worse conditions we'll be late. If it evolves to better conditions we'll be wasting time and bandwidth. Averaging the last 5 values is definitely not enough, as we still get a very wiggly measure. We need another approach.
The Kalman Filter
In 1960, R.E. Kalman published a paper describing a recursive solution to the discrete-data linear filtering problem.
The Kalman filter is a set of mathematical equations that provides an efficient computational (recursive) solution of the least-squares method. The filter is very powerful in several aspects: it supports estimations of past, present, and even future states, and it can do so even when the precise nature of the modeled system is unknown.
This approach is what we needed: a mathematical model that was able to predict the future state of the network. Not only the Kalman filter gives the best estimation for the next value, but it also gives the error variance, which is much more useful in our case, as we do not need the exact predicted value, but an upper bound for that prediction. In other words: we don't need to know that the most probable next delay time will be, but what is the confidence interval for that value, and use the worse value as our estimation.
The Kalman filter as-is, though, did not fit our case, as it makes some assumptions that didn't hold true. It assumes the real value is disturbed by a white noise signal. In our case the delay evolution of the network is not a continous function, but we can model it as an average continous function disturbed by another function. This other function doesn't follow a gaussian distribution. Further investigation led us to a different formulation of the Kalman filter, in which some of the calculations had been changed to fit the case. The noise added to the signal was supposed to follow an exponential distribution, and the filter was used at intervals of time, between which the filter was kind of `resetted'. Although not exactly our case, it did much better fit it. Simulation and experiments have shown a much better response than simple averaging: the failure rate of the modified Kalman filter was much lower. We call a failure a lease that couldn't be renewed in time.
The figure shows the behavior of the filter we're using when facing Internet delays. We implemented a simple server that received a request of a client, waited a random time (whose distribution was uniform around ±30% of a constant time, that was modifiable at run-time) and then issued the response. The spikes correspond to the Internet delays. The filter adapts quickly to new conditions on the network, although not so quickly that it would follow the previously sampled value, which would render the filter useless.
Even with the Kalman filter we still needed to ensure that service proxy registrations where not deleted from the registry when the service was still alive. A reregistration method was added to the class FetishImpl, that could be used by the FetishLeaseRenewalManager. Reregistration method registers the service proxy again, but it keeps the same parameters: attributes (given by the service provider), the interfaces implemented (given by the proxy itself) and ServiceID (given by Jini and FADA, but kept the same by FetishImpl and FetishLeaseRenewalManager). Even when a lease is not renewed in time, the state of the FADA remains stable, and users will only notice a delay in the response or a miss in the lookup, that will be fixed on the next lookup.
An experiment done over a period of 24 hours showed that, with a lease time of 30 seconds (which is really short; Jini generally gives lease times of around 5 minutes) only 107 of the lease renewals arrived late. The total amount of renewal attempts was around 720. This throws a failure rate of around 14.86%. The service registration, after the 24 hours period, still kept the same ServiceID of the initial registration, and the service was still findable and callable.
In RFC889, p. 11, section 3.3, a formulation of the filter used in IP routers is given. It is very similar to the already chosen one, that is based in the Kalman formulation. The same document also gives some statistics about the usual delays found in the Internet. Delays depend on many parameters, such as packet length, bit rate of middle networks, congestion, etc. Packet loss is also significant.
Distributed systems, though harder to develop, offer a number of advantages that overcomes their drawbacks, such as availability, load distribution and location.
The FADA architecture is a distributed repository of services that relies on Jini and extends its functionality to work in the Internet.
The high delays of the Internet are a serious problem to be treated, and a good estimation of their values is needed in order for things to work properly.
The Kalman filter formulation is a good starting point for estimating these delays. One of its derivatives has shown good performance in the FADA envionment, and has been adopted as the solution to the problem.