|
Unexpected failures and outages will continue to affect the operation of cyber
infrastructures like Amazon EC2 and network infrastructures like GENI. For
many applications running in such infrastructures, such as long-running
scientific jobs and networked system emulations, failure recovery means
re-running the application from the beginning thus losing (partial) work done
and wasting system resources. It is desirable for the infrastructure to
provide efficient, application-transparent failure recovery capability that
takes live "snapshots" of an infrastructure for future recovery or replay.
With advances in virtualization technologies, live snapshotting is feasible for a single virtual machine. However, the current technique is not adequate for suspending and resuming distributed experiments that run on GENI. GENI-VIOLIN's goal is to provide fast "live snapshotting" that allows suspend and resume of an entire GENI experiment distributed across multiple sites spanning multiple networks. This project is part of the GENI-alpha plenary demos planned for GENI Engineering Confernce 9 (GEC9). GENI-VIOLIN can be used for
|
|
|
The key challenge in suspending/resuming a distributed experiment is the
coordination required by multiple independent checkpoints performed at the
end-host. We leverage Purdue university's earlier work, VNSNAP
built on top of VIOLIN.
Our primary contribution is the development of distributed live snapshot
algorithm that allows snapshotting entirely in the network with minimal
changes to end-host systems and minimal performance degradation.
We have implemented Mattern's snapshotting algorithm using Xen's live migration and Openflow. There are two key components in the implementation.
|
|