Tolerating latency in replicated state machines through client speculation
Benjamin Wester, James Cowling, Edmund B. Nightingale, Peter M. Chen, Jason Flinn, and Barbara Liskov
Abstract
Replicated state machines are an important and widely-studied
methodology for tolerating a wide range of faults. Unfortunately,
while replicas should be distributed geographically for maximum fault
tolerance, current replicated state machine protocols tend to magnify
the effects of high network latencies caused by geographic
distribution. In this paper, we examine how to use speculative
execution at the clients of a replicated service to reduce the impact
of network and protocol latency. We first give design principles for
using client speculation with replicated services, such as generating
early replies and prioritizing throughput over latency. We then
describe a mechanism that allows speculative clients to make new
requests through replica-resolved speculation and predicated writes.
We implement a detailed case study that applies this approach to a
standard Byzantine fault tolerant protocol (PBFT) for replicated NFS
and counter services. Client speculation trades in 18% maximum
throughput to decrease the effective latency under light workloads,
letting us speed up run time on single-client micro-benchmarks
1.08--19x when the client is co-located with the primary. On a
macro-benchmark, reduced latency gives the client a speedup of up to 5x.