Building Reliable Large-Scale Distributed Systems: When Theory Meets Practice Lidong Zhou Microsoft Research Asia lidongz@microsoft.com Introduction Large-scale distributed systems are in the vogue with the increasing popularity of cloud computing, because a cloud is usually constructed from a large number of commodity machines. Due to the fail-prone nature of the commodity hardware, reliability has been a central piece in the design of those systems. The theoretical distributed system community is in theory well prepared to embrace this (possibly a bit hyped) new era: the community started uncovering the fundamental principles and theories more than thirty years ago. The core consensus problem and the replicated state-machine approach [19, 32] based on consensus are directly applicable to the reliability problem in large-scale distributed systems. Although some basic concepts and core algorithms (such as Paxos [20]) do nd their way into the practical systems, the relevance of the research in the community does not appear as high as expected to the practical large-scale distributed systems that have been developed and deployed. There is undeniably a signi cant gap between theory and practice in this particular context. This article re ects our rather biased personal perspectives on the gap between theory and
/lp/association-for-computing-machinery/building-reliable-large-scale-distributed-systems-when-theory-meets-WQqqh5edVv