Tutorial (P. Veríssimo)

Beyond the glamour of Byzantine Fault Tolerance:
OR why resisting intrusions means more than BFT

Paulo Veríssimo
Univ. Lisboa, Faculdade de Ciências, LaSIGE,
Lisbon, Portugal
pjv@di.fc.ul.pt ; http://www.di.fc.ul.pt/~pjv

Byzantine Fault Tolerance (BFT) has become a reference paradigm for dealing with faults and intrusions, achieving security (and dependability) in an automatic way, much along the lines of classical fault tolerance. However, BFT is a means to an end – intrusion tolerance and resilience – and resilience to intrusions means actually more than BFT.
The explosive combination of the desired asynchrony of these systems with the real-life (and real-time) power of attackers, has brought about limitations of the paradigm as a basis for designing resilient systems, addressed by several researchers, some of which quite unexpected. Although recent practical algorithmic or systems fixes have partially improved the situation, we show that the problems have a formal root: exhaustion failure and the susceptibility of current BFT systems to it. We give several practical examples of the phenomenon.
The tutorial consolidates recent results pointing to the fact that there is more to designing resilient systems than BFT and that, surprisingly or not, not all BFT algorithms lead to resilient designs (resilience meaning the capacity of your system to fulfill its mission to the end in the presence of – perhaps harsh and even uncertain – accidents and attacks, i.e. faults and intrusions).

Firstly, we start by discussing the theoretical underpinnings: We propose a system predicate, called exhaustion safety (ES), that should in fact be met by any resilient-to-be BFT algorithm and system; we show impossibility results for ES in asynchronous BFT systems and show that they can be overcome under hybrid distributed systems models; we review recent algorithmic lower bounds that show the power of this latter model.

Then, we review recent research results that address a complete approach to designing resilient BFT systems, especially in dynamic and long-lived environments. Concepts like consensus, state machine replication, proactive/reactive recovery/resilience, diversity, distributed systems hybridization, exhaustion safety, are put in context in a coherent whole, giving insight on the correct design of resilient systems: how to structure a BFT hybrid distributed system; how to design and show the correctness of BFT algorithms under hybrid models; how to actually solve the above-mentioned problems of BFT.

Finally, extensive literature pointers are given, namely to works featuring a concern to achieve actual resilience against Byzantine faults. The relevance of exhaustion safety is illustrated with examples from systems projects in critical areas like security information and event processing, cloud-based critical information infrastructures, or privacy-sensitive biobank data storage and processing.

The matters of the tutorial have been presented and perfected over several editions, for example at Ph.D. level courses at U. Roma la Sapienza, Carnegie Mellon, Swiss Romande Ph.D. Spring School, and more recently, at the INRIA Winter School on Hot Topics in Distributed Computing and the DISC 2012 conference.

Tutorial Elements
Duration: 3h + breaks (half day)

  • General problem definition: prevention vs. tolerance vs. resilience
  • Specific problem definition: misconceptions and limitations w.r.t. Byzantine Fault Tolerance
  • Formalisation: exhaustion failure and exhaustion safety
  • Practical examples of the problems
  • Solutions: hybrid distributed system models
  • Validity of the hybrid approach: algorithms, lower bounds, related work
Short Curriculum
Paulo Veríssimo is a Professor of the Department of Computer Science and Engineering, U. of Lisbon Faculty of Sciences (FCUL), member elect of the Board of the U. of Lisbon and of the Scientific Council of the FCUL, and Director of LaSIGE (http://lasige.di.fc.ul.pt). He belonged to the European Security & Dependability Advisory Board. He is currently Chair of the IFIP WG 10.4 on Dependable Computing and Fault-Tolerance and vice-Chair of the Steering Committee of the IEEE/IFIP DSN conference. PJV is Fellow of the IEEE and of the ACM. He is associate editor of the Elsevier Int’l Journal on Critical Infrastructure Protection. Veríssimo leads the Navigators group of LaSIGE, and is currently interested in distributed architectures, middleware and algorithms for: adaptability and safety of real-time networked embedded systems; and resilience of secure and dependable large-scale systems. He is author of over 170 peer-refereed publications and co-author of 5 books.