2024

Brief Announcement: Clock Distribution with Gradient TRIX.

Summary

Gradient clock synchronisation (GCS) algorithms minimise the worst-case clock offset between the nodes in a distributed network of diameter D and size n. They achieve optimal offsets of Θ(log D) locally, i.e. between adjacent nodes [Lenzen et al., 2010], and Θ(D) globally [Biaz and Welch, 2001]. A key open problem in this area is to achieve fault tolerance at minimal overhead in terms of the number of edges. In this work, we achieve this goal under the assumption of an average-case distribution of faults, i.e., nodes fail with independent probability p ∈ o(n^{-1/2}). In more detail, we present a self-stabilising GCS algorithm for a grid-like directed graph with in- and out-degrees of 3. Note that even for tolerating a single fault, this degree is necessary. Moreover, the failure probability p is the largest possible ensuring the necessary condition that for each node at most one in-neighbour fails with probability 1-o(1). Our algorithm achieves asymptotically optimal local skew of Θ(log D) with probability 1-o(1); this holds under general worst-case assumptions on link delay and clock speed variations, provided they change slowly relative to the speed of the system. On the one hand, our results are of practical interest. As we discuss with care, the fault model is suitable for synchronously clocked hardware. Since our algorithm can simultaneously sustain a constant number of arbitrary changes due to faults in each clock cycle, it achieves sufficient robustness to dramatically increase the size of synchronously clocked Systems-on-Chip. On the other hand, our result is of a theoretical and algorithmic nature. We show that for a worst-case distribution of f 1-local faulty nodes within our fault model’s locality constraints, our algorithm achieves a local skew of O(5^flog D), while for our model with probabilistic distribution of faults the algorithm achieves O(log D). Our work raises questions for further theoretical investigation on techniques for fault tolerance and trade-offs between fault distribution and edge density of graphs.