[trustable-software] No silent failures?

Niall Dalton niall.dalton at gmail.com
Thu Jul 28 17:40:27 UTC 2016


On Mon, Jul 25, 2016 at 2:33 PM, Duncan Hart <dah at seriousaboutsecurity.com>
wrote:

> I have come to accept that silent component failure is a contributor to
> system failure like no other.
>
>   Imagine you have a system with 3-way redundancy :
>
>   If one component fails then nothing bad happens.
>
>   Even if 2 components fail nothing bad happens.
>
>   But if the first, and the first and second fail AND you don't know that
> they have, then on the one hand the redundancy can be said to be effective
> and, on the other hand, each failure that you do not notice, because the
> redundancy is covering you, brings you one step closer to an entire system
> failure.
>
>   When an entire system failure occurs, you will declaim that it was
> impossible because you have (had) 3-way redundancy, but you didn't. You
> once did, then you had 2-way, then no redundancy at all, then you had a
> failure.
>
> Does the logic hold true? How might this manifest itself in a software
> environment?
>


​I'm sure plenty of folks suffer from silent failures. Plenty of others
suffer from the opposite problem though. To take your point in a slightly
different direction..

Imagine you have a large distributed system that is instrumented up the
wazoo. 10s or 100s of millions of operations / second in flight, an order
of magnitude more possible measurements. It can be hard to understand/react
appropriately to what is going on and know whether emergency reaction is
appropriate or likely to increase the damage. Say you have some master
service with r=5, and some storage system doing simple r=3 replication. You
suddenly lose 2 masters and 40% of your storage. You're heavily
instrumented so you realize this.

But should lots of sub-components leap into action to recover to full
redundancy? In this case, almost certainly not, since it's in practice
unlikely that you've genuinely lost 40%. Unless your environmental sensors
are picking up a rather exciting amount of heat and particles in the air,
you've probably had some network excursion. Try to swing aggressively into
action to solve it, and you're likely to do a denial of service attack
against yourself.

Now in this very simple case it's easy to imagine understanding it and
setting thresholds to stop the system swinging wildly. But in reality, with
many interacting services that you don't understand running on the system,
and the %s not being so blatantly large.. what do you do? In big systems
you also get some entertaining oscillations where the system never quite
seems to settle as a result of lots of interacting recoveries.

Scale the failure down to losing a single storage box in the cluster.
Rebalance or not? Eh, you could, but you might also want to consider just
running fractionally below r=3 on the data until someone swaps in a new
power supply. And so on.

I go off on this tangent to raise the whole bucket of pain involved in
building a trustworthy system like this. Nor that we necessarily want to
try to build a 100% trustworthy system. Life gets a /lot/ easier if we
tolerate a little slop rather than demanding perfection. Which is not to
say that systems designers shouldn't learn a bit more combinatorics,
probability and statistics.. e.g. you may have imagined randomly
distributing chunks of data across the cluster above, and imagine it
relatively unlikely to lose data in some major outage. E.g. take a genuine
large outage where 1% of the 1000 machine cluster is irrevocably lost and
you do random r=3 replication. It's 99.7% that you've lost all 3 copies of
some piece of data. (Breaking the cluster into replication sets where the
number of combinations chosen for placement of replicas can get that down
to 0.02%).

Some services learned this the hard way, going through the transition of
"we have no clue what's happening" -> "we have lots of data but no
understanding" -> "oh bugger, we have exquisite data on just how badly
we're screwed" -> "practically speaking, we're doing ok". But most systems
I look at are disturbingly early in the cycle.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.veristac.io/pipermail/trustable-software/attachments/20160728/427e75b7/attachment.html>


More information about the trustable-software mailing list