Call me maybe: Redis redux

antirez · on Dec 11, 2013

What Aphyr tested was not my "toy" example model, that was not even proposed as something to actually implement, but just to show that WAIT per se is not broken or ok, it is just a low level building block. The consistency achieved depends on the whole system, especially the failover procedure safety guarantees.

What I proposed is a toy system as described here: https://gist.github.com/antirez/7901666

It is a toy as it has a super strong coordinator that can partition away instances, that is never partitioned, that can reconfigure clients magically and so forth. Under the above assumptions the theoretical system is trivially capable of reaching linearizability I believe.

Aphyr tested a different model, with an implementation that is not capable to even guarantee some weak assumption about the model (for example, the slave reset the replication offset to 0 when restarted), so I'm not sure what the result means.

I could test the actual model I proposed even with Redis, but manually following the steps that I outlined in the Redis mailing list thread. The point was, if you can guarantee certain properties in the system, there is always the "transfer" of data and higher offsets to a majority of replicas, and the system becomes strongly consistent.

The properties are hard to achieve in practice once you try to move the features from the mythical super strong coordinator into the actual system, and this is why, for example, Raft uses epochs and other mechanisms to guarantee both safety and liveness.

Unfortunately the focus is in showing other people are wrong without even caring where the discussion is headed.

--- EDIT ---

Btw now that I read the full post carefully, Aphyr also cherry-picked parts of the thread to construct a story that does not exist, like if I was going to implement strong consistency into Redis based on the proposed toy system that was only useful to show that WAIT per se was not a system, but just a building block. Note that yesterday I wrote the opposite in my blog, that there is no interest in strong consistency in Redis Cluster.

Very unfair IMHO... I read only the analysis part at first, and was thinking this was just a "let's check anyway this model with the current implementation".

mmcnickle · on Dec 11, 2013

So to clarify, you see WAIT as a replication primative. It can be used as part of a larger scheme with a strong coordinator to provide strong consistency.

antirez · on Dec 11, 2013

Yes, with a strong coordinator or with a distributed system that can provide similar guarantees. But currently there is no plan to add this in Redis Cluster. Three main reasons:

* Our main business is low latency, so very few will use synchronous replication.

* Redis cluster is composed of multiple master-slave systems that hold a subset of the key space each. This means that if you need a majority of replicas to promote a new master in order to achieve consistency, after a partition you end with different hash slots having the majority in different sides of the partitions perhaps.

* With the current weaker consistency guarantees Redis Cluster can elect a replica that is isolated by the others but is in the right side of the partition, where the majority of other masters are. Similarly you can have just two total nodes for every hash slot, a setup that I believe will be very used, and still get some availability if the master fails.

So the consistency premises and the tradeoff of Redis Cluster are not exactly compatible with strong consistency.

Is WAIT still a useful tool? I believe yes, if documented as such:

WAIT in the context of Redis Cluster / Sentinel is not able to provide strong consistency, but lowers the percentage of probabilities that a failure mode that results in data loss happens.

A trivial example, what happens if a client is partitioned away with just a master? Without WAIT there is a window of NODE_TIMEOUT to lose data, with WAIT there is not this problem in this specific failure scenario. There are still a number of failure modes for WAIT, but are a smaller number compared to the full set of failure modes that there are without WAIT, so practically it does not feature strong consistency but provides the user with a smaller probability of data loss.

Another example: manual failover for very important data of some kind. You may have 5 nodes and use WAIT 4 to always write everywhere and improve durability.

And so forth.

rdtsc · on Dec 11, 2013

> Ultimately I was hoping that antirez and other contributors might realize why their proposal for a custom replication protocol was unsafe nine months ago, and abandon it in favor of an established algorithm with a formal model and a peer-reviewed proof, but that hasn’t happened yet. Redis continues to accrete homegrown consensus and replication algorithms without even a cursory nod to formal analysis.

That is kind of my feel. Redis is an outstanding product with a beautiful code base. This replication feature has been tough though. It is kind of due to external factors as I've mentioned in the previous post. Everyone and their cousin are talking about distributed databases, everyone likes CAP, CRDTs, Vector Clocks, Raft, Zookeeper and so on. It is hard to come up and say "Here I have made this custom replication protocol". Everyone stares and asks, "Hey where is your whitepaper or your partition tolerance tests?". 5-7 years ago, there would be only nods and approvals. The other aspect is this is about a database, so it is potentially toying and touching user's valuable data. If that gets lost either by a bug, mis-communication in docs, bad default, anything, it will not be taken lightly.

In the end I think it is fine to have it as what it is, with the warnings and disclaimers that data could be lost and avoiding papering over or hiding issues.

As an extra side note, simply put partition tolerance is hard. Net-splits are the devil of the distributed world. Some claim it doesn't exist or doesn't happens often. Others fear and tremble its name is mentioned. When it does happen it means having to resolve conflicts, throwing away user data, stopping killing your availability to stop some from accepting writes in order to provide consistency. This is a tough test (that Aphyr runs) and not very many databases fair well in it. But it is good these things are discussed.

DanielRibeiro · on Dec 11, 2013

The author, Kyle Kingsbury (aka aphyr[1]), has done an amazing experimental analysis of several distributed databases/datastores recently[2]. I hope his experiments and writings not only shine some light into how these systems actually behave in practice, but also influence the underlying projects to improve.

[1] https://github.com/aphyr

[2] http://www.infoq.com/presentations/partitioning-comparison

jasonwatkinspdx · on Dec 11, 2013

I think the blog post is valuable, but I'd point out that aphyr is also not publishing formal proofs or utilizing the verification he suggests is necessary. I don't think this is intentional, but they should realize and own that they are borrowing authority of formal method they are not demonstrating and then criticizing others for the same lack of demonstration.

aphyr · on Dec 11, 2013

Well this one's proof by construction, so I don't feel the need to go particularly deep into the math. If you prefer a more rigorous approach take a look at the TLA proofs earlier in the Jepsen series; think I linked one on Redis+async replication in the article.

jasonwatkinspdx · on Dec 11, 2013

I read and enjoyed them, thanks. I'd just like people to be in a more collaborative vs combative mode, and I think a big part of that is not overly relying on rhetoric or overreaching what you've actually demonstrated. Using formalism on a blog is commendable, but it's also not the same thing as a properly reviewed paper. People should keep straight the levels of dialog here and value them each for the light they shed. This has been mostly a productive exchange despite a few "yer doin' it wrong bro" attitude folks, and I just want to throw my hat in to strongly advocate for keeping it that way.

davidw · on Dec 11, 2013

As a way of checking, when a phrase is in character for "Comic Book Guy", you know it's probably a bit too much on the mean-spirited side of things.

rdtsc · on Dec 11, 2013

He writes very good partitioning and consistency tests. Those are invaluable. As we have seen most "reliable", "distributed" and "fault tolerant" databases fall to their knees in those tests. I think he is honest and he doesn't fake his result, I believe those are hard corner cases but they are not impossible.

Overall it is very good for the industry, better for Aphyr to lose one of his integer key value pairs, make a blog post about it, than for your health insurance to lose your health records.

justinsb · on Dec 11, 2013

Isn't the simulation sufficient? A proof that something works requires logic, but a single counter-example suffices to show that something doesn't work.

derefr · on Dec 11, 2013

To rephrase antirez from the previous thread:

People use Redis, in large part, for its time and space guarantees on data structure operations. (Without those, you may as well be using a serialized object store.) Strong consistency requires rollbacks; and the book-keeping necessary to do rollbacks throws away the time and space guarantees. So either you have Strong Consistency, or you have Redis, but you don't get both.

But Redis Cluster is a compromise: something which is roughly good enough for most cases people actually use Redis for, while failing horribly at things Redis isn't used for anyway, and still providing Redis's time and space guarantees.

Theorists balk, because there are obvious places where Redis Cluster falls down, and they can demonstrate this. Engineers shrug, because Redis isn't being used in their companies in such a way that those demonstrations are relevant to their problems.

Most people who need Redis Cluster have already Greenspunned a Redis Cluster themselves, and they're already happily living with the compromise it entails. They'll gladly hand the support burden of writing cluster-management code upstream to antirez; it won't change any of the facts about the compromise.

moe · on Dec 11, 2013

Most companies also simply don't have redis-type data problems that can't be solved by throwing a pair of 768G ($40k) or 2TB servers ($120k) at them.

When that option is available it tends to beat complex software solutions in every way.

justin66 · on Dec 11, 2013

A person who found themselves sympathetic to the kind of hand-wavey feel-good explanation of things in yesterday's Redis thread might find this conclusion kind of snotty:

> I wholeheartedly encourage antirez, myself, and every other distributed systems engineer: keep writing code, building features, solving problems–but please, please, use existing algorithms, or learn how to write a proof.

That person should be sure to note these experimental results:

> These results are catastrophic. In a partition which lasted for roughly 45% of the test, 45% of acknowledged writes were thrown away. To add insult to injury, Redis preserved all the failed writes in place of the successful ones.

antirez · on Dec 11, 2013

Regardless of the fact that a random, meaningless model was tested? This does not appear to be very formal to me.

justinsb · on Dec 11, 2013

I would treat this as a great bug report. Aphyr has done a lot of work, showing that a (reasonable to me) formal interpretation of your gist produces catastrophically bad results.

If you're unhappy with the formal interpretation of your gist, publishing the formal interpretation you intended (or, better yet, code) would allow others to build the system you actually intended.

antirez · on Dec 11, 2013

1) The thing that Aphyr tested is not the gist.

2) There was never any plan for it to go inside the Redis implementation. This was just an argument in a mailing list to show that synchronous replication as implemented by WAIT is dependent on the rest of the system.

What bug report we are talking about?

justinsb · on Dec 11, 2013

I was saying that the blog post is the best bug report I've ever seen. To extend the analogy: in my mind, the way you're treating it is like closing it as "INVALID" without any comment, which tends to annoy bug reporters :-)

If your argument is #1 (that Aphyr tested the wrong thing), then a reasonable reply would be to provide the model you did intend. If you tested it as well, that would be great, but it is reasonable to require the "bug submitter" to retest.

If your argument is #2 (that WAIT is just best-effort replication, and that it does not provide any guarantees) then that's fine, just say so clearly. But you should then stop disputing the model that Aphyr tested, because to do so implies the existence of a model which does provide guarantees.

mjs · on Dec 11, 2013

Note that antirez's reply (in the comments) begins with "thanks to Aphyr for spending the time to try stuff, but the model he tried here is not what I proposed..."

rcoh · on Dec 11, 2013

As a gut check, if you're solving some replication problem and you'd consider using Paxos to solve the problem, be /very/ wary and reason extremely carefully about why your weaker solution will provide the same guarantees. Chances are, it will fail in certain cases of network outage or system failure.

erichocean · on Dec 11, 2013

And if you're considering Paxos, PaxosLease[0] is a nice, fast variation that's much less complex to actually implement in a production setting, and doesn't require co-ordinated clocks (clocks only need be consistent—monotonic—locally).

It's amazing how far you can get with a reliable leader election protocol in a distributed system. We're getting to the point where all of the other stuff is just picking specific algorithms with specific tradeoffs, much like you do today in a non-distributed setting.

[0] http://arxiv.org/pdf/1209.4187.pdf

banachtarski · on Dec 11, 2013

Guys, use Redis for your real time data. Why else would you care about having the benefits of in-memory speed? Jesus. If a partition happened to my redis setup, you know what I'd do? Trash the whole thing and start again.

falcolas · on Dec 11, 2013

I'm not personally very familiar with Reddis or it's HA tenders, but (based solely on reading this article) they seem to suffer from problems (improper handling of non-quorum situations) that have been solved with tools like Pacemaker and Corosync.

Has anyone attempted to use Pacemaker to wrangle Reddis instances?