Jepsen – Beat the hell out of all distributed systems

Recently I have discovered an amazing tool called Jepsen, developed by Kyle Kingsbury, an engineer located in San Francisco. Jepsen is designed to evaluate a broad category of distributed systems in case of correctness and consistency of data, in mayhem situations such as network partitions and also when group of nodes are going down all at once. A lot of distrbuted systems such as Elastic Search, MongoDB, RethinkDB and etc, claim to keep consistency of data and correctness of operations during disastrous situations. Unfortunately, there is a huge gap between what they claim and what the reality is and it’s due to the lack of information about client’s user cases. There are few cloud service providers publishing their system logs annualy or quarterly, in order to help developers analyze the system’s behaviour in real world scenarios. On the other side, there are communities like RethinkDB, which  they have a slack channel, introducing a close contact between the users and the developers. This helps the developer to detect scenarios where the systems is not behaving the way it supposed to. This can help them fix the bugs and their design decisions right away.

Beside all discussions above, there is one amazing tool available on github, called Jepsen. Jepsen provides a full automated environment which you can setup a cluster of nodes, and the software middleware being deployed, run and tested automatically. After that, Jepsen executes queries and in the mean time also turn up or down the nodes, to force the system turning into recovery mode. As a result, the system would be forced to execute queries and fix the partition in the same time, which is really hard. Making right decisions regarding the query execution may be sensitive, since it could lead into data inconsistency after the recovery.

Some of the famous distributed systems that Jepsen supports are listed as:

  • Aerospike
  • Elastic Search
  • etcd ( This is only distributed log protocol, like ZooKeeper )
  • MongoDB
  • MySQL – Cluster
  • RabbitMQ
  • RethinkDB
  • ZooKeeper

I myself, were working on the RethinkDB section. This happened while I’m working on the current startup, and we have chosen the RethinkDB as the target system for analysis. RethinkDB is claimed to be the right solutions for realtime web applications. Systems like MySQL requires the user to send queries into the system, in order to get realtime update about the status of a specific set of data they are interested in. RethinkDB provides a real-time push notification model, where the user can register to get updated for a specific set of information. If data being updated at some point by some other user, all subscribed users gonna be informed.

RethinkDB is designed to be distributed. In summary, it has the same model as KV store systems, where data is being sharded and  distributed among multiple nodes. In order to provide consistency and correctness of data, RethinkDB is using RAFT as their core distributed protocol. Raft is a new protocol designed by a PhD student in Berkley, which received a great attention from both the industry and academia. We were aiming to first analyze how much raft is sophisticated compared to the rivals, such as ZAB and Paxos (Which has already been done by Kyle Kingsbury). It was really interesting that in all other experiments, Kyle was completely thrashing the systems. But in RethinkDB, the scenario was completely different. RethinkDB was acting much better in disastrous situations, and that was really interesting how Raft is working amazingly. Our aim was to show, RethinkDB integrated with our solutions is working faster and in more consistent fashion.

I would recommend Jepsen to whoever wants to make comparisons between different distributed storage systems, watching how well they behave in a real deployment. All you need to do is checking out the source code of Jepsen from github and then cd into the target data store directory, and then do lein test, and you’ll have everything set up. Remember, since Jepsen is accessing target nodes through ssh and with root access, you need to set up password-less ssh from the client machine to all nodes in the cluster. You may also need to (1) modify the sshd config file, in order to enable rootPermission and also setup dsa private key instead of the rsa key. Jepsen has a specific capability which even makes it easier for you guys to setup and start a cluster of nodes. Right now, I myself creating virtual machines and then run the Jepsen test. Jepsen already has Docker integration, which does all the required steps in order to deploy all worker nodes and start the test. Unfortunately, RethinkDB has not yet been integrated into this automated scenario. One of my plans is to integrate Jepsen RethinkDB with Jepsen Docker mechanism, so it would be much easier for us to test our technology over and over again.

I have done so many performance and consistency tests using the Jepsen RethinkDB module, and I’ll be more than happy to help anyone, understanding how Jepsen works, specifically in RethinkDB.