PhpRiot
News Archive
PhpRiot Newsletter
Your Email Address:

More information

MongoDB Elections

Note: This article was originally published at Planet PHP on 27 October 2012.
Planet PHP

On Monday of this week, Amazon's EC2 service suffered a major outage, which they call aoperformance issuesa, which we all know is simply not true.

This is not a post about how Amazon has failed us. Everyone goes down. We use AWS because it's flexible, and we need the flexibility. This is a post about how Gimme Bar went down due to this outage, despite our intentions of making everything resilient to these types of failures. It is a post about how I accidentally misconfigured our MongoDB Replica Set (aoRSa).

When one of the us-east availability zones died (aside: this was us-east-1c on the Fictive Kin AWS account, but I've learned that the letter is assigned on a per-account basis, so you might have lost 1a, 1e etc.), I knew what was wrong with the RS right away. In talking this over with a few friends, it became clear that the way MongoDB elections take place can be confusing. I'll describe our scenario, and hopefully that will serve as an example of how to not do things. I'll also share how we fixed the problem.

Gimme Bar is powered by a three-node MongoDB replica set. A primary and a secondary, plus a voting-but-zero-prority delayed secondary. The two main nodes are nearly-identical, puppeted, and are in different Amazon AWS/EC2 Availability Zones (aoAZa). The delayed secondary actually runs on one of our web nodes. It serves as a mostly-hot aooops, we totally screwed up the dataa failsafe, and is allowed to vote in RS elections, but it is not allowed to become primary, and the clients (API nodes) are configured to not read from it.

In the past, we did not have the delayed secondary. In fact, at one point, we had three main nodes in the cluster (a primary and two secondaries, all configured for reads (and writes to the primary) by the API nodes).

In order for MongoDB elections to work at all, you need at least three votes. Those votes need to be on separate networks in order for the election to work properly. I'll get back to our specific configuration below, but first, let's look at why you need at least three votes in three locations.

To examine the two-node, two-vote scenario, let's say we have two hypothetical, identical (for practical values of aoidenticala) nodes in the RS, in two separate locations: Castle Black and Winterfell. Now, let's say that there's a network connection failure between these two cities. Because the nodes can't see each other, they each think that the other node is down. This makes both nodes attempt an election, but they both destroy their own votes because there is not a majority. (A majority is ((aonumber of nodesa A 2) + 1), or in this scenario: 2 nodes. The election fails, the nodes demote themselves to secondary, and your app goes down (because there's no primary).

To solve this problem, you really need a third voting node in a third location: King's Landing. Then, let's say that Castle Black loses network connectivity. This means that King's Landing and Winterfell can both vote, and they do because they have a majority. They come to a consensus and nominate Winterfell (or King's Landing; it doesn't matter) to be Primary, and you stay up. When Castle Black comes back online, it syncs, becomes a secondary, and the subjects rejoice.

MongoDB has non-data nodes (called arbiters). These can be helpful if you're only running two MongoDB nodes, and don't want to replicate your data to a third location. Imagine it's really expensive to get data over the wall into King's Landing, but you still want to use it to vote. You could place an arbiter there, and in the scenario above where Castle Rock loses connectivity, King's Landing and Winterfell both vote. Since King's Landing can't become primary (it has no data), they both vote for Winterfell, and you stay up. When Castle Rock rejoins the continent, it syncs and becomes secondarya and the subjects rejoice.

So, back to Gimme Bar. In our old configuration, we had three (nearly) identical nodes in three AZs. When one went down, the other two would elect a primary, and our users nev

Truncated by Planet PHP, read more at the original (another 3556 bytes)