March 27, 2022 Postgres Transacton ID Wraparound

Overview

On March 27th at 3:23 UTC (10:23 PM EST) our primary database responsible for all account, server, and player information went to read-only mode. We use Postgres for many of our core features and it is necessary for the website and backend services to operate.

This outage seems to have been caused by an index data corruption issue in Postgres. The data corruption was replicated to our backup servers and our downloadable backups that we would normally use in the event of a database failure.

In order to recover from the corruption we were forced to shut down our database and repair it. It took approximately 3 hours for the database to be repaired.

Once the database was online we decided to restore as much of the BattleMetrics service as we could as quickly as possible. We stopped all query workers and prioritized RCON updates.

At this point we were running with a single Postgres server. Due to the limited capacity one server provides, we were unable to keep up with incoming RCON updates. Larger servers with more frequent updates were especially affected by this delay with some users experiencing up to two hours of delay. On average, we were able to keep the delay below 5 minutes for most servers.

Investigation

Some systems continued to function for a limited period without the ability to write data. We first saw the issue as users failing to load the website. We generate API keys for users that are good for up to 15 minutes. These API keys are used to interact with our API on behalf of the user. While the API key was valid, users were still able to load the website and view information.

As the API tokens expired users began to report issues loading the website. Roughly 30 minutes before these issues our DDOS protection activated due to a small attack. Initially we thought the issue was related to this attack.

We soon started to see RCON updates and queries failing. Going off of the DDOS attack notice, we thought these failures were related. Upon further investigation we quickly discovered that our main database was no longer accepting writes, causing failure.

One of our largest tables in Postgres is used for storing player identifiers. This includes the type of identifier, when it was last seen, and the metadata associated with it. We have an index on the identifier type and the identifier itself that we heavily use for matching players when they join a server.

Background

The cause of this problem is one that is not unique to BattleMetrics. Other large services have suffered this same calamity in the past, including companies like Sentry. See their postmortem here: https://blog.sentry.io/2015/07/23/transaction-id-wraparound-in-postgres. Sentry’s article links to a more in-depth analysis here: https://devcenter.heroku.com/articles/postgresql-concurrency

Fundamentally, Postgres can tolerate only so many transactions (about 2.1 billion) before it cannot accept any more write commands. Postgres mitigates the risk of this problem by “vacuuming” transaction IDs periodically, resetting the value to 0. This is normally not a problem for us. We had autovacuum policies set up and for years they had been working without issue.

As we would later discover, we made a critical mistake by foregoing health checks on the size of the transaction IDs in Postgres. For some unknown reason, a corruption event occurred in the index for our player identifiers table. Our best guess, based on our logs, is that this corruption occurred approximately a month ago (February 26, 2022). This corruption did not impact site performance so it went unnoticed. Unfortunately it had the side effect of causing any vacuum job to error out. If Postgres was a car, this is when the brakes were cut. Without effective monitoring in place on the number of transaction IDs, we had no idea that the database was slowly creeping closer and closer to disaster.

Recovery

After discovering that our Postgres cluster had failed we went straight to the logs and thought we had uncovered the reason. We saw the following error repeated thousands of times:

2022-03-27 03:23:49.291 UTC [291738] battlemetrics@battlemetrics ERROR:  database is not accepting commands to avoid wraparound data loss in database "battlemetrics"
2022-03-27 03:23:49.291 UTC [291738] battlemetrics@battlemetrics HINT:  Stop the postmaster and vacuum that database in single-user mode.
        You might also need to commit or roll back old prepared transactions, or drop stale replication slots.

We started by looking up the error and trying to learn more. After being unable to find if replicas would be affected by this we decided to initiate a manual failover to one of our replicas. Unfortunately this had no effect as the transaction ids that were wanting to wraparound along with the as-yet undiscovered data corruption had been replicated too.

At this point we decided to stop Postgres, follow the instructions in the error message, and hope it finished quickly.

While we waited we continued to look for any other possible remedies and made use of several of our replicas. We currently have 5 Postgres servers with 4 acting as read replicas. We used three servers to try three different approaches to getting online as fast as possible.

The first server is the one that was the primary. We followed directions by stopping the server, entering single--user mode, and starting a full vacuum.
We wiped another server and started the process to restore from backup.
Upon coming across documentation from Google Cloud on how to handle this issue we decided to try following their advice.

Number 3 was the winning option. Contrary to the error message it is not necessary to enter single-user mode to run a full vacuum on a server that has entered wraparound protection. We increased Postgres settings to give as many resources to the vacuum process as possible, restarted the server to apply the changes, and started the vacuum. After about 30 minutes we received an error and the vacuum was canceled.

We received the following error:

failed to re-find parent key in index "player_identifiers_type_identifier_index" for deletion target page 615856

A few searches later and we knew why the autovacuum process had failed. Our index for looking up player identifiers was corrupted and causing vacuums to fail.

We tried to drop the index and were met with the familiar `database is not accepting commands to avoid wraparound data loss in database` error. We shut the database down, entered single-user mode, dropped the corrupted index, and started the database again.

Once the database was running again we started a vacuum again and hoped for the best. This time the vacuum succeeded and the database was again willing to accept writes.

Since we had dropped the player identifier index we needed to rebuild it before we could restore services. With the player identifier index dropped, looking up a player by identifier would require a full table scan. Rebuilding the player identifier index required about two hours.

We knew the single Postgres server wouldn't be able to handle the full load of RCON and queries so we went ahead and shut down our query workers.

Once the index was back in place we pointed our services at the new Postgres primary and BattleMetrics services started to come online.

The Aftermath

Even with queries stopped it quickly became apparent that the single server was not up to the task of handling the reduced load. We saw the CPU, Disk, and network become fully saturated.

We decided to allow things to continue to run, albeit heavily delayed. In our estimation it is better to have a slower service than no service. We monitor P50, P90, and P99 RCON processing delay (the delay from when we receive to when we process it). Our P50 metrics showed a delay of two and a half minutes while P90 and P99 were maxed out at their limit of 5 minutes. We were able to determine from logs that some servers experienced a delay of up to two hours at its worst. The more updates we received from a server the more it was delayed.

After an initial spike in server load things began to stabilize giving us enough resources to bootstrap a new replica. We started the bootstrap process and watched to get an idea of how long it would take. Based on the data transfer rate we estimated it would take about two hours for the first replica to come online. After getting the replica set to come online automatically we decided it was time to get some sleep. At around 10:00 AM UTC (5 AM EST) we left things running and decided to reconvene in two hours when the first replica should be coming online.

The first replica came online at 12:40 UTC (7:40 AM EST) after taking 3 hours and 44 minutes to bootstrap due to the strain on the primary server.

By the time the replica was online and in use we had fallen behind by 660 thousand updates. The single replica was enough to allow us to slowly start catching up. We continued to bootstrap more replicas with each one easing the load on the primary and eventually caught up at 16:30 UTC (11:13 AM EST).

Once we were fully caught up with RCON updates we started queries again. We throttled queries to ensure that RCON updates continued to run without delay. Queries are prioritized based on a server's current population and its rank.

As we processed queries we also caught up on player search indexing and Steam profile lookups. Player search indexing caught up at 16:46 UTC (11:46 AM EST) and Steam lookups at 17:53 UTC (12:53 PM EST).

Finally at 00:42 (7:42 PM EST) all replicas were back online and queries were allowed to run at full speed again.

Next Steps

After struggling with this situation and losing hours of sleep, we identified several things that needed to happen to prevent this sort of issue in the future. They are as follows:

Monitoring errors in Postgres logs;
Monitoring/alerting on transaction id;
Setting up a logical replica in addition to our physical replication; and
Implementing a faster network to speed up recovery from failure.

We have already accomplished implementing health checks on transaction IDs. The remaining solutions will be implemented in the coming days.

References

We found the following documents/articles useful. If you ever find yourself in a similar situation or think you might be vulnerable to the same problem, these are worth your time: