HA Postgres — repeated Patroni leader elections causing production instability
Anonymous
PROOP

23 days ago

Environment: europe-west4-drams3a

Services affected: HA Postgres (Patroni + HAProxy)

Impact: Production API experiencing connection failures and 2s+ response time spikes

What's happening

Our HA Postgres cluster is repeatedly electing new leaders in rapid succession. This is not a clean single failover — HAProxy logs show it routing postgresql_primary connections to three different node IPs within seconds of each other:

Connect from 10.204.94.172 to 10.209.171.245:5432 (postgresql_primary/TCP)

Connect from 10.204.94.172 to 10.253.218.55:5432 (postgresql_primary/TCP)

Connect from 10.204.94.172 to 10.130.153.144:5432 (postgresql_primary/TCP)

PgBouncer (sitting in front of HAProxy) confirms these connections are immediately crashing:

2026-02-21 10:44:53 LOG S: railway/postgres@10.209.171.245:5432 closing because: server conn crashed? (age=0s)

2026-02-21 10:44:54 LOG S: railway/postgres@10.253.218.55:5432 closing because: server conn crashed? (age=0s)

2026-02-21 10:44:55 LOG S: railway/postgres@10.209.171.245:5432 closing because: server conn crashed? (age=0s)

And during these windows, all new client connections are rejected:

LOG WARNING: pooler error: server login has been failing, cached error: server conn crashed?

(server_login_retry)

This has been ongoing. Our preDeployCommand migration (which connects directly to HAProxy) needed 9 out of 10 attempts to succeed today alone.

Suspected cause

Based on Patroni's architecture, repeated rapid elections typically indicate etcd cannot write leader heartbeats fast enough — usually due to slow disk I/O on the etcd volume. We noticed warnings about etcd/disk in an earlier session and suspect the etcd backing store is on a degraded or overloaded volume.

1. Can you check the etcd disk I/O latency for our HA Postgres cluster in europe-west4?

2. Are there any Patroni logs showing why the leader lease is expiring repeatedly?

3. Is this a known issue with the HA Postgres template in this region?

Solved

1 Replies

20 days ago

Hi, I've updated the template, now it should handle these networking blips better, without marking running postgres as down like you saw. Thank you for reporting!


Status changed to Awaiting User Response Railway 20 days ago


Railway
BOT

13 days ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway 13 days ago


Loading...