Postgres high latency issue
dennyluan
HOBBYOP

2 months ago

So, we just migrated a Rails app from Heroku and initially are using a hosted postgres database on Crunchydata. On there, we've set up pgBouncer to handle persistent connections, which helped bring down the initial latency from 600ms down to 80ms.

Even after testing with a Railway hosted postgres service, the latency there is still high and often even worse than - 160ms.

I've seen now that this apparently is an issue with several threads asking about this, however I don't know if there's been a solution yet or what the root cause for this could be.

This is a freshly spun up cluster, and assuming that we're using the Railway Metal servers. But this is drastically worse than I was expecting, and didn't come up on my radar at all in considerations for hosting alternatives. Is there anyway to debug or improve this?

Solved

18 Replies

sam-a
EMPLOYEE

2 months ago

Two things to verify that are the most common causes of this. First, make sure your Rails app is connecting to Postgres/PgBouncer using the private networking hostname (e.g. PgBouncer.railway.internal) rather than the public TCP proxy URL. Public connections route through the proxy and add significant latency. Second, confirm all your services (web, PgBouncer, Postgres) are deployed to the same region - you can check this in each service's settings. If your app and database are in different regions, every query incurs cross-region latency. With private networking in the same region, you should see sub-5ms round trips to the database.


Status changed to Awaiting User Response Railway 2 months ago


dennyluan
HOBBYOP

2 months ago

Even with the internal networking url, it's still a bit slow.

```

root@3bba950466e6:/app# time psql $DATABASE_URL -c "SELECT 1;"

real 0m0.238s

user 0m0.036s

sys 0m0.013s

```

I think I will try out moving our production database to inside our Railway cluster, but this is part of a larger issue with our Heroku migration where response times were 40ms 90th and currently are 1000ms+ on Railway+Crunchydata, with no difference in application code. I'll report back when the database is migrated.


Status changed to Awaiting Railway Response Railway 2 months ago


Status changed to Awaiting User Response nico 2 months ago


dennyluan
HOBBYOP

2 months ago

So I've come across a lot of really worrying threads like this: https://station.railway.com/questions/postgres-is-slow-8aaee381

Is there any way my thread can get escalated and possibly moved off of Metal? This is insane to me that there are so many issues with slow IO, and I wish I had done more researcher before fully migrating our production environment over.

Our application has basically had constant request timeouts due to slow responses since we've flipped the switch, and I'm honestly a little flabbergasted.


Status changed to Awaiting Railway Response Railway 2 months ago


dennyluan
HOBBYOP

2 months ago

Have just completed the database migration to use Railway's internal postgres. Have a few benchmarks.

# measurement of basic roundtrip database call latency

root@b5a88dcea062:/app# time psql "$DATABASE_URL?sslmode=disable" -c "SELECT 1;"

real 0m0.087s

user 0m0.050s

sys 0m0.004s

# measurement of internal railway network latency

root@b5a88dcea062:/app# time bash -c "echo > /dev/tcp/postgres-jcuy.railway.internal/5432"

real 0m0.026s

user 0m0.000s

sys 0m0.004s

I'm kind of at my wits end. Railway's internal metrics response times are showing 1s+ response times for P50, New Relic is showing 1000ms response times for postgres.


dennyluan
HOBBYOP

2 months ago

I'm not sure what's going on with internal latency, but even a memcache server in the same region is returning terrible latency:

This should be on the order of microseconds but its give me 127ms.

root@c2027fda3ec6:/app# time bash -c "echo > /dev/tcp/memcached.railway.internal/11211"

real 0m0.127s

user 0m0.003s

sys 0m0.000s

root@c2027fda3ec6:/app# echo $MEMCACHE_PRIVATE_SERVER memcached.railway.internal:11211


dennyluan
HOBBYOP

2 months ago

network logs show 69ms for colocated postgres?

Attachments


dennyluan
HOBBYOP

2 months ago

root@d5563b02a74e:/app# psql $DATABASE_URL -c "

SELECT count(*) as total_queries,

round(sum(total_exec_time)::numeric, 2) as total_ms,

round(avg(mean_exec_time)::numeric, 2) as avg_query_ms

FROM pg_stat_statements

WHERE calls > 5;"

perl: warning: Falling back to the standard locale ("C").

total_queries | total_ms | avg_query_ms

---------------+----------+--------------

256 | 49387.07 | 3.49

just showing, 256 pg queries are averaging 3.49ms, so PG itself is fast. this to me confirms that it's a networking issue.


dennyluan
HOBBYOP

2 months ago

Hey Railway folks, hoping to get a response on this soon, or at least some guidance.


chandrika
EMPLOYEE

2 months ago

Hey Denny, sorry for the slow response. Your debugging is very helpful. Could you please do me a favor and try redeploying your services? This could place them on a fresh host, which if it resolves the latency would mean it was likely caused by contention on the current one. Let us know if the numbers improve after that.


Status changed to Awaiting User Response Railway 2 months ago


dennyluan
HOBBYOP

2 months ago

What do you mean by redeploy? I've deployed several times and removed replicas so that there is only one in one region. I've chased down quite a few gremlins and things have stabilized (at least not timing out), but there is still now an issue of request queuing which I haven't been able to identify the source of:

Attachments


Status changed to Awaiting Railway Response Railway 2 months ago


chandrika
EMPLOYEE

2 months ago

Hey Denny, thanks for the detailed benchmarks, super helpful!

A few things that would help us narrow this down:

1. What region are your services deployed in?

2. Can you run these from inside your app container?

dig postgres-jcuy.railway.internal

dig memcached.railway.internal

Want to see how long DNS resolution takes specifically, if that's where the latency is hiding.

3. How many DB queries does a typical request make? With Rails, if you're making 10-20 queries per request and each has DNS overhead, that would explain the 1s+ P50 you're seeing in New Relic.

4. Are you using connection pooling (pgBouncer) with the Railway Postgres, or connecting directly? Persistent connections would bypass repeated DNS lookups.

The request queuing in your New Relic chart also stands out, could mean your Puma/Unicorn workers being saturated waiting on I/O


Status changed to Awaiting User Response Railway about 2 months ago


dennyluan
HOBBYOP

2 months ago

root@4e35ed7ccd3e:/app# dig postgres-jcuy.railway.internal

; <<>> DiG 9.20.18-1~deb13u1-Debian <<>> postgres-jcuy.railway.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33623
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 8a61dd3d5df6b482 (echoed)
;; QUESTION SECTION:
;postgres-jcuy.railway.internal.  IN  A

;; ANSWER SECTION:
postgres-jcuy.railway.internal. 10 IN A 10.176.202.17

;; Query time: 15 msec
;; SERVER: fd12::10#53(fd12::10) (UDP)
;; WHEN: Tue Mar 24 07:40:54 UTC 2026
;; MSG SIZE  rcvd: 117

=================

root@4e35ed7ccd3e:/app# dig memcached.railway.internal

; <<>> DiG 9.20.18-1~deb13u1-Debian <<>> memcached.railway.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49076
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 2e3f8205b91c3c2a (echoed)
;; QUESTION SECTION:
;memcached.railway.internal.  IN  A

;; ANSWER SECTION:
memcached.railway.internal. 10  IN  A 10.209.186.217

;; Query time: 19 msec
;; SERVER: fd12::10#53(fd12::10) (UDP)
;; WHEN: Tue Mar 24 07:41:06 UTC 2026
;; MSG SIZE  rcvd: 109

This is a little better than before but still higher than I would have expected. All services are hosted in US West.


Status changed to Awaiting Railway Response Railway about 2 months ago


dennyluan
HOBBYOP

2 months ago

even trying to hit my elasticsearch service is very slow:

```

root@5e9b20a390de:/app# rails runner "

require 'benchmark'

require 'typhoeus'

time = Benchmark.ms {

response = Typhoeus.post(

\"https://#{ENV['ELASTIC%5FUSERNAME']}:#{ENV['ELASTIC%5FPASSWORD']}@#{ENV['ELASTICSEARCH%5FHOST']}/projects%5Fproduction%5F20260317072829082/%5Fsearch\",

body: '{\"query\":{\"match\":{\"title\":\"cancer\"}}}',

headers: { 'Content-Type' => 'application/json' },

ssl_verifypeer: false

)

puts 'Status: ' + response.code.to_s

puts 'ES took: ' + (JSON.parse(response.body)['took'].to_s + 'ms') rescue nil

}

puts 'Total Typhoeus time: ' + time.round(2).to_s + 'ms'

"

Status: 200

ES took: 3ms

Total Typhoeus time: 42.27ms

```


dennyluan
HOBBYOP

2 months ago

Just confirming that I've tested the stack in Digital Ocean and Heroku and no other platform seems to have this request queuing issue that Railway has. I'll continue testing on Render or Fly, basically until I find a solution I'm happy with.


dennyluan
HOBBYOP

2 months ago

This is what happens when I add one replica in a different region, so 1 replica in US West and 1 in Singapore. It looks like the load balancer on Railway's infrastructure isnt working?


2 months ago

Are you perhaps doing a DNS lookup for every DNS query? That is going to add a fair amount of time to the query due to the additional DNS lookup time.


Status changed to Awaiting User Response Railway about 2 months ago


dennyluan
HOBBYOP

2 months ago

I am not doing anything other than adding 1 new replica in a different region. You tell me what's happening between the replicas.

I'm kind of astounded at the lack of appropriate support response. Is this not the place to be creating support tickets for the Pro plan?


Status changed to Awaiting Railway Response Railway about 2 months ago


dennyluan
HOBBYOP

2 months ago

Just wanted to post this as well, confirming that Render matches Digital Ocean and Heroku performance with an app baseline of 30ms-60ms:

Railway is 5x-10x slower due to 150ms requeust queuing time that I'm 90% certain is down to network misconfiguration. Overall this is pretty frustrating, because I wish I had done this testing before migrating a production app over to Railway. I do like a lot of what Railway has to offer with the philosophy and design decisions. I really do hope someone can help me figure out what's going so wrong, but it's looking like Railway might not be for us.

Attachments


Status changed to Solved dennyluan about 2 months ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...