2 months ago
So, we just migrated a Rails app from Heroku and initially are using a hosted postgres database on Crunchydata. On there, we've set up pgBouncer to handle persistent connections, which helped bring down the initial latency from 600ms down to 80ms.
Even after testing with a Railway hosted postgres service, the latency there is still high and often even worse than - 160ms.
I've seen now that this apparently is an issue with several threads asking about this, however I don't know if there's been a solution yet or what the root cause for this could be.
This is a freshly spun up cluster, and assuming that we're using the Railway Metal servers. But this is drastically worse than I was expecting, and didn't come up on my radar at all in considerations for hosting alternatives. Is there anyway to debug or improve this?
18 Replies
2 months ago
Two things to verify that are the most common causes of this. First, make sure your Rails app is connecting to Postgres/PgBouncer using the private networking hostname (e.g. PgBouncer.railway.internal) rather than the public TCP proxy URL. Public connections route through the proxy and add significant latency. Second, confirm all your services (web, PgBouncer, Postgres) are deployed to the same region - you can check this in each service's settings. If your app and database are in different regions, every query incurs cross-region latency. With private networking in the same region, you should see sub-5ms round trips to the database.
Status changed to Awaiting User Response Railway • 2 months ago
2 months ago
Even with the internal networking url, it's still a bit slow.
```
root@3bba950466e6:/app# time psql $DATABASE_URL -c "SELECT 1;"
real 0m0.238s
user 0m0.036s
sys 0m0.013s
```
I think I will try out moving our production database to inside our Railway cluster, but this is part of a larger issue with our Heroku migration where response times were 40ms 90th and currently are 1000ms+ on Railway+Crunchydata, with no difference in application code. I'll report back when the database is migrated.
Status changed to Awaiting Railway Response Railway • 2 months ago
Status changed to Awaiting User Response nico • 2 months ago
2 months ago
So I've come across a lot of really worrying threads like this: https://station.railway.com/questions/postgres-is-slow-8aaee381
Is there any way my thread can get escalated and possibly moved off of Metal? This is insane to me that there are so many issues with slow IO, and I wish I had done more researcher before fully migrating our production environment over.
Our application has basically had constant request timeouts due to slow responses since we've flipped the switch, and I'm honestly a little flabbergasted.
Status changed to Awaiting Railway Response Railway • 2 months ago
2 months ago
Have just completed the database migration to use Railway's internal postgres. Have a few benchmarks.
# measurement of basic roundtrip database call latency
root@b5a88dcea062:/app# time psql "$DATABASE_URL?sslmode=disable" -c "SELECT 1;"
real 0m0.087s
user 0m0.050s
sys 0m0.004s
# measurement of internal railway network latency
root@b5a88dcea062:/app# time bash -c "echo > /dev/tcp/postgres-jcuy.railway.internal/5432"
real 0m0.026s
user 0m0.000s
sys 0m0.004s
I'm kind of at my wits end. Railway's internal metrics response times are showing 1s+ response times for P50, New Relic is showing 1000ms response times for postgres.
2 months ago
I'm not sure what's going on with internal latency, but even a memcache server in the same region is returning terrible latency:
This should be on the order of microseconds but its give me 127ms.
root@c2027fda3ec6:/app# time bash -c "echo > /dev/tcp/memcached.railway.internal/11211"
real 0m0.127s
user 0m0.003s
sys 0m0.000s
root@c2027fda3ec6:/app# echo $MEMCACHE_PRIVATE_SERVER memcached.railway.internal:11211
2 months ago
root@d5563b02a74e:/app# psql $DATABASE_URL -c "
SELECT count(*) as total_queries,
round(sum(total_exec_time)::numeric, 2) as total_ms,
round(avg(mean_exec_time)::numeric, 2) as avg_query_ms
FROM pg_stat_statements
WHERE calls > 5;"
perl: warning: Falling back to the standard locale ("C").
total_queries | total_ms | avg_query_ms
---------------+----------+--------------
256 | 49387.07 | 3.49
just showing, 256 pg queries are averaging 3.49ms, so PG itself is fast. this to me confirms that it's a networking issue.
2 months ago
Hey Railway folks, hoping to get a response on this soon, or at least some guidance.
2 months ago
Hey Denny, sorry for the slow response. Your debugging is very helpful. Could you please do me a favor and try redeploying your services? This could place them on a fresh host, which if it resolves the latency would mean it was likely caused by contention on the current one. Let us know if the numbers improve after that.
Status changed to Awaiting User Response Railway • 2 months ago
2 months ago
What do you mean by redeploy? I've deployed several times and removed replicas so that there is only one in one region. I've chased down quite a few gremlins and things have stabilized (at least not timing out), but there is still now an issue of request queuing which I haven't been able to identify the source of:
Attachments
Status changed to Awaiting Railway Response Railway • 2 months ago
2 months ago
Hey Denny, thanks for the detailed benchmarks, super helpful!
A few things that would help us narrow this down:
1. What region are your services deployed in?
2. Can you run these from inside your app container?
dig postgres-jcuy.railway.internal
dig memcached.railway.internal
Want to see how long DNS resolution takes specifically, if that's where the latency is hiding.
3. How many DB queries does a typical request make? With Rails, if you're making 10-20 queries per request and each has DNS overhead, that would explain the 1s+ P50 you're seeing in New Relic.
4. Are you using connection pooling (pgBouncer) with the Railway Postgres, or connecting directly? Persistent connections would bypass repeated DNS lookups.
The request queuing in your New Relic chart also stands out, could mean your Puma/Unicorn workers being saturated waiting on I/O
Status changed to Awaiting User Response Railway • about 2 months ago
2 months ago
root@4e35ed7ccd3e:/app# dig postgres-jcuy.railway.internal
; <<>> DiG 9.20.18-1~deb13u1-Debian <<>> postgres-jcuy.railway.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33623
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 8a61dd3d5df6b482 (echoed)
;; QUESTION SECTION:
;postgres-jcuy.railway.internal. IN A
;; ANSWER SECTION:
postgres-jcuy.railway.internal. 10 IN A 10.176.202.17
;; Query time: 15 msec
;; SERVER: fd12::10#53(fd12::10) (UDP)
;; WHEN: Tue Mar 24 07:40:54 UTC 2026
;; MSG SIZE rcvd: 117
=================
root@4e35ed7ccd3e:/app# dig memcached.railway.internal
; <<>> DiG 9.20.18-1~deb13u1-Debian <<>> memcached.railway.internal
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49076
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 2e3f8205b91c3c2a (echoed)
;; QUESTION SECTION:
;memcached.railway.internal. IN A
;; ANSWER SECTION:
memcached.railway.internal. 10 IN A 10.209.186.217
;; Query time: 19 msec
;; SERVER: fd12::10#53(fd12::10) (UDP)
;; WHEN: Tue Mar 24 07:41:06 UTC 2026
;; MSG SIZE rcvd: 109This is a little better than before but still higher than I would have expected. All services are hosted in US West.
Status changed to Awaiting Railway Response Railway • about 2 months ago
2 months ago
even trying to hit my elasticsearch service is very slow:
```
root@5e9b20a390de:/app# rails runner "
require 'benchmark'
require 'typhoeus'
time = Benchmark.ms {
response = Typhoeus.post(
body: '{\"query\":{\"match\":{\"title\":\"cancer\"}}}',
headers: { 'Content-Type' => 'application/json' },
ssl_verifypeer: false
)
puts 'Status: ' + response.code.to_s
puts 'ES took: ' + (JSON.parse(response.body)['took'].to_s + 'ms') rescue nil
}
puts 'Total Typhoeus time: ' + time.round(2).to_s + 'ms'
"
Status: 200
ES took: 3ms
Total Typhoeus time: 42.27ms
```
2 months ago
Just confirming that I've tested the stack in Digital Ocean and Heroku and no other platform seems to have this request queuing issue that Railway has. I'll continue testing on Render or Fly, basically until I find a solution I'm happy with.
2 months ago
2 months ago
Are you perhaps doing a DNS lookup for every DNS query? That is going to add a fair amount of time to the query due to the additional DNS lookup time.
Status changed to Awaiting User Response Railway • about 2 months ago
2 months ago
I am not doing anything other than adding 1 new replica in a different region. You tell me what's happening between the replicas.
I'm kind of astounded at the lack of appropriate support response. Is this not the place to be creating support tickets for the Pro plan?
Status changed to Awaiting Railway Response Railway • about 2 months ago
2 months ago
Just wanted to post this as well, confirming that Render matches Digital Ocean and Heroku performance with an app baseline of 30ms-60ms:
Railway is 5x-10x slower due to 150ms requeust queuing time that I'm 90% certain is down to network misconfiguration. Overall this is pretty frustrating, because I wish I had done this testing before migrating a production app over to Railway. I do like a lot of what Railway has to offer with the philosophy and design decisions. I really do hope someone can help me figure out what's going so wrong, but it's looking like Railway might not be for us.
Attachments
Status changed to Solved dennyluan • about 2 months ago