4 months ago
I'm currently testing Railway as a possible host for my existing app that has both a high write and read volume.
I made a copy of my production database set up and the results are VERY worrisome. I’ve run VACUUM ANALYZE on the whole DB right after the DB transfer.
Right now I have 0% of my write load and less than 1% of my read load but requests often time out (especially after a few hours of site inactivity). Even with no hot cache, the DB should be able to fullfil requests in time. Especially since it has about double (8GB) the max RAM (4GB) and more CPU than the database I currently have at Heroku (Standard 0).
12 Replies
At this point it's clear something is just wrong with the volume because disk I/O is super problematic.
That's an interesting error:
VACUUM(ANALYZE);
ERROR: could not resize shared memory segment "/PostgreSQL.1056445896" to 118765504 bytes: No space left on deviceAnd somehow this is the last ANALYZE time, even though I did run it at least twice:
db=# SELECT max(GREATEST(COALESCE(last_analyze,'epoch'),
db(# COALESCE(last_autoanalyze,'epoch'))) AS db_last_analyze
db-# FROM pg_stat_all_tables;
db_last_analyze
------------------------
1970-01-01 00:00:00+00
(1 row)It seems like VACUUM(FULL, VERBOSE); is able to start though, so waiting on results from that. I'm really hoping this is something gone wrong with my deployment, but at this point I'm very skeptic of Railway's performance.
Can this be moved to central station? I see @rahul has moved at least one client three days ago to the legacy system which isn't very reassuring but at least things will function:
https://station.railway.com/questions/urgent-production-instance-very-sl-d5482327
4 months ago
OK, I found it's already on Central Station. Things I did (some multiple times):
- VACUUM FULL
- ANALYZE
- Tuned settings
- Tuned shm_size
Nothing seems to have helped so far.
4 months ago
I'm not a big Postgres expert but I don't think this is good news:
2025-08-09 00:38:05.568 UTC [27] LOG: checkpoint complete: wrote 93 buffers (0.1%); 0 WAL file(s) added, 0 removed, 67 recycled; write=8.964 s, sync=6.407 s, total=33.174 s; sync files=43, longest=5.783 s, average=0.149 s; distance=1106118 kB, estimate=1106118 kB; lsn=10/97001A50, redo lsn=10/85FFF3F0
2025-08-09 00:43:08.495 UTC [27] LOG: checkpoint complete: wrote 347 buffers (0.3%); 0 WAL file(s) added, 0 removed, 63 recycled; write=35.078 s, sync=0.025 s, total=35.827 s; sync files=43, longest=0.010 s, average=0.001 s; distance=1026903 kB, estimate=1098196 kB; lsn=10/C4AD5348, redo lsn=10/C4AD5310
Is it normal for a WAL checkpoint to take > 35s writing 0 files?
This is a DB with almost 0 writes too.
4 months ago
I have moved your service to a different stacker with a different drive setup, can you let me know if you continue to see Disk I/O issues?
And just a friendly reminder that we don't ping the team members 🙂
4 months ago
That didn’t help at all. Site is extremely slow. It’s not networking because query times are similar when ran in psql.
4 months ago
Gotcha, thank you for letting me know.
Would you happen to have any tracing that involves the slow database queries? that would be most helpful as we debug this further.
4 months ago
except for information containing actual queries, I’m willing to provide any info that’s needed. I honestly think the volume is totally botched, I cannot believe an SSD will ever behave this way. (Also, see my other thread regarding volume usage discrepancies).
Could it be that the volume didn’t get moved when I moved the Postgres deployment to us-east-1?
Is there an option to redeploy me on the “legacy” (non-metal) stack just so we confirm this is the issue?
Edit: I'm attaching an anonymized explain (I hope AI didn't screw this up):
Gather (cost=155858.36..244117.56 rows=1612 width=4) (actual time=12661.460..14843.236 rows=1151 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=55247 read=199199
-> Parallel Hash Join (cost=154858.36..242956.36 rows=672 width=4) (actual time=12640.228..14795.726 rows=384 loops=3)
Hash Cond: (table_b.foreign_key_id = table_c.id)
Buffers: shared hit=55247 read=199199
-> Nested Loop (cost=40.51..87977.69 rows=61266 width=4) (actual time=0.493..2120.871 rows=54997 loops=3)
Buffers: shared hit=30272 read=50787
-> Parallel Bitmap Heap Scan on table_a u0 (cost=40.07..7187.39 rows=1701 width=4) (actual time=0.365..9.130
rows=1287 loops=3)
Recheck Cond: (user_id = 123)
Heap Blocks: exact=208
Buffers: shared hit=425 read=118
-> Bitmap Index Scan on idx_table_a_user_id (cost=0.00..39.05 rows=4082 width=0) (actual
time=0.248..0.249 rows=3862 loops=1)
Index Cond: (user_id = 123)
Buffers: shared hit=6
-> Index Scan using idx_table_b_reference_id on table_b (cost=0.44..47.17 rows=33 width=8) (actual
time=0.402..1.629 rows=43 loops=3862)
Index Cond: (reference_id = u0.reference_id)
Buffers: shared hit=29847 read=50669
-> Parallel Hash (cost=153911.60..153911.60 rows=72500 width=4) (actual time=12638.661..12638.662 rows=57789 loops=3)
Buckets: 262144 Batches: 1 Memory Usage: 8864kB
Buffers: shared hit=24917 read=148412
-> Parallel Index Only Scan using idx_table_c_composite on table_c (cost=0.42..153911.60 rows=72500 width=4)
(actual time=7.640..12551.648 rows=57789 loops=3)
Index Cond: ((boolean_field = false) AND (date_field >= '2025-05-01'::date) AND (date_field <=
'2025-05-31'::date))
Planning Time: 18.226 ms
Execution Time: 14843.236 ms4 months ago
I've moved the database back to the Legacy region, let me know if that solves anything.
4 months ago
As a start, cold cache queries are not timing out. They’re slow (I guess that’s somewhat expected).
I will wait a few more hours and then query update again.
4 months ago
Intermediate update:
I haven't had a single request time out since brody moved the database to the "legacy" region. This points to a serious issue for disk I/O in metal regions, especially when it comes to Postgres.
cold-cache queries are still extra slow (often reaching 10-20s) and surprising for an SSD. This makes me wonder: From what I've seen, at least where I looked, Railway does not mention IOPS anywhere in their pricing or documentation. Maybe this is the hidden problem? Everywhere there's disk, there's disk IOPS but I don't see what's my dedicated IOPS for the Postgres volume for example.
I'll keep updating here as I had to spin up some monitoring resources to benchmark things a bit more carefully.
I might spin up a DB at Neon to compare the results.
4 months ago
So, anything that has not been very recently queried still yields an abysmal 9s-25s time to execute.
I'm going to test Neon now in North Virginia now because those results (with 0% write load and less than 1% of the read load) are just not acceptable.
4 months ago
I deployed a comparable instance on Neon — although this wasn't my original plan — and the difference is night and day despite the "unfair" advantage the Railway Postgres deployment has (private networking).
It could be argued that this is expected given that Neon is specialized in Postgres but I don't think the differences are in tuning. They are just too BIG.
This has got to be one of two things:
Some disk I/O quirk that Railway ought to figure out. Disk writes have already been acknowledged by an employee as something that's being worked on, this is great. But my Postgres instance couldn't handle 0 writes and single non-concurrent reads when hosted on metal, and was still painfully slow when it was moved to "Legacy".
Something I screwed up big time with the deployment.
I have a few days before I have to make up my mind about a new host for the workloads but I'm definitely not hosting my DB at Railway. The deployment at 8GB memory either couldn't really keep its cache, or (most probably) is disastrous with disk I/O.
4 months ago
Can this be escalated to the engineering team, or is presumed that this is a an application level issue?
4 months ago
Hi I experience the same for Mysql and MongoDB, very slow despite the load is not high, below 100QPS.. Moving same same workload to toher provider has immediately 5-10x improvement in I/O speed
4 months ago
I've had something similar on my pg instance, my app is still in dev so I'm the only one using it, the metrics are almost flat and very very low, I have an e2e test suite I run to be sure all is working fine, locally all good, online I can see that sometime I get timeout issues and when this happens I get these logs in the db:
```
2025-08-13 16:49:14.715 UTC [74] LOG: checkpoint complete: wrote 21 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=2.027 s, sync=0.021 s, total=2.098 s; sync files=9, longest=0.015 s, average=0.003 s; distance=108 kB, estimate=108 kB; lsn=0/2E19A18, redo lsn=0/2E199E0
```
4 months ago
!t
4 months ago
This thread has been escalated to the Railway team.
Status changed to Awaiting Railway Response adam • 4 months ago
shxkm
Can this be escalated to the engineering team, or is presumed that this is a an application level issue?
4 months ago
We are working around the clock on the Metal disk I/O issues, but your database was moved back to GCP for the time being, where there were no disk I/O issues. We had users happily running multi-terabyte scale databases on GCP without issue.
Given that you are still experiencing issues while on a GCP host, remaining issues would likely come down to unoptimized queries and/or lack of database tuning optimizations. However, any such optimizations would be completely up to you, as our databases are not managed - there is no pooling, caching, or other managed features.
As for everyone else in this thread reporting issues, you would be on Metal, and I would ask you to open your own threads so that we can properly track your reports.
Status changed to Awaiting User Response Railway • 4 months ago
3 months ago
This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!
Status changed to Solved Railway • 3 months ago