2 months ago
Hey, this issue started maybe a week ago. I use Django backend with postgres and NextJS frontend. All in the same region on metal instances (Amsterdam), I tried disabling the Metal Edge networking but it didn't help.
This seems to be some kind of caching or creating a connection issue. The first save/create request to my backend takes really long, but if I send another request right after it's fast. Then after a few minutes I try to send another write req and the same loop happens again.
So perhaps the optimization of DB connections changed on railway? It makes my app look unresponsive to users. This problem doesnt appear on local dev enviroment so I suppose it's infra-specific.
When I try to use the Railway Data UI in the "data" tab of my Postgres service, it loads indefinitly and I can't really view my data from there, so that may be perhaps connected.
Both RAM and vCPU on the DB are maxed out to 8 each - on other app services to 2-4 each
Should I include any more information please let me know
13 Replies
2 months ago
What happens when you move your DB to a different Metal region? Does it make it available? Do you have any custom DB logic?
Status changed to Awaiting User Response Railway • 2 months ago
angelo-railway
What happens when you move your DB to a different Metal region? Does it make it available? Do you have any custom DB logic?
2 months ago
Thanks for the reply, I didn't get a notification thus my late reply. I suppose I could try to move it to a different region, but given the fact that all of my other services are also in the Netherlands EU regions as well as the databse, moving to to Asia or US would probably make the wait time longer. I'm also not usre if I can move the region without losing DB data?
"Does it make it available?" - the database is availible just that the write time is extreamly slow for the first request, maybe for some reason the connection-establishing time got longer on Railway?
I don't have any customer logic, just the most basic case of Django ORM no funny business. The connections time is normal on dev enviroment - the issue of first write request being slow doesnt happen locally.
Status changed to Awaiting Railway Response Railway • 2 months ago
2 months ago
I switched to US to test it. The read time was super slow as well cause it's far away, the write was simillar (propably normal speed + distance overhead). I switched back to EU so that at least reading is faster. This for sure looks like an issue on your side - propably the EU servers are overloaded. It maybe wasn't an issue without using metal - is it possible for me to go back?
2 months ago
--Direct timeing measurement by Railway CLI--
I connected using the CLI to postgres DB, when saving/reading a single record using raw SQL the time it takes is around ~40ms - might be due to the fact that connection is already warm and established by psql.
--Timing save() - Django operation in code (all on the same entity)--
1st call: 1816ms
2nd call: 18ms
3rd call: 4ms
4rd call: 7010ms !!!
5-7: i started spam clicking the submit button and after this 7s call next ones were fast (30, 120, 150ms, one 3200ms though)
Angelo, do you have access to look into my project setup, or can I somehow grant you those permissions? It might be easier that way :)
2 months ago
It sounds like your application is spinning up a connection for the first request and then reusing it
That's pretty standard. You'll want to look at creating a connection to your DB when your application first spins up
Unfortunately we can't help with application specific issues, but, maybe someone in the community can help with your code? I'll attach a bounty!
Status changed to Awaiting User Response Railway • 2 months ago
2 months ago
Yes, it is possible that hardware is the issue. Our PostgreSQL service started experiencing similar unpredictable slowdowns in the EU-West region, ultimately solved by moving the project to US-East at Angelo's suggestion.
If you truly believe you may be affected by the same problem, you will have to perform a lot of due diligence investigating your application behavior before you can claim it's caused by the hardware (it likely isn't).
You'll have to instrument your application with telemetry that allows you to ensure:
- you pool DB connections and keep them alive (as you should)
- ensure waiting is not caused by pessimistic locking, serializable transaction isolation etc.
- ensure it isn't your application code in other ways. For example logging during transactions. If your logger gets stuck and you transaction currently holds a lock on a table, you may be blocking other transactions. Infinite possibilities
For example I happen to know that Django loves table-level pessimistic locking and I think there's no connection pool by default if you don't configure one (I may be wrong).
As a quick check if you're suffering from I/O, you can use the following query DURING your slow request to see what the database is working on:SELECT pid,
state,
wait_event_type,
wait_event,
left(query, 6),
time(xact_start) AS xact_start,
now() - xact_start AS duration
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY xact_start;
If you are affected by the same thing we were, you should expect something along these lines:```
| pid | state | wait\_event\_type | wait\_event | query | xact\_start | duration |
|:----|:-------|:------------------|:-------------|:-------|:----------------------------------|:----------------------|
| 478 | active | LWLock | WALWrite | COMMIT | 2025-06-19 09:10:11.148374 +00:00 | 0 mins 35.76775 secs |
| 480 | active | LWLock | WALWrite | COMMIT | 2025-06-19 09:10:18.710823 +00:00 | 0 mins 28.205301 secs |
| 482 | active | LWLock | WALWrite | COMMIT | 2025-06-19 09:10:35.004183 +00:00 | 0 mins 11.911941 secs |
| 483 | active | IO | DataFileRead | select | 2025-06-19 09:10:35.070579 +00:00 | 0 mins 11.845545 secs |
| 216 | active | null | null | SELECT | 2025-06-19 09:10:46.916124 +00:00 | 0 mins 0.0 secs |
```
You would see that the database is waiting for I/O operations. If you see similar output, you have some reason to suspect underlying I/O issues, though you must perform your due diligence regardless before claiming that.
If indeed it turns out you have to move to US-East, you'll want to move the database and all services that access it as well. You cannot have a service that sends traffic from Europe to USA on each database access during request processing. Imagine you had 20 DB operations per request, and each of them has to make a roundtrip to the USA before a response can be returned to your user - that's bad.
2 months ago
I tried to format the table as a code block, but it didn't format properly, so here's a screenshot of the query output
Attachments
okarmazin
Yes, it is possible that hardware is the issue. Our PostgreSQL service started experiencing similar unpredictable slowdowns in the EU-West region, ultimately solved by moving the project to US-East at Angelo's suggestion.If you truly believe you may be affected by the same problem, you will have to perform a lot of due diligence investigating your application behavior before you can claim it's caused by the hardware (it likely isn't).You'll have to instrument your application with telemetry that allows you to ensure: - you pool DB connections and keep them alive (as you should)- ensure waiting is not caused by pessimistic locking, serializable transaction isolation etc.- ensure it isn't your application code in other ways. For example logging during transactions. If your logger gets stuck and you transaction currently holds a lock on a table, you may be blocking other transactions. Infinite possibilitiesFor example I happen to know that Django loves table-level pessimistic locking and I think there's no connection pool by default if you don't configure one (I may be wrong).As a quick check if you're suffering from I/O, you can use the following query DURING your slow request to see what the database is working on:SELECT pid, state, wait_event_type, wait_event, left(query, 6), time(xact_start) AS xact_start, now() - xact_start AS duration FROM pg_stat_activity WHERE state != 'idle' ORDER BY xact_start;If you are affected by the same thing we were, you should expect something along these lines:```| pid | state | wait\_event\_type | wait\_event | query | xact\_start | duration | |:----|:-------|:------------------|:-------------|:-------|:----------------------------------|:----------------------| | 478 | active | LWLock | WALWrite | COMMIT | 2025-06-19 09:10:11.148374 +00:00 | 0 mins 35.76775 secs | | 480 | active | LWLock | WALWrite | COMMIT | 2025-06-19 09:10:18.710823 +00:00 | 0 mins 28.205301 secs | | 482 | active | LWLock | WALWrite | COMMIT | 2025-06-19 09:10:35.004183 +00:00 | 0 mins 11.911941 secs | | 483 | active | IO | DataFileRead | select | 2025-06-19 09:10:35.070579 +00:00 | 0 mins 11.845545 secs | | 216 | active | null | null | SELECT | 2025-06-19 09:10:46.916124 +00:00 | 0 mins 0.0 secs | ```You would see that the database is waiting for I/O operations. If you see similar output, you have some reason to suspect underlying I/O issues, though you must perform your due diligence regardless before claiming that.If indeed it turns out you have to move to US-East, you'll want to move the database and all services that access it as well. You cannot have a service that sends traffic from Europe to USA on each database access during request processing. Imagine you had 20 DB operations per request, and each of them has to make a roundtrip to the USA before a response can be returned to your user - that's bad.
2 months ago
Thank you so much for this elaborate reply. I'll check the connection pooling and other things you mentioned. Weird thing is that I hadn't changed anything regarding DB setup from the start of my project (3-4 months ago), it's all the standard Django stuff just that the database problems started maybe a week ago.
I'll try the WAL query check as well, and will let you know. Thanks!
2 months ago
I run you SQL query during long updates and indeed it seems that WALSync takes a long time. Total save() method call on model was 6358ms here:pid | state | wait_event_type | wait_event | query_snippet | xact_start | duration
------+--------+-----------------+------------+---------------+------------+-----------------
2215 | active | IO | WALWrite | UPDATE | 08:57:13 | 00:00:00.672851
2211 | active | | | SELECT | 08:57:14 | 00:00:00
then:2215 | active | IO | WALSync | UPDATE | 08:57:13 | 00:00:03.513413
So it seems like the WALSyns is hanging, from what I've read it's responsible for flushing the WAL to disk - may be waiting for the confirmation from the OS.
After this the next 20 queries were at normal speed, I propably need to wait a bit until I get another slow one.
jake
It sounds like your application is spinning up a connection for the first request and then reusing itThat's pretty standard. You'll want to look at creating a connection to your DB when your application first spins upUnfortunately we can't help with application specific issues, but, maybe someone in the community can help with your code? I'll attach a bounty!
2 months ago
Thank you Jake, after your input I'm looking at creating a persistant connection from Django - current one was closed after each request. Thank you for the bountie as well!
rvdlaar
The most likely fix is implementing proper connection pooling with CONN_MAX_AGE and the connection pool library. The pattern you're seeing is classic connection establishment overhead, and keeping connections alive should resolve the issue.Try a different region: As suggested by the Railway employee, test with a database in a different Metal regionConsider upgrading: Your current resources are maxed out, which could contribute to the issueEnable query logging: Add LOGGING configuration to identify slow queriesMonitor connection counts: Use SELECT count(*) FROM pg_stat_activity; to monitor active connections1. Optimize Django Database Connection SettingsAdd these settings to your Django settings.py:DATABASES = {'default': {# ... your existing config'CONN_MAX_AGE': 60, # Keep connections alive for 60 seconds'OPTIONS': {'MAX_CONNS': 20, # Limit concurrent connections'MIN_CONNS': 1, # Keep minimum connections open'connect_timeout': 10,'application_name': 'your_app_name',}}}2. Implement Proper Connection PoolingInstall and configure django-db-connection-pool:pip install django-db-connection-pool[postgresql]Update your database engine:DATABASES = {'default': {'ENGINE': 'dj_db_conn_pool.backends.postgresql',# ... rest of your config'POOL_OPTIONS': {'POOL_SIZE': 10,'MAX_OVERFLOW': 10,'RECYCLE': 300, # Recycle connections every 5 minutes'PRE_PING': True, # Validate connections before use}}}3. Add Database Connection Health ChecksCreate a management command to warm up connections:# management/commands/warm_db.pyfrom django.core.management.base import BaseCommandfrom django.db import connectionclass Command(BaseCommand):def handle(self, args, *options):with connection.cursor() as cursor:cursor.execute("SELECT 1")4. Optimize PostgreSQL ConfigurationAdd these environment variables to your Railway PostgreSQL service:POSTGRES_SHARED_PRELOAD_LIBRARIES=pg_stat_statementsPOSTGRES_MAX_CONNECTIONS=100POSTGRES_SHARED_BUFFERS=2GBPOSTGRES_EFFECTIVE_CACHE_SIZE=6GBPOSTGRES_MAINTENANCE_WORK_MEM=512MBPOSTGRES_CHECKPOINT_COMPLETION_TARGET=0.9POSTGRES_WAL_BUFFERS=64MB5. Implement Application-Level SolutionsAdd connection warming middleware:# middleware.pyfrom django.db import connectionfrom django.utils.deprecation import MiddlewareMixinclass DatabaseConnectionMiddleware(MiddlewareMixin):def process_request(self, request):# Ensure connection is aliveif connection.connection and not connection.is_usable():connection.close()6. Immediate Debugging StepsAdd this to your Django view to monitor connection behavior:import timefrom django.db import connectionimport logginglogger = logging.getLogger(__name__)def your_view(request):start_time = time.time()logger.info(f"Connection queries before: {len(connection.queries)}")# Your database operations hereend_time = time.time()logger.info(f"Request took: {end_time - start_time:.2f}s")logger.info(f"Connection queries after: {len(connection.queries)}")
2 months ago
Thanks, not to be mean but it looks a bit like AI wrote it :) . Though you made some valid points. I already implemented CONN_MAX_AGE
as stated in my previous comment. The pooling library is also a good idea, if CONN_MAX_AGE
doesn't fix the issue I'll add it as well. I already tried a write db ping every 60s which didn't help, logging of write times is already in place (that's where I got the timings from in my first comments).
Currently it seems to have stabilized - weirdly after I run okamarzin's
SQL query to check idle connections - although it's propably just luck. I'll monitor it today and comment here with updates
jake
It sounds like your application is spinning up a connection for the first request and then reusing itThat's pretty standard. You'll want to look at creating a connection to your DB when your application first spins upUnfortunately we can't help with application specific issues, but, maybe someone in the community can help with your code? I'll attach a bounty!
2 months ago
I did additional measurements and research and sadly again it seems to be an hardware I/O bottelneck thing. Though I must say the number of those slow requests has decreased.
Below is a screen shot of 2 times the request took way too long. We can see this is not an issue with connecting from Django to Postgres because the wait happens in the Postgres itself. Those requests were hanging for 24s and 32s each on WALSync
. From what I understand Postgres first writes the change to the WAL buffer in memory and then it syncs that buffer to disk - WALSync
. So it looks like an issue regarding the I/O on the Postgres disk itself sadly
Request one - hanging 24s, finally finished and saved
Request two - hanging over 32s (!!!) eventually crashed the gunicorn worker due to timeout
So I'm positive it's not an issue with the DB connection or pooling given that the request reaches Postgres without much of a delay and gets stuck there.
Would upgrading to the Pro plan fix this? Maybe the EU servers are overloaded and disk can't keep up?
2 months ago
UPDATE: I moved the whole backend stack to Metal US EAST, it seems like the DB write issue was mitigated but the normal load time took a hit as it's further away. I kept frontend in EU so JS can load faster, but for example Django Admin UI is noticably slower in loading. It's just a temporary solution
UPDATE 2: I optimized the admin interface to cut donw on load queries and it works at an acceptable rate now
So we have a temporary solution, but would be nice to be able to actually use EU. Please let me know when you guys upgrade the EU servers or increase capacity!