a month ago
Hello, a few days ago (Sunday at the latest) we started experiencing service disruptions and request timeouts due to extremely slowed database write operations. The service has been running smoothly on Metal infrastructure for over a month until that point.
Investigation leads me to believe that application code is not the cause of this issue. The following is an AI-written report (reviewed to be an accurate assessment) of our issue:
---------------------
Issue Report: PostgreSQL WAL Synchronization Delays Impacting Application Performance
Issue Summary
Our application is experiencing significant performance degradation due to PostgreSQL Write-Ahead Log (WAL) synchronization delays. Database transactions are consistently blocked in WALSync
and WALWrite
wait states for extended periods (10+ seconds), creating cascading transaction locks and severely impacting application response times. This appears to be an underlying I/O performance issue with the database storage subsystem, not an application design problem.
Environment Details
Database: PostgreSQL
Application: Java/Kotlin application using Quarkus framework
Time Period: Issue observed consistently since June 15, 2025
Impact: High/Severe - affecting all database write operations
Diagnostic Evidence
Process activity captured from pg_stat_activity
shows multiple transactions blocked in WAL sync operations:
PID 424: COMMIT - Blocked in WALSync for 12.26 seconds
PID 422: COMMIT - Blocked in WALWrite for 4.21 seconds
PID 423: UPDATE - Blocked waiting on transaction ID for 1.26 seconds
This indicates that the database is experiencing significant I/O delays when attempting to flush WAL records to disk, causing a chain reaction of blocked transactions.
Remediation Steps Already Attempted
WAL Warmup Implementation:
Developed and deployed a dedicated WAL warmup service that performs regular database operations (INSERT, UPDATE, DELETE) at configurable intervals
Service executes regularly in an attempt to keep the WAL subsystem active - tried intervals from 2 minutes down to 10 seconds to no effect
Designed to exercise multiple database operation patterns to prevent "cold WAL" issues
Despite these efforts, WAL sync delays persist
Transaction Pattern Optimization:
Reviewed and optimized application transaction patterns
Implemented proper transaction boundaries
Ensured connections are properly managed and released
Connection Pool Tuning:
Configured appropriate connection pool settings
Verified connection acquisition/release patterns
Performed VACUUM FULL
Technical Assessment
Based on our investigation, this appears to be an infrastructure I/O performance issue rather than an application problem:
The consistent
WALSync
andWALWrite
wait events indicate the bottleneck is in the physical writing of WAL records to diskThe delays occur even during periods of relatively low application activity
Our WAL warmup strategy should have mitigated any "cold WAL" issues, but the fundamental I/O performance problem remains
The observed pattern is consistent with insufficient I/O throughput or high latency in the storage subsystem, as evidenced by identical transactions performing wildly differently - with execution times varying from milliseconds to 10+ seconds for the same SQL statements depending on when WAL writes occur
Database CPU utilization remains at low levels (practically zero CPU utilization) during these events, ruling out query optimization issues which would manifest as CPU spikes
If the issue were due to poorly optimized queries or missing indexes, we would expect to see high CPU consumption and different wait event types (not WAL-related), as the database would be performing intensive data processing rather than waiting on disk I/O
The application has been running in this configuration for some time without any issues. We aren't aware of any problematic or significant changes in data access patterns.
Requested Action
We request an Railway team assistance in investigating and remediating these these issues.
Business Impact
These WAL synchronization delays are causing significant degradation in our application's performance and reliability:
User-facing operations are experiencing unpredictable latency spikes
Background processing jobs are being delayed
We consider this a high-priority issue requiring immediate attention, as it impacts our ability to provide reliable service to our users.
Edit: removed a potentially misleading wording about "complete unavailability" of the service.
14 Replies
a month ago
Hello!
We're acknowledging your issue and attaching a ticket to this thread.
We don't have an ETA for it, but, our engineering team will take a look and you will be updated as we update the ticket.
Please reply to this thread if you have any questions!
a month ago
Hey there Ondrej,
The LLM made a pretty good report. I am sorry that you are facing this issue. The good news is that we found a few fixes. The machines on Metal have more memory so is it possible where you can tune your PG to use more WAL cache so that you can reply on the memory of your PG instance vs. disk?
Status changed to Awaiting User Response Railway • about 2 months ago
a month ago
Hey, thanks for taking a look. Could you please clarify what you mean by relying more on memory? Write operations must be written to disk in order to provide durability guarantees, no? This doesn't seem to be a problem with cache that I could solve by increasing memory settings - it does point to disk I/O.
Status changed to Awaiting Railway Response Railway • about 2 months ago
a month ago
Or do you mean switching off synchronous commit? I wouldn't want to do that as the service stores payment data. I cannot risk losing those types of records if there's a crash.
Let me reiterate that this service has been running smoothly and fast as hell until now in the identical configuration. Usage patterns and traffic volume also stays consistent.
a month ago
Or do you mean switching off synchronous commit?
Well, the thing we are fighting off is I/O pressure. My view here is to try to get the DB to not eagerly write when you have the WAL setup that you have.
A good example is WAL checkpoints:
```
checkpoint_timeout = 15min
max_wal_size = 2GB
min_wal_size = 1GB
```
We also spun up more machines in East, we can also try moving you to a calmer box.
Status changed to Awaiting User Response Railway • about 1 month ago
a month ago
OK, here's what I'm thinking, please correct any misunderstanding:
- The I/O problem has been acknowledged as a real issue on Railway side. The storage hardware is currently so busy in EU-West that the default PostgreSQL configuration of the standard Railway template has problems storing even low amounts of data (low thousands of transactions per day, very low amounts of actual data written to disk), no transaction concurrency (at most 2 concurrent write transactions a few times per day)
- This is unexpected, shouldn't happen, is/will be worked on. It correlates with something on Railway side - from my understanding it's the forced final migration of the remaining users to Metal?
- A possible hands-off (from our side) solution is suggested - to move data from EU-West to US-East because there are recent hardware upgrades there. After EU-West gets hardware upgrades, data can be moved back to the EU to prevent roundtrip under the Atlantic ocean on each database access.
- I'll try changing WAL options and/or transaction batching configurations to try and prevent the frequency of WAL sync. I will have to fork the Railway database template to do this, because that what's being used in our project.
- However, I/O operations are unavoidable for durable transactions. The transaction literally cannot end before commit gets written to disk. There are options to turn off synchronous commit which would allow the database to report transactions as committed while not actually being written to disk yet. This can significantly improve commit speeds, but has the risk of data loss in the event of a server crash.
- While yes, transaction operations are performed in-memory, the in-memory buffers must be written to disk during transaction commit before the database server can return the transaction as successfully committed.
I will now see if I can help by changing configuration - especially wal_buffers
, commit_delay
, commit_siblings
- though I don't expect them to help much due to the nature of the problem. Also I'll try to reduce the frequency of WAL checkpoints to reduce overall I/O pressure. Again, I don't know how helpful that can be unless you tell all other tenants on those boxes to do the same...
I'd like to ask the following of you:
1) If there's a Railway representative in the EU zone, please hand this over to them so that we can more easily communicate without waiting for the other person to sleep through their night during our office hours.
2) Could you kindly get an infrastructure person - a database expert - from your team to look at this thread to say whether there realistically is something I can do from the application side, or if this can only be helped by upgrading your physical hardware or abandoning Railway as a data storage platform?
3) I expect that other people on EU-West are experiencing the same issue. If you are able to detect and confirm they do, please reach out to them proactively to asssure them you're working on it.
Thanks
Status changed to Awaiting Railway Response Railway • about 1 month ago
a month ago
1) Your responses will be handled according to the SLO that your plan has, which doesn't entitle you to a dedicated resource.
2) I am that infrastructure person, I lead the migration to Metal and have been personally dealing with all regressions as we have debugged them and then surpassed GCP performance. As mentioned, the machines have much greater memory on the VM with shared I/O. (GCP disks behaved slightly differently on the mount) There's a number of steps we are taking at the hardware level in tuning the disk to be faster. The workaround I suggested was an immediate step that didn't require intervention on Railway's side.
3) We have seen the issue present from different customers, but we have addressed the issue to the profile of their applications. You are the first person to report an issue with Postgres, which I am addressing to the report you have provided.
Status changed to Awaiting User Response Railway • about 1 month ago
a month ago
| Timestamp | Operation | Duration |
|---------------------|-----------------|----------|
| 2025-06-19 09:08:35 | scheduledWarmup | 27 ms |
| 2025-06-19 09:08:05 | scheduledWarmup | 3.21 s |
| 2025-06-19 09:07:35 | scheduledWarmup | 3.66 s |
| 2025-06-19 09:07:05 | scheduledWarmup | 2.14 s |
| 2025-06-19 09:06:35 | scheduledWarmup | 956 ms |
| 2025-06-19 09:06:05 | scheduledWarmup | 31.0 s |
| 2025-06-19 09:05:35 | scheduledWarmup | 30 ms |
| 2025-06-19 09:05:05 | scheduledWarmup | 34 ms |
| 2025-06-19 09:04:35 | scheduledWarmup | 9.46 s |
| 2025-06-19 09:04:05 | scheduledWarmup | 7.13 s |
| 2025-06-19 09:03:35 | scheduledWarmup | 37.0 s |
| 2025-06-19 09:03:05 | scheduledWarmup | 1.66 s |
| 2025-06-19 09:02:35 | scheduledWarmup | 526 ms |
For illustration to anyone else reading this thread, here's an example of how that manifests - a few current traces from the attempted WAL warmup mentioned earlier. Each `scheduledWarmup` is a set of a single insert, a single update and a single delete and no other transactions running concurrently. Ranges from tens to tens of thousands of milliseconds.
------------------------------
Oh I see now - I didn't think infra team member would be the first to respond in the forums. I saw some chatter about similar experiences on Discord and encouraged those people to write some reports too.
I'll try the suggested database changes and if it doesn't help I'll try to move the volume to US-East, does that sound right?
Status changed to Awaiting Railway Response Railway • about 1 month ago
a month ago
Look, I don't know how you react when you say "Lemme talk to the SME" when we try very hard to make it so that the people who work on the systems are talking to people
---
That is right, we are actively spinning up more machines each day- there are a few fixes that we are targeting to improve overall DB performance to get us back above performance than GCP. To note, this isn't a widespread issue, just on certain workloads. Hence why we're trying to nail this down with you.
Status changed to Awaiting User Response Railway • about 1 month ago
a month ago
I take blame for my badly worded comment that looks entitled and belligerent, which was not at all the intention. Given English is not my first language, that's on me and I apologize for the miscommunication.
It was merely a friendly suggestion to involve - if you thought it would be at all appropriate - someone from a closer european timezone to prevent unnecessary delays, that's all. No Karen moment was intended! Had I known you are the lead migration engineer, I wouldn't have made that comment at all as it is moot.
Status changed to Awaiting Railway Response Railway • about 1 month ago
a month ago
It's all good, I am a night owl.
Let me know how the move works as well, gut feeling here is that is can be the host.
Status changed to Awaiting User Response Railway • about 1 month ago
a month ago
| pid | state | wait_event_type | wait_event | query | xact_start | duration |
|:----|:-------|:------------------|:-------------|:-------|:----------------------------------|:----------------------|
| 478 | active | LWLock | WALWrite | COMMIT | 2025-06-19 09:10:11.148374 +00:00 | 0 mins 35.76775 secs |
| 480 | active | LWLock | WALWrite | COMMIT | 2025-06-19 09:10:18.710823 +00:00 | 0 mins 28.205301 secs |
| 482 | active | LWLock | WALWrite | COMMIT | 2025-06-19 09:10:35.004183 +00:00 | 0 mins 11.911941 secs |
| 483 | active | IO | DataFileRead | select | 2025-06-19 09:10:35.070579 +00:00 | 0 mins 11.845545 secs |
| 216 | active | null | null | SELECT | 2025-06-19 09:10:46.916124 +00:00 | 0 mins 0.0 secs |
Here's a snippet from pg_stat_activity
after changing the suggested WAL settings using ALTER SYSTEM
followed by pg_reload_conf
- checkpoint, min wal, max wal, those did not influence the query characteristics.WALWrite
still the leading cause, though I have seen some transactions waiting in DataFileRead
for up to 10 seconds before moving on to WALWrite
and waiting for another 10 seconds there.
I'll do the migration to US-East and report.
Status changed to Awaiting Railway Response Railway • about 1 month ago
a month ago
I can confirm that moving the database service from EU-West
to US-East
immediately fixed all observed issues! Thanks for your guidance, Angelo.
a month ago
Are you able to share whether EU-West is being scaled up as well at the moment, so that hopefully it would be possible to move the data back to our side of the pond at some point? Since I can't get the service to fail any more, I'm marking as resolved.
Status changed to Solved okarmazin • about 1 month ago