PostgreSQL Database WAL Write Blocking - 30+ Second Transaction Commits
skullxa
HOBBYOP

4 months ago

# Railway Support Ticket: Critical Database Performance Issue

## Subject: 
**URGENT: PostgreSQL Database WAL Write Blocking - 30+ Second Transaction Commits**

## Summary:
Our production PostgreSQL database is experiencing severe WAL (Write-Ahead Log) blocking, causing transaction commits to take 30+ seconds instead of milliseconds. This has rendered our e-commerce application unusable.

## Database Details:
- **Database**: PostgreSQL on Railway
- **Issue Started**: Approximately January 8th, 2025
- **Impact**: 100% of write operations affected

## Technical Evidence:

### 1. Database Activity Monitoring Shows WAL Blocking:
```
🔥 ACTIVE QUERIES:
PID: 54164 | Duration: 30.26s | State: active
Query: COMMIT;
Waiting: LWLock/WALWrite

🔥 ACTIVE QUERIES:
PID: 54172 | Duration: 16.13s | State: active
Query: COMMIT;
Waiting: LWLock/WALWrite
```

### 2. Application Performance Impact:
**Before (Normal):**
- Cart operations: ~200ms
- Database commits: ~50ms

**After (Broken):**
- Cart operations: 15,000-30,000ms (15-30 seconds)
- Database commits: 30,000ms+ (30+ seconds)

### 3. Application Logs Showing Slow Operations:
```
🕐 [LINE-ITEM-DEBUG] Medusa handler completed after 17226ms
http: POST /store/carts/.../line-items ← (200) - 17338.187 ms

🕐 [LINE-ITEM-DEBUG] Medusa handler completed after 8957ms  
http: POST /store/carts/.../line-items ← (200) - 9059.047 ms
```

### 4. Definitive Proof - Local PostgreSQL Comparison:
We tested the same application with local PostgreSQL:

**Railway Database (Broken):**
```
🕐 [LINE-ITEM-DEBUG] Medusa handler completed after 17226ms
```

**Local PostgreSQL (Working):**
```
🕐 [LINE-ITEM-DEBUG] Medusa handler completed after 178ms
```

**Result: 100x performance difference** - proving this is a Railway infrastructure issue, not our application code.

### 5. Database Connection Testing:
Direct database tests show:
- **Connection time**: 170-180ms (acceptable)
- **Simple SELECT queries**: 40-45ms (acceptable)  
- **Write operations/commits**: 15,000-30,000ms (BROKEN)

## Root Cause Analysis:
The `LWLock/WALWrite` wait events indicate:
1. **Disk I/O bottleneck** on Railway's PostgreSQL server
2. **WAL files cannot be written** to disk in reasonable time
3. **All transactions block** waiting for WAL writes to complete
4. **Infrastructure-level issue** - not application-related

## Business Impact:
- **E-commerce application unusable** - customers cannot add items to cart
- **15-30 second delays** for all database write operations
- **Production system down** since January 8th, 2025
- **Revenue loss** due to inability to process orders

## Request:
1. **Immediate investigation** of WAL subsystem on our database server
2. **Database migration** to healthy server if current server has hardware issues
3. **Root cause analysis** of what caused this WAL blocking
4. **Prevention measures** to avoid future occurrences
5. **Service credit** consideration given the severity and duration

## Additional Information:
- We can provide **real-time monitoring logs** during the issue
- **Database appears healthy** for read operations
- **Issue is specific to write operations and transaction commits**
- **No application code changes** were made before the issue started


**Technical Contact**: Available for immediate debugging session if Railway engineers need real-time assistance.

**Note**: We have documented evidence of 100x performance improvement when switching to local PostgreSQL, confirming this is definitively a Railway infrastructure issue.
# Evidence Package for Railway Support

## Files to Attach to Support Ticket:

### 1. Database Monitoring Logs
**File**: `database-monitoring-logs.txt`
**Content**: Copy the terminal output showing WAL blocking:
```
🔥 ACTIVE QUERIES:
PID: 54164 | Duration: 30.26s | State: active
Query: COMMIT;
Waiting: LWLock/WALWrite
```


### 3. Local vs Railway Comparison
**File**: `performance-comparison.txt`
**Content**:
```
RAILWAY DATABASE (BROKEN):
🕐 [LINE-ITEM-DEBUG] Medusa handler completed after 17226ms
http: POST /store/carts/.../line-items ← (200) - 17338.187 ms

LOCAL POSTGRESQL (WORKING):  
🕐 [LINE-ITEM-DEBUG] Medusa handler completed after 178ms
http: POST /store/carts/.../line-items ← (200) - 183.380 ms

PERFORMANCE DIFFERENCE: 100x slower on Railway
```

### 4. Database Speed Test Results
**File**: `database-speed-test.txt` 
**Content**: Copy your Railway speed test results:
```
🔍 Testing Railway database speed...
📡 Connection time: 177ms
⚡ Simple query time: 40ms  
🛒 Cart query time: 43ms
✅ Connection is OK
✅ Simple queries are OK  
✅ Cart queries are OK
(BUT write operations take 30+ seconds)
Solved

11 Replies

shxkm
PRO

4 months ago

While this is surely AI generated, I’m experiencing performance issues as well, possibly related to disk I/O.


Railway
BOT

4 months ago

Hello!

We've escalated your issue to our engineering team.

We aim to provide an update within 1 business day.

Please reply to this thread if you have any questions!

Status changed to Awaiting User Response Railway 4 months ago


Update y'all, we've shipped another patch that should address this. We have two more coming down the line that will help with outstanding p95 performance.


angelo-railway

Update y'all, we've shipped another patch that should address this. We have two more coming down the line that will help with outstanding p95 performance.

shxkm
PRO

4 months ago

Is this only for writes, or cold-reads as well?


Status changed to Awaiting Railway Response Railway 4 months ago


shxkm

Is this only for writes, or cold-reads as well?

Both reads and writes, but the issue is VERY obvious for writes.

Aware and trying to work through it. I think we owe you and everyone else on an engineering blog post on what happened once it's all said and done.


Status changed to Awaiting User Response Railway 4 months ago


In the meantime, when we rolled out Metal, we had the option to go full native on the storage system but that would have invalidated everyone's backup (which is a no-no) or run the storage solution on a "emulation" bridge that kept compatibility so that we wouldn't have two different types of volumes. This bridged method is proving to be a lot of trouble, hence the patches.

We have engaged a hardware storage consultant that has just been advising us to make it so that we address these performance issues as they come up. P50 perf is back to nominal- but the edge cases are what is getting OP and likely you at times.


If it was like how it was during beta, you'd see the forum light up in disgust, so want to be clear not everyone is facing this, but for the unlucky 0.01 of users, that's still 15,000 people. Not insigificant.


angelo-railway

In the meantime, when we rolled out Metal, we had the option to go full native on the storage system but that would have invalidated everyone's backup (which is a no-no) or run the storage solution on a "emulation" bridge that kept compatibility so that we wouldn't have two different types of volumes. This bridged method is proving to be a lot of trouble, hence the patches. We have engaged a hardware storage consultant that has just been advising us to make it so that we address these performance issues as they come up. P50 perf is back to nominal- but the edge cases are what is getting OP and likely you at times.

shxkm
PRO

4 months ago

Thank you. I appreciate this getting acknowledged and as an engineer, the details give important context and are reassuring. I gotta say my initial impression was that Railway wasn’t taking this nearly as serious as they should.

I’m not sure what makes my case an “edge case”. When I was testing Postgres on Railway my database (basically a clone) had 0% of its write throughput (and it gets written to a lot when the workers are active) and less than 0.5% of its read throughput.

I’ve since moved the web and workers to Railway but for now my database stays somewhere else (and I have to pay some gross egress fees). The risk is just too high for me. Each such database move is 5 hours of total downtime in my case.


Status changed to Awaiting Railway Response Railway 4 months ago


angelo-railway

In the meantime, when we rolled out Metal, we had the option to go full native on the storage system but that would have invalidated everyone's backup (which is a no-no) or run the storage solution on a "emulation" bridge that kept compatibility so that we wouldn't have two different types of volumes. This bridged method is proving to be a lot of trouble, hence the patches. We have engaged a hardware storage consultant that has just been advising us to make it so that we address these performance issues as they come up. P50 perf is back to nominal- but the edge cases are what is getting OP and likely you at times.

shxkm
PRO

4 months ago

that would have invalidated everyone's backup (which is a no-no) or run the storage solution on a "emulation" bridge that kept compatibility so that we wouldn't have two different types of volumes.

Also, kinda curious why not do the bridging only for existing volumes, and have new setups (like mine for example) get the “full on native” experience?

(Don’t worry, you don’t have to tell me that it’s not that simple. I know.)


shxkm

that would have invalidated everyone's backup (which is a no-no) or run the storage solution on a "emulation" bridge that kept compatibility so that we wouldn't have two different types of volumes.Also, kinda curious why not do the bridging only for existing volumes, and have new setups (like mine for example) get the “full on native” experience?(Don’t worry, you don’t have to tell me that it’s not that simple. I know.)

So we did try this but then it bifurcated the experience. Customers would not know if they had a "new" volume vs. a "legacy" volume where we wouldn't know if backups didn't work or certain functionality did or didn't.

Varying amounts of pain. This was the least worst option we feel, as bad as it is.


Status changed to Awaiting User Response Railway 4 months ago


Okay, we're down, avg. write latency is down like a lot. We should see relief knock on wood.


Railway
BOT

4 months ago

✅ The ticket Slow transaction processing on database has been marked as completed.


Railway
BOT

3 months ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway 3 months ago


Loading...