2 months ago
## Summary
Internal PostgreSQL URL postgres.railway.internal:5432) was completely unreachable for ~30+ minutes, causing all API requests to timeout. Public proxy URL worked fine. After the issue self-resolved, internal URL latency is 8-10x slower than public URL.
## Environment
- Project ID:9e652f25-f137-48d5-b587-5f1057f75ac6
- Project Name:firebackup.io
- Environment:production
- Affected Services:api, backup-worker, pitr-worker
- Database: Railway-managed PostgreSQL
- Region: (Railway default)
## Issue Timeline (January 6, 2026)
### 14:21 UTC - Issue Started
- API requests started timing out after exactly 30 seconds
- Health endpoint /health) returning 503 - database health check timing out
- All endpoints requiring database access failed
- Logs showed: [PrismaService] Disconnecting from PostgreSQL database...
### 14:21 - 15:07 UTC - Investigation
- Confirmed database was reachable via public proxy URLtrolley.proxy.rlwy.net:21973)
- Local connection test to public URL: 610ms, successful
- Internal URL postgres.railway.internal:5432) was completely unreachable from the API service
### 15:07 UTC - Workaround Applied
- Changed DATABASE_URL from internal to public URL
- Service immediately recovered
- Health check: 4ms latency, healthy
### 15:28 UTC - Internal URL Recovered
- Created diagnostic endpoint to test both URLs
- Internal URL now working but significantly slower:
- Internal: 223-238ms
- Public: 2-30ms
## Diagnostic Evidence
### During Outage
```
# Health endpoint (using internal URL)
HTTP 503 - Service Unavailable
Response time: 5.3 seconds (timeout)
Database health check: FAILED - timed out after 5000ms
```
### After Switching to Public URL
```
# Health endpoint (using public URL)
HTTP 200 - OK
Response time: 0.4 seconds
Database: healthy, latency 4ms
```
### Current State (Both URLs Working)
```json
{
"internalUrl": {
"masked": "postgresql://postgres.railway.internal:5432/railway",
"status": "healthy",
"latencyMs": 238
},
"publicUrl": {
"masked": "postgresql://trolley.proxy.rlwy.net:21973/railway",
"status": "healthy",
"latencyMs": 30
}
}
```
## Expected vs Actual Behavior
| Aspect | Expected | Actual |
|--------|----------|--------|
| Internal URL availability | Always available | Was completely down for 30+ minutes |
| Internal URL latency | Lower than public (same network) | 8-10x slower than public URL |
| Service recovery | Automatic | Required manual switch to public URL |
## Questions
1. What caused the internal networking outage? Was there maintenance or an infrastructure issue around 14:21 UTC?
2. Why is internal URL latency higher than public? Internal networking should be faster, not slower. Is there a DNS resolution or routing issue?
3. How can we be notified of internal networking issues? We had no visibility into this problem until users reported timeouts.
4. Is the public proxy URL recommended for production? Given the reliability issues with internal URLs, should we continue using the public proxy?
## Diagnostic Endpoint
We've created a diagnostic endpoint for ongoing monitoring:
```
GET https://api.firebackup.io/health-internal
```
This tests both internal and public database URLs and reports latency/status for each.
## Current Workaround
We've switched all services to use the public proxy URL:
```
DATABASE_URL=postgresql://postgres:***@trolley.proxy.rlwy.net:21973/railway
```
The internal URL is configured separately for monitoring:
```
DATABASE_INTERNAL_URL=postgresql://postgres:***@postgres.railway.internal:5432/railway
```
## Impact
- Duration: ~45 minutes of degraded service
- User Impact: All authenticated API requests failed
- Services Affected: API, backup-worker, pitr-worker
- Data Loss: None (database itself was healthy)
---
Submitted by:Firebackup.io Team
Date: January 6, 2026
Priority: High (production outage)
6 Replies
2 months ago
Hello,
I have checked our systems and nothing has gone wrong on our end.
However, I have noticed that your API service was pinned above 1 vCPU. This is an issue because Node is single-threaded. Anything at or above 1 vCPU causes Node's network stack to falter.
Since this isn't an issue with our platform or product, I will open this up to the community so they can help you pinpoint any potential issues with your application.
Best,
Brody
Status changed to Awaiting User Response Railway • 2 months ago
2 months ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open brody • 2 months ago
2 months ago
can you try checking the latency using the cli from the container? in this way you can understand if it is a network issue or an app issue, are you using SSL when using the internal url? (if yes you should remove it)
2 months ago
@brody confirmed that your api was pinned above 1 vcpu that's likely your culprit. node's single-threaded so when cpu maxes out the event loop can't process anything including db connections.
check your railway metrics for cpu usage during 14:21-15:07 utc. if it was pegged at 100%+ that explains everything.
quick things to try:
- scale your api service horizontally (more instances)
- look for blocking operations or cpu intensive code in your request handlers
- check if you have any sync file operations or heavy compute running on the main thread
- profile what's eating cpu during peak load
if cpu wasn't actually high during the outage then something else is going on and we can dig deeper. but the symptoms (exact 30s timeouts, affecting all network io) match cpu saturation perfectly.
once you fix the cpu issue switch back to internal urls, they should be faster than public proxy
hope this help you :)
2 months ago
I'm not sure if that's the real problem, if you look at the link he shared you can see the latency between internal vs external db url and there is a lot of difference between the 2, I would expect the latency of the external url higher then the internal but it's the opposite....and I'd also expect both latency high if it was a cpu issue...
{
timestamp: "2026-01-06T16:43:52.071Z",
internalUrl: {
configured: true,
masked: "postgresql://postgres.railway.internal:5432/railway",
status: "healthy",
latencyMs: 235
},
publicUrl: {
configured: true,
masked: "postgresql://trolley.proxy.rlwy.net:21973/railway",
status: "healthy",
latencyMs: 2
},
recommendation: "Both URLs working. Internal URL can be used for lower latency."
}
the public url has 2ms latency, the internal 235ms, that's a big difference.....given that both work on the same container (I suppose) they use the same cpu, I would expect both the latency to be high, not just one...I don't think this will be resolved just scaling the service...
fra
I'm not sure if that's the real problem, if you look at the link he shared you can see the latency between internal vs external db url and there is a lot of difference between the 2, I would expect the latency of the external url higher then the internal but it's the opposite....and I'd also expect both latency high if it was a cpu issue...{timestamp: "2026-01-06T16:43:52.071Z",internalUrl: {configured: true,masked: "postgresql://postgres.railway.internal:5432/railway",status: "healthy",latencyMs: 235},publicUrl: {configured: true,masked: "postgresql://trolley.proxy.rlwy.net:21973/railway",status: "healthy",latencyMs: 2},recommendation: "Both URLs working. Internal URL can be used for lower latency."}the public url has 2ms latency, the internal 235ms, that's a big difference.....given that both work on the same container (I suppose) they use the same cpu, I would expect both the latency to be high, not just one...I don't think this will be resolved just scaling the service...
2 months ago
Their testing is flawed. The test for the private network involves creating a new Prisma client, creating a pool, doing a DNS lookup, etc., and the test for the public network uses the already existing Prisma client with an already established connection.
2 months ago
ok, I can't see the code so I supposed they used the same logic, in that case ignore my comment!

