Internal Networking Issue

shekhardtu

PROOP

5 months ago

## Summary

Internal PostgreSQL URL postgres.railway.internal:5432) was completely unreachable for ~30+ minutes, causing all API requests to timeout. Public proxy URL worked fine. After the issue self-resolved, internal URL latency is 8-10x slower than public URL.

## Environment

- Project ID:9e652f25-f137-48d5-b587-5f1057f75ac6

- Project Name:firebackup.io

- Environment:production

- Affected Services:api, backup-worker, pitr-worker

- Database: Railway-managed PostgreSQL

- Region: (Railway default)

## Issue Timeline (January 6, 2026)

### 14:21 UTC - Issue Started

- API requests started timing out after exactly 30 seconds

- Health endpoint /health) returning 503 - database health check timing out

- All endpoints requiring database access failed

- Logs showed: [PrismaService] Disconnecting from PostgreSQL database...

### 14:21 - 15:07 UTC - Investigation

- Confirmed database was reachable via public proxy URLtrolley.proxy.rlwy.net:21973)

- Local connection test to public URL: 610ms, successful

- Internal URL postgres.railway.internal:5432) was completely unreachable from the API service

### 15:07 UTC - Workaround Applied

- Changed DATABASE_URL from internal to public URL

- Service immediately recovered

- Health check: 4ms latency, healthy

### 15:28 UTC - Internal URL Recovered

- Created diagnostic endpoint to test both URLs

- Internal URL now working but significantly slower:

- Internal: 223-238ms

- Public: 2-30ms

## Diagnostic Evidence

### During Outage

```

# Health endpoint (using internal URL)

HTTP 503 - Service Unavailable

Response time: 5.3 seconds (timeout)

Database health check: FAILED - timed out after 5000ms

```

### After Switching to Public URL

```

# Health endpoint (using public URL)

HTTP 200 - OK

Response time: 0.4 seconds

Database: healthy, latency 4ms

```

### Current State (Both URLs Working)

```json

{

"internalUrl": {

"masked": "postgresql://postgres.railway.internal:5432/railway",

"status": "healthy",

"latencyMs": 238

"publicUrl": {

"masked": "postgresql://trolley.proxy.rlwy.net:21973/railway",

"status": "healthy",

"latencyMs": 30

}

```

## Expected vs Actual Behavior

| Aspect | Expected | Actual |

|--------|----------|--------|

| Internal URL availability | Always available | Was completely down for 30+ minutes |

| Internal URL latency | Lower than public (same network) | 8-10x slower than public URL |

| Service recovery | Automatic | Required manual switch to public URL |

## Questions

1. What caused the internal networking outage? Was there maintenance or an infrastructure issue around 14:21 UTC?

2. Why is internal URL latency higher than public? Internal networking should be faster, not slower. Is there a DNS resolution or routing issue?

3. How can we be notified of internal networking issues? We had no visibility into this problem until users reported timeouts.

4. Is the public proxy URL recommended for production? Given the reliability issues with internal URLs, should we continue using the public proxy?

## Diagnostic Endpoint

We've created a diagnostic endpoint for ongoing monitoring:

```

GET https://api.firebackup.io/health-internal

```

This tests both internal and public database URLs and reports latency/status for each.

## Current Workaround

We've switched all services to use the public proxy URL:

```

DATABASE_URL=postgresql://postgres:***@trolley.proxy.rlwy.net:21973/railway

```

The internal URL is configured separately for monitoring:

```

DATABASE_INTERNAL_URL=postgresql://postgres:***@postgres.railway.internal:5432/railway

```

## Impact

- Duration: ~45 minutes of degraded service

- User Impact: All authenticated API requests failed

- Services Affected: API, backup-worker, pitr-worker

- Data Loss: None (database itself was healthy)

---

Submitted by:Firebackup.io Team

Date: January 6, 2026

Priority: High (production outage)

$30 Bounty

7 Replies

brody

EMPLOYEE

5 months ago

Hello,

I have checked our systems and nothing has gone wrong on our end.

However, I have noticed that your API service was pinned above 1 vCPU. This is an issue because Node is single-threaded. Anything at or above 1 vCPU causes Node's network stack to falter.

Since this isn't an issue with our platform or product, I will open this up to the community so they can help you pinpoint any potential issues with your application.

Best,

Brody

Status changed to Awaiting User Response Railway • 5 months ago

brody

EMPLOYEE

5 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open brody • 5 months ago

fra

HOBBYTop 10% Contributor

5 months ago

can you try checking the latency using the cli from the container? in this way you can understand if it is a network issue or an app issue, are you using SSL when using the internal url? (if yes you should remove it)

ilyassbreth

FREE

5 months ago

@brody confirmed that your api was pinned above 1 vcpu that's likely your culprit. node's single-threaded so when cpu maxes out the event loop can't process anything including db connections.

check your railway metrics for cpu usage during 14:21-15:07 utc. if it was pegged at 100%+ that explains everything.

quick things to try:

- scale your api service horizontally (more instances)

- look for blocking operations or cpu intensive code in your request handlers

- check if you have any sync file operations or heavy compute running on the main thread

- profile what's eating cpu during peak load

if cpu wasn't actually high during the outage then something else is going on and we can dig deeper. but the symptoms (exact 30s timeouts, affecting all network io) match cpu saturation perfectly.

once you fix the cpu issue switch back to internal urls, they should be faster than public proxy

hope this help you :)

fra

HOBBYTop 10% Contributor

5 months ago

{

timestamp: "2026-01-06T16:43:52.071Z",
internalUrl: {
- configured: true,
- masked: "postgresql://postgres.railway.internal:5432/railway",
- status: "healthy",
- latencyMs: 235 },
publicUrl: {
- configured: true,
- masked: "postgresql://trolley.proxy.rlwy.net:21973/railway",
- status: "healthy",
- latencyMs: 2 },
recommendation: "Both URLs working. Internal URL can be used for lower latency."

}

the public url has 2ms latency, the internal 235ms, that's a big difference.....given that both work on the same container (I suppose) they use the same cpu, I would expect both the latency to be high, not just one...I don't think this will be resolved just scaling the service...

fra

I'm not sure if that's the real problem, if you look at the link he shared you can see the latency between internal vs external db url and there is a lot of difference between the 2, I would expect the latency of the external url higher then the internal but it's the opposite....and I'd also expect both latency high if it was a cpu issue... { * **timestamp**: "2026-01-06T16:43:52.071Z", * **internalUrl**: { * **configured**: true, * **masked**: "postgresql://postgres.railway.internal:5432/railway", * **status**: "healthy", * **latencyMs**: 235 }, * **publicUrl**: { * **configured**: true, * **masked**: "postgresql://[trolley.proxy.rlwy.net:21973/railway](http://trolley.proxy.rlwy.net:21973/railway)", * **status**: "healthy", * **latencyMs**: 2 }, * **recommendation**: "Both URLs working. Internal URL can be used for lower latency." } the public url has 2ms latency, the internal 235ms, that's a big difference.....given that both work on the same container (I suppose) they use the same cpu, I would expect both the latency to be high, not just one...I don't think this will be resolved just scaling the service...

brody

EMPLOYEE

5 months ago

Their testing is flawed. The test for the private network involves creating a new Prisma client, creating a pool, doing a DNS lookup, etc., and the test for the public network uses the already existing Prisma client with an already established connection.

fra

HOBBYTop 10% Contributor

5 months ago

ok, I can't see the code so I supposed they used the same logic, in that case ignore my comment!

jabalf15ai

HOBBY

2 months ago

Railway’s internal networking (postgres.railway.internal) can become unreachable or extremely slow for periods of time.
The public proxy (trolley.proxy.rlwy.net) routes through Railway’s edge network and has been stable and faster in this case.

What you should do right now:

Set your main DATABASE_URL to the public proxy URL: text

postgresql://postgres:***@trolley.proxy.rlwy.net:21973/railway

Keep the internal URL only for monitoring (optional): text

DATABASE_INTERNAL_URL=postgresql://postgres:***@postgres.railway.internal:5432/railway

Add this to your Prisma schema (highly recommended): prisma

datasource db {  
  provider  = "postgresql"  
  url       = env("DATABASE_URL")  
  directUrl = env("DIRECT_URL")   // same as DATABASE_URL  
}

Add connection parameters for stability: text

?connect_timeout=30&socket_timeout=30&pool_timeout=20

Current Best Practice (Jan 2026): Many production services on Railway are now using the public proxy URL as primary because internal networking has shown repeated instability and unexpected high latency. super grok 🥸✊🥷🤝

Welcome!