10 days ago
Summary
My n8n Primary service intermittently returns 500 Internal Server Error responses from Railway's Metal Edge proxy layer. The errors do not originate from my application — they occur at the edge proxy level before the request reaches my container. My application logs show zero errors, zero crashes, and normal resource usage during these incidents. I need to understand what is causing this and whether there is any configuration available to prevent it.
Service Details
Service: Primary
Application: n8n (self-hosted workflow automation, Node.js)
Replicas: 1
The Problem
Intermittent 500 Internal Server Error responses are returned by the Metal Edge proxy when external services send POST requests to my service. The pattern is:
Multiple different webhook paths fail simultaneously
Some requests succeed while others fail at the exact same time — it is not a complete outage
The failures cluster in windows of 1-5 minutes
My application logs show zero errors during these windows — the request never reaches my container
CPU, memory, and network metrics show no anomalies during failures
The 500 response is returned in under 15ms, which is too fast to be an application-level timeout
Evidence
Incident: February 24, 2026 — 04:04 to 04:08 UTC
What my external monitoring observed (Cloudflare Worker):
05:04:42 [FAILED] /webhook/XXX → 500 after 3 attempts
...What my application logs show during the same window:
Normal operation. Executions being enqueued and completed throughout:
Feb 24 2026 05:04:02 Primary Enqueued execution 1485572 (job 146607)
Feb 24 2026 05:04:03 Primary Execution 1485572 (job 146607) finished
Feb 24 2026 05:04:04 Primary Enqueued execution 1485573 (job 146608)
Feb 24 2026 05:04:05 Primary Execution 1485573 (job 146608) finished
Feb 24 2026 05:04:11 Primary Execution 1485545 (job 146606) finished
Feb 24 2026 05:04:15 Primary Enqueued execution 1485575 (job 146609)
Feb 24 2026 05:04:15 Primary Execution 1485575 (job 146609) finishedNo error messages, no crashes, no restarts, no "Last session crashed" entries. The application was running normally and successfully processing requests that made it through.
Resource metrics during the incident:
CPU: ~0.05 vCPU (normal, no spike)
Memory: ~400-500MB (normal, no spike)
No deployment or restart occurred
Traffic Volume During the Incident Was Below Average
Hour of the incident (04:00–05:00 UTC) — 213 total messages:
Time Slot (UTC) Messages
04:00 – 04:05 14 ← failures started
04:05 – 04:10 4 ← peak failure period
04:10 – 04:15 14
04:15 – 04:20 21
04:20 – 04:25 17
04:25 – 04:30 15
04:30 – 04:35 20
04:35 – 04:40 12
04:40 – 04:45 22
04:45 – 04:50 26
04:50 – 04:55 26
04:55 – 05:00 22
The 500 errors occurred at below-average load, ruling out application overload as a cause.
Previous Incidents (Before Metal Edge)
Before the domain was moved to Metal Edge, similar issues occurred through Railway's Fastly/Varnish CDN layer:
403 Forbidden responses with HTML body from Varnish
Response header:
Server: Varnish,X-Railway-CDN-Edge: fastly/cache-fra-etou8220198-FRARailway network logs showed
TCP_OVERWINDOWandNO_SOCKETerrorsSame pattern: application healthy, no errors in logs, low resource usage
My Hypothesis
The edge proxy has aggressive TCP connection timeout thresholds. When my Node.js application's event loop takes slightly longer to call accept() on its listening socket (due to normal operations like database writes, Redis queue management, or garbage collection), the edge proxy considers the backend unreachable and returns 500 immediately rather than queuing the connection in a TCP backlog.
This would explain why:
Some requests succeed (arrive when event loop is free to accept)
Others fail simultaneously (arrive when event loop is in the middle of a tick)
The 500 is returned in under 15ms (edge proxy decision, not application timeout)
Application logs show no errors (request never reaches the application)
Current Workaround
I have deployed a Cloudflare Worker that sits between my callers and Railway.
Environment
Railway region: GCP europe-west4
Single replica (cannot run multiple due to application architecture constraints)
Typical request volume: ~10,000 POST requests/day, peaks of 20-30 requests/minute
Average response time when healthy: 50-200m
0 Replies