2 months ago
Hello
We had a traffic spike January 8th around 20h UTC.
Our service node api we saw these errors in the logs:
"upstreamErrors": "[{\"deploymentInstanceID\":\"f55af9d8-2152-4036-aac3-bf051f52f287\",\"duration\":13121,\"error\":\"an unknown error occurred\"},{\"deploymentInstanceID\":\"6acb3dab-a3cc-4818-8bde-8fd77bdc9222\",\"duration\":5000,\"error\":\"connection dial timeout\"},{\"deploymentInstanceID\":\"d257e79c-a2c4-485d-b114-1ea1a1d50dbd\",\"duration\":5000,\"error\":\"connection dial timeout\"}]"
Could you help us understand what went wrong? We dont see any spike in CPU/RAM but the service returned some http 502 to the clients.
Thanks
8 Replies
2 months ago
Hey there! We've found the following might help you get unblocked faster:
If you find the answer from one of these, please let us know by solving the thread!
2 months ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open brody • 2 months ago
2 months ago
Do you use third party services? It seems like your service is doing a request that timeout...
2 months ago
If you are using node.js, a single synchronous operation or a blocked Event Loop can cause this. However, that usually spikes the cpu. If cpu was flat, it's more likely the event loop was waiting on a promise that never resolved or a slow external API call. And since cpu was low, your application was likely waiting.
Under normal load, a server accepts a connection in milliseconds. A 5 second delay means the server was running but completely unresponsive to new network traffic.
You should check your database metrics to see if it's causing the bottleneck. Check if active connections hit the limit as well. Or maybe you're calling an external API that's causing this.
I also recommend that you implement timeouts for database queries and external API calls to ensure you aren't awaiting a slow promise without a timeout.
19 days ago
A 502 in Railway usually means the proxy didn’t receive a valid response from your service in time.
Based on your logs:
• connection timeout (5000ms)
• unknown upstream error
• multiple deploymentInstanceID entries
This suggests the service was reachable but not responding fast enough under peak load.
Even if CPU/RAM didn’t spike, common causes in Node services are:
1. Exhausted database connection pool
2. Blocking operations in the event loop
3. Slow external API dependencies
4. Missing horizontal scaling
5. Long-running synchronous code
I would recommend checking:
• DB pool size vs concurrent requests
• Any synchronous CPU-heavy operations
• External API latency
• Whether autoscaling was enabled
• Railway service concurrency limits
Also verify that your service responds on process.env.PORT and does not block the event loop under load.
If possible, simulate traffic with a load test to reproduce the behavior.
19 days ago
If you're using a DB (Postgres, Mongo, etc.), monitor connection pool saturation during traffic spikes.
A common pattern is that requests queue waiting for DB connections, causing the proxy timeout.
18 days ago
Thank you all for your answers and sorry for the delay.
We see nothing pointing out to api's CPU/RAM being high. Nothing on the DB (NeonDB) side either, CPU/RAM and connections are all still low.
Our research suggests that it could come from a mismatch of the keepAliveTimeout setting between the Railway LoadBalancer and our API:
-----
Problem Timeouts/502 errors only at the Railway Load Balancer. Requests never reach the API (logs are empty).
Cause Railway keeps keep-alive connections open for 60 seconds. Node.js closes them after 5 seconds by default → zombie sockets get reused by the LB.
(~1000 concurrent connections currently active, well below the 10k max.)
Solution In main.ts:
TypeScript
const server = app.getHttpServer();
server.keepAliveTimeout = 65000;
server.headersTimeout = 70000;Why it works keepAliveTimeout > 60 s → Node.js keeps the sockets alive longer than the LB (so the LB closes them cleanly first).
The higher headersTimeout prevents a race condition in Node.js when receiving headers on reused connections.
-----
We are currently testing this.
17 days ago
Since my update above, we have less 502 errors but still big differences on HTTP request duration time between Railway metrics and what we see on our Datadog. Database PG (Neon) and Redis (railway) are not the issue. We don't handle traffic spikes smoothly and we didn't have this problem before on Heroku.
See attachments: Railway API metrics / NeonDB metrics / Datadog API metrics
Could it be connection limitations by Railway (load balancers, concurrency http rate limitations) we see on this documentation: https://docs.railway.com/networking/public-networking/specs-and-limits ?
7 days ago
We still have the issue every 2-3 days. We got it today and I managed to screen it (see screenshot) : 503 Backend.max_conn reached returned while I tried to acces our production app.
It appears to be coming from the Railway proxy edge Fastly.
Railway please do somehting.
Attachments