Traffic spike errors http 502
epether
PROOP

2 months ago

Hello

We had a traffic spike January 8th around 20h UTC.

Our service node api we saw these errors in the logs:

"upstreamErrors": "[{\"deploymentInstanceID\":\"f55af9d8-2152-4036-aac3-bf051f52f287\",\"duration\":13121,\"error\":\"an unknown error occurred\"},{\"deploymentInstanceID\":\"6acb3dab-a3cc-4818-8bde-8fd77bdc9222\",\"duration\":5000,\"error\":\"connection dial timeout\"},{\"deploymentInstanceID\":\"d257e79c-a2c4-485d-b114-1ea1a1d50dbd\",\"duration\":5000,\"error\":\"connection dial timeout\"}]"

Could you help us understand what went wrong? We dont see any spike in CPU/RAM but the service returned some http 502 to the clients.

Thanks

$30 Bounty

8 Replies

Railway
BOT

2 months ago

Hey there! We've found the following might help you get unblocked faster:

If you find the answer from one of these, please let us know by solving the thread!


2 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open brody 2 months ago


fra
HOBBYTop 10% Contributor

2 months ago

Do you use third party services? It seems like your service is doing a request that timeout...


darseen
HOBBYTop 5% Contributor

2 months ago

If you are using node.js, a single synchronous operation or a blocked Event Loop can cause this. However, that usually spikes the cpu. If cpu was flat, it's more likely the event loop was waiting on a promise that never resolved or a slow external API call. And since cpu was low, your application was likely waiting.

Under normal load, a server accepts a connection in milliseconds. A 5 second delay means the server was running but completely unresponsive to new network traffic.

You should check your database metrics to see if it's causing the bottleneck. Check if active connections hit the limit as well. Or maybe you're calling an external API that's causing this.

I also recommend that you implement timeouts for database queries and external API calls to ensure you aren't awaiting a slow promise without a timeout.


angeldanielmartinez424-glitch
FREE

19 days ago

A 502 in Railway usually means the proxy didn’t receive a valid response from your service in time.

Based on your logs:

• connection timeout (5000ms)

• unknown upstream error

• multiple deploymentInstanceID entries

This suggests the service was reachable but not responding fast enough under peak load.

Even if CPU/RAM didn’t spike, common causes in Node services are:

1. Exhausted database connection pool

2. Blocking operations in the event loop

3. Slow external API dependencies

4. Missing horizontal scaling

5. Long-running synchronous code

I would recommend checking:

• DB pool size vs concurrent requests

• Any synchronous CPU-heavy operations

• External API latency

• Whether autoscaling was enabled

• Railway service concurrency limits

Also verify that your service responds on process.env.PORT and does not block the event loop under load.

If possible, simulate traffic with a load test to reproduce the behavior.


angeldanielmartinez424-glitch
FREE

19 days ago

If you're using a DB (Postgres, Mongo, etc.), monitor connection pool saturation during traffic spikes.

A common pattern is that requests queue waiting for DB connections, causing the proxy timeout.


epether
PROOP

18 days ago

Thank you all for your answers and sorry for the delay.

We see nothing pointing out to api's CPU/RAM being high. Nothing on the DB (NeonDB) side either, CPU/RAM and connections are all still low.

Our research suggests that it could come from a mismatch of the keepAliveTimeout setting between the Railway LoadBalancer and our API:

-----

Problem Timeouts/502 errors only at the Railway Load Balancer. Requests never reach the API (logs are empty).

Cause Railway keeps keep-alive connections open for 60 seconds. Node.js closes them after 5 seconds by default → zombie sockets get reused by the LB.

(~1000 concurrent connections currently active, well below the 10k max.)

Solution In main.ts:

TypeScript

const server = app.getHttpServer();
server.keepAliveTimeout = 65000;
server.headersTimeout = 70000;

Why it works keepAliveTimeout > 60 s → Node.js keeps the sockets alive longer than the LB (so the LB closes them cleanly first).

The higher headersTimeout prevents a race condition in Node.js when receiving headers on reused connections.

-----

We are currently testing this.


epether
PROOP

17 days ago

Since my update above, we have less 502 errors but still big differences on HTTP request duration time between Railway metrics and what we see on our Datadog. Database PG (Neon) and Redis (railway) are not the issue. We don't handle traffic spikes smoothly and we didn't have this problem before on Heroku.

See attachments: Railway API metrics / NeonDB metrics / Datadog API metrics

Could it be connection limitations by Railway (load balancers, concurrency http rate limitations) we see on this documentation: https://docs.railway.com/networking/public-networking/specs-and-limits ?


epether
PROOP

7 days ago

We still have the issue every 2-3 days. We got it today and I managed to screen it (see screenshot) : 503 Backend.max_conn reached returned while I tried to acces our production app.

It appears to be coming from the Railway proxy edge Fastly.

Railway please do somehting.

Attachments


Loading...