Issues for the entire day

ilvalerione

PROOP

4 months ago

Hi,

today was terrible with issues on our worker service for the entire day. It's causing a lot of problems on the business side. I understand some short interruption can happen, but today our application cannot process data for almost all working hours.

I don't know what to do, it's a very low level of service.

Solved$30 Bounty

Pinned Solution

ilvalerione

PROOP

2 months ago

What I understood playing with configurations is that the number of workers per-replica was too high, so when the workers start to process an high number of jobs the single replica goes down. The problem don't show up when the workers start, so it's invisiboe until at runtime under heavy workload. One problem is that using the Railway UI I don't understand what are the limit of a single replica in terms of memory, because memory should be the real problem here I think. I can only see the general limit of 32GB of RAM for the entire service. Be aware of replica limitations can help to have the right expectations and avoid this turmoil. I decresed the number of workers per-queue, and increased the number of replicas and it seems stable now.

15 Replies

chandrika

EMPLOYEE

4 months ago

We can see your Worker service in the Neuron Stream project is currently running successfully, and there are no failed deployments recorded today. We did observe several deployment attempts and cancellations throughout the day.

Could you share more details about the specific issues you experienced? For example, were you seeing error messages, service crashes, timeouts, or failed data processing? Any logs or screenshots from your application showing the problem would help us investigate further.

Status changed to Awaiting User Response Railway • 4 months ago

ilvalerione

PROOP

4 months ago

The application was intermittently unreachable returning gateway error. But the most important problem to me is that the worker service is configured to have 120 processes, processing messages from Dragonfly queues. And it also have 6 replicas. But, in the Worker metrics I see the CPU consumptiuon stay very very low, and in fact the messages on the queues continue to accumulates. I don't really know how this kind of service works!?!? It doesn't scale!

Status changed to Awaiting Railway Response Railway • 4 months ago

jake

EMPLOYEE

4 months ago

We investigated the networking between your Worker service and Dragonfly and found a pattern of dropped TCP connections on port 6379 occurring consistently every ~5 minutes throughout the afternoon. These drops are on the private network path between the two services. This would explain the low CPU usage and queue accumulation you observed - if the workers cannot establish or maintain connections to Dragonfly, they would sit idle while messages pile up. We also see multiple deployment-cancellation cycles on your Worker service during this period, which may have compounded the issue. We're looking into this further on our side.

Status changed to Awaiting User Response Railway • 4 months ago

ilvalerione

PROOP

4 months ago

I have not canceled any deployment, just to let you know. I tried restarting some containers hoping they can work again.

Status changed to Awaiting Railway Response Railway • 4 months ago

chandrika

EMPLOYEE

3 months ago

Understood, that makes sense. The deployment cancellations we saw were from your container restarts, not manual cancels on your part. The underlying issue, dropped TCP connections between your Worker and Dragonfly over the private network, has been a known pattern that our engineering team previously investigated and resolved. Your experience on March 20 may indicate a localized recurrence. We've added your case to the tracking cluster so the team has visibility. If you see this happen again, please let us know immediately so we can correlate it in real time.

Status changed to Awaiting User Response Railway • 4 months ago

Railway

BOT

3 months ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway • 3 months ago

ilvalerione

PROOP

3 months ago

Another time issues between worker 1 service consuming the queue on the Dragonfly service. It's becoming a problem guys

Status changed to Awaiting Railway Response Railway • 3 months ago

ilvalerione

PROOP

3 months ago

This level of service is completely unreliable

brody

EMPLOYEE

3 months ago

We've looked into this further and the connection drops between your Worker and Dragonfly are isolated to your services. We're not seeing this pattern elsewhere on the platform. With 120 processes across 6 replicas, you're opening a very high number of concurrent connections to Dragonfly on port 6379. The consistent drop pattern every ~5 minutes on specific connection paths suggests a client-side connection management issue, likely related to how your worker processes handle connection pooling or reconnection to Dragonfly. We'd recommend reviewing your Dragonfly client configuration, particularly around connection pool sizing, keepalive settings, and reconnection logic to ensure it's tuned for this volume of concurrent connections.

Status changed to Awaiting User Response Railway • 3 months ago

ilvalerione

PROOP

3 months ago

After testing many different configurations I noticed that the supervisor crashes if configured with too many workers. I can't configure a supervisor with 100 workers because it crashes. I need to keep the number of workers under a certain threshold (e.g. 20 workers). This means I need to add replicas to scale the number of processes consuming my queues. Is this some Railway limitation?

Status changed to Awaiting Railway Response Railway • 3 months ago

ilvalerione

PROOP

3 months ago

This problem prevent me to configure an autoscaling workers strategy like the "auto" balance supported by Laravel Horizon. Becuase I can't scale up over a limited number of workers in a single service instance. I always need to keep multiple replicas alive to reach the amount of workers I need to process the volume of jobs in my queues.

angelo-railway

EMPLOYEE

3 months ago

The supervisor crash at higher worker counts is likely hitting the container's memory limit. Each worker process consumes its own memory, so 100+ workers in a single container can exceed available RAM and trigger an OOM kill. Check your service's memory metrics — if it's hitting the ceiling, the fix is exactly what you're doing: fewer workers per replica, more replicas. This isn't a Railway limitation per se, it's a memory constraint. You can bump your service's memory limit if you want more workers per replica, or keep scaling horizontally with replicas.

Status changed to Awaiting User Response Railway • 3 months ago

ilvalerione

PROOP

3 months ago

The service's memory metric is way lower it's limit: around 10GB used of 32GB available. My concern was about running an excess of workers consuming more than necessary resources. Now, I have 4 replicas with 20 workers each, but the CPU metric seems to scale between 0.5 and 3 CPUs. So the number of workers do not affect CPU consumptions, CPUs seem to scale only when they are actually used to process a job. I'm sorry for this turn around, but I feel the need to better understand how your platform works because it's seems very different from having a fixed amount of resources in a VM.

Another strange pattern I noticed in the CPU chart of all services is that it continue to jump down to 0 at a quite regular intervals. All services. Is it normal or am I missing something?

Status changed to Awaiting Railway Response Railway • 3 months ago

Railway

BOT

3 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway • 3 months ago

ilvalerione

PROOP

2 months ago

Today is another terrible day for the instability of the Railway infrastructure. My worker service continue to alternate period of downtime. If you don't have any suggestions, fix, or something that can give my architecture appropriate reliability I don't have other options than migrate my product elsewhere. I would collaborate to help understand what the problem can be, but sincerely I'm not sure you want to take it in consideration. Open a bug bounty and wait is not a valid strategy. Since my average spend is about 800$/month I would like to speak with a technical specialist on your side if you will.

ilvalerione

PROOP

2 months ago

chinanderm

PRO

2 months ago

Another absolutely frustrating day as a Railway customer.

Status changed to Solved noahd • 2 months ago

Welcome!