2 months ago
Hello Railway Support,
We are experiencing recurring short service interruptions on our app hosted on Railway. Almost every day, there are small outage windows lasting around 2-5 minutes during which either the backend, the UI, or both become unavailable. In some cases, this happens more than once per day.
This does not appear to affect only a single service or a specific deployment. During these incidents, services in both our production and development environments can become unreachable at the same time. So this seems to be broader than an issue isolated to one service.
The most recent incident just happened today at around 3:12 PM, Europe/Rome timezone.
During these interruptions:
- The host becomes unreachable.
- Requests time out.
- There are no visible error logs on our side that explain the issue.
- After a few minutes, the service starts working again on its own.
This pattern has been happening repeatedly, and it is affecting the reliability of our application. We would like to understand whether there are platform-side issues, restarts, networking problems, or any other infrastructure events that could explain these short downtime periods.
Could you please investigate the incident, and also check whether there is a broader pattern affecting our project?
If useful, we can provide project details and any additional information you may need.
Thank you,
Francesco
6 Replies
2 months ago
All 15 services in your project currently show a healthy status, and we found no error logs or service restarts around the reported incident time (13:12 UTC on June 18). Your custom domains (app.docsy.it, app-dev.docsy.it, n8n.docsy.it, sgtm.docsy.it) all have valid certificates and are verified, but all traffic is routed through Cloudflare's proxy (orange cloud), meaning requests pass through Cloudflare before reaching us. Since both production and development environments share the same Cloudflare zone and the interruptions affect multiple services simultaneously with no corresponding errors on our side, we recommend testing during the next interruption by hitting your Railway-generated domains directly (e.g., app-ui-prod.up.railway.app and docsy-be-prod.up.railway.app), as these bypass Cloudflare entirely and will help determine whether the issue originates between Cloudflare and your users or between Cloudflare and our infrastructure. Also confirm that your Cloudflare SSL/TLS mode is set to "Full" (not "Full (Strict)"), as Strict mode can cause intermittent 526 errors during certificate renewal windows.
Status changed to Awaiting User Response Railway • about 2 months ago
Status changed to Awaiting Railway Response Railway • about 2 months ago
Railway
All 15 services in your project currently show a healthy status, and we found no error logs or service restarts around the reported incident time (13:12 UTC on June 18). Your custom domains (app.docsy.it, app-dev.docsy.it, n8n.docsy.it, sgtm.docsy.it) all have valid certificates and are verified, but all traffic is routed through Cloudflare's proxy (orange cloud), meaning requests pass through Cloudflare before reaching us. Since both production and development environments share the same Cloudflare zone and the interruptions affect multiple services simultaneously with no corresponding errors on our side, we recommend testing during the next interruption by hitting your Railway-generated domains directly (e.g., app-ui-prod.up.railway.app and docsy-be-prod.up.railway.app), as these bypass Cloudflare entirely and will help determine whether the issue originates between Cloudflare and your users or between Cloudflare and our infrastructure. Also confirm that your Cloudflare SSL/TLS mode is set to "Full" (not "Full (Strict)"), as Strict mode can cause intermittent 526 errors during certificate renewal windows.
2 months ago
Also the incident reported happened today, not on June 18 lol
2 months ago
This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.
Status changed to Open Railway • about 2 months ago
2 months ago
Common Causes for Recurring 2-5 Minute Outages on Railway
Short, self-resolving interruptions (host unreachable, requests timeout, no error logs) that affect multiple services and both prod/dev environments are usually platform-side or infrastructure-related rather than your code.
Most Likely Causes (in order of frequency):
- Automatic Restarts / Health Check Failures Railway restarts containers when they fail health checks, run out of memory, or hit resource limits. Even brief restarts cause 1–5 minutes of downtime.
- Ephemeral Networking / Internal Connectivity Flakes Temporary issues with Railway’s internal networking or edge proxy (especially in EU regions). Services become unreachable for a few minutes then recover automatically.
- Resource Pressure (Memory/CPU) On Hobby/Pro plans, if your services spike and hit limits, Railway may kill/restart replicas silently.
- Database Connection Pool Exhaustion or Slow Queries If backend depends on Postgres/Redis, connection timeouts or pool issues can make the whole app appear down.
- Platform-wide Degraded Performance Check https://status.railway.com for any recent incidents (even small ones).
Recommended Actions Right Now:
- Go to each service → Logs → filter for the exact time of the outage (today ~15:12 UTC / 3:12 PM Rome) and look for:
- Container restarts
- OOMKilled / memory errors
- Health check failures
- Prisma/Postgres/Redis timeout messages
- Enable Always Restart policy temporarily (Service Settings → Restart Policy) to see if it helps.
- Add better health checks and increase resource allocation (RAM/CPU) for a test.
jabalf15ai
**Common Causes for Recurring 2-5 Minute Outages on Railway** Short, self-resolving interruptions (host unreachable, requests timeout, no error logs) that affect multiple services and both prod/dev environments are usually **platform-side** or infrastructure-related rather than your code. ### Most Likely Causes (in order of frequency): 1. **Automatic Restarts / Health Check Failures** Railway restarts containers when they fail health checks, run out of memory, or hit resource limits. Even brief restarts cause 1–5 minutes of downtime. 2. **Ephemeral Networking / Internal Connectivity Flakes** Temporary issues with Railway’s internal networking or edge proxy (especially in EU regions). Services become unreachable for a few minutes then recover automatically. 3. **Resource Pressure (Memory/CPU)** On Hobby/Pro plans, if your services spike and hit limits, Railway may kill/restart replicas silently. 4. **Database Connection Pool Exhaustion or Slow Queries** If backend depends on Postgres/Redis, connection timeouts or pool issues can make the whole app appear down. 5. **Platform-wide Degraded Performance** Check <https://status.railway.com> for any recent incidents (even small ones). ### Recommended Actions Right Now: * Go to each service → **Logs** → filter for the exact time of the outage (today \~15:12 UTC / 3:12 PM Rome) and look for: * Container restarts * OOMKilled / memory errors * Health check failures * Prisma/Postgres/Redis timeout messages * Enable **Always Restart** policy temporarily (Service Settings → Restart Policy) to see if it helps. * Add better health checks and increase resource allocation (RAM/CPU) for a test.
2 months ago
Hello, thanks for the detailed reply.
I checked the metrics for each service, and there are no visible spikes in CPU or memory usage during these incidents. Resource limits are already set well above our current needs, and I had already increased them a few days ago specifically to test whether this would stop the outages, but the issue is still happening.
Because of that, I do not think this is a resource exhaustion issue on our side. Also, when the problem happens, it is not just one service going down: all web services across both environments become inaccessible at the same time. This affects both production and development simultaneously, so it does not look related to one specific deployment or service-level resource problem.
I also checked the logs, and the services do not appear to be restarting during these events. They remain marked as online, but they become extremely slow. For example, during the outage around 3:12 PM Europe/Rome, looking at the backend service, requests were essentially stuck at the OPTIONS stage and were taking around 28-30 seconds to complete.
I also checked Railway Status, and yesterday there were reported incidents including “Request timeouts and Service Unavailable responses on Edge Network” and “We’re currently experiencing elevated latency in our EU-West region”.
Given both the timing and the behavior we are seeing, would it be reasonable to assume that this points more toward a platform-wide degraded performance issue rather than a service-specific problem on our side?
The reason I am asking is that this seems to be happening quite frequently, which is concerning for a production workload.
Thank you
2 months ago
This is a networking issue most likely. To verify it, I recommend you use the generated domain instead of your custom domain when this issue occurs. If the generated domain has no problems, then your issue is related to Cloudflare. You can disable proxy in your DNS settings, so that the traffic goes straight to Railway instead of passing through Cloudflare first.
One other thing I need to know; is this issue reported by your users, or is it happening only to you? Because it's possible that this issue is related to your network. If this is the case, then do some tests using different networks to rule this possibility out.
Hope this helps.
darseen
This is a networking issue most likely. To verify it, I recommend you use the generated domain instead of your custom domain when this issue occurs. If the generated domain has no problems, then your issue is related to Cloudflare. You can disable proxy in your DNS settings, so that the traffic goes straight to Railway instead of passing through Cloudflare first. One other thing I need to know; is this issue reported by your users, or is it happening only to you? Because it's possible that this issue is related to your network. If this is the case, then do some tests using different networks to rule this possibility out. Hope this helps.
2 months ago
Thanks for the reply and suggestions.
It shouldn’t be related to Cloudflare, because for the backend we don’t use any custom domain — all requests go directly to the default Railway-hosted URL, and the issue still occurs even in that case.
The problem is also not limited to my network. It’s been reported by our users, and during those moments neither I nor my colleagues (from different networks since we work from remote) are able to access the app or make direct backend calls (for example, via Postman). Everything becomes unreachable for a few minutes, then eventually recovers on its own.
We also implemented Instatus automatic health checks which fails during these incidents.
Given that, it seems unlikely to be a DNS or local network issue.