a month ago
Our requests to our API have been timing out/just never responding. Started happening earlier today, then stopped. I redeployed the API and nothing's working. This is very urgent, nothing in our setup has changed and its been fine all week. This is happening to about 6 different databases.
78 Replies
Attachments
Attachments
a month ago
What region is this deployed in?
So this is most likely not an app issue on my end or atleast I can't see anything leading me to believe that
This happened again like 3 hours ago, went to bed, just woke up to many messages, so would like some confirmation from someone if this is on my side or railway if possible
Idk how accurate this is since it's AI, but just want to try to figure out what caused this
Attachments
Attachments
Scaling to 2 replicas didn't work, databases seem fine, redis seems fine, no clue what's going on
Attachments
Checking DB logs:
2026-05-12 16:49:27.871 UTC [37676] LOG: unexpected EOF on client connection with an open transaction
2026-05-12 16:49:27.880 UTC [37671] LOG: could not receive data from client: Connection reset by peer
2026-05-12 16:49:27.890 UTC [37675] LOG: could not receive data from client: Connection reset by peer
2026-05-12 16:49:27.884 UTC [37671] LOG: unexpected EOF on client connection with an open transaction
2026-05-12 16:49:27.890 UTC [37675] LOG: unexpected EOF on client connection with an open transaction
2026-05-12 16:49:28.634 UTC [37604] LOG: could not receive data from client: Connection reset by peer
2026-05-12 16:49:28.635 UTC [37682] LOG: could not receive data from client: Connection reset by peer```Redeploy the db now is just stuck on deploying and innaccessible
Attachments
I am not using postgres btw. Could be an internal networking issue, since I'm using Dragonfly
here's more logs:
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "/app/.venv/lib/python3.10/site-packages/flask/app.py", line 865, in full_dispatch_request
rv = self.ensure_sync(before_func)()
g.db_conn = get_connection()
File "/app/app/db.py", line 29, in get_connection
_pools[host] = pool.ThreadedConnectionPool(1, 20, dsn)
File "/app/.venv/lib/python3.10/site-packages/psycopg2/pool.py", line 59, in __init__
File "/app/.venv/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
psycopg2.OperationalError: connection to server at "postgres-88e0f11e.railway.internal" (fd12:685a:3643:0:a000:2f:ca06:13c5), port 5432 failed: Connection timed out
Is the server running on that host and accepting TCP/IP connections?```I meant that I started a redis on Upstash and pointed my app to the external url
a month ago
Your workspace has been restricted. You cannot create new resources.
Contact Support why
The server works for the first 10 seconds after a redeploy, after that it goes abck to 499 errors, we've had this API running for a year+ and never have enoucntered these issues, nor have we changed anything recently to cause this
Okay this has escalated to full downtime now, is it possible to get someone on this?
Or Claude is lying Claude was lying
Attachments
Can anyone from the team advise? if it was an app issue it would most likely not get resolved by itself no?
a month ago
!t
Status changed to Awaiting Railway Response Railway • about 1 month ago
If anyone from the team does have an idea on the cause would love to know now, just don’t want it happening overnight again
Hi, reaching out again, this happened multiple times, and don't want this to get looked past
a month ago
Hey Andre — sorry for the delay here; it took a bit to dig up all the details. The issue was on a physical machine in our us-east region that hosts your services and databases. One of its network cables started failing late May 11 and kept flapping — dropping and recovering on its own — which is why you saw intermittent timeouts across that whole window instead of one clean outage.
It also explains why redeploys, scaling replicas, and switching to public DB URLs didn't help: the problem was at the host's network layer, not your app or your DB config. A platform engineer fully isolated the bad cable on May 12, which is when your Postgres recovered, and your remaining services settled that day. A technician was onsite the next day to physically replace the failed hardware, so it can't recur the same way.
Status changed to Awaiting User Response Railway • 29 days ago
a month ago
Hi I appreciate the response, is there anything we can do to mitigate this in the future? I understand things break, but honestly lately it feels like I'm on these threads pretty often due to issues, and without communication especially when our production servers are just failing, it just puts me in a position where I have no idea what to do but wait. I love railway and do not want to migrate away, but nothings been said about what the plans are to mitigate these types of issues, or at least better recognition or support when they do arise, as I had to wait a whole day for it to be acknowledged when our whole server is absolutely down. Which also means I have to waste hours of my time trying to fix the server issues and paying attention to my clients as they continue to message. I'd just like to know if there's anything ongoing to try to avoid these problems or atleast just some better communication with the users it impacts.
Status changed to Awaiting Railway Response Railway • 29 days ago
a month ago
Yeah, before I say anything else, I just want to say - I hear you, and totally get where you're coming from on it. The feedback is well heard, and obviously while I can't copy/paste international conversations, know that its something we're spending a lot of energy on internally to improve on (re: stability). Things are scaling really quickly, and like you mentioned - things do break - but ultimately, it doesn't feel great on the receiving end.
There's a lot in flight to improve stability in these areas, and a lot of it is already in place - despite some of these bumps that come up.
Your ask is very fair (better comms, sooner, and ways to mitigate). I can for sure take that feedback into the team, but its totally something we're taking very serious right now.
Really sorry about how this one played out, and the lag in response. We'll keep chasing down making it better.
On this one, let me know if there's anything I can do to help calm the worries/frustrations down. I've got this thread tagged - ill be checking back in.
Status changed to Awaiting User Response Railway • 28 days ago
a month ago
Hey, I completely understand, I guess there's not much that can be done now, I just hope in the future these issues will be responded to faster. Without any responses I can't tell my clients what's going on, and I can't even guarantee them when the dashboard will be up which is obviously a huge issue.
Status changed to Awaiting Railway Response Railway • 27 days ago
a month ago
Completely agree, but I feel obligated to mention that this was posted on Discord, the team does not actively monitor Discord and Discord threads do not come with any response guarantees whatsoever.
The Central Station would be the correct place to open a thread when the issue pertains to the Railway platform.
Good to know now, wasn't clear at first since discord threads appear in my threads in the central station
a month ago
They are bridged to Central Station, but they are not put in front of the admin view.
a month ago
Though in this case this thread was because Noah escalated it, but that's not something to count on always happening.
Attachments
andremaytorena
Hey, I completely understand, I guess there's not much that can be done now, I just hope in the future these issues will be responded to faster. Without any responses I can't tell my clients what's going on, and I can't even guarantee them when the dashboard will be up which is obviously a huge issue.
a month ago
Totally fair take on this, and I agree. Thanks for being open to talking about it, and keeping the feedback coming. Let us/me know how we can help.
Status changed to Awaiting User Response Railway • 26 days ago
24 days ago
I guess this didn't age well, server down for the whole night :(
Status changed to Awaiting Railway Response Railway • 24 days ago
23 days ago
Hey there,
Closing the loop on this one: the May 11-12 timeouts were caused by a flapping network cable in our US East region, which our platform team isolated on the 12th and the on-site tech replaced the next day. Apologies for the disruption.
The May 20 overnight issue you mentioned was a separate event: our GCP cloud account got auto-restricted, taking down our API for a few hours. Full writeup here: https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage. Recovery is complete on our side.
For future urgent issues, posting via Central Station is the right channel; Discord isn't actively monitored by support and doesn't carry response guarantees.
Thanks,
Angelo
Status changed to Awaiting User Response Railway • 23 days ago
16 days ago
This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!
Status changed to Solved Railway • 16 days ago