a year ago
Hi team, since switching to multi region we observe downtime at a certain period frequently. Restart/redeploy help most of the time. Most of the disruption locates in US region. Could get some help looking into this?
ProjectId: bd3c882d-f3ad-4f0d-b553-e27e0201d165
42 Replies
a year ago
hello
a year ago
can you define what downtime in practice means for you?
Hi @Brody , it seems to us that our front-end was unable to establish connection to the servers. The issue is quite subtle and we couldn't tell what was going wrong with our monitoring.
a year ago
can you provide a little more detail? that could mean many hundreds of things
From 2024-12-11 6:15am ET to 9:15am ET, we observed elevated errors from 3 out of 10 our backend servers. Our eng located in US is unable to access the site.
We use nodejs with websocket as a mean of communication between our BE<> FE. Below are the errors we saw during the time of disruption
Other regions seemed not impacted. Folks from other regions still can access the site.
No sign on resources bounded, cpu + memories look normal.
Our cloudflare looks normal as well.
Error: Parse Error
at socketOnEnd (node:_http_server:803:22)
at Socket.emit (node:events:531:35)
at Socket.emit (node:domain:488:12)
at endReadableNT (node:internal/streams/readable:1696:12)
at processTicksAndRejections (node:internal/process/task_queues:82:21)This was not the only time we saw the issue though. It happened to us twice this week.
That's all we have on our end. Posting q here hoping we could get more insight from railway to figure out what causes the disruption.
a year ago
there's nothing on our side from our network availability monitoring, and we have no observability into what your code is doing so unfortunately we can't tell you much with respect to what went wrong at your application layer
@Brody do you observe any abnormalities during the time from any of our instances ? Any sign of degraded in system metrics on individual instance? Did any of them bear more loads than the others ?
What bother us is this only happened to a certain region (US-east to be more specific) so we doubt it's the application layer is the culprit here.
a year ago
you can check your service metrics to see cpu / memory / networking within the service panel
a year ago
we don't have any application level monitoring so we can't tell you what happened with the application itself
The issue is that it's hard from us to see at an application level what is happening, even locally for example, because we can only reproduce on Railway in certain replicas
For example, we debugged at one point by changing VPNs, but the above error is at such a low level that it wasn't useful. But this is consistently not across all replicas, which is why we were wondering if it's possibly a Railway issue
For example, we use websockets. Is it possible that websocket connections over streams on your replicas are causing this?
a year ago
There are no issues with websockets here!
Handling all possible errors, verbose logging, and tracing will help you in your efforts to debug any further application level issues.
a year ago
really sorry we couldn't help here, but there's just nothing we can tell you about this issue from the platform side of things.
@Brody to help test a theory regarding replicas. is it possible to add a header to a request to ensure a certain replica is being used?
a year ago
like have your replica return a header with its specific ID?
a year ago
Oh pinning your request to a specific replica, no that's not possible, we don't support sticky sessions
Gotcha ty. Any known issues w having multiple replicas on inngest? We're wondering if inngest may be the issue too. Particularly when it makes calls between replicas
a year ago
Make calls between replicas? I'm sorry I don't follow.
the only issue with replicas and websockets would be if you needed sticky sessions.
a year ago
Nope no issues there, we run replicas ourselves for the docs, and help station
Thanks for all the help @Brody ! Another q, do you have a more granular metrics (CPU, ram) on each of the replica?
a year ago
The UI adds them all together
Is there any way we can get the break down ? Or do you have any tips to determine which region seeing struggle/ needs to scale up ?
a year ago
ill check, what service is this in regards to? basement-feathers-api?
a year ago
the two metal east 4 region did most of the compute
a year ago
you removed the two metal east 4 regions and replaced them with 2 gcp east 4 regions, but since then east 4 is still the region that is doing the most compute
a year ago
even right now, the two east regions are doing more than double the compute than asia
Will you guys expose these metrics in the dashboard in the future? That'd be helpful for developers.
a year ago
What do you exactly mean by contention?
a year ago
its something we want to do but just dont have the cycles for
a year ago
!s
Status changed to Solved brody • about 1 year ago
