Multi-regional Deployment down for certain regions
aqt-dev
PROOP

a year ago

Hi team, since switching to multi region we observe downtime at a certain period frequently. Restart/redeploy help most of the time. Most of the disruption locates in US region. Could get some help looking into this?

ProjectId: bd3c882d-f3ad-4f0d-b553-e27e0201d165

Solved

42 Replies

aqt-dev
PROOP

a year ago

Hi can we get some help take a look at this please!


a year ago

hello


a year ago

can you define what downtime in practice means for you?


aqt-dev
PROOP

a year ago

Hi @Brody , it seems to us that our front-end was unable to establish connection to the servers. The issue is quite subtle and we couldn't tell what was going wrong with our monitoring.


a year ago

can you provide a little more detail? that could mean many hundreds of things


aqt-dev
PROOP

a year ago

  • From 2024-12-11 6:15am ET to 9:15am ET, we observed elevated errors from 3 out of 10 our backend servers. Our eng located in US is unable to access the site.

  • We use nodejs with websocket as a mean of communication between our BE<> FE. Below are the errors we saw during the time of disruption

  • Other regions seemed not impacted. Folks from other regions still can access the site.

  • No sign on resources bounded, cpu + memories look normal.

  • Our cloudflare looks normal as well.

  Error: Parse Error

      at socketOnEnd (node:_http_server:803:22)

      at Socket.emit (node:events:531:35)

      at Socket.emit (node:domain:488:12)

      at endReadableNT (node:internal/streams/readable:1696:12)

      at processTicksAndRejections (node:internal/process/task_queues:82:21)

aqt-dev
PROOP

a year ago

No deployment before the time of disruption


aqt-dev
PROOP

a year ago

1317242210348957700


aqt-dev
PROOP

a year ago

This was not the only time we saw the issue though. It happened to us twice this week.

That's all we have on our end. Posting q here hoping we could get more insight from railway to figure out what causes the disruption.


a year ago

there's nothing on our side from our network availability monitoring, and we have no observability into what your code is doing so unfortunately we can't tell you much with respect to what went wrong at your application layer


aqt-dev
PROOP

a year ago

@Brody do you observe any abnormalities during the time from any of our instances ? Any sign of degraded in system metrics on individual instance? Did any of them bear more loads than the others ?


aqt-dev
PROOP

a year ago

What bother us is this only happened to a certain region (US-east to be more specific) so we doubt it's the application layer is the culprit here.


a year ago

you can check your service metrics to see cpu / memory / networking within the service panel


a year ago

we don't have any application level monitoring so we can't tell you what happened with the application itself


gio
PRO

a year ago

The issue is that it's hard from us to see at an application level what is happening, even locally for example, because we can only reproduce on Railway in certain replicas


gio
PRO

a year ago

For example, we debugged at one point by changing VPNs, but the above error is at such a low level that it wasn't useful. But this is consistently not across all replicas, which is why we were wondering if it's possibly a Railway issue


gio
PRO

a year ago

For example, we use websockets. Is it possible that websocket connections over streams on your replicas are causing this?


gio
PRO

a year ago

I'd imagine if so, maybe someone on your team has seen others experience this prior


a year ago

There are no issues with websockets here!

Handling all possible errors, verbose logging, and tracing will help you in your efforts to debug any further application level issues.


gio
PRO

a year ago

Gotcha


a year ago

really sorry we couldn't help here, but there's just nothing we can tell you about this issue from the platform side of things.


gio
PRO

a year ago

Understood. Appreciate it


gio
PRO

a year ago

@Brody to help test a theory regarding replicas. is it possible to add a header to a request to ensure a certain replica is being used?


a year ago

like have your replica return a header with its specific ID?


gio
PRO

a year ago

Or rather, when going from our FE to BE, specify which replica to use, etc.


a year ago

Oh pinning your request to a specific replica, no that's not possible, we don't support sticky sessions


gio
PRO

a year ago

Gotcha ty. Any known issues w having multiple replicas on inngest? We're wondering if inngest may be the issue too. Particularly when it makes calls between replicas


a year ago

Make calls between replicas? I'm sorry I don't follow.

the only issue with replicas and websockets would be if you needed sticky sessions.


gio
PRO

a year ago

I meant possible issues between replicas and Inngest, particularly


a year ago

Nope no issues there, we run replicas ourselves for the docs, and help station


aqt-dev
PROOP

a year ago

Thanks for all the help @Brody ! Another q, do you have a more granular metrics (CPU, ram) on each of the replica?


a year ago

The UI adds them all together


aqt-dev
PROOP

a year ago

Is there any way we can get the break down ? Or do you have any tips to determine which region seeing struggle/ needs to scale up ?


a year ago

ill check, what service is this in regards to? basement-feathers-api?


a year ago

the two metal east 4 region did most of the compute


a year ago

you removed the two metal east 4 regions and replaced them with 2 gcp east 4 regions, but since then east 4 is still the region that is doing the most compute


a year ago

even right now, the two east regions are doing more than double the compute than asia


aqt-dev
PROOP

a year ago

thank you. Is there a sign of contention on those 2 instances ?


aqt-dev
PROOP

a year ago

Will you guys expose these metrics in the dashboard in the future? That'd be helpful for developers.


a year ago

What do you exactly mean by contention?


a year ago

its something we want to do but just dont have the cycles for


a year ago

!s


Status changed to Solved brody about 1 year ago


Loading...