Multi-regional Deployment down for certain regions

aqt-dev

PROOP

2 years ago

Hi team, since switching to multi region we observe downtime at a certain period frequently. Restart/redeploy help most of the time. Most of the disruption locates in US region. Could get some help looking into this?

ProjectId: bd3c882d-f3ad-4f0d-b553-e27e0201d165

Solved

42 Replies

aqt-dev

PROOP

2 years ago

Hi can we get some help take a look at this please!

brody

EMPLOYEE

2 years ago

hello

brody

EMPLOYEE

2 years ago

can you define what downtime in practice means for you?

aqt-dev

PROOP

2 years ago

Hi @Brody , it seems to us that our front-end was unable to establish connection to the servers. The issue is quite subtle and we couldn't tell what was going wrong with our monitoring.

brody

EMPLOYEE

2 years ago

can you provide a little more detail? that could mean many hundreds of things

aqt-dev

PROOP

2 years ago

From 2024-12-11 6:15am ET to 9:15am ET, we observed elevated errors from 3 out of 10 our backend servers. Our eng located in US is unable to access the site.
We use nodejs with websocket as a mean of communication between our BE<> FE. Below are the errors we saw during the time of disruption
Other regions seemed not impacted. Folks from other regions still can access the site.
No sign on resources bounded, cpu + memories look normal.
Our cloudflare looks normal as well.

  Error: Parse Error

      at socketOnEnd (node:_http_server:803:22)

      at Socket.emit (node:events:531:35)

      at Socket.emit (node:domain:488:12)

      at endReadableNT (node:internal/streams/readable:1696:12)

      at processTicksAndRejections (node:internal/process/task_queues:82:21)

aqt-dev

PROOP

2 years ago

No deployment before the time of disruption

aqt-dev

PROOP

2 years ago

1317242210348957787

aqt-dev

PROOP

2 years ago

This was not the only time we saw the issue though. It happened to us twice this week.

That's all we have on our end. Posting q here hoping we could get more insight from railway to figure out what causes the disruption.

brody

EMPLOYEE

2 years ago

there's nothing on our side from our network availability monitoring, and we have no observability into what your code is doing so unfortunately we can't tell you much with respect to what went wrong at your application layer

aqt-dev

PROOP

2 years ago

@Brody do you observe any abnormalities during the time from any of our instances ? Any sign of degraded in system metrics on individual instance? Did any of them bear more loads than the others ?

aqt-dev

PROOP

2 years ago

What bother us is this only happened to a certain region (US-east to be more specific) so we doubt it's the application layer is the culprit here.

brody

EMPLOYEE

2 years ago

you can check your service metrics to see cpu / memory / networking within the service panel

brody

EMPLOYEE

2 years ago

we don't have any application level monitoring so we can't tell you what happened with the application itself

gio

PRO

2 years ago

The issue is that it's hard from us to see at an application level what is happening, even locally for example, because we can only reproduce on Railway in certain replicas

gio

PRO

2 years ago

For example, we debugged at one point by changing VPNs, but the above error is at such a low level that it wasn't useful. But this is consistently not across all replicas, which is why we were wondering if it's possibly a Railway issue

gio

PRO

2 years ago

For example, we use websockets. Is it possible that websocket connections over streams on your replicas are causing this?

gio

PRO

2 years ago

I'd imagine if so, maybe someone on your team has seen others experience this prior

brody

EMPLOYEE

2 years ago

There are no issues with websockets here!

Handling all possible errors, verbose logging, and tracing will help you in your efforts to debug any further application level issues.

gio

PRO

2 years ago

Gotcha

brody

EMPLOYEE

2 years ago

really sorry we couldn't help here, but there's just nothing we can tell you about this issue from the platform side of things.

gio

PRO

2 years ago

Understood. Appreciate it

gio

PRO

2 years ago

@Brody to help test a theory regarding replicas. is it possible to add a header to a request to ensure a certain replica is being used?

brody

EMPLOYEE

2 years ago

like have your replica return a header with its specific ID?

gio

PRO

2 years ago

Or rather, when going from our FE to BE, specify which replica to use, etc.

brody

EMPLOYEE

2 years ago

Oh pinning your request to a specific replica, no that's not possible, we don't support sticky sessions

gio

PRO

2 years ago

Gotcha ty. Any known issues w having multiple replicas on inngest? We're wondering if inngest may be the issue too. Particularly when it makes calls between replicas

brody

EMPLOYEE

2 years ago

Make calls between replicas? I'm sorry I don't follow.

the only issue with replicas and websockets would be if you needed sticky sessions.

gio

PRO

2 years ago

I meant possible issues between replicas and Inngest, particularly

brody

EMPLOYEE

2 years ago

Nope no issues there, we run replicas ourselves for the docs, and help station

aqt-dev

PROOP

2 years ago

Thanks for all the help @Brody ! Another q, do you have a more granular metrics (CPU, ram) on each of the replica?

brody

EMPLOYEE

2 years ago

The UI adds them all together

aqt-dev

PROOP

2 years ago

Is there any way we can get the break down ? Or do you have any tips to determine which region seeing struggle/ needs to scale up ?

brody

EMPLOYEE

2 years ago

ill check, what service is this in regards to? basement-feathers-api?

brody

EMPLOYEE

2 years ago

the two metal east 4 region did most of the compute

brody

EMPLOYEE

2 years ago

you removed the two metal east 4 regions and replaced them with 2 gcp east 4 regions, but since then east 4 is still the region that is doing the most compute

brody

EMPLOYEE

2 years ago

even right now, the two east regions are doing more than double the compute than asia

aqt-dev

PROOP

2 years ago

thank you. Is there a sign of contention on those 2 instances ?

aqt-dev

PROOP

2 years ago

Will you guys expose these metrics in the dashboard in the future? That'd be helpful for developers.