Slow performance MySQL<->Node
gazhay
PROOP

2 months ago

Possibly linked to https://discord.com/channels/713503345364697088/1480754509318848676

Seeing horribly degraded performance accessing the database over the internal routing in recent days.

On our development server things that are instant are taking many seconds to complete on production(railway deployment)

6a6c4123-f4c9-4c41-ba5f-c3115ddc1753

No relevant incidents that are still open as far as I can see.

$20 Bounty

65 Replies

2 months ago

How is this performance measured? do you have tracing?


2 months ago

If possible, please follow the comment I posted in that Discord thread: https://discord.com/channels/713503345364697088/1480754509318848676/1480893347936862310. Let me know if you need any help with it!


gazhay
PROOP

2 months ago

Slow internal routing compared to internet routing and some tCP errors

image.png

Attachments


2 months ago

31 MB on a single packet scares me a little bit. Perhaps your queries are becoming excessively large?


gazhay
PROOP

2 months ago

There can be a large amount data required for processing some of these requests yes.


gazhay
PROOP

2 months ago

It is also happening on smaller queries too though, so size is probably a bit of a red herring

image.png

Attachments


2 months ago

You said that it's lower over public networking, can you confirm? Network Logs also register internet traffic but don't show the exact service, if I recall correctly.


gazhay
PROOP

2 months ago

I meant the latency was lower to external connections but the database internal routing was showing high latency.

Our dev server where everything is mirrored doesn't experience any latency, but i really don't know enough about railway "services" to know why they have suddenly become more latent than previously.

I'm making code changes to cut down on request size but was concerned that even a small size was getting errors.

Again not knowledgeable enough to know what the TCP errors are


2 months ago

Those network flow logs are still sub-20 ms, so the issue seems like it's from processing the data once it gets to your application, not the network speed itself.


2 months ago

Yep, I was thinking of that too, so the tcp errors are just normal?


2 months ago

How much time are your requests taking? If it's too high, then we can eliminate the latencies observed in the network flow logs.


2 months ago

We would need app level tracing to be sure.


gazhay
PROOP

2 months ago

Seems to be back to the 5xx first byte errors - so i think this really is something beyond the scope of the application to be honest - I can only go on the fact that the same code and db runs much faster on a laptop than it does being deployed.

I'll continue to monitor but I don't really have the expertise to debug routing/deployments in a data centre


gazhay
PROOP

2 months ago

Basically every access to mysql is slow.

This is from production - If I clone the mysql database to a latop and run our application there - I see maybe 1 or 2 of these ever.

Even if I run the mysql database on a server over a LAN the results are the same.

The routing between services in the railway infra seems to be adding many ms of latency. AFAIK.

Like I said I'm not experienced enough to know how railway infra & data centres work at low levels, but the latency here is adding up to very slow application.

Like I also said, this is mostly a new phenomenon, the database used to be quite quick - yes, i've tried redeploying everything to the same results.

image.png

Attachments


2 months ago

What are those times measuring specifically?


gazhay
PROOP

2 months ago

They are sequelize benchmarking the queries.

A redeploy which also has sequelize sync the tables to the models, would normally take in the region of 2 minutes - just took over 6.

I don't know what more I can do - everything I look at points to a latency between node and mysql - I don't think there is anything I can do to optimise routing between two railway services in 2 railway containers in the same data centre - or however that works.

Is there any way to deploy node and sequalize within the same service/container to eliminate this latency ?


2 months ago

Have you SSH'd into the service and pinged the database to isolate this to L3?


gazhay
PROOP

2 months ago

there is no ping or nc on the service

root@d0b4e400b5ab:/app# node -e "const t=Date.now(); require('net').connect(3306, 'mysql.railway.internal', () => { console.log('Latency: ' + (Date.now()-t) + 'ms'); process.exit() })"
Latency: 166ms
root@d0b4e400b5ab:/app# node -e "const t=Date.now(); require('net').connect(3306, 'mysql.railway.internal', () => { console.log('Latency: ' + (Date.now()-t) + 'ms'); process.exit() })"
Latency: 164ms
root@d0b4e400b5ab:/app# 

2 months ago

That measures how fast the database can accept the TCP connection; unfortunately, it's not a good test of L3 speeds.


gazhay
PROOP

2 months ago

🤷‍♂️

Like i keep saying. I'm not a network infra guy.

Part of choosing railway is that this is all handled and i can just do the code.

I have not a clue what L3 is.

I just have the anneceotal experience.


gazhay
PROOP

2 months ago

Even when someone has gone and installed ping ??

bash: /usr/bin/ping: Operation not permitted

gazhay
PROOP

2 months ago

So whilst even if i did know a little bit more about infra and railway, it seems i cannot diagnose this myself either?


2 months ago

Some versions of ping require elevated permissions.

Either way, unfortunately all I am seeing is that the MySQL database itself is slow to accept TCP conns, nothing looks wrong with the networking stack on our end.


2 months ago

I am going to deploy some testing software on the same host your app and database are on just to quadruple check.


gazhay
PROOP

2 months ago

Just fyi - there has been no network logging since 1400ish today

image.png

Attachments


gazhay
PROOP

2 months ago

In about 1 hour I would imagine the server is going to be hit by a wave of activity as it has to process results data then.

Yesterday it ground to a halt then - where in the many months before it would process the results in a few moments.

I probably won't be back until after the rush this evening to fix any issues that arise


gazhay
PROOP

2 months ago

The only thing really left is the ip4/6 and dns resolution.

I have to use mysql.railway.internal to avoid egress costs, unless you can advise a ipv6 address instead?

Research tells me mysql might be doing reverse dns causing further issues with resolution?

But all in all, this remains a "new" issue. We've been fine for months, so i don't see how dns resolution or ipv4/6 issues would cause this in the last few days?

As i asked earlier, is there a way to deploy both node and mysql as one container to avoid the railway latency?


2 months ago

As i asked earlier, is there a way to deploy both node and mysql as one container to avoid the railway latency?

There is no practical way.


2 months ago

Are you opening a new database connection for every SQL call?


2 months ago

Could my deploy logs not loading be related to the ongoing network issues you reported in the previous thread? Thank you!


2 months ago

The utilities app is deployed to the same host your app is deployed to, and the hello world service is deployed to the same host as your database.

reply from 10.218.64.162 - 2.062141ms
reply from 10.218.64.162 - 2.301472ms
reply from 10.218.64.162 - 549.625µs
reply from 10.218.64.162 - 564.215µs

minimum: 549.625µs
average: 1.369363ms
maximum: 2.301472ms
total: 5.477453ms

The only thing remaining would be DNS lookups, and if you are making a new connection to MySQL for all SQL queries, you will see latency with that.

10.218.64.162

dns lookup took 147.624692ms

So if you are making a new connection per SQL query, please use a pool.


dev-nolant

Could my deploy logs not loading be related to the ongoing network issues you reported in the previous thread? Thank you!

2 months ago

I'm going to assume so as they were drastically slow to load in ~10 mins for like 20 lines of logs haha. Much love to Railway ❤ hope the server issues we've had recently are from you guys just blowing up and nothing bad - my problem is solved


dev-nolant

Could my deploy logs not loading be related to the ongoing network issues you reported in the previous thread? Thank you!

2 months ago

As I have just shown above, there are no networking issues, at least for this specific user, so I believe your issue would be unrelated.


gazhay
PROOP

2 months ago

No. Sequelize is handling pooling.


gazhay
PROOP

2 months ago

We would also experience the latency here if it was an application layer thing such as DNS lookups....


2 months ago

DNS lookup time is the only thing on our end that would add time here, and that is a solvable problem at the application level; otherwise, there are no other networking latencies, as shown above.

So with that said, I will go ahead and open this thread up for further community support and disengage myself.


gazhay
PROOP

2 months ago

https://discord.com/channels/713503345364697088/1480873601052835970/1481349349748248588

I asked here - is there a way we can move to ipv6 addresses rather than dns at all?

Is there a way to disable any dns lookup by mysql that could be causing double hopping when ipv4 failsover to ipv6?

Why did all this start in the last week or so?


gazhay
PROOP

2 months ago

I mean - every AI I ask - even ones with access to our code base indicate the likely cause is "Railway's network routing"

We don't really want to have to migrate, but if we cannot get to the bottom of the routing issue we may have to.


gazhay
PROOP

2 months ago

I've trimmed a lot of queries, added indexes to speed up queries, but we've narrowed it down to the routing between the app and mysql - sequelize is appropriately pooling connections - it seemingly is the routing, though brody disagrees.

Seemingly we are at an empasse is there any way to diagnose the dns resolving? is mysql.railway.internal giving ip4 results and/or then ip6 results or vice versa causing a latency? Is there a configuration of the containers that is user accessible that can help resolving? Can we switch to direct ip addresses?


2 months ago

Just getting up to this thread, is Opentelemetry setup on your application? You could confirm that it's DNS by using https://www.npmjs.com/package/@opentelemetry/instrumentation-dns


2 months ago

Also, in theory you could switch to direct IP by resolving the DNS and then switching to it but be aware that they are not persistent but for a test should be fine.


gazhay
PROOP

2 months ago

I can maybe look into this tomorrow - unfortunately right now I have the task of trying to process results.

At the moment this has taken 11 minutes, yet on our duplicate dev server it finished after 2 minutes.

That includes all the calls to external apis that are involved.

The only difference between the code here and the production code is the railway dockers around the app and mysql and the routing between them.

I scored this result running our entire app on my laptop to the dev server mysql - so it had real network (wifi) latency and still completed in 2 mins.

In fact - the scoring on the production (railway) server just died with the pictured error 😕

image.png

image.png


gazhay
PROOP

2 months ago

The scoring did however complete at 9:21pm


gazhay
PROOP

2 months ago

There are at least 7 more that need this level of processing (thats 1hr 30 of processing that should take 14 minutes)

Our only other option is to export the data to our dev server - do the scoring - and re-upload it - but that would be egress costs and it would be really silly to have to export things like that to get them to process at a reasonable speed.


2 months ago

Railway has a limit of 15 minutes per request, so maybe your request took longer than that.

As Brody has shown, the only issue that could be here is DNS latency, you could try getting the IP directly by resolving the private domain inside your container.


2 months ago

For that, you could use the Railway CLI:

  1. railway link
  2. railway ssh
  3. dig

and then specify that IP on your database URL


gazhay
PROOP

2 months ago

isn't the private database just mysql.railway.internal?

is dig installed as there was no ping or nc earlier - though ping appeared at some point but I didn't have permission to use it


2 months ago

But again, that's just for testing to confirm that your ORM is doing a DNS query per SQL query, those IPs are not going to be forever static.


2 months ago

You could install it depending on your container, try apk or apt (common package managers)


2 months ago

Or run a temporary Ubuntu image too


gazhay
PROOP

2 months ago

Right - so I'm just going to have to live with the latency of https://discord.com/channels/713503345364697088/1480873601052835970/1481297250385920070

There isn't any way around that in production?

I am currently stripping down functions to limit database use and probably will spend more time on that.

But again - this started suddenly in the last couple of days


2 months ago

If your ORM is doing DNS query per SQL query, we could then fix that by using pooling (maybe it's not working, that's why I want to confirm with a static IP)


2 months ago

I also hate asking you to do that, but we really get a lot of people here with "it was working fine a few days ago" only to confirm it's a code issue.

If the static IP doesn't solve the issue then we could investigate the possibility of being a Railway issue


gazhay
PROOP

2 months ago

Its literally npm sequelize setup pretty standard

I just don't understand why the exact same setup (from the git that railway builds from) doesn't have the same issues on multiple locations and setups and this is only on railway - but "is not a railway routing issue" - is it a docker to docker latency issue?

image.png

Attachments


2 months ago

maybe your other hosts have a faster DNS query resolution, as you saw previously Railway DNS may take a bit to respond (147ms as shown by Brody previously)


2 months ago

Also, you could modify your code to resolve the DNS first and then pass it down to sequelize.


gazhay
PROOP

2 months ago

If the process is long enough lived is that practical?

If the application and the mysql hosts stay "up" will they retain the same address so that this might work?


gazhay
PROOP

2 months ago

as in, is the transient nature of dynamic addresses limited to not changing while they are alive - just each deployment / restart can obtain new addresses?


2 months ago

It should be ok as long as both stay up, I'm not too familiar with sequelize but maybe they have a callback method to be called when a reconnection happens? you could then refresh the DNS.


2 months ago

But again, maybe I would confirm first that it's a DNS issue then we could look for fixes to it, resolving the DNS query with code at runtime might be the best way to test it and then we can think of the side effects


gazhay
PROOP

2 months ago

Ok, well I'll look into that tomorrow - thanks for your assistance - I'm gonna focus on getting stuff processed now will return with an update whatever happens - thanks again


2 months ago

No problem, if you really need to process it one time, maybe public networking is a good workaround (just be aware of the egress).


gazhay
PROOP

2 months ago

Hi, sorry I didn't return sooner, we took the decision to migrate away from railway for the time being.

Whilst we were very happy issues like this and ultimately outages just pushed us away.

We have decided to go the "own server" route and just FYI we migrated everything and the routing between the two dockers on our new server is orders of magnitude quicker than it was here.

Whilst you couldn't find anything and @Brody insisted L3 was "ok", something clearly was an issue, I hope you can find it for other customers in the future.

Thanks


2 months ago

We are indeed working on making DNS resolutions for the private network way faster.


2 months ago

Sad to hear that, but I understand the issue you were having with DNS. As Brody mentioned, they are working on it. I hope to see you here again sometime in the future!


Welcome!

Sign in to your Railway account to join the conversation.

Loading...