2 months ago
Possibly linked to https://discord.com/channels/713503345364697088/1480754509318848676
Seeing horribly degraded performance accessing the database over the internal routing in recent days.
On our development server things that are instant are taking many seconds to complete on production(railway deployment)
6a6c4123-f4c9-4c41-ba5f-c3115ddc1753
No relevant incidents that are still open as far as I can see.
65 Replies
2 months ago
How is this performance measured? do you have tracing?
2 months ago
If possible, please follow the comment I posted in that Discord thread: https://discord.com/channels/713503345364697088/1480754509318848676/1480893347936862310. Let me know if you need any help with it!
Slow internal routing compared to internet routing and some tCP errors
Attachments
2 months ago
31 MB on a single packet scares me a little bit. Perhaps your queries are becoming excessively large?
There can be a large amount data required for processing some of these requests yes.
It is also happening on smaller queries too though, so size is probably a bit of a red herring
Attachments
2 months ago
You said that it's lower over public networking, can you confirm? Network Logs also register internet traffic but don't show the exact service, if I recall correctly.
I meant the latency was lower to external connections but the database internal routing was showing high latency.
Our dev server where everything is mirrored doesn't experience any latency, but i really don't know enough about railway "services" to know why they have suddenly become more latent than previously.
I'm making code changes to cut down on request size but was concerned that even a small size was getting errors.
Again not knowledgeable enough to know what the TCP errors are
2 months ago
Those network flow logs are still sub-20 ms, so the issue seems like it's from processing the data once it gets to your application, not the network speed itself.
2 months ago
Yep, I was thinking of that too, so the tcp errors are just normal?
2 months ago
How much time are your requests taking? If it's too high, then we can eliminate the latencies observed in the network flow logs.
2 months ago
We would need app level tracing to be sure.
Seems to be back to the 5xx first byte errors - so i think this really is something beyond the scope of the application to be honest - I can only go on the fact that the same code and db runs much faster on a laptop than it does being deployed.
I'll continue to monitor but I don't really have the expertise to debug routing/deployments in a data centre
Basically every access to mysql is slow.
This is from production - If I clone the mysql database to a latop and run our application there - I see maybe 1 or 2 of these ever.
Even if I run the mysql database on a server over a LAN the results are the same.
The routing between services in the railway infra seems to be adding many ms of latency. AFAIK.
Like I said I'm not experienced enough to know how railway infra & data centres work at low levels, but the latency here is adding up to very slow application.
Like I also said, this is mostly a new phenomenon, the database used to be quite quick - yes, i've tried redeploying everything to the same results.
Attachments
2 months ago
What are those times measuring specifically?
They are sequelize benchmarking the queries.
A redeploy which also has sequelize sync the tables to the models, would normally take in the region of 2 minutes - just took over 6.
I don't know what more I can do - everything I look at points to a latency between node and mysql - I don't think there is anything I can do to optimise routing between two railway services in 2 railway containers in the same data centre - or however that works.
Is there any way to deploy node and sequalize within the same service/container to eliminate this latency ?
2 months ago
Have you SSH'd into the service and pinged the database to isolate this to L3?
there is no ping or nc on the service
root@d0b4e400b5ab:/app# node -e "const t=Date.now(); require('net').connect(3306, 'mysql.railway.internal', () => { console.log('Latency: ' + (Date.now()-t) + 'ms'); process.exit() })"
Latency: 166ms
root@d0b4e400b5ab:/app# node -e "const t=Date.now(); require('net').connect(3306, 'mysql.railway.internal', () => { console.log('Latency: ' + (Date.now()-t) + 'ms'); process.exit() })"
Latency: 164ms
root@d0b4e400b5ab:/app# 2 months ago
That measures how fast the database can accept the TCP connection; unfortunately, it's not a good test of L3 speeds.
🤷♂️
Like i keep saying. I'm not a network infra guy.
Part of choosing railway is that this is all handled and i can just do the code.
I have not a clue what L3 is.
I just have the anneceotal experience.
Even when someone has gone and installed ping ??
bash: /usr/bin/ping: Operation not permittedSo whilst even if i did know a little bit more about infra and railway, it seems i cannot diagnose this myself either?
2 months ago
Some versions of ping require elevated permissions.
Either way, unfortunately all I am seeing is that the MySQL database itself is slow to accept TCP conns, nothing looks wrong with the networking stack on our end.
2 months ago
I am going to deploy some testing software on the same host your app and database are on just to quadruple check.
Just fyi - there has been no network logging since 1400ish today
Attachments
In about 1 hour I would imagine the server is going to be hit by a wave of activity as it has to process results data then.
Yesterday it ground to a halt then - where in the many months before it would process the results in a few moments.
I probably won't be back until after the rush this evening to fix any issues that arise
The only thing really left is the ip4/6 and dns resolution.
I have to use mysql.railway.internal to avoid egress costs, unless you can advise a ipv6 address instead?
Research tells me mysql might be doing reverse dns causing further issues with resolution?
But all in all, this remains a "new" issue. We've been fine for months, so i don't see how dns resolution or ipv4/6 issues would cause this in the last few days?
As i asked earlier, is there a way to deploy both node and mysql as one container to avoid the railway latency?
2 months ago
As i asked earlier, is there a way to deploy both node and mysql as one container to avoid the railway latency?
There is no practical way.
2 months ago
Are you opening a new database connection for every SQL call?
2 months ago
Could my deploy logs not loading be related to the ongoing network issues you reported in the previous thread? Thank you!
2 months ago
The utilities app is deployed to the same host your app is deployed to, and the hello world service is deployed to the same host as your database.
reply from 10.218.64.162 - 2.062141ms
reply from 10.218.64.162 - 2.301472ms
reply from 10.218.64.162 - 549.625µs
reply from 10.218.64.162 - 564.215µs
minimum: 549.625µs
average: 1.369363ms
maximum: 2.301472ms
total: 5.477453msThe only thing remaining would be DNS lookups, and if you are making a new connection to MySQL for all SQL queries, you will see latency with that.
10.218.64.162
dns lookup took 147.624692msSo if you are making a new connection per SQL query, please use a pool.
dev-nolant
Could my deploy logs not loading be related to the ongoing network issues you reported in the previous thread? Thank you!
2 months ago
I'm going to assume so as they were drastically slow to load in ~10 mins for like 20 lines of logs haha. Much love to Railway ❤ hope the server issues we've had recently are from you guys just blowing up and nothing bad - my problem is solved
dev-nolant
Could my deploy logs not loading be related to the ongoing network issues you reported in the previous thread? Thank you!
2 months ago
As I have just shown above, there are no networking issues, at least for this specific user, so I believe your issue would be unrelated.
We would also experience the latency here if it was an application layer thing such as DNS lookups....
2 months ago
DNS lookup time is the only thing on our end that would add time here, and that is a solvable problem at the application level; otherwise, there are no other networking latencies, as shown above.
So with that said, I will go ahead and open this thread up for further community support and disengage myself.
https://discord.com/channels/713503345364697088/1480873601052835970/1481349349748248588
I asked here - is there a way we can move to ipv6 addresses rather than dns at all?
Is there a way to disable any dns lookup by mysql that could be causing double hopping when ipv4 failsover to ipv6?
Why did all this start in the last week or so?
I mean - every AI I ask - even ones with access to our code base indicate the likely cause is "Railway's network routing"
We don't really want to have to migrate, but if we cannot get to the bottom of the routing issue we may have to.
I've trimmed a lot of queries, added indexes to speed up queries, but we've narrowed it down to the routing between the app and mysql - sequelize is appropriately pooling connections - it seemingly is the routing, though brody disagrees.
Seemingly we are at an empasse is there any way to diagnose the dns resolving? is mysql.railway.internal giving ip4 results and/or then ip6 results or vice versa causing a latency? Is there a configuration of the containers that is user accessible that can help resolving? Can we switch to direct ip addresses?
2 months ago
Just getting up to this thread, is Opentelemetry setup on your application? You could confirm that it's DNS by using https://www.npmjs.com/package/@opentelemetry/instrumentation-dns
2 months ago
Also, in theory you could switch to direct IP by resolving the DNS and then switching to it but be aware that they are not persistent but for a test should be fine.
I can maybe look into this tomorrow - unfortunately right now I have the task of trying to process results.
At the moment this has taken 11 minutes, yet on our duplicate dev server it finished after 2 minutes.
That includes all the calls to external apis that are involved.
The only difference between the code here and the production code is the railway dockers around the app and mysql and the routing between them.
I scored this result running our entire app on my laptop to the dev server mysql - so it had real network (wifi) latency and still completed in 2 mins.
In fact - the scoring on the production (railway) server just died with the pictured error 😕
There are at least 7 more that need this level of processing (thats 1hr 30 of processing that should take 14 minutes)
Our only other option is to export the data to our dev server - do the scoring - and re-upload it - but that would be egress costs and it would be really silly to have to export things like that to get them to process at a reasonable speed.
2 months ago
Railway has a limit of 15 minutes per request, so maybe your request took longer than that.
As Brody has shown, the only issue that could be here is DNS latency, you could try getting the IP directly by resolving the private domain inside your container.
2 months ago
For that, you could use the Railway CLI:
railway linkrailway sshdig
and then specify that IP on your database URL
isn't the private database just mysql.railway.internal?
is dig installed as there was no ping or nc earlier - though ping appeared at some point but I didn't have permission to use it
2 months ago
But again, that's just for testing to confirm that your ORM is doing a DNS query per SQL query, those IPs are not going to be forever static.
2 months ago
You could install it depending on your container, try apk or apt (common package managers)
2 months ago
Or run a temporary Ubuntu image too
Right - so I'm just going to have to live with the latency of https://discord.com/channels/713503345364697088/1480873601052835970/1481297250385920070
There isn't any way around that in production?
I am currently stripping down functions to limit database use and probably will spend more time on that.
But again - this started suddenly in the last couple of days
2 months ago
If your ORM is doing DNS query per SQL query, we could then fix that by using pooling (maybe it's not working, that's why I want to confirm with a static IP)
2 months ago
I also hate asking you to do that, but we really get a lot of people here with "it was working fine a few days ago" only to confirm it's a code issue.
If the static IP doesn't solve the issue then we could investigate the possibility of being a Railway issue
Its literally npm sequelize setup pretty standard
I just don't understand why the exact same setup (from the git that railway builds from) doesn't have the same issues on multiple locations and setups and this is only on railway - but "is not a railway routing issue" - is it a docker to docker latency issue?
Attachments
2 months ago
maybe your other hosts have a faster DNS query resolution, as you saw previously Railway DNS may take a bit to respond (147ms as shown by Brody previously)
2 months ago
Also, you could modify your code to resolve the DNS first and then pass it down to sequelize.
If the process is long enough lived is that practical?
If the application and the mysql hosts stay "up" will they retain the same address so that this might work?
as in, is the transient nature of dynamic addresses limited to not changing while they are alive - just each deployment / restart can obtain new addresses?
2 months ago
It should be ok as long as both stay up, I'm not too familiar with sequelize but maybe they have a callback method to be called when a reconnection happens? you could then refresh the DNS.
2 months ago
But again, maybe I would confirm first that it's a DNS issue then we could look for fixes to it, resolving the DNS query with code at runtime might be the best way to test it and then we can think of the side effects
Ok, well I'll look into that tomorrow - thanks for your assistance - I'm gonna focus on getting stuff processed now will return with an update whatever happens - thanks again
2 months ago
No problem, if you really need to process it one time, maybe public networking is a good workaround (just be aware of the egress).
Hi, sorry I didn't return sooner, we took the decision to migrate away from railway for the time being.
Whilst we were very happy issues like this and ultimately outages just pushed us away.
We have decided to go the "own server" route and just FYI we migrated everything and the routing between the two dockers on our new server is orders of magnitude quicker than it was here.
Whilst you couldn't find anything and @Brody insisted L3 was "ok", something clearly was an issue, I hope you can find it for other customers in the future.
Thanks
2 months ago
We are indeed working on making DNS resolutions for the private network way faster.
2 months ago
Sad to hear that, but I understand the issue you were having with DNS. As Brody mentioned, they are working on it. I hope to see you here again sometime in the future!