Random database slowness

stan

HOBBYOP

2 years ago

See Sentry traces: https://imgur.com/a/gtEQALx

For the same request, the database calls are suddenly an order of magnitude slower, from 1-5ms to 15-80ms.

All the services in my project are in the same region, us-west (oregon, non metal). I'm using the private network between my web service and my Postgres service.

What could be the reason?

26 Replies

adam

MODERATOR

2 years ago

Can you please verify that your services are in fact in the same region? This kind of latency should only be seen in that scenario

stan

HOBBYOP

2 years ago

If you mean double check in the UI, then yes I have verified that they are in the same region (us-west, oregon, non metal) When the web service is running on us-west (california/metal), the db latency is worse.

Here, it's pretty random, but sometimes I am able to get ~1ms latency to the DB (cf sentry traces), which indicates that they are close.

stan

HOBBYOP

2 years ago

I can observe the same with redis, sometimes sub-ms response time, sometimes more than 10ms. So that would make me exclude a postgres-specific issue.

I can see the variance across requests but inside a single request as well. It looks like some network or resource (like CPU) issue. 🤔

stan

HOBBYOP

2 years ago

For the same endpoints queried every 3s by curl (no other concurrent request), that does 1 redis get, 1 pg select and 1 redis put, I can see response time ranging from 10ms to 100ms+ (up to 300ms): https://i.imgur.com/dNBLe6Y.png

stan

HOBBYOP

2 years ago

This might be anecdotal, but by querying continuously (while true; curl) it seems the slowness is somewhat periodic: https://i.imgur.com/27cvZ37.png every second or so

stan

HOBBYOP

2 years ago

Here is 3 traces: https://imgur.com/a/DogtacM

a very fast one (30ms)
a very slow one (300ms)
one that is slow in the first half and fast in the second half

This is a laravel app FYI.

I tried both with caddy+frankenphp+octane and a more classic nginx+php-fpm, I see very bad response time on both (terrible with php-fpm).

And considering this happens with both redis and pgsql (and the processing of the req itself), with no other traffic to the app, I think this is an issue with Railway 😢

Any clue?

brody

EMPLOYEE

2 years ago

Hello,

I've checked the hosts your web service, database, and redis are running on, there are no noisy neighbors in terms of CPU or Network.

So I'm not thinking its an issue on our side at this time, also given no one else has reported such issues.

Would it be possible to ask you to a minimal reproducible example in something other than PHP as to rule out any issues with how you have PHP setup, if this is indeed an issue on our side, it would be reproducible with other languages too.

stan

HOBBYOP

2 years ago

Thanks for the feedback 🫡 I was wondering the same!

I launched a quick Go PoC (code: https://gist.github.com/angristan/9139b26a390a9ae236e5fe711ae58ec7#file-main-go)

I can see the exact same behaviour, overall response times ranging from 1ms to 100ms

Sample traces: https://imgur.com/a/Emd175W

I don't want to blindly blame Railway but I'm out of ideas at this point 😄

brody

EMPLOYEE

2 years ago

Very compelling evidence against us, thank you for the MRE.. long shot, but can you try redeploying the database?

stan

HOBBYOP

2 years ago

I just did, I can see the same behaviour with a brand new pgsql service

brody

EMPLOYEE

2 years ago

Would it be too much to ask you to try with Postgres 16 instead?

stan

HOBBYOP

2 years ago

Sure, I did and the behaviour is the same. But since I see the exact same behaviour with redis, I don't think it's a matter of database engine.

To confirm this, I tried doing an HTTP call between two Go services through the private network (region: oregon). 👀

Here are some sample traces: https://imgur.com/a/QItf6xN

We can see that in some cases, again, we go as low as 2ms (without reusing connection, so DNS+TCP again), and as high as dozens of ms. There is a slowdown in http.dns and/or http.connect according to the trace (from otelhttptrace)

brody

EMPLOYEE

2 years ago

Hmmm, I know we have DNS cached for a few seconds, so occasionally you will see increased DNS resolution times, refresh this a few times -

https://utilities.up.railway.app/dns-lookup?value=hello-world.railway.internal&type=ip

But that wouldn't explain why the Postgres and Redis tests have slowdowns since you connect at the app's start unless the driver reconnects every few seconds?

brody

EMPLOYEE

2 years ago

Also, apologies for the slow responses, we are currently on winter break.

stan

HOBBYOP

2 years ago

Indeed from your link the DNS resolution time is often pretty high 👀

But as you pointed out the connections should be reused. Also, from my traces we can see that the http.connect spans encounter the same behaviour: in one example, 0.5ms to establish the conn, in another one 30ms. So I wouldn't say it's DNS

Also, apologies for the slow responses, we are currently on winter break.

No worries, I am too 😄

stan

HOBBYOP

2 years ago

I can see the same with curl, here we can see the tcp handshake takes 6ms up to 127ms

bash -c 'while true; do curl -w "\ndnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | redirect: %{time_redirect} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download}\n" -s http://test-railway.railway.internal:8080; sleep 1;done'

dnslookup: 0.060784 | connect: 0.062704 | appconnect: 0.000000 | pretransfer: 0.062846 | redirect: 0.000000 | starttransfer: 0.252572 | total: 0.252660 | size: 110

Jan 04 12:50:02

dnslookup: 0.004345 | connect: 0.006257 | appconnect: 0.000000 | pretransfer: 0.006343 | redirect: 0.000000 | starttransfer: 0.181636 | total: 0.181733 | size: 110

Jan 04 12:50:04

dnslookup: 0.125589 | connect: 0.127878 | appconnect: 0.000000 | pretransfer: 0.128008 | redirect: 0.000000 | starttransfer: 0.328557 | total: 0.328668 | size: 110

Jan 04 12:50:05

dnslookup: 0.004370 | connect: 0.006810 | appconnect: 0.000000 | pretransfer: 0.006980 | redirect: 0.000000 | starttransfer: 0.180946 | total: 0.181041 | size: 110

Jan 04 12:50:06

dnslookup: 0.061076 | connect: 0.063473 | appconnect: 0.000000 | pretransfer: 0.063604 | redirect: 0.000000 | starttransfer: 0.238792 | total: 0.238885 | size: 110

stan

HOBBYOP

2 years ago

OK, even with connections reused I have mysterious time lost somewhere: https://i.imgur.com/foxQYeo.png

No db involved, I have two instances of the following code deployed, I call /test on one which calls /fake on the other

	transport := &http.Transport{
		MaxIdleConns:        100,
		MaxIdleConnsPerHost: 100,
		IdleConnTimeout:     90 * time.Second,
	}

	client := http.Client{
		Transport: otelhttp.NewTransport(
			transport,
			otelhttp.WithClientTrace(func(ctx context.Context) *httptrace.ClientTrace {
				return otelhttptrace.NewClientTrace(ctx)
			}),
		),
	}

	r.GET("/fake", func(c *gin.Context) {
		c.JSON(200, gin.H{"status": "ok"})
	})

	r.GET("/test", func(c *gin.Context) {
		ctx := c.Request.Context()

		req, err := http.NewRequestWithContext(ctx, "GET", api+"/fake", nil)
		if err != nil {
			sentry.CaptureException(err)
			c.JSON(500, gin.H{"error": "Failed to create request"})
			return
		}

		dump, _ := httputil.DumpRequestOut(req, true)
		log.Printf("Request: %s", string(dump))

		resp, err := client.Do(req)
		if err != nil {
			sentry.CaptureException(err)
			c.JSON(500, gin.H{"error": "Failed to send request"})
			return
		}
		defer resp.Body.Close()

		dump, _ = httputil.DumpResponse(resp, true)
		log.Printf("Response: %s", string(dump))

		c.JSON(200, gin.H{"status": resp.Status})
	})

brody

EMPLOYEE

2 years ago

Well, I'm out of ideas, I'll have to raise this to our infra team after winter break, I'll then get back to you likely with further questions so we can all get to the bottom of what's happening here.

I want to thank you for all the in-depth testing you have done, coming to us with traces is something I can't express my appreciation for enough!

stan

HOBBYOP

2 years ago

Sure! I'm very curious to find out what's happening as well Unfortunately, I don't have access to lower-level tools to investigate as I'm constrained by the container layer. But I'd be happy to help. Have a nice winter break ❄

stan

HOBBYOP

a year ago

👋 Any clue? 🙂

brody

EMPLOYEE

a year ago

I'm so sorry, in all honesty, this had slipped off my table.

And in further honesty, our infra team, unfortunately, does not have the cycles to look into one-off reports of this nature at this time, it's all hands on deck over there to bring all metal regions into GA and everything surrounding that (volumes on metal).

So for right now, I will have to punt on this issue, I'm sorry.

Railway

BOT

a year ago

An internal ticket has been attached to this thread. Updates will be relayed in this thread, so keep an eye out!

Status changed to In Progress Railway • over 1 year ago

brody

EMPLOYEE

a year ago

I've attached a ticket so that we can track reports from other users, if enough users report the issue we will set aside cycles to look into it.

Status changed to Awaiting User Response Railway • over 1 year ago

stan

HOBBYOP

a year ago

No worries! I get it. I wish I could help from the inside, good luck for the GA 🫡

Status changed to Awaiting Railway Response Railway • over 1 year ago

stan

HOBBYOP

a year ago

Just for reference (no trying to resurrect the topic—just sharing more data)

these are traces for my app hosted or railway and the same app (same docker image and db contents copied) on a self hosted server with k8s (baremetal): https://imgur.com/a/xQocSQM

There is like 1 out of 10 requests that has a normal response time (32ms), the rest can be like up to 100x slower 🥲 Fortunately this is just a hobby project, but still there is something very wrong. I don't know why I would be the only one affected by this

stan

HOBBYOP

a year ago

Isn't this something about CPU time scheduling? This reminds me of the CPU quotas on Fly that make everything slow as well

brody

EMPLOYEE

a year ago

You may not be the only one affected, but you are the only one to report such issues, not saying the issue doesn't exist, but it's just not something we can look into given the circumstances here.

But if it is a CPU scheduling issue, or even a noisy neighbor, you may be pleased to know that we will be moving away from a container environment to a VM based environment that should offer better resource isolation!

Welcome!