Random database slowness
stan
HOBBYOP

a year ago

See Sentry traces: https://imgur.com/a/gtEQALx

For the same request, the database calls are suddenly an order of magnitude slower, from 1-5ms to 15-80ms.

All the services in my project are in the same region, us-west (oregon, non metal). I'm using the private network between my web service and my Postgres service.

What could be the reason?

26 Replies

adam
MODERATOR

a year ago

Can you please verify that your services are in fact in the same region? This kind of latency should only be seen in that scenario


stan
HOBBYOP

a year ago

If you mean double check in the UI, then yes I have verified that they are in the same region (us-west, oregon, non metal) smile emoji When the web service is running on us-west (california/metal), the db latency is worse.

Here, it's pretty random, but sometimes I am able to get ~1ms latency to the DB (cf sentry traces), which indicates that they are close.


stan
HOBBYOP

a year ago

I can observe the same with redis, sometimes sub-ms response time, sometimes more than 10ms. So that would make me exclude a postgres-specific issue.

I can see the variance across requests but inside a single request as well. It looks like some network or resource (like CPU) issue. thinking emoji


stan
HOBBYOP

a year ago

For the same endpoints queried every 3s by curl (no other concurrent request), that does 1 redis get, 1 pg select and 1 redis put, I can see response time ranging from 10ms to 100ms+ (up to 300ms): https://i.imgur.com/dNBLe6Y.png


stan
HOBBYOP

a year ago

This might be anecdotal, but by querying continuously (while true; curl) it seems the slowness is somewhat periodic: https://i.imgur.com/27cvZ37.png every second or so


stan
HOBBYOP

a year ago

Here is 3 traces: https://imgur.com/a/DogtacM

  • a very fast one (30ms)

  • a very slow one (300ms)

  • one that is slow in the first half and fast in the second half

This is a laravel app FYI.

I tried both with caddy+frankenphp+octane and a more classic nginx+php-fpm, I see very bad response time on both (terrible with php-fpm).

And considering this happens with both redis and pgsql (and the processing of the req itself), with no other traffic to the app, I think this is an issue with Railway cry emoji

Any clue?


brody
EMPLOYEE

a year ago

Hello,

I've checked the hosts your web service, database, and redis are running on, there are no noisy neighbors in terms of CPU or Network.

So I'm not thinking its an issue on our side at this time, also given no one else has reported such issues.

Would it be possible to ask you to a minimal reproducible example in something other than PHP as to rule out any issues with how you have PHP setup, if this is indeed an issue on our side, it would be reproducible with other languages too.


stan
HOBBYOP

a year ago

Thanks for the feedback 🫡 I was wondering the same!

I launched a quick Go PoC (code: https://gist.github.com/angristan/9139b26a390a9ae236e5fe711ae58ec7#file-main-go)

I can see the exact same behaviour, overall response times ranging from 1ms to 100ms

Sample traces: https://imgur.com/a/Emd175W

I don't want to blindly blame Railway but I'm out of ideas at this point smile emoji


brody
EMPLOYEE

a year ago

Very compelling evidence against us, thank you for the MRE.. long shot, but can you try redeploying the database?


stan
HOBBYOP

a year ago

I just did, I can see the same behaviour with a brand new pgsql service


brody
EMPLOYEE

a year ago

Would it be too much to ask you to try with Postgres 16 instead?


stan
HOBBYOP

a year ago

Sure, I did and the behaviour is the same. But since I see the exact same behaviour with redis, I don't think it's a matter of database engine.

To confirm this, I tried doing an HTTP call between two Go services through the private network (region: oregon). eyes emoji

Here are some sample traces: https://imgur.com/a/QItf6xN

We can see that in some cases, again, we go as low as 2ms (without reusing connection, so DNS+TCP again), and as high as dozens of ms. There is a slowdown in http.dns and/or http.connect according to the trace (from otelhttptrace)


brody
EMPLOYEE

a year ago

Hmmm, I know we have DNS cached for a few seconds, so occasionally you will see increased DNS resolution times, refresh this a few times -

https://utilities.up.railway.app/dns-lookup?value=hello-world.railway.internal&type=ip

But that wouldn't explain why the Postgres and Redis tests have slowdowns since you connect at the app's start unless the driver reconnects every few seconds?


brody
EMPLOYEE

a year ago

Also, apologies for the slow responses, we are currently on winter break.


stan
HOBBYOP

a year ago

Indeed from your link the DNS resolution time is often pretty high eyes emoji

But as you pointed out the connections should be reused. Also, from my traces we can see that the http.connect spans encounter the same behaviour: in one example, 0.5ms to establish the conn, in another one 30ms. So I wouldn't say it's DNS

Also, apologies for the slow responses, we are currently on winter break.

No worries, I am too smile emoji


stan
HOBBYOP

a year ago

I can see the same with curl, here we can see the tcp handshake takes 6ms up to 127ms

bash -c 'while true; do curl -w "\ndnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | redirect: %{time_redirect} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download}\n" -s http://test-railway.railway.internal:8080; sleep 1;done'
dnslookup: 0.060784 | connect: 0.062704 | appconnect: 0.000000 | pretransfer: 0.062846 | redirect: 0.000000 | starttransfer: 0.252572 | total: 0.252660 | size: 110

Jan 04 12:50:02

dnslookup: 0.004345 | connect: 0.006257 | appconnect: 0.000000 | pretransfer: 0.006343 | redirect: 0.000000 | starttransfer: 0.181636 | total: 0.181733 | size: 110

Jan 04 12:50:04

dnslookup: 0.125589 | connect: 0.127878 | appconnect: 0.000000 | pretransfer: 0.128008 | redirect: 0.000000 | starttransfer: 0.328557 | total: 0.328668 | size: 110

Jan 04 12:50:05

dnslookup: 0.004370 | connect: 0.006810 | appconnect: 0.000000 | pretransfer: 0.006980 | redirect: 0.000000 | starttransfer: 0.180946 | total: 0.181041 | size: 110

Jan 04 12:50:06

dnslookup: 0.061076 | connect: 0.063473 | appconnect: 0.000000 | pretransfer: 0.063604 | redirect: 0.000000 | starttransfer: 0.238792 | total: 0.238885 | size: 110

stan
HOBBYOP

a year ago

OK, even with connections reused I have mysterious time lost somewhere: https://i.imgur.com/foxQYeo.png

No db involved, I have two instances of the following code deployed, I call /test on one which calls /fake on the other

	transport := &http.Transport{
		MaxIdleConns:        100,
		MaxIdleConnsPerHost: 100,
		IdleConnTimeout:     90 * time.Second,
	}

	client := http.Client{
		Transport: otelhttp.NewTransport(
			transport,
			otelhttp.WithClientTrace(func(ctx context.Context) *httptrace.ClientTrace {
				return otelhttptrace.NewClientTrace(ctx)
			}),
		),
	}

	r.GET("/fake", func(c *gin.Context) {
		c.JSON(200, gin.H{"status": "ok"})
	})

	r.GET("/test", func(c *gin.Context) {
		ctx := c.Request.Context()

		req, err := http.NewRequestWithContext(ctx, "GET", api+"/fake", nil)
		if err != nil {
			sentry.CaptureException(err)
			c.JSON(500, gin.H{"error": "Failed to create request"})
			return
		}

		dump, _ := httputil.DumpRequestOut(req, true)
		log.Printf("Request: %s", string(dump))

		resp, err := client.Do(req)
		if err != nil {
			sentry.CaptureException(err)
			c.JSON(500, gin.H{"error": "Failed to send request"})
			return
		}
		defer resp.Body.Close()

		dump, _ = httputil.DumpResponse(resp, true)
		log.Printf("Response: %s", string(dump))

		c.JSON(200, gin.H{"status": resp.Status})
	})

brody
EMPLOYEE

a year ago

Well, I'm out of ideas, I'll have to raise this to our infra team after winter break, I'll then get back to you likely with further questions so we can all get to the bottom of what's happening here.

I want to thank you for all the in-depth testing you have done, coming to us with traces is something I can't express my appreciation for enough!


stan
HOBBYOP

a year ago

Sure! I'm very curious to find out what's happening as well male_detective emoji Unfortunately, I don't have access to lower-level tools to investigate as I'm constrained by the container layer. But I'd be happy to help. Have a nice winter break snowflake emoji


stan
HOBBYOP

a year ago

wave emoji Any clue? slightly_smiling_face emoji


brody
EMPLOYEE

a year ago

I'm so sorry, in all honesty, this had slipped off my table.

And in further honesty, our infra team, unfortunately, does not have the cycles to look into one-off reports of this nature at this time, it's all hands on deck over there to bring all metal regions into GA and everything surrounding that (volumes on metal).

So for right now, I will have to punt on this issue, I'm sorry.


Railway
BOT

a year ago

An internal ticket has been attached to this thread. Updates will be relayed in this thread, so keep an eye out!

Status changed to In Progress Railway about 1 year ago


brody
EMPLOYEE

a year ago

I've attached a ticket so that we can track reports from other users, if enough users report the issue we will set aside cycles to look into it.


Status changed to Awaiting User Response Railway about 1 year ago


stan
HOBBYOP

a year ago

No worries! I get it. I wish I could help from the inside, good luck for the GA 🫡


Status changed to Awaiting Railway Response Railway about 1 year ago


stan
HOBBYOP

a year ago

Just for reference (no trying to resurrect the topic—just sharing more data)

these are traces for my app hosted or railway and the same app (same docker image and db contents copied) on a self hosted server with k8s (baremetal): https://imgur.com/a/xQocSQM

There is like 1 out of 10 requests that has a normal response time (32ms), the rest can be like up to 100x slower smiling_face_with_tear emoji Fortunately this is just a hobby project, but still there is something very wrong. I don't know why I would be the only one affected by this


stan
HOBBYOP

a year ago

Isn't this something about CPU time scheduling? This reminds me of the CPU quotas on Fly that make everything slow as well


brody
EMPLOYEE

a year ago

You may not be the only one affected, but you are the only one to report such issues, not saying the issue doesn't exist, but it's just not something we can look into given the circumstances here.

But if it is a CPU scheduling issue, or even a noisy neighbor, you may be pleased to know that we will be moving away from a container environment to a VM based environment that should offer better resource isolation!


Loading...