a year ago
See Sentry traces: https://imgur.com/a/gtEQALx
For the same request, the database calls are suddenly an order of magnitude slower, from 1-5ms to 15-80ms.
All the services in my project are in the same region, us-west (oregon, non metal). I'm using the private network between my web service and my Postgres service.
What could be the reason?
26 Replies
a year ago
Can you please verify that your services are in fact in the same region? This kind of latency should only be seen in that scenario
a year ago
If you mean double check in the UI, then yes I have verified that they are in the same region (us-west, oregon, non metal)
When the web service is running on us-west (california/metal), the db latency is worse.
Here, it's pretty random, but sometimes I am able to get ~1ms latency to the DB (cf sentry traces), which indicates that they are close.
a year ago
I can observe the same with redis, sometimes sub-ms response time, sometimes more than 10ms. So that would make me exclude a postgres-specific issue.
I can see the variance across requests but inside a single request as well. It looks like some network or resource (like CPU) issue. 
a year ago
For the same endpoints queried every 3s by curl (no other concurrent request), that does 1 redis get, 1 pg select and 1 redis put, I can see response time ranging from 10ms to 100ms+ (up to 300ms): https://i.imgur.com/dNBLe6Y.png
a year ago
This might be anecdotal, but by querying continuously (while true; curl) it seems the slowness is somewhat periodic: https://i.imgur.com/27cvZ37.png every second or so
a year ago
Here is 3 traces: https://imgur.com/a/DogtacM
a very fast one (30ms)
a very slow one (300ms)
one that is slow in the first half and fast in the second half
This is a laravel app FYI.
I tried both with caddy+frankenphp+octane and a more classic nginx+php-fpm, I see very bad response time on both (terrible with php-fpm).
And considering this happens with both redis and pgsql (and the processing of the req itself), with no other traffic to the app, I think this is an issue with Railway 
Any clue?
a year ago
Hello,
I've checked the hosts your web service, database, and redis are running on, there are no noisy neighbors in terms of CPU or Network.
So I'm not thinking its an issue on our side at this time, also given no one else has reported such issues.
Would it be possible to ask you to a minimal reproducible example in something other than PHP as to rule out any issues with how you have PHP setup, if this is indeed an issue on our side, it would be reproducible with other languages too.
a year ago
Thanks for the feedback 🫡 I was wondering the same!
I launched a quick Go PoC (code: https://gist.github.com/angristan/9139b26a390a9ae236e5fe711ae58ec7#file-main-go)
I can see the exact same behaviour, overall response times ranging from 1ms to 100ms
Sample traces: https://imgur.com/a/Emd175W
I don't want to blindly blame Railway but I'm out of ideas at this point 
a year ago
Very compelling evidence against us, thank you for the MRE.. long shot, but can you try redeploying the database?
a year ago
I just did, I can see the same behaviour with a brand new pgsql service
a year ago
Sure, I did and the behaviour is the same. But since I see the exact same behaviour with redis, I don't think it's a matter of database engine.
To confirm this, I tried doing an HTTP call between two Go services through the private network (region: oregon). 
Here are some sample traces: https://imgur.com/a/QItf6xN
We can see that in some cases, again, we go as low as 2ms (without reusing connection, so DNS+TCP again), and as high as dozens of ms. There is a slowdown in http.dns and/or http.connect according to the trace (from otelhttptrace)
a year ago
Hmmm, I know we have DNS cached for a few seconds, so occasionally you will see increased DNS resolution times, refresh this a few times -
https://utilities.up.railway.app/dns-lookup?value=hello-world.railway.internal&type=ip
But that wouldn't explain why the Postgres and Redis tests have slowdowns since you connect at the app's start unless the driver reconnects every few seconds?
a year ago
Indeed from your link the DNS resolution time is often pretty high 
But as you pointed out the connections should be reused. Also, from my traces we can see that the http.connect spans encounter the same behaviour: in one example, 0.5ms to establish the conn, in another one 30ms. So I wouldn't say it's DNS
Also, apologies for the slow responses, we are currently on winter break.
No worries, I am too 
a year ago
I can see the same with curl, here we can see the tcp handshake takes 6ms up to 127ms
bash -c 'while true; do curl -w "\ndnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | redirect: %{time_redirect} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download}\n" -s http://test-railway.railway.internal:8080; sleep 1;done'dnslookup: 0.060784 | connect: 0.062704 | appconnect: 0.000000 | pretransfer: 0.062846 | redirect: 0.000000 | starttransfer: 0.252572 | total: 0.252660 | size: 110
Jan 04 12:50:02
dnslookup: 0.004345 | connect: 0.006257 | appconnect: 0.000000 | pretransfer: 0.006343 | redirect: 0.000000 | starttransfer: 0.181636 | total: 0.181733 | size: 110
Jan 04 12:50:04
dnslookup: 0.125589 | connect: 0.127878 | appconnect: 0.000000 | pretransfer: 0.128008 | redirect: 0.000000 | starttransfer: 0.328557 | total: 0.328668 | size: 110
Jan 04 12:50:05
dnslookup: 0.004370 | connect: 0.006810 | appconnect: 0.000000 | pretransfer: 0.006980 | redirect: 0.000000 | starttransfer: 0.180946 | total: 0.181041 | size: 110
Jan 04 12:50:06
dnslookup: 0.061076 | connect: 0.063473 | appconnect: 0.000000 | pretransfer: 0.063604 | redirect: 0.000000 | starttransfer: 0.238792 | total: 0.238885 | size: 110a year ago
OK, even with connections reused I have mysterious time lost somewhere: https://i.imgur.com/foxQYeo.png
No db involved, I have two instances of the following code deployed, I call /test on one which calls /fake on the other
transport := &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 100,
IdleConnTimeout: 90 * time.Second,
}
client := http.Client{
Transport: otelhttp.NewTransport(
transport,
otelhttp.WithClientTrace(func(ctx context.Context) *httptrace.ClientTrace {
return otelhttptrace.NewClientTrace(ctx)
}),
),
}
r.GET("/fake", func(c *gin.Context) {
c.JSON(200, gin.H{"status": "ok"})
})
r.GET("/test", func(c *gin.Context) {
ctx := c.Request.Context()
req, err := http.NewRequestWithContext(ctx, "GET", api+"/fake", nil)
if err != nil {
sentry.CaptureException(err)
c.JSON(500, gin.H{"error": "Failed to create request"})
return
}
dump, _ := httputil.DumpRequestOut(req, true)
log.Printf("Request: %s", string(dump))
resp, err := client.Do(req)
if err != nil {
sentry.CaptureException(err)
c.JSON(500, gin.H{"error": "Failed to send request"})
return
}
defer resp.Body.Close()
dump, _ = httputil.DumpResponse(resp, true)
log.Printf("Response: %s", string(dump))
c.JSON(200, gin.H{"status": resp.Status})
})a year ago
Well, I'm out of ideas, I'll have to raise this to our infra team after winter break, I'll then get back to you likely with further questions so we can all get to the bottom of what's happening here.
I want to thank you for all the in-depth testing you have done, coming to us with traces is something I can't express my appreciation for enough!
a year ago
Sure! I'm very curious to find out what's happening as well
Unfortunately, I don't have access to lower-level tools to investigate as I'm constrained by the container layer. But I'd be happy to help. Have a nice winter break 
a year ago
Any clue? 
a year ago
I'm so sorry, in all honesty, this had slipped off my table.
And in further honesty, our infra team, unfortunately, does not have the cycles to look into one-off reports of this nature at this time, it's all hands on deck over there to bring all metal regions into GA and everything surrounding that (volumes on metal).
So for right now, I will have to punt on this issue, I'm sorry.
a year ago
An internal ticket has been attached to this thread. Updates will be relayed in this thread, so keep an eye out!
Status changed to In Progress Railway • about 1 year ago
a year ago
I've attached a ticket so that we can track reports from other users, if enough users report the issue we will set aside cycles to look into it.
Status changed to Awaiting User Response Railway • about 1 year ago
a year ago
No worries! I get it. I wish I could help from the inside, good luck for the GA 🫡
Status changed to Awaiting Railway Response Railway • about 1 year ago
a year ago
Just for reference (no trying to resurrect the topic—just sharing more data)
these are traces for my app hosted or railway and the same app (same docker image and db contents copied) on a self hosted server with k8s (baremetal): https://imgur.com/a/xQocSQM
There is like 1 out of 10 requests that has a normal response time (32ms), the rest can be like up to 100x slower
Fortunately this is just a hobby project, but still there is something very wrong. I don't know why I would be the only one affected by this
a year ago
Isn't this something about CPU time scheduling? This reminds me of the CPU quotas on Fly that make everything slow as well
a year ago
You may not be the only one affected, but you are the only one to report such issues, not saying the issue doesn't exist, but it's just not something we can look into given the circumstances here.
But if it is a CPU scheduling issue, or even a noisy neighbor, you may be pleased to know that we will be moving away from a container environment to a VM based environment that should offer better resource isolation!


