6 months ago
DNSException:
syscall: "getaddrinfo",
errno: 4,
code: "ENOTFOUND"Following the automatic migration to metal, one of our services (which references a private URL, that's also also on metal) has begun to fail after a redeploy with above error.
I've tried redeploying the services, but it seems there is a DNS resolution issue for services, which are all on metal.
My redis service exposes a REDIS_PRIVATE_URL variable defined as redis://default:${{REDIS_PASSWORD}}@${{RAILWAY_PRIVATE_DOMAIN}}:6379 but when referencing this value on a second service, the evaluated value has redis's RAILWAY_TCP_PROXY_DOMAIN used in place of RAILWAY_PRIVATE_DOMAIN. This defies my expectations or I cannot see any ability to override or fix the issue. I presume this is a migration shim designed to help us, but I need to override it. How to fix this?
14 Replies
6 months ago
Hi there - I'm seeing the latest deployment as active. Could you please link the failed deployment for us to look into?
Status changed to Awaiting User Response Railway • 6 months ago
6 months ago
The deployment does not fail (does not have a healthcheck) – it cannot properly connect to redis and spews DNSException logs. Example: https://railway.com/project/26efef52-073b-4468-a124-8073e2678b0d/service/44f08ca4-5b7a-4604-8f2e-13288068518f?environmentId=dfcbb9f3-1c03-4ea6-9cf7-b46c63a09d43&id=68ef5c19-33be-499f-b955-b37d271a71ff#deploy
Status changed to Awaiting Railway Response Railway • 6 months ago
6 months ago
Hey there DrMarshall,
I have triggered a new deployment that gone live, I am wondering if you can re-pro the DNSException error again. I have a gut feeling on that caused this issue but I want to see if we can re-produce it so it can confirm my suspicion.
Thanks,
Angelo
Status changed to Awaiting User Response Railway • 6 months ago
6 months ago
Let's focus any repro'ing in our staging env to prevent customer impacting interruptions/outages. Our jobs are long-running, so triggering redeploys impacts our end-user wait times.
Repro'ed here in staging: https://railway.com/project/26efef52-073b-4468-a124-8073e2678b0d/service/44f08ca4-5b7a-4604-8f2e-13288068518f?environmentId=de957125-5be9-4819-b8c0-ca4fd4bc6e43
Status changed to Awaiting Railway Response Railway • 6 months ago
6 months ago
In production, I can confirm that updating the REDIS_URL to ${{Redis-tUX2.REDIS_PRIVATE_URL}} and previewing the value (no deploy necessary), it is still pulling the "wrong" value (templating in RAILWAY_TCP_PROXY_DOMAIN instead of RAILWAY_PRIVATE_DOMAIN) in the evaluated value
drmarshall
Let's focus any repro'ing in our staging env to prevent customer impacting interruptions/outages. Our jobs are long-running, so triggering redeploys impacts our end-user wait times.Repro'ed here in staging: https://railway.com/project/26efef52-073b-4468-a124-8073e2678b0d/service/44f08ca4-5b7a-4604-8f2e-13288068518f?environmentId=de957125-5be9-4819-b8c0-ca4fd4bc6e43
6 months ago
Curiously, in staging it does not appear the template value is wrong, but still seeing the DNSException – trying to append ?family=1 as perhaps the resolution is using ipv6 now?
Attachments
6 months ago
Hello!
We're acknowledging your issue and attaching a ticket to this thread.
We don't have an ETA for it, but, our engineering team will take a look and you will be updated as we update the ticket.
Please reply to this thread if you have any questions!
6 months ago
That would likely be it, with that said, I still wanna re-pro on my project to see if I can get the variable to point to the wrong value.
Give me a moment.
Status changed to Awaiting User Response Railway • 6 months ago
6 months ago
So, I can confirm the DNSException is resolved issue by using ?family=6 in the redis connection string (it seems like there is an ipv6 resolution happening under the hood?), but the mismatched variable certainly is a bug of sorts and made debugging the ipv6 resolution MUCH harder than it needed to be.
Status changed to Awaiting Railway Response Railway • 6 months ago
6 months ago
Previewing the value is giving me a `.internal` value, is that the case now after your re-deployed? I am just very confused on how this could happen.
(If you can get a screenshot that would help too)
---
Re: IPv6, noted, yea, I am more concerned about the domain swap if there is one.
Status changed to Awaiting User Response Railway • 6 months ago
6 months ago
In production, after updating the Redis.REDIS_PRIVATE_URL, it seems like the downstream values have updated to .internal (no more mismatch).
Status changed to Awaiting Railway Response Railway • 6 months ago
Status changed to Solved drmarshall • 6 months ago
6 months ago
✅ The internal ticket Service pulling the incorrect reference variable has been marked as completed.
6 months ago
I think the variable mismatch was a UI/caching issue. I'll continue to look for any related problems, but the connection errors seem to be because of Redis and ipv6. Sorry for all the confusion here!
Status changed to Awaiting Railway Response Railway • 6 months ago
6 months ago
Can you confirm that the issue is resolved after appending the value at the end of the URL?
Status changed to Awaiting User Response Railway • 6 months ago
Status changed to Solved drmarshall • 5 months ago
