9 months ago
We are leveraging an ingress application to route traffic based on URI path to the appropriate backend service. Each backend service is not exposed to the public internet, and the ingress uses the <domain>.railway.internal domain to route traffic.
I've noted the recommendations to delay the entrypoint script on the ingress app which performs a sleep for 3 seconds before starting the ingress process in order for those internal domain to become routable. This works about 50% of the time and typically I have to manually restart the ingress app (assuming that's just killing the container and letting the orchestrator roll the deployment) which fixes the routing/dns issue.
13 Replies
9 months ago
Heya, could you try swapping to our v2 runtime ? we fixed the private network init issues on it.
Lmk if you have any concerns or questions
Status changed to Awaiting User Response railway[bot] • 9 months ago
9 months ago
I've just updated all the apps to the v2 runtime - I'll continue to monitor over the next few days and also remove the sleep
for the ingress app. I'll report back after a few deploys have been rolled out. Cheers!
Status changed to Awaiting Railway Response railway[bot] • 9 months ago
9 months ago
Perfect ! I'll close out the thread for now, but feel free to reopen it if anything comes up
Status changed to Solved itsrems • 9 months ago
Status changed to Awaiting Railway Response calvinbrewer • 8 months ago
8 months ago
Howdy again - we are seeing intermittent timeouts when connecting upstream to the services through the Railway internal network. We've done the following since last spoke:
Upgraded all projects to runtime v2
Removed the sleep for 3 seconds on the ingress controller start up process
After restarting the ingress controller the issue is immediately resolved.
Here is an example of our nginx error logs from the ingress app when connecting upstream to the IPv6 address:
2024/09/10 21:32:42 [warn] 31#31: *1 upstream server temporarily disabled while connecting to upstream, client: 100.64.0.2, server: ..., request: "GET /... HTTP/1.1", upstream: "http://[fd12:7651:d842::8f:ff40:b2f5]:3001/...", host: "..."
2024/09/10 21:32:42 [error] 31#31: *1 upstream timed out (110: Operation timed out) while connecting to upstream, client: 100.64.0.2, server: ..., request: "GET /... HTTP/1.1", upstream: "http://[fd12:7651:d842::8f:ff40:b2f5]:3001/...", host: "..."
Any thoughts?
It looks like DNS resolves just fine but the networking layer to handle the TCP connections is defunk on initial deployment.
8 months ago
Are you able to share your NGINX config?
I have a strong suspicion that this is due to NGINX caching DNS entries for its lifetime, and that poses an issue because a given service does not have a static internal IP.
Status changed to Awaiting User Response railway[bot] • 8 months ago
8 months ago
We are using the default nginx conf:
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log notice;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
keepalive_timeout 65;
#gzip on;
include /etc/nginx/conf.d/*.conf;
}
And our default.conf file:
server {
listen ...;
server_name ...;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_http_version 1.1;
proxy_set_header "Connection" "";
location /example {
rewrite /example/(.*) /$1 break;
proxy_pass http://example-app.railway.internal:3000;
}
...
}
If it's a caching issue we could override the resolver directive to specify a shorter TTL, e.g.resolver ... valid=10s;
or use the nginx upstream module:
resolver ...;
upstream example {
server example-app.railway.internal resolve;
}
Status changed to Awaiting Railway Response railway[bot] • 8 months ago
8 months ago
Yeah I'm quite certain that this is a caching / stale DNS issue, every time you deploy the upstream service for any reason they will get a new IPv6 address, and that would break this setup.
Even with a resolver [fd12::10] valid=1s;
(the DNS resolver you'd need to use) you could still hit a stale DNS value and get the same error, even in conjunction with an upstream directive.
It's my understanding that NGINX does not provide a way to completely disable DNS cache, if true it would not be a good fit to use in our private network, I would recommend Caddy instead.
Status changed to Awaiting User Response railway[bot] • 8 months ago
8 months ago
I should be able to leverage a variable based approach for the proxy_pass
directive which will force DNS resolution.
From https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_pass
Parameter value can contain variables. In this case, if an address is specified as a domain name, the name is searched among the described server groups, and, if not found, is determined using a resolver.
I'll give that a shot, and will look into another solution if it continues to be an issue.
Do you all modify the DNS resolver used by the running containers or just use the localhost resolver?
Status changed to Awaiting Railway Response railway[bot] • 8 months ago
8 months ago
[fd12::10]
would be the resolver you need to use, this doesn't change regardless of service or project and it has been the same address since private networking was first introduced, so I'd say it would be okay to hardcode in your config.
Even so, you could end up with stale DNS for 1s because you can not set the resolver to 0s.
Status changed to Awaiting User Response railway[bot] • 8 months ago
8 months ago
Yeah that's very true. I'll look into using Caddy or another solution if nginx does not pan out for us. Cheers I appreciate your help
Status changed to Awaiting Railway Response railway[bot] • 8 months ago
Status changed to Solved ray-chen • 8 months ago
Status changed to Awaiting Railway Response calvinbrewer • 8 months ago
8 months ago
We went ahead and switched our ingress over to Caddy and things are working much better. We are still experiencing a short window, post micro service deployment, where the ingress results in a HTTP 502 error. I've configured Caddy to not cache DNS resolutions for the dynamic upstream directive, so I'm thinking it's more of a container/network runtime behavior causing those short periods of service outage.
How does Railway perform deployments under the hood? Is it similar to K8s rolling deployments for a Service resource?
8 months ago
That's great!, I'd also recommend implementing the load balancer settings and passive health checks as shown here - https://github.com/railwayapp-templates/caddy-reverse-proxy/blob/main/Caddyfile
Then see if that improves anything.
Status changed to Awaiting User Response railway[bot] • 8 months ago
8 months ago
Hey Brody - those recommendations did the trick. Appreicate all your help on this as we are unblocked and good to go
Status changed to Awaiting Railway Response railway[bot] • 8 months ago
Status changed to Solved ray-chen • 8 months ago