Reliability: Internal networking

calvinbrewerPRO

9 months ago

We are leveraging an ingress application to route traffic based on URI path to the appropriate backend service. Each backend service is not exposed to the public internet, and the ingress uses the <domain>.railway.internal domain to route traffic.

I've noted the recommendations to delay the entrypoint script on the ingress app which performs a sleep for 3 seconds before starting the ingress process in order for those internal domain to become routable. This works about 50% of the time and typically I have to manually restart the ingress app (assuming that's just killing the container and letting the orchestrator roll the deployment) which fixes the routing/dns issue.

Solved

13 Replies

9 months ago

Heya, could you try swapping to our v2 runtime ? we fixed the private network init issues on it.

Lmk if you have any concerns or questions


Status changed to Awaiting User Response railway[bot] 9 months ago


calvinbrewerPRO

9 months ago

I've just updated all the apps to the v2 runtime - I'll continue to monitor over the next few days and also remove the sleep for the ingress app. I'll report back after a few deploys have been rolled out. Cheers!


Status changed to Awaiting Railway Response railway[bot] 9 months ago


9 months ago

Perfect ! I'll close out the thread for now, but feel free to reopen it if anything comes up


Status changed to Solved itsrems 9 months ago


Status changed to Awaiting Railway Response calvinbrewer 8 months ago


calvinbrewerPRO

8 months ago

Howdy again - we are seeing intermittent timeouts when connecting upstream to the services through the Railway internal network. We've done the following since last spoke:

  1. Upgraded all projects to runtime v2

  2. Removed the sleep for 3 seconds on the ingress controller start up process

After restarting the ingress controller the issue is immediately resolved.

Here is an example of our nginx error logs from the ingress app when connecting upstream to the IPv6 address:

2024/09/10 21:32:42 [warn] 31#31: *1 upstream server temporarily disabled while connecting to upstream, client: 100.64.0.2, server: ..., request: "GET /... HTTP/1.1", upstream: "http://[fd12:7651:d842::8f:ff40:b2f5]:3001/...", host: "..."

2024/09/10 21:32:42 [error] 31#31: *1 upstream timed out (110: Operation timed out) while connecting to upstream, client: 100.64.0.2, server: ..., request: "GET /... HTTP/1.1", upstream: "http://[fd12:7651:d842::8f:ff40:b2f5]:3001/...", host: "..."

Any thoughts?

It looks like DNS resolves just fine but the networking layer to handle the TCP connections is defunk on initial deployment.


8 months ago

Are you able to share your NGINX config?

I have a strong suspicion that this is due to NGINX caching DNS entries for its lifetime, and that poses an issue because a given service does not have a static internal IP.


Status changed to Awaiting User Response railway[bot] 8 months ago


calvinbrewerPRO

8 months ago

We are using the default nginx conf:

user  nginx;
worker_processes  auto;

error_log  /var/log/nginx/error.log notice;
pid        /var/run/nginx.pid;


events {
    worker_connections  1024;
}


http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile        on;
    #tcp_nopush     on;

    keepalive_timeout  65;

    #gzip  on;

    include /etc/nginx/conf.d/*.conf;
}

And our default.conf file:

server {
  listen ...;
  server_name ...;

  proxy_set_header Host $host;
  proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  proxy_set_header X-Forwarded-Proto $scheme;

  proxy_http_version 1.1;
  proxy_set_header "Connection" "";

  location /example {
    rewrite /example/(.*) /$1 break;
    proxy_pass http://example-app.railway.internal:3000;
  }

  ...
}

If it's a caching issue we could override the resolver directive to specify a shorter TTL, e.g.resolver ... valid=10s; or use the nginx upstream module:

resolver ...;

upstream example {
    server example-app.railway.internal resolve;
}

Status changed to Awaiting Railway Response railway[bot] 8 months ago


8 months ago

Yeah I'm quite certain that this is a caching / stale DNS issue, every time you deploy the upstream service for any reason they will get a new IPv6 address, and that would break this setup.

Even with a resolver [fd12::10] valid=1s; (the DNS resolver you'd need to use) you could still hit a stale DNS value and get the same error, even in conjunction with an upstream directive.

It's my understanding that NGINX does not provide a way to completely disable DNS cache, if true it would not be a good fit to use in our private network, I would recommend Caddy instead.


Status changed to Awaiting User Response railway[bot] 8 months ago


calvinbrewerPRO

8 months ago

I should be able to leverage a variable based approach for the proxy_pass directive which will force DNS resolution.

From https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_pass

Parameter value can contain variables. In this case, if an address is specified as a domain name, the name is searched among the described server groups, and, if not found, is determined using a resolver.

I'll give that a shot, and will look into another solution if it continues to be an issue.

Do you all modify the DNS resolver used by the running containers or just use the localhost resolver?


Status changed to Awaiting Railway Response railway[bot] 8 months ago


8 months ago

[fd12::10] would be the resolver you need to use, this doesn't change regardless of service or project and it has been the same address since private networking was first introduced, so I'd say it would be okay to hardcode in your config.

Even so, you could end up with stale DNS for 1s because you can not set the resolver to 0s.


Status changed to Awaiting User Response railway[bot] 8 months ago


calvinbrewerPRO

8 months ago

Yeah that's very true. I'll look into using Caddy or another solution if nginx does not pan out for us. Cheers I appreciate your help


Status changed to Awaiting Railway Response railway[bot] 8 months ago


Status changed to Solved ray-chen 8 months ago


Status changed to Awaiting Railway Response calvinbrewer 8 months ago


calvinbrewerPRO

8 months ago

We went ahead and switched our ingress over to Caddy and things are working much better. We are still experiencing a short window, post micro service deployment, where the ingress results in a HTTP 502 error. I've configured Caddy to not cache DNS resolutions for the dynamic upstream directive, so I'm thinking it's more of a container/network runtime behavior causing those short periods of service outage.

How does Railway perform deployments under the hood? Is it similar to K8s rolling deployments for a Service resource?


8 months ago

That's great!, I'd also recommend implementing the load balancer settings and passive health checks as shown here - https://github.com/railwayapp-templates/caddy-reverse-proxy/blob/main/Caddyfile

Then see if that improves anything.


Status changed to Awaiting User Response railway[bot] 8 months ago


calvinbrewerPRO

8 months ago

Hey Brody - those recommendations did the trick. Appreicate all your help on this as we are unblocked and good to go


Status changed to Awaiting Railway Response railway[bot] 8 months ago


Status changed to Solved ray-chen 8 months ago