Custom Domain Edge Routing Failure
mcintoshjames-sketch
PROOP

21 days ago

Summary

All three custom domains on a newly created staging environment are stuck atCERTIFICATE_STATUS_TYPE_VALIDATING_OWNERSHIPand return HTTP 404 "Application not found" from the Railway edge, despite correct service-side configuration. The issue appears to be an edge ingress config sync failure for this specific environment.

  • Affected Services: api, worker, web (plus guardrail-proxy, rspamd, Postgres-RTuS, Redis-OAIR)

Affected Custom Domains

  1. api-staging.replylayer.ai→ serviceapi(ID: 4b1e2ea0-b861-4f2e-9c05-1b40fd092886)

    • Custom domain ID: 5d1d2be1-349e-4277-9db0-e30db1a3aea5

    • Proxy target: 2t5mwcot.up.railway.app

  2. hooks-staging.replylayer.ai→ serviceworker(ID: 987a0b5d-0b1c-4bf9-af76-f59f3850fdfb)

    • Custom domain ID: 5583cfa5-d631-49cd-befe-8805fbdcb92c

    • Proxy target: e30wajt1.up.railway.app

  3. app-staging.replylayer.ai→ serviceweb(ID: 39348200-0541-4adc-b0c8-1e1b2f2f6584)

    • Custom domain ID: 8351cdc6-919b-40b9-a510-3a997b846d89

    • Proxy target: hpmg6vkv.up.railway.app

What Works (Service-Side State is Correct)

  • white_check_mark emoji All three services showstatus=SUCCESSwith active deployments (timestamps 2026-04-23T14:34:06Z)

  • white_check_mark emoji Native Railway domains respond with HTTP 200 and healthy payloads

  • white_check_mark emoji GraphQLcustomDomainquery confirms each domain is correctly bound:

    • serviceId,environmentId, andtargetPortall populated

    • deletedAt: null(not soft-deleted)

  • white_check_mark emoji Service networking config includes custom domains inserviceDomains:

    • api:{"api-staging-f5fa.up.railway.app": {}, "api-staging.replylayer.ai": {}}

    • worker:{"worker-staging-54e6.up.railway.app": {}, "hooks-staging.replylayer.ai": {}}

    • web:{"web-staging-3e6a.up.railway.app": {}, "app-staging.replylayer.ai": {}}

  • white_check_mark emoji Cloudflare CNAMEs are in place, unproxied (orange-cloud off), and fully propagated

    • DNS status:DNS_RECORD_STATUS_PROPAGATEDper Railway API

What Doesn't Work (Edge-Side Failure)

  • x emoji All three custom domains return HTTP 404 fromserver: railway-edge:

    {"status":"error","code":404,"message":"Application not found","request_id":"..."}
    
  • x emoji The proxy targets also return 404:

    curl -sS https://2t5mwcot.up.railway.app/v1/health
    # → HTTP 404 Application not found
    
  • x emoji Certificate validation is stuck atCERTIFICATE_STATUS_TYPE_VALIDATING_OWNERSHIP(45+ minutes, no progress)

    • certificateErrorMessage: null(no error signal from API)

    • certificateRetryable: null

    • This is expected: Let's Encrypt's HTTP-01 challenge at/.well-known/acme-challenge/*cannot complete because the edge returns 404 for the domain

Request IDs for Edge Tracing

Latest attempts (after service restarts):

  • api-staging.replylayer.ai:NHjuGC42RyGuzK1Ic9o55Q

  • hooks-staging.replylayer.ai:zAu4bF7CRpq1TYBlBhdwDg

Earlier attempts (before restarts):

  • api-staging.replylayer.ai:07No5ulSQsiPLyF7woOzXw

  • hooks-staging.replylayer.ai:Pq1X8piWTISWhDgLaP71AA

Troubleshooting Steps Already Taken

  1. Waited for DNS propagation: 30+ minutes after CNAME publish; status unchanged

  2. Verified Cloudflare config: CNAMEs are unproxied (orange-cloud off), so Cloudflare is not in the TLS path

  3. Tested HTTP-01 challenge path:curl http://api-staging.replylayer.ai/.well-known/acme-challenge/testreturns 404 from railway-edge (expected, since the domain route doesn't exist on the edge)

  4. Recreated custom domain: Deleted and recreatedapi-staging.replylayer.ai, got a new proxy target (2t5mwcot.up.railway.app), updated CF CNAME. Same 404 and VALIDATING_OWNERSHIP stall.

  5. Set explicit targetPort: Updatedapi-stagingcustom domain totargetPort=3000(service listens on 3000 viaAPI_PORTenv var). No change.

  6. Tested with Host header: Sent request with matching Host header against proxy target; still 404.

  7. Updated service networking config: Added custom domains toserviceDomainsvia Railway API. Deployed services.

  8. Restarted services: Triggered container restarts to force edge re-sync. Services came back up cleanly with fresh deployments. Edge still returns 404.

Root Cause Analysis

The service-side state is correct and complete. The edge is returning 404, which means:

  • The ingress config for this environment is either stale or missing the custom domain routes

  • The config propagation path from serviceserviceDomains→ edge ingress is not syncing for this environment

  • This is not a DNS issue, TLS issue, or service availability issue

What We Need

A manual investigation and/or resync of the edge ingress config for environment35ee243a-1691-4eee-b73f-178f3ba75d15. The request IDs above should pinpoint the exact edge nodes serving the 404s and help identify whether the config is stale or missing.

Additional Context

  • This is a fresh environment created today; no prior custom domain history

  • All other services in the environment (databases, proxies, etc.) are functioning normally

  • The issue is isolated to custom domain routing; native Railway domains work perfectly

  • No errors or warnings in service logs or deployment output

$20 Bounty

1 Replies

Railway
BOT

21 days ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway 21 days ago


I could not find a CNAME or a TXT record attached to any of the 3 domains you have listed above.

Make sure you added the CNAME and TXT records as instructed when adding a custom domain to your service(s).


Welcome!

Sign in to your Railway account to join the conversation.

Loading...