certified.one apex TLS broken — wildcard cert served, SAN mismatch blocking all apex traffic
aspiers
PROOP

15 days ago

Hi Railway team,

This is an urgent request regarding our most critical production infrastructure which has been working perfectly for weeks and suddenly stopped working.

The apex custom domain certified.one on our production environment is serving the wrong certificate, causing SAN mismatch errors for all clients hitting https://certified.one/.

Symptoms:

  • openssl s_client -connect certified.one:443 -servername certified.one returns cert with SAN *.certified.one only (fingerprint

8d:44:7d:bb:63:a1:87:94:bd:b6:dd:e7:89:c8:a3:2f:a2:40:63:ad:6d:23:0e:87:8b:0f:f4:5c:bc:51:d4:78)

  • Browsers/curl reject with SSL: no alternative certificate subject name matches target hostname 'certified.one'
  • Wildcard and other subdomains (e.g. auth.certified.one) work fine

Railway state (via GraphQL):

  • Project: 17980f2b-0913-439f-a53e-472969130b6d (ePDS)
  • Environment: 2a602bb6-48d5-4a7f-8a1e-ffc184abf406 (production)
  • Service: 16bd3666-68ae-4537-95fb-ac516e60e9c8 (pds-core)
  • Apex custom domain id: bcf6a4bd-49a8-4950-93bc-145434641ed9 (certified.one)
  • Apex certs provisioned (RSA fingerprint e229d5abce4265665a3ec3f12a980eda31affa6946cbbc27fe989f092c8ae11b, ECDSA

8269b24c32bba58a7c6b4141d3c34a68e2b894dc8d8174b1a2044c20f161c058) — but neither is being served at the edge

  • DNS status for apex shows DNS_RECORD_STATUS_REQUIRES_UPDATE with currentValue: ""

DNS side:

  • Apex certified.one is a CNAME → t0m9zgd1.up.railway.app configured at Cloudflare (DNS-only / grey cloud)
  • Cloudflare's CNAME flattening returns A 69.46.46.60 at apex (which matches the resolution of t0m9zgd1.up.railway.app)
  • Railway docs state CNAME flattening is supported for apex, so this should be a valid setup
  • We have made no DNS or Railway domain config changes recently; this previously worked

Suspected cause:

Railway's external DNS validator can't see the CNAME (Cloudflare flattens it to A), so the apex domain stays in REQUIRES_UPDATE state and the provisioned apex cert is never bound to the ingress. Requests to apex land on the wildcard ingress (which also resolves to 69.46.46.60) and get the wildcard cert.

We've found a similar report: https://station.railway.com/questions/stale-dns-cache-prevents-custom-domain-v-dfd694d4

Request:

Please could you manually mark the apex domain as validated / force-bind the existing apex cert to the edge? We'd strongly prefer not to remove/re-add the domain (production traffic).

Thanks,

Adam

Solved$20 Bounty

Pinned Solution

12 days ago

Hello,

We had discovered an issue which would've affected services that have both a wildcard and exact-matching SAN for the same second-level domain, on the same service. We have been migrating traffic to our new edge network over the past week, which had tree lookup logic that erroneously preferred the wildcard certificate over the exact match, specifically if the exact-matching SAN was generated/renewed after the wildcard certificate.

We have since rolled out a fleet-wide patch to this issue, and added regression testing to make sure this doesn't happen again. We apologize for the inconvenience - please let me know if you have any further questions.

14 Replies

Railway
BOT

15 days ago

This thread has been opened as a public bounty so the community can help solve it. The thread and any further activity are now visible to everyone.

Status changed to Open Railway 15 days ago


Try removing the domain from Railway and add it back after ~10-15 mins. Update DNS records as necessary.

Also, Cloudflare currently has an incident related to certificates so it may or may not be causing this issue. (https://new.cloudflarestatus.com/incidents/j17t8xz91xs0)


aspiers
PROOP

14 days ago

We did that yesterday and it fixed it, but the problem has already come back again so this is clearly not sustainable. Especially given the limit of 5 new certs per week (if I remember correctly).


aspiers
PROOP

14 days ago

And this time it didn't work. Why would we have to wait ~10-15 mins before adding?


aspiers

And this time it didn't work. Why would we have to wait ~10-15 mins before adding?

It's less about a specific amount of time, and more about TTL expiration and Let's Encrypt rate limits. Waiting a few minutes is better to avoid any caching issues.


aspiers
PROOP

14 days ago

This time it also affected an entirely different service not on the apex domain, so that disproves the previous theory about flattening of the apex CNAME to an A record being the cause. We've switched to Cloudflare proxying and that works around the broken certificate, but it's very concerning that these things are just randomly failing out of the blue after weeks or months without issue.


aspiers
PROOP

14 days ago

The problem only manifests when Cloudflare proxying is disabled. In that case certified.one incorrectly uses the certificate for *.certified.one which causes service outages due to TLS validation failures. (We have to have both of these as custom domains for the same service for it to function correctly, due to the nature of the workload.)

Unfortunately we can no longer rely on Railway TLS, given that:

  • this happened three times within 18 hours (including on non-apex domains)
  • re-adding the domain didn't even work the second time
  • Let's Encrypt rate limits will be reached very quickly at this rate
  • there is apparently no issue acknowledged from Railway's side

From the link shared above, it seems that even relying on Cloudflare is questionable. So we are looking into running our own TLS proxy where we have full control of the certificates. Pretty disappointing, and super strange considering we had months without any issues and didn't change anything recently.

BTW we already had instatus monitoring set up but it inexplicably failed to spot the TLS certificate issue - maybe their monitors cache certificates. I have raised a support request with them separately.


aspiers
PROOP

13 days ago

This morning yet another of our services fell victim to the same outage - this time it was one on a subdomain (with 4 domain segments). This is yet more evidence that it is nothing to do with CNAME flattening on apex domains. Worse, Cloudflare doesn't support proxying of nested subdomains, so this time that workaround is not an available option.

This is a very serious failure of Railway (or perhaps an upstream provider), and we are still no closer to a) understanding the root cause, or b) a proper solution rather than a workaround which very quickly fails due to Let's Encrypt rate limiting.


alialabdrabulrasul
PROTop 10% Contributor

13 days ago

Same thing on nahltime.com — apex served the *.nahltime.com wildcard (ERR_CERT_COMMON_NAME_INVALID) even though the apex cert was issued (saw it in CT logs) but never bound to the edge. Worked for weeks, broke after a wildcard renewal, and remove/re-add only helped for a few hours while burning the LE weekly cert limit. Fix that worked: apex behind Cloudflare proxied (orange) with SSL/TLS = Full, subdomains left grey — Cloudflare serves its own apex cert so Railway's broken edge never comes into it. Grey/DNS-only did nothing; only proxying worked. It's a workaround though — the real bug is Railway not binding the issued apex cert.


brandonmchu
PRO

13 days ago

+1 — we're hitting the identical failure on our apex, every.ai, starting in roughly the same window — the last ~24–48h — after months of stability with no DNS or domain config changes on our side.

Same signature as the OP:

  • Apex is served a wildcard that doesn't cover it: *.every.ai on the bare every.ai, resulting in ERR_CERT_COMMON_NAME_INVALID.
  • It's spread beyond the apex. Some subdomains now get a non-covering wildcard too. For example, staging.every.ai is handed *.staging.every.ai, which doesn't cover the parent label.
  • For anyone confused by the symptom: a *.foo.com cert never covers the bare foo.com. So when the edge serves the wildcard for the parent name, it always shows up as a CN/SAN mismatch.

What convinced us this is edge cert selection, not issuance:

  • Certificate Transparency proves the right certs exist. crt.sh shows valid, unexpired certs for our exact apex hostname were provisioned recently, but the edge keeps serving the older wildcard instead.

Anyone can check their own:

  • Issuance history: https://crt.sh/?q=yourdomain.com
  • What's actually served:

openssl s_client -connect yourdomain.com:443 -servername yourdomain.com | openssl x509 -noout -subject -dates

  • All DNS records show verified/green in Railway's Networking panel — CNAME + _railway-verify TXT both checkmarked — yet the wrong cert is still served. So this is not fixable from the DNS side.

Heads-up for others landing here:

  • Remove + re-add is only a temporary fix. It reverts.
  • Repeated re-issuance can burn your Let's Encrypt rate limit: 5 duplicate certs/week per identical hostname set. Don't loop on it or you can lock yourself out of issuing any cert for ~a week.

Railway team — can you confirm whether this is a known edge/cert incident? Multiple customers are now reporting the same “certs provisioned but not served at the edge” behavior in the same window, and the remove/re-add guidance isn't holding. Is it related to the Cloudflare certificate incident mentioned earlier?


brandonmchu

+1 — we're hitting the identical failure on our apex, every.ai, starting in roughly the same window — the last ~24–48h — after months of stability with no DNS or domain config changes on our side. Same signature as the OP: * Apex is served a wildcard that doesn't cover it: `*.every.ai` on the bare `every.ai`, resulting in `ERR_CERT_COMMON_NAME_INVALID`. * It's spread beyond the apex. Some subdomains now get a non-covering wildcard too. For example, `staging.every.ai` is handed `*.staging.every.ai`, which doesn't cover the parent label. * For anyone confused by the symptom: a `*.foo.com` cert never covers the bare `foo.com`. So when the edge serves the wildcard for the parent name, it always shows up as a CN/SAN mismatch. What convinced us this is edge cert selection, not issuance: * Certificate Transparency proves the right certs exist. `crt.sh` shows valid, unexpired certs for our exact apex hostname were provisioned recently, but the edge keeps serving the older wildcard instead. Anyone can check their own: * Issuance history: `https://crt.sh/?q=yourdomain.com` * What's actually served: `openssl s_client -connect yourdomain.com:443 -servername yourdomain.com | openssl x509 -noout -subject -dates` * All DNS records show verified/green in Railway's Networking panel — CNAME + `_railway-verify` TXT both checkmarked — yet the wrong cert is still served. So this is not fixable from the DNS side. Heads-up for others landing here: * Remove + re-add is only a temporary fix. It reverts. * Repeated re-issuance can burn your Let's Encrypt rate limit: 5 duplicate certs/week per identical hostname set. Don't loop on it or you can lock yourself out of issuing any cert for ~a week. Railway team — can you confirm whether this is a known edge/cert incident? Multiple customers are now reporting the same “certs provisioned but not served at the edge” behavior in the same window, and the remove/re-add guidance isn't holding. Is it related to the Cloudflare certificate incident mentioned earlier?

luismingati
PRO

12 days ago

I am having the exact same problem. Did you find any fix for this?


chippd
PRO

12 days ago

chiming in to say I've hit this problem too. Not good


luismingati
PRO

12 days ago

Any solution to this?


12 days ago

Hello,

We had discovered an issue which would've affected services that have both a wildcard and exact-matching SAN for the same second-level domain, on the same service. We have been migrating traffic to our new edge network over the past week, which had tree lookup logic that erroneously preferred the wildcard certificate over the exact match, specifically if the exact-matching SAN was generated/renewed after the wildcard certificate.

We have since rolled out a fleet-wide patch to this issue, and added regression testing to make sure this doesn't happen again. We apologize for the inconvenience - please let me know if you have any further questions.


Status changed to Solved dizzydes90 12 days ago


aspiers
PROOP

12 days ago

Thanks a lot - everything seems to have started working again, so I am cautiously optimistic. However this was a major production outage evidently affecting multiple customres, so it was extremely painful to have to wait multiple days for a response.


Status changed to Solved dev 10 days ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...