SSL cert failing (and we are live on Hacker News)
respectify-dave
PROOP

12 days ago

Hi. Our SaaS backend, app.respectify.ai, is giving a SSL cert for railway.com.

This happened a few minutes ago. The app settings show the domains with green checkmarks. Our site, which is #1 on Show HN right now is now giving errors.

It seems to be Railway's wildcard cert (*.up.railway.app) in Chrome, and curl gives: curl: (60) SSL: no alternative certificate subject name matches target host name 'app.respectify.ai'

Awaiting User Response

7 Replies

respectify-dave
PROOP

12 days ago

Ok folks, on a redeployment it's back, downtime about 6 minutes. This was a redeploy of what was already deployed.

I truly have no idea what caused it, but thankfully it is okay now. With far less urgency, if possible, I'd love to know if you are able to identify a cause. Thankyou slightly_smiling_face emoji


12 days ago

Hey,

Deeply sorry about that, and glad you are back online now. We would like to look into this. Would you happen to have some timestamps for us to look into?


Status changed to Awaiting User Response Railway 12 days ago


respectify-dave
PROOP

11 days ago

Hi Brody. Thanks for the reply! Apologies for taking some time to reply -- it was a busy evening, I was up late, and took a breather today as things are currently stable.

Our Slack logs show I saw it at 12:54AM my time, which is 10:54PM GMT on the evening of 25 Feb. I think it likely started with 393966cb, and then was resolved with fdd225ce, which I think was a re-deploy of the same commit.

I do want to note: it was late night, we were getting a bunch of server traffic and a bunch of comments on HN at the same time, and I was making adjustments to the site in direct response to feedback. The previous commit (5dff88c1) before the one that I think had the SSL issue actually built but failed to go live due to my own error (I made some changes to reduce build time, and the healthcheck failed, this is embarrassing) and is it possible that SSL somehow broke due to that?

Second, due to the same things - late night, moving fast - it is entirely possible there is user error my side. I don't want to rule that out wink emoji

We / I did not knowingly change any DNS or SSL etc settings -- or any settings at all of any kind, for that matter. We were deploying only. Our workflow is that when main is pushed, it builds and deploys production off that branch (same is also set up for staging off our staging branch.) We do not usually push to main anywhere near as often as we did last night. Staging, which also has SSL, can have pushes and builds very frequently sometimes though. I do not recall ever seeing a SSL error with staging even when redeploying often.

Please let me know if I can help. As noted, all good now, but I'm grateful you want to look into it. And I must repeat I am not ruling out my own error slightly_smiling_face emoji

Thankyou,

David


Status changed to Awaiting Railway Response Railway 11 days ago


11 days ago

Thanks for the details David, this is very helpful. We're going to investigate what went wrong on our end - we've done some changes re: SSL handling/termination on our part recently so this is indeed suspicious.


Status changed to Awaiting User Response Railway 11 days ago


respectify-dave
PROOP

11 days ago

Awesome. Thanks again. I'm curious to hear what you find if you have anything you're able to share (honestly, at this point it's curiosity or interest only, I don't need to know :)) Either way I appreciate you looking into it.


Status changed to Awaiting Railway Response Railway 11 days ago


Status changed to Awaiting User Response noahd 11 days ago


respectify-dave
PROOP

11 days ago

Something else was transiently wrong, and I got:

curl: (56) Recv failure: Connection reset by peer

acccessing the same app. It was timing out and my Astro frontend (also on Railway) was giving multiple 'fetch failed' errors trying to contact it.

Just as I was diagnosing, it came back. Nothing in our own logs, where it seemed to be functioning fine. Dashboard was green.


Status changed to Awaiting Railway Response Railway 11 days ago


We've pushed a fix for the certificate issue - it was related to a race condition in our network routing layer when a new deployment is initiated, and presented as a cert error because the lookup for your domain's cert failed and we ended up serving the default fallback certificate for *.up.railway.app

re: "Connection reset by peer" error - do you see this consistently? Are you able to provide timestamps, and links to services where this happen (e.g. service A dialing service B)?


Status changed to Awaiting User Response Railway 7 days ago


Loading...