Sometimes internal services (postgres) aren't discoverable in PR environments
monotykamary
PROOP

2 months ago

Not specific to my workspace, but this has happened much more recently on my colleague's projects. Occasionally when a PR environment spins up, it isn't able to pick up a the DATABASE_URL from PostgreSQL (the railway.internal URL). I often have to update the environment variable directly to use DATABSE_PUBLIC_URL.

Funny coincidence that this happens only on rainy days <:fafuke:1445969066262003722>. The issue is intermittent, but once it happens it sticks around for that PR environment across deploys.

Here is my colleague's workspace, project, and environment ID for reference:

  • workspaceId: ec546b49-3e78-4f40-8c34-1027b00aca2d

  • projectId: 7ed2d64d-500b-4c1e-964d-36b071b51601

  • environmentId: f4abcd7f-966d-4d85-bee7-95b92dbb7e4b

Solved

38 Replies

2 months ago

Can you go more in depth on what you mean when you say it can't pick it up?


monotykamary
PROOP

2 months ago

ah gotcha, let me see if I can capture the logs and some pics


2 months ago

Does your base environment use reference variables?


monotykamary
PROOP

2 months ago

ah yess, we have an environment in one service that sets SCOUT_DATABASE_URL=${{Postgres.DATABASE_URL}}?schema=scout_objs


monotykamary
PROOP

2 months ago

most of the time this works, but occasionally, we noticed our app just not connecting to it on some rainy day, so we switch to SCOUT_DATABASE_URL=${{Postgres.DATABASE_PUBLIC_URL}}?schema=scout_objs


2 months ago

Does the reference resolve? or is this just purely an application level issue?


monotykamary
PROOP

2 months ago

the DATABASE_URL resolves to postgresql://postgres:...@postgres.railway.internal:5432/railway and the application usually picks it up and does its thing


monotykamary
PROOP

2 months ago

but on rare occasions, our nextjs server no longer wants to connect to it


monotykamary
PROOP

2 months ago

so I would have to change it to the public URL for it to pick it up


monotykamary
PROOP

2 months ago

although I am testing that environment again with the internal URL and it's working fine again <:monkaS:819507963915796490>


monotykamary
PROOP

2 months ago

so it's a really intermittent heisenbug


2 months ago

Do you have the errors from the times when it won't connect?


monotykamary
PROOP

2 months ago

let me see if I can capture one on my work machine


monotykamary
PROOP

2 months ago

oh at least the old deployments still have it

1458104548630593800


monotykamary
PROOP

2 months ago

this is all I have I think 💀


2 months ago

Haha, I think you know that error is extraordinarily vague and wouldn't be helpful here.


monotykamary
PROOP

2 months ago

going to pick a rainy day and see if I can reproduce it with some network traces


monotykamary
PROOP

2 months ago

i'll ask my colleague to add some sidecars


2 months ago

Sounds good, I'll be here when you have more information!


monotykamary
PROOP

2 months ago

https://github.com/monotykamary/railway-network-sidecar
I've added a sidecar on my colleague's project to debug whether there was intermittent connection issues on railway


monotykamary
PROOP

2 months ago

what I found was it had nothing to do with railway


monotykamary
PROOP

2 months ago

and everything to do with alpine's musl <:facepalm:580292052525383690>


monotykamary
PROOP

2 months ago

even after all these years it's still flaky


monotykamary
PROOP

2 months ago

but thankfully nothing on railway's side


monotykamary
PROOP

2 months ago

sorry for the trouble <:chemat:823803293310386216>


2 months ago

Oh, tell me more about how you came to that conclusion, and what about Alpine's musl was causing issues?


monotykamary
PROOP

2 months ago

ah, I noticed that after several days the network sidecar ran to no issues; resolved the internal DNS just fine and gave perfect reliability


monotykamary
PROOP

2 months ago

so I looked into what was so different with our webapp container (since it was the only one having network issues)


monotykamary
PROOP

2 months ago

turns out it was using alpine and I remembered faintly this was a common problem during the before coredns k8s era


monotykamary
PROOP

2 months ago

switching out alpine for debian-slim solved it immediately


monotykamary
PROOP

2 months ago

likely a dns over tcp issue (I think)?


2 months ago

Well, now I am here wondering if this was Alpine, or the recent issue with IPv4 private networking that we found that would affect a very small subset of hosts (and have since fixed).


monotykamary
PROOP

2 months ago

oh it might have cascaded to alpine in that sense


2 months ago

Yeah, any request made over IPv4 would time out, while IPv6 worked fine.


2 months ago

If only Prima gave a better error message, you could have been the first user to find the IPv4 issue, haha.


2 months ago

Either way, I'm happy you are in a good place now, and that was a very nice solution with your Zig sidecar.


monotykamary
PROOP

2 months ago

likewise, thanks for the support as always


2 months ago

Happy to help!


Status changed to Solved brody about 2 months ago


Loading...