Sometimes internal services (postgres) aren't discoverable in PR environments

monotykamary

PROOP

6 months ago

Not specific to my workspace, but this has happened much more recently on my colleague's projects. Occasionally when a PR environment spins up, it isn't able to pick up a the DATABASE_URL from PostgreSQL (the railway.internal URL). I often have to update the environment variable directly to use DATABSE_PUBLIC_URL.

Funny coincidence that this happens only on rainy days <:fafuke:1445969066262003722>. The issue is intermittent, but once it happens it sticks around for that PR environment across deploys.

Here is my colleague's workspace, project, and environment ID for reference:

workspaceId: ec546b49-3e78-4f40-8c34-1027b00aca2d
projectId: 7ed2d64d-500b-4c1e-964d-36b071b51601
environmentId: f4abcd7f-966d-4d85-bee7-95b92dbb7e4b

Solved

38 Replies

brody

EMPLOYEE

6 months ago

Can you go more in depth on what you mean when you say it can't pick it up?

monotykamary

PROOP

6 months ago

ah gotcha, let me see if I can capture the logs and some pics

brody

EMPLOYEE

6 months ago

Does your base environment use reference variables?

monotykamary

PROOP

6 months ago

ah yess, we have an environment in one service that sets SCOUT_DATABASE_URL=${{Postgres.DATABASE_URL}}?schema=scout_objs

monotykamary

PROOP

6 months ago

most of the time this works, but occasionally, we noticed our app just not connecting to it on some rainy day, so we switch to SCOUT_DATABASE_URL=${{Postgres.DATABASE_PUBLIC_URL}}?schema=scout_objs

brody

EMPLOYEE

6 months ago

Does the reference resolve? or is this just purely an application level issue?

monotykamary

PROOP

6 months ago

the DATABASE_URL resolves to postgresql://postgres:...@postgres.railway.internal:5432/railway and the application usually picks it up and does its thing

monotykamary

PROOP

6 months ago

but on rare occasions, our nextjs server no longer wants to connect to it

monotykamary

PROOP

6 months ago

so I would have to change it to the public URL for it to pick it up

monotykamary

PROOP

6 months ago

although I am testing that environment again with the internal URL and it's working fine again <:monkaS:819507963915796490>

monotykamary

PROOP

6 months ago

so it's a really intermittent heisenbug

brody

EMPLOYEE

6 months ago

Do you have the errors from the times when it won't connect?

monotykamary

PROOP

6 months ago

let me see if I can capture one on my work machine

monotykamary

PROOP

6 months ago

oh at least the old deployments still have it

1458104548630593700

monotykamary

PROOP

6 months ago

this is all I have I think 💀

brody

EMPLOYEE

6 months ago

Haha, I think you know that error is extraordinarily vague and wouldn't be helpful here.

monotykamary

PROOP

6 months ago

going to pick a rainy day and see if I can reproduce it with some network traces

monotykamary

PROOP

6 months ago

i'll ask my colleague to add some sidecars

brody

EMPLOYEE

6 months ago

Sounds good, I'll be here when you have more information!

monotykamary

PROOP

6 months ago

https://github.com/monotykamary/railway-network-sidecar

I've added a sidecar on my colleague's project to debug whether there was intermittent connection issues on railway

monotykamary

PROOP

6 months ago

what I found was it had nothing to do with railway

monotykamary

PROOP

6 months ago

and everything to do with alpine's musl <:facepalm:580292052525383690>

monotykamary

PROOP

6 months ago

even after all these years it's still flaky

monotykamary

PROOP

6 months ago

but thankfully nothing on railway's side

monotykamary

PROOP

6 months ago

sorry for the trouble <:chemat:823803293310386216>

brody

EMPLOYEE

6 months ago

Oh, tell me more about how you came to that conclusion, and what about Alpine's musl was causing issues?

monotykamary

PROOP

6 months ago

ah, I noticed that after several days the network sidecar ran to no issues; resolved the internal DNS just fine and gave perfect reliability

monotykamary

PROOP

6 months ago

so I looked into what was so different with our webapp container (since it was the only one having network issues)

monotykamary

PROOP

6 months ago

turns out it was using alpine and I remembered faintly this was a common problem during the before coredns k8s era

monotykamary

PROOP

6 months ago

switching out alpine for debian-slim solved it immediately

monotykamary

PROOP

6 months ago

likely a dns over tcp issue (I think)?

brody

EMPLOYEE

6 months ago

Well, now I am here wondering if this was Alpine, or the recent issue with IPv4 private networking that we found that would affect a very small subset of hosts (and have since fixed).

monotykamary

PROOP

6 months ago

oh it might have cascaded to alpine in that sense

brody

EMPLOYEE

6 months ago

Yeah, any request made over IPv4 would time out, while IPv6 worked fine.

brody

EMPLOYEE

6 months ago

If only Prima gave a better error message, you could have been the first user to find the IPv4 issue, haha.