8 months ago
We use Railway endpoints for our CI/CD and we rely on them for our ephemeral deployments (since we use external DBs such as Neon) and use a github action to set the variables on the Railway deployment. We have changed nothing in a very long time but the github action which normally works well is beginning to intermittently fail. See attached screenshot
97 Replies
On a different project it sometimes suceeded but then would fail on this step

Is the infrastructure currently flaky or is the API changing in a rolling Green/Blue deployment? Without this our Preview Environment infrastructure fails
8 months ago
Hey, from what I see in your screenshot, the Railway API is telling that the environment by that name already exists, maybe your CI is somehow running in double or the environment actually exists?
8 months ago
oh seems like it's timing out
yeah this was intermittent then it would make it past this step (sometimes) and then make it to the final one here and then fail too
8 months ago

8 months ago
it seems that you're getting the same issue from the following thread:
https://discord.com/channels/713503345364697088/1381999325587968010
8 months ago
maybe your environment is so big that it timeouts on Railway's API
Quite literally the final step from here https://github.com/Faolain/railway-pr-deploy/blob/d775c351a1d135637db3278066732a59ffe7deca/index.js#L376
8 months ago
AI generated deployment…?
this has always worked fwiw I even redeployed PRs which were formerly working even earlier today but now it fails with the timeout
it depends from the first one vs the latter but neither succeed when all github action deployments were working until now, so something must have changed on the GraphQL Railway API end
8 months ago
I'll tag the team here to take a look, but if you really want something working now, I would recommend sending the environment duplicate request and then pulling the environments API to see if it's actually created, or maybe just implementing a sleep between actions.
yeah seems similar ish but yeah nothing has changed on a services end for us in many months
8 months ago
latter action?
this is the second one it fails on if it fmakes it past the environmentCreate/duplication
8 months ago
that one is always failing?
if it helps I have seen this flaky behavior in the past but after a few hours it would start working again, would usually coincide with some reported issue on the railway side
8 months ago
I remember seeing some issues related to Github, let me be sure, one minute
8 months ago
Are you able to reinstall the Github app from Railway? I remember seeing some issues related to Github invalidating the tokens and Railway wasn't able to handle that. It might be another issue but it's worth a try
https://github.com/Faolain/railway-pr-deploy/blob/d775c351a1d135637db3278066732a59ffe7deca/index.js#L376 code here unchanged in > a year
8 months ago
but like, is your repository somehow attached to your Railway service?
8 months ago
but how do you push your code to Railway's infra?
I open a PR which triggers this github action, the github action via duplicates the staging environment that exists, and sets the upstream branch as the PR branch, redeploys all the services(the last step that fails) and railway itself pulls in the code from github (due to the branch being set as the PR one)
8 months ago
pulls code from github so it's connected to Github somehow
sure I guess there's a railway app but my point is that it's not deploying via that integration
8 months ago
yep, do that
but the flakiness seems to be on the railway side when it comes to duplication or redeploying services
and not deterministic which if the token was bad it would just constantly fail on one step right?
8 months ago
anyway cc @Brody
504 timeouts on environment create endpoint, also happening on the dashboard (https://discord.com/channels/713503345364697088/1381999325587968010)
8 months ago
yep environment creation is probably Railway at fault here
8 months ago
also, can I ask on why you're not using the Railway's auto PR feature?
8 months ago
can i ask why you arent using more up to date mutations 😆
never got around to it Brody, if there was an official Railway Github Action which did that though we would happily transition to it 😉
8 months ago
oh so auto branch on PR and stuff?
yeah what our steps currently do is it creates a branch on neon (along with other services) and then creates a branch off of staging on railway too, sets the env vars on the branch on railway with the env_vars of all the 3rd party services that were created
8 months ago
so user at fault here, not Railway issue?
I uninstalled github and reinstalled the app, no change in behavior environmentCreate API requests still timing out
8 months ago
pretty sure that environmentCreate will remain after github reinstall, maybe that fixed the no commit found issue
other times after timing out it would fail and no env gets created (on both instances while still getting the environmentCreate error)
8 months ago
yep, if I'm correct, Railway implemented some kind of global timeout on all requests, and some of those slow mutations fail to respond within that time frame (even though they're still processing in the background).
8 months ago
what is missing from this? -
8 months ago
no we are
so I just ran it for the 3rd time on one repo, and it got passed environmentCreate
8 months ago
hmm so commit not found still persists, well I really do not know what could be causing this other than an invalid commit hash
8 months ago
a bit new
8 months ago
if you need any help on setting it up let me know
I just tried swapping the custom github action for the railway provided cli action and I'm still getting failures
Github action
- name: Link CLI to project
run: railway link --project "$LINK_PROJECT_ID" --environment "$DUPLICATE_FROM_ID"
- name: Create Railway env (multi-vars)
env:
DB_URL: >-
postgresql+asyncpg://${{ secrets.NEON_USERNAME }}:
${{ steps.create_branch.outputs.password }}
@${{ steps.create_branch.outputs.db_host }}/espresso-staging
run: |
railway environment new pr-${{ github.event.pull_request.number }} \
--copy "${{ env.DUPLICATE_FROM_ID }}" \
-v "${{ env.SERVICE_ID }}" "DATABASE_URL=${{ env.DB_URL }}" \
-v "${{ env.SERVICE_ID }}" "APP_ENVIRONMENT=pr-preview" \
-v "${{ env.SERVICE_ID }}" "LOG_DD_AGENT_HOST="and if I try manually myself to go to [https://backboard.railway.com/graphql/v2](https://backboard.railway.com/graphql/v2) which seems to be the endpoint timing out I get
I've noticed half the time the environment is creatd but this timeout still occurs
the problem is that since this step fails it's never able to go to the next step which is taking the domain and passing it to vercel, it's been 12 days since this has been reported, is it possible for this to be fixed?
8 months ago
pretty sure that the CLI still uses the copy environment endpoint under the hood so the issue still persists
8 months ago
that's normal, does not impact anything on the queries itself
8 months ago
we're waiting on the team for this, unfortunately nothing we can do until there, sorry for that.
Appreciate it @ThallesComH is there any idea on timeline? As of right now we've been unable to have ephemeral environments and if this continues on for any more time we will have to consider migrating away from railway which we'd rather not do. We chose railway because of its ease of use and the ability to quickly spin up ephemeral environments in a customizable way. Are there any workarounds the team suggests for the interim?
Bumping this 🙏 (if there are workarounds definitely open to it but we need our ephemeral PRs to work again)
7 months ago
I would recommend opening a private thread on https://station.railway.com/
7 months ago
there the team is more likely to solve your problem.
the deploy's going through but it's erroring out before it can pull from the correct git branch
7 months ago
Hello, we made some changes in this regard, could you confirm if you still see this issue?
7 months ago
cc @Faolain
Hmm I've had a private thread where I saw there were updates, I noticed that the deploys are working without timing out but I did notice that the environment copy was no longer working (whether using the railway cli or my old unchanged github action)
aka all the old environment variables were being kept from the branch it forked from (which then in and of itself led to 4 different PRs to point to our staging redis leading to….issues)







