Intermittent failures on Railway API Endpoints - Railway Central Station

Intermittent failures on Railway API Endpoints

faolain

PROOP

a year ago

We use Railway endpoints for our CI/CD and we rely on them for our ephemeral deployments (since we use external DBs such as Neon) and use a github action to set the variables on the Railway deployment. We have changed nothing in a very long time but the github action which normally works well is beginning to intermittently fail. See attached screenshot

97 Replies

faolain

PROOP

a year ago

45c5d9aa-4168-473c-8491-c139f300457f

faolain

PROOP

a year ago

On a different project it sometimes suceeded but then would fail on this step

1382159648823443518

faolain

PROOP

a year ago

Is the infrastructure currently flaky or is the API changing in a rolling Green/Blue deployment? Without this our Preview Environment infrastructure fails

faolain

PROOP

a year ago

Maybe that traceid can help

a year ago

Hey, from what I see in your screenshot, the Railway API is telling that the environment by that name already exists, maybe your CI is somehow running in double or the environment actually exists?

faolain

PROOP

a year ago

Sorry I attached the wrong screenshot <:Facepalm:1181339308687380531>

faolain

PROOP

a year ago

1382161673363914835

faolain

PROOP

a year ago

this should have been the initial screenshot

a year ago

oh seems like it's timing out

faolain

PROOP

a year ago

yeah this was intermittent then it would make it past this step (sometimes) and then make it to the final one here and then fail too

a year ago

1382161870567247903

a year ago

it seems that you're getting the same issue from the following thread:

https://discord.com/channels/713503345364697088/1381999325587968010

a year ago

maybe your environment is so big that it timeouts on Railway's API

faolain

PROOP

a year ago

Quite literally the final step from here https://github.com/Faolain/railway-pr-deploy/blob/d775c351a1d135637db3278066732a59ffe7deca/index.js#L376

a year ago

AI generated deployment...?

faolain

PROOP

a year ago

this has always worked fwiw I even redeployed PRs which were formerly working even earlier today but now it fails with the timeout

faolain

PROOP

a year ago

it depends from the first one vs the latter but neither succeed when all github action deployments were working until now, so something must have changed on the GraphQL Railway API end

a year ago

I'll tag the team here to take a look, but if you really want something working now, I would recommend sending the environment duplicate request and then pulling the environments API to see if it's actually created, or maybe just implementing a sleep between actions.

faolain

PROOP

a year ago

yeah seems similar ish but yeah nothing has changed on a services end for us in many months

faolain

PROOP

a year ago

what about for the latter action?

faolain

PROOP

a year ago

(since it's two endpoints which are flaky)

a year ago

latter action?

faolain

PROOP

a year ago

two endpoints are failing/being flaky

faolain

PROOP

a year ago

this is the second one it fails on if it fmakes it past the environmentCreate/duplication

a year ago

that one is always failing?

faolain

PROOP

a year ago

yep if it makes it past the first environmentCreate

faolain

PROOP

a year ago

haven't gotten it to pass yet

faolain

PROOP

a year ago

if it helps I have seen this flaky behavior in the past but after a few hours it would start working again, would usually coincide with some reported issue on the railway side

faolain

PROOP

a year ago

unsure if it's a canary of sorts haha

a year ago

I remember seeing some issues related to Github, let me be sure, one minute

a year ago

Are you able to reinstall the Github app from Railway? I remember seeing some issues related to Github invalidating the tokens and Railway wasn't able to handle that. It might be another issue but it's worth a try

faolain

PROOP

a year ago

Ah so I'm not using the Github App, it's a github action

faolain

PROOP

a year ago

https://github.com/Faolain/railway-pr-deploy/blob/d775c351a1d135637db3278066732a59ffe7deca/index.js#L376 code here unchanged in > a year

faolain

PROOP

a year ago

calls the GraphQL endpoint via api

a year ago

but like, is your repository somehow attached to your Railway service?

faolain

PROOP

a year ago

nope

faolain

PROOP

a year ago

just makes pure API requests to railway via this github action

a year ago

but how do you push your code to Railway's infra?

faolain

PROOP

a year ago

I open a PR which triggers this github action, the github action via duplicates the staging environment that exists, and sets the upstream branch as the PR branch, redeploys all the services(the last step that fails) and railway itself pulls in the code from github (due to the branch being set as the PR one)

a year ago

pulls code from github so it's connected to Github somehow

faolain

PROOP

a year ago

sure I guess there's a railway app but my point is that it's not deploying via that integration

faolain

PROOP

a year ago

I can try reinstalling the app I guess

a year ago

yep, do that

faolain

PROOP

a year ago

but the flakiness seems to be on the railway side when it comes to duplication or redeploying services

faolain

PROOP

a year ago

and not deterministic which if the token was bad it would just constantly fail on one step right?

a year ago

anyway cc @Brody

504 timeouts on environment create endpoint, also happening on the dashboard (https://discord.com/channels/713503345364697088/1381999325587968010)

a year ago

yep environment creation is probably Railway at fault here

faolain

PROOP

a year ago

and on serviceRedeploy endpoint internal server error ^

a year ago

also, can I ask on why you're not using the Railway's auto PR feature?

faolain

PROOP

a year ago

because we have our frontend on Vercel and our database on Neon

a year ago

can i ask why you arent using more up to date mutations 😆

faolain

PROOP

a year ago

(among some other 3rd party services that railway doesn't support)

faolain

PROOP

a year ago

never got around to it Brody, if there was an official Railway Github Action which did that though we would happily transition to it 😉

a year ago

oh so auto branch on PR and stuff?

faolain

PROOP

a year ago

yeah what our steps currently do is it creates a branch on neon (along with other services) and then creates a branch off of staging on railway too, sets the env vars on the branch on railway with the env_vars of all the 3rd party services that were created

a year ago

so user at fault here, not Railway issue?

faolain

PROOP

a year ago

I uninstalled github and reinstalled the app, no change in behavior environmentCreate API requests still timing out

a year ago

pretty sure that environmentCreate will remain after github reinstall, maybe that fixed the no commit found issue

faolain

PROOP

a year ago

sometimes the environment is created correctly despite the error

1382166756054794353

faolain

PROOP

a year ago

(ive tried it a few times while on this chat)

faolain

PROOP

a year ago

other times after timing out it would fail and no env gets created (on both instances while still getting the environmentCreate error)

a year ago

yep, if I'm correct, Railway implemented some kind of global timeout on all requests, and some of those slow mutations fail to respond within that time frame (even though they're still processing in the background).

faolain

PROOP

a year ago

hmm is this recent?

faolain

PROOP

a year ago

as in within the last day

a year ago

what is missing from this? -

a year ago

no we are

faolain

PROOP

a year ago

so I just ran it for the 3rd time on one repo, and it got passed environmentCreate

faolain

PROOP

a year ago

But then hit this again

1382167402456027328

faolain

PROOP

a year ago

oh hm is this new?

a year ago

hmm so commit not found still persists, well I really do not know what could be causing this other than an invalid commit hash

a year ago

a bit new

a year ago

if you need any help on setting it up let me know

faolain

PROOP

a year ago

I just tried swapping the custom github action for the railway provided cli action and I'm still getting failures

faolain

PROOP

a year ago

1385904902013849660

faolain

PROOP

a year ago

1385904983672754296

faolain

PROOP

a year ago

Github action

      - name: Link CLI to project
        run: railway link --project "$LINK_PROJECT_ID" --environment "$DUPLICATE_FROM_ID"

      - name: Create Railway env (multi-vars)
        env:
          DB_URL: >-
            postgresql+asyncpg://${{ secrets.NEON_USERNAME }}:
            ${{ steps.create_branch.outputs.password }}
            @${{ steps.create_branch.outputs.db_host }}/espresso-staging
        run: |
          railway environment new pr-${{ github.event.pull_request.number }} \
            --copy "${{ env.DUPLICATE_FROM_ID }}"                            \
            -v "${{ env.SERVICE_ID }}" "DATABASE_URL=${{ env.DB_URL }}"      \
            -v "${{ env.SERVICE_ID }}" "APP_ENVIRONMENT=pr-preview"          \
            -v "${{ env.SERVICE_ID }}" "LOG_DD_AGENT_HOST="

faolain

PROOP

a year ago

and if I try manually myself to go to https://backboard.railway.com/graphql/v2 which seems to be the endpoint timing out I get

faolain

PROOP

a year ago

1385905334685532201

faolain

PROOP

a year ago

Is the endpoint down?

faolain

PROOP

a year ago

1386210347899162804

faolain

PROOP

a year ago

I've noticed half the time the environment is creatd but this timeout still occurs

faolain

PROOP

a year ago

1386214761582034994

faolain

PROOP

a year ago

the problem is that since this step fails it's never able to go to the next step which is taking the domain and passing it to vercel, it's been 12 days since this has been reported, is it possible for this to be fixed?

a year ago

pretty sure that the CLI still uses the copy environment endpoint under the hood so the issue still persists

a year ago

that's normal, does not impact anything on the queries itself

a year ago

we're waiting on the team for this, unfortunately nothing we can do until there, sorry for that.

faolain

PROOP

a year ago

Appreciate it @ThallesComH is there any idea on timeline? As of right now we've been unable to have ephemeral environments and if this continues on for any more time we will have to consider migrating away from railway which we'd rather not do. We chose railway because of its ease of use and the ability to quickly spin up ephemeral environments in a customizable way. Are there any workarounds the team suggests for the interim?

faolain

PROOP

a year ago

Bumping this 🙏 (if there are workarounds definitely open to it but we need our ephemeral PRs to work again)

a year ago

I would recommend opening a private thread on https://station.railway.com/

a year ago

there the team is more likely to solve your problem.

ryanlieu

PRO

a year ago

I'm running into the same issue

ryanlieu

PRO

a year ago

the deploy's going through but it's erroring out before it can pull from the correct git branch

ryanlieu

PRO

a year ago

so it just defaults to main

a year ago

Hello, we made some changes in this regard, could you confirm if you still see this issue?

a year ago

cc @Faolain

faolain

PROOP

a year ago

Hmm I've had a private thread where I saw there were updates, I noticed that the deploys are working without timing out but I did notice that the environment copy was no longer working (whether using the railway cli or my old unchanged github action)

faolain

PROOP

a year ago

aka all the old environment variables were being kept from the branch it forked from (which then in and of itself led to 4 different PRs to point to our staging redis leading to....issues)

Welcome!

Sign in to your Railway account to join the conversation.