Intermittent failures on Railway API Endpoints
faolain
PROOP

9 months ago

We use Railway endpoints for our CI/CD and we rely on them for our ephemeral deployments (since we use external DBs such as Neon) and use a github action to set the variables on the Railway deployment. We have changed nothing in a very long time but the github action which normally works well is beginning to intermittently fail. See attached screenshot

97 Replies

faolain
PROOP

9 months ago

45c5d9aa-4168-473c-8491-c139f300457f


faolain
PROOP

9 months ago

On a different project it sometimes suceeded but then would fail on this step

1382159648823443500


faolain
PROOP

9 months ago

Is the infrastructure currently flaky or is the API changing in a rolling Green/Blue deployment? Without this our Preview Environment infrastructure fails


faolain
PROOP

9 months ago

Maybe that traceid can help


9 months ago

Hey, from what I see in your screenshot, the Railway API is telling that the environment by that name already exists, maybe your CI is somehow running in double or the environment actually exists?


faolain
PROOP

9 months ago

Sorry I attached the wrong screenshot <:Facepalm:1181339308687380531>


faolain
PROOP

9 months ago

1382161673363914800


faolain
PROOP

9 months ago

this should have been the initial screenshot


9 months ago

oh seems like it's timing out


faolain
PROOP

9 months ago

yeah this was intermittent then it would make it past this step (sometimes) and then make it to the final one here and then fail too


9 months ago

1382161870567248000


9 months ago

it seems that you're getting the same issue from the following thread:
https://discord.com/channels/713503345364697088/1381999325587968010


9 months ago

maybe your environment is so big that it timeouts on Railway's API



9 months ago

AI generated deployment…?


faolain
PROOP

9 months ago

this has always worked fwiw I even redeployed PRs which were formerly working even earlier today but now it fails with the timeout


faolain
PROOP

9 months ago

it depends from the first one vs the latter but neither succeed when all github action deployments were working until now, so something must have changed on the GraphQL Railway API end


9 months ago

I'll tag the team here to take a look, but if you really want something working now, I would recommend sending the environment duplicate request and then pulling the environments API to see if it's actually created, or maybe just implementing a sleep between actions.


faolain
PROOP

9 months ago

yeah seems similar ish but yeah nothing has changed on a services end for us in many months


faolain
PROOP

9 months ago

what about for the latter action?


faolain
PROOP

9 months ago

(since it's two endpoints which are flaky)


9 months ago

latter action?


faolain
PROOP

9 months ago

two endpoints are failing/being flaky


faolain
PROOP

9 months ago

this is the second one it fails on if it fmakes it past the environmentCreate/duplication


9 months ago

that one is always failing?


faolain
PROOP

9 months ago

yep if it makes it past the first environmentCreate


faolain
PROOP

9 months ago

haven't gotten it to pass yet


faolain
PROOP

9 months ago

if it helps I have seen this flaky behavior in the past but after a few hours it would start working again, would usually coincide with some reported issue on the railway side


faolain
PROOP

9 months ago

unsure if it's a canary of sorts haha


9 months ago

I remember seeing some issues related to Github, let me be sure, one minute


9 months ago

Are you able to reinstall the Github app from Railway? I remember seeing some issues related to Github invalidating the tokens and Railway wasn't able to handle that. It might be another issue but it's worth a try


faolain
PROOP

9 months ago

Ah so I'm not using the Github App, it's a github action



faolain
PROOP

9 months ago

calls the GraphQL endpoint via api


9 months ago

but like, is your repository somehow attached to your Railway service?


faolain
PROOP

9 months ago

nope


faolain
PROOP

9 months ago

just makes pure API requests to railway via this github action


9 months ago

but how do you push your code to Railway's infra?


faolain
PROOP

9 months ago

I open a PR which triggers this github action, the github action via duplicates the staging environment that exists, and sets the upstream branch as the PR branch, redeploys all the services(the last step that fails) and railway itself pulls in the code from github (due to the branch being set as the PR one)


9 months ago

pulls code from github so it's connected to Github somehow


faolain
PROOP

9 months ago

sure I guess there's a railway app but my point is that it's not deploying via that integration


faolain
PROOP

9 months ago

I can try reinstalling the app I guess


9 months ago

yep, do that


faolain
PROOP

9 months ago

but the flakiness seems to be on the railway side when it comes to duplication or redeploying services


faolain
PROOP

9 months ago

and not deterministic which if the token was bad it would just constantly fail on one step right?


9 months ago

anyway cc @Brody
504 timeouts on environment create endpoint, also happening on the dashboard (https://discord.com/channels/713503345364697088/1381999325587968010)


9 months ago

yep environment creation is probably Railway at fault here


faolain
PROOP

9 months ago

and on serviceRedeploy endpoint internal server error ^


9 months ago

also, can I ask on why you're not using the Railway's auto PR feature?


faolain
PROOP

9 months ago

because we have our frontend on Vercel and our database on Neon


9 months ago

can i ask why you arent using more up to date mutations 😆


faolain
PROOP

9 months ago

(among some other 3rd party services that railway doesn't support)


faolain
PROOP

9 months ago

never got around to it Brody, if there was an official Railway Github Action which did that though we would happily transition to it 😉


9 months ago

oh so auto branch on PR and stuff?


faolain
PROOP

9 months ago

yeah what our steps currently do is it creates a branch on neon (along with other services) and then creates a branch off of staging on railway too, sets the env vars on the branch on railway with the env_vars of all the 3rd party services that were created


9 months ago

so user at fault here, not Railway issue?


faolain
PROOP

9 months ago

I uninstalled github and reinstalled the app, no change in behavior environmentCreate API requests still timing out


9 months ago

pretty sure that environmentCreate will remain after github reinstall, maybe that fixed the no commit found issue


faolain
PROOP

9 months ago

sometimes the environment is created correctly despite the error

1382166756054794200


faolain
PROOP

9 months ago

(ive tried it a few times while on this chat)


faolain
PROOP

9 months ago

other times after timing out it would fail and no env gets created (on both instances while still getting the environmentCreate error)


9 months ago

yep, if I'm correct, Railway implemented some kind of global timeout on all requests, and some of those slow mutations fail to respond within that time frame (even though they're still processing in the background).


faolain
PROOP

9 months ago

hmm is this recent?


faolain
PROOP

9 months ago

as in within the last day


9 months ago

what is missing from this? -


9 months ago

no we are


faolain
PROOP

9 months ago

so I just ran it for the 3rd time on one repo, and it got passed environmentCreate


faolain
PROOP

9 months ago

But then hit this again

1382167402456027400


faolain
PROOP

9 months ago

oh hm is this new?


9 months ago

hmm so commit not found still persists, well I really do not know what could be causing this other than an invalid commit hash


9 months ago

a bit new


9 months ago

if you need any help on setting it up let me know


faolain
PROOP

9 months ago

I just tried swapping the custom github action for the railway provided cli action and I'm still getting failures


faolain
PROOP

9 months ago

1385904902013849600


faolain
PROOP

9 months ago

1385904983672754200


faolain
PROOP

9 months ago

Github action

      - name: Link CLI to project
        run: railway link --project "$LINK_PROJECT_ID" --environment "$DUPLICATE_FROM_ID"

      - name: Create Railway env (multi-vars)
        env:
          DB_URL: >-
            postgresql+asyncpg://${{ secrets.NEON_USERNAME }}:
            ${{ steps.create_branch.outputs.password }}
            @${{ steps.create_branch.outputs.db_host }}/espresso-staging
        run: |
          railway environment new pr-${{ github.event.pull_request.number }} \
            --copy "${{ env.DUPLICATE_FROM_ID }}"                            \
            -v "${{ env.SERVICE_ID }}" "DATABASE_URL=${{ env.DB_URL }}"      \
            -v "${{ env.SERVICE_ID }}" "APP_ENVIRONMENT=pr-preview"          \
            -v "${{ env.SERVICE_ID }}" "LOG_DD_AGENT_HOST="

faolain
PROOP

9 months ago

and if I try manually myself to go to https://backboard.railway.com/graphql/v2 which seems to be the endpoint timing out I get


faolain
PROOP

9 months ago

1385905334685532200


faolain
PROOP

9 months ago

Is the endpoint down?


faolain
PROOP

9 months ago

1386210347899163000


faolain
PROOP

9 months ago

I've noticed half the time the environment is creatd but this timeout still occurs


faolain
PROOP

9 months ago

1386214761582035000


faolain
PROOP

9 months ago

the problem is that since this step fails it's never able to go to the next step which is taking the domain and passing it to vercel, it's been 12 days since this has been reported, is it possible for this to be fixed?


9 months ago

pretty sure that the CLI still uses the copy environment endpoint under the hood so the issue still persists


9 months ago

that's normal, does not impact anything on the queries itself


9 months ago

we're waiting on the team for this, unfortunately nothing we can do until there, sorry for that.


faolain
PROOP

9 months ago

Appreciate it @ThallesComH is there any idea on timeline? As of right now we've been unable to have ephemeral environments and if this continues on for any more time we will have to consider migrating away from railway which we'd rather not do. We chose railway because of its ease of use and the ability to quickly spin up ephemeral environments in a customizable way. Are there any workarounds the team suggests for the interim?


faolain
PROOP

9 months ago

Bumping this 🙏 (if there are workarounds definitely open to it but we need our ephemeral PRs to work again)


9 months ago

I would recommend opening a private thread on https://station.railway.com/


9 months ago

there the team is more likely to solve your problem.


ryanlieu
PRO

9 months ago

I'm running into the same issue


ryanlieu
PRO

9 months ago

the deploy's going through but it's erroring out before it can pull from the correct git branch


ryanlieu
PRO

9 months ago

so it just defaults to main


9 months ago

Hello, we made some changes in this regard, could you confirm if you still see this issue?


9 months ago

cc @Faolain


faolain
PROOP

9 months ago

Hmm I've had a private thread where I saw there were updates, I noticed that the deploys are working without timing out but I did notice that the environment copy was no longer working (whether using the railway cli or my old unchanged github action)


faolain
PROOP

9 months ago

aka all the old environment variables were being kept from the branch it forked from (which then in and of itself led to 4 different PRs to point to our staging redis leading to….issues)


Loading...