Railway nightmare today
lbds137
PROOP

a month ago

My project industrious-analysis has been experiencing Redis timeouts ever since the earlier outage, and despite trying to redeploy all services after the outage was declared resolved, I am still experiencing the issue. I don't know what else to try at this point since I haven't changed anything on my end since last night.

Solved

8 Replies

aleks
HOBBY

a month ago

Does the Redis service show any logs?


lbds137
PROOP

a month ago

nothing that seems notable. I am not sure the request is even making it there


lbds137
PROOP

a month ago

I suspect a network issue


aleks
HOBBY

a month ago

Are you using the private service URL to connect to it?


lbds137
PROOP

a month ago

yes


aleks
HOBBY

a month ago

What exactly are the error messages you're getting?


lbds137
PROOP

a month ago

timeouts


aleks
HOBBY

a month ago

Could you send the exact error message?


lbds137
PROOP

a month ago

well that was weird. it's working now despite me not doing anything??


lbds137
PROOP

a month ago

hmm it is likely to time out again though


lbds137
PROOP

a month ago

it worked once in prod and now it's taking forever in dev


lbds137
PROOP

a month ago

BTW the outage broke both my environments


lbds137
PROOP

a month ago

@name:"api-gateway" AND @err.message:"failed with status code 500" AND @err.stack:"Error: failed with status code 500\n    at onResFinished (/app/node_modules/.pnpm/pino-http@11.0.0/node_modules/pino-http/logger.js:115:39)\n    at ServerResponse.onResponseComplete (/app/node_modules/.pnpm/pino-http@11.0.0/node_modules/pino-http/logger.js:178:14)\n    at ServerResponse.emit (node:events:531:35)\n    at onFinish (node:_http_outgoing:1082:10)\n    at callback (node:internal/streams/writable:766:21)\n    at afterWrite (node:internal/streams/writable:710:5)\n    at afterWriteTick (node:internal/streams/writable:696:10)\n    at process.processTicksAndRejections (node:internal/process/task_queues:89:21)" AND @err.type:"Error" 

lbds137
PROOP

a month ago

@req.id:3 AND @req.method:"POST" AND @req.remoteAddress:"fd12:59bf:c23a:0:a000:75:159b:3779" AND @req.remotePort:38392 AND @req.url:"/ai/generate?wait=true" AND @req.query.wait:"true" AND @req.headers.accept:"*/*" AND @req.headers.accept-encoding:"gzip, deflate" AND @req.headers.accept-language:"*" AND @req.headers.connection:"keep-alive" AND @req.headers.content-length:"56045" AND @req.headers.content-type:"application/json" AND @req.headers.host:"api-gateway.railway.internal:3000" AND @req.headers.sec-fetch-mode:"cors" AND @req.headers.user-agent:"node" 

lbds137
PROOP

a month ago

I can't find redis failures anymore after all the redeployments


lbds137
PROOP

a month ago

I hate doing this from my phone


lbds137
PROOP

a month ago

I haven't been home all day


lbds137
PROOP

a month ago

lol only one message made it through. we're back to timeouts


lbds137
PROOP

a month ago

1433603819232563500


lbds137
PROOP

a month ago

it's still broken. ugh


lbds137
PROOP

a month ago

I was hoping it would be a transient thing that would self heal but nope it's been hoooours


lbds137
PROOP

a month ago

stopped working a bit after lunchtime


lbds137
PROOP

a month ago

I'll work on adding better logging in the meantime


lbds137
PROOP

a month ago

Railway nightmare today


lbds137
PROOP

a month ago

grrr now my production Postgres is hanging and the redeploy is too
today has been a bad day
it finally deployed but I still can't connect to it. wtf
doesn't even work on the website


passos
MODERATOR

a month ago

about the Postgres service, do you see any kind of error logs?


lbds137
PROOP

a month ago

I redeployed it like twice. it says active but nothing shows up here

1433653643403591700


lbds137
PROOP

a month ago

can't even connect via the website

1433653764111204400


passos
MODERATOR

a month ago

can you remove your current deployment and then try a fresh deployment by hitting CTRL + K and then deploy latest commit


noahd
EMPLOYEE

a month ago

Can I get a link to that service?



lbds137
PROOP

a month ago

it doesn't give me that option but I did deploy source image or whatever


noahd
EMPLOYEE

a month ago

That option doesnt exist on a postgres image. It's alright


noahd
EMPLOYEE

a month ago

waiting for it to come online and looking at it


passos
MODERATOR

a month ago

Thats indeed odd, I'll wait for the team response on that, sorry for any issue.


lbds137
PROOP

a month ago

it's taking its sweet time


lbds137
PROOP

a month ago

this is like, my third attempt at redeploying it


noahd
EMPLOYEE

a month ago

The creating containers step randomly might take some time. We are working on a project that will fix that and a few other issues to make things much faster


lbds137
PROOP

a month ago

I'm really frustrated that both my dev and prod environments stopped working after lunch today and have been hosed for hours


lbds137
PROOP

a month ago

even now the dev bot isn't working


lbds137
PROOP

a month ago

the incident earlier broke my app


noahd
EMPLOYEE

a month ago

The degraded performance in US-East?


lbds137
PROOP

a month ago

yes. everything was fine until that incident


lbds137
PROOP

a month ago

afterwards? it has been broken all day


lbds137
PROOP

a month ago

tbh it might also be partly a Qdrant issue. my app logs show it hanging a lot. or it's a networking thing idk. AWS has been unreliable lately


noahd
EMPLOYEE

a month ago

Is your Qdrant hosted on AWS?


noahd
EMPLOYEE

a month ago

I don't see qdrant on this project


lbds137
PROOP

a month ago

I hosted it externally because I didn't know any better. does Railway offer hosting for it?


lbds137
PROOP

a month ago

it's on qdrant.io


lbds137
PROOP

a month ago

and they use us-east-1 for my cluster


noahd
EMPLOYEE

a month ago

Railway does! I host my qdrant instances on Railway always


lbds137
PROOP

a month ago

hmm then it would be good to just migrate


passos
MODERATOR

a month ago

Direct link for it:



noahd
EMPLOYEE

a month ago

beat me to it


lbds137
PROOP

a month ago

it shows as active but the db connection still isn't working


noahd
EMPLOYEE

a month ago

Interesting… Let me check something internally


lbds137
PROOP

a month ago

lol it's still unable to be connected to. I'm working on setting up Qdrant on Railway and I hope it actually works


noahd
EMPLOYEE

a month ago

Need to bring this up to the right team. You should be good to connect with an external DB viewer like tableplus


noahd
EMPLOYEE

a month ago

going to get this fixed though


lbds137
PROOP

a month ago

it's not working


lbds137
PROOP

a month ago

I use Jetbrains and their DB connectivity stuff


lbds137
PROOP

a month ago

I can connect to my DEV postgres but not PROD


noahd
EMPLOYEE

a month ago

Fixed (Paulo the goat)
Reload yor web page and it should be good!


lbds137
PROOP

a month ago

the web page works but the external connection is still not working


lbds137
PROOP

a month ago

tbh I don't like using the web page because it's so limited


lbds137
PROOP

a month ago

hard to do any serious database management


lbds137
PROOP

a month ago

it's handy when I'm on my phone but that's about it


brody
EMPLOYEE

a month ago

External connections to Postgres are working.


lbds137
PROOP

a month ago

my dev one works fine, but prod doesn't


lbds137
PROOP

a month ago

I double checked the password and port on the proxy and everything. it was working before today's mess


brody
EMPLOYEE

a month ago

Prod works fine with regard to both private and public connections; we have verified that.


lbds137
PROOP

a month ago

1433673275577471000
1433673276042908000


lbds137
PROOP

a month ago

as I said, one works, the other fails


passos
MODERATOR

a month ago

Make sure that your username credential is correct, from what I remember Railway uses the railway username.


lbds137
PROOP

a month ago

they use postgres for the username actually


lbds137
PROOP

a month ago

I double checked all the credentials. which didn't change since yesterday, when everything was fine


lbds137
PROOP

a month ago

my suspicion is that some networking stuff broke badly during the incident earlier


lbds137
PROOP

a month ago

as a separate example, for some reason I can't write to Qdrant.io but I can read. bizarre


brody
EMPLOYEE

a month ago

Any connection issues would be on your end. We have verified that the database is accessible via both private and public connections.


lbds137
PROOP

a month ago

the proxy domain might be broken idk


lbds137
PROOP

a month ago

I guess I can try generating a new one


brody
EMPLOYEE

a month ago

There could be a firewall on your end blocking the current TCP port that is in use. Generating a new TCP proxy will get you a different port.


lbds137
PROOP

a month ago

tried a new domain and same issue


lbds137
PROOP

a month ago

again, this was working before the incident. I had zero problems. so it's not my firewall


brody
EMPLOYEE

a month ago

Again, we have verified that the database is accessible via both private and public connections.


lbds137
PROOP

a month ago

if it was my firewall I would have trouble accessing both dev and prod. and since nothing changed in my firewall between yesterday and now, it is not the issue


lbds137
PROOP

a month ago

basic process of elimination


brody
EMPLOYEE

a month ago

I'm sorry, but I don't know what to tell you at this point. The database is accessible to the public internet without issue.


lbds137
PROOP

a month ago

how exactly are you verifying that?


lbds137
PROOP

a month ago

I am using the TCP proxy feature


brody
EMPLOYEE

a month ago

We have internal tools that I couldn't disclose, but I can show you Telnet being able to communicate.

1433676415043375000


passos
MODERATOR

a month ago

Is that another database? As it differs from her screenshot (the proxy domain)


lbds137
PROOP

a month ago

telnet doesn't tell me anything really. a real test would be to connect from outside using the same approach (via the postgres connection)


lbds137
PROOP

a month ago

*her screenshot


lbds137
PROOP

a month ago

and I changed it already as a debugging step


passos
MODERATOR

a month ago

sorry!


lbds137
PROOP

a month ago

it's the correct proxy

1433676982893412400


brody
EMPLOYEE

a month ago

I am most certainly outside. I am not anywhere near a Railway data center.


passos
MODERATOR

a month ago

When using a database client, I get a "wrong password" error instead of a connection failure like you're experiencing.

1433677389812072400


lbds137
PROOP

a month ago

still fails on the new proxy

1433677564475347000


lbds137
PROOP

a month ago

I have pasted the password so many times


passos
MODERATOR

a month ago

Also, I can see that DBeaver is showing some information about the database (like versions) so I'm guessing the issue here are credentials. I also believe that you're copying the right password.


lbds137
PROOP

a month ago

maybe I should regen the password?


passos
MODERATOR

a month ago

That would be the best option, but I'm unsure if changing the environment variable would make a difference. You would need to SSH into the container and change it manually.


brody
EMPLOYEE

a month ago

We have a way to regen the password in our UI now.


lbds137
PROOP

a month ago

it didn't work lol


lbds137
PROOP

a month ago

I tried regen and it didn't make a new password


lbds137
PROOP

a month ago

so not even the UI for the password regen is working


lbds137
PROOP

a month ago

it worked earlier when I generated a new dev password


passos
MODERATOR

a month ago

Do you get any kind of errors when trying it?


lbds137
PROOP

a month ago

nope, just doesn't actually update it


passos
MODERATOR

a month ago

Maybe the password regeneration feature uses the current credentials to reset it?


lbds137
PROOP

a month ago

oh actually it popped up an error that said failed to fetch


lbds137
PROOP

a month ago

this time it didn't say "failed to fetch"

1433678919512621300


lbds137
PROOP

a month ago

the issue with even regenerating the password may be a hint about what's happening under the hood


lbds137
PROOP

a month ago

I was an idiot and accidentally leaked my dev password on Github today and had no issue making a new password for my dev DB via the same UI


shxkm
PRO

a month ago

Same happening to me. Website completely down with Redis errors.


shxkm

Same happening to me. Website completely down with Redis errors.

passos
MODERATOR

a month ago

Hey, would it be possible to open a help thread about your problem? It helps us organize threads and get you a faster response.


lbds137
PROOP

a month ago

I tried again and same issue


passos

Hey, would it be possible to open a help thread about your problem? It helps us organize threads and get you a faster response.

shxkm
PRO

a month ago

I did. I replied here so it doesn’t get treated as a misconfiguration or isolated incident.


brody
EMPLOYEE

a month ago

shxkm, please open your own thread.


lbds137
PROOP

a month ago

I'm trying to use this in my development environment and I either get refused connections or timeouts (the former for the internal networking, the latter for the public URL). not very promising, unless I'm doing something dumb


lbds137
PROOP

a month ago

trying the TCP proxy now I guess


lbds137
PROOP

a month ago

proxy works, but it's dumb that I have to use that for internal services talking to each other


brody

shxkm, please open your own thread.

shxkm
PRO

a month ago

As I said clearly, I DID open my own thread. But Railway tends to not acknowledge its own bugs and issues so I replied here as a “me too”.

By the way, the thread you told me to open has ZERO replies from Railway employees. My production app has been down for more than 12 hours. Friday is the busiest day for my app. I wish I didn’t move here from Heroku.

https://station.railway.com/questions/redis-ttimeouts-all-over-site-not-respo-e871fa03


shxkm

As I said clearly, I DID open my own thread. But Railway tends to not acknowledge its own bugs and issues so I replied here as a “me too”.By the way, the thread you told me to open has ZERO replies from Railway employees. My production app has been down for more than 12 hours. Friday is the busiest day for my app. I wish I didn’t move here from Heroku.https://station.railway.com/questions/redis-ttimeouts-all-over-site-not-respo-e871fa03

lbds137
PROOP

a month ago

I'm lucky that my app is just a project for myself and a few friends, but unfortunately for me I rely on it heavily and it really ruined my day to have it down for that many hours.


lbds137

I'm lucky that my app is just a project for myself and a few friends, but unfortunately for me I rely on it heavily and it really ruined my day to have it down for that many hours.

shxkm
PRO

a month ago

I have hundreds of customers. Some of them will be issuing refund requests because of this.

I hope you learned your lesson because I surely learned mine.


shxkm

I have hundreds of customers. Some of them will be issuing refund requests because of this.I hope you learned your lesson because I surely learned mine.

lbds137
PROOP

a month ago

Unfortunately I don't have a ton of energy to do lots of manual infrastructure provisioning myself, which is why I use Railway. This is the first major problem I've had since starting to use the service a few months ago, but admittedly my application got more complex recently, with multiple interconnected services. It's frustrating because most of it works, but because my app is for an AI use case, RAG is very important to me, and my connection to an external Qdrant provider has been failing repeatedly. I was advised that Railway now offers Qdrant, so I've been trying to sync my stuff to stay in the ecosystem, but even that has been failing badly due to timeouts. I'm really not impressed and hope that this gets resolved soon so I can go back to actually using my app rather than losing sleep over it. I stayed up till like 3:30am and I'm kinda screwed for today because I have an AWS certification exam that I'll be taking on very little sleep.


lbds137
PROOP

a month ago

I'll have to check the status of my app when I have a chance but it would be helpful to know if anyone is looking into my issues or if I have to pull teeth and do most of the debugging myself


lbds137
PROOP

a month ago

hmm, I was able to regen the password and can connect locally again. thanks to whoever fixed it


lbds137
PROOP

a month ago

having issues with creating collections on Railway Qdrant now


lbds137
PROOP

a month ago

ugh


lbds137
PROOP

a month ago

if it's not one thing it's another


lbds137
PROOP

a month ago

nvm it's listing collections that keeps timing out


lbds137
PROOP

a month ago

either way


lbds137
PROOP

a month ago

and deleting apparently


lbds137
PROOP

a month ago

I'm gonna just give up on Qdrant tbh. too much hassle


lbds137
PROOP

a month ago

at least my Postgres is working. I can switch to pgvector and call it a day


lbds137
PROOP

a month ago

ok new issue - how do I rotate credentials with pgvector?


lbds137
PROOP

a month ago

that particular image doesn't give me access to the database tab like a regular Postgres instance


passos
MODERATOR

a month ago

that's expected as it isn't a database template made by Railway


lbds137
PROOP

a month ago

how do I do it then?

1433973061601988600


lbds137
PROOP

a month ago

that popup is wrong because there is no such tab


passos
MODERATOR

a month ago

I can't confirm it right now but can't you install the pgvector extensions directly onto the official template?


lbds137
PROOP

a month ago

I tried


lbds137
PROOP

a month ago

it's not a thing


lbds137
PROOP

a month ago

I wouldn't have made a new db if I could have just intalled that


passos
MODERATOR

a month ago

you would need to do it via SQL then, that modal is detecting it as an official template when it's not


lbds137
PROOP

a month ago

yeah that's what I ended up doing


lbds137
PROOP

a month ago

this is a usability problem though


lbds137
PROOP

a month ago

that modal is not applicable to this template


lbds137
PROOP

a month ago

don't you love it when you're vibe coding and the AI leaks your password 🙃 happened twice now


passos
MODERATOR

a month ago

a #🤗|feedback thread about it is more than welcome :)


passos
MODERATOR

a month ago

can't you tell your vibe coding tool to ignore .env files?


lbds137
PROOP

a month ago

.env is already ignored


lbds137
PROOP

a month ago

it committed it to a todo list file 🤦‍♀️


lbds137
PROOP

a month ago

the credentials are rotated now. but man that was annoying


lbds137
PROOP

a month ago

done: #pgvector template doesn't allow UI-based password rotation


lbds137
PROOP

a month ago

alright I'm good to close this thread


lbds137
PROOP

a month ago

got my stuff fixed after a very stressful couple of days


passos
MODERATOR

a month ago

!s


Status changed to Solved passos about 1 month ago


Loading...