Railway nightmare today

4 months ago

My project industrious-analysis has been experiencing Redis timeouts ever since the earlier outage, and despite trying to redeploy all services after the outage was declared resolved, I am still experiencing the issue. I don't know what else to try at this point since I haven't changed anything on my end since last night.

Solved

162 Replies

4 months ago

Does the Redis service show any logs?


4 months ago

nothing that seems notable. I am not sure the request is even making it there


4 months ago

I suspect a network issue


4 months ago

Are you using the private service URL to connect to it?


4 months ago

yes


4 months ago

What exactly are the error messages you're getting?


4 months ago

timeouts


4 months ago

Could you send the exact error message?


4 months ago

well that was weird. it's working now despite me not doing anything??


4 months ago

hmm it is likely to time out again though


4 months ago

it worked once in prod and now it's taking forever in dev


4 months ago

BTW the outage broke both my environments


4 months ago

@name:"api-gateway" AND @err.message:"failed with status code 500" AND @err.stack:"Error: failed with status code 500\n    at onResFinished (/app/node_modules/.pnpm/pino-http@11.0.0/node_modules/pino-http/logger.js:115:39)\n    at ServerResponse.onResponseComplete (/app/node_modules/.pnpm/pino-http@11.0.0/node_modules/pino-http/logger.js:178:14)\n    at ServerResponse.emit (node:events:531:35)\n    at onFinish (node:_http_outgoing:1082:10)\n    at callback (node:internal/streams/writable:766:21)\n    at afterWrite (node:internal/streams/writable:710:5)\n    at afterWriteTick (node:internal/streams/writable:696:10)\n    at process.processTicksAndRejections (node:internal/process/task_queues:89:21)" AND @err.type:"Error" 

4 months ago

@req.id:3 AND @req.method:"POST" AND @req.remoteAddress:"fd12:59bf:c23a:0:a000:75:159b:3779" AND @req.remotePort:38392 AND @req.url:"/ai/generate?wait=true" AND @req.query.wait:"true" AND @req.headers.accept:"*/*" AND @req.headers.accept-encoding:"gzip, deflate" AND @req.headers.accept-language:"*" AND @req.headers.connection:"keep-alive" AND @req.headers.content-length:"56045" AND @req.headers.content-type:"application/json" AND @req.headers.host:"api-gateway.railway.internal:3000" AND @req.headers.sec-fetch-mode:"cors" AND @req.headers.user-agent:"node" 

4 months ago

I can't find redis failures anymore after all the redeployments


4 months ago

I hate doing this from my phone


4 months ago

I haven't been home all day


4 months ago

lol only one message made it through. we're back to timeouts


4 months ago

1433603819232563500


4 months ago

it's still broken. ugh


4 months ago

I was hoping it would be a transient thing that would self heal but nope it's been hoooours


4 months ago

stopped working a bit after lunchtime


4 months ago

I'll work on adding better logging in the meantime


4 months ago

Railway nightmare today


4 months ago

grrr now my production Postgres is hanging and the redeploy is too
today has been a bad day
it finally deployed but I still can't connect to it. wtf
doesn't even work on the website


4 months ago

about the Postgres service, do you see any kind of error logs?


4 months ago

I redeployed it like twice. it says active but nothing shows up here

1433653643403591700


4 months ago

can't even connect via the website

1433653764111204400


4 months ago

can you remove your current deployment and then try a fresh deployment by hitting CTRL + K and then deploy latest commit


4 months ago

Can I get a link to that service?



4 months ago

it doesn't give me that option but I did deploy source image or whatever


4 months ago

That option doesnt exist on a postgres image. It's alright


4 months ago

waiting for it to come online and looking at it


4 months ago

Thats indeed odd, I'll wait for the team response on that, sorry for any issue.


4 months ago

it's taking its sweet time


4 months ago

this is like, my third attempt at redeploying it


4 months ago

The creating containers step randomly might take some time. We are working on a project that will fix that and a few other issues to make things much faster


4 months ago

I'm really frustrated that both my dev and prod environments stopped working after lunch today and have been hosed for hours


4 months ago

even now the dev bot isn't working


4 months ago

the incident earlier broke my app


4 months ago

The degraded performance in US-East?


4 months ago

yes. everything was fine until that incident


4 months ago

afterwards? it has been broken all day


4 months ago

tbh it might also be partly a Qdrant issue. my app logs show it hanging a lot. or it's a networking thing idk. AWS has been unreliable lately


4 months ago

Is your Qdrant hosted on AWS?


4 months ago

I don't see qdrant on this project


4 months ago

I hosted it externally because I didn't know any better. does Railway offer hosting for it?


4 months ago

it's on qdrant.io


4 months ago

and they use us-east-1 for my cluster


4 months ago

Railway does! I host my qdrant instances on Railway always


4 months ago

hmm then it would be good to just migrate


4 months ago

Direct link for it:



4 months ago

beat me to it


4 months ago

it shows as active but the db connection still isn't working


4 months ago

Interesting… Let me check something internally


4 months ago

lol it's still unable to be connected to. I'm working on setting up Qdrant on Railway and I hope it actually works


4 months ago

Need to bring this up to the right team. You should be good to connect with an external DB viewer like tableplus


4 months ago

going to get this fixed though


4 months ago

it's not working


4 months ago

I use Jetbrains and their DB connectivity stuff


4 months ago

I can connect to my DEV postgres but not PROD


4 months ago

Fixed (Paulo the goat)
Reload yor web page and it should be good!


4 months ago

the web page works but the external connection is still not working


4 months ago

tbh I don't like using the web page because it's so limited


4 months ago

hard to do any serious database management


4 months ago

it's handy when I'm on my phone but that's about it


4 months ago

External connections to Postgres are working.


4 months ago

my dev one works fine, but prod doesn't


4 months ago

I double checked the password and port on the proxy and everything. it was working before today's mess


4 months ago

Prod works fine with regard to both private and public connections; we have verified that.


4 months ago

1433673275577471000
1433673276042908000


4 months ago

as I said, one works, the other fails


4 months ago

Make sure that your username credential is correct, from what I remember Railway uses the railway username.


4 months ago

they use postgres for the username actually


4 months ago

I double checked all the credentials. which didn't change since yesterday, when everything was fine


4 months ago

my suspicion is that some networking stuff broke badly during the incident earlier


4 months ago

as a separate example, for some reason I can't write to Qdrant.io but I can read. bizarre


4 months ago

Any connection issues would be on your end. We have verified that the database is accessible via both private and public connections.


4 months ago

the proxy domain might be broken idk


4 months ago

I guess I can try generating a new one


4 months ago

There could be a firewall on your end blocking the current TCP port that is in use. Generating a new TCP proxy will get you a different port.


4 months ago

tried a new domain and same issue


4 months ago

again, this was working before the incident. I had zero problems. so it's not my firewall


4 months ago

Again, we have verified that the database is accessible via both private and public connections.


4 months ago

if it was my firewall I would have trouble accessing both dev and prod. and since nothing changed in my firewall between yesterday and now, it is not the issue


4 months ago

basic process of elimination


4 months ago

I'm sorry, but I don't know what to tell you at this point. The database is accessible to the public internet without issue.


4 months ago

how exactly are you verifying that?


4 months ago

I am using the TCP proxy feature


4 months ago

We have internal tools that I couldn't disclose, but I can show you Telnet being able to communicate.

1433676415043375000


4 months ago

Is that another database? As it differs from her screenshot (the proxy domain)


4 months ago

telnet doesn't tell me anything really. a real test would be to connect from outside using the same approach (via the postgres connection)


4 months ago

*her screenshot


4 months ago

and I changed it already as a debugging step


4 months ago

sorry!


4 months ago

it's the correct proxy

1433676982893412400


4 months ago

I am most certainly outside. I am not anywhere near a Railway data center.


4 months ago

When using a database client, I get a "wrong password" error instead of a connection failure like you're experiencing.

1433677389812072400


4 months ago

still fails on the new proxy

1433677564475347000


4 months ago

I have pasted the password so many times


4 months ago

Also, I can see that DBeaver is showing some information about the database (like versions) so I'm guessing the issue here are credentials. I also believe that you're copying the right password.


4 months ago

maybe I should regen the password?


4 months ago

That would be the best option, but I'm unsure if changing the environment variable would make a difference. You would need to SSH into the container and change it manually.


4 months ago

We have a way to regen the password in our UI now.


4 months ago

it didn't work lol


4 months ago

I tried regen and it didn't make a new password


4 months ago

so not even the UI for the password regen is working


4 months ago

it worked earlier when I generated a new dev password


4 months ago

Do you get any kind of errors when trying it?


4 months ago

nope, just doesn't actually update it


4 months ago

Maybe the password regeneration feature uses the current credentials to reset it?


4 months ago

oh actually it popped up an error that said failed to fetch


4 months ago

this time it didn't say "failed to fetch"

1433678919512621300


4 months ago

the issue with even regenerating the password may be a hint about what's happening under the hood


4 months ago

I was an idiot and accidentally leaked my dev password on Github today and had no issue making a new password for my dev DB via the same UI


shxkm
PRO

4 months ago

Same happening to me. Website completely down with Redis errors.


shxkm

Same happening to me. Website completely down with Redis errors.

4 months ago

Hey, would it be possible to open a help thread about your problem? It helps us organize threads and get you a faster response.


4 months ago

I tried again and same issue


passos

Hey, would it be possible to open a help thread about your problem? It helps us organize threads and get you a faster response.

shxkm
PRO

4 months ago

I did. I replied here so it doesn’t get treated as a misconfiguration or isolated incident.


4 months ago

shxkm, please open your own thread.


4 months ago

I'm trying to use this in my development environment and I either get refused connections or timeouts (the former for the internal networking, the latter for the public URL). not very promising, unless I'm doing something dumb


4 months ago

trying the TCP proxy now I guess


4 months ago

proxy works, but it's dumb that I have to use that for internal services talking to each other


brody

shxkm, please open your own thread.

shxkm
PRO

4 months ago

As I said clearly, I DID open my own thread. But Railway tends to not acknowledge its own bugs and issues so I replied here as a “me too”.

By the way, the thread you told me to open has ZERO replies from Railway employees. My production app has been down for more than 12 hours. Friday is the busiest day for my app. I wish I didn’t move here from Heroku.

https://station.railway.com/questions/redis-ttimeouts-all-over-site-not-respo-e871fa03


shxkm

As I said clearly, I DID open my own thread. But Railway tends to not acknowledge its own bugs and issues so I replied here as a “me too”.By the way, the thread you told me to open has ZERO replies from Railway employees. My production app has been down for more than 12 hours. Friday is the busiest day for my app. I wish I didn’t move here from Heroku.https://station.railway.com/questions/redis-ttimeouts-all-over-site-not-respo-e871fa03

4 months ago

I'm lucky that my app is just a project for myself and a few friends, but unfortunately for me I rely on it heavily and it really ruined my day to have it down for that many hours.


lbds137

I'm lucky that my app is just a project for myself and a few friends, but unfortunately for me I rely on it heavily and it really ruined my day to have it down for that many hours.

shxkm
PRO

4 months ago

I have hundreds of customers. Some of them will be issuing refund requests because of this.

I hope you learned your lesson because I surely learned mine.


shxkm

I have hundreds of customers. Some of them will be issuing refund requests because of this.I hope you learned your lesson because I surely learned mine.

4 months ago

Unfortunately I don't have a ton of energy to do lots of manual infrastructure provisioning myself, which is why I use Railway. This is the first major problem I've had since starting to use the service a few months ago, but admittedly my application got more complex recently, with multiple interconnected services. It's frustrating because most of it works, but because my app is for an AI use case, RAG is very important to me, and my connection to an external Qdrant provider has been failing repeatedly. I was advised that Railway now offers Qdrant, so I've been trying to sync my stuff to stay in the ecosystem, but even that has been failing badly due to timeouts. I'm really not impressed and hope that this gets resolved soon so I can go back to actually using my app rather than losing sleep over it. I stayed up till like 3:30am and I'm kinda screwed for today because I have an AWS certification exam that I'll be taking on very little sleep.


4 months ago

I'll have to check the status of my app when I have a chance but it would be helpful to know if anyone is looking into my issues or if I have to pull teeth and do most of the debugging myself


4 months ago

hmm, I was able to regen the password and can connect locally again. thanks to whoever fixed it


4 months ago

having issues with creating collections on Railway Qdrant now


4 months ago

ugh


4 months ago

if it's not one thing it's another


4 months ago

nvm it's listing collections that keeps timing out


4 months ago

either way


4 months ago

and deleting apparently


4 months ago

I'm gonna just give up on Qdrant tbh. too much hassle


4 months ago

at least my Postgres is working. I can switch to pgvector and call it a day


4 months ago

ok new issue - how do I rotate credentials with pgvector?


4 months ago

that particular image doesn't give me access to the database tab like a regular Postgres instance


4 months ago

that's expected as it isn't a database template made by Railway


4 months ago

how do I do it then?

1433973061601988600


4 months ago

that popup is wrong because there is no such tab


4 months ago

I can't confirm it right now but can't you install the pgvector extensions directly onto the official template?


4 months ago

I tried


4 months ago

it's not a thing


4 months ago

I wouldn't have made a new db if I could have just intalled that


4 months ago

you would need to do it via SQL then, that modal is detecting it as an official template when it's not


4 months ago

yeah that's what I ended up doing


4 months ago

this is a usability problem though


4 months ago

that modal is not applicable to this template


4 months ago

don't you love it when you're vibe coding and the AI leaks your password 🙃 happened twice now


4 months ago

a #🤗|feedback thread about it is more than welcome :)


4 months ago

can't you tell your vibe coding tool to ignore .env files?


4 months ago

.env is already ignored


4 months ago

it committed it to a todo list file 🤦‍♀️


4 months ago

the credentials are rotated now. but man that was annoying


4 months ago

done: #pgvector template doesn't allow UI-based password rotation


4 months ago

alright I'm good to close this thread


4 months ago

got my stuff fixed after a very stressful couple of days


4 months ago

!s


Status changed to Solved passos 4 months ago


Loading...