Service is down after restat attempt
Anonymous
PROOP

a month ago

My Grist service is down after a scheduled restart. Both restart/redploy are not working. This is affecting production users. Please consider checking some state issue on your (Railway) end.

Solved

20 Replies

Anonymous
PROOP

a month ago

Can you please quickly validate if this is a similar issue that I faced with my other service (minio bucket) recently?
https://discord.com/channels/713503345364697088/1459523759441449091/1460688951257076068


Are there any error logs?


Anonymous
PROOP

a month ago

No, abruptly went down, no error log


Anonymous
PROOP

a month ago

This happened with my other service a couple of weeks back and Brody from Railway team had to update some status to fix it!


Anonymous
PROOP

a month ago

I have shared the thread reference above


Anonymous
PROOP

a month ago

I am seeing frequent out-of-memory notifications regarding the service. Also received emails

Attachments


Anonymous
PROOP

a month ago

But I have already set the limit as 32GB and the average usage is around 4-10GB


Anonymous
PROOP

a month ago

For reference

Attachments


Anonymous
PROOP

a month ago

trying to bump this thread up!


Anonymous
PROOP

a month ago

this is happening only when the current volume is mounted. If I remove the volume mount or use a fresh volume mount, it works.


Anonymous
PROOP

a month ago

My document files and metadata files are in the volume mount - I cannot consider it alive or operational unless it works with the volume mount!


a month ago

Hey there, just for testing purposes can you try adding this env var: NODE_OPTIONS=--max-old-space-size=28672
This would allow the Grist NodeJs service to use 28GB of ram since it seems to spike to a max of 16GB.

If you can also share the logs from the crashed deployments, it would help us debug this further!


Anonymous
PROOP

a month ago

I don't have any build logs (it is a public container) - no deploy logs either.

It should be something related to the disk - can you verify with brody once what was the state issue with the disk/service in the same project for the MinIO bucket service?

Also, don't you have the visibility into my service logs/history/deployments?
https://railway.com/project/d8a9cdda-ca19-44a5-814f-1ecaff088212/service/a72e891f-cb1e-460a-8e37-b8a31006f225?environmentId=82212507-616a-4d85-a946-87879ae84c13


Railway staff do, I believe moderators do not


I don't have any build logs (it is a public container) - no deploy logs either.It should be something related to the disk - can you verify with brody once what was the state issue with the disk/service in the same project for the MinIO bucket service?Also, don't you have the visibility into my service logs/history/deployments?https://railway.com/project/d8a9cdda-ca19-44a5-814f-1ecaff088212/service/a72e891f-cb1e-460a-8e37-b8a31006f225?environmentId=82212507-616a-4d85-a946-87879ae84c13

What do you mean by public container?

There’s not much that anyone besides maybe railway staff can help with if there aren’t any logs to reference.

Is there anything in any sort of log that indicates why your service isn’t working with this existing volume? Does it have malformed data in it?


Anonymous
PROOP

a month ago

I meant public-image no build step involved.
- I tried to import the same document into a different grist service mounted on different volume, it works absolutely fine there!
- I tried to unmount the volume and run the service - it comes up without an issue!
- I tried to mount the same volume to a new grist service - it hangs at the same place and fails with "Out of memory" error.


Anonymous
PROOP

a month ago

I also tried to mount the old (non-working) volume on to a simple FileBrowser () service - but that too gets hanged at the same "Creating containers step"


Anonymous
PROOP

a month ago

Hey Railway Team,
I suppose something wrong with the disk mounting logic (which usually also gets triggered during restart/redeploy I assume).
Can you please check and update?


Anonymous
PROOP

a month ago

I just tried to create a volume backup and restore it to a new volume - looks like it did the trick - but not sure if there is any data corruption expected here!


a month ago

Im unfortunately not seeing too much indicating an issue on our end. Poked through logs etc and didnt see much.
If you run into this live again can you please let us know? Super sorry you're running into it, would love to walk through and make sure its not a reproducible issue.


Status changed to Awaiting User Response Railway about 1 month ago


Railway
BOT

a month ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway about 1 month ago


Loading...