Service is down after restat attempt

Anonymous

PROOP

5 months ago

My Grist service is down after a scheduled restart. Both restart/redploy are not working. This is affecting production users. Please consider checking some state issue on your (Railway) end.

Solved

20 Replies

Anonymous

PROOP

5 months ago

Can you please quickly validate if this is a similar issue that I faced with my other service (minio bucket) recently?

https://discord.com/channels/713503345364697088/1459523759441449091/1460688951257076068

0x5b62656e5d

MODERATOR

5 months ago

Are there any error logs?

Anonymous

PROOP

5 months ago

No, abruptly went down, no error log

Anonymous

PROOP

5 months ago

This happened with my other service a couple of weeks back and Brody from Railway team had to update some status to fix it!

Anonymous

PROOP

5 months ago

I have shared the thread reference above

Anonymous

PROOP

5 months ago

I am seeing frequent out-of-memory notifications regarding the service. Also received emails

Attachments

image.png

Anonymous

PROOP

5 months ago

But I have already set the limit as 32GB and the average usage is around 4-10GB

Anonymous

PROOP

5 months ago

For reference

Attachments

image.png

Anonymous

PROOP

5 months ago

trying to bump this thread up!

Anonymous

PROOP

5 months ago

this is happening only when the current volume is mounted. If I remove the volume mount or use a fresh volume mount, it works.

Anonymous

PROOP

5 months ago

My document files and metadata files are in the volume mount - I cannot consider it alive or operational unless it works with the volume mount!

medim

MODERATOR

5 months ago

Hey there, just for testing purposes can you try adding this env var: NODE_OPTIONS=--max-old-space-size=28672

This would allow the Grist NodeJs service to use 28GB of ram since it seems to spike to a max of 16GB.

If you can also share the logs from the crashed deployments, it would help us debug this further!

Anonymous

PROOP

5 months ago

I don't have any build logs (it is a public container) - no deploy logs either.

It should be something related to the disk - can you verify with brody once what was the state issue with the disk/service in the same project for the MinIO bucket service?

Also, don't you have the visibility into my service logs/history/deployments?

https://railway.com/project/d8a9cdda-ca19-44a5-814f-1ecaff088212/service/a72e891f-cb1e-460a-8e37-b8a31006f225?environmentId=82212507-616a-4d85-a946-87879ae84c13

proudparrot2

HOBBY

5 months ago

Railway staff do, I believe moderators do not

I don't have any build logs (it is a public container) - no deploy logs either. It should be something related to the disk - can you verify with brody once what was the state issue with the disk/service in the same project for the MinIO bucket service? Also, don't you have the visibility into my service logs/history/deployments? <https://railway.com/project/d8a9cdda-ca19-44a5-814f-1ecaff088212/service/a72e891f-cb1e-460a-8e37-b8a31006f225?environmentId=82212507-616a-4d85-a946-87879ae84c13>

proudparrot2

HOBBY

5 months ago

What do you mean by public container?

There’s not much that anyone besides maybe railway staff can help with if there aren’t any logs to reference.

Is there anything in any sort of log that indicates why your service isn’t working with this existing volume? Does it have malformed data in it?

Anonymous

PROOP

5 months ago

I meant public-image no build step involved.

- I tried to import the same document into a different grist service mounted on different volume, it works absolutely fine there!

- I tried to unmount the volume and run the service - it comes up without an issue!

- I tried to mount the same volume to a new grist service - it hangs at the same place and fails with "Out of memory" error.

Anonymous

PROOP

5 months ago

I also tried to mount the old (non-working) volume on to a simple FileBrowser () service - but that too gets hanged at the same "Creating containers step"

Anonymous

PROOP

5 months ago

Hey Railway Team,

I suppose something wrong with the disk mounting logic (which usually also gets triggered during restart/redeploy I assume).

Can you please check and update?

Anonymous

PROOP

5 months ago

I just tried to create a volume backup and restore it to a new volume - looks like it did the trick - but not sure if there is any data corruption expected here!

noahd

EMPLOYEE

5 months ago

Im unfortunately not seeing too much indicating an issue on our end. Poked through logs etc and didnt see much.

If you run into this live again can you please let us know? Super sorry you're running into it, would love to walk through and make sure its not a reproducible issue.

Status changed to Awaiting User Response Railway • 5 months ago

Railway

BOT

5 months ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway • 5 months ago

Welcome!