Need to increase volumes beyond 250GB
sam-scolari
PROOP

10 months ago

I need help increasing the max size of my volumes

Solved

113 Replies

sam-scolari
PROOP

10 months ago

6b3f7e42-b250-4b53-8639-abb14c79ab8b


medim
MODERATOR

10 months ago

!t


medim
MODERATOR

10 months ago

This thread has been escalated to the Railway team.

Status changed to Awaiting Railway Response medim 10 months ago


sam-scolari
PROOP

10 months ago

This is pretty urgent as I have reached max capacity so I would appreciate any escalation I can get


brody
EMPLOYEE

10 months ago

what service? i dont see any services that have reached a limit


sam-scolari
PROOP

10 months ago

193f99f0-73b4-4347-aec9-155d60eea71b


sam-scolari
PROOP

10 months ago

This volume: 6ed229cd-3128-4b23-abb0-b2e20c8dd115


brody
EMPLOYEE

10 months ago

thank you, will do this now


sam-scolari
PROOP

10 months ago

thanks


brody
EMPLOYEE

10 months ago

please confirm if you are okay with the downtime


brody
EMPLOYEE

10 months ago

@Sam ^


sam-scolari
PROOP

10 months ago

yes this is fine


brody
EMPLOYEE

10 months ago

looks to be back online


sam-scolari
PROOP

10 months ago

Yep looks online now, I might be needing beyond 500gb (maybe 1tb - 1.5tb range) but I won't know until later today so this should be good in the meantime.

We only get charged for actual usage right?


sam-scolari
PROOP

10 months ago

thanks for quick help


brody
EMPLOYEE

10 months ago

We only get charged for actual usage right?

correct


sam-scolari
PROOP

10 months ago

Ok I got the numbers back and I'm going to need 1.5TB on these 2 volumes as soon as it can happen

6ed229cd-3128-4b23-abb0-b2e20c8dd115 (currently 500GB)
50eeb77b-61dd-40fb-9467-63eb9a9ec4b5 (currently 250GB)


sam-scolari
PROOP

10 months ago

I'm also having trouble migrating the first one to Metal (US EAST)


sam-scolari
PROOP

10 months ago

the second one migrated fine earlier


brody
EMPLOYEE

10 months ago

i am not able to increase the volume sizes on metal with any level of urgency unfortunately


brody
EMPLOYEE

10 months ago

for that we would have to fall back to your SLA


sam-scolari
PROOP

10 months ago

The first one is the most important right now and its not on metal


sam-scolari
PROOP

10 months ago

could I migrate it after or are migrations not possible beyond 250GB ?


brody
EMPLOYEE

10 months ago

the entire cluster including pgpool must be on non-metal


brody
EMPLOYEE

10 months ago

they should be


sam-scolari
PROOP

10 months ago

So its bad that I managed to migrate one of them earlier?


brody
EMPLOYEE

10 months ago

you do not want to be mixing regions


sam-scolari
PROOP

10 months ago

right, but I guess I cant migrate both which is the issue then?


sam-scolari
PROOP

10 months ago

Ideally they are both US EAST on Metal


brody
EMPLOYEE

10 months ago

are your other services on us-east metal?


sam-scolari
PROOP

10 months ago

So pgpool and pg-1 (read replica) are on us-east metal

I tried to get pg-0 on us-east metal but it said migration failed


brody
EMPLOYEE

10 months ago

Can i move them back so that i can grow the volume size


sam-scolari
PROOP

10 months ago

sure if it needs to happen


sam-scolari
PROOP

10 months ago

appreciate the help on this


brody
EMPLOYEE

10 months ago

done


sam-scolari
PROOP

10 months ago

ty ty, and should I be able to migrate those to Metal now or will I have to wait for you guys to manually do that? Getting onto Metal is not urgent I'm just trying to get us down to that $0.15/gb mo tier

(not sure if it failed earlier because it was over 250GB or not)


brody
EMPLOYEE

10 months ago

how long did it take to fail?


sam-scolari
PROOP

10 months ago

almost instantly


sam-scolari
PROOP

10 months ago

didnt see any build logs


brody
EMPLOYEE

10 months ago

its possible there are no target stackers with enough extra storage space


sam-scolari
PROOP

10 months ago

So should I try again or?


brody
EMPLOYEE

10 months ago

ill raise this to infra


brody
EMPLOYEE

10 months ago

please try again


medim
MODERATOR

10 months ago

@Sam ^


sam-scolari
PROOP

10 months ago

I started the migration an hour ago looks like it just failed after 50mins

1361793847969190100


adam
MODERATOR

10 months ago

@Brody for visibility ^


sam-scolari
PROOP

10 months ago

I tried again immediately after this and it went for almost 3 hours before I aborted the deployment (thinking it would stop the migration 😓) but now the deployment is removed but the volume still says migrating volume with a loading spinner, not sure if thats bad or not


brody
EMPLOYEE

10 months ago

may i ask why you aborted it? its not like the service was offline for 3 hours


brody
EMPLOYEE

10 months ago

(i reset the state)


sam-scolari
PROOP

10 months ago

Stupid decision in hindsight, have never had it take nearly that long so assumed it wasn't working and was going to wait to try again til whatever it was got fixed


brody
EMPLOYEE

10 months ago

lets hold off on further migrations though, we have plans to compress the zfs stream to hopefully speed up migrations, and ill let you know when thats in place


sam-scolari
PROOP

10 months ago

sounds good ty


Acking from the Railway end- we are still working on the ZFS compression workstream


Status changed to Awaiting User Response Railway 10 months ago


brody
EMPLOYEE

10 months ago

hey @Sam - i noticed the database is now sub 80GB, did you wanna try the migration again?


sam-scolari
PROOP

10 months ago

Yeah I could try again


sam-scolari
PROOP

10 months ago

Looks like pg-1 was successful but pg-0 failed after 13min


sam-scolari
PROOP

10 months ago

For context, I've successfully migrated pg-1 to metal before (before you reversed it) and I've only ever had issues with pg-0 (193f99f0-73b4-4347-aec9-155d60eea71b), not sure why


sam-scolari
PROOP

10 months ago

Also, this is probably a visual issue with the metric chart but my disk usage jumped to 500GB early this morning for some reason and I can't figure out why. The whole database is < 70GB at the moment and running du -sh . on the root directory confirms that.

I am anticipating my disk usage climbs to the 500GB-1TB range over the course of the next few weeks but the reason it jumped so fast to 1.2TB and subsequently back down to sub 80GB yesterday was because I had a bad replication slot configuration causing it to store 1.2TB of WAL files for a node that didn't exist anymore.

The reason I think this is a visual issue is because this time du -sh . confirms the db disk usage from my db. Maybe it has to do with a stale backup, not sure if thats included in the du calculation?

1362415567495106800


brody
EMPLOYEE

10 months ago

interesting, ill look into it


sam-scolari
PROOP

10 months ago

actually just noticed this
root@c2e9e49183a0:/# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 1.5T 968G 485G 67% /
tmpfs 64M 0 64M 0% /dev
/dev/zd2400 1.4T 67G 1.3T 5% /bitnami/postgresql
tmpfs 26G 194M 26G 1% /etc/hosts
shm 62M 1.1M 60M 2% /dev/shm
udev 126G 0 126G 0% /proc/keys

Looks like the disk space is being used outside the container?


sam-scolari
PROOP

10 months ago

ok a container restart solved this


brody
EMPLOYEE

10 months ago

what do you mean outside of the container?


sam-scolari
PROOP

10 months ago

Preface: I have limited knowledge of how Docker works under the hood

From AI

"However, df -h shows the usage of the entire OverlayFS mount on the host, which includes:

All your container’s changes (the upper layer),

Every image layer that the host has pulled, for this container and any others,

Any leftover data in the host’s Overlay2 directory (e.g. old, unremoved images, stopped containers, volumes, build cache).

That is why the overlay mount is showing ~967 GB used even though your container’s own filesystem (as seen by du) is only ~350 MB."


sam-scolari
PROOP

10 months ago

but nevertheless a restart of the container solved it

1362420693052035000


brody
EMPLOYEE

10 months ago

We have since added more stackers with volumes, feel free to try again


sam-scolari
PROOP

10 months ago

will do, I think you might need to reset the state on it before I can I don't see the try again button anymore

1362846659893002200


brody
EMPLOYEE

10 months ago

done


sam-scolari
PROOP

10 months ago

<:cryfistsq:1238327748464873472>

1362915628905660400


sam-scolari
PROOP

10 months ago

this volume really doesnt like me lol


brody
EMPLOYEE

10 months ago

sorry about the inconvenience here, i am still talking with infra here


sam-scolari
PROOP

10 months ago

all good


Railway
BOT

10 months ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway 10 months ago


brody
EMPLOYEE

10 months ago

hello @Sam - we have put in place some more verbose logging, we would like you to try again when you have the time


sam-scolari
PROOP

10 months ago

Sure thing, can give it a try some time tomorrow


brody
EMPLOYEE

10 months ago

thank you!


sam-scolari
PROOP

10 months ago

Ran it last night and it failed again


brody
EMPLOYEE

10 months ago

thank you


brody
EMPLOYEE

10 months ago

@Sam - would you mind if we went and initiated a transfer on your behalf? we would like to follow along with the progress


sam-scolari
PROOP

10 months ago

By transfer you mean migration? Sure.


brody
EMPLOYEE

10 months ago

haha yes, i mean migration


sam-scolari
PROOP

9 months ago

Hey, any update on this? Still being worked on?


brody
EMPLOYEE

9 months ago

perfect timing, i had on my todo's to come back today and ask you to try again, we very recently added volume migration progress, resumability, and compression.


sam-scolari
PROOP

9 months ago

will try again now


brody
EMPLOYEE

9 months ago

thank you!


brody
EMPLOYEE

9 months ago

1.2TB of data is being transferred


brody
EMPLOYEE

9 months ago

you see the estimated time now right?


sam-scolari
PROOP

9 months ago

yup its much more clear now

1372972247089680600


sam-scolari
PROOP

9 months ago

curious, where is the 1.2TB coming from? This volume is only like 130GB atm. Would the remaining tb all be backups?


brody
EMPLOYEE

9 months ago

yep, current data, and all the data in the backups


sam-scolari
PROOP

9 months ago

ah are manual backups not deleted?


sam-scolari
PROOP

9 months ago

didnt realize I had that much backed up


brody
EMPLOYEE

9 months ago

yep manual backups do not have a ttl


sam-scolari
PROOP

9 months ago

woops guess I should have deleted all those old ones before the migration because almost a tb of that is not needed lol


sam-scolari
PROOP

9 months ago

probably caused these migration issues in the past as well


brody
EMPLOYEE

9 months ago

probably but hopefully we have fixed them by now


sam-scolari
PROOP

9 months ago

yeah hopefully


sam-scolari
PROOP

9 months ago

wait, I thought backups were incremental as in it only backs up new data not previously backed up. Wouldnt that mean that backups should never exceed twice the size of the current volume?


brody
EMPLOYEE

9 months ago

you have stored far more than 140gb of data in the past, you store 500gb, take a backup, and delete that data, well now that backup size is 360gb


sam-scolari
PROOP

9 months ago

oh right, yeah forgot I had that WAL buildup issue nvm


brody
EMPLOYEE

9 months ago

did you want to cancel and delete some backups?


sam-scolari
PROOP

9 months ago

How bad is the downtime on a migration of 1.2tb vs like 300gb?


sam-scolari
PROOP

9 months ago

happy to let this play out so you can see if large migrations work now but would also be better for me if the downtime was considerably less


sam-scolari
PROOP

9 months ago

but it is pgpool so i suppose there shouldnt be any downtime technically


brody
EMPLOYEE

9 months ago

it all depends on how much data you write to the volume during the first step of the migration, we do a two step process, the fist step is sending a snapshot of the current data while your deployment is still online, the second step is taking the deployment offline and sending over the data that has changed since starting the first step, if in 3 hours you only write 300mb the downtime would be 20 seconds


sam-scolari
PROOP

9 months ago

ah then it should be fine then can leave it


sam-scolari
PROOP

9 months ago

no way we write more than 50mb in that time


brody
EMPLOYEE

9 months ago

then yeah given that fact the downtime from a 3 hour migration or a 1 hour migration is negligible


sam-scolari
PROOP

9 months ago

@Brody at some point a few hours ago the copying % message and timer went away from the volume label and said "Migrating Volume"

Just checked back in and now theres a failed deployment, the copying label is back and stuck at 69%, and in the metrics tab it is also stuck on "migrating service…"

1373064708718071800


sam-scolari
PROOP

9 months ago

1373064858899058700


sam-scolari
PROOP

9 months ago

not sure if its still doing things


sam-scolari
PROOP

9 months ago

or just broken visually


brody
EMPLOYEE

9 months ago

looks like it hit the migration timeout and canceled, I've kicked it off again, but we have plans to increase the timeout


sam-scolari
PROOP

9 months ago

looks like that worked, ty for all the help


brody
EMPLOYEE

9 months ago

amazing, happy it's finally done


brody
EMPLOYEE

9 months ago

!s


Loading...