Need to increase volumes beyond 250GB

sam-scolari

PROOP

a year ago

I need help increasing the max size of my volumes

Solved

113 Replies

sam-scolari

PROOP

a year ago

6b3f7e42-b250-4b53-8639-abb14c79ab8b

a year ago

a year ago

This thread has been escalated to the Railway team.

Status changed to Awaiting Railway Response medim • about 1 year ago

sam-scolari

PROOP

a year ago

This is pretty urgent as I have reached max capacity so I would appreciate any escalation I can get

brody

EMPLOYEE

a year ago

what service? i dont see any services that have reached a limit

sam-scolari

PROOP

a year ago

193f99f0-73b4-4347-aec9-155d60eea71b

sam-scolari

PROOP

a year ago

This volume: 6ed229cd-3128-4b23-abb0-b2e20c8dd115

brody

EMPLOYEE

a year ago

thank you, will do this now

sam-scolari

PROOP

a year ago

thanks

brody

EMPLOYEE

a year ago

please confirm if you are okay with the downtime

brody

EMPLOYEE

a year ago

@Sam ^

sam-scolari

PROOP

a year ago

yes this is fine

brody

EMPLOYEE

a year ago

looks to be back online

sam-scolari

PROOP

a year ago

Yep looks online now, I might be needing beyond 500gb (maybe 1tb - 1.5tb range) but I won't know until later today so this should be good in the meantime.

We only get charged for actual usage right?

sam-scolari

PROOP

a year ago

thanks for quick help

brody

EMPLOYEE

a year ago

We only get charged for actual usage right?

correct

sam-scolari

PROOP

a year ago

Ok I got the numbers back and I'm going to need 1.5TB on these 2 volumes as soon as it can happen

6ed229cd-3128-4b23-abb0-b2e20c8dd115 (currently 500GB)

50eeb77b-61dd-40fb-9467-63eb9a9ec4b5 (currently 250GB)

sam-scolari

PROOP

a year ago

I'm also having trouble migrating the first one to Metal (US EAST)

sam-scolari

PROOP

a year ago

the second one migrated fine earlier

brody

EMPLOYEE

a year ago

i am not able to increase the volume sizes on metal with any level of urgency unfortunately

brody

EMPLOYEE

a year ago

for that we would have to fall back to your SLA

sam-scolari

PROOP

a year ago

The first one is the most important right now and its not on metal

sam-scolari

PROOP

a year ago

could I migrate it after or are migrations not possible beyond 250GB ?

brody

EMPLOYEE

a year ago

the entire cluster including pgpool must be on non-metal

brody

EMPLOYEE

a year ago

they should be

sam-scolari

PROOP

a year ago

So its bad that I managed to migrate one of them earlier?

brody

EMPLOYEE

a year ago

you do not want to be mixing regions

sam-scolari

PROOP

a year ago

right, but I guess I cant migrate both which is the issue then?

sam-scolari

PROOP

a year ago

Ideally they are both US EAST on Metal

brody

EMPLOYEE

a year ago

are your other services on us-east metal?

sam-scolari

PROOP

a year ago

So pgpool and pg-1 (read replica) are on us-east metal

I tried to get pg-0 on us-east metal but it said migration failed

brody

EMPLOYEE

a year ago

Can i move them back so that i can grow the volume size

sam-scolari

PROOP

a year ago

sure if it needs to happen

sam-scolari

PROOP

a year ago

appreciate the help on this

brody

EMPLOYEE

a year ago

done

sam-scolari

PROOP

a year ago

ty ty, and should I be able to migrate those to Metal now or will I have to wait for you guys to manually do that? Getting onto Metal is not urgent I'm just trying to get us down to that $0.15/gb mo tier

(not sure if it failed earlier because it was over 250GB or not)

brody

EMPLOYEE

a year ago

how long did it take to fail?

sam-scolari

PROOP

a year ago

almost instantly

sam-scolari

PROOP

a year ago

didnt see any build logs

brody

EMPLOYEE

a year ago

its possible there are no target stackers with enough extra storage space

sam-scolari

PROOP

a year ago

So should I try again or?

brody

EMPLOYEE

a year ago

ill raise this to infra

brody

EMPLOYEE

a year ago

please try again

medim

MODERATOR

a year ago

@Sam ^

sam-scolari

PROOP

a year ago

I started the migration an hour ago looks like it just failed after 50mins

1361793847969190109

adam

MODERATOR

a year ago

@Brody for visibility ^

sam-scolari

PROOP

a year ago

I tried again immediately after this and it went for almost 3 hours before I aborted the deployment (thinking it would stop the migration 😓) but now the deployment is removed but the volume still says migrating volume with a loading spinner, not sure if thats bad or not

brody

EMPLOYEE

a year ago

may i ask why you aborted it? its not like the service was offline for 3 hours

brody

EMPLOYEE

a year ago

(i reset the state)

sam-scolari

PROOP

a year ago

Stupid decision in hindsight, have never had it take nearly that long so assumed it wasn't working and was going to wait to try again til whatever it was got fixed

brody

EMPLOYEE

a year ago

lets hold off on further migrations though, we have plans to compress the zfs stream to hopefully speed up migrations, and ill let you know when thats in place

sam-scolari

PROOP

a year ago

sounds good ty

angelo-railway

EMPLOYEE

a year ago

Acking from the Railway end- we are still working on the ZFS compression workstream

Status changed to Awaiting User Response Railway • about 1 year ago

brody

EMPLOYEE

a year ago

hey @Sam - i noticed the database is now sub 80GB, did you wanna try the migration again?

sam-scolari

PROOP

a year ago

Yeah I could try again

sam-scolari

PROOP

a year ago

Looks like pg-1 was successful but pg-0 failed after 13min

sam-scolari

PROOP

a year ago

For context, I've successfully migrated pg-1 to metal before (before you reversed it) and I've only ever had issues with pg-0 (193f99f0-73b4-4347-aec9-155d60eea71b), not sure why

sam-scolari

PROOP

a year ago

Also, this is probably a visual issue with the metric chart but my disk usage jumped to 500GB early this morning for some reason and I can't figure out why. The whole database is < 70GB at the moment and running du -sh . on the root directory confirms that.

I am anticipating my disk usage climbs to the 500GB-1TB range over the course of the next few weeks but the reason it jumped so fast to 1.2TB and subsequently back down to sub 80GB yesterday was because I had a bad replication slot configuration causing it to store 1.2TB of WAL files for a node that didn't exist anymore.

The reason I think this is a visual issue is because this time du -sh . confirms the db disk usage from my db. Maybe it has to do with a stale backup, not sure if thats included in the du calculation?

1362415567495106721

brody

EMPLOYEE

a year ago

interesting, ill look into it

sam-scolari

PROOP

a year ago

actually just noticed this

root@c2e9e49183a0:/# df -h

Filesystem Size Used Avail Use% Mounted on

overlay 1.5T 968G 485G 67% /

tmpfs 64M 0 64M 0% /dev

/dev/zd2400 1.4T 67G 1.3T 5% /bitnami/postgresql

tmpfs 26G 194M 26G 1% /etc/hosts

shm 62M 1.1M 60M 2% /dev/shm

udev 126G 0 126G 0% /proc/keys

Looks like the disk space is being used outside the container?

sam-scolari

PROOP

a year ago

ok a container restart solved this

brody

EMPLOYEE

a year ago

what do you mean outside of the container?

sam-scolari

PROOP

a year ago

Preface: I have limited knowledge of how Docker works under the hood

From AI

"However, df -h shows the usage of the entire OverlayFS mount on the host, which includes:

All your container’s changes (the upper layer),

Every image layer that the host has pulled, for this container and any others,

Any leftover data in the host’s Overlay2 directory (e.g. old, unremoved images, stopped containers, volumes, build cache).

That is why the overlay mount is showing ~967 GB used even though your container’s own filesystem (as seen by du) is only ~350 MB."

sam-scolari

PROOP

a year ago

but nevertheless a restart of the container solved it

1362420693052035193

brody

EMPLOYEE

a year ago

We have since added more stackers with volumes, feel free to try again

sam-scolari

PROOP

a year ago

will do, I think you might need to reset the state on it before I can I don't see the try again button anymore

1362846659893002290

brody

EMPLOYEE

a year ago

done

sam-scolari

PROOP

a year ago

<:cryfistsq:1238327748464873472>

1362915628905660436

sam-scolari

PROOP

a year ago

this volume really doesnt like me lol

brody

EMPLOYEE

a year ago

sorry about the inconvenience here, i am still talking with infra here

sam-scolari

PROOP

a year ago

all good

Railway

BOT

a year ago

This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!

Status changed to Solved Railway • about 1 year ago

brody

EMPLOYEE

a year ago

hello @Sam - we have put in place some more verbose logging, we would like you to try again when you have the time

sam-scolari

PROOP

a year ago

Sure thing, can give it a try some time tomorrow

brody

EMPLOYEE

a year ago

thank you!

sam-scolari

PROOP

a year ago

Ran it last night and it failed again

brody

EMPLOYEE

a year ago

thank you

brody

EMPLOYEE

a year ago

@Sam - would you mind if we went and initiated a transfer on your behalf? we would like to follow along with the progress

sam-scolari

PROOP

a year ago

By transfer you mean migration? Sure.

brody

EMPLOYEE

a year ago

haha yes, i mean migration

sam-scolari

PROOP

a year ago

Hey, any update on this? Still being worked on?

brody

EMPLOYEE

a year ago

perfect timing, i had on my todo's to come back today and ask you to try again, we very recently added volume migration progress, resumability, and compression.

sam-scolari

PROOP

a year ago

will try again now

brody

EMPLOYEE

a year ago

thank you!

brody

EMPLOYEE

a year ago

1.2TB of data is being transferred

brody

EMPLOYEE

a year ago

you see the estimated time now right?

sam-scolari

PROOP

a year ago

yup its much more clear now

1372972247089680564

sam-scolari

PROOP

a year ago

curious, where is the 1.2TB coming from? This volume is only like 130GB atm. Would the remaining tb all be backups?

brody

EMPLOYEE

a year ago

yep, current data, and all the data in the backups

sam-scolari

PROOP

a year ago

ah are manual backups not deleted?

sam-scolari

PROOP

a year ago

didnt realize I had that much backed up

brody

EMPLOYEE

a year ago

yep manual backups do not have a ttl

sam-scolari

PROOP

a year ago

woops guess I should have deleted all those old ones before the migration because almost a tb of that is not needed lol

sam-scolari

PROOP

a year ago

probably caused these migration issues in the past as well

brody

EMPLOYEE

a year ago

probably but hopefully we have fixed them by now

sam-scolari

PROOP

a year ago

yeah hopefully

sam-scolari

PROOP

a year ago

wait, I thought backups were incremental as in it only backs up new data not previously backed up. Wouldnt that mean that backups should never exceed twice the size of the current volume?

brody

EMPLOYEE

a year ago

you have stored far more than 140gb of data in the past, you store 500gb, take a backup, and delete that data, well now that backup size is 360gb

sam-scolari

PROOP

a year ago

oh right, yeah forgot I had that WAL buildup issue nvm

brody

EMPLOYEE

a year ago

did you want to cancel and delete some backups?

sam-scolari

PROOP

a year ago

How bad is the downtime on a migration of 1.2tb vs like 300gb?

sam-scolari

PROOP

a year ago

happy to let this play out so you can see if large migrations work now but would also be better for me if the downtime was considerably less

sam-scolari

PROOP

a year ago

but it is pgpool so i suppose there shouldnt be any downtime technically

brody

EMPLOYEE

a year ago

it all depends on how much data you write to the volume during the first step of the migration, we do a two step process, the fist step is sending a snapshot of the current data while your deployment is still online, the second step is taking the deployment offline and sending over the data that has changed since starting the first step, if in 3 hours you only write 300mb the downtime would be 20 seconds

sam-scolari

PROOP

a year ago

ah then it should be fine then can leave it

sam-scolari

PROOP

a year ago

no way we write more than 50mb in that time

brody

EMPLOYEE

a year ago

then yeah given that fact the downtime from a 3 hour migration or a 1 hour migration is negligible

sam-scolari

PROOP

a year ago

@Brody at some point a few hours ago the copying % message and timer went away from the volume label and said "Migrating Volume"

Just checked back in and now theres a failed deployment, the copying label is back and stuck at 69%, and in the metrics tab it is also stuck on "migrating service..."

1373064708718071928