10 months ago
I need help increasing the max size of my volumes
113 Replies
10 months ago
!t
10 months ago
This thread has been escalated to the Railway team.
Status changed to Awaiting Railway Response medim • 10 months ago
This is pretty urgent as I have reached max capacity so I would appreciate any escalation I can get
10 months ago
what service? i dont see any services that have reached a limit
10 months ago
thank you, will do this now
10 months ago
please confirm if you are okay with the downtime
10 months ago
@Sam ^
10 months ago
looks to be back online
Yep looks online now, I might be needing beyond 500gb (maybe 1tb - 1.5tb range) but I won't know until later today so this should be good in the meantime.
We only get charged for actual usage right?
10 months ago
We only get charged for actual usage right?
correct
Ok I got the numbers back and I'm going to need 1.5TB on these 2 volumes as soon as it can happen
6ed229cd-3128-4b23-abb0-b2e20c8dd115 (currently 500GB)
50eeb77b-61dd-40fb-9467-63eb9a9ec4b5 (currently 250GB)
10 months ago
i am not able to increase the volume sizes on metal with any level of urgency unfortunately
10 months ago
for that we would have to fall back to your SLA
10 months ago
the entire cluster including pgpool must be on non-metal
10 months ago
they should be
10 months ago
you do not want to be mixing regions
10 months ago
are your other services on us-east metal?
So pgpool and pg-1 (read replica) are on us-east metal
I tried to get pg-0 on us-east metal but it said migration failed
10 months ago
Can i move them back so that i can grow the volume size
10 months ago
done
ty ty, and should I be able to migrate those to Metal now or will I have to wait for you guys to manually do that? Getting onto Metal is not urgent I'm just trying to get us down to that $0.15/gb mo tier
(not sure if it failed earlier because it was over 250GB or not)
10 months ago
how long did it take to fail?
10 months ago
its possible there are no target stackers with enough extra storage space
10 months ago
ill raise this to infra
10 months ago
please try again
10 months ago
@Sam ^
I started the migration an hour ago looks like it just failed after 50mins

10 months ago
@Brody for visibility ^
I tried again immediately after this and it went for almost 3 hours before I aborted the deployment (thinking it would stop the migration 😓) but now the deployment is removed but the volume still says migrating volume with a loading spinner, not sure if thats bad or not
10 months ago
may i ask why you aborted it? its not like the service was offline for 3 hours
10 months ago
(i reset the state)
Stupid decision in hindsight, have never had it take nearly that long so assumed it wasn't working and was going to wait to try again til whatever it was got fixed
10 months ago
lets hold off on further migrations though, we have plans to compress the zfs stream to hopefully speed up migrations, and ill let you know when thats in place
10 months ago
Acking from the Railway end- we are still working on the ZFS compression workstream
Status changed to Awaiting User Response Railway • 10 months ago
10 months ago
hey @Sam - i noticed the database is now sub 80GB, did you wanna try the migration again?
For context, I've successfully migrated pg-1 to metal before (before you reversed it) and I've only ever had issues with pg-0 (193f99f0-73b4-4347-aec9-155d60eea71b), not sure why
Also, this is probably a visual issue with the metric chart but my disk usage jumped to 500GB early this morning for some reason and I can't figure out why. The whole database is < 70GB at the moment and running du -sh . on the root directory confirms that.
I am anticipating my disk usage climbs to the 500GB-1TB range over the course of the next few weeks but the reason it jumped so fast to 1.2TB and subsequently back down to sub 80GB yesterday was because I had a bad replication slot configuration causing it to store 1.2TB of WAL files for a node that didn't exist anymore.
The reason I think this is a visual issue is because this time du -sh . confirms the db disk usage from my db. Maybe it has to do with a stale backup, not sure if thats included in the du calculation?

10 months ago
interesting, ill look into it
actually just noticed this
root@c2e9e49183a0:/# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 1.5T 968G 485G 67% /
tmpfs 64M 0 64M 0% /dev
/dev/zd2400 1.4T 67G 1.3T 5% /bitnami/postgresql
tmpfs 26G 194M 26G 1% /etc/hosts
shm 62M 1.1M 60M 2% /dev/shm
udev 126G 0 126G 0% /proc/keys
Looks like the disk space is being used outside the container?
10 months ago
what do you mean outside of the container?
Preface: I have limited knowledge of how Docker works under the hood
From AI
"However, df -h shows the usage of the entire OverlayFS mount on the host, which includes:
All your container’s changes (the upper layer),
Every image layer that the host has pulled, for this container and any others,
Any leftover data in the host’s Overlay2 directory (e.g. old, unremoved images, stopped containers, volumes, build cache).
That is why the overlay mount is showing ~967 GB used even though your container’s own filesystem (as seen by du) is only ~350 MB."
10 months ago
We have since added more stackers with volumes, feel free to try again
will do, I think you might need to reset the state on it before I can I don't see the try again button anymore

10 months ago
done
10 months ago
sorry about the inconvenience here, i am still talking with infra here
10 months ago
This thread has been marked as solved automatically due to a lack of recent activity. Please re-open this thread or create a new one if you require further assistance. Thank you!
Status changed to Solved Railway • 10 months ago
10 months ago
hello @Sam - we have put in place some more verbose logging, we would like you to try again when you have the time
10 months ago
thank you!
10 months ago
thank you
10 months ago
@Sam - would you mind if we went and initiated a transfer on your behalf? we would like to follow along with the progress
10 months ago
haha yes, i mean migration
9 months ago
perfect timing, i had on my todo's to come back today and ask you to try again, we very recently added volume migration progress, resumability, and compression.
9 months ago
thank you!
9 months ago
1.2TB of data is being transferred
9 months ago
you see the estimated time now right?
curious, where is the 1.2TB coming from? This volume is only like 130GB atm. Would the remaining tb all be backups?
9 months ago
yep, current data, and all the data in the backups
9 months ago
yep manual backups do not have a ttl
woops guess I should have deleted all those old ones before the migration because almost a tb of that is not needed lol
9 months ago
probably but hopefully we have fixed them by now
wait, I thought backups were incremental as in it only backs up new data not previously backed up. Wouldnt that mean that backups should never exceed twice the size of the current volume?
9 months ago
you have stored far more than 140gb of data in the past, you store 500gb, take a backup, and delete that data, well now that backup size is 360gb
9 months ago
did you want to cancel and delete some backups?
happy to let this play out so you can see if large migrations work now but would also be better for me if the downtime was considerably less
but it is pgpool so i suppose there shouldnt be any downtime technically
9 months ago
it all depends on how much data you write to the volume during the first step of the migration, we do a two step process, the fist step is sending a snapshot of the current data while your deployment is still online, the second step is taking the deployment offline and sending over the data that has changed since starting the first step, if in 3 hours you only write 300mb the downtime would be 20 seconds
9 months ago
then yeah given that fact the downtime from a 3 hour migration or a 1 hour migration is negligible
@Brody at some point a few hours ago the copying % message and timer went away from the volume label and said "Migrating Volume"
Just checked back in and now theres a failed deployment, the copying label is back and stuck at 69%, and in the metrics tab it is also stuck on "migrating service…"

9 months ago
looks like it hit the migration timeout and canceled, I've kicked it off again, but we have plans to increase the timeout
9 months ago
amazing, happy it's finally done
9 months ago
!s



