5 months ago
hi, whenever i deploy my project it runs perfectly fine for a day or two then the requests get 499 and 502 errors, the only way to fix it is by redeploying it, i tried searching on what could use it but couldn't really find any solutions
project-id: 6bf56526-ebe4-41fa-ac7f-573821bcbf55
26 Replies
5 months ago
Any deploy logs from your app that you can share (make sure nothing sensible is in there) and the metrics
Maybe it’s a resources limit issue/memory leak?
there is no errors in the deploy logs and it doesn't seem like its going over the resources limit according to the metrics


5 months ago
No errors on the project activity? Like a OOM warning
5 months ago
Whats the runtime of your app? Node/python/etc
and is this service connecting to a database?
its python using fastapi
no the service isnt connected to a database but its connected to a volume (if that helps)
do note that the project is serverless too
5 months ago
Could you describe what the /health endpoint does?
is there a chance that the "mounting volume" overlapped with the "/health" request causing the 499 error which caused the other 502 errors? (sorry if that sounds kinda stupid, i am not really experienced in host stuff)
5 months ago
🤔
5 months ago
This picture on the logs here, that was your attempt to restart the service but it still didn't work right?
From the healthchecks image, I could see at 18:15 it kinda stopped working and then you attempted to restart to fix it but that also failed.
Am I following correctly?
no that wasnt me that was the serverless function, it goes to sleep(which in that case shutting down server) when its inactive for 10 minutes
5 months ago
Gotcha
5 months ago
When your serverless function uses the volume, does it access it in a non-blocking way?
blocking
def _blocking_probe_write():
"""Synchronous disk touch that can block the event loop."""
path = os.path.join(VOLUME_DIR, ".probe")
with open(path, "a") as f:
f.write(f"{time.time()}\n")non-blocking
async def _nonblocking_probe_write(timeout=1.0):
"""Disk touch off the event loop, with a timeout."""
await asyncio.wait_for(asyncio.to_thread(_blocking_probe_write), timeout=timeout)5 months ago
This is around this possibility, any app code tries to run before volume mounts and blocks the event loop
i dont know?
the volume thing is mostly on railway's side, there is no code involved, i only tell it which folder in the repository to save in the volume (which is data folder)
and i just access like i access a normal file in folder like this
with open(f'data/ADV.json','r',encoding='utf-8') as file:
data = json.load(file)
file.close()5 months ago
cool. yea that has blocking potential (no asyncio)
5 months ago
i'm gonna give you some snippets in a sec
i am a bit confused tbh, the only time it reads or writes in the data folder is when it recieves "/run" request which according to the logs it wasnt called during the time of the crash
are you sure its related to the mounting volume?
5 months ago
I'm not 100% sure it's at that point specifically
But it'd be worth a shot to have something like this before your app runs:
@app.get("/ready")
async def ready():
try:
def probe_io():
p = os.path.join("data", ".probe")
os.makedirs("data", exist_ok=True)
open(p, "a").write(f"{time.time()}\n")
open(p, "r").readline()
await asyncio.wait_for(asyncio.to_thread(probe_io), timeout=1.0)
return {"ready": True}
except Exception as e:
raise HTTPException(status_code=503, detail=f"volume_unready:{type(e).__name__}")And set it as your service healthcheck

5 months ago
It's been ages since I ran servers using python so I might as well be wrong, just trying to help here <:salute:1137099685417451530>
There's a chance Railway doesn't load/run anything before volume mounts, but I'm also not sure about that.
tbh i got lost on what that code does lol, i will try it and see if that changes anything, ty ^^
5 months ago
It just tries to open a directory on the volume, write a file to make sure everything is ok and mounted, in a non-blocking way (doesn't block python's event loop)
Let me know if that makes anything better, if not, I'm pretty sure we have a lot of python enthusiasts that might have run into a similar problem before <:salute:1137099685417451530>
The code crashed again, but now I can confirm that it's because of the mounting volume overlapping with /health request (probably because I am waking the service up right before it's done with mounting volume)
The /ready doesn't really work because the deployment never restarts or anything when that happens
Is there a way to make it so that the service waits until it's done mounting volume before it starts to wake up?
