499 and 502 errors in my project

ahmedkmh

FREEOP

9 months ago

hi, whenever i deploy my project it runs perfectly fine for a day or two then the requests get 499 and 502 errors, the only way to fix it is by redeploying it, i tried searching on what could use it but couldn't really find any solutions

project-id: 6bf56526-ebe4-41fa-ac7f-573821bcbf55

26 Replies

crisog

PRO

9 months ago

Any deploy logs from your app that you can share (make sure nothing sensible is in there) and the metrics

Maybe it’s a resources limit issue/memory leak?

ahmedkmh

FREEOP

9 months ago

there is no errors in the deploy logs and it doesn't seem like its going over the resources limit according to the metrics

1425187339549806775

1425187340086935572

medim

MODERATOR

9 months ago

No errors on the project activity? Like a OOM warning

crisog

PRO

9 months ago

Whats the runtime of your app? Node/python/etc

and is this service connecting to a database?

ahmedkmh

FREEOP

9 months ago

no i dont think so

ahmedkmh

FREEOP

9 months ago

its python using fastapi

no the service isnt connected to a database but its connected to a volume (if that helps)

do note that the project is serverless too

crisog

PRO

9 months ago

Could you describe what the /health endpoint does?

ahmedkmh

FREEOP

9 months ago

is there a chance that the "mounting volume" overlapped with the "/health" request causing the 499 error which caused the other 502 errors? (sorry if that sounds kinda stupid, i am not really experienced in host stuff)

ahmedkmh

FREEOP

9 months ago

it just returns {"status": "ok"}

1425190373881417808

crisog

PRO

9 months ago

🤔

crisog

PRO

9 months ago

This picture on the logs here, that was your attempt to restart the service but it still didn't work right?

From the healthchecks image, I could see at 18:15 it kinda stopped working and then you attempted to restart to fix it but that also failed.

Am I following correctly?

ahmedkmh

FREEOP

9 months ago

no that wasnt me that was the serverless function, it goes to sleep(which in that case shutting down server) when its inactive for 10 minutes

crisog

PRO

9 months ago

Gotcha

crisog

PRO

9 months ago

When your serverless function uses the volume, does it access it in a non-blocking way?

blocking

def _blocking_probe_write():
    """Synchronous disk touch that can block the event loop."""
    path = os.path.join(VOLUME_DIR, ".probe")
    with open(path, "a") as f:
        f.write(f"{time.time()}\n")

non-blocking

async def _nonblocking_probe_write(timeout=1.0):
    """Disk touch off the event loop, with a timeout."""
    await asyncio.wait_for(asyncio.to_thread(_blocking_probe_write), timeout=timeout)

crisog

PRO

9 months ago

This is around this possibility, any app code tries to run before volume mounts and blocks the event loop

ahmedkmh

FREEOP

9 months ago

i dont know?

the volume thing is mostly on railway's side, there is no code involved, i only tell it which folder in the repository to save in the volume (which is data folder)

and i just access like i access a normal file in folder like this

 with open(f'data/ADV.json','r',encoding='utf-8') as file:
            data = json.load(file)
            file.close()

crisog

PRO

9 months ago

cool. yea that has blocking potential (no asyncio)

crisog

PRO

9 months ago

i'm gonna give you some snippets in a sec

ahmedkmh

FREEOP

9 months ago

i am a bit confused tbh, the only time it reads or writes in the data folder is when it recieves "/run" request which according to the logs it wasnt called during the time of the crash

are you sure its related to the mounting volume?

crisog

PRO

9 months ago

I'm not 100% sure it's at that point specifically

But it'd be worth a shot to have something like this before your app runs:

@app.get("/ready")
async def ready():
    try:
        def probe_io():
            p = os.path.join("data", ".probe")
            os.makedirs("data", exist_ok=True)
            open(p, "a").write(f"{time.time()}\n")
            open(p, "r").readline()

        await asyncio.wait_for(asyncio.to_thread(probe_io), timeout=1.0)
        return {"ready": True}
    except Exception as e:
        raise HTTPException(status_code=503, detail=f"volume_unready:{type(e).__name__}")

And set it as your service healthcheck

1425205221600268418

crisog

PRO

9 months ago

It's been ages since I ran servers using python so I might as well be wrong, just trying to help here <:salute:1137099685417451530>

There's a chance Railway doesn't load/run anything before volume mounts, but I'm also not sure about that.

ahmedkmh

FREEOP

9 months ago

tbh i got lost on what that code does lol, i will try it and see if that changes anything, ty ^^

crisog

PRO

9 months ago

It just tries to open a directory on the volume, write a file to make sure everything is ok and mounted, in a non-blocking way (doesn't block python's event loop)

Let me know if that makes anything better, if not, I'm pretty sure we have a lot of python enthusiasts that might have run into a similar problem before <:salute:1137099685417451530>

ahmedkmh

FREEOP

9 months ago

The code crashed again, but now I can confirm that it's because of the mounting volume overlapping with /health request (probably because I am waking the service up right before it's done with mounting volume)

The /ready doesn't really work because the deployment never restarts or anything when that happens

Is there a way to make it so that the service waits until it's done mounting volume before it starts to wake up?

hajinaka44-boop

FREE

9 months ago

How can I get the full Aksea free trial?

ahmedkmh

FREEOP

9 months ago

Welcome!