Production: outbound HTTPS to Supabase starts timing out after 10–20 minutes; redeploy temporarily fixes
andrebarros2
PROOP

3 months ago

Body

  • Project: humorous-possibility

  • Service: Adros-SaaS

  • Environment/Region: production, US East (Virginia)

  • Stack: Node.js 20 on Railway, Express, undici fetch (no custom dispatcher), offline JWT verification (no Auth round‑trips)

Issue

  • After a fresh deploy, everything works for ~10–20 minutes.

  • Then only outbound requests to Supabase REST begin to hang and end in AbortError timeouts (around 20–25s). Our own /api/health continues to respond immediately.

  • A redeploy “resets” the behavior for another 10–20 minutes.

Representative logs (UTC)

  • 2025‑09‑10 02:53:25: GET …/rest/v1/appointments?... ms=21844 code=20 error=“This operation was aborted”

  • 2025‑09‑10 02:53:25: GET …/rest/v1/clinics?select=… ms=23691 code=20 error=“This operation was aborted”

  • Our lightweight ping to Supabase also fails during the incident: “[supabase] ping_error AbortError: This operation was aborted”

What we’ve ruled out

  • DB slowness on our side: we added the right indexes (appointments (clinic_id, starts_at), scheduled_messages partial (clinic_id, send_at) WHERE sent=false AND paused=false, patients (clinic_id, updated_at), clinic_professionals (clinic_id), tag_assignments indexes). EXPLAIN ANALYZE shows ms-level execution (e.g., appointments ~0.07 ms, scheduled_messages ~0.04 ms).

  • Client settings: removed custom undici dispatcher and aggressive keep‑alive; now using default undici fetch with per‑attempt timeout and short retry/backoff; added a lightweight circuit breaker to avoid cascades.

Suspicion

  • An infra/network path issue between our Railway egress and Supabase (edge/CDN/PostgREST pool/rate‑limit/NAT). The problem appears only in production and correlates with container uptime, not query plans.

Asks for Railway

1) Share the current egress IP(s) for this service and check if those IPs are hitting any outbound throttling/connection limits/timeouts to jnprhvhkxggvqowmwppf.supabase.co:443.

2) Look for signs of NAT/egress gateway idle-socket recycling or per-destination concurrency limits after ~10 minutes of uptime.

3) Check DNS resolution/route health from our node to that hostname around the provided timestamps; any spikes in TLS handshake failures or SYN timeouts?

4) Confirm if there’s any shared egress policy that could intermittently impact long‑lived services, and whether a static egress IP or different egress pool/region would help.

5) Provide recommended best practices for undici/Node networking on Railway in this scenario (keep‑alive expectations, retry patterns).

Notes

  • The same code and region works fine against our staging Supabase project.

Happy to provide additional logs (full lines with URLs/timestamps) or run targeted probes you suggest.

Solved$10 Bounty

2 Replies

Railway
BOT

3 months ago


jake
EMPLOYEE

3 months ago

Apologies but this looks like an issue with the application level code. Due to volume, we can only answer platform level issues.

I've made this thread public so that the community might be able to help with you query.


Status changed to Awaiting User Response Railway 3 months ago


jake
EMPLOYEE

3 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open jake 3 months ago


Status changed to Solved andrebarros2 3 months ago


Loading...