BodyProject: humorous-possibilityService: Adros-SaaSEnvironment/Region: production, US East (Virginia)Stack: Node.js 20 on Railway, Express, undici fetch (no custom dispatcher), offline JWT verification (no Auth round‑trips)External dependency: Supabase REST (prod) jnprhvhkxggvqowmwppf.supabase.coIssueAfter a fresh deploy, everything works for ~10–20 minutes.Then only outbound requests to Supabase REST begin to hang and end in AbortError timeouts (around 20–25s). Our own /api/health continues to respond immediately.A redeploy “resets” the behavior for another 10–20 minutes.Staging environment (pcqklbwsbnvcynprxboi.supabase.co) with the same code/region does NOT reproduce.Representative logs (UTC)2025‑09‑10 02:53:25: GET …/rest/v1/appointments?... ms=21844 code=20 error=“This operation was aborted”2025‑09‑10 02:53:25: GET …/rest/v1/clinics?select=… ms=23691 code=20 error=“This operation was aborted”Our lightweight ping to Supabase also fails during the incident: “[supabase] ping_error AbortError: This operation was aborted”What we’ve ruled outDB slowness on our side: we added the right indexes (appointments (clinic_id, starts_at), scheduled_messages partial (clinic_id, send_at) WHERE sent=false AND paused=false, patients (clinic_id, updated_at), clinic_professionals (clinic_id), tag_assignments indexes). EXPLAIN ANALYZE shows ms-level execution (e.g., appointments ~0.07 ms, scheduled_messages ~0.04 ms).Client settings: removed custom undici dispatcher and aggressive keep‑alive; now using default undici fetch with per‑attempt timeout and short retry/backoff; added a lightweight circuit breaker to avoid cascades.SuspicionAn infra/network path issue between our Railway egress and Supabase (edge/CDN/PostgREST pool/rate‑limit/NAT). The problem appears only in production and correlates with container uptime, not query plans.Asks for Railway1) Share the current egress IP(s) for this service and check if those IPs are hitting any outbound throttling/connection limits/timeouts to jnprhvhkxggvqowmwppf.supabase.co:443.2) Look for signs of NAT/egress gateway idle-socket recycling or per-destination concurrency limits after ~10 minutes of uptime.3) Check DNS resolution/route health from our node to that hostname around the provided timestamps; any spikes in TLS handshake failures or SYN timeouts?4) Confirm if there’s any shared egress policy that could intermittently impact long‑lived services, and whether a static egress IP or different egress pool/region would help.5) Provide recommended best practices for undici/Node networking on Railway in this scenario (keep‑alive expectations, retry patterns).NotesThe same code and region works fine against our staging Supabase project.Repo (if helpful): https://github.com/AndreBarros2/Adros-SaaSHappy to provide additional logs (full lines with URLs/timestamps) or run targeted probes you suggest.

Production: outbound HTTPS to Supabase starts timing out after 10–20 minutes; redeploy temporarily fixes

andrebarros2

PROOP

6 months ago

Body

Project: humorous-possibility

Service: Adros-SaaS

Environment/Region: production, US East (Virginia)

Stack: Node.js 20 on Railway, Express, undici fetch (no custom dispatcher), offline JWT verification (no Auth round‑trips)

External dependency: Supabase REST (prod) jnprhvhkxggvqowmwppf.supabase.co

Issue

After a fresh deploy, everything works for ~10–20 minutes.

Then only outbound requests to Supabase REST begin to hang and end in AbortError timeouts (around 20–25s). Our own /api/health continues to respond immediately.

A redeploy “resets” the behavior for another 10–20 minutes.

Staging environment (pcqklbwsbnvcynprxboi.supabase.co) with the same code/region does NOT reproduce.

Representative logs (UTC)

2025‑09‑10 02:53:25: GET …/rest/v1/appointments?... ms=21844 code=20 error=“This operation was aborted”

2025‑09‑10 02:53:25: GET …/rest/v1/clinics?select=… ms=23691 code=20 error=“This operation was aborted”

Our lightweight ping to Supabase also fails during the incident: “[supabase] ping_error AbortError: This operation was aborted”

What we’ve ruled out

DB slowness on our side: we added the right indexes (appointments (clinic_id, starts_at), scheduled_messages partial (clinic_id, send_at) WHERE sent=false AND paused=false, patients (clinic_id, updated_at), clinic_professionals (clinic_id), tag_assignments indexes). EXPLAIN ANALYZE shows ms-level execution (e.g., appointments ~0.07 ms, scheduled_messages ~0.04 ms).

Client settings: removed custom undici dispatcher and aggressive keep‑alive; now using default undici fetch with per‑attempt timeout and short retry/backoff; added a lightweight circuit breaker to avoid cascades.

Suspicion

An infra/network path issue between our Railway egress and Supabase (edge/CDN/PostgREST pool/rate‑limit/NAT). The problem appears only in production and correlates with container uptime, not query plans.

Asks for Railway

1) Share the current egress IP(s) for this service and check if those IPs are hitting any outbound throttling/connection limits/timeouts to jnprhvhkxggvqowmwppf.supabase.co:443.

2) Look for signs of NAT/egress gateway idle-socket recycling or per-destination concurrency limits after ~10 minutes of uptime.

3) Check DNS resolution/route health from our node to that hostname around the provided timestamps; any spikes in TLS handshake failures or SYN timeouts?

4) Confirm if there’s any shared egress policy that could intermittently impact long‑lived services, and whether a static egress IP or different egress pool/region would help.

5) Provide recommended best practices for undici/Node networking on Railway in this scenario (keep‑alive expectations, retry patterns).

Notes

The same code and region works fine against our staging Supabase project.

Repo (if helpful): https://github.com/AndreBarros2/Adros-SaaS

Happy to provide additional logs (full lines with URLs/timestamps) or run targeted probes you suggest.

Solved$10 Bounty

2 Replies

Railway

BOT

6 months ago

Hey there! We've found the following might help you get unblocked faster:

If you find the answer from one of these, please let us know by solving the thread!

jake

EMPLOYEE

6 months ago

Apologies but this looks like an issue with the application level code. Due to volume, we can only answer platform level issues.

I've made this thread public so that the community might be able to help with you query.

Status changed to Awaiting User Response Railway • 6 months ago

jake

EMPLOYEE

6 months ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open jake • 6 months ago

Status changed to Solved andrebarros2 • 6 months ago