Container-scoped NO_SOCKET on Postgres return packets, resolved only by redeploy
rohith-lua
PROOP

11 days ago

Hey Railway Team,

We've got a recurring issue where a single container loses the ability to receive return traffic from Postgres. It only resolves after a full redeploy. I did a deep dive into a specific instance of this occurring yesterday but we've seen this as a transient issue across our various environments for a few months now. The core observation here is that the Network Flow Logs for the bad container show repeated NO_SOCKET on inbound packets from our database, while every other container and service hitting the same DB was healthy the entire time.

I've done an investigation on our side so I'm going to lay out what I was able to find. If there's more information that you need I'm happy to help out in any way I can.

Incident details

Environment**: Testing**

Service: Clint (FastAPI/Python, async)

Bad deployment: 4e7750b3

Bad container IP: 10.132.145.95

Incident start: May 10, 2026 ~5:36 PM EDT

Resolution: Redeploy to deployment 98eb73b9-3c3f-4f1f-ade4-908fbebb1edd, new container IP 10.238.112.51

Database: Supabase Postgres (external, not Railway-hosted)

Connection path: outbound IPv4 through Railway egress

What happened

Our application traces (Logfire/OpenTelemetry) show SQLAlchemy/asyncpg hanging on Postgres connection checkout and pool_pre_ping starting at ~5:48:55 PM EDT. Once it started, every single subsequent request that needed the database failed. This continued for the entire lifetime of the container.

When I went into Railway Network Flow Logs for the bad container, I found Postgres reply packets arriving and being tagged NO_SOCKET

What I Ruled Out

  • Supabase/Postgres down: Celery workers and other services were connected to the same Supabase instance throughout the incident. All healthy, queries succeeded fine.
  • Application connection leak: We instrumented pool metrics (checked_out, checkedin, pool_size) via Logfire. Pool was not exhausted. Connections were hanging in checkout, not leaking.
  • pool_pre_ping causing it: pool_pre_ping verifies connections before checkout, but the hang was in the TCP layer below asyncpg, not in the ping query itself. Pre-ping hung because the socket was already broken.
  • engine.dispose() closing sockets: dispose() only runs on shutdown or when our pool monitor detects sustained pressure (checked_out > 20). Shutdown was at 6:20 PM EDT, after the incident started. Pool monitor never triggered.
  • pool_recycle=1800 recycling too aggressively: First failure was ~13 min after deploy, well within the 30-min recycle window. Timing doesn't fit.
  • Client-side 499s causing it: Webportal returned 499s during the incident, but these are a symptom. Webportal timed out because Clint was already hung on DB.
  • Internet routing / packet loss: Postgres reply packets are visible in Railway flow logs. They reached Railway's network edge. They just got tagged NO_SOCKET.
  • Stale DNS: DB hostname resolved to the same IP before, during, and after the incident.
  • Code change: No code changes between the working deploy and the failing deploy. Same code, same config.

Why I believe this is platform-layer

  1. Postgres was replying. The return packets reached Railway's edge.
  2. Railway's own flow logs tagged those replies NO_SOCKET, meaning the platform could not match them to a local socket on the container.
  3. This was scoped entirely to one container IP (10.132.145.95). Everything else in our stack was fine.
  4. Redeploying to a new container immediately fixed it with zero code or config changes.

I wasn't able to find any similar threads. As I mentioned earlier, if you need any other information happy to provide anything to help with this investigation!

$20 Bounty

3 Replies

Status changed to Awaiting Railway Response Railway 11 days ago


Railway
BOT

7 days ago

This thread has been marked as public for community involvement, as it does not contain any sensitive or personal information. Any further activity in this thread will be visible to everyone.

Status changed to Open Railway 7 days ago


rohith-lua
PROOP

7 days ago

No. Our SQLAlchemy async engine uses asyncpg with pool prep ping, pool recycle, pool timeout, and asyncpg connection/query timeouts, but we are not currently setting custom TCP keepalive socket options through connect args.


rohith-lua
PROOP

7 days ago

The TCP keepalive theory explains how an idle pooled socket can become stale, but I do not think it by itself explains why all subsequent DB connectivity fails after this issue starts occurring. With pool pre ping, a stale socket should be detected, invalidated, and replaced with a fresh connection. Any thoughts given that context?


gyanavkhandelwal6396-cmyk
FREE

7 days ago

This appears to be a Railway platform/network-layer issue where return Postgres packets reached the container edge but were dropped as NO_SOCKET, indicating stale socket/NAT state isolated to container 10.132.145.95.

Since redeploying immediately resolved it with no code or config changes, i recommend Railway investigate deployment 4e7750b3 while using shorter-lived pooled connections and automatic container restarts as mitigation.


Welcome!

Sign in to your Railway account to join the conversation.

Loading...