a month ago
Postgres connections hang silently via *.railway.internal — reproduces on brand-new project
Started 2026-04-24 ~00:50 UTC, immediately after the Apr 23 IPv6 egress rollout. ~50 deploys over 32+ hours all fail identically. Reproduced today on a brand-new project — not project-scoped state.
Stack
Node 20.20.2, Medusa 2.13.2, Knex 3.1.0, node-postgres 8.16.0. postgres-ssl:17.9 (old project), :18 (new). Railpack builder. Region us-east4-eqdc4a.
Symptom
Every medusa start hangs in Knex pool acquisition for 60s, then:
KnexTimeoutError: Timeout acquiring a connection. Pool is probably full.
at Client_PG.acquireConnection (knex/lib/client.js:332)
at async pgConnectionLoader (@medusajs/framework/database/pg-connection-loader.ts:74)
sql: 'SELECT 1'
Retries every 60s for 5 min, exits 1. No ECONNREFUSED, ENETUNREACH, ETIMEDOUT, EHOSTUNREACH, or any TCP error logged.
NODE_DEBUG=net,tls,dns produces zero socket events during the 60s pool-acquire window — pg never reaches net.connect(). Postgres-side logs show zero connection attempts during the failure window.
Definitive reproducer (no app code)
From inside the Postgres container itself:
psql -h localhost -U postgres -d railway -c 'SELECT 1'
# → connects, returns 1
psql -h postgres.railway.internal -U postgres -d railway -c 'SELECT 1'
# → hangs indefinitely
getent ahosts postgres.railway.internal returns the container's own IPv4 + IPv6 (confirmed via hostname -I). Postgres logs listening on 0.0.0.0:5432 and :::5432. Traffic to those addresses doesn't reach Postgres. bash exec 3<>/dev/tcp/postgres.railway.internal/5432 reports exit 0 (TCP handshake "completes") but no Postgres-protocol data flows.
Brand-new project reproduces (today, 2026-04-25)
- Created brand-new project, fresh Postgres template, fresh Redis, same workspace/region
- Restored 102MB schema dump from old project — clean (136 tables, exit 0)
- Uploaded same Medusa code via
railway up - Same KnexTimeoutError: SELECT 1, same 60s cadence, same 5-min crash.
Eliminates project-state, deploy cache, lockfile drift, and Medusa version as causes. The bug is in Railway's private-network routing.
Ruled out
Hypothesis How tested Result Wrong creds Compare DATABASE_URL pw vs PGPASSWORD Match Postgres unhealthy External psql via TCP proxy Connects, queries fine DNS failure getent ahosts from app container Returns IPv4 + IPv6 pg IPv6-first preference NODE_OPTIONS=--dns-result-order=ipv6first No effect ipv6EgressEnabled config Toggled false → true, redeploys No effect SSL handshake hang Removed ?sslmode=require No effect Public proxy bypass DATABASE_URL → DATABASE_PUBLIC_URL Same hang Stale build cache Multiple fresh Railpack rebuilds No effect Schema mismatch Last mikro_orm migration verified No drift Project-scoped state Brand-new project repro Bug confirmed wider
Timeline (UTC)
- 04-22 22:56 — Postgres image pinned :17 → :17.9 (only diff: pgvector)
- 04-22 23:01 — Postgres redeploy
- 04-22 23:04 — Last successful backend deploy, ran healthy ~25 hrs
- 04-23 — Railway outbound IPv6 egress rollout (per changelog)
- 04-24 00:49 — Postgres redeploy
- 04-24 00:50 — First backend-app failure
- 04-24 → 25 — ~50 deploys, all fail identically
- 04-25 — Brand-new project reproduces
Prior similar incidents
Two Central Station threads describe similar symptoms during prior platform rollouts, both resolved by Railway-side rollbacks:
- station.railway.com/questions/internal-network-connection-issue-app-8ddfdab9 (Dec 3, 2025)
- station.railway.com/questions/connection-issue-to-postgre-sql-via-priva-35b9ae64 (Dec 11–16, 2025)
Asks
- Is this a known regression of the Apr 23 rollout for Medusa/Knex/node-postgres connection patterns?
- Can private-network routing config be reapplied for our services?
- Workaround: any way to force
*.railway.internal→ IPv4-only until fixed? - If migration is required, please advise on volume-data migration path between projects.
Project IDs, deployment IDs, dump file, and full logs available on request.
Severity: critical. Production launch blocked.
2 Replies
Status changed to Awaiting Railway Response Railway • about 1 month ago
Status changed to Open Railway • about 1 month ago
a month ago
Update — partial resolution found, root cause not fully isolated.
Worked around the issue by spinning up a brand-new Railway project today
(2026-04-25/26) and deploying a clean Medusa 2.14.1 codebase via railway up.
That deploy connected to Railway-hosted Postgres without any KnexTimeout.
Server started, migrations ran, /health returns 200, pg__stat__activity shows
healthy idle connections.
What we changed between the broken and working setup:
- Bootstrapped fresh from create-medusa-app@2.14.0 (then bumped to 2.14.1)
- Flattened the bootstrap's nested monorepo wrapper to single-package
- Eliminated the lockfile drift (the old project had @medusajs/* packages
resolving to a mix of 2.13.1, 2.13.2, and 2.14.0 across the dep tree)
- Set startCommand per Medusa's deployment guide:
cd .medusa/server && npm install --legacy-peer-deps && npm run predeploy && npm run start
- Added .npmrc with legacy-peer-deps=true
What we did NOT change:
- Same Railway region (us-east4-eqdc4a)
- Same *.railway.internal private DNS pattern
- Same Medusa pgConnectionLoader code path
- Same Knex/pg version
Open question for Railway team: the original failure mode (TCP appears to
connect via bash /dev/tcp but pg never reaches net.connect(), no error
codes logged, repro from inside the Postgres container itself with
psql -h postgres.railway.internal hanging while psql -h localhost worked)
is still not explained. If that asymmetry was a transient platform issue
that's since resolved, please confirm. Otherwise it's a latent issue that
could re-trigger.
For now: production unblocked, no longer urgent. Happy to leave the ticket
open if you want the diagnostic data, or close it as worked-around.
Status changed to Solved ianrothfuss • about 1 month ago
Status changed to Open brody • about 1 month ago
ianrothfuss
**Update — partial resolution found, root cause not fully isolated.** Worked around the issue by spinning up a brand-new Railway project today (2026-04-25/26) and deploying a clean Medusa 2.14.1 codebase via `railway up`. That deploy connected to Railway-hosted Postgres without any KnexTimeout. Server started, migrations ran, /health returns 200, pg_\_stat\__activity shows healthy idle connections. What we changed between the broken and working setup: \- Bootstrapped fresh from `create-medusa-app@2.14.0` (then bumped to 2.14.1) \- Flattened the bootstrap's nested monorepo wrapper to single-package \- Eliminated the lockfile drift (the old project had @medusajs/_\* packages_ _resolving to a mix of 2.13.1, 2.13.2, and 2.14.0 across the dep tree)_ _\- Set_ `startCommand` _per Medusa's deployment guide:_ `cd .medusa/server && npm install --legacy-peer-deps && npm run predeploy && npm run start` _\- Added .npmrc with_ `legacy-peer-deps=true` _What we did NOT change:_ _\- Same Railway region (us-east4-eqdc4a)_ _\- Same_ `*.railway.internal` private DNS pattern \- Same Medusa pgConnectionLoader code path \- Same Knex/pg version **Open question for Railway team:** the original failure mode (TCP appears to connect via bash `/dev/tcp` but pg never reaches `net.connect()`, no error codes logged, repro from inside the Postgres container itself with `psql -h postgres.railway.internal` hanging while `psql -h localhost` worked) is still not explained. If that asymmetry was a transient platform issue that's since resolved, please confirm. Otherwise it's a latent issue that could re-trigger. For now: production unblocked, no longer urgent. Happy to leave the ticket open if you want the diagnostic data, or close it as worked-around.
10 days ago
A few things in your repro strongly point away from Medusa/Knex itself and toward Railway’s internal networking layer — specifically the interaction between the private DNS resolver, dual-stack routing, and loopback/self-resolution behavior after the Apr 23 IPv6 rollout.
The key signal for me is this:
psql -h localhost ...
psql -h postgres.railway.internal ...
from inside the Postgres container itself.
That effectively removes Knex, Medusa, pg pools, and most app-layer causes from the equation.
What’s especially interesting is this part:
bash exec 3<>/dev/tcp/postgres.railway.internal/5432
meaning TCP SYN/SYN-ACK completed, but the PostgreSQL protocol handshake never progressed. Combined with:
- no
net.connect()activity - no Postgres logs
- DNS resolving both IPv4 + IPv6
- issue starting immediately after IPv6 egress rollout
this smells like a blackhole/routing asymmetry where the internal hostname resolves to an address reachable at the TCP layer but not correctly hairpinned back into the container/service network namespace.
Possible theory:
postgres.railway.internal may now resolve preferentially to an IPv6 address that successfully accepts the TCP handshake through Railway’s overlay/proxy layer, but packets containing the actual Postgres protocol stream are never forwarded back to the originating container/service correctly.
The strongest evidence is that the issue reproduced even inside the database container itself. That should normally be the simplest successful path possible.
A few things I would still test if you haven’t already:
getent ahostsv4 postgres.railway.internal
getent ahostsv6 postgres.railway.internal
then force explicit address-family tests:
psql "hostaddr= host=postgres.railway.internal ..."
psql "hostaddr= host=postgres.railway.internal ..."
Also:
PGHOSTADDR= psql ...
If IPv4 succeeds while hostname resolution hangs, that would strongly confirm a resolver/routing regression around IPv6 preference ordering or internal overlay translation.
Another interesting datapoint is that /dev/tcp succeeded but libpq stalled. That can happen if:
- SYN succeeds
- socket established
- but no readable bytes ever arrive after startup packet transmission
which usually indicates transparent proxying / conntrack / overlay routing weirdness rather than application failure.
The fact that rebuilding into a clean Medusa 2.14.x project “fixed” it may simply mean the underlying Railway networking issue was transiently corrected during later deploys/reallocations, not necessarily that dependency drift caused the original symptom.
I think the real fix here was likely forcing Railway to allocate fresh internal networking state by creating a brand-new project.
Your own repro isolates this below the application layer:
psql -h postgres.railway.internal ...
That should never happen if the private overlay routing is healthy.
My guess is one of:
- stale service-discovery/network metadata after the Apr 23 IPv6 rollout
- broken IPv6 hairpin routing for *.railway.internal
- corrupted overlay routing state tied to the original project allocation
Creating a fresh project likely forced:
- new internal DNS entries
- new overlay network allocation
- new service mesh routes
- fresh IPv6 bindings
which explains why the exact same stack suddenly worked without meaningful pg/Knex changes.
You could probably validate this by comparing:
dig postgres.railway.internal AAAA
dig postgres.railway.internal A
ip -6 route
ss -ltnp
between the broken and working projects.
If Railway engineering can repro the “self-connect via internal hostname hangs” behavior, I suspect they’ll find an overlay or IPv6 routing regression introduced during the Apr 23 rollout.