a day ago
Postgres connections hang silently via *.railway.internal — reproduces on brand-new project
Started 2026-04-24 ~00:50 UTC, immediately after the Apr 23 IPv6 egress rollout. ~50 deploys over 32+ hours all fail identically. Reproduced today on a brand-new project — not project-scoped state.
Stack
Node 20.20.2, Medusa 2.13.2, Knex 3.1.0, node-postgres 8.16.0. postgres-ssl:17.9 (old project), :18 (new). Railpack builder. Region us-east4-eqdc4a.
Symptom
Every medusa start hangs in Knex pool acquisition for 60s, then:
KnexTimeoutError: Timeout acquiring a connection. Pool is probably full.
at Client_PG.acquireConnection (knex/lib/client.js:332)
at async pgConnectionLoader (@medusajs/framework/database/pg-connection-loader.ts:74)
sql: 'SELECT 1'
Retries every 60s for 5 min, exits 1. No ECONNREFUSED, ENETUNREACH, ETIMEDOUT, EHOSTUNREACH, or any TCP error logged.
NODE_DEBUG=net,tls,dns produces zero socket events during the 60s pool-acquire window — pg never reaches net.connect(). Postgres-side logs show zero connection attempts during the failure window.
Definitive reproducer (no app code)
From inside the Postgres container itself:
psql -h localhost -U postgres -d railway -c 'SELECT 1'
# → connects, returns 1
psql -h postgres.railway.internal -U postgres -d railway -c 'SELECT 1'
# → hangs indefinitely
getent ahosts postgres.railway.internal returns the container's own IPv4 + IPv6 (confirmed via hostname -I). Postgres logs listening on 0.0.0.0:5432 and :::5432. Traffic to those addresses doesn't reach Postgres. bash exec 3<>/dev/tcp/postgres.railway.internal/5432 reports exit 0 (TCP handshake "completes") but no Postgres-protocol data flows.
Brand-new project reproduces (today, 2026-04-25)
Created brand-new project, fresh Postgres template, fresh Redis, same workspace/region
Restored 102MB schema dump from old project — clean (136 tables, exit 0)
Uploaded same Medusa code via
railway upSame KnexTimeoutError: SELECT 1, same 60s cadence, same 5-min crash.
Eliminates project-state, deploy cache, lockfile drift, and Medusa version as causes. The bug is in Railway's private-network routing.
Ruled out
Hypothesis How tested Result Wrong creds Compare DATABASE_URL pw vs PGPASSWORD Match Postgres unhealthy External psql via TCP proxy Connects, queries fine DNS failure getent ahosts from app container Returns IPv4 + IPv6 pg IPv6-first preference NODE_OPTIONS=--dns-result-order=ipv6first No effect ipv6EgressEnabled config Toggled false → true, redeploys No effect SSL handshake hang Removed ?sslmode=require No effect Public proxy bypass DATABASE_URL → DATABASE_PUBLIC_URL Same hang Stale build cache Multiple fresh Railpack rebuilds No effect Schema mismatch Last mikro_orm migration verified No drift Project-scoped state Brand-new project repro Bug confirmed wider
Timeline (UTC)
04-22 22:56 — Postgres image pinned :17 → :17.9 (only diff: pgvector)
04-22 23:01 — Postgres redeploy
04-22 23:04 — Last successful backend deploy, ran healthy ~25 hrs
04-23 — Railway outbound IPv6 egress rollout (per changelog)
04-24 00:49 — Postgres redeploy
04-24 00:50 — First backend-app failure
04-24 → 25 — ~50 deploys, all fail identically
04-25 — Brand-new project reproduces
Prior similar incidents
Two Central Station threads describe similar symptoms during prior platform rollouts, both resolved by Railway-side rollbacks:
station.railway.com/questions/internal-network-connection-issue-app-8ddfdab9 (Dec 3, 2025)
station.railway.com/questions/connection-issue-to-postgre-sql-via-priva-35b9ae64 (Dec 11–16, 2025)
Asks
Is this a known regression of the Apr 23 rollout for Medusa/Knex/node-postgres connection patterns?
Can private-network routing config be reapplied for our services?
Workaround: any way to force
*.railway.internal→ IPv4-only until fixed?If migration is required, please advise on volume-data migration path between projects.
Project IDs, deployment IDs, dump file, and full logs available on request.
Severity: critical. Production launch blocked.
1 Replies
Status changed to Awaiting Railway Response Railway • about 22 hours ago
Status changed to Open Railway • about 22 hours ago
10 hours ago
Update — partial resolution found, root cause not fully isolated.
Worked around the issue by spinning up a brand-new Railway project today
(2026-04-25/26) and deploying a clean Medusa 2.14.1 codebase via railway up.
That deploy connected to Railway-hosted Postgres without any KnexTimeout.
Server started, migrations ran, /health returns 200, pg_stat_activity shows
healthy idle connections.
What we changed between the broken and working setup:
- Bootstrapped fresh from create-medusa-app@2.14.0 (then bumped to 2.14.1)
- Flattened the bootstrap's nested monorepo wrapper to single-package
- Eliminated the lockfile drift (the old project had @medusajs/* packages
resolving to a mix of 2.13.1, 2.13.2, and 2.14.0 across the dep tree)
- Set startCommand per Medusa's deployment guide:
cd .medusa/server && npm install --legacy-peer-deps && npm run predeploy && npm run start
- Added .npmrc with legacy-peer-deps=true
What we did NOT change:
- Same Railway region (us-east4-eqdc4a)
- Same *.railway.internal private DNS pattern
- Same Medusa pgConnectionLoader code path
- Same Knex/pg version
Open question for Railway team: the original failure mode (TCP appears to
connect via bash /dev/tcp but pg never reaches net.connect(), no error
codes logged, repro from inside the Postgres container itself with
psql -h postgres.railway.internal hanging while psql -h localhost worked)
is still not explained. If that asymmetry was a transient platform issue
that's since resolved, please confirm. Otherwise it's a latent issue that
could re-trigger.
For now: production unblocked, no longer urgent. Happy to leave the ticket
open if you want the diagnostic data, or close it as worked-around.
Status changed to Solved ianrothfuss • about 10 hours ago
Status changed to Open brody • about 10 hours ago