Railway Private Network: Postgres Connections Hang Silently — Reproduces on Brand-New Project
ianrothfuss
HOBBYOP

a day ago

Postgres connections hang silently via *.railway.internal — reproduces on brand-new project

Started 2026-04-24 ~00:50 UTC, immediately after the Apr 23 IPv6 egress rollout. ~50 deploys over 32+ hours all fail identically. Reproduced today on a brand-new project — not project-scoped state.

Stack

Node 20.20.2, Medusa 2.13.2, Knex 3.1.0, node-postgres 8.16.0. postgres-ssl:17.9 (old project), :18 (new). Railpack builder. Region us-east4-eqdc4a.

Symptom

Every medusa start hangs in Knex pool acquisition for 60s, then:

KnexTimeoutError: Timeout acquiring a connection. Pool is probably full.
  at Client_PG.acquireConnection (knex/lib/client.js:332)
  at async pgConnectionLoader (@medusajs/framework/database/pg-connection-loader.ts:74)
  sql: 'SELECT 1'

Retries every 60s for 5 min, exits 1. No ECONNREFUSED, ENETUNREACH, ETIMEDOUT, EHOSTUNREACH, or any TCP error logged.

NODE_DEBUG=net,tls,dns produces zero socket events during the 60s pool-acquire window — pg never reaches net.connect(). Postgres-side logs show zero connection attempts during the failure window.

Definitive reproducer (no app code)

From inside the Postgres container itself:

psql -h localhost -U postgres -d railway -c 'SELECT 1'
# → connects, returns 1

psql -h postgres.railway.internal -U postgres -d railway -c 'SELECT 1'
# → hangs indefinitely

getent ahosts postgres.railway.internal returns the container's own IPv4 + IPv6 (confirmed via hostname -I). Postgres logs listening on 0.0.0.0:5432 and :::5432. Traffic to those addresses doesn't reach Postgres. bash exec 3<>/dev/tcp/postgres.railway.internal/5432 reports exit 0 (TCP handshake "completes") but no Postgres-protocol data flows.

Brand-new project reproduces (today, 2026-04-25)

  1. Created brand-new project, fresh Postgres template, fresh Redis, same workspace/region

  2. Restored 102MB schema dump from old project — clean (136 tables, exit 0)

  3. Uploaded same Medusa code via railway up

  4. Same KnexTimeoutError: SELECT 1, same 60s cadence, same 5-min crash.

Eliminates project-state, deploy cache, lockfile drift, and Medusa version as causes. The bug is in Railway's private-network routing.

Ruled out

Hypothesis How tested Result Wrong creds Compare DATABASE_URL pw vs PGPASSWORD Match Postgres unhealthy External psql via TCP proxy Connects, queries fine DNS failure getent ahosts from app container Returns IPv4 + IPv6 pg IPv6-first preference NODE_OPTIONS=--dns-result-order=ipv6first No effect ipv6EgressEnabled config Toggled false → true, redeploys No effect SSL handshake hang Removed ?sslmode=require No effect Public proxy bypass DATABASE_URL → DATABASE_PUBLIC_URL Same hang Stale build cache Multiple fresh Railpack rebuilds No effect Schema mismatch Last mikro_orm migration verified No drift Project-scoped state Brand-new project repro Bug confirmed wider

Timeline (UTC)

  • 04-22 22:56 — Postgres image pinned :17 → :17.9 (only diff: pgvector)

  • 04-22 23:01 — Postgres redeploy

  • 04-22 23:04 — Last successful backend deploy, ran healthy ~25 hrs

  • 04-23 — Railway outbound IPv6 egress rollout (per changelog)

  • 04-24 00:49 — Postgres redeploy

  • 04-24 00:50 — First backend-app failure

  • 04-24 → 25 — ~50 deploys, all fail identically

  • 04-25 — Brand-new project reproduces

Prior similar incidents

Two Central Station threads describe similar symptoms during prior platform rollouts, both resolved by Railway-side rollbacks:

Asks

  1. Is this a known regression of the Apr 23 rollout for Medusa/Knex/node-postgres connection patterns?

  2. Can private-network routing config be reapplied for our services?

  3. Workaround: any way to force *.railway.internal → IPv4-only until fixed?

  4. If migration is required, please advise on volume-data migration path between projects.

Project IDs, deployment IDs, dump file, and full logs available on request.

Severity: critical. Production launch blocked.

$10 Bounty

1 Replies

Status changed to Awaiting Railway Response Railway about 22 hours ago


Status changed to Open Railway about 22 hours ago


ianrothfuss
HOBBYOP

10 hours ago

Update — partial resolution found, root cause not fully isolated.

  Worked around the issue by spinning up a brand-new Railway project today                                                                             

  (2026-04-25/26) and deploying a clean Medusa 2.14.1 codebase via railway up.                                                                       

  That deploy connected to Railway-hosted Postgres without any KnexTimeout.                                                                            

  Server started, migrations ran, /health returns 200, pg_stat_activity shows                                                                          

  healthy idle connections.                                                                                                                            

  What we changed between the broken and working setup:                                                                                                

  - Bootstrapped fresh from create-medusa-app@2.14.0 (then bumped to 2.14.1)

  - Flattened the bootstrap's nested monorepo wrapper to single-package                                                                                

  - Eliminated the lockfile drift (the old project had @medusajs/* packages

  resolving to a mix of 2.13.1, 2.13.2, and 2.14.0 across the dep tree)

- Set startCommand per Medusa's deployment guide:

cd .medusa/server && npm install --legacy-peer-deps && npm run predeploy && npm run start

- Added .npmrc with legacy-peer-deps=true

What we did NOT change:

- Same Railway region (us-east4-eqdc4a)

- Same *.railway.internal private DNS pattern                                                                                                      

  - Same Medusa pgConnectionLoader code path

  - Same Knex/pg version                                                                                                                               

Open question for Railway team: the original failure mode (TCP appears to                                                                        

  connect via bash /dev/tcp but pg never reaches net.connect(), no error                                                                           

  codes logged, repro from inside the Postgres container itself with                                                                                   

psql -h postgres.railway.internal hanging while psql -h localhost worked)                                                                        

  is still not explained. If that asymmetry was a transient platform issue                                                                             

  that's since resolved, please confirm. Otherwise it's a latent issue that                                                                            

  could re-trigger.                                                                                                                                    

  For now: production unblocked, no longer urgent. Happy to leave the ticket                                                                           

  open if you want the diagnostic data, or close it as worked-around.


Status changed to Solved ianrothfuss about 10 hours ago


Status changed to Open brody about 10 hours ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...