Railway Private Network: Postgres Connections Hang Silently — Reproduces on Brand-New Project
ianrothfuss
HOBBYOP

2 months ago

Postgres connections hang silently via *.railway.internal — reproduces on brand-new project

Started 2026-04-24 ~00:50 UTC, immediately after the Apr 23 IPv6 egress rollout. ~50 deploys over 32+ hours all fail identically. Reproduced today on a brand-new project — not project-scoped state.

Stack

Node 20.20.2, Medusa 2.13.2, Knex 3.1.0, node-postgres 8.16.0. postgres-ssl:17.9 (old project), :18 (new). Railpack builder. Region us-east4-eqdc4a.

Symptom

Every medusa start hangs in Knex pool acquisition for 60s, then:

KnexTimeoutError: Timeout acquiring a connection. Pool is probably full.
  at Client_PG.acquireConnection (knex/lib/client.js:332)
  at async pgConnectionLoader (@medusajs/framework/database/pg-connection-loader.ts:74)
  sql: 'SELECT 1'

Retries every 60s for 5 min, exits 1. No ECONNREFUSED, ENETUNREACH, ETIMEDOUT, EHOSTUNREACH, or any TCP error logged.

NODE_DEBUG=net,tls,dns produces zero socket events during the 60s pool-acquire window — pg never reaches net.connect(). Postgres-side logs show zero connection attempts during the failure window.

Definitive reproducer (no app code)

From inside the Postgres container itself:

psql -h localhost -U postgres -d railway -c 'SELECT 1'
# → connects, returns 1

psql -h postgres.railway.internal -U postgres -d railway -c 'SELECT 1'
# → hangs indefinitely

getent ahosts postgres.railway.internal returns the container's own IPv4 + IPv6 (confirmed via hostname -I). Postgres logs listening on 0.0.0.0:5432 and :::5432. Traffic to those addresses doesn't reach Postgres. bash exec 3<>/dev/tcp/postgres.railway.internal/5432 reports exit 0 (TCP handshake "completes") but no Postgres-protocol data flows.

Brand-new project reproduces (today, 2026-04-25)

  1. Created brand-new project, fresh Postgres template, fresh Redis, same workspace/region
  2. Restored 102MB schema dump from old project — clean (136 tables, exit 0)
  3. Uploaded same Medusa code via railway up
  4. Same KnexTimeoutError: SELECT 1, same 60s cadence, same 5-min crash.

Eliminates project-state, deploy cache, lockfile drift, and Medusa version as causes. The bug is in Railway's private-network routing.

Ruled out

Hypothesis How tested Result Wrong creds Compare DATABASE_URL pw vs PGPASSWORD Match Postgres unhealthy External psql via TCP proxy Connects, queries fine DNS failure getent ahosts from app container Returns IPv4 + IPv6 pg IPv6-first preference NODE_OPTIONS=--dns-result-order=ipv6first No effect ipv6EgressEnabled config Toggled false → true, redeploys No effect SSL handshake hang Removed ?sslmode=require No effect Public proxy bypass DATABASE_URL → DATABASE_PUBLIC_URL Same hang Stale build cache Multiple fresh Railpack rebuilds No effect Schema mismatch Last mikro_orm migration verified No drift Project-scoped state Brand-new project repro Bug confirmed wider

Timeline (UTC)

  • 04-22 22:56 — Postgres image pinned :17 → :17.9 (only diff: pgvector)
  • 04-22 23:01 — Postgres redeploy
  • 04-22 23:04 — Last successful backend deploy, ran healthy ~25 hrs
  • 04-23 — Railway outbound IPv6 egress rollout (per changelog)
  • 04-24 00:49 — Postgres redeploy
  • 04-24 00:50 — First backend-app failure
  • 04-24 → 25 — ~50 deploys, all fail identically
  • 04-25 — Brand-new project reproduces

Prior similar incidents

Two Central Station threads describe similar symptoms during prior platform rollouts, both resolved by Railway-side rollbacks:

Asks

  1. Is this a known regression of the Apr 23 rollout for Medusa/Knex/node-postgres connection patterns?
  2. Can private-network routing config be reapplied for our services?
  3. Workaround: any way to force *.railway.internal → IPv4-only until fixed?
  4. If migration is required, please advise on volume-data migration path between projects.

Project IDs, deployment IDs, dump file, and full logs available on request.

Severity: critical. Production launch blocked.

$10 Bounty

2 Replies

Status changed to Awaiting Railway Response Railway about 2 months ago


Status changed to Open Railway about 2 months ago


ianrothfuss
HOBBYOP

2 months ago

Update — partial resolution found, root cause not fully isolated.

Worked around the issue by spinning up a brand-new Railway project today

(2026-04-25/26) and deploying a clean Medusa 2.14.1 codebase via railway up.

That deploy connected to Railway-hosted Postgres without any KnexTimeout.

Server started, migrations ran, /health returns 200, pg__stat__activity shows

healthy idle connections.

What we changed between the broken and working setup:

- Bootstrapped fresh from create-medusa-app@2.14.0 (then bumped to 2.14.1)

- Flattened the bootstrap's nested monorepo wrapper to single-package

- Eliminated the lockfile drift (the old project had @medusajs/* packages

resolving to a mix of 2.13.1, 2.13.2, and 2.14.0 across the dep tree)

- Set startCommand per Medusa's deployment guide:

cd .medusa/server && npm install --legacy-peer-deps && npm run predeploy && npm run start

- Added .npmrc with legacy-peer-deps=true

What we did NOT change:

- Same Railway region (us-east4-eqdc4a)

- Same *.railway.internal private DNS pattern

- Same Medusa pgConnectionLoader code path

- Same Knex/pg version

Open question for Railway team: the original failure mode (TCP appears to

connect via bash /dev/tcp but pg never reaches net.connect(), no error

codes logged, repro from inside the Postgres container itself with

psql -h postgres.railway.internal hanging while psql -h localhost worked)

is still not explained. If that asymmetry was a transient platform issue

that's since resolved, please confirm. Otherwise it's a latent issue that

could re-trigger.

For now: production unblocked, no longer urgent. Happy to leave the ticket

open if you want the diagnostic data, or close it as worked-around.


Status changed to Solved ianrothfuss about 2 months ago


Status changed to Open brody about 2 months ago


ianrothfuss

**Update — partial resolution found, root cause not fully isolated.** Worked around the issue by spinning up a brand-new Railway project today (2026-04-25/26) and deploying a clean Medusa 2.14.1 codebase via `railway up`. That deploy connected to Railway-hosted Postgres without any KnexTimeout. Server started, migrations ran, /health returns 200, pg_\_stat\__activity shows healthy idle connections. What we changed between the broken and working setup: \- Bootstrapped fresh from `create-medusa-app@2.14.0` (then bumped to 2.14.1) \- Flattened the bootstrap's nested monorepo wrapper to single-package \- Eliminated the lockfile drift (the old project had @medusajs/_\* packages_ _resolving to a mix of 2.13.1, 2.13.2, and 2.14.0 across the dep tree)_ _\- Set_ `startCommand` _per Medusa's deployment guide:_ `cd .medusa/server && npm install --legacy-peer-deps && npm run predeploy && npm run start` _\- Added .npmrc with_ `legacy-peer-deps=true` _What we did NOT change:_ _\- Same Railway region (us-east4-eqdc4a)_ _\- Same_ `*.railway.internal` private DNS pattern \- Same Medusa pgConnectionLoader code path \- Same Knex/pg version **Open question for Railway team:** the original failure mode (TCP appears to connect via bash `/dev/tcp` but pg never reaches `net.connect()`, no error codes logged, repro from inside the Postgres container itself with `psql -h postgres.railway.internal` hanging while `psql -h localhost` worked) is still not explained. If that asymmetry was a transient platform issue that's since resolved, please confirm. Otherwise it's a latent issue that could re-trigger. For now: production unblocked, no longer urgent. Happy to leave the ticket open if you want the diagnostic data, or close it as worked-around.

dev-charles254
PRO

a month ago

A few things in your repro strongly point away from Medusa/Knex itself and toward Railway’s internal networking layer — specifically the interaction between the private DNS resolver, dual-stack routing, and loopback/self-resolution behavior after the Apr 23 IPv6 rollout.

The key signal for me is this:

psql -h localhost ...

psql -h postgres.railway.internal ...

from inside the Postgres container itself.

That effectively removes Knex, Medusa, pg pools, and most app-layer causes from the equation.

What’s especially interesting is this part:

bash exec 3<>/dev/tcp/postgres.railway.internal/5432

meaning TCP SYN/SYN-ACK completed, but the PostgreSQL protocol handshake never progressed. Combined with:

  • no net.connect() activity
  • no Postgres logs
  • DNS resolving both IPv4 + IPv6
  • issue starting immediately after IPv6 egress rollout

this smells like a blackhole/routing asymmetry where the internal hostname resolves to an address reachable at the TCP layer but not correctly hairpinned back into the container/service network namespace.

Possible theory:

postgres.railway.internal may now resolve preferentially to an IPv6 address that successfully accepts the TCP handshake through Railway’s overlay/proxy layer, but packets containing the actual Postgres protocol stream are never forwarded back to the originating container/service correctly.

The strongest evidence is that the issue reproduced even inside the database container itself. That should normally be the simplest successful path possible.

A few things I would still test if you haven’t already:

getent ahostsv4 postgres.railway.internal

getent ahostsv6 postgres.railway.internal

then force explicit address-family tests:

psql "hostaddr= host=postgres.railway.internal ..."

psql "hostaddr= host=postgres.railway.internal ..."

Also:

PGHOSTADDR= psql ...

If IPv4 succeeds while hostname resolution hangs, that would strongly confirm a resolver/routing regression around IPv6 preference ordering or internal overlay translation.

Another interesting datapoint is that /dev/tcp succeeded but libpq stalled. That can happen if:

  • SYN succeeds
  • socket established
  • but no readable bytes ever arrive after startup packet transmission

which usually indicates transparent proxying / conntrack / overlay routing weirdness rather than application failure.

The fact that rebuilding into a clean Medusa 2.14.x project “fixed” it may simply mean the underlying Railway networking issue was transiently corrected during later deploys/reallocations, not necessarily that dependency drift caused the original symptom.

I think the real fix here was likely forcing Railway to allocate fresh internal networking state by creating a brand-new project.

Your own repro isolates this below the application layer:

psql -h postgres.railway.internal ...

That should never happen if the private overlay routing is healthy.

My guess is one of:

  • stale service-discovery/network metadata after the Apr 23 IPv6 rollout
  • broken IPv6 hairpin routing for *.railway.internal
  • corrupted overlay routing state tied to the original project allocation

Creating a fresh project likely forced:

  • new internal DNS entries
  • new overlay network allocation
  • new service mesh routes
  • fresh IPv6 bindings

which explains why the exact same stack suddenly worked without meaningful pg/Knex changes.

You could probably validate this by comparing:

dig postgres.railway.internal AAAA

dig postgres.railway.internal A

ip -6 route

ss -ltnp

between the broken and working projects.

If Railway engineering can repro the “self-connect via internal hostname hangs” behavior, I suspect they’ll find an overlay or IPv6 routing regression introduced during the Apr 23 rollout.


Welcome!

Sign in to your Railway account to join the conversation.

Loading...