Since GCP outage 5/20 - Phoenix container can't connect to PostgreSQL
bdouble
HOBBYOP

14 hours ago

Hi, hoping someone can help diagnose a private networking issue in my staging environment.

The situation

I have a phoenix service (arizephoenix/phoenix:latest) that needs to connect to a PostgreSQL instance (pgvector/pgvector:pg17) via Railway private networking. Phoenix consistently times out on every connection attempt. However, a separate Node.js/Next.js app in the same environment connects to the exact same hostname, port, and credentials without any issues and is serving live traffic.

Both services use identical connection strings pointing to pgvector.railway.internal:5432. The problem appears to be specific to the Phoenix container or Python/asyncpg, not a general networking failure.

Environment details

  • Project ID: 5f32b81b-b267-4b61-8d8c-6cbf7da35584
  • Environment ID: 4760bda0-1190-47db-b317-7063e5d59058
  • Environment: staging

pgvector service (the database)

  • Image: pgvector/pgvector:pg17
  • Private domain: pgvector.railway.internal
  • Listening on port 5432
  • Active deployment: e7b9e39e-5609-400a-9c2c-e24c9e583534 (deployed 2026-05-21T01:14:45Z)

phoenix service (the failing service)

  • Image: arizephoenix/phoenix:latest
  • Connecting to: pgvector.railway.internal:5432
  • Active deployment: d57252c1-86ef-46e7-94df-4b222368e1e8 (deployed 2026-05-21T01:15:20Z)

ProductLobster App (Node.js service that works fine)

  • Custom Dockerfile, Node.js/Next.js, Prisma ORM
  • Connecting to: same pgvector.railway.internal:5432, same credentials, same database
  • Active deployment: 281c4717-46f5-47f6-9e29-0c5c62fb212e (deployed 2026-05-21T01:14:36Z)

Evidence that pgvector is healthy

pgvector logs from the current deployment show a clean startup and ongoing checkpoint activity:

[01:29:21Z] starting PostgreSQL 17.10, listening on IPv4 0.0.0.0, port 5432

[01:29:21Z] database system is ready to accept connections

[01:29:21Z] checkpoint complete: wrote 2 buffers

No errors after startup. The service is stable.

Evidence that Node.js connects fine

The ProductLobster App was deployed fresh at 01:14:36Z — nine seconds before the current pgvector container even existed — so it established entirely new TCP connections post-redeploy. It is serving live database-backed traffic without errors. The private network route to pgvector.railway.internal:5432 is clearly reachable from at least one container in this environment.

Evidence that Phoenix always times out

Phoenix uses asyncpg (Python async PostgreSQL driver) and connects on startup to run Alembic migrations. It has failed on every single restart across multiple redeployment cycles. The error is a TCP-level connection timeout — asyncpg never completes the handshake. Alembic never runs. The surface error is labeled "PhoenixMigrationError" but the real cause is buried in the stack trace:

File "asyncpg/connection.py", line 2442, in connect

File "phoenix/server/cli/commands/serve.py", line 260, in run

    async with compat.timeout(timeout):

File "asyncio/timeouts.py", line 116, in __aexit__

    raise TimeoutError from exc_val

TimeoutError

Each restart cycle looks like this in the logs:

[01:29:28Z] Mounting volume — container starts

[01:29:44Z] Running migrations on the database — asyncpg.connect() begins

[01:30:46Z] Mounting volume — container restarted (62s timeout, crash, restart)

This repeated 8+ times across two full redeployment rounds of both phoenix and pgvector, with pgvector confirmed healthy throughout.

What I've ruled out

pgvector is not down — logs show it accepting connections and the Node.js app proves it. The Node.js app deployment is fresh (post-incident), so this isn't about stale persistent connections. Both services use identical hostnames, ports, and credentials so there's no config mismatch. I've redeployed both services multiple times - including the Command-K redeploy which pulls a fresh server image. Phoenix fails even 8+ minutes after pgvector is fully up, so DNS propagation delay doesn't explain it.

The core mystery

The only meaningful difference between the service that works (Node.js/Prisma) and the one that fails (Python/asyncpg) is the runtime and driver. This makes me wonder if there's something specific to how asyncpg probes for SSL before establishing a plain connection, how Python resolves the private DNS hostname versus Node.js, or something in the arizephoenix/phoenix image's network namespace that behaves differently inside Railway's mesh.

Has anyone seen Railway private networking work for one runtime but not another in the same environment? Any ideas on what to check or try?

$10 Bounty

2 Replies

Status changed to Open Railway about 14 hours ago


froopledesign
PRO

9 hours ago

Probably not your issue since both your services use the same hostname and only Phoenix fails — but worth a quick sanity check: after the May 19/20 incident, our Postgres connection variables silently re-resolved to different values even though we never changed them manually (we use Railway's dynamic reference variables). A redeploy fixed it for us by pulling the current values.

In your case, since the Node.js app works against pgvector.railway.internal:5432, the endpoint is clearly correct from somewhere — but it might be worth shelling into the Phoenix container and printing the actual resolved DATABASE_URL (or whatever env var asyncpg reads) to confirm it matches what Node.js is using. If they diverge, that's your answer. If they match, you've at least ruled it out.


bdouble
HOBBYOP

an hour ago

Thanks @froopledesign - ruled this out, did not work.


Welcome!

Sign in to your Railway account to join the conversation.

Loading...