Since GCP outage 5/20 - Phoenix container can't connect to PostgreSQL

bdouble

HOBBYOP

2 months ago

Hi, hoping someone can help diagnose a private networking issue in my staging environment.

The situation

I have a phoenix service (arizephoenix/phoenix:latest) that needs to connect to a PostgreSQL instance (pgvector/pgvector:pg17) via Railway private networking. Phoenix consistently times out on every connection attempt. However, a separate Node.js/Next.js app in the same environment connects to the exact same hostname, port, and credentials without any issues and is serving live traffic.

Both services use identical connection strings pointing to pgvector.railway.internal:5432. The problem appears to be specific to the Phoenix container or Python/asyncpg, not a general networking failure.

Environment details

Project ID: 5f32b81b-b267-4b61-8d8c-6cbf7da35584
Environment ID: 4760bda0-1190-47db-b317-7063e5d59058
Environment: staging

pgvector service (the database)

Image: pgvector/pgvector:pg17
Private domain: pgvector.railway.internal
Listening on port 5432
Active deployment: e7b9e39e-5609-400a-9c2c-e24c9e583534 (deployed 2026-05-21T01:14:45Z)

phoenix service (the failing service)

Image: arizephoenix/phoenix:latest
Connecting to: pgvector.railway.internal:5432
Active deployment: d57252c1-86ef-46e7-94df-4b222368e1e8 (deployed 2026-05-21T01:15:20Z)

ProductLobster App (Node.js service that works fine)

Custom Dockerfile, Node.js/Next.js, Prisma ORM
Connecting to: same pgvector.railway.internal:5432, same credentials, same database
Active deployment: 281c4717-46f5-47f6-9e29-0c5c62fb212e (deployed 2026-05-21T01:14:36Z)

Evidence that pgvector is healthy

pgvector logs from the current deployment show a clean startup and ongoing checkpoint activity:

[01:29:21Z] starting PostgreSQL 17.10, listening on IPv4 0.0.0.0, port 5432

[01:29:21Z] database system is ready to accept connections

[01:29:21Z] checkpoint complete: wrote 2 buffers

No errors after startup. The service is stable.

Evidence that Node.js connects fine

The ProductLobster App was deployed fresh at 01:14:36Z — nine seconds before the current pgvector container even existed — so it established entirely new TCP connections post-redeploy. It is serving live database-backed traffic without errors. The private network route to pgvector.railway.internal:5432 is clearly reachable from at least one container in this environment.

Evidence that Phoenix always times out

Phoenix uses asyncpg (Python async PostgreSQL driver) and connects on startup to run Alembic migrations. It has failed on every single restart across multiple redeployment cycles. The error is a TCP-level connection timeout — asyncpg never completes the handshake. Alembic never runs. The surface error is labeled "PhoenixMigrationError" but the real cause is buried in the stack trace:

File "asyncpg/connection.py", line 2442, in connect

File "phoenix/server/cli/commands/serve.py", line 260, in run

    async with compat.timeout(timeout):

File "asyncio/timeouts.py", line 116, in __aexit__

    raise TimeoutError from exc_val

TimeoutError

Each restart cycle looks like this in the logs:

[01:29:28Z] Mounting volume — container starts

[01:29:44Z] Running migrations on the database — asyncpg.connect() begins

[01:30:46Z] Mounting volume — container restarted (62s timeout, crash, restart)

This repeated 8+ times across two full redeployment rounds of both phoenix and pgvector, with pgvector confirmed healthy throughout.

What I've ruled out

pgvector is not down — logs show it accepting connections and the Node.js app proves it. The Node.js app deployment is fresh (post-incident), so this isn't about stale persistent connections. Both services use identical hostnames, ports, and credentials so there's no config mismatch. I've redeployed both services multiple times - including the Command-K redeploy which pulls a fresh server image. Phoenix fails even 8+ minutes after pgvector is fully up, so DNS propagation delay doesn't explain it.

The core mystery

The only meaningful difference between the service that works (Node.js/Prisma) and the one that fails (Python/asyncpg) is the runtime and driver. This makes me wonder if there's something specific to how asyncpg probes for SSL before establishing a plain connection, how Python resolves the private DNS hostname versus Node.js, or something in the arizephoenix/phoenix image's network namespace that behaves differently inside Railway's mesh.

Has anyone seen Railway private networking work for one runtime but not another in the same environment? Any ideas on what to check or try?

$10 Bounty

2 Replies

Status changed to Open Railway • about 2 months ago

froopledesign

PRO

2 months ago

Probably not your issue since both your services use the same hostname and only Phoenix fails — but worth a quick sanity check: after the May 19/20 incident, our Postgres connection variables silently re-resolved to different values even though we never changed them manually (we use Railway's dynamic reference variables). A redeploy fixed it for us by pulling the current values.

In your case, since the Node.js app works against pgvector.railway.internal:5432, the endpoint is clearly correct from somewhere — but it might be worth shelling into the Phoenix container and printing the actual resolved DATABASE_URL (or whatever env var asyncpg reads) to confirm it matches what Node.js is using. If they diverge, that's your answer. If they match, you've at least ruled it out.

bdouble

HOBBYOP

2 months ago

Thanks @froopledesign - ruled this out, did not work.

Welcome!