14 hours ago
Hi, hoping someone can help diagnose a private networking issue in my staging environment.
The situation
I have a phoenix service (arizephoenix/phoenix:latest) that needs to connect to a PostgreSQL instance (pgvector/pgvector:pg17) via Railway private networking. Phoenix consistently times out on every connection attempt. However, a separate Node.js/Next.js app in the same environment connects to the exact same hostname, port, and credentials without any issues and is serving live traffic.
Both services use identical connection strings pointing to pgvector.railway.internal:5432. The problem appears to be specific to the Phoenix container or Python/asyncpg, not a general networking failure.
Environment details
- Project ID: 5f32b81b-b267-4b61-8d8c-6cbf7da35584
- Environment ID: 4760bda0-1190-47db-b317-7063e5d59058
- Environment: staging
pgvector service (the database)
- Image: pgvector/pgvector:pg17
- Private domain: pgvector.railway.internal
- Listening on port 5432
- Active deployment: e7b9e39e-5609-400a-9c2c-e24c9e583534 (deployed 2026-05-21T01:14:45Z)
phoenix service (the failing service)
- Image: arizephoenix/phoenix:latest
- Connecting to: pgvector.railway.internal:5432
- Active deployment: d57252c1-86ef-46e7-94df-4b222368e1e8 (deployed 2026-05-21T01:15:20Z)
ProductLobster App (Node.js service that works fine)
- Custom Dockerfile, Node.js/Next.js, Prisma ORM
- Connecting to: same pgvector.railway.internal:5432, same credentials, same database
- Active deployment: 281c4717-46f5-47f6-9e29-0c5c62fb212e (deployed 2026-05-21T01:14:36Z)
Evidence that pgvector is healthy
pgvector logs from the current deployment show a clean startup and ongoing checkpoint activity:
[01:29:21Z] starting PostgreSQL 17.10, listening on IPv4 0.0.0.0, port 5432
[01:29:21Z] database system is ready to accept connections
[01:29:21Z] checkpoint complete: wrote 2 buffersNo errors after startup. The service is stable.
Evidence that Node.js connects fine
The ProductLobster App was deployed fresh at 01:14:36Z — nine seconds before the current pgvector container even existed — so it established entirely new TCP connections post-redeploy. It is serving live database-backed traffic without errors. The private network route to pgvector.railway.internal:5432 is clearly reachable from at least one container in this environment.
Evidence that Phoenix always times out
Phoenix uses asyncpg (Python async PostgreSQL driver) and connects on startup to run Alembic migrations. It has failed on every single restart across multiple redeployment cycles. The error is a TCP-level connection timeout — asyncpg never completes the handshake. Alembic never runs. The surface error is labeled "PhoenixMigrationError" but the real cause is buried in the stack trace:
File "asyncpg/connection.py", line 2442, in connect
File "phoenix/server/cli/commands/serve.py", line 260, in run
async with compat.timeout(timeout):
File "asyncio/timeouts.py", line 116, in __aexit__
raise TimeoutError from exc_val
TimeoutErrorEach restart cycle looks like this in the logs:
[01:29:28Z] Mounting volume — container starts
[01:29:44Z] Running migrations on the database — asyncpg.connect() begins
[01:30:46Z] Mounting volume — container restarted (62s timeout, crash, restart)This repeated 8+ times across two full redeployment rounds of both phoenix and pgvector, with pgvector confirmed healthy throughout.
What I've ruled out
pgvector is not down — logs show it accepting connections and the Node.js app proves it. The Node.js app deployment is fresh (post-incident), so this isn't about stale persistent connections. Both services use identical hostnames, ports, and credentials so there's no config mismatch. I've redeployed both services multiple times - including the Command-K redeploy which pulls a fresh server image. Phoenix fails even 8+ minutes after pgvector is fully up, so DNS propagation delay doesn't explain it.
The core mystery
The only meaningful difference between the service that works (Node.js/Prisma) and the one that fails (Python/asyncpg) is the runtime and driver. This makes me wonder if there's something specific to how asyncpg probes for SSL before establishing a plain connection, how Python resolves the private DNS hostname versus Node.js, or something in the arizephoenix/phoenix image's network namespace that behaves differently inside Railway's mesh.
Has anyone seen Railway private networking work for one runtime but not another in the same environment? Any ideas on what to check or try?
2 Replies
Status changed to Open Railway • about 14 hours ago
9 hours ago
Probably not your issue since both your services use the same hostname and only Phoenix fails — but worth a quick sanity check: after the May 19/20 incident, our Postgres connection variables silently re-resolved to different values even though we never changed them manually (we use Railway's dynamic reference variables). A redeploy fixed it for us by pulling the current values.
In your case, since the Node.js app works against pgvector.railway.internal:5432, the endpoint is clearly correct from somewhere — but it might be worth shelling into the Phoenix container and printing the actual resolved DATABASE_URL (or whatever env var asyncpg reads) to confirm it matches what Node.js is using. If they diverge, that's your answer. If they match, you've at least ruled it out.
an hour ago
Thanks @froopledesign - ruled this out, did not work.