Internal network latency causing database and Redis connection timeouts

ranyu696
PRO

a month ago

## Issue Summary
I'm experiencing severe internal network latency between my microservices and PostgreSQL/Redis instances, causing frequent timeouts and poor performance despite all services being deployed in the same Railway project.

## Environment
- **Region**: All services deployed in US region
- **Architecture**: 20+ Go microservices using private networking
- **Database**: PostgreSQL (shared by all services)
- **Cache**: Redis (shared by all services)

## Symptoms

### 1. Database Performance Issues
- Simple COUNT queries taking 10+ seconds:
  ```sql
  SELECT count(*) FROM "games" WHERE status = 'published'
  [10041.353ms] [rows:1]
  • gorm.Open() connection taking 10+ seconds

  • Queries that should be instant (with proper indexes) are extremely slow

2. Redis Timeout Issues

  • Frequent i/o timeout errors even with 30-second timeout configuration:

    read tcp 10.235.217.216:33404->10.187.111.239:6379: i/o timeout
    
  • Simple Redis SET/GET operations timing out

  • Timeouts occur even for small cache entries (<1KB)

3. Connection Pool Challenges

  • Had to reduce connection pools significantly:

    • PostgreSQL: MaxOpenConns from 25 โ†’ 2 per service (20 services = 40 total)

    • Redis: PoolSize from 10 โ†’ 3 per service (20 services = 60 total)

  • Even with reduced pools, still experiencing timeouts

Configuration Details

Current Connection Settings

// PostgreSQL
MaxOpenConns:    2
MaxIdleConns:    1
ConnMaxLifetime: 3 * time.Minute

// Redis
DialTimeout:  30 * time.Second
ReadTimeout:  30 * time.Second
WriteTimeout: 30 * time.Second
PoolTimeout:  30 * time.Second
PoolSize:     3
MinIdleConns: 1

Questions

  1. Are all services (apps, PostgreSQL, Redis) guaranteed to be in the same datacenter/availability zone?

  2. Is there known latency in Railway's private network between services?

  3. What are the recommended connection pool sizes for this architecture?

  4. Should I consider using public endpoints instead of private networking for better performance?

Expected Behavior

Internal private network connections should have low latency (<10ms) since all resources are in the same Railway project and region.

Actual Behavior

Network operations taking 10-30+ seconds, suggesting high inter-service latency or network congestion.

Impact

  • Services unable to handle production traffic

  • User requests timing out

  • Poor user experience

Solved

3 Replies

Railway
BOT

a month ago

Hey there! We've found the following might help you get unblocked faster:

If you find the answer from one of these, please let us know by solving the thread!


a month ago

Hey there.

Let me start with a friendly observation: your support ticket reads very, very LLM-written. While there's nothing wrong with using an LLM for some help on summing up an issue, the length and format simply makes it harder for us to read, and therefore help you (block format, repeated sentences, weird categorization). We'd appreciate if you keep your answers to the point and uniform next time, as it will make our job easier

Could you point me to your services in order for me to take a look?

Here's the answers to your questions:

  1. if all services are deployed in the same region then yes, this will mean they should land in the same datacenter (we sometimes divert stateless workflows to other datacenters

  2. We are aware of some spike issues with our private networking and our team is working on fixing these this quarter.

  3. We're unable to make recommendation for your architecture. I suggest opening a public thread - our community will be happy to help!

  4. There is no world where we would recommend using the public networking. you'll see increased base latency, and will lose the isolation given by private networking, exposing your service to the internet.

I believe you're running into our known spike issues with private networking. Our team is picking this up asap and we're hoping to resolve it within the next few weeks.

Best,
Nico


Status changed to Awaiting User Response Railway โ€ข about 1 month ago


ranyu696
PRO

a month ago

I don't speak English so I have to ask AI for help. The problem I'm facing now is that the query using the private address link of the database is very slow, but it's normal when I use the public link. I can't be sure whether it's my problem or the platform problem. The production environment I deployed uses the database private network but it's normal.


Status changed to Awaiting Railway Response Railway โ€ข 30 days ago


Status changed to Solved ranyu696 โ€ข 30 days ago


Loading...
Internal network latency causing database and Redis connection timeouts - Railway Help Station