Scaling out causes request errors and 3× higher response times
injung
PROOP

a month ago

Hi community,

I'm experiencing an issue when scaling out my Rails server instances. (I just migrated my workload to Railway) With a single instance, everything works fine: the request error rate is 0%, and the P90 response time stays below 1 second.

However, after increasing the instance count to two, around 10% of requests start failing, and the P90 response time increases to about 3 seconds. Has anyone experienced a similar issue?

At the moment, a single instance is sufficient even during peak traffic (~100k requests per minute when using Cloudflare Workers). Still, I'd like to resolve this scaling issue now to avoid potential problems in the future.

Attachments

Solved$20 Bounty

14 Replies

brody
EMPLOYEE

a month ago

Hello,
The errors are from your application returning error codes, mainly 500s. Nothing on our end is returning an error. Given that this is not a problem with our platform or product, I have opened this thread up for the community to help you debug.
Best,
Brody


Status changed to Awaiting User Response Railway about 1 month ago


fra
HOBBYTop 10% Contributor

a month ago

do you see any error in the logs?


Could you share more about your project architecture? What services you are running etc.


injung
PROOP

a month ago

Hi, thanks for taking a look, and I appreciate any help or ideas.

My architecture is simple: one Rails API service and one Rails worker.

Here are the logs:

[dff571b3-a775-4207-a814-dd5f592c76f3] Completed 500 Internal Server Error in 3043ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 17.2ms)
[5b7d8787-3591-4e18-ab5f-30b263b4316b] Completed 500 Internal Server Error in 3049ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 5.0ms)
[e72ca602-ee42-405a-9b45-446c3b9b57b2] Completed 500 Internal Server Error in 3050ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.7ms)
[01f909dd-1379-4b55-9dc4-88d9ed92d47a] Completed 500 Internal Server Error in 3034ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.8ms)
[b010f55b-245d-4572-a8ca-158d0bee91d8] Completed 500 Internal Server Error in 3055ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 2.0ms)
[baffbcfe-f23f-470b-b641-5cca345a504b] Completed 500 Internal Server Error in 3035ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[69b02c1d-5ff2-42da-b7bc-a9ae6631a2fe] Completed 500 Internal Server Error in 3051ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[59aba690-bc7b-40b6-9f8a-e415ecbf7727] Completed 500 Internal Server Error in 3031ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 7.4ms)
[b38b0077-d271-40a4-a840-964564ac5c8e] Completed 500 Internal Server Error in 3045ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[613b6d8d-4068-4e0c-96f3-2cd767087495] Completed 500 Internal Server Error in 3035ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[ea79c2e5-40b6-4b97-9f77-4b9405a82712] Completed 500 Internal Server Error in 3047ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[ad5c81c9-41c1-4be6-b416-6f3dd2c9c92a] Completed 500 Internal Server Error in 3038ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[7c0e5b3e-427d-47ab-a507-2f77213af92d] Completed 500 Internal Server Error in 3045ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 3.0ms)
[06e81b14-b4b8-4660-8ff5-80cf1dc4359b] Completed 500 Internal Server Error in 3041ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[e4678dae-e6a1-4fae-9892-a0beda59e61a] Completed 500 Internal Server Error in 3048ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)

Here's my database.yml:

default: &default
  adapter: postgresql
  encoding: unicode
  max_connections: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>


development:
  primary: &primary_development
    <<: *default
    database: intoss_development
  cache:
    <<: *primary_development
    database: intoss_development_cache
    migrations_paths: db/cache_migrate
  queue:
    <<: *primary_development
    database: intoss_development_queue
    migrations_paths: db/queue_migrate

test:
  <<: *default
  url: <%= ENV["RAILS_DATABASE_URL"] %>
  database: intoss_test

preview:
  primary: &primary_preview
    <<: *default
    url: <%= ENV["RAILS_DATABASE_URL"] %>
    database: intoss_preview
  cache:
    <<: *primary_preview
    database: intoss_preview_cache
    migrations_paths: db/cache_migrate
  queue:
    <<: *primary_preview
    database: intoss_preview_queue
    migrations_paths: db/queue_migrate

production:
  primary: &primary_production
    <<: *default
    url: <%= ENV["RAILS_DATABASE_URL"] %>
    database: intoss_production
  cache:
    <<: *primary_production
    database: intoss_production_cache
    migrations_paths: db/cache_migrate
  queue:
    <<: *primary_production
    database: intoss_production_queue
    migrations_paths: db/queue_migrate

I'm using RAILS_DATABASE_URL and not specifying a database name in the URL, following this Railway community thread:
https://station.railway.com/community/tips-for-deploying-a-rails-app-with-soli-593eb702


injung
PROOP

a month ago

Architecture diagram here

Attachments


fra
HOBBYTop 10% Contributor

a month ago

Just throwing some ideas in the bucket, are your services using pool? how many connection per pool? is it possible that you already used all the available connections and the replica can't connect to the db?

Also, unrelated to the problem, in theory when using micro services, each service should have it's own db, in this diagram I can see 2 services connected to the same db, I think it would be better if you connect to the db only from the api, and you can expose some rest endpoints in the api used by the worker, doing this you have just 1 app dealing with the db and the connection handling will be easier...


injung
PROOP

a month ago

For additional context: PostgreSQL max_connections is set to 100 (Railway default), and my Rails connection pool size is 5.

With 1 API instance and 1 worker instance running, I'm seeing ~3 and ~6 active connections respectively.

Based on that, it doesn't seem like we're hitting the database connection limit. Does that sound right, or am I missing something?

Attachments


fra

Just throwing some ideas in the bucket, are your services using pool? how many connection per pool? is it possible that you already used all the available connections and the replica can't connect to the db? Also, unrelated to the problem, in theory when using micro services, each service should have it's own db, in this diagram I can see 2 services connected to the same db, I think it would be better if you connect to the db only from the api, and you can expose some rest endpoints in the api used by the worker, doing this you have just 1 app dealing with the db and the connection handling will be easier...

injung
PROOP

a month ago

Thanks for the suggestion!

We're intentionally running a monolithic architecture, not microservices. The API and worker are just separate processes of the same app, so splitting the database doesn't give us much benefit.

Having the worker go through the API would instead add unnecessary network RTT, in my view. Please let me know if I'm missing something.


Has this service previously been scaled like this on your old platform? Just wondering if it's some kind of application-specific issue, i.e. holding locks etc.


thaumanovic

Has this service previously been scaled like this on your old platform? Just wondering if it's some kind of application-specific issue, i.e. holding locks etc.

injung
PROOP

a month ago

I was previously running this on Cloudflare Workers + D1, and I recently migrated to Rails + Postgres on Railway. So the two setups aren't directly comparable.

That said, I never saw this issue on Workers/D1, and it only started after the migration.


injung

For additional context: PostgreSQL max_connections is set to 100 (Railway default), and my Rails connection pool size is 5.With 1 API instance and 1 worker instance running, I'm seeing ~3 and ~6 active connections respectively.Based on that, it doesn't seem like we're hitting the database connection limit. Does that sound right, or am I missing something?

fra
HOBBYTop 10% Contributor

a month ago

yeah as you said it should be fine, the logs doesn't say much about the error so it can be anything, what I would try to do is improve the logs to give you more info where the app is throwing, does the library you use for the db support events? something like onError ? Does your app expect sticky sessions?


injung
PROOP

a month ago

I found the root cause: we make an outbound request via Faraday to an external service, and open_timeout was set to 3.
That explains the ~3s failures, but I still don't understand why the issue only shows up after scaling from 1 to 2 instances. Any thoughts on what could make outbound connections more likely to hit open_timeout when scaling out?

[a4fc6294-d269-4a5d-8dea-b47fb4173095] Faraday::ConnectionFailed (Failed to open TCP connection to [filtered]:443 (execution expired))
[a4fc6294-d269-4a5d-8dea-b47fb4173095] Caused by: Net::OpenTimeout (Failed to open TCP connection to [filtered]:443 (execution expired))

injung

I found the root cause: we make an outbound request via Faraday to an external service, and open_timeout was set to 3.That explains the ~3s failures, but I still don't understand why the issue only shows up after scaling from 1 to 2 instances. Any thoughts on what could make outbound connections more likely to hit open_timeout when scaling out?[a4fc6294-d269-4a5d-8dea-b47fb4173095] Faraday::ConnectionFailed (Failed to open TCP connection to [filtered]:443 (execution expired)) [a4fc6294-d269-4a5d-8dea-b47fb4173095] Caused by: Net::OpenTimeout (Failed to open TCP connection to [filtered]:443 (execution expired))

Does the external service have any kind of connection/rate limiting?


thaumanovic

Does the external service have any kind of connection/rate limiting?

fra
HOBBYTop 10% Contributor

a month ago

yeah this would be my first guess, maybe the requests are sent from different ips (I' not sure about this), given you have a pro account you can try using a static ip and see if it change?


Status changed to Solved injung 29 days ago


Loading...