2 months ago
Hi community,
I'm experiencing an issue when scaling out my Rails server instances. (I just migrated my workload to Railway) With a single instance, everything works fine: the request error rate is 0%, and the P90 response time stays below 1 second.
However, after increasing the instance count to two, around 10% of requests start failing, and the P90 response time increases to about 3 seconds. Has anyone experienced a similar issue?
At the moment, a single instance is sufficient even during peak traffic (~100k requests per minute when using Cloudflare Workers). Still, I'd like to resolve this scaling issue now to avoid potential problems in the future.
Attachments
14 Replies
2 months ago
Hello,
The errors are from your application returning error codes, mainly 500s. Nothing on our end is returning an error. Given that this is not a problem with our platform or product, I have opened this thread up for the community to help you debug.
Best,
Brody
Status changed to Awaiting User Response Railway • about 2 months ago
2 months ago
do you see any error in the logs?
2 months ago
Could you share more about your project architecture? What services you are running etc.
2 months ago
Hi, thanks for taking a look, and I appreciate any help or ideas.
My architecture is simple: one Rails API service and one Rails worker.
Here are the logs:
[dff571b3-a775-4207-a814-dd5f592c76f3] Completed 500 Internal Server Error in 3043ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 17.2ms)
[5b7d8787-3591-4e18-ab5f-30b263b4316b] Completed 500 Internal Server Error in 3049ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 5.0ms)
[e72ca602-ee42-405a-9b45-446c3b9b57b2] Completed 500 Internal Server Error in 3050ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.7ms)
[01f909dd-1379-4b55-9dc4-88d9ed92d47a] Completed 500 Internal Server Error in 3034ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.8ms)
[b010f55b-245d-4572-a8ca-158d0bee91d8] Completed 500 Internal Server Error in 3055ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 2.0ms)
[baffbcfe-f23f-470b-b641-5cca345a504b] Completed 500 Internal Server Error in 3035ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[69b02c1d-5ff2-42da-b7bc-a9ae6631a2fe] Completed 500 Internal Server Error in 3051ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[59aba690-bc7b-40b6-9f8a-e415ecbf7727] Completed 500 Internal Server Error in 3031ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 7.4ms)
[b38b0077-d271-40a4-a840-964564ac5c8e] Completed 500 Internal Server Error in 3045ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[613b6d8d-4068-4e0c-96f3-2cd767087495] Completed 500 Internal Server Error in 3035ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[ea79c2e5-40b6-4b97-9f77-4b9405a82712] Completed 500 Internal Server Error in 3047ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[ad5c81c9-41c1-4be6-b416-6f3dd2c9c92a] Completed 500 Internal Server Error in 3038ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[7c0e5b3e-427d-47ab-a507-2f77213af92d] Completed 500 Internal Server Error in 3045ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 3.0ms)
[06e81b14-b4b8-4660-8ff5-80cf1dc4359b] Completed 500 Internal Server Error in 3041ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
[e4678dae-e6a1-4fae-9892-a0beda59e61a] Completed 500 Internal Server Error in 3048ms (ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)Here's my database.yml:
default: &default
adapter: postgresql
encoding: unicode
max_connections: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
development:
primary: &primary_development
<<: *default
database: intoss_development
cache:
<<: *primary_development
database: intoss_development_cache
migrations_paths: db/cache_migrate
queue:
<<: *primary_development
database: intoss_development_queue
migrations_paths: db/queue_migrate
test:
<<: *default
url: <%= ENV["RAILS_DATABASE_URL"] %>
database: intoss_test
preview:
primary: &primary_preview
<<: *default
url: <%= ENV["RAILS_DATABASE_URL"] %>
database: intoss_preview
cache:
<<: *primary_preview
database: intoss_preview_cache
migrations_paths: db/cache_migrate
queue:
<<: *primary_preview
database: intoss_preview_queue
migrations_paths: db/queue_migrate
production:
primary: &primary_production
<<: *default
url: <%= ENV["RAILS_DATABASE_URL"] %>
database: intoss_production
cache:
<<: *primary_production
database: intoss_production_cache
migrations_paths: db/cache_migrate
queue:
<<: *primary_production
database: intoss_production_queue
migrations_paths: db/queue_migrateI'm using RAILS_DATABASE_URL and not specifying a database name in the URL, following this Railway community thread:
https://station.railway.com/community/tips-for-deploying-a-rails-app-with-soli-593eb702
2 months ago
Just throwing some ideas in the bucket, are your services using pool? how many connection per pool? is it possible that you already used all the available connections and the replica can't connect to the db?
Also, unrelated to the problem, in theory when using micro services, each service should have it's own db, in this diagram I can see 2 services connected to the same db, I think it would be better if you connect to the db only from the api, and you can expose some rest endpoints in the api used by the worker, doing this you have just 1 app dealing with the db and the connection handling will be easier...
2 months ago
For additional context: PostgreSQL max_connections is set to 100 (Railway default), and my Rails connection pool size is 5.
With 1 API instance and 1 worker instance running, I'm seeing ~3 and ~6 active connections respectively.
Based on that, it doesn't seem like we're hitting the database connection limit. Does that sound right, or am I missing something?
Attachments
fra
Just throwing some ideas in the bucket, are your services using pool? how many connection per pool? is it possible that you already used all the available connections and the replica can't connect to the db? Also, unrelated to the problem, in theory when using micro services, each service should have it's own db, in this diagram I can see 2 services connected to the same db, I think it would be better if you connect to the db only from the api, and you can expose some rest endpoints in the api used by the worker, doing this you have just 1 app dealing with the db and the connection handling will be easier...
2 months ago
Thanks for the suggestion!
We're intentionally running a monolithic architecture, not microservices. The API and worker are just separate processes of the same app, so splitting the database doesn't give us much benefit.
Having the worker go through the API would instead add unnecessary network RTT, in my view. Please let me know if I'm missing something.
2 months ago
Has this service previously been scaled like this on your old platform? Just wondering if it's some kind of application-specific issue, i.e. holding locks etc.
thaumanovic
Has this service previously been scaled like this on your old platform? Just wondering if it's some kind of application-specific issue, i.e. holding locks etc.
2 months ago
I was previously running this on Cloudflare Workers + D1, and I recently migrated to Rails + Postgres on Railway. So the two setups aren't directly comparable.
That said, I never saw this issue on Workers/D1, and it only started after the migration.
injung
For additional context: PostgreSQL max_connections is set to 100 (Railway default), and my Rails connection pool size is 5.With 1 API instance and 1 worker instance running, I'm seeing ~3 and ~6 active connections respectively.Based on that, it doesn't seem like we're hitting the database connection limit. Does that sound right, or am I missing something?
2 months ago
yeah as you said it should be fine, the logs doesn't say much about the error so it can be anything, what I would try to do is improve the logs to give you more info where the app is throwing, does the library you use for the db support events? something like onError ? Does your app expect sticky sessions?
2 months ago
I found the root cause: we make an outbound request via Faraday to an external service, and open_timeout was set to 3.
That explains the ~3s failures, but I still don't understand why the issue only shows up after scaling from 1 to 2 instances. Any thoughts on what could make outbound connections more likely to hit open_timeout when scaling out?
[a4fc6294-d269-4a5d-8dea-b47fb4173095] Faraday::ConnectionFailed (Failed to open TCP connection to [filtered]:443 (execution expired))
[a4fc6294-d269-4a5d-8dea-b47fb4173095] Caused by: Net::OpenTimeout (Failed to open TCP connection to [filtered]:443 (execution expired))injung
I found the root cause: we make an outbound request via Faraday to an external service, and open_timeout was set to 3.That explains the ~3s failures, but I still don't understand why the issue only shows up after scaling from 1 to 2 instances. Any thoughts on what could make outbound connections more likely to hit open_timeout when scaling out?[a4fc6294-d269-4a5d-8dea-b47fb4173095] Faraday::ConnectionFailed (Failed to open TCP connection to [filtered]:443 (execution expired)) [a4fc6294-d269-4a5d-8dea-b47fb4173095] Caused by: Net::OpenTimeout (Failed to open TCP connection to [filtered]:443 (execution expired))
2 months ago
Does the external service have any kind of connection/rate limiting?
thaumanovic
Does the external service have any kind of connection/rate limiting?
2 months ago
yeah this would be my first guess, maybe the requests are sent from different ips (I' not sure about this), given you have a pro account you can try using a static ip and see if it change?
Status changed to Solved injung • about 2 months ago
