How to measure and visualize CPU steal
yiphei
PROOP

2 months ago

i run very cpu-intensive tasks with low latency requirements. Given that railway machines use shared cpus, I want to measure and visualize CPU steal to understand the latency caused by railway infra vs my own code. what is the recommended way to do this?

Solved

5 Replies

chandrika
EMPLOYEE

2 months ago

We don't expose CPU steal time as a metric. Our built-in metrics cover CPU usage, memory, disk, and network, but not steal. Since your containers run in a cgroup-isolated environment, traditional tools like /proc/stat won't give you meaningful steal figures from inside the container either.

For latency attribution, the most practical approach is to instrument your own code with timing (e.g., OpenTelemetry) and compare against our HTTP log fields - upstreamRqDuration shows how long your app took to respond, while totalDuration includes network/proxy time. Variance in those timings under consistent load can serve as a proxy for noisy-neighbor effects. Our slow deployments guide covers these fields in detail.


Status changed to Awaiting User Response Railway about 2 months ago


yiphei
PROOP

2 months ago

@chandrika what im benchmarking is not a traditional web server task. in fact, there is zero inbound traffic so what you are suggesting wont work for me. The task im benchmarking can be significantly affected by L1 and L2 caches (among other things), so thats why i need to see steal and other CPU contention metrics so that i can understand the bigger picture of whats going on


Status changed to Awaiting Railway Response Railway about 2 months ago


Railway
BOT

2 months ago

We don't expose CPU steal time or hardware-level cache contention metrics, and since containers run in a cgroup-isolated environment, /proc/stat steal figures from inside the container won't reflect host-level contention. For compute-intensive, non-HTTP workloads where cache behavior matters, the most practical approach is to run your own timing-based microbenchmarks repeatedly and look at variance in execution time across runs as a signal for noisy-neighbor effects. If your workload requires full resource isolation to get stable, repeatable benchmarks, our Enterprise plan offers dedicated VMs where you'd have exclusive access to the underlying hardware.


Status changed to Awaiting User Response brody about 2 months ago


Railway

We don't expose CPU steal time or hardware-level cache contention metrics, and since containers run in a cgroup-isolated environment, `/proc/stat` steal figures from inside the container won't reflect host-level contention. For compute-intensive, non-HTTP workloads where cache behavior matters, the most practical approach is to run your own timing-based microbenchmarks repeatedly and look at variance in execution time across runs as a signal for noisy-neighbor effects. If your workload requires full resource isolation to get stable, repeatable benchmarks, our [Enterprise plan](https://railway.com/enterprise) offers dedicated VMs where you'd have exclusive access to the underlying hardware.

yiphei
PROOP

2 months ago

none of your suggestions are realistic. On running the benchmark for longer, it is simply dumb and it prevents me from testing changes quickly. On getting enterprise, Im too small to get it.

In the meantime, ive already moved this workload to fly.io . They provide steal metrics, and using their performance cpus, the steal time averages around 0.5%, which is good enough for my work.


Status changed to Awaiting Railway Response Railway about 2 months ago


2 months ago

Understood, and thanks for the feedback. If you'd like to formally request CPU steal metrics as a feature, you can submit it at our roadmap so other users can upvote it.


Status changed to Awaiting User Response Railway about 2 months ago


Status changed to Solved brody about 2 months ago


Welcome!

Sign in to your Railway account to join the conversation.

Loading...