2 months ago
This is a big issue for me. About 75% of the time, Railway says it sent SIGTERM, but my application does not receive it. This means the deployment will effectively be alive forever unless removed, since it doesn't even clear after the draining timeout.
Bad deploy logs:
Starting Container
File ingestion cancel channel subscribed
File ingestion worker ready
sending signal SIGTERM to containerGood deploy logs:
Starting Container
File ingestion cancel channel subscribed
File ingestion worker ready
sending signal SIGTERM to container
Received SIGTERM. Starting graceful shutdown... 0 jobs processing
Shutdown complete. Exiting with code 0.
Stopping ContainerAgain, this happens only around 75% of the time. But this means I have to stop the deployment at some point, not knowing if I'm interrupting a currently active job and impacting a user, let alone the fact that there's an active deployment (still accepting new jobs/connections) running old code. I am pretty certain that this is not a bug in my own code.
7 Replies
Here is a currently active deployment that never received the SIGTERM even though Railway said it sent the signal: https://railway.com/project/0f75896b-b4ee-4932-929b-68a33d865542/service/e4fd84e5-c7e2-4f34-888d-a5f6240ea67c?environmentId=e745a7cf-ab5c-4d45-bf14-37f3cda4292e&id=3b29ce2f-1cd6-4610-83d9-1ce87efaa86a#deploy
Relevant code (pretty much the same across all services that I deploy, except some services more frequently hit this bug):
import { fileIngestionWorker } from './workers/file-ingestion.ts';
import { FILE_INGESTION_CANCEL_CHANNEL } from '@supai/queue-contracts/files';
import { getRedisSubscriber } from '@supai/redis';
let activeJobCount = 0;
fileIngestionWorker.on('active', () => {
activeJobCount++;
});
fileIngestionWorker.on('completed', () => {
activeJobCount--;
});
fileIngestionWorker.on('failed', () => {
activeJobCount--;
});
let isExiting = false;
const exit = async (signal: 'SIGTERM' | 'SIGINT') => {
if (isExiting) {
console.log(`Received ${signal}, but already shutting down. Ignoring.`);
return;
}
isExiting = true;
console.log(
`Received ${signal}. Starting graceful shutdown... ${activeJobCount.toLocaleString()} job${activeJobCount === 1 ? '' : 's'} processing`,
);
const interval = setInterval(() => {
console.log(
`Waiting for shutdown... ${activeJobCount.toLocaleString()} job${activeJobCount === 1 ? '' : 's'} still processing`,
);
}, 5_000);
try {
await fileIngestionWorker.close();
} catch (error) {
console.error('Error during exit:', error);
process.exitCode = 1;
} finally {
clearInterval(interval);
}
console.log(`Shutdown complete. Exiting with code ${process.exitCode ?? 0}.`);
process.exit();
};
process.on('SIGTERM', () => exit('SIGTERM'));
process.on('SIGINT', () => exit('SIGINT'));
await Promise.all([
(async () => {
const subscribe = await getRedisSubscriber();
await subscribe(FILE_INGESTION_CANCEL_CHANNEL, fileId => {
fileIngestionWorker.cancelJob(fileId);
});
console.log('File ingestion cancel channel subscribed');
})(),
(async () => {
await new Promise((resolve, reject) => {
fileIngestionWorker.once('ready', () => {
resolve();
});
fileIngestionWorker.once('error', error => {
reject(error);
});
});
console.log('File ingestion worker ready');
})(),
]);Attachments
2 months ago
Are you able to reproduce this with an ultra minimal example?
With a new project, or do you want me to redeploy one of my same services?
2 months ago
I'd like a minimum code example so that we can rule out any effects of your application code.