Ongoing issue: SIGTERM is not received by service about 75% of the time
kenmueller
PROOP

2 months ago

This is a big issue for me. About 75% of the time, Railway says it sent SIGTERM, but my application does not receive it. This means the deployment will effectively be alive forever unless removed, since it doesn't even clear after the draining timeout.

Bad deploy logs:

Starting Container
File ingestion cancel channel subscribed
File ingestion worker ready
sending signal SIGTERM to container

Good deploy logs:

Starting Container
File ingestion cancel channel subscribed
File ingestion worker ready
sending signal SIGTERM to container
Received SIGTERM. Starting graceful shutdown... 0 jobs processing
Shutdown complete. Exiting with code 0.
Stopping Container

Again, this happens only around 75% of the time. But this means I have to stop the deployment at some point, not knowing if I'm interrupting a currently active job and impacting a user, let alone the fact that there's an active deployment (still accepting new jobs/connections) running old code. I am pretty certain that this is not a bug in my own code.

7 Replies

kenmueller
PROOP

2 months ago


kenmueller
PROOP

2 months ago

Relevant code (pretty much the same across all services that I deploy, except some services more frequently hit this bug):

import { fileIngestionWorker } from './workers/file-ingestion.ts';
import { FILE_INGESTION_CANCEL_CHANNEL } from '@supai/queue-contracts/files';
import { getRedisSubscriber } from '@supai/redis';

let activeJobCount = 0;

fileIngestionWorker.on('active', () => {
  activeJobCount++;
});

fileIngestionWorker.on('completed', () => {
  activeJobCount--;
});

fileIngestionWorker.on('failed', () => {
  activeJobCount--;
});

let isExiting = false;

const exit = async (signal: 'SIGTERM' | 'SIGINT') => {
  if (isExiting) {
    console.log(`Received ${signal}, but already shutting down. Ignoring.`);
    return;
  }

  isExiting = true;

  console.log(
    `Received ${signal}. Starting graceful shutdown... ${activeJobCount.toLocaleString()} job${activeJobCount === 1 ? '' : 's'} processing`,
  );

  const interval = setInterval(() => {
    console.log(
      `Waiting for shutdown... ${activeJobCount.toLocaleString()} job${activeJobCount === 1 ? '' : 's'} still processing`,
    );
  }, 5_000);

  try {
    await fileIngestionWorker.close();
  } catch (error) {
    console.error('Error during exit:', error);
    process.exitCode = 1;
  } finally {
    clearInterval(interval);
  }

  console.log(`Shutdown complete. Exiting with code ${process.exitCode ?? 0}.`);
  process.exit();
};

process.on('SIGTERM', () => exit('SIGTERM'));
process.on('SIGINT', () => exit('SIGINT'));

await Promise.all([
  (async () => {
    const subscribe = await getRedisSubscriber();

    await subscribe(FILE_INGESTION_CANCEL_CHANNEL, fileId => {
      fileIngestionWorker.cancelJob(fileId);
    });

    console.log('File ingestion cancel channel subscribed');
  })(),
  (async () => {
    await new Promise((resolve, reject) => {
      fileIngestionWorker.once('ready', () => {
        resolve();
      });

      fileIngestionWorker.once('error', error => {
        reject(error);
      });
    });

    console.log('File ingestion worker ready');
  })(),
]);

kenmueller
PROOP

2 months ago

Screenshot_2026-03-27_at_5.22.45_PM.png

Attachments


2 months ago

Are you able to reproduce this with an ultra minimal example?


kenmueller
PROOP

2 months ago

With a new project, or do you want me to redeploy one of my same services?


2 months ago

I'd like a minimum code example so that we can rule out any effects of your application code.


kenmueller
PROOP

2 months ago

Ok, will do. I'll have it by tomorrow.


Welcome!

Sign in to your Railway account to join the conversation.

Loading...