Railway throw 502 on certain requests in very odd conditions

djulian
PROOP

a month ago

Hi 👋🏻
I have a service that has been running for months now without any issues. However, recently I've been experiencing a couple of users who gets the railway 502 on a very specific endpoint. I have 0 logs on my side, the service does not crash and all other requests are being served. No weird RAM or CPU usage spikes are showing. Here are more details:

  • NestJS + Fastify 10.0.0

  • Every requests is logged and traced. When the 502 happens, no logs are showing. All others requests are served successfully.

  • I saw in the front-end logs that sometimes the user retries the request and it goes through.

  • The route is POST /scan/upload which takes exactly 8 files in a multi-part body. Here is the multipart config:

    app.register(fastifyMultipart, {
      limits: {
        fileSize: 100 * 1024 * 1024, // 100MB,
        files: 10,
        fieldNameSize: 500,
      },
    });

Most payloads are 2-8MB.

The scan are being uploaded from an Android and iOS app. The problem occurs on both os.

I have no idea on how to debug this, tried a lot of things unsuccessfully so I take any idea at this point!

Solved$40 Bounty

7 Replies

Railway
BOT

a month ago

Hey there! We've found the following might help you get unblocked faster:

If you find the answer from one of these, please let us know by solving the thread!


a month ago

Are you using a proxy or is the 502 from Railway? Because a 502 error makes me think you have a proxy somewhere and the lack of logs makes me believe that the request never reaches your application in the first place. Maybe I'm barking up the wrong tree but I wonder if the proxy you're using (assuming I'm right about there being a proxy) has some sort of configuration that prevents the server and client from communicating, maybe a buffer limit? Sorry for the highly speculative response but without any reproducability and no access to the code it would be hard to provide ideas that isn't speculative in nature


djulian
PROOP

a month ago

I'm not using any proxy nor load balancer, the request is being sent straight to the server. The 502 is coming from Railway, the server is never being reached by those specific request. They also time out after 5 minutes or fail with "client disconnected" after a couple of minutes. When they time out, the Railway side error is {"error":"i/o timeout"}.


colinrm000
HOBBY

a month ago

Hello!

I don't think that this is an issue with your app — I believe it could be Railway’s ingress proxy timing out before your container receives the request.

The "i/o timeout" and lack of logs show that the upload never reached your service.

Possible fixes to try:

  1. Use signed uploads (S3, Cloudflare R2, etc.) instead of posting files directly to your API.

  2. If that’s not possible yet, ensure your Fastify handler starts reading the stream immediately with an onFile handler.

  3. Reduce upload size or network latency.

The reason retries sometimes succeed is that faster uploads complete before Railway’s proxy drops the connection.


fra
HOBBYTop 10% Contributor

a month ago

just throwing ideas in the bucket:

  • is it possible that you exceed the max size for the headers? or a user is trying to upload files too big?

  • Are you able to add more logs on the consumer side?

  • are you using Cloudflare or similar?


Anonymous
PRO

15 days ago

Given that this is mobile apps uploading files, Option 1 (direct-to-storage) is the industry-standard solution.

Every major app (Instagram, WhatsApp, etc.) uses this pattern because:

- Mobile networks are unreliable

- It's more scalable

- It's cheaper

- It provides better UX (can show real upload progress)


djulian
PROOP

14 days ago

The solution was to manually split the single 8 files multi-part request into smaller batches. With this update on both iOS and Android, we have a 100% success rate.
At the moment, we're not considering direct-to-storage because we have to do a lot of validation and post-upload treatment to the media files. Implementing triggers and cascading jobs is unfortunately not on the roadmap and a bit overengineered for this project.

Thanks everyone!


Status changed to Solved djulian 14 days ago


Loading...