Container DNS Resolver Failing ([Errno -3] Lookup timed out)
truth0530
HOBBYOP

7 months ago

Solved$20 Bounty

63 Replies

7 months ago

the domain works fine for me


7 months ago

1398220665508593700


7 months ago

ef4e9876-c1a3-495b-8a59-773a83136cf0


truth0530
HOBBYOP

7 months ago

[2025-07-23 01:01:24 +0000] [1] [INFO] Handling signal: term

[2025-07-23 01:01:25 +0000] [4] [INFO] Worker exiting (pid: 4)

[2025-07-23 01:01:25 +0000] [1] [INFO] Shutting down: Master

1398224352272453600


7 months ago

oh I see, I misunderstood, it's only affecting outbound traffic


7 months ago

this is a very weird case, I've never seen this before <:Thinking:1360710341239242762>


7 months ago

I'm especially confused because the DNS lookup is timing out instead of outright failing


7 months ago

I'm out of my depth here so I'll escalate to the team


7 months ago

!t


7 months ago

This thread has been escalated to the Railway team.

Status changed to Awaiting Railway Response dev 8 months ago


truth0530
HOBBYOP

7 months ago

Thank you so much for looking into this and for escalating the issue. I sincerely appreciate it.

To give you some context on the impact, I've spent the last three sleepless nights trying to debug this – modifying code and even reverting to old GitHub commits – before we finally isolated it as a platform issue.

This is a critical blocker for my service, so I would be extremely grateful for any updates as your team investigates. Please let me know if there's anything more I can provide.


7 months ago

@carg - What domain is failing to lookup?


7 months ago

DNS resolution is working fine for these domains, these tests where done from within the container.

1398349480843018500


7 months ago

Container DNS Resolver Failing ([Errno -3] Lookup timed out)


Status changed to Awaiting User Response itsrems 8 months ago


truth0530
HOBBYOP

7 months ago

Thank you for the quick response and for investigating this, Brody.

I see that DNS resolution is working correctly in the container you tested. However, I am very confused because the diagnostic results run directly from my live service container are consistently showing a different outcome.

As requested by the previous helper, I added a debug endpoint to my app. When I access https://nemc.up.railway.app/debug-dns, I get this result every time:


truth0530
HOBBYOP

7 months ago

{
"googlemaindns": "FAIL: [Errno -3] Lookup timed out",
"googleoauthdns": "FAIL: [Errno -3] Lookup timed out",
"resolv_conf": "search internal railway.internal | nameserver fd12::10 | "
}


truth0530
HOBBYOP

7 months ago

This brings up two key questions about the test you performed:

Test Environment: Did you run the dig commands from a new test container, or were you able to shell directly into my service's currently running container instance? My issue might be specific to the node my service is currently deployed on.

Queried Domain: I noticed the test results for oauth2.googleapis.com—the exact domain that is failing in my application's authentication flow—is missing from your screenshot.

Request:
To get to the bottom of this, could you please try to shell into my currently active service container and run dig +short oauth2.googleapis.com?

I believe this is the most accurate way to reproduce the issue. Thank you again for your help.


7 months ago

Please provide a link to the service you want me to shell into


7 months ago

^ @carg



7 months ago

Here is dig being ran inside of your application's container, dns resolution is all good on our side.

1398435758507036700


truth0530
HOBBYOP

7 months ago

Thank you for running the test directly inside my container, Brody. This is very confusing, because my application is still failing with the exact same DNS timeout error.

Your dig command shows that DNS resolution works correctly from the container's shell. However, when the DNS lookup is performed by my Python application running under Gunicorn/eventlet, it consistently fails.

Here is the output from my /debug-dns endpoint, which I accessed just now. It shows a definitive failure:

JSON

{
"googlemaindns": "FAIL: [Errno -3] Lookup timed out",
"googleoauthdns": "FAIL: [Errno -3] Lookup timed out",
"resolv_conf": "search internal railway.internal | nameserver fd12::10 | "
}
This suggests the problem might not be with the container itself, but with the specific runtime environment of the Python/Gunicorn process. Is it possible for the application worker to have different network permissions or configurations than the interactive shell you used?

This is a very critical and persistent issue for my service. Since dig from the shell works, but socket.gethostbyname from within the Python app fails, could you please advise on what could cause such a discrepancy within the Railway environment?


7 months ago

When I SSH'd in, I was in the exact same environment that your Python code was running from. Notice the /app path; that is where your code lives in the container.

But I'm sorry, I cannot comment on why your code is not working. All I can say is that everything is good at the platform level, meaning I am unable to assist further here.

You can SSH into the container yourself for further testing -

So now I'll let the community help you debug this further.


truth0530
HOBBYOP

7 months ago

Ok, thank you for clarifying and providing the documentation for the SSH access.

I understand that from your perspective the platform is working correctly. I will follow your advice, SSH into the container myself to perform further tests from within the Python runtime, and will post my findings back to the community.

I appreciate you taking the time to look into this.


truth0530
HOBBYOP

7 months ago

Hi Brody, I've completed the final diagnostic test.

I deployed a minimal, barebones Flask application with only the required gunicorn and eventlet dependencies. The app's only function is to perform a single DNS lookup.

The test failed with the exact same DNS timeout error.

Here is the direct output from this minimal application:
FAIL: Could not resolve 'oauth2.googleapis.com' Error: [Errno -3] Lookup timed out

This definitively proves the issue is not with my application code or its dependencies, but with the underlying network environment of the container itself. A basic DNS lookup is failing at the infrastructure level.

Please escalate this issue to your network engineering team for investigation. My service remains critically blocked by this. Thank you.


7 months ago

Have you SSH'd in and ran dig? (you might need to install dig)


truth0530
HOBBYOP

7 months ago

Hi Brody, thank you. Here is the link to the service dashboard you requested for inspection:

https://railway.com/project/0fcaf2cf-b205-4259-93b0-ec4d62597b30/service/dcfe4054-8cc9-4e56-a117-1b28643c19ec?environmentId=df39c076-7636-46e0-8783-7d4345eacc77

As a follow-up, I deployed a separate, minimal Flask application to isolate the issue, as we discussed. This app's only function is to perform a single DNS lookup.

This minimal test app failed with the exact same DNS timeout error.

Here is the direct output from the minimal application's URL:
FAIL: Could not resolve 'oauth2.googleapis.com' Error: [Errno -3] Lookup timed out

This definitively proves the issue is not with my main application's code or its dependencies, but with the underlying network environment of the container itself. A basic DNS lookup is failing at the infrastructure level.

Please escalate this to your network engineering team for investigation. My service remains critically blocked by this. Thank you.


7 months ago

Please let me know!


truth0530
HOBBYOP

7 months ago

Hi Brody, I have followed your instructions and run dig from within the container's shell.

You are correct, the dig command works successfully from the interactive shell. However, this only confirms the confusing discrepancy we are seeing, because the DNS lookup consistently fails when initiated from the running Python application.

Here is the evidence of the two conflicting results from the exact same container:

  1. Test from Shell (via ssh, as you requested):

dig +short oauth2.googleapis.com

142.251.10.95
(This works)

  1. Test from the running Python App (via my /debug-dns endpoint):

JSON

{
"googleoauthdns": "FAIL: [Errno -3] Lookup timed out"
}
The critical question now is: Why does a DNS lookup succeed from the bash shell, but consistently fail when initiated by the Python process running under Gunicorn/eventlet in the exact same container?

This points to an issue with the specific runtime environment Railway provides for Python applications, not a general platform issue. Could you please help investigate this specific runtime discrepancy?


7 months ago

I'm sorry, but I cannot help debug application-level issues, so please provide your minimal reproducible example to the community so that they can help you.


truth0530
HOBBYOP

7 months ago

Hi Brody, as you requested, I am providing the minimal reproducible example to the community.

I have made the simple GitHub repository public. You or anyone from the community can deploy it directly to reproduce the error:

https://github.com/truth0530/flask-gunicorn-app

This repository contains only three files: test_app.py (for the DNS test), requirements.txt (with Gunicorn/eventlet), and runtime.txt (to force a Python 3.11 environment).

When this repository is deployed, the service starts successfully. However, accessing the generated public URL will result in a DNS lookup timeout error.

This should allow your team to easily reproduce the [Errno -3] Lookup timed out bug on your end. Thank you for your time.


truth0530
HOBBYOP

7 months ago

Hi Brody, I've seen your latest reply. I understand and respect that from your dig test, the platform's networking appears functional.

However, I would be very grateful if you could also consider the evidence from my perspective.

My minimal test app—with just a few lines of standard Python and basic dependencies—is failing on a fundamental network task. This leads me to a critical question: how can I confidently deploy my complex, production application on an environment where even this simple, standard code fails?

Respectfully, the discrepancy between the dig command working and Python's standard socket library failing points to a subtle but critical issue within the Railway runtime environment itself, not just my "application code."

I am not asking you to debug my application logic. I am asking for your help in understanding why the platform cannot seem to provide a stable networking environment for a basic Python process. Could you please help investigate this specific discrepancy?


truth0530
HOBBYOP

7 months ago

Hi Brody, no rush on your end, I know you're likely busy investigating.

In the meantime, I'm continuing to debug this on my side. My current hypothesis, based on the discrepancy between the shell and the Python runtime, is that eventlet's monkey-patching feature might be conflicting with Railway's network environment. I am currently running tests inside the container to confirm this theory.

I was wondering if you had any thoughts on this. In your experience, do you think the eventlet worker is generally not a good fit for Railway?

My plan, after I confirm the eventlet issue, is to try one of these two solutions:

Change the worker class to sync to see if a standard worker resolves the DNS issue.

Switch to a different async worker like gevent (by replacing eventlet with gevent in requirements.txt).

Do you think this is a reasonable approach, or would you recommend a better path? Any advice you have would be extremely helpful. Thank you.


7 months ago

If this were a stability issue, I would definitely be able to help, but your code is failing to look up the DNS every time, while dig works perfectly every time.

This issue is solely within your application's code and not directly related to the platform; thus, neither I nor the Railway team will be able to assist.

You have provided your example code to the community, and I have put up a bounty for this thread, so I'm sure the community will come to your aid.


truth0530
HOBBYOP

7 months ago

Thank you for the clarification and for adding the bounty. I will continue to work with the community to resolve this.


itcc275
HOBBY

7 months ago

@carg @Brody was this issue resolved it seems my Django application is also having similar issue which was working fine till yesterday but suddenly I am not able to get any response from my apis


7 months ago

Maybe you can work with Carg to come to a solution, since the issue is not with the Railway platform.


itcc275
HOBBY

7 months ago


itcc275
HOBBY

7 months ago

@carg are you also getting error like - This site can’t be reached
Check if there is a typo in tipitakachantingcouncil.up.railway.app.
DNSPROBEFINISHED_NXDOMAIN

1398520733638070300


7 months ago

That's a completely different issue; please open your own thread.


truth0530
HOBBYOP

7 months ago

Dear Brody,

I hope this message finds you well. I am writing to express my concern regarding the recent interaction we had concerning the issues with my Django application hosted on the Railway platform.

As I previously mentioned, my application was functioning well until recently, and I am now unable to receive any responses from my APIs. I encountered a DNS error that states, "This site can't be reached," along with the error code DNSPROBEFINISHED_NXDOMAIN.

In your response, you suggested that I collaborate with another user, Carg, to find a solution. While I appreciate the intent to foster a collaborative environment, I believe that, as a representative of the Railway platform, it is your responsibility to provide direct assistance to users facing technical issues.

Given the critical nature of this problem, I kindly request that you review my situation more closely and provide specific guidance or next steps to resolve this matter. Your expertise is essential for me to get my application back online.

Thank you for your attention to this matter. I look forward to your prompt response.

Best regards,


7 months ago

I'm sorry, but as I've said a few times already, I cannot provide application-level support; I can only provide support for the Railway platform and product. This is an issue with your application or configuration, and thus it falls outside what the Railway team or I can provide support for.


7 months ago

Additionally, I will not be conversing further with AI-generated responses. Thank you for understanding.


truth0530
HOBBYOP

7 months ago

Please do not assume that this was generated by AI. I translated it to manage my strong emotions, and your response comes across as quite rude and unhelpfu


truth0530
HOBBYOP

7 months ago

I haven't slept for several nights because of this issue. I am considering discontinuing my use of Railway and will make sure to inform others about this terrible service. I am extremely frustrated with such irresponsible responses.


truth0530
HOBBYOP

7 months ago

Are you not willing to apologize? Apologize for assuming I am AI so hastily and for your rudeness. Your behavior is truly unacceptable.


7 months ago

Hey carg, it's reasonable for you to be frustrated considering how long you've spent trying to fix this but I feel it's worth noting that on hobby plan you're only allowed community-support, brody has no obligation to help you here in accordance with your subscription.

The fact is that this isn't an issue with Railway's DNS resolver (as brody proved) because DNS is resolving fine within your container, just not within your application, which would naturally mean it's an application-level issue. It's not within Railway's responsibilities to help you fix your own code.


7 months ago

I'll see if I can reproduce this with your example application


truth0530
HOBBYOP

7 months ago

Thank you for your realistic advice. I haven't been getting much sleep lately, which has made me a bit sensitive. I've inserted numerous debug codes, tried various approaches, restored to previous versions of the code, and explored all the possible methods, but nothing seems to work.

Since I don't live in an English-speaking country, I had no choice but to use AI to translate my language into English. However, I found it quite frustrating when it was suggested that I wouldn’t receive a response from AI, which made it difficult for me to keep my emotions in check at that moment.

I appreciate your suggestions and will make sure to double-check everything accordingly. Thank you for your understanding and support.


7 months ago

No worries, I completely understand


7 months ago

I can't seem to reproduce btw, it works fine for me

1398537368931143700


7 months ago

I deployed your application 1:1, the only difference is I changed test_app.py to main.py so railpack could start the app for me


7 months ago

what region are you deploying to?


7 months ago

and have you deployed your test application yourself? does it error in your test application as well?


truth0530
HOBBYOP

7 months ago

I'm sorry, but I didn't fully understand your message. My actual project has undergone many changes since I posted my question, and the situation has changed a lot since then. However, the problem is still unresolved, and issues have also been discovered with Google authentication, so I am currently unable to deploy the app while trying to implement a different authentication method.

As for the second project I uploaded to GitHub for testing purposes, I deployed it with minimal functionality as described in my initial question. After hearing your suggestion, I changed the filename from test_app.py to main.py, then committed and pushed it back to GitHub.

Of course, changing the filename doesn't seem like a core solution, but it's the only option I have at the moment.

Regarding your question about the region, I initially set it to Singapore, then changed it to the USA, and now I have switched it back to Singapore.


7 months ago

I'm sorry, but I didn't fully understand your message

Railway asked you for a minimal reproducible example earlier, you then provided a flask-gunicorn-app. I deployed this application but it didn't reproduce the issue you're having. It successfully resolved the DNS.

My question is whether you have deployed the app yourself, try changing test_app.py to main.py in your repo and then deploy it and see if it resolves sucessfully.

Of course, changing the filename doesn't seem like a core solution, but it's the only option I have at the moment.

Changing the file name is superficial, it's just so that Railpack can start the app automatically for me, I could've kept it the same but then I would've had to write a start command.

I initially set it to Singapore

Tested the flask-gunicorn-app in singapore too just now and it's working there too


truth0530
HOBBYOP

7 months ago

I'm sorry for any confusion regarding the previous situation with the flask-gunicorn-app. After deploying it on Railway, I took your advice and changed the filename to main.py. I initially deployed it to California, but I have now switched it to Singapore and redeployed the application immediately. Thank you for your guidance!


truth0530
HOBBYOP

7 months ago

1398544588347539500


truth0530
HOBBYOP

7 months ago

However, I've encountered a new issue where, perhaps due to the numerous deployments, both the project that had previous problems and the test deployment are showing as "queued" without any indication of success or failure, as seen in the screenshot. This has created a bit of a roadblock for me.


7 months ago

make sure there's nothing in the pre-deploy command slot that could be halting the execution of the deploy command


7 months ago

if you do then make sure to remove your previous deployments otherwise they'll yield forever


7 months ago

All clear to sovle?


7 months ago

All clear to solve?


7 months ago

!s


Status changed to Solved dev 7 months ago


Loading...