station.railway.com drops TLS ClientHello unless MSS clamping is applied
raxod502
HOBBYOP

a year ago

Hi, I have what I believe to be a specific networking error in your infrastructure to report. I will present findings below; the concrete implication of the issue is that some (spec-compliant) browsers are unable to complete the TLS handshake with 66.33.22.11 and will time out negotiating the connection.

I initially noticed that I was unable to load https://station.railway.com on my laptop (Pop!_OS 22.04, Firefox 135.0.1). I tested on my other laptop, running Ubuntu 24.04, and found the same result. I also provisioned an Ubuntu virtual machine and found the same result in both Firefox and Chromium. I tested both my residential internet connection as well as a mobile hotspot, with identical findings.

However, connections to the server worked fine from my phone (Android 15, GrapheneOS) and a Macbook. Furthermore, curl worked fine for all devices; this was something that affected browsers exclusively (but not something coming from configuration or plugins; I tested in an isolated Firefox profile, as well as from within the freshly provisioned VM).

I used Wireshark on my laptop and tcpdump on my router (OpenWrt) to look at what was actually happening at the protocol layer. The results were surprising. My client sends a SYN packet, gets SYN/ACK back, sends its own ACK, so the TCP three-way handshake is completed. It then sends a TLS ClientHello packet, which (due to modern browsers' use of the TLS key sharing extension) is much larger than the ClientHello used by curl, and in particular exceeds 1500 bytes. The TCP payload is automatically segmented by the Linux networking stack into two segments (1:1448 and 1449:1905); note that a TCP segment of 1448 octets corresponds to an encapsulated IP datagram of 1500 bytes. This should be fine, since your server is negotiating an MSS of 1460, corresponding to MTU of 1500 - and I can verify with ping -s1472 that 66.33.22.11 accepts 1500-byte ICMP frames just fine. Nonetheless, what comes back from the server is a TCP selective ACK (SACK) for only the 1449:1905 segment. The OS then repeatedly retransmits the initial 1448-octet segment, to no avail. So, the server accepts 1500-byte ICMP frames, but not 1500-byte TCP frames, contrary to the negotiated MSS.

Wireshark packet capture

tcpdump excerpt: https://paste.sr.ht/~raxod502/0b0f11b027b511fa39d2eb166546e34bad261c9d

I have an in-house tool for making network requests with custom TLS ClientHello fingerprints, and verified that if I make my own request from the command line that would trigger TCP segmentation, it reliably reproduces the error, contrary to curl's default options which all fit in a single packet. I also verified that if I manually decrease the MTU of my outbound network interfaces to less than their default of 1500, then full network connectivity to https://station.railway.com magically starts working in a browser.

I also checked router traffic while a working client, such as the aforementioned Macbook, was making a request. It sends the same segments initially, and they are blackholed in exactly the same way - but after several retransmits, it seems Apple's network stack decides to unilaterally decrease the MSS it uses, perhaps specifically to cope with issues like this one; it starts segmenting the payload into even smaller packets, which are then delivered successfully.

Solved

13 Replies

jake
EMPLOYEE

a year ago

Hi! Thank you for the report and apologies you're running into this

Would you mind going to your service settings and disabling "Metal Edge"? This will change the IP broadcasting. I'd like to see if this fixes it


Status changed to Awaiting User Response Railway 12 months ago


jake

Hi! Thank you for the report and apologies you're running into thisWould you mind going to your service settings and disabling "Metal Edge"? This will change the IP broadcasting. I'd like to see if this fixes it

raxod502
HOBBYOP

a year ago

No - this is not in reference to my service. It is in reference to your service, https://station.railway.com.


Status changed to Awaiting Railway Response Railway 12 months ago


Thanks Radon-

It's always nice seeing a report from you. I am going to disable the metal edge from our service to alleviate this issue and wait until Phineas, our network engineer working on our proxy responds to your report.

I have switched it back- can you run the tests again?


Status changed to Awaiting User Response Railway 12 months ago


angelo-railway

Thanks Radon- It's always nice seeing a report from you. I am going to disable the metal edge from our service to alleviate this issue and wait until Phineas, our network engineer working on our proxy responds to your report. I have switched it back- can you run the tests again?

raxod502
HOBBYOP

a year ago

Unfortunately, no dice. I see the same behavior when testing just now:

I'm assuming the metal edge change is something internal that I wouldn't be able to see as a change in the DNS entry; for me station.railway.com still resolves to 66.33.22.11.

Attachments


Status changed to Awaiting Railway Response Railway 12 months ago


Well- it's an option we/others can set to be on that set of proxies. I have forwarded this information. It may also be a bug when we move workloads on vs. off the proxy. Forwarded the details and I appreciate your energy on the issue.

(Our door is always open for you if you ever want to test or push the bounds of our hardware.)


Status changed to Awaiting User Response Railway 12 months ago


Got a quick response from the team, did you clear your DNS cache?


Okay- it was us... we hardcoded the value. One more time after the cache is clear?


raxod502
HOBBYOP

a year ago

Well, I see the TTL is down to 60s, but I think it is still resolving to 66.33.22.11 globally: https://www.whatsmydns.net/#A/station.railway.com

So if the intent is to change the DNS resolution I think there is something else that is preventing that from going out.

Attachments


Status changed to Awaiting Railway Response Railway 12 months ago


We put it back but then changed the IPIP tunnel to 1440 MTU. Can you run the test again?


Status changed to Awaiting User Response Railway 12 months ago


raxod502
HOBBYOP

a year ago

Yeah, still the same results, can't load when my local interface is at 1500 MTU, but can load when at 1400 MTU:

Attachments


Status changed to Awaiting Railway Response Railway 12 months ago


Update from the team Radon. Phin (PTO) is tracking this issue and should be able to drive it to resolution once they are back. He mentioned the client should respect 1440, but you never know. Next message will likely come from him.

(Again, thanks for reporting.)


Status changed to Awaiting User Response Railway 11 months ago


phin
EMPLOYEE

a year ago

Hey Radon,

Thanks a bunch for this detailed report. MTU strikes again it seems! I've updated our advertised MSS to 1440 to account for our L4LB's IPIP encap. We're likely going to migrate toward jumbo frames internally so this shouldn't be an issue in the near future.

Please try to estab a tcp session again and let me know if it's working slightly_smiling_face emoji


phin

Hey Radon,Thanks a bunch for this detailed report. MTU strikes again it seems! I've updated our advertised MSS to 1440 to account for our L4LB's IPIP encap. We're likely going to migrate toward jumbo frames internally so this shouldn't be an issue in the near future.Please try to estab a tcp session again and let me know if it's working

raxod502
HOBBYOP

a year ago

Looking great now! Thanks for fixing, and happy to help!


Status changed to Awaiting Railway Response Railway 11 months ago


Status changed to Solved itsrems 11 months ago


Loading...