17 days ago
Hi!
We've encountered this same problem twice now (May 4 and again on May 22), where intermittent connectivity issues caused the internal mesh to go down, our backend couldn't communicate with other services via private networking, and external traffic returned SSL handshake failures and 529s. Both times affected Southeast Asia (Singapore).
Two questions:
-
Is this a known recurring issue with the networking control plane / edge proxy? What's being done structurally to prevent it from happening again?
-
Is there anything we can do architecturally to survive mesh outages? We considered aggressive health checks to trigger a redeploy, since on May 22, a redeploy seemed to have fixed it but that doesn't and might not fix the root cause, which is on the network layer. Are there patterns you'd recommend (e.g. public networking fallback, multi-region) for services that need higher availability?
Attachments
9 Replies
17 days ago
(+1 on the today, had a ECONNRESET at 12pm utc)
is the related service that was affected.
15 days ago
did you observe this again?
hey @angelo we didnt observe this again, but we wanted to better understand why this happens so we can have more proactive measures to recover from this
hey @angelo @Brody this just happened to us again, we re-deployed twice and it only worked on the 2nd re-deployment.
could we please get some guidance and visibility into why this happens and what we can do to fix it
6 days ago
Please do not ping the team - #🛂|readme #5
really sorry about that but we just faced the same problem again and we still havent gotten any help on this matter