5 days ago
Hi!
We've encountered this same problem twice now (May 4 and again on May 22), where intermittent connectivity issues caused the internal mesh to go down, our backend couldn't communicate with other services via private networking, and external traffic returned SSL handshake failures and 529s. Both times affected Southeast Asia (Singapore).
Two questions:
-
Is this a known recurring issue with the networking control plane / edge proxy? What's being done structurally to prevent it from happening again?
-
Is there anything we can do architecturally to survive mesh outages? We considered aggressive health checks to trigger a redeploy, since on May 22, a redeploy seemed to have fixed it but that doesn't and might not fix the root cause, which is on the network layer. Are there patterns you'd recommend (e.g. public networking fallback, multi-region) for services that need higher availability?
Attachments
6 Replies
5 days ago
(+1 on the today, had a ECONNRESET at 12pm utc)
is the related service that was affected.
2 days ago
did you observe this again?
hey @angelo we didnt observe this again, but we wanted to better understand why this happens so we can have more proactive measures to recover from this