Despite all efforts, a single crashed agent took down the overlay network again, causing many stuck deployments and accordingly containers that are not running. We're looking for a solution.
Update 1, 14:13 CEST: As a first measure, we cancelled all trial accounts and suspended their apps. We're now focusing on restoring major containers as fast as possible.
Update 2, 18:24 CEST: We still haven't found a way to permanently fix our overlay network. Restarting the whole cluster is having no reasonable effect anymore. Finding a real solution is slow, while many containers are still down and can't be started anymore.
Update 3, August 16th, 09:04 CEST: We still don't have a working overlay network and the cluster health has reduced even further. Efforts at reproducing the issue in a separate environment have not been successful.
Update 4, 09:21 CEST: We're once more shutting everything down to try and reset the overlay network to a fresh state.
Update 5, 10:04 CEST: We're testing a new cluster configuration for the overlay network (btw. based on Weave Net). We're hopeful that this will finally allows us to bring the cluster back up with a stable network.
Update 6, 11:32 CEST: New weave cluster is up and we're restoring internal services.
Update 7, 11:52 CEST: We're starting to restore customer containers.
Update 8, 14:03 CEST: A good chunk of the cluster is up and looks stable. We're continuing with more deployments.
Update 9, 16:45 CEST: The cluster is up and running and shows no signs of issues. Given the last few days, we're still skeptical and monitoring closely.
Our services for managing logs, stats and console aren't up, yet. We're working on those, while also doing our best to recover all containers that were running before the outage started - if you see any problems with your containers at this point, please contact us using the chat.