14th August 2018

Container Platform Partial outage

We see failures on parts of our infrastructure and are investigating.

Update 1, 14:07 CEST: Our overlay network is collapsing and we're isolating agents to avoid failure of the whole cluster.

Update 2, 15:43 CEST: We're doing our best to keep applications running and restart any stopped ones, while provisioning new servers to take over the load from the servers removed from the overlay network.

Update 3, 16:55 CEST: To avoid a full cluster restart, we are working on a fix for the overlay network. By doing so, we also hope to prevent future problems of a similar nature.

Update 4, 19:37 CEST: We continue to look for a solution for our overlay network issue, to fix the current outage and to prevent the overlay network to cause any further issues in the future.

Update 5, 21:26 CEST: Still working on it.

Update 6, 22:31 CEST: We see no way to restore the overlay network other than a full cluster restart. We're starting that now, along with an upgrade of the overlay network component. Besides ending this outage, we're hoping to resolve those issues for good.

Update 7, 23:59 CEST: After a full shutdown, an upgrade of the overlay network component and making sure that all components are running properly, we're restarting all containers.

Update 8, 02:45 CEST: A majority of containers are running again and we're gradually starting the remaining ones, while monitoring closely, especially the upgraded overlay network.

Update 9, 03:02 CEST: Everything is up and running, including the newly provisioned agents and the upgraded overlay network. We're continuing to monitor.

Update 10, 03:34 CEST: Platform restored and stable, we're going home

Please note that console, logs and stats might not yet be available for your apps.