6th August 2018

Container Platform Partial outage

We see failures on parts of our infrastructure and are investigating.

Update 1, 13:44 CEST: While restoring affected agents, we're moving containers to healthy agents, reducing downtime as much as possible.

Update 2, 14:38 CEST: We're restarting agents to restore them properly.

Update 3, 18:06 CEST: Network layer problems are still not fully resolved, rebooting agents didn't help enough. We're testing other approaches, which unfortunately take even more time.

Update 4, 19:05 CEST: Many containers are still not running, we continue working on a solution.

Update 5, 19:58 CEST: Previous attempts at gradually restoring the network layer have failed, so we're now taking down more agents to first bring back the network layer properly, then restore all remaining containers.

Update 6, 20:55 CEST: We're still working on a solution

Update 7, 21:49 CEST: We're stopping all containers to do a controlled restart of the whole platform

Update 8, 22:18 CEST: We've suspended api.sloppy.io (backend for admin.sloppy.io and CLI) while restoring containers

Update 9, 23:35 CEST: We're still working on a solution

Update 10, 00:10 CEST: We've isolated broken agents. While fixing those, we're slowing restarting customer containers on stable agents.

Update 11, 00:36 CEST: About one fifth of containers are running again

Update 12, 01:22 CEST: Half of containers are back up.

Update 13, 02:56 CEST: Network layer is broken again, both deployments and already running apps are failing. Starting over.

Update 14, 03:46 CEST: New attempt at booting everything up properly

Update 15, 10:18 CEST: All containers are back