We see failures on parts of our infrastructure and are investigating.
Update 1: Problem identified. Working on a fix.
Update 2: Most of the failed containers are back. We are still in the restoring process.
Update 3 29.07.2019: We are still working to get everything back to normal.
Update 4 29.07.2019 - 11:23: We disabled API and login to prepare restart for users on shared clusters. Dedicated node users are not affected.
Update 5 29.07.2019 - 16:09: Reset done. Starting projects now.
Update 6 29.07.2019 - 20:27: Restoring projects done. Login enabled. Some container failed to start which most of the time was because of using no version tags but mysql:latest and postgres:latest or npm builds inside the container with references to old or not existing npm modules. Full post-mortem will follow.
On Sunday, July 27 2018, at 18:45 UTC we experienced an outage for some of our customers, significantly impacting your businesses and your users. We are very sorry for causing this downtime and for the impact that it had on your businesses.
Here are some additional details about what happened and steps we are taking to mitigate future outages of a similar nature.
Who was affected?
First, a subset of around 400 containers was affected. The redistribution of these containers caused trouble for other parts of the platform which required us to restart more of the containers running on our platform for shared nodes. Our dedicated node customers were not affected except one because his load balancer was running on the failed cluster.
We use several clusters for our platform. Full self managed hardware clusters and managed Xen clusters. One of the managed Xen clusters did a safety shutdown of all VMs running on it because a system volume was full. As this is a cluster managed by our data center provider we are not able to monitor or administrate this system volume by our self. We contacted the support hotline of the provider and they increased the size of the volume so we were able to restart all the VMs.
From our past experiences of failing single nodes, we did some changes to the cluster configuration so there will be no problems of one or two failing nodes but this and for the first time a complete cluster failed and a lot of servers stopped working at the same time. This caused trouble for the rest of the cluster as it tried to take over all the tasks from the failed nodes to fast. At a specific point we needed to fix many nodes by restarting them and we suspended a lot of containers to be able restarting them in controlled batches. For this, we also disabled the frontend.
What will we do to mitigate problems like this in the future?
The current configuration worked fine for a long time because of maintenance work or failures we have several reboots in a month which are not even noticed because of the fast take over. To prevent an outage like this because of a complete cluster failure we need to change this config so it will start failed containers slower. It is now on us to find a good compromise between starting not to slow in case of one or two failed nodes and starting not to fast when a complete cluster fails.
Furthermore, we are in contact with our data center provider to prevent cluster outages by filled up volumes at all.