Nodes/Inter-Region connectivity issue

Incident Report for Scrapfly

Resolved

Since 23:30, we have observed network issues resulting in a service disruption.

The root cause has been identified as the cilium-agent (https://cilium.io manages the networking system inside our kubernetes cluster) unresponsive, entering a crash loop backoff without recovery. Although rebooting the agent should have resolved the issue, it did not occur for reasons yet unknown.

```
{"level":"fatal","msg":"Failed to create k8s client: exec plugin: invalid apiVersion \"client.authentication.k8s.io/v1alpha1\"","subsys":"daemon"}
```

This led to the disconnection of all applications on affected nodes, causing them to fall offline sequentially. While not all nodes were impacted, the web pool responsible for exposing our web applications, such as the website and API, experienced disruption.

Regrettably, at 23:00 PM UTC, our engineering team was unavailable, and we began receiving alerts overnight, around 1 AM UTC. The team responsible for resolving such issues is based in Europe (UTC+1) and was unfortunately unreachable.

By 08:30 AM, we became aware of the situation and initiated an investigation. By 9:00 AM, we determined that Google Kubernetes Engine (GKE) had updated some internal components, affecting certain nodes and causing them to lose network connectivity. We then began remediation efforts.

Due to the locked state of the nodes during the update process, we were unable to perform any operations and had to manually halt the underlying VMs, remove them from the Kubernetes node group, and scale down the nodes to 0 to ensure synchronization with GKE. We gradually scaled up the nodes to verify that the updates were proceeding correctly and increased capacity to accommodate traffic before restoring API access.

We are currently gathering more information and will collaborate with the Google team to ascertain the root cause and implement preventive measures for the future.

We sincerely apologize for the disruption caused and assure you that we will learn from this incident to enhance our system's reliability.

Posted Jan 31, 2024 - 23:30 CET