Kubernetes

Why do I do that?

Currently, I have my entire homelab/workflow hosted at my mom’s house. When a power issue, or just a “oops I pull the plug” issue occurs, my entire workflow is stoped with all of my self hosted apps unavailable.

Current infrastructure

I have a R620 server at my mom’s house and a VPS.

What R620 hosts?

The R620 runs proxmox to do some VMs and to do some experiments on networking, runs all my apps, and so ever.

What about the VPS?

On the VPS, I host my mail server, a matrix synapse server, wireguard server and a “super” gateway using BGP between my R620, my VPS, an OCI instance, and all my devices.

What do I plan?

Remove Oracle Cloud instance for eternity (finally don’t)
Take a smaller VPS to just run nginx, wireguard and bgp thing.
Have a node at my personal house, and one at my mom’s place.

All the Kubernetes nodes will be a control plane, worker and etcd, to ensure a continuity of service if any problems occur.

What kind of machine will you use?

I’m looking to use mini-pc to do this cluster. However, in a first place, I will setup a node on my R620 server and a node on a mini-pc.

It’s here that my experiment begin

Now the cluster is in production

By deploying “The true kube cluster”, it is the production cluster. The cluster is used to replace a big part of my homelab to have a sort of HA.

If a node is down, Kubernetes will automatically (depends on the kind of the resource) migrate apps between nodes.

You might think “What about the storage?”, as I use longhorn, volumes are replicated all along the cluster.

Issues with the cluster

After a few days, I discovered some issues with the cluster.

K3s uses by default sqlite, and the cluster was really slow, in any operation it could take up to 30s to respond. I decided to use etcd as the cluster database, thinking the database is in memory, all it did it is slowing down the cluster, but it was still usable in the most of case.

Then I had an issue with the secret management of my cluster, so I preferred to install a Vault in-cluster to manage those secrets.

Finally, an issue with longhorn where on some volumes, but not all the time, longhorn failed to mount them on the node.

To discover what I did, I have described here.

Adding more critical workflows on the cluster

As my cluster getting bigger, I decided to add my password manager on the Kubernetes cluster, but I quickly face with my public repository I would share all the url to anyone.
Since the beginning, I did not thought about the URL problem as the most of the URLs are internal and not accessible without a VPN (all the *.apps.legodard.fr).
So with this issue coming, I decided to put my infrastructure private to avoid any “Hey guys please hack this url, it is a password manager”.

Reborn from flame

With my cluster breaking more often while doing nothing on the cluster, I decided to redo the cluster entirely but now with true HA.

I described the process here, but as all the workloads were GitOpsed and still is, the restoration was really easy and straight forward.