Migrate master node to another VPS
I noticed that my master node is not enough powerful, so sometimes K3s get killed, or errors like “etcd leader election lost“, on a single master node…
On searching on internet, it seems to be related to storage performance, etcd requires a NVMe where my provider only give a SSD.
Architecture
I planned to use Oracle Cloud generous free tier and pass my account on PAYG to prevent instance deletion from them.
By migrating the cluster, I am switching from a single master node to three, and a fully meshed WireGuard network.
Old architecture
It was a pretty basic setup, the master node was on a VPS, and workers at my apartment and at my parents’.

New architecture
For the new setup, I preferred to have 3 master node, 2 on Oracle servers and 1 at my apartment (on my old laptop).

Setup the cluster
WireGuard configuration
Install Ubuntu 24.04
Update the system just in case
Install WireGuard:
apt install wireguard -yCreate private and public keys:
umask 077 && wg genkey | tee privatekey | wg pubkey > publickeySetup the mesh network:
[Interface] Address = 10.222.0.x, fd5b:b1f6:46f2::x ListenPort = 51820 PrivateKey = PostUp = iptables -A FORWARD -i %i -j ACCEPT; iptables -A FORWARD -o %i -j ACCEPT; iptables -t nat -A POSTROUTING -o enxc8a3620caea5 -j MASQUERADE PostDown = iptables -D FORWARD -i %i -j ACCEPT; iptables -D FORWARD -o %i -j ACCEPT; iptables -t nat -D POSTROUTING -o enxc8a3620caea5 -j MASQUERADE [Peer] Endpoint = <ip>:<port> PublicKey = AllowedIPs = 10.222.0.2, fd5b:b1f6:46f2::2 PersistentKeepalive = 25 [Peer] Endpoint = <ip>:<port> PublicKey = AllowedIPs = 10.222.0.3, fd5b:b1f6:46f2::3 PersistentKeepalive = 25 [Peer] Endpoint = <ip>:<port> PublicKey = AllowedIPs = 10.222.0.1, fd5b:b1f6:46f2::1 PersistentKeepalive = 25
K3s configuration
First node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.33.1+k3s1 sh -s - server \
--write-kubeconfig-mode 644 \
--cluster-init \
--tls-san="ip4,ip6" \
--node-external-ip "ip4,ip6" \
--flannel-iface wg0 \
--node-ip "wg_ip4,wg_ip6" \
--advertise-address "wg_ip4" \
--cluster-cidr '10.42.0.0/16,2001:cafe:42::/56' \
--service-cidr '10.43.0.0/16,2001:cafe:43::/112'Second and x node
curl -sfL https://get.k3s.io | K3S_TOKEN="the_token" sh -s - server \
--write-kubeconfig-mode 644 \
--server https://wg_ip4:6443 \
--tls-san="ip4,ip6" \
--flannel-iface wg0 \
--node-ip "wg_ip4,wg_ip6" \
--node-external-ip "ip4,ip6" \
--advertise-address wg_ip4 \
--cluster-cidr "10.42.0.0/16,2001:cafe:42::/56" \
--service-cidr "10.43.0.0/16,2001:cafe:43::/112"Bootstrap ArgoCD
As ArgoCD depends on secrets I given to HCP Vault Secret, I need to start with the vault stack.
It’s pretty simple, the vault stack is configuring the vault controller to fetch secrets from HCP or self hosted vault, it also installs cluster-secrets to share the HCP vault secrets across all the namespaces.
ArgoCD also depends on the WAF system and cert-manager.
Once dependencies installed, it’s possible to install ArgoCD, it will install argocd, argocd-apps, setup the repository credentials, and the root application.
Install CSI
As my cluster is “multi-cloud”, I need something to replicate all the data stored in any volumes. For this, I use longhorn.
It is installable on ArgoCD, it’s the application longhorn and longhorn-config.
Volume restorations
Thanks to longhorn and my working backups, restoring was pretty easy, the hardest thing is to remember the filesystem of my volumes (ext4 or xfs).
To avoid any issues with the volumes, you need to create the namespace before restoring the volume and DO NOT deploy any workflow in those namespace to avoid any volume creation during the cluster restoration.
Redeploy all the workflows
It is simple here, you need to select all your ArgoCD apps and sync them.
Conclusion
With this new setup I only had to rebuild the cluster once after the Paris node went down while I installed an operator and the cosmic magic wanted to mess with etcd, so I preferred to lose 2h of data (since I backup all the volumes each 2 hours) and restore the whole cluster without the new operator.
But in general, this setup is solid, I can have a node going down and still have the service running without major service disruption.