Kubenetes, in the home, for fun

Cluster boards mounted on the wall for testing

The current cluster nodes

Pine64 graciously sent me a set of the SoQuartz boards + carrier boards that are being used for this cluster (But I was not paid; they didn’t know what I would use them for). Additionally, some of the time used in first setting up the cluster was done as part of my employment at Propeller Aero.

My home lab setup has evolved a lot over the past ~10 years.

  • Cheap Android TV Dongle flashed to Linux
  • Cheap Intel Atom Motherboard
  • HP Microserver Gen8
  • ODroid N2
  • Custom TrueNAS storage server

Over the last year, I’ve been running TrueNAS scale on my storage server; but with more and more programs creeping into my deployed apps, it’s time to grow into a proper setup. (I put some credit to (Awesome Self Hosted)[https://github.com/awesome-selfhosted/awesome-selfhosted] for this).

This cluster aims to be mostly a learning area as well as help keep some “critical” services like DNS up for the house.

Cluster purpose

First and foremost the desire for the cluster is to move the load off my NAS and increase availability. Occasionally the NAS needs to reboot for maintenance. Being able to have jobs shunt to a new device automatically is the key reason to run something like k8s at home. Obviously, if you only need “enough” uptime; there is no point in going to this level of effort… Unless it’s fun :).

So, if it’s not obvious; this is an at-home cluster, that is done for fun and learning. It’s not meant to be perfect but “good enough”. This means things are sometimes not what you would deploy in production. However, I like to be similar (unless it’s fun to be overkill).

Cluster design

For this cluster, I’m aiming for 2 control nodes and 3 etcd nodes, I’m overlapping these so that there are two nodes that are entirely free to just run the workload.

Node 1 -> Compute + Storage
Node 2 -> Compute + Storage
Node 3 -> Compute only
Node 4 -> Compute + etcd
Node 5 -> Controller + Compute + etcd + Storage
Node 6 -> Controller + Compute + etcd + Storage

Once the rough cluster plan has been shaken out; it’s time to prepare the nodes and deploy a rough and ready k8s cluster.

Deploying an OS to the nodes

The largest issue I ran into in deploying this cluster is that DietPi is missing a lot of the required kernel modules for K8s to run on the nodes, and Kubespray doesn’t support Arch/Manjaro (I could totally have deployed it all manually… Just effort).

Thankfully just as I was dreading having to build my own image I stumbled upon the apparently-not-quite-ready-but-already-shockingly-good Plebian Linux Project by CounterPillow and friends. This is a really good project and mega kudos to them. After writing the image to the eMMC modules for each cluster unit; the units came up perfectly and we are away to the races.

I highly suggest doing a brief bit of locking down first here:

  • Setup ssh keys
  • Disable password login
  • Disable root login
  • cleanup installed packages

Also, I find this a good point in time to lock in the local IPs for the nodes to make sure they can’t move around. The only thing worse than physically losing an SBC is having to find the darn things new IP.

Kubepsray

For the vast majority of the use of kubespray, please refer to their guide. This post is a snapshot in time, so will age fast. Also, I’m certain I’ll miss little things.

Hyper short TL;DR

  1. Download kubespray
  2. Install dependencies for kubespray
  3. Copy the template somewhere for working cp ../kubespray/sample/* src/cluster/
  4. Generate your inventory declare -a IPS=(192.168.0.200 192.168.0.201...); CONFIG_FILE=inventory/prod/hosts.yml python3 contrib/inventory_builder/inventory.py ${IPS[@]}
  5. edit addons.yml for metallb to be enabled and assign IP
  6. Set strict_arp to true for metallb
  7. Set upstream DNS fields: upstream_dns_servers
  8. Turn on NTP on all nodes: ntp_enabled: true
  9. Run deployment cd ../kubespray && ansible-playbook -i ../src/cluster/hosts.yaml --become --become-user=root cluster.yml -kK
  10. kubectl should work on the nodes; copy the cube config to your machine to use locally

Prepare

I don’t like to infect or pollute my computer with random software or mixed versions; I’m running kubespray in a docker pod. This means that I can match perfectly to what software it wants without any risks.

I’ve pushed the Dockerfile + docker-compose.yml to my GitHub, for reference. You are welcome to use this.

Create the inventory of nodes

Kubespray wants to have a hosts.yml file that is used to define each machine it’s going to configure and its roles. For example, this is what mine worked out to be:

all:
  hosts:
    node1:
      ansible_host: 192.168.0.201
      ip: 192.168.0.201
      access_ip: 192.168.0.201
    node2:
      ansible_host: 192.168.0.202
      ip: 192.168.0.202
      access_ip: 192.168.0.202
    node3:
      ansible_host: 192.168.0.203
      ip: 192.168.0.203
      access_ip: 192.168.0.203
    node4:
      ansible_host: 192.168.0.204
      ip: 192.168.0.204
      access_ip: 192.168.0.204
    node5:
      ansible_host: 192.168.0.205
      ip: 192.168.0.205
      access_ip: 192.168.0.205
    node6:
      ansible_host: 192.168.0.206
      ip: 192.168.0.206
      access_ip: 192.168.0.206
  children:
    kube_control_plane:
      hosts:
        node6:
        node5:
    kube_node:
      hosts:
        node1:
        node2:
        node3:
        node4:
        node5:
        node6:
    etcd:
      hosts:
        node4:
        node5:
        node6:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

Above you can see that I used a fairly simple static IP assignment to keep things easy to remember. When I magically run out of the ~55 node IP’s I might need to be creative… But that’s a lot of $$ worth of SBC’s to buy first.

This should match the plan you made earlier on; since this is exactly what Kubespray will deploy.

Edit kubespray settings

I changed some settings away from the defaults to have kubespray set up more of my cluster for me.

First copy the cluster defaults out of the kubespray repo so that you have something to edit. Then you can edit these outside of docker.

cp -r inventory/sample ../src/

Network Plugin

For this cluster, I’ve settled on using flannel for the network management plugin. For no good reason other than (1) it just worked and (2) it sounded good in online reviews. The default calico also worked fairly well in testing too; so I think either of these should serve you well in your deployment if you do one.

Modify required packages

As of the time of writing, kubespray is designed around older Debian releases, so it expects python-apt to exist rather than the newer python3-apt. For now, I’m manually patching roles/kubernetes/preinstall/vars/debian.yml to enact the name change of the package. (See my docker file). Also if you want to have ansible automatically install extra packages for you, insert them here.

MetalLB

MetalLB is a fantastic load balancer so far for me, does exactly what I want on bare-metal machines (provides resiliency) without any extra hardware. It handles this by using ARP to advertise what node is hosting the LAN IP for the service, this way if the node goes down, it can move it as needed.

For MetalLB to function, you need to tell it what IP space in your network it can assign to services. For my home network, I run in 192.168.0.0/23. This gives me 192.168.0.0-192.168.0.255 for devices and then I use 192.168.1.0-192.168.1.255 for services and IO~~S~~T devices. The main DHCP server only allocates the devices range, and anything in the services range is outside its control. Thus when configuring MetalLB, I’ve given it 192.168.1.10-192.168.1.128. This is plenty of room for a home lab :).

I needed to set kube_proxy_strict_arp to true as well for this to work in my configuration.

Deploy against the nodes

ansible-playbook -i ../src/cluster/hosts.yaml --become --become-user=root cluster.yml -kK That’s the magic command I used. But to expand this, I’m locating my hosts and config files in a folder outside the kubespray directory. This is done so that they can be checked into git and preserved across restarts of my docker environment. --become --become-user means that after Ansible logs into the device it will elevate to that user. I use this to raise to root to perform the setup so that the initial ssh does not need the ability to log in as the root user.

Persist your configuration for accessing the cluster

After ansible has deployed the changes to all of your nodes, the cluster configuration file for access will be placed in the cluster/artifacts/admin.conf file. Copy this to your ~/.kube/ folder and tools like OpenLens will pick it up automatically.

If you save this as default in your Kube folder it will be used automatically, otherwise, when interacting with the cluster from the CLI you will need to specify the file by setting KUBECONFIG => export KUBECONFIG=/home/$USER/.kube/admin.conf

Setting up LAN-accessible services

For a lot of the services I plan to run on the cluster; it’s optimal for them to be assigned a static IP so that I can reliably find them. NodePorts are a volatile solution (unless you have 1 node) since the port will be exposed on a node but there is no knowing which until it’s assigned by k8s.

The first service I deployed to the cluster was blocky which is a great tool to have on one’s home network. Many have probably heard of Pi-Hole, and blocky feels like the juiced-up and streamlined pi-hole. It acts as my primary DNS at home and catches all the DNS requests from client devices.

Setting up Blocky is a relatively straightforward affair:

  1. Create a configuration file (or a close best guess)
  2. Inject the configuration file into a k8s config-map
  3. Create the deployment to define the redundant pods
  4. Create an ingress to have MetalLB assign a static IP to the pods

Blocky has a relatively simple configuration that we can store inside of k8s to allow the pods to run without any special volume mounts. During pod creation, the file will be injected into their containers from our config map. Personally I find this a hard requirement, I don’t want to bring the cluster up from a cold start to involve loops. So having a dependency on storage systems to get my DNS up feels like a fast way to circular dependency hell. Instead by using the in-k8s stored config, I only need the usual k8s requirements of a controller and etcd.

Create the configuration file

Please refer to the blocky documentation when creating your config, but a minimised one is thus:

upstream:
  default:
    - 1.1.1.1
    - 8.8.8.8

startVerifyUpstream: true

blocking:
  blackLists:
    ads:
      - https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
  whiteLists:
    ads:
      - |
        fonts.gstatic.com        
  clientGroupsBlock:
    default:
      - ads

  blockType: zeroIp
  blockTTL: 5m
  startStrategy: failOnError

prometheus:
  enable: true
  path: /metrics

This is a barebones configuration for brevity. This has one block list, and 2 upstream DNS servers and also enables the prometheus metrics. Save this to a yaml file on disk such as blocky-configuration.yml. We can then create a configmap via kubectl kubectl create configmap blocky-config --from-file=blocky-configuration.yml

Create the main deployment

Before I jump into the deployment, a quick note is that you can have multiple blocks in one yaml file. If you append the sections together with a --- line that will split up each section. This means a single yaml file can contain your entire deployment. Also, remember that you “apply” a yaml file with kubectl apply -f <path>. Or you can apply a whole folder of yaml files at once if that path is a directory.

To run Blocky we want to maintain good uptime so we want to instruct the cluster to always try and keep two copies running at once. This way if one fails, MetalLB can migrate the IP as soon as possible to a ready-to-go pod. This also means that upgrades will roll out across the pods one by one, preventing downtime where possible.

Here is the deployment I’m running at the moment, I’ve annotated the lines with comments to make more sense

apiVersion: apps/v1
kind: Deployment
metadata:
  name: blocky #Name for human interaction and used in the base pod labels
  labels:
    app: blocky
spec:
  replicas: 2 # We want to keep 2 replicas (copies) running at all times
  selector:
    matchLabels:
      app: blocky
  template: # This is the template that is used when spawning each replica
    metadata:
      labels:
        app: blocky
        name: blocky
    spec:
      containers: #The list of containers to run in the "pod"
        - name: blocky
          image: spx01/blocky:v0.20 # Pull the latest image tag when starting. _Ideally_ you may want to tag this instead.
          imagePullPolicy: IfNotPresent # But only do that download if we don't already have on on the machine
          env:
            - name: TZ
              value: "Australia/Sydney" # Force the timezone to be local in the pod
            - name: BLOCKY_CONFIG_FILE
              value: "/app/config/blocky-configuration.yml" # The path blocky should read to grab its config
          volumeMounts:
            - name: blocky-configuration # This volume mount specifies that we want the volume (declared later) to be mounted in the folder blocky wants
              mountPath: /app/config/ #
          resources:
            requests: # Requests help k8s guess how much it needs when packing a node with jobs, here we declare 0.1 CPU's (10% of one core)
              cpu: "0.1"
            limits:
              cpu: "2" # This is a limit, if the process is using more than this it will be limited at this if the node is busy
          readinessProbe: # This probe (polling the TCP DNS socket) is used to allow K8s to know when the pod is up and ready to handle traffic
            tcpSocket:
              port: 53
            initialDelaySeconds: 20
            periodSeconds: 10
          livenessProbe: # This probe (polling the TCP DNS socket) is polled periodically for K8S to make sure the pod is still alive and happy
            tcpSocket:
              port: 53
            initialDelaySeconds: 60
            periodSeconds: 10

      volumes:
        - name: blocky-configuration # We define the volume we mounted above
          configMap: # And specify it comes from a configmap in k8s
            name: blocky-configuration

Once this is applied to the cluster, you can inspect the pods in OpenLens and see that they will move through to the ready state (It may take a bit at first while the image is pulled).

Where possible I also try to include a PodDistruptionBudget. This is a fancy way of describing to k8s to always keep 1 available at all times. This means that when you make a change it will bring down all but one, change them, let them come up and take the load than do the remaining one. Doing this will make deployments slower, but it means you should minimise interruptions. Reminder: As noted earlier on, you can put this in the same deployment separated by a ---.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: blocky-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: blocky

Create an ingress to assign a static IP

To expose blocky to the network we are going to define a service for it with an exposed endpoint. Also since I need t obe able to configure the static IP to be advertised by DHCP, here I’m pinning it to 192.168.1.10. This is the start of the MetalLB range so MetalLb will happily allocate this for us.

apiVersion: v1
kind: Service
metadata:
  name: blocky
spec:
  type: LoadBalancer
  loadBalancerIP: 192.168.1.10
  selector:
    app: blocky # Find the app called blocky to attach to
  ports:
    - port: 4000
      targetPort: 4000
      protocol: TCP
      name: blocky-admin # port we expose the admin interface/API and the Prometheus metrics
    - port: 53
      targetPort: 53 # Open up port 53 TCP and UDP for queries
      protocol: TCP
      name: dns-tcp
    - port: 53
      targetPort: 53
      protocol: UDP
      name: dns-udp

Ingress with nginx-proxy & cert-manager

To create nicer applications that are public internet facing; it’s optimal to give them an HTTPS cert to make things both faster and more secure. To achieve this we also want to (1) automate everything and (2) have load balancing so that more than one pod can take the load at once where possible. I have moderately decent home internet so exposing services over it runs quite well. Normally I don’t expose things and instead, use Tailscale to route things privately. But some services I want others to be able to reach such as a hosted Git server. It’s for these shared resources we want to set this up.

Nginx-proxy

This deployment uses Nginx to act as a load balancer and reverse proxy. It will look at the hostname on the incoming request and match it against all of the selectors for applications we have defined. This means we can use as many subdomains and have them spin out to multiple services with only one public IP.

Additionally, Nginx will terminate TLS for us, using the certificate that cert-manager stores in the cluster. This means that our services do not have to use HTTPS (or can use a cluster-trusted self-signed cert), and Nginx will handle the load of wrapping the connection in the TLS connection. Again, it’s magic.

For setting up nginx I’ve just used their deployment directly kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.5.1/deploy/static/provider/cloud/deploy.yaml and this has worked excellently so far.

Cert-manager

Cert manager handles the magic API requests to pass certification and generation of a Lets Encrypt certificate. Whenever a service is defined that has a public domain name, the cert-manager will fetch the certificate from Lets Encrypt and store it in k8s ready for Nginx to use.

Similarly to nignx; I am deploying cert-manager directly via kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.10.1/cert-manager.yaml

Additionally; once cert-manager itself is deployed; it needs to be given a configuration to tell it where to fetch certs from:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
  namespace: cert-manager
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: <your email here>
    privateKeySecretRef:
      name: letsencrypt
    solvers:
      - http01:
          ingress:
            class: nginx

In this case; it’s a ClusterIssuer so it will be able to issue certificates for the whole cluster.

Finally; for each subdomain we want to associate with a service; we add a configuration for it as well like so:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: gitea-cert
  namespace: gitea
spec:
  dnsNames:
    - gitea-example.ralimtek.com
  secretName: gitea-tls-cert
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer

You can then apply this configuration to the cluster as well, and it will be picked up automatically for you :)

Noting that you will need to make the service for this to work (otherwise it will complain that it can’t find the service you are referring to).

Tada

So at this point; in theory, you have a running cluster of nodes; and you have one working service + automatic TLS setup for your future deployments.

FAQ / Help

Reset cluster and try again

Once the power of reproducible deployments goes to your head becomes familiar; you will most likely want to try changing settings and testing different configurations. To do a full node reset that takes down all of the cluster and software you want to use the reset workflow (command ansible-playbook -i ../src/cluster/hosts.yaml --become --become-user=root reset.yml -kK). This will take you roughly back to before you started. It does leave crud on the filesystem as ansible makes file backups every time you run things, but that’s minor for a home lab and for testing.

Trying to reset and DNS lookup is broken

If you are trying to reset the cluster because you broke networking (because you may or may not have been changing network settings…), you may run into getting stuck as the DNS lookups timeout/fail because resolvectl is still configured to use the k8s nodedns service as the resolver, but that service is down due to your testing.

This is relatively easy to fix, generally, ansible inserts a block at the top of the /etc/dhcp/dhcpclient.conf file that inserts this override. You can just remove the lines between the two comment markers for the ansible content and reboot to get out of this situation.