Using the k1 cluster
This document describes the differences between k0 and the k1 clusters. Familiarity with the cookbook is assumed.
k1 is meant to replace k0 as the main production cluster. After years of failing to bring k0 entirely up-to-date (due to the difficulty and risk of breaking prod), we’re starting from scratch with up-to-date dependencies, new hardware, better tooling, and architectural learnings from k1. Once productionized, we intend to move services from k0, eventually adding k2 as the crash/test cluster, and retiring k0.
Important: k1 is currently work-in-progress. Do not use it for serious production services just yet.
Jsonnet differences
Instead of:
local kube = import "../../../kube/hscloud.libsonnet";
use:
local kube = import "../../../kube/k1.libsonnet";
There are many minor differences in Kubernetes manifests due to a large version difference. For example, Ingress custom rule paths now require a pathType: 'ImplementationSpecific'.
Using jsonnet libraries
When using .libsonnets that are shared between clusters, kube.libsonnet needs to be passed as a parameter, for example:
local kube = import "../../../kube/k1.libsonnet";
// ...
pki: ns.Contain(hspki) {
kube: kube, // <-- new
cfg+: {
// ...
},
}
Some libraries might also need a cluster parameter now. For example, hspki will take cfg.cluster set to k1.
Cluster tooling
To authenticate, use prodaccess -cluster k1.
Once authenticated, you can see currently selected cluster using kubectl config current-context and switch between them using kubectl config use-context {k0,k1}.hswaw.net.
You can also pass --context {k0,k1}.hswaw.net to all standard kube tooling like kubectl, kubecfg or stern.
DNS
For test Ingress domains, you can use these as equivalents of *.cloud.of-a.cat:
*.k8.s-a.cat*.kubernete.s-a.cat*.kartongip.s-a.cat
For your own domains, use CNAME to ingress.k1.hswaw.net (or equivalent A, AAAA records).
Block Storage
Instead of k0’s waw-hdd-redundant-3 storage class name, use one of these:
waw-hdd-redundant-4(cheap bulk storage, 2x replication)waw-ssd-experimental-4(experimental only, do not use on production)waw-nvme-redundant-4(fast NVMe storage, 2x replication)
k1 also supports volume expansion, so it’s not necessary to overprovision volumes as much as with k0.
Object Storage
For now, use k0’s S3/radosgw as per the cookbook
CockroachDB
For now, use k0’s CockroachDB as per the cookbook
Cross-cluster services
Both clusters are on the same network, so kube services can talk to each other between clusters. You can also use cluster DNS - instead of service.namespace or services.namespace.svc.cluster.local use service.namespace.svc.{k0,k1}.hswaw.net.
TODO: cross-sign hspki certificates so that hspki mTLS authentication works cross-cluster as well
IPv6
k1 supports IPv6. Currently with the following limitations:
- pod and service IPs are not accessible outside of the cluster network
- pod egress is currently NATed with node IP
To use dual-stack services, set .spec.ipFamilyPolicy = "PreferDualStack".
To explicitly specify load balancer IPs to use with dual-stack, instead of the deprecated spec.loadBalancerIP, use this annotation:
kube.Service("name") {
metadata+: {
annotations+: {
"metallb.universe.tf/loadBalancerIPs": "185.236.240.157, 2a0d:eb00:2137:40::2:157",
},
},
}
Note that Ingress supports IPv6 by default (you don’t have to set the target Service to be dual-stack for this to work).
User Namespaces
k1 supports user namespaces, a security hardening feature not available on k0. Please set hostUsers: false on all your Pod specs and only opt out if it causes issues. (Note that this will likely be made into an opt-out default value later on.)
Migrating web services to k1 without downtime
It’s possible to move services from k0 to k1 while avoiding downtime due to DNS propagation (ingress address change). It’s a bit involved, so you probably want to ignore this tutorial unless dealing with a high-traffic or otherwise critical service.
Here’s how:
Step 1: Deploy to k1 under a test domain
This is to check that the service will continue to work on new cluster.
// (k1)
cfg:: {
// ...
domains: ['mything.k8.s-a.cat'],
}
// ...
ingress: ns.Contain(kube.TLSIngress(cfg.name)) {
hosts:: cfg.domains,
target:: top.service,
},
Step 2: Add original domain to k1 deployment
This needs to be a separate step. Since original domain does not yet point to k1, TLS certificates for the new test domain would fail to generate.
// (k1)
cfg:: {
// ...
domains: ['mything.k8.s-a.cat', 'mything.hackerspace.pl'],
}
If you kubectl -n mything get pod,ing you should see a cm-acme-http-solver running, trying (for now, unsuccessfully) to obtain a certificate for mything.hackerspace.pl.
Now, run this sanity check:
curl -i --resolve "mything.hackerspace.pl:443:185.236.240.161" "https://mything.hackerspace.pl"
This will try to access the service (which still points to ingress.k0.hswaw.net) by contacting ingress.k1.hswaw.net. You should see something like curl: (60) SSL certificate problem. If it doesn’t fail, you probably messed up the --resolve option.
Step 3: Forward ACME requests from k0 to k1
Back on k0, modify the Ingress so that ACME challenges (TLS certificate verification) on mything.hackerspace.pl get forwarded to k1, like so:
// (k0)
k1IngressProxy: ns.Contain(kube.Service("k1-ingress")) {
spec: {
type: 'ExternalName',
ports: [
{ port: 80, name: 'http', targetPort: 80 }
],
externalName: "ingress.k1.hswaw.net",
},
},
ingress: ns.Contain(kube.TLSIngress(cfg.name)) {
hosts:: cfg.domains,
target:: top.service,
extraPaths:: [
{ path: '/.well-known/acme-challenge', backend: top.k1IngressProxy.name_port }
]
},
After applying, observe cm-acme-http-solver on k1. It should quickly succeed in obtaining the certificate. Now we can serve HTTPS content on this domain on k1.
To be sure, you can check kubectl -n mything get cert to see if the READY status changed to True. You can also re-do the curl check - now, it should successfully return content through k1’s ingress. (This is why we did the sanity check before - to rule out that the curl command is bad and we’re checking k0 all this time. TODO: Come up with a simpler way of reliably verifying this).
Step 4: Forward all traffic to k1
Now we’re ready to switch all traffic from k0’s Deployment to k1, with zero downtime.
To do that, we’ll modify the k0’s Service to point to k1’s Service (using ExternalName and cross-cluster DNS) instead of k0’s Pods:
// (k0)
service: ns.Contain(kube.Service(cfg.name)) {
// target:: top.deployment, // <-- comment this out
spec: {
type: 'ExternalName',
clusterIP: null,
ports: [
{ port: 8080, name: 'http', targetPort: 8080 }
],
externalName: "%s.k1.hswaw.net" % top.service.host,
},
},
Please note:
- clusterIP: null is needed when modifying existing Service (otherwise kube apiserver will reject this change)
- Make sure that the targetPort is the correct port on which k1’s service serves content
- externalName assumes that the k1 version has the same service and namespace names as on k0
If we didn’t make any mistakes, we should get the final result immediately after application. Otherwise, you might see errors 502 or long delay followed by a 509. Most likely, externalName or ports are wrong.
Step 5: Clean up
- Scale k0 Deployment to 0 replicas
- Remove test domain from k1’s Ingress
- Switch DNS from
CNAME ingress.k0.hswaw.nettoCNAME ingress.k1.hswaw.net
TODO: Can we avoid the ACME ingress shenanigans by just copying certificates&secrets?
Migrating PVC
This procedure requires ceph admin access.
First, find name of Persistent Volume (pv) of the PVC you want to migrate, e.g:
kubectl --context k0.hswaw.net -n mything get pvc/data -o=jsonpath="{.spec.volumeName}"
You’ll see something like pvc-79ee4efb-bacb-4173-96f7-e41cf75c687d. Note that in ceph-waw3, rbd image names are the same as pv names.
Then, from ssh root@dcr01s22.hswaw.net, dump the rbd image (note the -metadata, yes it’s necessary):
rbd -c ceph-waw3.conf export waw-hdd-redundant-3-metadata/pvc-79ee4efb-bacb-4173-96f7-e41cf75c687d waw3-dumps/mything-data
On k1, create new PVCs as per usual, but don’t bind them to a pod yet. On ceph-waw4, rbd image names are different from PV names, so extract the rbd image name like so:
kubectl --context k1.hswaw.net get pv/$(kubectl --context k1.hswaw.net -n mything get pvc/data -o=jsonpath="{.spec.volumeName}") -o=jsonpath="{.spec.csi.volumeAttributes.imageName}"
You’ll see something like csi-vol-0c2c1b2d-9eff-43b3-92fd-635c641061d3. We’ll delete it from Ceph (be super careful!)
rbd rm waw-hdd-redundant-4/csi-vol-0c2c1b2d-9eff-43b3-92fd-635c641061d3
Then we import the waw3 volume as an rbd image with the same name as what the CSI mounter will expect:
rbd import waw3-dumps/mything-data waw-hdd-redundant-4/csi-vol-0c2c1b2d-9eff-43b3-92fd-635c641061d3
TODO: Explore RBD Live Migration for low/zero-downtime migration: https://docs.ceph.com/en/reef/rbd/rbd-live-migration/
k1 ops and architecture differences
k1 stuff resides at //cluster/k1/. For applying changes, instead of multiple “view” jsonnets, use: kubecfg diff cluster/k1/k1-view.jsonnet -A view=NAME.
We have a NixOS integration test for the k1 cluster. Use it to test cluster/node-level changes before applying to the live cluster. See //cluster/k1/test.nix for details.
Networking: calico-metallb interaction is now simplified. MetalLB more-or-less only serves as a kube operator for IP allocation, but the BGP announcements are handled by Calico now.
Storage: We plan a new Ceph cluster that is managed not in the kube cluster (by Rook), but directly on NixOS nodes.
Dependencies: For cluster-level dependencies (like calico, coredns, cert-manager, etc.), we now try to use vendored yaml manifests as much as possible, applying jsonnet “patches” as needed, instead of recreating them entirely in jsonnet. This is meant to simplify updates.
Wanna help? Talk to k1 ops (radex, informatic, mikedlr, et al) or ask on #hswaw-infra what you can do to help productionize k1 and migrate services to it from k0/boston-packets.