Using Lease Resources to Manage Concurrency in Tekton Builds
I’ve recently started using Tekton as my main build system. In combination with Argo CD for GitOps management of the pipeline scripts, it has some pretty nice features. One of the things I’m appreciating is the ability to get down to the platform level and customize my pipeline.
A challenge we’ve recently been puzzling over is the integration test phase. The integration tests are the sort of juicy integration tests that you’re really glad you run, because they involve OpenShift on IBM Cloud, a number of IBM Cloud managed services, another cloud provider, virtual machines, scripts, legacy systems, data pushes, data pulls, and stored procedures. I wish I could say everything ran perfectly the first time we put it all together, but it did not. With so many moving parts, this is a great example of where proper continuous integration and continuous deployment practices can help reduce risk.
Continuous integration means many git pushes a day, which means many builds. Those builds are sometimes concurrent. At this point, things get awkward. On the application side, we’re fine — Kubernetes allows us to deploy our image with a unique name, and run tests against isolated instances. Once we get into databases, things get more complicated. Sharing a single database between multiple concurrent integration test runs can get pretty non-deterministic: “is it the first build or the second build which is doing unwelcome things to our data?”
We could easily solve the problem by spinning up a database container inside our cluster. Except. That wouldn’t allow us to test everything we need. As well as the main application, two services respond to changes in the data and push data into the database. Excluding them from the scope of our integration tests would be beautiful and clean, but it would leave out the interactions most likely to create unexpected results. Dynamically wiring the remote systems to a test databases — several test databases, in fact — would require significant infrastructure investment.
The tests don’t take that long to run, so the simplest fix is to throttle concurrency and only run one suite of tests at a time. This is where Tekton starts to show its newness as a technology. Jenkins has a plugin to manage concurrency; Tekton has several issues with feature requests, but nothing actually implemented yet. However, it turns out to be a fairly straightforward problem to home-roll a solution for. I was inspired by the Akka Leases feature, which takes advantage of Kubernetes’ built-in concurrency management. It’s not labelled “BUILT-IN CONCURRENCY MANAGEMENT” in big letters (much to my disappointment when I went looking in the docs), but the features are there.
The core of the solution is the fact that in Kubernetes, we can make resources for anything. I mean anything. The corollary is that almost everything we might want to interact with is a resource; Tekton runtime artefacts are all resources.
(As an aside, I wonder whether in a few years’ time we’ll be fed up of custom resource definitions, and we’ll see lots of polemics exhorting devs “stop defining resources! we don’t need anymore! not everything should be a resource!” … However, in mid-2020, using the platform’s extensibility to tune it to our domain needs is cool and fun and I plan to do more of it.)
I defined a
LeaseResource, as follows:
These resources also need some RBAC. Here,
pipeline should be the service account id used to run your pipelines:
- kind: ServiceAccount
Once these are defined, we can define tasks to acquire and release leases:
- name: label
- name: create-lease
# EOF in yaml is hard, so make a file the simple way
echo ‘apiVersion: “somegroup.org/v1”’ > e2e-lease.yaml
echo ‘kind: Lease’ >> e2e-lease.yaml
echo ‘metadata:’ >> e2e-lease.yaml
echo ‘ name: e2e-lease’ >> e2e-lease.yaml
echo ‘ label: $(inputs.params.label)’ >> e2e-lease.yaml #Try to create a lease — either it succeeds, and we are good, or it fails, and then we wait for the lease to be deleted or a timeout, and then we make the lease if there was a deletion # In the event of a timeout, clear out the dead lease so it doesn’t mess up future builds kubectl create -f e2e-test-lease.yaml || (echo Waiting for lease && kubectl wait --for=delete lease.somegroup.org/e2e-test-lease --timeout=$TIMEOUT || ( echo “Grabbing abandoned lease.” && kubectl delete lease.somegroup.org/e2e-test-lease )) # We could be here for three reasons;
# either we successfully created a lease,
# we waited and another run’s lease got deleted,
# or we waited and the other lease is still there.
# Run an apply to make sure a lease with our label now exists. kubectl apply -f e2e-lease.yaml
The first part creates the resource; if creation fails, that means the lease must already exist. In that case, it uses kubectl wait to pause until the other pipeline releases the lease (shell
|| is super-handy for this kind of logic). What if the other lease never gets released, because a build hung or crashed? In that case, it waits for twenty minutes, then times out, and deletes the existing lease. In both of the error paths a lease doesn’t get made, so I run a final kubectl apply to ensure a lease gets created. These command chains aren’t atomic, so I wouldn’t use it for business-critical concurrency … but for a queueing system for a build, it’s good enough.
The release lease task is simpler. The only non-obvious part is that it we add a pipeline label to the lease, and only delete the lease for the current pipeline. There shouldn’t ever be any other leases until release-task is run, but I wanted to be cautious in what got deleted. This becomes important in a cleanup context, as we’ll see below.
- name: label
- name: delete-lease
script: kubectl delete lease.somegroup.org --ignore-not-found=true --field-selector metadata.label=$(inputs.params.label) --field-selector metadata.name=e2e-test-lease
In the pipeline, all tasks which touch the database
runAfter the acquisition task:
- name: acquire-lease
- name: label
The label is a unique pipeline name which the pipeline runner sets on the pipeline (you could use git commit hash, or timestamp). Similarly, the release lease task gets runAfter-s everything which touches the database.
This works fine in the happy path, but what if an intermediate task crashes? Recent builds of Tekton include
finally support, which is ideal for sharing build status notification and tidying up resources like leases.
At the end of the pipeline, the pipeline deletes the lease in a finally block. Since this could run after the lease has been released, another build might have already grabbed the lease. Trampling over running leases in a misguided tidy up attempt would be bad, so it’s critical that each pipeline only delete its own lease. This is where the pipeline label in the acquire and release tasks comes in. Making sure it’s unique and using that in the selector ensures a pipeline only deletes its own leases:
- name: final-release-lease
- name: label
If you’re on a version of Tekton which isn’t new enough to have finally capability, or using one of the downstream products like OpenShift Pipelines, things mostly work; the first build after an integration test failure will hang around for twenty minutes waiting for the timeout on the lease, but subsequent builds will then run normally.