Speeding Up GitLab CI

February 1, 2024

It’s coming up on a year since I switched to Hugo. During that time, the most annoying thing about it has been waiting for builds. Sure, I have it all neatly packaged and automated via GitLab Auto DevOps, but it’s just plain slow.

It takes between three and five minutes to run a build depending on what’s going on in my cluster at the time. This may not sound like much, but that’s a lot of minutes wasted waiting for a review environment to become available so I can proofread a post. Especially if I have to make a lot of incremental corrections.

It needs to be sped up…

Why Is It Slow?

Hugo itself is blazingly fast when I use the local server, so why is the build so abominably slow in the pipeline? There are actually two main issues involved:

Hugo has no build cache available to it. As a result, it has to re-process both CSS and all the images for every build, which takes time – and it will only get worse as the amount of content grows.
I’m building into a container, and docker-in-docker builds are annoyingly slow in general.

I already have most of the code quality checks disabled – they’re not needed in my environment – so those aren’t an issue. No, just those two things are slowing the build down something fierce.

Fixing Hugo

The obvious first task is to fix the Hugo build speed. Historically, Hugo takes somewhere in the neighborhood of a minute to generate the content it needs for the site; this includes a lot of image resizing and CSS processing. All we need is to preserve Hugo’s cache data so it doesn’t have to redo this every time and we instantly shave a minute or more off the build.

Fortunately, GitLab supplies a native mechanism: it allows you to define caches for your build jobs.

Unfortunately, this isn’t going to work with my existing build model.

Fixing The Build Paradigm

Currently I’m using docker to do the entire build. That means that I use a two stage Dockerfile: the first stage uses a custom image to run hugo and build the site, and the second stage builds the actual nginx container that gets deployed; it just packages up the content from the first stage.

That was fairly handy early on because it allowed Auto DevOps to work out of the box. Unfortunately, the multi-stage docker build isn’t going to work if we want to preserve the build cache. In order to do that, GitLab needs to have visibility into the state of things after the build completes.

And that’s gone after the build container terminates, which means that GitLab has no access to it.

As a result, we’re going to have to separate the hugo build into a separate stage. And that leads to–

Customizing GitLab Auto DevOps

There is no “easy” way to add stages to GitLab Auto DevOps. I thought about doing an entirely custom pipeline, but the truth is that the Auto DevOps features are rather nice. The main thing being, it handles everything from review environments to production deployments automagically.

The only way to do this is to create a .gitlab-ci.yml file and import the Auto DevOps pieces you still want to use:

include:
  - template: Auto-DevOps.gitlab-ci.yml

stages:
  - hugo
  - build
  - test
  - deploy  # dummy stage to follow the template guidelines
  - review
  - dast
  - staging
  - canary
  - production
  - incremental rollout 10%
  - incremental rollout 25%
  - incremental rollout 50%
  - incremental rollout 100%
  - performance
  - cleanup

The stages were copied directly from the original template. Note that I added a new stage there: The hugo stage. That will allow me to “pre-build” the Hugo content so that the container can slurp it in:

hugo:
  image: gitlab.s.gtu.floating.io:5000/floating.io/images/hugo:0-122-0
  stage: hugo
  script:
    - "hugo --minify --logLevel info"
  artifacts:
    expire_in: 1 day
    paths:
      - public/
  rules:
    - if: '$BUILD_DISABLED'
      when: never
    - if: '$AUTO_DEVOPS_PLATFORM_TARGET == "EC2"'
      when: never
    - if: '$CI_COMMIT_TAG || $CI_COMMIT_BRANCH'

The rules were stolen from the original build job template and simply ensure that it works correctly with the rest of Auto DevOps. The rest is fairly straightforward: build the thing, and mark public/* as a collection of build artifacts that we can pass on to the build stage.

Of course, for this to work properly, we need to make the build stage dependant on the hugo stage so it will pull those artifacts back in when it runs:

build:
  dependencies:
    - hugo

From there we replace our Dockerfile with a single-stage version that just builds the container and packages up the appropriate content, and we’re done. Now Auto DevOps will run our build in two pipeline stages instead of one.

It’s a start, anyway.

Making Hugo Cache Stuff

This is, relatively speaking, the easy part. In a properly configured environment with shared runner caches, all you need to do is add the cache configuration to the hugo stage:

hugo:
  # ...
  cache:
    key:
      prefix: hugo-gen
      files:
        - config.yaml
    paths:
      - resources
    unprotect: true
  script:
    - "hugo --minify --logLevel info --cacheDir $CI_PROJECT_DIR/resources"
  # ...

This tells Hugo to use a cache key of hugo-gen-${hash-of(config.yaml)}, and to use the same cache for both protected and unprotected branches (unprotect: true). The cache data is in the resources directory, so that’s the one that needs to be preserved.

To ensure that all of Hugo’s cache data is preserved for the future, I also updated the command to specify the cache directory as shown above, and then modified hugo’s config.yaml to add this:

caches:
  assets:
    dir: :cacheDir/_gen
    maxAge: -1
  images:
    dir: :cacheDir/_gen
    maxAge: -1

This wasn’t necessary, strictly speaking, but it future-proofs things a little. Really, it’s a result of all the gymnastics I went through in arriving at this solution.

Of course, this all assumes a correctly-configured environment with shared runner caches, which I did not happen to have. I had to install a MinIO instance (which is a topic for another day), and then reconfigure my gitlab runners to use it. From the values I pass to helm when I install the runner:

[[runners]]
  [runners.kubernetes]
    namespace = "{{.Release.Namespace}}"
    image = "alpine"
    privileged = true
  [runners.cache]
    Type = "s3"
    Path = "/"
    Shared = true
    [runners.cache.s3]
      ServerAddress = "minio-gitlab-cache.ci.floating.io"
      AccessKey = "gitlab-cache"
      SecretKey = "gitlab-cache"
      BucketName = "gitlab-cache"
      BucketLocation = "gtu"

And with that, everything Just Worked. Suddenly, all my builds are a minute shorter!

Interlude: The Score So Far

So what have we accomplished with this? We’ll, we’ve certainly made it a little faster. If I cherry-pick a random build from a week ago, we get timings that look something like this:

Stage	Before	After
hugo	-	30s
build	1m48s	39s
secret_detection	14s	19s
review	43s	47s
Total Runtime	2m47s	2m16s

So we’ve only saved about thirty seconds. Strange, right? Why so little savings overall? If we look at just the build stages, we get 1m48s vs 1m9s. But if we shaved a minute off in hugo, what gives?

Put simply, it’s the multiple stages combined with docker-in-docker.

First, for each stage that executes, GitLab has to spin up a new runner instance and make things go. That means downloading docker images and so forth. It all takes time. The fewer stages you have, the faster your build will run.

Second, docker-in-docker is notoriously slow. When you use this, the build system spins up a service (a.k.a. another container in the runner pod) that runs a docker daemon. Then it has to wait for that daemon to be ready to serve traffic before it can even start the build.

Yeah, slow.

Time to get rid of docker-in-docker, and consolidate back into a single build stage…

Using buildah

There are a number of newer methods of building containers, and many of them focus on not requiring an actual docker daemon to do so. If we can use one, then we can eliminate a lot of overhead. If we then consolidate it all, we get a much faster build with fewer stages.

For this, I’ve chosen to use buildah. It’s a daemonless build tool that is almost a drop-in replacement for docker.

The first step was to rebuild my hugo build image to add buildah. My Dockerfile for that looks something like this:

FROM rockylinux:8-minimal AS source
COPY fio-image-sources.repo /tmp
RUN mv /tmp/fio-image-sources.repo /etc/yum.repos.d                  && \
    microdnf module enable nodejs:18                                 && \
    microdnf install -y git npm                                      && \
    microdnf install -y buildah                                      && \
    microdnf clean all                                               && \
    npm install --omit=dev -g postcss postcss-cli autoprefixer
    
# Do this as a separate layer for cached build purposes, not that it
# will matter in the end...
RUN microdnf install -y hugo-0.122.0 && microdnf clean all

This provides a build image with everything I need to build the site and package it up with buildah. Why do I use rocky as a base? I honestly can’t remember. At some point I might switch it over to alpine to gain that little bit of extra speed, but for now this will do.

For those wondering, yes, I bothered to package Hugo as an RPM. I use it in more than just docker containers, and that makes it easy to keep up to date in the rest of my environment.

And with that built and tagged, I next discover that my .gitlab-ci.yml is about to become simultaneously simpler and more complex because I’m just going to override the default build stage.

The new file looks like this:

include:
  - template: Auto-DevOps.gitlab-ci.yml

build:
  image: gitlab.s.gtu.floating.io:5000/floating.io/images/hugo:0-122-0-a
  stage: build
  cache:
    key:
      prefix: hugo-gen
      files:
        - config.yaml
    paths:
      - resources
    unprotect: true

  # Disable docker dind service; we don't need it.
  services: []

  before_script:
    - buildah login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY

  script:
    - "hugo --minify --logLevel info --cacheDir $CI_PROJECT_DIR/resources"
    - |
      if [[ -z "$CI_COMMIT_TAG" ]]; then
        export CI_APPLICATION_REPOSITORY=${CI_APPLICATION_REPOSITORY:-$CI_REGISTRY_IMAGE/$CI_COMMIT_REF_SLUG}
        export CI_APPLICATION_TAG=${CI_APPLICATION_TAG:-$CI_COMMIT_SHA}
      else
        export CI_APPLICATION_REPOSITORY=${CI_APPLICATION_REPOSITORY:-$CI_REGISTRY_IMAGE}
        export CI_APPLICATION_TAG=${CI_APPLICATION_TAG:-$CI_COMMIT_TAG}
      fi
      export image_previous="$CI_APPLICATION_REPOSITORY:$CI_COMMIT_BEFORE_SHA"
      export image_tagged="$CI_APPLICATION_REPOSITORY:$CI_APPLICATION_TAG"
      export image_latest="$CI_APPLICATION_REPOSITORY:latest"
      echo "Getting cache data..."
      buildah pull --quiet "$image_previous" 2>/dev/null || \
        buildah pull --quiet "$image_latest" 2>/dev/null
      echo "Building..."
      buildah bud --pull=true -t "$image_tagged" -t "$image_latest" .
      echo "Pushing container and tags..."
      buildah push "$image_tagged"
      buildah push "$image_latest"

Largely, the script bits were interpolated from what exists in the auto build container (the source of which can be found here) and the build job template. Most of that is just finding the right docker image ids to use with pulling, pushing, and tagging. The rest is just a fairly standard buildah build.

And with that, we have a working build process with buildah, and docker-in-docker is nowhere to be found!

The Final Countdown

So where did that leave us? With a much faster build process! The table now looks like this:

Stage	Original	2-Stage	Buildah
hugo	-	30s	-
build	1m48s	39s	43s
secret_detection	14s	19s	18s
review	43s	47s	46s
Total Runtime	2m47s	2m16s	1m49s

Isn’t that a big difference? We saved just shy of a minute off of the original time, and a half minute from when we just cached the hugo build. I suspect that there are other gains to be made, but this is still a pretty huge win.

I doubt there is much more to be gained in this process without substantial work, though. The simple truth is that the majority of the remaining time is in the deployment phase. That’s always going to take time; not only does it use helm for that process (which isn’t the fastest thing in the world), but Kubernetes is going to have to pull the new image, spin up pods, and so on.

For now, I’m happy. The faster this is, the less frustrating it is to work on blog posts. Maybe someday I’ll try to make it even faster, but I think I’ve hit the point of diminishing returns.

As an aside, I really wish the GitLab people would fix this bug. I ended up having to tail my runner logs with kubectl quite frequently during this process because of it…