New Manifest Checksum Caching Implementation by lzap · Pull Request #2375 · osbuild/image-builder

lzap · 2026-05-26T20:14:00Z

Overview

We have a "fake" manifest checksum implementation that can be fragile sometimes. We always had a plan to use osbuild --inspect to do proper ID resolution. This patch series bring this.

But there is a snag - calling osbuild is costly, about 300ms per call. That is too much, a full manifest generation takes about a minute on my system. Therefore, part of this patch, there is a simple cache implementation. We keep calculating sha sums from the manifest blobs (upgraded from sha1 to sha256) but this is only used as a cache key. This speeds up checksum validation by ~100x (from ~200ms to ~2.5ms per cached manifest).

Manifest checksum format changes, instead of some undefined sha, it is now manifest ID for each pipeline in this format:

d7e0c83abe99e164736d5c4b68dda1442cfcb82779b4dc5174f708fb3cb2594a: build
d0348416e3643ee607fde1d54c51f134ea8a4d45313276ceec701f360a110662: os
50bcbf77d3030db338b2ef32eaedfe0fbee950388884f525842d158ad9baeee0: image
4e5aefe649fd4c051fcbf453d1d9d8b3fc1c68399e90d1c3d21c57e412ac1451: xz

If I understand how osbuild --inspect works, this should give us less conflicts since it will ignore some (source) portions of data and ordering will be always correct. Pipelines can differ depending on the image type of course.

Further optimalizations were done by moving some code from shell script into Go which means that overall, this implementation is about 50% faster when cache is fully pre-populated. Regeneration of roughly 10-30 manifest then gives the same performance as before, which is a good result I think - we get better checksums for no cost other than a small cache.

An extra feature of tracking history of each checksum file (as long as it is in the cache) was added. When you pass an existing file, it will show all available entries in the cache:

lzap@dev:~/images$ go run ./cmd/gen-manifests -history test/data/manifest-checksums/rhel_10.3-aarch64-ec2-empty
2026-05-27 06:24:32 UTC 28e7ea69057c58a9c47d7749056cc46b280ca14745d4475a42abdbf60774707f

When you pass a sha256, it will output the actual output from osbuild --inspect. This way you can always take a quick look on the latest (or any historical version that is in your own cache):

lzap@dev:~/images$ go run ./cmd/gen-manifests -history 28e7ea69057c58a9c47d7749056cc46b280ca14745d4475a42abdbf60774707f
{ ... }

Therefore it should be able to compare history with something like (pseudo command): diff (< gen-manifest -history XXX | jq) (< gen-manifest -history YYY | jq). I know the ideal would be some sort of central database (git repo) of all manifests or have them directly in images but that would be probably a lot of data. The current cache is around 10 MiB and we often rotate it. A central solution is out of scope for this initial refactoring, but I am open to suggestion.

Osbuild version is added to the key calculation and the version itself is cached too because calling osbuild --version itself takes 200ms. This ensures that an upgrade will invalidate cache entries. There is no automatic garbage collection because of the history feature that can be useful so I would rather keep it on my system forever. But if needed, simple deletion will do the trick.

Key Features

Cache System

Location: ~/.cache/osbuild-gen-manifests (XDG-compliant)
Cache key: SHA256(canonical JSON + osbuild version)
Structure: 256 subdirectories (first 2 hex chars) for efficient filesystem performance
Automatic invalidation: Cache keys include osbuild version, so upgrades naturally use new cache entries
Version cached in ~/.cache/osbuild-gen-manifests/version
Timestamp-based invalidation: only calls osbuild --version if binary is newer than cache file
Cache inspection: -history flag shows cache entries per checksum file or contents of actual (historical) osbuild output

Nitpicks

Default workers: CPU count (was hardcoded to 16 I think)
Ability to create SMC label in commit message to skip commit per commit manifest check

GitHub Actions Integration

Added official cache action to .github/workflows/validate-checksums.yml
Cache key includes osbuild version and workflow file hash
Shared cache across PR commits for faster CI runs

Statistics

Displays after checksum generation:

Cache hits and misses
Average manifest generation time
Average osbuild inspect call duration (when cache misses occur)

Checksums use SHA256 for better collision resistance compared to SHA1. Also, generate checksums from canonical JSON to ensure order-independent results. This makes checksums deterministic regardless of map key ordering in the input data. The implementation round-trips JSON through the 'any' type, leveraging Go's built-in map key sorting in the encoding/json package. This eliminates the need for custom MarshalJSON implementations across multiple types. Additionally, this optimization skips pretty-printing in checksums-only mode, and move the actual sha generation from external processes into the native Go binary, improving performance: - Before: 10.7s (temp manifest files written to disk) - After: 5.3s (checksums calculated on-the-fly in memory) - Improvement: 2.0x faster

Generate manifest checksums using osbuild --inspect output for semantic validation. The checksum file contains the last stage ID and pipeline name for each pipeline in the manifest. Checksum file format: <stage-id>: <pipeline-name> <stage-id>: <pipeline-name> ... If osbuild --inspect fails (e.g., invalid manifests), manifest generation fails - no empty checksum files are created. Changes: - Fix OSBuildInspect to pass "-" argument for stdin input - Add better error reporting with stderr capture - Add simplified structs for parsing inspect output - generateOSBuildInspectOutput: runs osbuild inspect, returns error on failure - Fail manifest generation if osbuild inspect fails Example output (valid manifest): e5ffa2f65c897e8d2b918d7d14f8aaae0c3a784a97c140e602347382ef5514fa: build 1b3a9b712a0098860b8d49447eac53b3b5e97e9a28beae700505f8381351fbf6: image e3e9465d743c6d87830d5b8984d3321ebf7fcaf71bc68f7f0f67c8199de71afc: qcow2 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Store osbuild inspect results in cache to avoid re-running osbuild for unchanged manifests. The cache key is SHA256(canonical JSON + osbuild version), which ensures different osbuild versions naturally use separate cache entries without needing explicit invalidation. Cache structure: - Location: ~/.cache/osbuild-gen-manifests (or /var/cache for root) - Subdirectories: First 2 characters of cache key (256 buckets) - Content: Raw osbuild inspect JSON output - History: cache_root/hist/<manifest-name> tracks cache keys per manifest The manifest checksum files contain only the osbuild inspect output (pipeline stage IDs), with no SHA256 prefix. Features: 1. Cache inspection via -history flag: - Manifest mode: Shows all cache keys for a manifest - Cache key mode: Shows cached osbuild inspect output for a key 2. Performance optimizations: - Osbuild version retrieved once at startup (global variable) - Osbuild version cached to file with timestamp checking - Eliminates thousands of subprocess calls to "osbuild --version" Error handling: - generateOSBuildInspectOutput() now returns (string, error) instead of silently returning empty strings on failures - Manifest generation fails immediately if osbuild --inspect encounters invalid manifests, preventing empty checksum files Performance improvements: - Before caching: ~40 seconds for full checksum generation - After caching: ~4-7 seconds - Speedup: 6-10x faster depending on cache state Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> gen-manifests: fix linter issues Fix gosec G306 warnings by using 0600 permissions for cache files instead of 0644. Cache files don't need to be world-readable. Fix staticcheck QF1001 by applying De Morgan's law to hex character validation logic. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> gen-manifests: extract cache functions to cache.go Move all cache-related functions from main.go to a new cache.go file: - initCacheDir, getOSBuildVersion, getCachePath - readCache, writeCache - appendHistory, readHistory - generateCacheKey Also extract SHA256 validation logic into isSHA256Hex() utility function to simplify the main() function. This reduces main.go from 1250 to 1044 lines (-206 lines). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Speed up checksum validation by caching the osbuild inspect results in GitHub Actions cache. The cache is keyed by: - OS (runner.os) - Workflow file hash (invalidates on workflow changes) - osbuild version (different versions use separate caches) This significantly reduces the time to validate checksums in PRs since osbuild --inspect results are reused across runs for unchanged manifests. The restore-keys allow falling back to caches from the same OS/workflow but potentially different osbuild versions, which is better than no cache. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add statistics tracking to provide visibility into cache effectiveness and performance characteristics: - Track cache hits and misses for osbuild inspect calls - Track average manifest generation time - Track average osbuild inspect call duration - Print statistics summary at end of checksums-only mode Statistics use thread-safe counters with nanosecond precision timing. Example output: Cache hits: 4, cache misses: 0 Avg. manifest generation: 2.58 ms Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

When the last commit in a PR contains 'SMC' (Squashed Manifest Commit) in the subject line, the workflow runs the full commit-by-commit validation but skips the exit 1 on failure. Implementation: - Detects SMC marker in last commit message (git log -1 --pretty=%s) - Sets smc_mode variable if marker found - Always runs: git rebase --exec ./tools/gen-manifest-checksums.sh - Shows full diagnostic output on checksum differences - Only executes "exit 1" when smc_mode is empty (line 95-97) This is useful for large patch series where manifest regeneration is done in a final squashed commit to avoid rebasing complexity and reduce CI runtime. The validation still runs to show differences for review, but doesn't fail the workflow. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Regenerate all manifest checksums with new osbuild inspect format. Checksum files now contain pipeline stage IDs instead of SHA256 hashes, providing semantic validation of manifest structure. This commit contains the [SMC] marker (Squashed Manifest Commit) to skip commit-by-commit validation in CI and only validate HEAD. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

achilleas-k

Some nice work and ideas overall, but there's a lot of side-features of unclear value.

Also, the description in one of the commits (gen-manifests: add caching for osbuild inspect results) seems to be a squashed version of multiple commits. Can you clear that up, please?

achilleas-k · 2026-05-27T10:59:26Z

+	if checksumsOnlyMode {
+		// Round-trip through any to ensure all map keys are sorted
+		// This makes checksums independent of key order in the input
+		b, err := json.Marshal(data)
+		if err != nil {
+			return fmt.Errorf("failed to marshal data for %q: %w", filename, err)
+		}
+
+		var canonical any
+		if err := json.Unmarshal(b, &canonical); err != nil {
+			return fmt.Errorf("failed to unmarshal for canonicalization: %w", err)
+		}
+
+		canonicalBytes, err := json.Marshal(canonical)
+		if err != nil {
+			return fmt.Errorf("failed to marshal canonical form: %w", err)
+		}
+
+		h := sha256.New()
+		h.Write(canonicalBytes)
+		checksum := hex.EncodeToString(h.Sum(nil))
+
+		// Store checksum with filename (without .json extension)
+		checksumKey := strings.TrimSuffix(filename, ".json")
+		checksumsMu.Lock()
+		checksums[checksumKey] = checksum
+		checksumsMu.Unlock()
+		return nil
+	}


It's a little confusing that this block is in the save() function, which normally writes the manifests to file, but instead of writing the files it collects the checksums to write them later. What's the reason for this difference in the flow of the two cases? If we write the files immediately here, we wouldn't need the map (or it's mutex).

achilleas-k · 2026-05-27T11:12:17Z

+	// Get osbuild binary modification time
+	binaryInfo, err := os.Stat(osbuildPath)
+	if err != nil {
+		return "", fmt.Errorf("failed to stat osbuild binary: %w", err)
+	}
+	binaryModTime := binaryInfo.ModTime()
+
+	// Check cached version file
+	versionPath := filepath.Join(cacheDir, "version")
+	if versionInfo, err := os.Stat(versionPath); err == nil {
+		// Version file exists, check if it's newer than the binary
+		if versionInfo.ModTime().After(binaryModTime) {
+			// Cache is newer, use it
+			cachedVersion, err := os.ReadFile(versionPath)
+			if err == nil && len(cachedVersion) > 0 {
+				return strings.TrimSpace(string(cachedVersion)), nil
+			}
+		}
+	}


What's the value in comparing the mod times? I would assume a single call to osbuild --version for each run of gen-manifests wouldn't be too much of an issue and would be perfectly reliable.
Consider, for example, that on ostree / bootc based systems, the mod time of the binary is always epoch.

achilleas-k · 2026-05-27T11:15:42Z

+		// Cache the raw JSON output for next time
+		_ = writeCache(cacheKey, string(inspectedJSON))


I guess we ignore the error because we don't want a cache write failure to be fatal? Could we print it as a warning to avoid confusion when things don't work as expected?

Also, it's a bit strange that in all of this effort to optimise, we write back the manifest to the cache even when in cases where we just had a cache hit a few lines before.

achilleas-k · 2026-05-27T11:20:56Z

+// readHistory reads all cache keys from the history file for a manifest
+// Returns cache keys (SHA256 of canonical JSON + osbuild version)
+func readHistory(checksumKey string) ([]string, error) {
+	histPath := filepath.Join(cacheDir, "hist", checksumKey)
+	data, err := os.ReadFile(histPath)
+	if err != nil {
+		if os.IsNotExist(err) {
+			return []string{}, nil
+		}
+		return nil, fmt.Errorf("failed to read history file: %w", err)
+	}
+
+	var sums []string
+	for _, line := range strings.Split(string(data), "\n") {
+		line = strings.TrimSpace(line)
+		if line != "" {
+			sums = append(sums, line)
+		}
+	}
+
+	return sums, nil
+}


What's the value of the history function? I can sort of imagine it being useful to check how a manifest changed during the development of a branch, but just having a history of checksums without them being associated with something like a commit ID wouldn't tell us much.

achilleas-k · 2026-05-27T11:24:21Z

+          # Check if last commit has [SMC] marker (Squashed Manifest Commit)
+          last_commit_msg=$(git log -1 --pretty=%s)
+          smc_mode=""
+          if echo "$last_commit_msg" | grep -q "SMC"; then
+            smc_mode="true"
+            echo "Last commit contains 'SMC' marker - validation will not fail on checksum differences"
+          fi


Not sure I like this. We either want manifest checksums changing with the commit that modified them or we don't. I don't see a good reason to have a special case to skip the check.

The commit message says

This is useful for large patch series where manifest regeneration is
done in a final squashed commit to avoid rebasing complexity and
reduce CI runtime.

But a large patch series is exactly the scenario where I would want to see how manifests change at every step.

thozza

Thanks for looking into this, but I have a few basic questions and a concern that is a blocker from my PoV.

Overview

We have a "fake" manifest checksum implementation that can be fragile sometimes. We always had a plan to use osbuild --inspect to do proper ID resolution. This patch series bring this.

Can you be specific on what was fragile? The justification is a bit too hand-wavy and jumps to fixing a problem that is not even clearly defined here. Moreover, the solution is quite complex and convoluted for what should be a straightforward checksum mechanism -- it's becoming a project of its own (caching system with 256-bucket subdirectories, cache invalidation via binary mod-time comparison, history tracking, statistics, SMC markers, GitHub Actions cache integration, etc.). Each layer exists to support the previous one: osbuild --inspect is slow, so we need caching, which needs invalidation, which needs version caching, and so on.

Additionally, can you point to the source of "We always had a plan to use osbuild --inspect"? Because the comment in the script you're modifying clearly states otherwise:

# NOTE: 'osbuild --inspect' is generally a better way to calculate a manifest
# fingerprint, because it ignores things like pipeline names, source URLs, and
# generally things that don't affect the build output.
# For mocked manifests though we want those things to be visible changes, so we
# calculate the checksum of the file directly. Also it's faster.

This comment is still present in the updated script, but the script now contradicts it by using --checksums-only (which runs osbuild --inspect) instead of sha1sum of the file.

The blocker

The switch from sha1sum of the full manifest JSON to osbuild --inspect stage IDs breaks detection of repo snapshot updates. Stage IDs only depend on stage definitions (type, options, input references), not on the sources section where download URLs live.

At the same time, the repo pseudo-packages in manifestmock.Depsolve() explicitly strip the snapshot date from the URL before computing their checksum, so a date-only change (e.g., -20240819 -> -20250201) doesn't produce a different pseudo-package checksum either. The result is that both detection paths are closed -- source URLs are ignored by osbuild --inspect, and the date is stripped before it can reach the stage inputs.

This means updating repo snapshots in test/data/repositories/ produces identical checksum files. CI sees no difference, no rebuild is triggered, and regressions from new RPM package versions in the updated snapshot go undetected. We use pinned repo snapshots specifically to retest images against new package versions, so this silently defeats that workflow.

Note that the date stripping in manifestmock was fine before this change, because the old sha1sum approach still caught date changes through the regular package RemoteLocations in the curl sources (those include the full date-stamped URL). The date stripping only becomes a problem in combination with the switch to osbuild --inspect, which ignores sources entirely.

A straightforward fix would be to stop stripping the date suffix from repo pseudo-package URLs in manifestmock.Depsolve(). That way the snapshot date becomes part of the pseudo-package checksum, flows into RPM stage input references, and naturally changes the osbuild --inspect stage IDs when snapshots are updated.

Even with the blocker addressed, I'm not convinced we need all of this added infrastructure and features (caching, history, statistics, SMC markers) to solve a problem that hasn't been clearly defined yet. I'd prefer to start with a concrete description of what's broken today and the minimal change that fixes it.

achilleas-k · 2026-05-27T11:51:42Z

I agree with the whole comment above but especially with this part:

Even with the blocker addressed, I'm not convinced we need all of this added infrastructure and features (caching, history, statistics, SMC markers) to solve a problem that hasn't been clearly defined yet. I'd prefer to start with a concrete description of what's broken today and the minimal change that fixes it.

I will say though that I like the idea of pipeline IDs as manifest checksums. It gives a bit more granularity to the change detection without going all the way towards having whole manifests checked in¹. But I do agree that we need to consider what exactly we want the change detection to detect. I think for the purposes of this repo, things that do not affect the manifest logically probably do matter. For example, if we change the name of the OS pipeline, we would want to see a list of manifests affected by the change, which manifest IDs wouldn't show. So if we go this way, perhaps we should consider having both.

Having whole manifests checked in to the repo (mocked or otherwise) would be the best solution in terms of change detection and visibility, it would possibly remove a lot of sources of conflict, but it is the worst solution in terms of noise, large diffs, and raw repo data. See this old commit as an example. ↩

lzap · 2026-05-27T14:04:40Z

Good points, I somewhat took for granted that osbuild --inspect is the best possible outcome. I am gonna close this but one small patch from here is useful - moving away from forking shasum process to simply calculating this in the process - this speeds up generation by 50 %. Also I could perhaps upgrade from sha1 to sha256, or take it the opposite direction and just do CRC64? All we need to know is something has changed.

achilleas-k · 2026-05-27T17:40:48Z

moving away from forking shasum process to simply calculating this in the process - this speeds up generation by 50 %. Also I could perhaps upgrade from sha1 to sha256, or take it the opposite direction and just do CRC64? All we need to know is something has changed.

Yeah, that seems like a good change.
Any reason to prefer CRC64 over sha256? I have no preference, it's just that we use sha256 generally (in osbuild, containers, etc).

thozza · 2026-05-27T18:00:29Z

My preference is to use sha256, so that we do not risk the unlikely event of changing the manifest and ending up with unchanged checksum...

lzap · 2026-05-28T14:20:02Z

Closing, I extracted the commits that were just making calculation faster: #2382

Re: sha256 vs sha1 - I am not doing this in the linked PR because I want to keep things simple and clean there, but I have an idea which I will share in the opened issue.

lzap added the WIP Work in progress. Don't run Gitlab CI. label May 26, 2026

lzap mentioned this pull request May 26, 2026

Rethink the manifest checksum check #1566

Open

lzap force-pushed the checksum-validator-diff1 branch from 8640636 to 0437879 Compare May 27, 2026 06:54

lzap and others added 8 commits May 27, 2026 09:48

cicd: show diff on validate checksum error

97d12df

lzap force-pushed the checksum-validator-diff1 branch from 0437879 to c7680a2 Compare May 27, 2026 07:49

lzap marked this pull request as ready for review May 27, 2026 07:49

lzap requested a review from a team as a code owner May 27, 2026 07:49

lzap requested review from brlane-rht, supakeen and thozza May 27, 2026 07:49

thozza requested a review from achilleas-k May 27, 2026 11:05

achilleas-k reviewed May 27, 2026

View reviewed changes

thozza requested changes May 27, 2026

View reviewed changes

lzap closed this May 28, 2026

		// Cache the raw JSON output for next time
		_ = writeCache(cacheKey, string(inspectedJSON))

Conversation

lzap commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Features

Cache System

Nitpicks

GitHub Actions Integration

Statistics

Uh oh!

achilleas-k left a comment

Choose a reason for hiding this comment

Uh oh!

achilleas-k May 27, 2026

Choose a reason for hiding this comment

Uh oh!

achilleas-k May 27, 2026

Choose a reason for hiding this comment

Uh oh!

achilleas-k May 27, 2026

Choose a reason for hiding this comment

Uh oh!

achilleas-k May 27, 2026

Choose a reason for hiding this comment

Uh oh!

achilleas-k May 27, 2026

Choose a reason for hiding this comment

Uh oh!

achilleas-k May 27, 2026

Choose a reason for hiding this comment

Uh oh!

thozza left a comment

Choose a reason for hiding this comment

Overview

The blocker

Uh oh!

achilleas-k commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

lzap commented May 27, 2026

Uh oh!

achilleas-k commented May 27, 2026

Uh oh!

thozza commented May 27, 2026

Uh oh!

lzap commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lzap commented May 26, 2026 •

edited

Loading

achilleas-k commented May 27, 2026 •

edited

Loading