Skip to content

New Manifest Checksum Caching Implementation#2375

Closed
lzap wants to merge 8 commits into
osbuild:mainfrom
lzap:checksum-validator-diff1
Closed

New Manifest Checksum Caching Implementation#2375
lzap wants to merge 8 commits into
osbuild:mainfrom
lzap:checksum-validator-diff1

Conversation

@lzap

@lzap lzap commented May 26, 2026

Copy link
Copy Markdown
Contributor

Overview

We have a "fake" manifest checksum implementation that can be fragile sometimes. We always had a plan to use osbuild --inspect to do proper ID resolution. This patch series bring this.

But there is a snag - calling osbuild is costly, about 300ms per call. That is too much, a full manifest generation takes about a minute on my system. Therefore, part of this patch, there is a simple cache implementation. We keep calculating sha sums from the manifest blobs (upgraded from sha1 to sha256) but this is only used as a cache key. This speeds up checksum validation by ~100x (from ~200ms to ~2.5ms per cached manifest).

Manifest checksum format changes, instead of some undefined sha, it is now manifest ID for each pipeline in this format:

d7e0c83abe99e164736d5c4b68dda1442cfcb82779b4dc5174f708fb3cb2594a: build
d0348416e3643ee607fde1d54c51f134ea8a4d45313276ceec701f360a110662: os
50bcbf77d3030db338b2ef32eaedfe0fbee950388884f525842d158ad9baeee0: image
4e5aefe649fd4c051fcbf453d1d9d8b3fc1c68399e90d1c3d21c57e412ac1451: xz

If I understand how osbuild --inspect works, this should give us less conflicts since it will ignore some (source) portions of data and ordering will be always correct. Pipelines can differ depending on the image type of course.

Further optimalizations were done by moving some code from shell script into Go which means that overall, this implementation is about 50% faster when cache is fully pre-populated. Regeneration of roughly 10-30 manifest then gives the same performance as before, which is a good result I think - we get better checksums for no cost other than a small cache.

An extra feature of tracking history of each checksum file (as long as it is in the cache) was added. When you pass an existing file, it will show all available entries in the cache:

lzap@dev:~/images$ go run ./cmd/gen-manifests -history test/data/manifest-checksums/rhel_10.3-aarch64-ec2-empty
2026-05-27 06:24:32 UTC 28e7ea69057c58a9c47d7749056cc46b280ca14745d4475a42abdbf60774707f

When you pass a sha256, it will output the actual output from osbuild --inspect. This way you can always take a quick look on the latest (or any historical version that is in your own cache):

lzap@dev:~/images$ go run ./cmd/gen-manifests -history 28e7ea69057c58a9c47d7749056cc46b280ca14745d4475a42abdbf60774707f
{ ... }

Therefore it should be able to compare history with something like (pseudo command): diff (< gen-manifest -history XXX | jq) (< gen-manifest -history YYY | jq). I know the ideal would be some sort of central database (git repo) of all manifests or have them directly in images but that would be probably a lot of data. The current cache is around 10 MiB and we often rotate it. A central solution is out of scope for this initial refactoring, but I am open to suggestion.

Osbuild version is added to the key calculation and the version itself is cached too because calling osbuild --version itself takes 200ms. This ensures that an upgrade will invalidate cache entries. There is no automatic garbage collection because of the history feature that can be useful so I would rather keep it on my system forever. But if needed, simple deletion will do the trick.

Key Features

Cache System

  • Location: ~/.cache/osbuild-gen-manifests (XDG-compliant)
  • Cache key: SHA256(canonical JSON + osbuild version)
  • Structure: 256 subdirectories (first 2 hex chars) for efficient filesystem performance
  • Automatic invalidation: Cache keys include osbuild version, so upgrades naturally use new cache entries
  • Version cached in ~/.cache/osbuild-gen-manifests/version
  • Timestamp-based invalidation: only calls osbuild --version if binary is newer than cache file
  • Cache inspection: -history flag shows cache entries per checksum file or contents of actual (historical) osbuild output

Nitpicks

  • Default workers: CPU count (was hardcoded to 16 I think)
  • Ability to create SMC label in commit message to skip commit per commit manifest check

GitHub Actions Integration

  • Added official cache action to .github/workflows/validate-checksums.yml
  • Cache key includes osbuild version and workflow file hash
  • Shared cache across PR commits for faster CI runs

Statistics

Displays after checksum generation:

  • Cache hits and misses
  • Average manifest generation time
  • Average osbuild inspect call duration (when cache misses occur)

@lzap lzap added the WIP Work in progress. Don't run Gitlab CI. label May 26, 2026
@lzap lzap force-pushed the checksum-validator-diff1 branch from 8640636 to 0437879 Compare May 27, 2026 06:54
lzap and others added 8 commits May 27, 2026 09:48
Checksums use SHA256 for better collision resistance compared to SHA1.

Also, generate checksums from canonical JSON to ensure order-independent
results. This makes checksums deterministic regardless of map key
ordering in the input data.

The implementation round-trips JSON through the 'any' type, leveraging
Go's built-in map key sorting in the encoding/json package. This
eliminates the need for custom MarshalJSON implementations across
multiple types.

Additionally, this optimization skips pretty-printing in checksums-only
mode, and move the actual sha generation from external processes into
the native Go binary, improving performance:

- Before: 10.7s (temp manifest files written to disk)
- After: 5.3s (checksums calculated on-the-fly in memory)
- Improvement: 2.0x faster
Generate manifest checksums using osbuild --inspect output for semantic
validation. The checksum file contains the last stage ID and pipeline
name for each pipeline in the manifest.

Checksum file format:
  <stage-id>: <pipeline-name>
  <stage-id>: <pipeline-name>
  ...

If osbuild --inspect fails (e.g., invalid manifests), manifest
generation fails - no empty checksum files are created.

Changes:
- Fix OSBuildInspect to pass "-" argument for stdin input
- Add better error reporting with stderr capture
- Add simplified structs for parsing inspect output
- generateOSBuildInspectOutput: runs osbuild inspect, returns error on failure
- Fail manifest generation if osbuild inspect fails

Example output (valid manifest):
  e5ffa2f65c897e8d2b918d7d14f8aaae0c3a784a97c140e602347382ef5514fa: build
  1b3a9b712a0098860b8d49447eac53b3b5e97e9a28beae700505f8381351fbf6: image
  e3e9465d743c6d87830d5b8984d3321ebf7fcaf71bc68f7f0f67c8199de71afc: qcow2

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Store osbuild inspect results in cache to avoid re-running osbuild
for unchanged manifests. The cache key is SHA256(canonical JSON +
osbuild version), which ensures different osbuild versions naturally
use separate cache entries without needing explicit invalidation.

Cache structure:
- Location: ~/.cache/osbuild-gen-manifests (or /var/cache for root)
- Subdirectories: First 2 characters of cache key (256 buckets)
- Content: Raw osbuild inspect JSON output
- History: cache_root/hist/<manifest-name> tracks cache keys per manifest

The manifest checksum files contain only the osbuild inspect output
(pipeline stage IDs), with no SHA256 prefix.

Features:
1. Cache inspection via -history flag:
   - Manifest mode: Shows all cache keys for a manifest
   - Cache key mode: Shows cached osbuild inspect output for a key

2. Performance optimizations:
   - Osbuild version retrieved once at startup (global variable)
   - Osbuild version cached to file with timestamp checking
   - Eliminates thousands of subprocess calls to "osbuild --version"

Error handling:
- generateOSBuildInspectOutput() now returns (string, error) instead
  of silently returning empty strings on failures
- Manifest generation fails immediately if osbuild --inspect encounters
  invalid manifests, preventing empty checksum files

Performance improvements:
- Before caching: ~40 seconds for full checksum generation
- After caching: ~4-7 seconds
- Speedup: 6-10x faster depending on cache state

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

gen-manifests: fix linter issues

Fix gosec G306 warnings by using 0600 permissions for cache files
instead of 0644. Cache files don't need to be world-readable.

Fix staticcheck QF1001 by applying De Morgan's law to hex character
validation logic.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

gen-manifests: extract cache functions to cache.go

Move all cache-related functions from main.go to a new cache.go file:
- initCacheDir, getOSBuildVersion, getCachePath
- readCache, writeCache
- appendHistory, readHistory
- generateCacheKey

Also extract SHA256 validation logic into isSHA256Hex() utility
function to simplify the main() function.

This reduces main.go from 1250 to 1044 lines (-206 lines).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Speed up checksum validation by caching the osbuild inspect results
in GitHub Actions cache. The cache is keyed by:
- OS (runner.os)
- Workflow file hash (invalidates on workflow changes)
- osbuild version (different versions use separate caches)

This significantly reduces the time to validate checksums in PRs since
osbuild --inspect results are reused across runs for unchanged manifests.

The restore-keys allow falling back to caches from the same OS/workflow
but potentially different osbuild versions, which is better than no cache.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add statistics tracking to provide visibility into cache effectiveness
and performance characteristics:

- Track cache hits and misses for osbuild inspect calls
- Track average manifest generation time
- Track average osbuild inspect call duration
- Print statistics summary at end of checksums-only mode

Statistics use thread-safe counters with nanosecond precision timing.
Example output:
  Cache hits: 4, cache misses: 0
  Avg. manifest generation: 2.58 ms

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
When the last commit in a PR contains 'SMC' (Squashed Manifest Commit)
in the subject line, the workflow runs the full commit-by-commit
validation but skips the exit 1 on failure.

Implementation:
- Detects SMC marker in last commit message (git log -1 --pretty=%s)
- Sets smc_mode variable if marker found
- Always runs: git rebase --exec ./tools/gen-manifest-checksums.sh
- Shows full diagnostic output on checksum differences
- Only executes "exit 1" when smc_mode is empty (line 95-97)

This is useful for large patch series where manifest regeneration is
done in a final squashed commit to avoid rebasing complexity and
reduce CI runtime. The validation still runs to show differences for
review, but doesn't fail the workflow.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Regenerate all manifest checksums with new osbuild inspect format.
Checksum files now contain pipeline stage IDs instead of SHA256 hashes,
providing semantic validation of manifest structure.

This commit contains the [SMC] marker (Squashed Manifest Commit) to
skip commit-by-commit validation in CI and only validate HEAD.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@lzap lzap force-pushed the checksum-validator-diff1 branch from 0437879 to c7680a2 Compare May 27, 2026 07:49
@lzap lzap marked this pull request as ready for review May 27, 2026 07:49
@lzap lzap requested a review from a team as a code owner May 27, 2026 07:49
@lzap lzap requested review from brlane-rht, supakeen and thozza May 27, 2026 07:49
@thozza thozza requested a review from achilleas-k May 27, 2026 11:05

@achilleas-k achilleas-k left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nice work and ideas overall, but there's a lot of side-features of unclear value.

Also, the description in one of the commits (gen-manifests: add caching for osbuild inspect results) seems to be a squashed version of multiple commits. Can you clear that up, please?

Comment thread cmd/gen-manifests/main.go
Comment on lines +425 to +453
if checksumsOnlyMode {
// Round-trip through any to ensure all map keys are sorted
// This makes checksums independent of key order in the input
b, err := json.Marshal(data)
if err != nil {
return fmt.Errorf("failed to marshal data for %q: %w", filename, err)
}

var canonical any
if err := json.Unmarshal(b, &canonical); err != nil {
return fmt.Errorf("failed to unmarshal for canonicalization: %w", err)
}

canonicalBytes, err := json.Marshal(canonical)
if err != nil {
return fmt.Errorf("failed to marshal canonical form: %w", err)
}

h := sha256.New()
h.Write(canonicalBytes)
checksum := hex.EncodeToString(h.Sum(nil))

// Store checksum with filename (without .json extension)
checksumKey := strings.TrimSuffix(filename, ".json")
checksumsMu.Lock()
checksums[checksumKey] = checksum
checksumsMu.Unlock()
return nil
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing that this block is in the save() function, which normally writes the manifests to file, but instead of writing the files it collects the checksums to write them later. What's the reason for this difference in the flow of the two cases? If we write the files immediately here, we wouldn't need the map (or it's mutex).

Comment on lines +50 to +68
// Get osbuild binary modification time
binaryInfo, err := os.Stat(osbuildPath)
if err != nil {
return "", fmt.Errorf("failed to stat osbuild binary: %w", err)
}
binaryModTime := binaryInfo.ModTime()

// Check cached version file
versionPath := filepath.Join(cacheDir, "version")
if versionInfo, err := os.Stat(versionPath); err == nil {
// Version file exists, check if it's newer than the binary
if versionInfo.ModTime().After(binaryModTime) {
// Cache is newer, use it
cachedVersion, err := os.ReadFile(versionPath)
if err == nil && len(cachedVersion) > 0 {
return strings.TrimSpace(string(cachedVersion)), nil
}
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the value in comparing the mod times? I would assume a single call to osbuild --version for each run of gen-manifests wouldn't be too much of an issue and would be perfectly reliable.
Consider, for example, that on ostree / bootc based systems, the mod time of the binary is always epoch.

Comment thread cmd/gen-manifests/main.go
Comment on lines +483 to +484
// Cache the raw JSON output for next time
_ = writeCache(cacheKey, string(inspectedJSON))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we ignore the error because we don't want a cache write failure to be fatal? Could we print it as a warning to avoid confusion when things don't work as expected?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's a bit strange that in all of this effort to optimise, we write back the manifest to the cache even when in cases where we just had a cache hit a few lines before.

Comment on lines +168 to +189
// readHistory reads all cache keys from the history file for a manifest
// Returns cache keys (SHA256 of canonical JSON + osbuild version)
func readHistory(checksumKey string) ([]string, error) {
histPath := filepath.Join(cacheDir, "hist", checksumKey)
data, err := os.ReadFile(histPath)
if err != nil {
if os.IsNotExist(err) {
return []string{}, nil
}
return nil, fmt.Errorf("failed to read history file: %w", err)
}

var sums []string
for _, line := range strings.Split(string(data), "\n") {
line = strings.TrimSpace(line)
if line != "" {
sums = append(sums, line)
}
}

return sums, nil
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the value of the history function? I can sort of imagine it being useful to check how a manifest changed during the development of a branch, but just having a history of checksums without them being associated with something like a commit ID wouldn't tell us much.

Comment on lines +68 to +74
# Check if last commit has [SMC] marker (Squashed Manifest Commit)
last_commit_msg=$(git log -1 --pretty=%s)
smc_mode=""
if echo "$last_commit_msg" | grep -q "SMC"; then
smc_mode="true"
echo "Last commit contains 'SMC' marker - validation will not fail on checksum differences"
fi

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I like this. We either want manifest checksums changing with the commit that modified them or we don't. I don't see a good reason to have a special case to skip the check.

The commit message says

This is useful for large patch series where manifest regeneration is
done in a final squashed commit to avoid rebasing complexity and
reduce CI runtime.

But a large patch series is exactly the scenario where I would want to see how manifests change at every step.

@thozza thozza left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this, but I have a few basic questions and a concern that is a blocker from my PoV.

Overview

We have a "fake" manifest checksum implementation that can be fragile sometimes. We always had a plan to use osbuild --inspect to do proper ID resolution. This patch series bring this.

Can you be specific on what was fragile? The justification is a bit too hand-wavy and jumps to fixing a problem that is not even clearly defined here. Moreover, the solution is quite complex and convoluted for what should be a straightforward checksum mechanism -- it's becoming a project of its own (caching system with 256-bucket subdirectories, cache invalidation via binary mod-time comparison, history tracking, statistics, SMC markers, GitHub Actions cache integration, etc.). Each layer exists to support the previous one: osbuild --inspect is slow, so we need caching, which needs invalidation, which needs version caching, and so on.

Additionally, can you point to the source of "We always had a plan to use osbuild --inspect"? Because the comment in the script you're modifying clearly states otherwise:

# NOTE: 'osbuild --inspect' is generally a better way to calculate a manifest
# fingerprint, because it ignores things like pipeline names, source URLs, and
# generally things that don't affect the build output.
# For mocked manifests though we want those things to be visible changes, so we
# calculate the checksum of the file directly. Also it's faster.

This comment is still present in the updated script, but the script now contradicts it by using --checksums-only (which runs osbuild --inspect) instead of sha1sum of the file.

The blocker

The switch from sha1sum of the full manifest JSON to osbuild --inspect stage IDs breaks detection of repo snapshot updates. Stage IDs only depend on stage definitions (type, options, input references), not on the sources section where download URLs live.

At the same time, the repo pseudo-packages in manifestmock.Depsolve() explicitly strip the snapshot date from the URL before computing their checksum, so a date-only change (e.g., -20240819 -> -20250201) doesn't produce a different pseudo-package checksum either. The result is that both detection paths are closed -- source URLs are ignored by osbuild --inspect, and the date is stripped before it can reach the stage inputs.

This means updating repo snapshots in test/data/repositories/ produces identical checksum files. CI sees no difference, no rebuild is triggered, and regressions from new RPM package versions in the updated snapshot go undetected. We use pinned repo snapshots specifically to retest images against new package versions, so this silently defeats that workflow.

Note that the date stripping in manifestmock was fine before this change, because the old sha1sum approach still caught date changes through the regular package RemoteLocations in the curl sources (those include the full date-stamped URL). The date stripping only becomes a problem in combination with the switch to osbuild --inspect, which ignores sources entirely.

A straightforward fix would be to stop stripping the date suffix from repo pseudo-package URLs in manifestmock.Depsolve(). That way the snapshot date becomes part of the pseudo-package checksum, flows into RPM stage input references, and naturally changes the osbuild --inspect stage IDs when snapshots are updated.

Even with the blocker addressed, I'm not convinced we need all of this added infrastructure and features (caching, history, statistics, SMC markers) to solve a problem that hasn't been clearly defined yet. I'd prefer to start with a concrete description of what's broken today and the minimal change that fixes it.

@achilleas-k

achilleas-k commented May 27, 2026

Copy link
Copy Markdown
Member

I agree with the whole comment above but especially with this part:

Even with the blocker addressed, I'm not convinced we need all of this added infrastructure and features (caching, history, statistics, SMC markers) to solve a problem that hasn't been clearly defined yet. I'd prefer to start with a concrete description of what's broken today and the minimal change that fixes it.

I will say though that I like the idea of pipeline IDs as manifest checksums. It gives a bit more granularity to the change detection without going all the way towards having whole manifests checked in1. But I do agree that we need to consider what exactly we want the change detection to detect. I think for the purposes of this repo, things that do not affect the manifest logically probably do matter. For example, if we change the name of the OS pipeline, we would want to see a list of manifests affected by the change, which manifest IDs wouldn't show. So if we go this way, perhaps we should consider having both.

Footnotes

  1. Having whole manifests checked in to the repo (mocked or otherwise) would be the best solution in terms of change detection and visibility, it would possibly remove a lot of sources of conflict, but it is the worst solution in terms of noise, large diffs, and raw repo data. See this old commit as an example.

@lzap

lzap commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Good points, I somewhat took for granted that osbuild --inspect is the best possible outcome. I am gonna close this but one small patch from here is useful - moving away from forking shasum process to simply calculating this in the process - this speeds up generation by 50 %. Also I could perhaps upgrade from sha1 to sha256, or take it the opposite direction and just do CRC64? All we need to know is something has changed.

@achilleas-k

Copy link
Copy Markdown
Member

moving away from forking shasum process to simply calculating this in the process - this speeds up generation by 50 %. Also I could perhaps upgrade from sha1 to sha256, or take it the opposite direction and just do CRC64? All we need to know is something has changed.

Yeah, that seems like a good change.
Any reason to prefer CRC64 over sha256? I have no preference, it's just that we use sha256 generally (in osbuild, containers, etc).

@thozza

thozza commented May 27, 2026

Copy link
Copy Markdown
Member

My preference is to use sha256, so that we do not risk the unlikely event of changing the manifest and ending up with unchanged checksum...

@lzap

lzap commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

Closing, I extracted the commits that were just making calculation faster: #2382

Re: sha256 vs sha1 - I am not doing this in the linked PR because I want to keep things simple and clean there, but I have an idea which I will share in the opened issue.

@lzap lzap closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

WIP Work in progress. Don't run Gitlab CI.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants