New Manifest Checksum Caching Implementation#2375
Conversation
8640636 to
0437879
Compare
Checksums use SHA256 for better collision resistance compared to SHA1. Also, generate checksums from canonical JSON to ensure order-independent results. This makes checksums deterministic regardless of map key ordering in the input data. The implementation round-trips JSON through the 'any' type, leveraging Go's built-in map key sorting in the encoding/json package. This eliminates the need for custom MarshalJSON implementations across multiple types. Additionally, this optimization skips pretty-printing in checksums-only mode, and move the actual sha generation from external processes into the native Go binary, improving performance: - Before: 10.7s (temp manifest files written to disk) - After: 5.3s (checksums calculated on-the-fly in memory) - Improvement: 2.0x faster
Generate manifest checksums using osbuild --inspect output for semantic validation. The checksum file contains the last stage ID and pipeline name for each pipeline in the manifest. Checksum file format: <stage-id>: <pipeline-name> <stage-id>: <pipeline-name> ... If osbuild --inspect fails (e.g., invalid manifests), manifest generation fails - no empty checksum files are created. Changes: - Fix OSBuildInspect to pass "-" argument for stdin input - Add better error reporting with stderr capture - Add simplified structs for parsing inspect output - generateOSBuildInspectOutput: runs osbuild inspect, returns error on failure - Fail manifest generation if osbuild inspect fails Example output (valid manifest): e5ffa2f65c897e8d2b918d7d14f8aaae0c3a784a97c140e602347382ef5514fa: build 1b3a9b712a0098860b8d49447eac53b3b5e97e9a28beae700505f8381351fbf6: image e3e9465d743c6d87830d5b8984d3321ebf7fcaf71bc68f7f0f67c8199de71afc: qcow2 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Store osbuild inspect results in cache to avoid re-running osbuild for unchanged manifests. The cache key is SHA256(canonical JSON + osbuild version), which ensures different osbuild versions naturally use separate cache entries without needing explicit invalidation. Cache structure: - Location: ~/.cache/osbuild-gen-manifests (or /var/cache for root) - Subdirectories: First 2 characters of cache key (256 buckets) - Content: Raw osbuild inspect JSON output - History: cache_root/hist/<manifest-name> tracks cache keys per manifest The manifest checksum files contain only the osbuild inspect output (pipeline stage IDs), with no SHA256 prefix. Features: 1. Cache inspection via -history flag: - Manifest mode: Shows all cache keys for a manifest - Cache key mode: Shows cached osbuild inspect output for a key 2. Performance optimizations: - Osbuild version retrieved once at startup (global variable) - Osbuild version cached to file with timestamp checking - Eliminates thousands of subprocess calls to "osbuild --version" Error handling: - generateOSBuildInspectOutput() now returns (string, error) instead of silently returning empty strings on failures - Manifest generation fails immediately if osbuild --inspect encounters invalid manifests, preventing empty checksum files Performance improvements: - Before caching: ~40 seconds for full checksum generation - After caching: ~4-7 seconds - Speedup: 6-10x faster depending on cache state Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> gen-manifests: fix linter issues Fix gosec G306 warnings by using 0600 permissions for cache files instead of 0644. Cache files don't need to be world-readable. Fix staticcheck QF1001 by applying De Morgan's law to hex character validation logic. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> gen-manifests: extract cache functions to cache.go Move all cache-related functions from main.go to a new cache.go file: - initCacheDir, getOSBuildVersion, getCachePath - readCache, writeCache - appendHistory, readHistory - generateCacheKey Also extract SHA256 validation logic into isSHA256Hex() utility function to simplify the main() function. This reduces main.go from 1250 to 1044 lines (-206 lines). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Speed up checksum validation by caching the osbuild inspect results in GitHub Actions cache. The cache is keyed by: - OS (runner.os) - Workflow file hash (invalidates on workflow changes) - osbuild version (different versions use separate caches) This significantly reduces the time to validate checksums in PRs since osbuild --inspect results are reused across runs for unchanged manifests. The restore-keys allow falling back to caches from the same OS/workflow but potentially different osbuild versions, which is better than no cache. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add statistics tracking to provide visibility into cache effectiveness and performance characteristics: - Track cache hits and misses for osbuild inspect calls - Track average manifest generation time - Track average osbuild inspect call duration - Print statistics summary at end of checksums-only mode Statistics use thread-safe counters with nanosecond precision timing. Example output: Cache hits: 4, cache misses: 0 Avg. manifest generation: 2.58 ms Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
When the last commit in a PR contains 'SMC' (Squashed Manifest Commit) in the subject line, the workflow runs the full commit-by-commit validation but skips the exit 1 on failure. Implementation: - Detects SMC marker in last commit message (git log -1 --pretty=%s) - Sets smc_mode variable if marker found - Always runs: git rebase --exec ./tools/gen-manifest-checksums.sh - Shows full diagnostic output on checksum differences - Only executes "exit 1" when smc_mode is empty (line 95-97) This is useful for large patch series where manifest regeneration is done in a final squashed commit to avoid rebasing complexity and reduce CI runtime. The validation still runs to show differences for review, but doesn't fail the workflow. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Regenerate all manifest checksums with new osbuild inspect format. Checksum files now contain pipeline stage IDs instead of SHA256 hashes, providing semantic validation of manifest structure. This commit contains the [SMC] marker (Squashed Manifest Commit) to skip commit-by-commit validation in CI and only validate HEAD. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
0437879 to
c7680a2
Compare
achilleas-k
left a comment
There was a problem hiding this comment.
Some nice work and ideas overall, but there's a lot of side-features of unclear value.
Also, the description in one of the commits (gen-manifests: add caching for osbuild inspect results) seems to be a squashed version of multiple commits. Can you clear that up, please?
| if checksumsOnlyMode { | ||
| // Round-trip through any to ensure all map keys are sorted | ||
| // This makes checksums independent of key order in the input | ||
| b, err := json.Marshal(data) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to marshal data for %q: %w", filename, err) | ||
| } | ||
|
|
||
| var canonical any | ||
| if err := json.Unmarshal(b, &canonical); err != nil { | ||
| return fmt.Errorf("failed to unmarshal for canonicalization: %w", err) | ||
| } | ||
|
|
||
| canonicalBytes, err := json.Marshal(canonical) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to marshal canonical form: %w", err) | ||
| } | ||
|
|
||
| h := sha256.New() | ||
| h.Write(canonicalBytes) | ||
| checksum := hex.EncodeToString(h.Sum(nil)) | ||
|
|
||
| // Store checksum with filename (without .json extension) | ||
| checksumKey := strings.TrimSuffix(filename, ".json") | ||
| checksumsMu.Lock() | ||
| checksums[checksumKey] = checksum | ||
| checksumsMu.Unlock() | ||
| return nil | ||
| } |
There was a problem hiding this comment.
It's a little confusing that this block is in the save() function, which normally writes the manifests to file, but instead of writing the files it collects the checksums to write them later. What's the reason for this difference in the flow of the two cases? If we write the files immediately here, we wouldn't need the map (or it's mutex).
| // Get osbuild binary modification time | ||
| binaryInfo, err := os.Stat(osbuildPath) | ||
| if err != nil { | ||
| return "", fmt.Errorf("failed to stat osbuild binary: %w", err) | ||
| } | ||
| binaryModTime := binaryInfo.ModTime() | ||
|
|
||
| // Check cached version file | ||
| versionPath := filepath.Join(cacheDir, "version") | ||
| if versionInfo, err := os.Stat(versionPath); err == nil { | ||
| // Version file exists, check if it's newer than the binary | ||
| if versionInfo.ModTime().After(binaryModTime) { | ||
| // Cache is newer, use it | ||
| cachedVersion, err := os.ReadFile(versionPath) | ||
| if err == nil && len(cachedVersion) > 0 { | ||
| return strings.TrimSpace(string(cachedVersion)), nil | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
What's the value in comparing the mod times? I would assume a single call to osbuild --version for each run of gen-manifests wouldn't be too much of an issue and would be perfectly reliable.
Consider, for example, that on ostree / bootc based systems, the mod time of the binary is always epoch.
| // Cache the raw JSON output for next time | ||
| _ = writeCache(cacheKey, string(inspectedJSON)) |
There was a problem hiding this comment.
I guess we ignore the error because we don't want a cache write failure to be fatal? Could we print it as a warning to avoid confusion when things don't work as expected?
There was a problem hiding this comment.
Also, it's a bit strange that in all of this effort to optimise, we write back the manifest to the cache even when in cases where we just had a cache hit a few lines before.
| // readHistory reads all cache keys from the history file for a manifest | ||
| // Returns cache keys (SHA256 of canonical JSON + osbuild version) | ||
| func readHistory(checksumKey string) ([]string, error) { | ||
| histPath := filepath.Join(cacheDir, "hist", checksumKey) | ||
| data, err := os.ReadFile(histPath) | ||
| if err != nil { | ||
| if os.IsNotExist(err) { | ||
| return []string{}, nil | ||
| } | ||
| return nil, fmt.Errorf("failed to read history file: %w", err) | ||
| } | ||
|
|
||
| var sums []string | ||
| for _, line := range strings.Split(string(data), "\n") { | ||
| line = strings.TrimSpace(line) | ||
| if line != "" { | ||
| sums = append(sums, line) | ||
| } | ||
| } | ||
|
|
||
| return sums, nil | ||
| } |
There was a problem hiding this comment.
What's the value of the history function? I can sort of imagine it being useful to check how a manifest changed during the development of a branch, but just having a history of checksums without them being associated with something like a commit ID wouldn't tell us much.
| # Check if last commit has [SMC] marker (Squashed Manifest Commit) | ||
| last_commit_msg=$(git log -1 --pretty=%s) | ||
| smc_mode="" | ||
| if echo "$last_commit_msg" | grep -q "SMC"; then | ||
| smc_mode="true" | ||
| echo "Last commit contains 'SMC' marker - validation will not fail on checksum differences" | ||
| fi |
There was a problem hiding this comment.
Not sure I like this. We either want manifest checksums changing with the commit that modified them or we don't. I don't see a good reason to have a special case to skip the check.
The commit message says
This is useful for large patch series where manifest regeneration is
done in a final squashed commit to avoid rebasing complexity and
reduce CI runtime.
But a large patch series is exactly the scenario where I would want to see how manifests change at every step.
thozza
left a comment
There was a problem hiding this comment.
Thanks for looking into this, but I have a few basic questions and a concern that is a blocker from my PoV.
Overview
We have a "fake" manifest checksum implementation that can be fragile sometimes. We always had a plan to use
osbuild --inspectto do proper ID resolution. This patch series bring this.
Can you be specific on what was fragile? The justification is a bit too hand-wavy and jumps to fixing a problem that is not even clearly defined here. Moreover, the solution is quite complex and convoluted for what should be a straightforward checksum mechanism -- it's becoming a project of its own (caching system with 256-bucket subdirectories, cache invalidation via binary mod-time comparison, history tracking, statistics, SMC markers, GitHub Actions cache integration, etc.). Each layer exists to support the previous one: osbuild --inspect is slow, so we need caching, which needs invalidation, which needs version caching, and so on.
Additionally, can you point to the source of "We always had a plan to use osbuild --inspect"? Because the comment in the script you're modifying clearly states otherwise:
# NOTE: 'osbuild --inspect' is generally a better way to calculate a manifest
# fingerprint, because it ignores things like pipeline names, source URLs, and
# generally things that don't affect the build output.
# For mocked manifests though we want those things to be visible changes, so we
# calculate the checksum of the file directly. Also it's faster.
This comment is still present in the updated script, but the script now contradicts it by using --checksums-only (which runs osbuild --inspect) instead of sha1sum of the file.
The blocker
The switch from sha1sum of the full manifest JSON to osbuild --inspect stage IDs breaks detection of repo snapshot updates. Stage IDs only depend on stage definitions (type, options, input references), not on the sources section where download URLs live.
At the same time, the repo pseudo-packages in manifestmock.Depsolve() explicitly strip the snapshot date from the URL before computing their checksum, so a date-only change (e.g., -20240819 -> -20250201) doesn't produce a different pseudo-package checksum either. The result is that both detection paths are closed -- source URLs are ignored by osbuild --inspect, and the date is stripped before it can reach the stage inputs.
This means updating repo snapshots in test/data/repositories/ produces identical checksum files. CI sees no difference, no rebuild is triggered, and regressions from new RPM package versions in the updated snapshot go undetected. We use pinned repo snapshots specifically to retest images against new package versions, so this silently defeats that workflow.
Note that the date stripping in manifestmock was fine before this change, because the old sha1sum approach still caught date changes through the regular package RemoteLocations in the curl sources (those include the full date-stamped URL). The date stripping only becomes a problem in combination with the switch to osbuild --inspect, which ignores sources entirely.
A straightforward fix would be to stop stripping the date suffix from repo pseudo-package URLs in manifestmock.Depsolve(). That way the snapshot date becomes part of the pseudo-package checksum, flows into RPM stage input references, and naturally changes the osbuild --inspect stage IDs when snapshots are updated.
Even with the blocker addressed, I'm not convinced we need all of this added infrastructure and features (caching, history, statistics, SMC markers) to solve a problem that hasn't been clearly defined yet. I'd prefer to start with a concrete description of what's broken today and the minimal change that fixes it.
|
I agree with the whole comment above but especially with this part:
I will say though that I like the idea of pipeline IDs as manifest checksums. It gives a bit more granularity to the change detection without going all the way towards having whole manifests checked in1. But I do agree that we need to consider what exactly we want the change detection to detect. I think for the purposes of this repo, things that do not affect the manifest logically probably do matter. For example, if we change the name of the OS pipeline, we would want to see a list of manifests affected by the change, which manifest IDs wouldn't show. So if we go this way, perhaps we should consider having both. Footnotes
|
|
Good points, I somewhat took for granted that |
Yeah, that seems like a good change. |
|
My preference is to use sha256, so that we do not risk the unlikely event of changing the manifest and ending up with unchanged checksum... |
|
Closing, I extracted the commits that were just making calculation faster: #2382 Re: sha256 vs sha1 - I am not doing this in the linked PR because I want to keep things simple and clean there, but I have an idea which I will share in the opened issue. |
Overview
We have a "fake" manifest checksum implementation that can be fragile sometimes. We always had a plan to use
osbuild --inspectto do proper ID resolution. This patch series bring this.But there is a snag - calling
osbuildis costly, about 300ms per call. That is too much, a full manifest generation takes about a minute on my system. Therefore, part of this patch, there is a simple cache implementation. We keep calculating sha sums from the manifest blobs (upgraded from sha1 to sha256) but this is only used as a cache key. This speeds up checksum validation by ~100x (from ~200ms to ~2.5ms per cached manifest).Manifest checksum format changes, instead of some undefined sha, it is now manifest ID for each pipeline in this format:
If I understand how
osbuild --inspectworks, this should give us less conflicts since it will ignore some (source) portions of data and ordering will be always correct. Pipelines can differ depending on the image type of course.Further optimalizations were done by moving some code from shell script into Go which means that overall, this implementation is about 50% faster when cache is fully pre-populated. Regeneration of roughly 10-30 manifest then gives the same performance as before, which is a good result I think - we get better checksums for no cost other than a small cache.
An extra feature of tracking history of each checksum file (as long as it is in the cache) was added. When you pass an existing file, it will show all available entries in the cache:
When you pass a sha256, it will output the actual output from
osbuild --inspect. This way you can always take a quick look on the latest (or any historical version that is in your own cache):Therefore it should be able to compare history with something like (pseudo command):
diff (< gen-manifest -history XXX | jq) (< gen-manifest -history YYY | jq). I know the ideal would be some sort of central database (git repo) of all manifests or have them directly inimagesbut that would be probably a lot of data. The current cache is around 10 MiB and we often rotate it. A central solution is out of scope for this initial refactoring, but I am open to suggestion.Osbuild version is added to the key calculation and the version itself is cached too because calling
osbuild --versionitself takes 200ms. This ensures that an upgrade will invalidate cache entries. There is no automatic garbage collection because of the history feature that can be useful so I would rather keep it on my system forever. But if needed, simple deletion will do the trick.Key Features
Cache System
~/.cache/osbuild-gen-manifests(XDG-compliant)SHA256(canonical JSON + osbuild version)~/.cache/osbuild-gen-manifests/versionosbuild --versionif binary is newer than cache file-historyflag shows cache entries per checksum file or contents of actual (historical) osbuild outputNitpicks
GitHub Actions Integration
.github/workflows/validate-checksums.ymlStatistics
Displays after checksum generation: