Skip to content

🐛 Wait for disks to be registered before snapshot creation#1673

Open
aruneshpa wants to merge 1 commit into
vmware-tanzu:mainfrom
aruneshpa:bug/fix-vm-snapshot-wait-for-crv
Open

🐛 Wait for disks to be registered before snapshot creation#1673
aruneshpa wants to merge 1 commit into
vmware-tanzu:mainfrom
aruneshpa:bug/fix-vm-snapshot-wait-for-crv

Conversation

@aruneshpa

@aruneshpa aruneshpa commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do, and why is it needed?

If we invoke VM snapshot API after the CnsRegisterVolume resource for an unmanaged disk has been created, but before CSI has had a chance to register the disk, then CRV will fail in a terminal way because CreateVolume call will refuse to reject a disk that is not the running point.

This change fixes this behavior by waiting for all disks to be registered before a VM snapshot is triggered. A potential side effect is that a VM snapshot can not be taken if disk registration is broken. But that would have been the same behavior today anyway.

Testing Done:

  • Unit tests
  • Ran the failing end to end tests from our reco job a few times to make sure they pass

Please add a release note if necessary:

 Wait for disks to be registered before snapshot creation

@aruneshpa aruneshpa requested review from a team and faisalabujabal as code owners June 18, 2026 22:48
@github-actions github-actions Bot added the size/L Denotes a PR that changes 100-499 lines. label Jun 18, 2026
@aruneshpa aruneshpa force-pushed the bug/fix-vm-snapshot-wait-for-crv branch 2 times, most recently from b8a45fc to 9995d6d Compare June 19, 2026 06:01
// created), CNS processing, and error cases — without requiring an API call.
func hasUnmanagedVolumes(vm *vmopv1.VirtualMachine) bool {
for i := range vm.Status.Volumes {
if vm.Status.Volumes[i].Type == vmopv1.VolumeTypeClassic {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we vm.spec.promoteDiskMode is Disabled?

// VolumeTypeManaged entry for the same disk UUID. This covers the full
// registration lifecycle — Phase 1 (PVC created, no CRV yet), Phase 2 (CRV
// created), CNS processing, and error cases — without requiring an API call.
func hasUnmanagedVolumes(vm *vmopv1.VirtualMachine) bool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I’m surprised we don’t have a public util that does this

// a delta, FilterOutLinkedClones drops the disk from info.Disks in
// the register reconciler, which may not have corrected the
// condition yet on this reconcile).
waitForRegistration = hasUnmanagedVolumes(vmCtx.VM)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this cause an issue if we revert FCDs for image disks? or atleast will require a change here too. it just means that the definition of disk promotion is done is in multiple controllers, which i'm not a fan off. what is the issue if the condition lags behind, doesn't this mean it will do the right thing on the next reconcile?

If we invoke VM snapshot API after the CnsRegisterVolume resource for
an unmanaged disk has been created, but before CSI has had a chance to
register the disk, then CRV will fail in a terminal way because
CreateVolume call will refuse to reject a disk that is not the running
point.

Changes:
- Add VirtualMachineSnapshotWaitingForDiskRegistrationReason to v1alpha5/v6
- Fix VirtualMachineUnmanagedVolumesRegistered condition being incorrectly
  set True when CRVs exist with Registered=false
- Return nil (not ErrPendingRegister) from the hasPendingCRVs path so that
  power state and other operations are not gated on CRV completion
- Replace hasPendingCRVsForVM K8s List call with hasUnmanagedVolumes, an
  in-memory scan of vm.Status.Volumes for Classic / Unmanaged volumes
- Fix getVirtualMachineSnapshotsForVM to use server-side VMNameForSnapshotLabel
  filtering instead of listing all snapshots in the namespace

Test changes:
- Add VMNameForSnapshotLabel to DummyVirtualMachineSnapshot (webhook sets
  this in production but tests bypass it)
- Add CnsRegisterVolume to KnownObjectTypes(): it has +kubebuilder:subresource:status
  but was omitted, causing fake-client Status().Update() to be silently ignored

fixes vmop-3871.

Testing Done:
- Unit tests
- Ran the failing end to end tests from our reco job a few times to
make sure they pass
- WIP
@aruneshpa aruneshpa force-pushed the bug/fix-vm-snapshot-wait-for-crv branch from 9995d6d to 5e9035e Compare June 19, 2026 20:20
@github-actions

Copy link
Copy Markdown

Code Coverage

Package Line Rate Health
github.com/vmware-tanzu/vm-operator/controllers/contentlibrary/clustercontentlibraryitem 67%
github.com/vmware-tanzu/vm-operator/controllers/contentlibrary/contentlibraryitem 67%
github.com/vmware-tanzu/vm-operator/controllers/contentlibrary/utils 85%
github.com/vmware-tanzu/vm-operator/controllers/infra/capability/configmap 92%
github.com/vmware-tanzu/vm-operator/controllers/infra/capability/crd 100%
github.com/vmware-tanzu/vm-operator/controllers/infra/configmap 75%
github.com/vmware-tanzu/vm-operator/controllers/infra/node 77%
github.com/vmware-tanzu/vm-operator/controllers/infra/secret 76%
github.com/vmware-tanzu/vm-operator/controllers/infra/validatingwebhookconfiguration 87%
github.com/vmware-tanzu/vm-operator/controllers/infra/zone 73%
github.com/vmware-tanzu/vm-operator/controllers/storage/storageclass 93%
github.com/vmware-tanzu/vm-operator/controllers/storage/storagepolicy 96%
github.com/vmware-tanzu/vm-operator/controllers/storage/storagepolicyquota 91%
github.com/vmware-tanzu/vm-operator/controllers/storage/volumeattributesclass 93%
github.com/vmware-tanzu/vm-operator/controllers/util/encoding 73%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachine/storagepolicyusage 96%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachine/virtualmachine 65%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachine/volume 86%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachine/volumebatch 89%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachineclass 73%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinegroup 89%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinegrouppublishrequest 88%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachineimagecache 89%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinepublishrequest 84%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinereplicaset 67%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachineservice 90%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachineservice/providers 93%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinesetresourcepolicy 81%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinesnapshot 92%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinewebconsolerequest 72%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinewebconsolerequest/v1alpha1 72%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinewebconsolerequest/v1alpha1/conditions 88%
github.com/vmware-tanzu/vm-operator/controllers/virtualmachinewebconsolerequest/v1alpha1/patch 78%
github.com/vmware-tanzu/vm-operator/controllers/vspherepolicy/policyevaluation 85%
github.com/vmware-tanzu/vm-operator/pkg/bitmask 100%
github.com/vmware-tanzu/vm-operator/pkg/builder 89%
github.com/vmware-tanzu/vm-operator/pkg/conditions 90%
github.com/vmware-tanzu/vm-operator/pkg/config 100%
github.com/vmware-tanzu/vm-operator/pkg/config/capabilities 97%
github.com/vmware-tanzu/vm-operator/pkg/config/env 100%
github.com/vmware-tanzu/vm-operator/pkg/context 37%
github.com/vmware-tanzu/vm-operator/pkg/context/generic 100%
github.com/vmware-tanzu/vm-operator/pkg/context/operation 100%
github.com/vmware-tanzu/vm-operator/pkg/crd 77%
github.com/vmware-tanzu/vm-operator/pkg/errors 74%
github.com/vmware-tanzu/vm-operator/pkg/exit 100%
github.com/vmware-tanzu/vm-operator/pkg/log 100%
github.com/vmware-tanzu/vm-operator/pkg/mem 100%
github.com/vmware-tanzu/vm-operator/pkg/patch 78%
github.com/vmware-tanzu/vm-operator/pkg/prober 89%
github.com/vmware-tanzu/vm-operator/pkg/prober/probe 90%
github.com/vmware-tanzu/vm-operator/pkg/prober/worker 77%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere 75%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/clustermodules 73%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/config 88%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/contentlibrary 75%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/credentials 100%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/network 81%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/placement 70%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/session 52%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/storage 44%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/upgrade/virtualmachine 96%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/upgrade/virtualmachine/backfill 96%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/vcenter 80%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/virtualmachine 84%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/virtualmachine/extraconfig 88%
github.com/vmware-tanzu/vm-operator/pkg/providers/vsphere/vmlifecycle 74%
github.com/vmware-tanzu/vm-operator/pkg/record 84%
github.com/vmware-tanzu/vm-operator/pkg/topology 91%
github.com/vmware-tanzu/vm-operator/pkg/util 78%
github.com/vmware-tanzu/vm-operator/pkg/util/cloudinit 89%
github.com/vmware-tanzu/vm-operator/pkg/util/cloudinit/validate 91%
github.com/vmware-tanzu/vm-operator/pkg/util/image 100%
github.com/vmware-tanzu/vm-operator/pkg/util/kube 92%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/cource 100%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/internal 100%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/networksettings 100%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/proxyaddr 73%
github.com/vmware-tanzu/vm-operator/pkg/util/kube/spq 99%
github.com/vmware-tanzu/vm-operator/pkg/util/linuxprep 97%
github.com/vmware-tanzu/vm-operator/pkg/util/netplan 100%
github.com/vmware-tanzu/vm-operator/pkg/util/nil 100%
github.com/vmware-tanzu/vm-operator/pkg/util/ovfcache 75%
github.com/vmware-tanzu/vm-operator/pkg/util/ovfcache/internal 100%
github.com/vmware-tanzu/vm-operator/pkg/util/paused 100%
github.com/vmware-tanzu/vm-operator/pkg/util/ptr 100%
github.com/vmware-tanzu/vm-operator/pkg/util/resize 98%
github.com/vmware-tanzu/vm-operator/pkg/util/sysprep 98%
github.com/vmware-tanzu/vm-operator/pkg/util/vmopv1 85%
github.com/vmware-tanzu/vm-operator/pkg/util/volumes 100%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/client 66%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/datastore 100%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/fault 100%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/library 95%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/storage 82%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/task 100%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/vm 78%
github.com/vmware-tanzu/vm-operator/pkg/util/vsphere/watcher 85%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig 95%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig/anno2extraconfig 100%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig/bootoptions 88%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig/cdrom 88%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig/crypto 92%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig/diskpromo 100%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig/extraconfig 97%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig/policy 97%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig/virtualcontroller 93%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig/volumes/unmanaged/backfill 98%
github.com/vmware-tanzu/vm-operator/pkg/vmconfig/volumes/unmanaged/register 92%
github.com/vmware-tanzu/vm-operator/pkg/webconsolevalidation 100%
github.com/vmware-tanzu/vm-operator/services/vm-watcher 85%
github.com/vmware-tanzu/vm-operator/webhooks/common 98%
github.com/vmware-tanzu/vm-operator/webhooks/persistentvolumeclaim/validation 95%
github.com/vmware-tanzu/vm-operator/webhooks/unifiedstoragequota/validation 89%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachine/mutation 86%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachine/validation 96%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachineclass/mutation 62%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachineclass/validation 89%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinegroup/mutation 87%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinegroup/validation 92%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinegrouppublishrequest/mutation 86%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinegrouppublishrequest/validation 88%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinepublishrequest/validation 90%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinereplicaset/validation 90%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachineservice/mutation 67%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachineservice/validation 92%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinesetresourcepolicy/validation 89%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinesnapshot/mutation 76%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinesnapshot/validation 91%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinewebconsolerequest/v1alpha1/validation 92%
github.com/vmware-tanzu/vm-operator/webhooks/virtualmachinewebconsolerequest/validation 92%
Summary 84% (20241 / 24171)

Minimum allowed line rate is 79%

vmCtx,
&snapshotList,
ctrlclient.InNamespace(vmCtx.VM.Namespace),
ctrlclient.MatchingLabels{vmopv1.VMNameForSnapshotLabel: vmCtx.VM.Name},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High: The switch from manual filtering to a label selector in getVirtualMachineSnapshotsForVM assumes all existing VirtualMachineSnapshot objects have the new label. If existing snapshots are not backfilled with this label, they will be ignored by the controller, causing a regression in snapshot reconciliation.

return nil
}

// When AllDisksArePVCs is enabled, wait for all unmanaged disk registrations

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High: Taking the address of a loop variable (snapshotToProcess = &snapshot) is a common bug in Go versions prior to 1.22. In older versions, snapshotToProcess will point to the last element of the slice instead of the first one found, leading to incorrect snapshot processing.

Suggested change
// When AllDisksArePVCs is enabled, wait for all unmanaged disk registrations
for i := range vmSnapshots {
snapshot := &vmSnapshots[i]
// ...
if snapshotToProcess == nil {
snapshotToProcess = snapshot
}
}

Comment on lines +985 to +989
// check in the register reconciler keeps the condition False when a CRV
// exists but its disk has been filtered out by FilterOutLinkedClones (e.g.
// after a snapshot converted the backing to a delta chain). Checking
// VolumeTypeClassic entries directly would block snapshots permanently on
// VMs where promoteDiskMode is disabled, because those disks are always

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium: In ReconcileCurrentSnapshot, only the oldest pending snapshot is marked as waiting for disk registration. This is inconsistent with ReconcileSnapshotWaitForCRVCondition, which marks all pending snapshots, and may leave subsequent snapshots appearing 'stuck' with empty conditions in the UI.

// to guard individual operations that care about registration state.
if hasPendingCRVs {
pkgcond.MarkFalse(
vm,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium: When registration is pending due to a CnsRegisterVolume error, the VirtualMachine condition is marked False with an empty message. Surfacing the error from the CRV status in the condition would improve observability, especially since manual intervention is required for permanent errors as noted in the code comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-reviewed size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants