🐛 Wait for disks to be registered before snapshot creation#1673
🐛 Wait for disks to be registered before snapshot creation#1673aruneshpa wants to merge 1 commit into
Conversation
b8a45fc to
9995d6d
Compare
| // created), CNS processing, and error cases — without requiring an API call. | ||
| func hasUnmanagedVolumes(vm *vmopv1.VirtualMachine) bool { | ||
| for i := range vm.Status.Volumes { | ||
| if vm.Status.Volumes[i].Type == vmopv1.VolumeTypeClassic { |
There was a problem hiding this comment.
What happens if we vm.spec.promoteDiskMode is Disabled?
| // VolumeTypeManaged entry for the same disk UUID. This covers the full | ||
| // registration lifecycle — Phase 1 (PVC created, no CRV yet), Phase 2 (CRV | ||
| // created), CNS processing, and error cases — without requiring an API call. | ||
| func hasUnmanagedVolumes(vm *vmopv1.VirtualMachine) bool { |
There was a problem hiding this comment.
nit: I’m surprised we don’t have a public util that does this
| // a delta, FilterOutLinkedClones drops the disk from info.Disks in | ||
| // the register reconciler, which may not have corrected the | ||
| // condition yet on this reconcile). | ||
| waitForRegistration = hasUnmanagedVolumes(vmCtx.VM) |
There was a problem hiding this comment.
wouldn't this cause an issue if we revert FCDs for image disks? or atleast will require a change here too. it just means that the definition of disk promotion is done is in multiple controllers, which i'm not a fan off. what is the issue if the condition lags behind, doesn't this mean it will do the right thing on the next reconcile?
If we invoke VM snapshot API after the CnsRegisterVolume resource for an unmanaged disk has been created, but before CSI has had a chance to register the disk, then CRV will fail in a terminal way because CreateVolume call will refuse to reject a disk that is not the running point. Changes: - Add VirtualMachineSnapshotWaitingForDiskRegistrationReason to v1alpha5/v6 - Fix VirtualMachineUnmanagedVolumesRegistered condition being incorrectly set True when CRVs exist with Registered=false - Return nil (not ErrPendingRegister) from the hasPendingCRVs path so that power state and other operations are not gated on CRV completion - Replace hasPendingCRVsForVM K8s List call with hasUnmanagedVolumes, an in-memory scan of vm.Status.Volumes for Classic / Unmanaged volumes - Fix getVirtualMachineSnapshotsForVM to use server-side VMNameForSnapshotLabel filtering instead of listing all snapshots in the namespace Test changes: - Add VMNameForSnapshotLabel to DummyVirtualMachineSnapshot (webhook sets this in production but tests bypass it) - Add CnsRegisterVolume to KnownObjectTypes(): it has +kubebuilder:subresource:status but was omitted, causing fake-client Status().Update() to be silently ignored fixes vmop-3871. Testing Done: - Unit tests - Ran the failing end to end tests from our reco job a few times to make sure they pass - WIP
9995d6d to
5e9035e
Compare
Minimum allowed line rate is |
| vmCtx, | ||
| &snapshotList, | ||
| ctrlclient.InNamespace(vmCtx.VM.Namespace), | ||
| ctrlclient.MatchingLabels{vmopv1.VMNameForSnapshotLabel: vmCtx.VM.Name}, |
There was a problem hiding this comment.
🟠 High: The switch from manual filtering to a label selector in getVirtualMachineSnapshotsForVM assumes all existing VirtualMachineSnapshot objects have the new label. If existing snapshots are not backfilled with this label, they will be ignored by the controller, causing a regression in snapshot reconciliation.
| return nil | ||
| } | ||
|
|
||
| // When AllDisksArePVCs is enabled, wait for all unmanaged disk registrations |
There was a problem hiding this comment.
🟠 High: Taking the address of a loop variable (snapshotToProcess = &snapshot) is a common bug in Go versions prior to 1.22. In older versions, snapshotToProcess will point to the last element of the slice instead of the first one found, leading to incorrect snapshot processing.
| // When AllDisksArePVCs is enabled, wait for all unmanaged disk registrations | |
| for i := range vmSnapshots { | |
| snapshot := &vmSnapshots[i] | |
| // ... | |
| if snapshotToProcess == nil { | |
| snapshotToProcess = snapshot | |
| } | |
| } |
| // check in the register reconciler keeps the condition False when a CRV | ||
| // exists but its disk has been filtered out by FilterOutLinkedClones (e.g. | ||
| // after a snapshot converted the backing to a delta chain). Checking | ||
| // VolumeTypeClassic entries directly would block snapshots permanently on | ||
| // VMs where promoteDiskMode is disabled, because those disks are always |
There was a problem hiding this comment.
🟡 Medium: In ReconcileCurrentSnapshot, only the oldest pending snapshot is marked as waiting for disk registration. This is inconsistent with ReconcileSnapshotWaitForCRVCondition, which marks all pending snapshots, and may leave subsequent snapshots appearing 'stuck' with empty conditions in the UI.
| // to guard individual operations that care about registration state. | ||
| if hasPendingCRVs { | ||
| pkgcond.MarkFalse( | ||
| vm, |
There was a problem hiding this comment.
🟡 Medium: When registration is pending due to a CnsRegisterVolume error, the VirtualMachine condition is marked False with an empty message. Surfacing the error from the CRV status in the condition would improve observability, especially since manual intervention is required for permanent errors as noted in the code comments.
What does this PR do, and why is it needed?
If we invoke VM snapshot API after the CnsRegisterVolume resource for an unmanaged disk has been created, but before CSI has had a chance to register the disk, then CRV will fail in a terminal way because CreateVolume call will refuse to reject a disk that is not the running point.
This change fixes this behavior by waiting for all disks to be registered before a VM snapshot is triggered. A potential side effect is that a VM snapshot can not be taken if disk registration is broken. But that would have been the same behavior today anyway.
Testing Done:
Please add a release note if necessary: