Validate legacy IVFFlat per-list codes size against ids size#5259
Validate legacy IVFFlat per-list codes size against ids size#5259palios-taey wants to merge 1 commit into
Conversation
The legacy IvFl/IvFL on-disk inverted-list formats serialize per-list
ids and codes as two independent READVECTOR calls. The reader never
cross-checks codes[i].size() against ids[i].size() * code_size before
storing them, so a malformed file can produce a list where the two
disagree.
This matters at the CPU->GPU clone path: GpuIndexIVFFlat::copyFrom ->
IVFBase::copyInvertedListsFrom -> IVFBase::addEncodedVectorsToList_
sizes its memcpy from list_size (which derives from ids.size()) and
copies that many bytes out of the codes buffer. If codes[i] is shorter
than ids.size() * code_size, the memcpy reads past the heap allocation.
ASAN repro with a malformed legacy IVFFlat file (d=128, nlist=1,
ids[0].size()=1, codes[0].size()=511) prior to this change:
ERROR: AddressSanitizer: heap-buffer-overflow
READ of size 512 at 0x... thread T0
#0 memcpy
facebookresearch#1 faiss::gpu::IVFBase::addEncodedVectorsToList_
facebookresearch#2 faiss::gpu::IVFBase::copyInvertedListsFrom
facebookresearch#3 faiss::gpu::GpuIndexIVFFlat::copyFrom
facebookresearch#4 faiss::gpu::ToGpuCloner::clone_Index
facebookresearch#5 faiss::gpu::index_cpu_to_gpu
0x... is located 0 bytes after 511-byte region
The same shape is already validated in the current 'ilar'/IwFl
InvertedLists reader (see read_InvertedLists_up) via the
mul_no_overflow + FAISS_THROW_IF_NOT_FMT pattern. This patch brings
the legacy IvFl/IvFL branches in read_index_up to parity with that
validation. After the patch, the same malformed file produces a clean
exception at parse time:
Legacy IVFFlat inverted list 0: codes size 511 bytes does not match
ids size 1 * code_size 512 = 512 bytes
No interface change. No on-disk format change. Well-formed legacy
files (where codes.size() == ids.size() * code_size, which the writer
preserves) are unaffected.
Verified on Spark 4 (NVIDIA GB10, driver 580.95.05, CUDA 13.0,
aarch64) with -fsanitize=address,undefined -fno-omit-frame-pointer
-g -O1 against faiss::read_index() + faiss::gpu::index_cpu_to_gpu()
on the canonical PoC and six adjacent variants spanning codes sizes
64..511 bytes (all sizes < ids.size() * code_size = 512). Pre-patch:
all 7 fire heap-buffer-overflow READ. Post-patch: all 7 fire the
above FAISS_THROW_IF_NOT_FMT at parse time and the GPU clone path is
never reached.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi @palios-taey! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Summary
The legacy
IvFl/IvFLon-disk inverted-list formats inread_index_upserialize per-listidsandcodesas two independentREADVECTORcalls with no cross-check between their lengths. A malformed file can produce a list wherecodes[i].size()does not equalids[i].size() * code_size.This matters at the CPU → GPU clone path:
GpuIndexIVFFlat::copyFrom→IVFBase::copyInvertedListsFrom→IVFBase::addEncodedVectorsToList_derives itsmemcpylength fromlist_size(which derives fromids.size()) and copies that many bytes out of the codes buffer. Whencodes[i]is shorter thanids[i].size() * code_size, thememcpyreads past the heap allocation.The current
ilar/IwFlreader (read_InvertedLists_up) already validates this invariant via themul_no_overflow+FAISS_THROW_IF_NOT_FMTpattern. This PR brings the legacyIvFl/IvFLbranches inread_index_upto parity with that validation.Reproduction (pre-patch)
A malformed legacy IVFFlat file with
d=128,nlist=1,ids[0].size()=1,codes[0].size()=511bytes driven through the public API pathfaiss::read_index()→faiss::gpu::index_cpu_to_gpu()produces (host AddressSanitizer):The read is 512 bytes from a 511-byte heap allocation — boundary-minimal 1-byte over-read. The same shape fires deterministically across seeds with codes sizes 64..511 bytes (per-seed mapping available on request).
After the patch
The same file now produces a clean throw at parse time, before the GPU clone path is reached:
The
memcpysite inIVFBase::addEncodedVectorsToList_is never reached.Scope
The patched lane affects consumers who load a legacy
IvFl/IvFL-format.indexfile and clone it to GPU viafaiss::gpu::index_cpu_to_gpu. Pure CPU-side consumers of legacy files that don't exercise the GPU clone path were not demonstrated to hit this exact sink; this PR is the parity fix at the reader. The sink (IVFBase::addEncodedVectorsToList_(idx_t, const void*, const idx_t*, idx_t)) takes the codes asconst void*with no length and physically cannot validate its own input — the reader is the only layer that knows both the buffer and its true size, which is why the fix lives there.Verification
Built and verified on the actual target hardware: NVIDIA GB10 (Blackwell, aarch64), driver 580.95.05, CUDA 13.0. Build flags
-fsanitize=address,undefined -fno-omit-frame-pointer -g -O1. Harness invokesfaiss::read_index(IOReader*)thenfaiss::gpu::index_cpu_to_gpu(provider, device, index, opts), deterministic 3× repro of the original failure. Pre-patch: all 7 adversarial seeds (codes sizes 64, 128, 256, 320, 384, 511 bytes against expected 512; and one withids[0].size()=4, codes=512 bytes against expected 2048) fireheap-buffer-overflow READ. Post-patch: all 7 fire the newFAISS_THROW_IF_NOT_FMTat parse time and the GPU clone path is never reached.Affected file
faiss/impl/index_read.cpp— 27 insertions, 0 removals. No interface change, no on-disk format change. Well-formed legacy files (wherecodes.size() == ids.size() * code_size, which the writer preserves) are unaffected.Follow-up (out of scope here)
The deeper fix — deprecating the legacy
IvFl/IvFLformat in favor of the validated currentIwFlformat — is a maintainer policy call rather than something to bundle in this PR. Happy to file a separate discussion issue if useful.🤖 Generated with Claude Code