Our RAG stack degraded slowly over months.
- Text-shape differences created different embedding vectors
- Hidden characters slipped in from OCR
- Partial updates mixed old and new embeddings
- Incremental index rebuilds drifted from ground truth
Retrieval looked random at times, but the retriever wasn’t the problem.
We enforced a consistent embedding pipeline:
- Canonical preprocessing that never changes silently
- Full re-embeddings instead of patching
- Version-pinned embedding model
- Stable index rebuild rules tied to segmentation changes
Impact:
- Retrieval reliability improved immediately
- Embedding clusters became predictable
- Fewer “mysterious RAG failures”
- Debug time dropped dramatically
Have you seen embedding drift show up in long-running systems?
[link] [comments]