Has ingestion drift quietly broken your RAG pipeline before?
Has ingestion drift quietly broken your RAG pipeline before?

Has ingestion drift quietly broken your RAG pipeline before?

We’ve been working on an Autonomous Agentic AI, and the thing that keeps surprising me is how often performance drops come from ingestion changing quietly in the background, not from embeddings or the retriever.

Sometimes the extractor handles a doc differently than it did a month ago. Sometimes the structure collapses. Sometimes small OCR glitches creep in. Or the team updates a file and forgets to re-ingest it.

I’ve been diffing extraction outputs over time and checking token count changes, which helps a bit. But I still see drift when different export tools or file types get mixed in.

If you’ve run RAG in the wild for a while, what kinds of ingestion surprises have bitten you?

submitted by /u/coolandy00
[link] [comments]