The unsexy part of AI apps: glue work that breaks everything (and how we stopped it)

I used to think building an AI feature was mostly model choice + prompts.

Then we shipped one.

What went wrong:
The assistant started giving different answers to the same questions. We didn’t change the model. We didn’t change the UI. It looked like the AI got worse.

Turns out the cause was boring: the system that feeds information into the AI changed slightly (a document extraction update). The text the AI searched over was subtly different, so it pulled different passages and answered differently.

What was observed:

documents ingested live with whatever parser happened to run
no record of what text was actually used
no simple test to detect changes early
debugging was basically guesswork

Changes applied:

we saved the cleaned/extracted text as a build artifact
we made the - how we slice documents rules explicit and versioned
we added a tiny regression test: a few questions that must still cite the same sources (or at least show what changed)

Impact:
Failures became explainable. The AI changed turned into this document’s extracted text changed; here’s the diff.

If you’ve built AI apps: what’s the most annoying reliability issue you didn’t expect until you shipped?

submitted by /u/coolandy00
[link] [comments]