Many services offer "we'll index your PDF, DOCX, etc." files, but we all (should) know that data like this is over-inflated with tons of extraneous data that's not needed and takes longer to parse.
At what point do you think we'll start to see a negligible performance (accuracy) difference between structured and unstructured data?
I understand for some specific models, structured data will always be necessary, but what about for common LLMs?
[link] [comments]