Best way to programmatically extract data from a set of .pdf files?

I’m wondering if the SaaS LLM offerings aren’t quite good enough yet for my use case. I need to extract about thirty key pieces of information from sets of PDF files programmatically.

Each file set will contain between 2 to 20 files and the data is fairly complex legal content. A reasonably intelligent person could do most of this work without having a legal background for example, identifying a court case number and the name of the plaintiff.

Some of the documents are several MB but most are smaller than 1 MB. Altogether I have about three thousand of these documents and will be collecting several hundred new ones every day.

Anyone doing something like this right now?

submitted by /u/tech_tuna
[link] [comments]