I wasn't exactly sure how to title my question but here is what I'm looking for:
Lecture Transcripts
I have been collecting transcripts from lectures (usually 30-60 minutes long, so around 5k-10k words in each transcript file) on various topics that I follow. I've been able to download the transcripts from Youtube, which don't have any punctuation, but when I feed a single transcript into an LLM to summarize, it usually has no problem giving a great summary back.
Transcripts to LLM
I thought it would be great to somehow train an LLM with all of the transcripts I've collected for a certain lecturer/speaker, and then be able to interact, ask questions, and use it as a study guide. With hundreds of lectures for a given individual, it seems like an LLM would be able to quickly pull out insights and connections that would take me a long time to make.
Options I've found
Google NotebookLM
I was actually pretty excited when I saw Google's NotebookLM, but it seems to choke when I feed it "larger" chunks of text or when I try to feed it many files. Honestly, if NotebookLM could handle documents with 10k words and handle a thousand documents in each notebook - that would be exactly what I'm looking for. Since it can't (yet), I am here looking for ideas.
Other options
Another option I've seen is AssemblyAI. I haven't been able to find a way to feed it text transcriptions though - it seems to drive from the actual audio first, and from that you can produce transcriptions, summaries, and train their LLM with the transcriptions.
Ideas?
With all of that said, are there products (services, software I could run on my servers, or even python libraries I could use to implement my own solution) to take transcripts from podcasts and YouTube lectures, train some kind of custom LLM, and use that as a learning/research tool?
[link] [comments]