Building knowledge bases from YouTube data using LLMs — my workflow after 52 guides
Building knowledge bases from YouTube data using LLMs — my workflow after 52 guides

Building knowledge bases from YouTube data using LLMs — my workflow after 52 guides

I've been building a system that turns YouTube channels into structured knowledge bases. Thought I'd share the workflow since Karpathy's recent post about LLM knowledge bases

The problem I was solving: I wanted to extract structured, cross-referenced knowledge from YouTube channels. Not just transcripts, but the full picture: what the creator teaches, what the audience actually cares about, what questions remain unanswered, and what patterns emerge across multiple channels, playbooks, etc.

Data ingestion:

  • Pull transcripts from up to 100 videos per channel
  • Fetch comments per channel
  • Generate video summaries from transcripts
  • For real-time context, I also pull recent Reddit and X discussions on the same topics

Analysis layer: I run 10 parallel analyzers on each channel's data:

  1. Transcript analysis: themes, lessons, counter-intuitive insights
  2. Sentiment analysis: what the audience praises vs. complains about
  3. Viewer questions: clustered by theme, scored by demand
  4. Superfan detection: identifying the most engaged community members
  5. Content requests: what the audience explicitly asks for
  6. Product mentions: tools/products referenced in content
  7. Historical trends: how audience interests shift over time
  8. Engagement quality: depth of comments, substantive vs. shallow ratio
  9. Content intelligence: top opportunities based on demand signals
  10. Sponsor analysis: brand mentions and audience reception

Each analyzer reads from preprocessed data files tailored to its needs.

Cross-channel synthesis: The most valuable output isn't any single channel analysis, it's what emerges when you compare 20+ channels.

I generate programmatic "topic pages" that aggregate insights across all channels covering a given subject. This gives you a demand-weighted map of what audiences actually want to know.

Output: So far I've built 52 guides across health, business, AI, and finance -- each synthesized from multiple channels' worth of transcripts and comments. They're structured as searchable knowledge bases, not blog posts.

Stack: The whole pipeline is automated, point it at a channel and it handles everything.

happy to answer any questions.

submitted by /u/recmend
[link] [comments]