I've been building a system that turns YouTube channels into structured knowledge bases. Thought I'd share the workflow since Karpathy's recent post about LLM knowledge bases
The problem I was solving: I wanted to extract structured, cross-referenced knowledge from YouTube channels. Not just transcripts, but the full picture: what the creator teaches, what the audience actually cares about, what questions remain unanswered, and what patterns emerge across multiple channels, playbooks, etc.
Data ingestion:
- Pull transcripts from up to 100 videos per channel
- Fetch comments per channel
- Generate video summaries from transcripts
- For real-time context, I also pull recent Reddit and X discussions on the same topics
Analysis layer: I run 10 parallel analyzers on each channel's data:
- Transcript analysis: themes, lessons, counter-intuitive insights
- Sentiment analysis: what the audience praises vs. complains about
- Viewer questions: clustered by theme, scored by demand
- Superfan detection: identifying the most engaged community members
- Content requests: what the audience explicitly asks for
- Product mentions: tools/products referenced in content
- Historical trends: how audience interests shift over time
- Engagement quality: depth of comments, substantive vs. shallow ratio
- Content intelligence: top opportunities based on demand signals
- Sponsor analysis: brand mentions and audience reception
Each analyzer reads from preprocessed data files tailored to its needs.
Cross-channel synthesis: The most valuable output isn't any single channel analysis, it's what emerges when you compare 20+ channels.
I generate programmatic "topic pages" that aggregate insights across all channels covering a given subject. This gives you a demand-weighted map of what audiences actually want to know.
Output: So far I've built 52 guides across health, business, AI, and finance -- each synthesized from multiple channels' worth of transcripts and comments. They're structured as searchable knowledge bases, not blog posts.
Stack: The whole pipeline is automated, point it at a channel and it handles everything.
happy to answer any questions.
[link] [comments]