An LLM-maintained exploration wiki for the DataTalks.Club podcast archive.
This project applies Andrej Karpathy's LLM wiki pattern to the podcast content in
../datatalksclub.github.io/_podcast: raw episode files stay in the website
repo, while this repo holds exploration pages that help agents and readers
understand, connect, and reuse the podcast archive. Public podcast links should
point to canonical https://datatalks.club/podcast/<slug>.html pages on the
main website.
The podcast archive already has episode pages, timestamped clips, topics, and full transcripts. The missing layer is a persistent topic-oriented wiki:
- episode discovery by theme, not only by publication order
- insight hub drafts for topics like LLMs, MLOps, career transitions, and open source
- cross-links between episodes, guests, clips, and recurring ideas
- a workflow where LLM analysis is filed back into Markdown instead of disappearing into chat history
This maps directly to DataTalksClub issue #111: taxonomy, clip categorization, and thematic "Insight Hub" pages.
sources/documents where raw source material lives. The raw Markdown files are not copied here.CONTENT_TODO.mdkeeps the durable backlog for roles, transitions, portfolio projects, roadmaps, and comparison pages._podcast_summaries/contains source-derived agent records. Full transcripts and public podcast pages stay in the website repo._people/contains source-derived person node records that resolve to the main website._wiki/contains all public content. Some pages usetags:such asguide,comparison,roadmap,transition, orhow-to.search/contains the browser fallback corpus copied into the static site.graph/graph.jsoncontains generated static graph data used by/graph.html.artifacts/search/contains build artifacts for the Zerosearch Lambda.sources/podcast-archive-summary.mdcontains the generated agent-first map of all synced episodes, people, chapter summaries, and topic candidates.search_lambda/contains the AWS Lambda handler for exploration-page search.scripts/contains deterministic helpers for search packaging and checking the static site.AGENTS.mdtells Codex or another LLM agent how to maintain the wiki.
Build with Rustkyll:
make buildThe Makefile prefers .bin/rustkyll when present, matching the pinned binary
used by the GitHub Pages deploy. If that file is absent, it falls back to
uvx --no-config --from rustkyll==0.5.1 rustkyll; the --no-config matters
because a global uv exclude-newer setting can hide fresh Rustkyll releases and
silently run an older binary without WASM extension support.
Serve locally:
make serveThe search page is available at /search.html. Set search_api_url in
_config.yml to the deployed Lambda Function URL to use server-side Zerosearch.
When search_api_url is empty, the page falls back to a simple client-side search
over /search/search-corpus.json for local development. Search indexes the
exploration collections, not full podcast transcripts.
The graph page is available at /graph.html. It visualizes topics, tagged wiki
pages, source-derived episode records, people, and source-episode relationships
from graph/graph.json. Clicking a node opens a side panel with canonical page
links, search links, related nodes, and a copyable graph URL such as
/graph.html#topic%3Allms.
Rebuild graph data after content changes:
make graphCheck generated internal links after a build:
python scripts/check_links.pyThe GitHub Pages workflow runs the same link check on each push to main with
the deployed /podwiki base URL.
Build the packed Zerosearch artifact:
make indexThis creates artifacts/search/search-index.zsx. To prepare the minimal SAM
package directory, run:
make lambda-packageThe Lambda in
search_lambda/podwiki_search/handler.py loads that file and exposes:
GET /healthGET /?q=<query>GET /?q=<query>&level=wikiGET /?q=<query>&level=guideGET /?q=<query>&level=comparisonGET /?q=<query>&level=roadmapGET /?q=<query>&level=how_toGET /?q=<query>&level=podcast_summaryGET /?q=<query>&level=personGET /?q=<query>&level=section
The SAM template is template.yaml. The GitHub Actions workflow in
.github/workflows/deploy-search.yml rebuilds the corpus and .zsx artifact,
then deploys on push to main. It expects these repository secrets:
AWS_DEPLOY_ROLE_ARNAWS_REGION
Optional repository variable:
CORS_ORIGIN
- Add or update the source episode Markdown in
../datatalksclub.github.io/_podcast. - Run
make sourcesto regenerate_podcast_summaries/,_people/,artifacts/podcast/source-index.json, andsources/podcast-archive-summary.md. - Link the new evidence from relevant
_wiki/pages when it adds useful support. - Run
make checkin this repo. This refreshes graph data, search corpus, the Lambda package, static HTML, and generated internal-link checks. - Push this repo to rebuild and deploy the search Lambda through GitHub Actions.
- Ask the LLM to synthesize one wiki page at a time using existing exploration
pages plus source transcript excerpts from
../datatalksclub.github.io. - File durable public synthesis into
_wiki/<topic>.md; usetags:for keyword-driven editorial pages. - Cross-link editorial pages to relevant wiki pages, canonical podcast episode URLs, and the podcast evidence that grounds each claim.
- Refresh compact
_podcast_summaries/and_people/records as reusable agent context, not as public mirrors. - Periodically ask the LLM to lint for stale summaries, missing cross-links, weak taxonomy assignments, and orphan pages.