I knew what was going to happen if I tried to teach myself information retrieval from a textbook.
I'd read chapter 1, follow it, "get it," move on. Read chapter 2, follow it, get it, move on. Three weeks in I'd have read six chapters and learned almost none of it — because reading is not learning, and following an explanation is not retrieval, and the fluency illusion is the thing that ruins self-taught technical work for most people who try.
The problem with self-taught technical learning isn't that you're not smart enough. It's that there's nobody in the room to call you on it when you fake the rep. School had professors and exams. A bootcamp has cohort pressure. Reading a textbook alone in your apartment has neither. The work is on you. So is the bullshit detection.
Before I opened Manning's IR textbook, I read Justin Skycak's Advice on Upskilling. Skycak runs Math Academy. The whole book is one argument with consequences: learning is a positive change in long-term memory. Not a feeling. Not pages read. If you can't retrieve it cold without help, you haven't learned it. The thing you're constantly fighting is the fluency illusion — confusing "I followed it in the moment" with "I know it."
I needed a tutor. Not someone to teach me — I had the textbook. Someone to catch me when I was faking the work. Someone who'd refuse to let me move on when a retrieval was fuzzy. Someone who'd push back when I tried to skip a rep because I felt like I "already had this one."
I built that tutor as a Claude skill before I started chapter 1. Six chapters in, the system has caught me more times than I can count.
What the skill does
It turns Claude into a learning coach grounded in Skycak's framework. Three modes, picked at the start of every session: Diagnose (find the gaps), Plan (build the prereq tree), Coach (run the actual retrieval loop). The skill is specifically designed to fight a single failure mode. From the SKILL.md, the line that aligns the entire system:
The thing you are fighting: the user mistaking comfortable fluency (information sitting in working memory during a conversation) for mastery. You will catch this constantly. Name it when you see it.
That sentence is the whole alignment. Most "AI tutor" skills are aligned to help the user feel smart. This one is aligned to fight the user when they're tricking themselves. Different objective function.
The skill enforces eight non-negotiable rules — no passive re-ingestion as review, prereq audit before new material, minimum effective dose of instruction, review should feel challenging, bite-sized pieces, measurable progress over feelings, no shortcut validation, top-down plan with bottom-up execution. Every retrieval gets graded into one of five categories: clean, clean but slow, peeked, failed, didn't attempt. The distinction between peeked and clean is what separates a real spaced-rep system from a fake one. Peeking does not count as a retrieval rep. The system tracks them separately because they are separate.
The part most people miss
A spaced-rep system is necessary but not sufficient. The harder problem is: when you're tired, when working memory is loaded, when the rep is owed and you really want to skip it — what stops you?
Two answers. First, the system knows my flinch patterns and is instructed to call them out. Second, it doesn't soften when I push back. The contract is explicit. From SESSION_SUMMARY.md, what Claude reads at the start of every chat:
Push me toward clearer thinking, expose blind spots, no empty praise. Hold high standards. When I set up a learning contract, hold the line even when I flinch. Don't apologize for pushing. Don't soften when a retrieval is actually fuzzy. Empty reassurance = violation of the contract.
The flinch list isn't theoretical. Each pattern was named after I watched myself do it during a real session, then written into the system so future-Claude could catch it earlier than I could. The system doesn't know my flinches because Skycak does. It knows because the system caught me, I named it, and I wrote it down.
What it actually says
The system runs across two timescales. The strategic layer is OPERATING_PRINCIPLES.md — re-read monthly, not every session. It names what I'm optimizing for and the operating bugs that compound when I don't manage them. From near the end of the file:
Speed is a moat, but only when speed is applied to the things that compound. The rep is the gate. Always.
The tactical layer is SESSION_SUMMARY.md — pasted into every new chat to bootstrap context. It tells the next Claude where we left off, which retrievals are due, and what flinches to watch for this week. The session-opening protocol on it is non-negotiable:
1. Read this file. 2. Check learning_log.md for any retrieval rep due today. 3. The rep is the gate. Do it cold, before any new building. No exceptions. The most important rule I have set for myself.
The reps themselves live in learning_log.md. Every concept gets retrieval questions, a review schedule, and a grade per attempt. The schedule isn't fixed — it adapts to the grade. A clean retrieval doubles the next interval. A fuzzy one shrinks it back.
The dot is one concept moving through review intervals — 1 day, 3 days, 1 week, 2 weeks, 1 month, 3 months. Clean retrievals push it forward. Fuzzy retrievals pull it back. The distance between rungs is what makes the rep work; review that feels easy means the gap was too short and nothing got consolidated.
What the system catches
The most useful entries in the log are the failures. They're the receipts that the system is doing its job, not letting me off the hook. After a six-hour session implementing tiered indexes — the most important question of the day, the one about an implicit assumption breaking under cosine normalization — got logged like this:
Q4 (implicit assumption + where it breaks): Fuzzy. Trailed off.
User's answer: "if tau is low or —" (incomplete)
The user had observed this empirically in the same session. So the conceptual understanding is there — the verbal retrieval under cognitive load was fuzzy. Working memory issue, not a knowledge issue. Re-rep scheduled day-3 cold.
I'd have called that rep "good enough" if I were grading myself. The system didn't. It logged the trail-off, separated the cause (fatigue) from the failure mode (knowledge gap), and scheduled the re-rep three days out so I'd hit it cold instead of cached.
The flinch patterns in SESSION_SUMMARY.md were observed the same way. From the current list:
Mechanics-dump on conceptual questions — when asked why, I drop into describing how. Concrete-over-abstract is a comfort move. Name it; redirect to the why.
Article-as-rep substitution — when retrieval is owed and an article is also pending, I use the article to force the retrieval. The article gets done either way, but this pattern means the rep is fragile: skip the article and the retrieval gets skipped too. Healthier flow: retrieval first, article reflects locked-in understanding.
Each one was caught during a session, named in plain language, and added to the file so it would get caught faster next time. That's what makes this different from any spaced-rep app — the system doesn't just track what I know. It tracks how I dodge knowing it, and pushes back on the dodge.
The architecture
Four files do the work, layered from durable strategy down to the daily rep:
At the end of every session I zip the directory. To resume a week later, I drop the zip into a fresh chat, paste SESSION_SUMMARY at the top, and Claude has the full state. No context loss between sessions. No drift between Claudes.
What running it has produced
Six chapters into Manning's IR textbook. Seven articles shipped documenting the build. Twelve Rust files implementing a search engine that handles 11,314 documents, 138,743 terms, tiered indexes with data-driven thresholds, and a streaming k-way merge that holds constant memory at any scale. Code is on GitHub.
The more interesting output is the meta-layer. The learning log shows things I couldn't have seen in real time — which concepts I had to re-rep at day 3 versus which ones held cold at day 7, the hours of the day my answers go fuzzy, the fact that I outsource micro-decisions specifically when working memory is loaded. The fix wasn't "stop outsourcing." It was "stop pushing past hour 4."
None of this is theoretical. It's data on how I learn, generated by the system, that I now use to operate differently.
Why this matters more than the search engine
The search engine is the artifact I'll point at in interviews. The system that built the engine is the more durable thing. Every domain I want to learn after this — production retrieval systems, neural rerankers, distributed systems, whatever's next — runs through the same four files. The skill doesn't care what the topic is. It cares whether I'm faking the rep.
I use AI to do the work on me — to force the reps I'd skip, call out the flinches I can't see, and grade me fuzzy when I want to be graded clean. That's the trade I've found most useful for what I'm trying to build.
The rep is the gate. The skill is what holds it.
Skill on GitHub: github.com/sreenish27/learning-skill · Search engine build: krithik.xyz/search-engine