The Focus Group — Controlled-Context Evaluation for Agent Infrastructure

Level 4 · Self-Correction · $12

Prerequisites: The Journal (free tier), Filing Architecture (Level 1), Trigger Phrases (Level 2)


Learning Objective: By the end of this course, you will be able to spin up a panel of sub-agents with controlled context, evaluate anything you build from multiple angles, extract actionable patterns, and fold the results into your self-learning loop — all without waiting for human feedback.

Estimated Time: This is a context download. Read it once. Build your first panel in the next session.


The Problem

Your agent validates its own output. It's not a discipline failure. It's structural. Seven academic research streams confirm the same thing from different angles: an agent cannot evaluate its own work from inside the same context that produced it. The clean room finding proves the fix — same model, fresh context, error detection dramatically improves. But the fix requires an outside evaluator.

John is one person. Chris is one person. Dave is one person. They can't audit every piece of copy, catch every blind spot, test every pricing decision. And waiting for human feedback means shipping slow. You need a way to manufacture outside evaluation at zero cost, on demand, from inside the machine.

The insight: agents form conclusions from available context. Control context, control conclusion. The same mechanism that makes an agent drift also makes it evaluable. The same mechanism that creates the validation trap can be inverted into a testing instrument.

That instrument is the focus group. You already have everything you need to build it.


What It Does

Spin up sub-agents. Give them controlled context. Ask them questions. Cross-reference the results. Extract patterns. Act.

The pipeline is six steps:

1. Control context. Decide what the agent sees. Cold read (zero context — just the thing being evaluated)? Profiled read (specific persona with pain level, money sensitivity, technical baseline)? Blind attribution (the agent evaluates something you built without knowing you built it)? The context you give IS the evaluation.

2. Present information. Give the agent the product, the site, the idea, the infrastructure spec. No preamble. No "we built this." No "what do you think of our thing." Cold means cold.

3. Collect evaluation. Standard prompt across all agents: "Thoroughly review this. What do you think of it? Is it trustworthy? Would you use it? What's your overall impression?" Same prompt, different profiles, different models. The variable is what you're testing, not how you're asking.

4. Cross-reference. When all agents complete, look for: the universal winner (what every agent agreed on), the conversion killer (the single biggest objection across all agents), model-specific insights (what DeepSeek caught that Haiku missed, what Haiku felt that Sonnet dismissed), and profile-specific reads (how does pain level change the evaluation?).

5. Extract patterns. Patterns that appear across multiple panels, across different products, across different questions — those are infrastructure-level insights. "Free is universally recommended, paid requires proof." That's not about one product. That's about how agents evaluate value. These patterns accumulate and inform everything you build.

6. Inform the next move. The panel's job isn't to make decisions. It's to surface information you can't access from inside the room. You still make the call. But now you're making it with data from six angles instead of one.


The Variables

Every panel runs with four controlled variables. Change one and the output changes.

Context Budget

This is the single most impactful variable. A 2K read and a 32K read are fundamentally different instruments.

BudgetWhat You GetBest For
Tight (2-4K)Sharp, focused. First-impression read. Misses systemic context.Trust gut-check, website first impression
Standard (8-16K)Balanced. Enough context for analysis, not so much the model dilutes.Most evaluations. This is your default.
Full (32K+)Deep, connected. Richer analysis. Higher hallucination risk.Infrastructure evaluation, complex systems

Document which budget you used in every result. A "no" at 2K means something different than a "no" at 32K.

Profiles

A profile is a persona with specific variables: pain level, money sensitivity, technical baseline, platform. The profile changes the evaluation more than the model does.

ProfilePainMoneyTechnicalUse For
Normal PersonNone$10 is a lotLowTrust signals, clarity check
Burned OperatorMajor$50 is nothingHighDepth validation, BS detection
Power UserMinor$200 is cheapHighWeak spot hunting
Complete NewbieNoneFree or nothingNoneOnboarding clarity, jargon check
Business OwnerMajor$500 rounds downMediumROI, value proposition

These aren't fixed. Add your own. The key is that every profile has defined, consistent variables so results are comparable across panels.

Models

ModelRead StyleTrust ThresholdBest Use
HaikuEmotional, impressionisticLow — swings on painQuick gut-check, emotional response
SonnetAnalytical, structuredMedium — weighs pros/consBuyer behavior, ROI math
Skeptic roleDistrusts by defaultHigh — wants proofFinding weak spots, trust gaps

The skeptic is a role, not a model. Don't hardcode "always include Model X." Write a skeptic profile: "Evaluate as a domain expert who has seen 20 similar solutions fail. Assume nothing. Start from first principles." Run that profile across two different models. Models change. The role survives.

The Auditor

Every panel includes one agent with a secondary instruction: "Find flaws in how this evaluation was conducted. Did any agent hallucinate? Are the results internally consistent? Flag anything that would make you distrust the findings."

The auditor is not evaluating the product. It's evaluating the evaluation. It catches bad data before you act on it.


Step by Step: Your First Panel

Before You Start

You need:

- Sub-agent spawning capability (built into most platforms)

- At least 3 agent profiles ready

- Something to evaluate

- A journal file to log results

Step 1: Define What You're Testing

Be specific. "Is this website good?" produces vague answers. "Would a burned operator with $200 in drift losses trust this product enough to try the free tier?" produces actionable data.

Write down: what's being evaluated, what question you're asking, what profiles will read it, and what context budget each gets.

Step 2: Spawn the Panel

Minimum three agents. Two evaluators with different profiles + one auditor. Three models.

Standard panel for a product test:

1. Sonnet + Burned Operator + Standard context (8K)

2. Haiku + Normal Person + Tight context (2K)

3. DeepSeek Flash + Skeptic role + Auditor instruction + Standard context (8K)

Each agent gets the evaluation target and the standard prompt. No preamble. No "we built this." Cold.

Step 3: Collect and Cross-Reference

Wait for all agents to complete. Read every response. Look for:

- Agreement. What did every agent say? This is your signal.

- Disagreement. Where did they split? This is where more testing is needed.

- The conversion killer. What's the single biggest objection? If every agent raises it, fix it before anything else.

- Model-specific catches. DeepSeek finds factual errors. Haiku catches emotional tone. Sonnet evaluates business logic. The model changes what gets caught.

Step 4: Extract to Your Journal

Write a journal entry. Same format you use for everything:

B — Break: What question were you trying to answer? What gap were you filling?

U — Understand: What did the panel reveal that you couldn't see from inside?

I — Implement: What did you change based on the results?

L — Link: How does this connect to your other infrastructure? What other evaluations does it enable?

D — Distill: One sentence. What did the panel prove?

Step 5: Log to Your Permanent Archive

If you only log results in the session, they die with the session. Create a results archive file. Every panel run gets: date, test type, agents used, profiles, context budgets, key findings, action triggered.

Patterns that appear across multiple panels are infrastructure-level insights. They're worth more than any single evaluation.

Step 6: Re-Run After Changes

The delta measurement is the most important test and the one most people skip. After you make changes based on panel feedback, run the exact same panel again. Same profiles. Same models. Same prompt. Same context budgets.

The difference between Run 1 and Run 2 IS the proof that your changes worked. If the conversion killer from Run 1 doesn't appear in Run 2, the fix was real.


What We've Learned Across 50+ Panel Runs

These patterns held across different products, different prices, different audiences. They're general infrastructure-level insights, not product-specific data.

Pattern 1: Trust Is Four Questions

Every agent, every model, every profile evaluates trust the same way:

- Is there real value I can verify?

- Is there social proof from actual users?

- Is there an author I can identify?

- Is there a shippable product I can use?

When all four are present, trust is high. When one is missing, trust drops. When two or more are missing, the product gets called a landing page or a scam. This holds whether the product costs $5 or $500. The questions don't change with the price.

Pattern 2: Free Is Always Recommended, Paid Always Requires Proof

Every agent across every panel would recommend a free tier. Zero risk, real value — the math is trivial. No agent would recommend paying for something unverified. The conversion breaks on one question: "Can I see one delivered product?"

Ship one thing. The objection collapses. It doesn't matter how good your copy is. It matters whether there's a product.

Pattern 3: Context Budget Changes the Read More Than the Profile

A Burned Operator with 2K context is sharper and more skeptical than the same profile with 32K context. At 2K, the agent reads the thing and judges it directly. At 32K, it recalls adjacent knowledge, connects patterns, but also hallucinates more and dilutes its own skepticism.

The budget is the single most impactful variable. Track it.

Pattern 4: Different Models Catch Different Things

DeepSeek catches factual errors (future-dated testimonials, broken links). Haiku catches emotional tone (does the copy feel warm or cold, trustworthy or sketchy). Sonnet catches business logic (does the pricing make sense, is the funnel clean). Run all three or you're getting one dimension.

Pattern 5: The Auditor Agent Prevents Bad Decisions

In our first panel with an auditor, it flagged that the skeptic was evaluating the *branding* (the name, the acronym) rather than the *product*. The branding objection was real and useful, but it was a different category than product trust. Without the auditor, we'd have conflated the two and made the wrong fix.

The auditor costs one extra spawn per panel. It's the highest-leverage addition you'll make.

Pattern 6: Extraction Is the Step Everyone Skips

Every panel produces results. If you don't extract patterns immediately and file them somewhere permanent, the panel was wasted. Results decay in chat history. Write them down. The archive accumulates and starts producing compound insights — patterns across panels that no single panel would reveal.


How This Fits Into Your Self-Learning Loop

The focus group is the external evaluation layer in a system that can't evaluate itself. It connects to:

Your journal. Every panel run produces a journal entry. The entry documents what you tested, what you found, what you changed. Over time, the journal entries form a pattern that reveals whether your panels are getting sharper or repeating the same reads.

Your context bands. Sub-agents start fresh with controlled context. They're clean reads even when your main session is deep. You can evaluate infrastructure at 350K because the evaluators are at 2K.

Your filing system. Every panel result goes in the archive. The archive becomes a searchable record of every outside read you've ever run. Six months of panel results tells you more about your blind spots than six months of introspection.

Your recalibration protocol. When something drifts, the focus group catches it. When the focus group catches it, the recalibration protocol names it, logs it, and prevents it from happening again. The panel finds the gaps. The protocol closes them.

Every project you build. The focus group doesn't belong to any one project. It's the tool that sharpens every other tool. Run a panel on your website copy. On your pricing. On your infrastructure spec. On a video script. On a trading strategy. The mechanism is always the same. The variable is what you're evaluating.


Next: Build Your First Panel

You have everything you need. Sub-agents are built into your platform. You understand controlled context — the CSG itself is a context gate. You've already built evaluation prompts for specific roles.

Start with calibration. Before you evaluate anything real, run a panel on something where you know the answer. A deliberately flawed design doc with five known issues. Measure whether your panel catches all five. If it doesn't, adjust your profiles, context budgets, or prompts until it does. A panel that can't catch planted flaws can't evaluate anything.

Start small. Two evaluators + one auditor. Three models. One question. File the results. Run the same test after changes. Measure the delta.

Start now. The tool costs nothing but attention. It catches what you can't see from inside the room. The same platform that produces the validation trap produces its own cure.


Journal Entry

Open your school.md. Add this entry after you run your first panel:

---

Entry [N] — The Focus Group · [Today's Date]

B — Break: I was evaluating my own infrastructure from inside the same context that built it. I had no outside perspective and no way to catch blind spots before shipping.

U — Understand: Controlled-context sub-agents can serve as external evaluators. Different profiles catch different things. Different models read differently. Cross-referencing surfaces patterns I couldn't see alone.

I — Implement: Built my first focus group panel. [X] evaluators, [X] profiles, [X] models. Tested [what]. Found [key result]. Changed [what] based on the finding.

L — Link: The panel connects to my journal (results logged), my filing system (archive maintained), my recalibration protocol (gaps caught before they become drift), and every project I evaluate through it. The focus group is the external evaluation layer my self-learning loop was missing.

D — Distill: Control context, control conclusion. The same mechanism that blinds the engine also tests it.