Scaffolding

In March 2026, Anthropic upgraded its skill-creator. They added a few things: Evals, Benchmark, Comparator, Description Tuning.

If you’ve never used Agent Skills, these sound like marketing buzzwords. If you have, you probably know what they’re trying to solve—those skills that “seem to work”—do they actually work?

First, a quick note on what Skills are.

In October 2025, Anthropic launched Agent Skills, a modular “skill pack” system. A folder contains a SKILL.md (instructions), some scripts, and resources. Claude loads them when needed. You can make the model write weekly reports your way, review contracts by your standards.

But after writing a skill, questions arise: Will this skill still work on the new model? How many successes out of ten cases? Did my tweak make it better or worse?

The answer was: you don’t know. Because there was nowhere to test.

This upgrade fills that gap.

Evals lets you describe “given this input, expect that output,” and the system runs validation automatically. Benchmark runs standardized tests in batch, giving you pass rates, latency, token consumption. Comparator does blind A/B testing, letting a neutral AI judge which version is better. Description Tuning analyzes trigger phrases to reduce misfires.

In short, it puts four dashboards into skill creation.

—But I don’t want to talk about features. There’s a question I’ve been turning over.

Shouldn’t experience already be validated before it’s packaged? Why validate it again after packaging?

The answer became clear after shifting perspective.

In the human world, the implicit premise of “experience works” is: executed by a human with common sense and judgment. A lawyer knows that “pay attention to liquidated damages clauses” carries countless unspoken nuances. But when you package that experience into a Skill and hand it to Claude, the executor becomes a statistical model. It might literally interpret “pay attention to liquidated damages clauses” and try to find them in completely irrelevant documents.

Experience that works on humans doesn’t necessarily work on models. It’s not the experience’s fault—the executor changed.

Worse, the executor keeps changing. A skill that worked perfectly on Sonnet 4 might fail on Opus 4. The model’s “thinking style” shifts; its sensitivity to instructions shifts. An instruction that once worked may suddenly break on a new model version.

So a skill doesn’t need one-time validation—it needs a continuous validation system. That’s what Evals and Benchmark do.


There’s a simple test: remove an engineering layer. Does the model’s useful output decrease?

If yes—it’s a valuable pipeline. If no—it’s decoration.

Remove Evals, and you can’t know whether your skill still works on the new model. Useful output decreases—pipeline. Remove Benchmark, and you don’t know if your skill is stable or flaky. Remove Comparator, and you don’t know if the new version is truly better or an illusion. Remove Description Tuning, and your skill might fire in the wrong places, producing garbage.

So this system itself is a pipeline for measuring skill pipelines.


What’s the real relationship between Skills and the model?

One possibility: skills eventually get internalized by the model. This is speculation, but not without reason.

When enough people use similar skills (reviewing contracts, writing reports, analyzing data), the experience patterns encapsulated in those skills become high-quality signals. If those signals feed into training, the next generation model may be able to do these things natively. The skill retires.

But retirement isn’t the end. Experts move on to encapsulate the next set of things the model can’t yet do.

It’s a cycle where external skills continually nourish the model core. Skills are like scaffolding—removed after each floor is built, but the building keeps rising.


There’s a subtle contradiction here.

Anthropic provides a no-code tool that lets domain experts worldwide—lawyers, accountants, product managers, screenwriters—package their experience into AI-executable skills. This looks like mining tacit knowledge.

But what experts actually write into SKILL.md are explicit instructions. True tacit knowledge—the “I know how to do it but can’t quite explain it” part—is precisely what can’t be written down.

This circles back to the earlier point. Experts think they’ve packaged their full experience, but when the model executes, it discovers the unspoken common sense is missing. The skill breaks. Evals catches it. The expert iterates.

This cycle itself is what gradually approximates tacit knowledge. It’s not extracted all at once—it’s approached through repeated loops of “package, test, fail, revise.”


This is also where the real moat lies.

When thousands of skills have gone through hundreds of rounds of validation, what accumulates isn’t Markdown files—those can be copied anytime. What accumulates is the validation infrastructure itself: how many model versions each skill has been tested against, which edge cases cause failures, which instruction patterns work best for the current model.

When users migrate to another model, what they lose isn’t the skill text—it’s the confidence that “this skill has been proven effective through 200 tests.” That confidence can’t be exported.


So the test for whether a Skill is worth building is: Does it make the model produce more useful things?

If yes, build it. If no, stop.

And the test for whether a validation system is worth building is: Does it tell you whether the model is producing useful things?

If yes, build it.

2026.03.07