The Four Layers of Agentic Engineering

A monolithic terminal rising from the ground

Friday night a few weeks ago. Everyone's out. I'm home with 7 worktrees open across 3 repos, 2-3 AI coding sessions each. My machine is at 97% swap. And I'm writing in my journal instead of coding: "I probably shouldn't write any code until I have the basic agentic system in place."

I'd been using Claude Code every day for ~3 months. Running it across a dozen projects, personal and work. Best tool I'd ever used. But each project was its own island. The bookkeeping automation didn't share learnings with the content pipeline. The growth engineering setup didn't inform the blog workflow. Every skill I built worked in isolation, and the integration between them was me, manually copy-pasting patterns and re-discovering solutions I'd already found.

I was using the tool. I hadn't built the system around it.

One layer gets all the attention

I sometimes watch this guy on YouTube, IndyDevDan, who breaks agentic systems into four layers: context, tools, prompts, model.

The prompt gets all the attention. Better instructions, maybe a system message, done. That's one layer out of four. The other three are where the system compounds.

Here's what I learned building 55 skills across personal and work projects. For each layer, I'll show what I built, what techniques actually matter, and where to start.

Context: the compound layer

Every project has a CLAUDE.md that loads automatically at the start of every session. Mine is 182 lines. Role, current projects, file conventions, writing style. But the interesting part is what accumulates over time:

## Anti-AI Slop (IMPORTANT)
- **Never use:** [30+ banned words that spike in AI-generated text]
- **Replace:** [15+ word swaps, simpler alternatives]
- **Never open with:** "In today's...", "Let's dive in"
- **Never close with:** "In conclusion", "At the end of the day"
- **Structural bans:** dramatic reversals, binary contrast punchlines,
  staccato fragmentation, zoom-out conclusions

That's from a real file. There's a JSON registry behind it with 30+ hard-banned words, 15+ soft-banned words, and suggested replacements. An automated linter reads the registry and checks every draft before it ships.

The registry grows. Every time I flag a tone pattern, it goes in. Every time a reviewer catches AI-sounding language, it goes in. The system gets more opinionated with every piece of content because the context accumulates judgment.

Context is the difference between a tool with amnesia and a tool with opinions. My anti-slop registry started with 5 words. It has 45 now. Nobody decided to add 40 more. They accumulated through feedback and iterations. If the correction lives in your head, it dies when the session ends. If it lives in the config, it persists.

Corrections in your head vs. carved in stone

Beyond the basics

A CLAUDE.md file is just the starting point. Three mechanisms make context really compound:

Hooks run shell commands before or after every tool call. Mine run a linter on every draft automatically. You can trigger tests after file edits, validate formatting before commits, or inject environment data before each session. The agent doesn't have to remember to check things because the hooks do it for it.

Rules files (.claude/rules/) load context conditionally. Instead of one giant CLAUDE.md, you split context into modules: one for active projects, one for writing style, one for API conventions. The right context loads based on what you're working on.

Memory persists learnings across sessions. Not chat history. Structured observations: "this user prefers X over Y", "last time this broke because Z", "the deploy workflow changed to W." The agent starts every new session a little smarter than the last one. (Claude Code docs cover all three.)

The engineers on my team who ship fastest invested here first. For example my Pi harness has a datetime extension that injects the current date into every prompt automatically, plus a mode switcher that cycles between Normal, Plan, Quick Plan and Execute modes with a keyboard shortcut. Small customizations that save 30+ seconds each, hundreds of times.

Tools: what the agent can touch

The agent's value scales with what it can reach.

My setup connects to Slack, Gmail, Google Docs, Sheets, Notion, Ahrefs, a CRM, and a research router. None of these are plugins from a marketplace. They're Python scripts the agent calls as shell commands. Ugly, but they work.

The bookkeeping system is the one I explain first because everyone relates to it. It parses my German bank CSV, classifies every transaction against an auto-updated 23-vendor registry, pulls matching invoices from Gmail, renames them, and organizes everything into monthly folders:

Bookkeeping Q1 - 2026/
├── January 2026/
│   ├── Incoming Invoices/
│   │   ├── 020126_VendorA.pdf
│   │   ├── 030126_VendorB.pdf
│   │   └── 090126_VendorC.pdf
│   └── Outgoing Invoices/
│       └── 050126_VendorD.pdf
├── February 2026/
│   └── ...
└── March 2026/
    └── ...

Zipped for the accountant. Used to take me a day. Then I hired a bookkeeper, still took me 3h+ to collect missing invoices and inspect accuracy.

Takes about 20 minutes now.

I'll be honest, the first time doing bookkeeping with my agent was painful. But with the second iteration using a closed feedback-loop, the agent did 98% of all the work and then suggested to contact 4 vendors for missing invoices.

Beyond the basics

Each new tool connection is a multiplier, not an addition. The bookkeeping agent works because it can read bank data AND search email AND write files AND organize folders. Remove any one connection and the whole workflow breaks.

Two approaches to connecting tools:

MCP servers (Model Context Protocol) let you connect external services through a standard interface. Claude Code, Pi, and other tools support them natively. Connect Notion, GitHub, Slack, databases. Less code than custom scripts, more standardized.

Custom scripts do exactly what you need and nothing else. My Google API scripts are 6 Python files that handle OAuth, read/write Gmail, Calendar, Docs, and Sheets. No framework, no dependencies beyond the Google client library. They break when APIs change. I fix them and they work for another 6 months.

Both approaches compound. The more tools your agent can reach, the more complex workflows it can handle without you switching between apps.

Prompts: skills beat improvisation

This is the layer people spend the most time on, usually in the wrong way. Writing a better one-shot prompt helps. Writing a multi-phase execution plan helps a lot more.

Each skill is a markdown file (or folder of them) with phase-by-phase instructions. The eng-blog skill is one of my most complex, 6,843 lines across 66 files:

.claude/skills/eng-blog/
├── SKILL.md                  # commands, phases, guardrails
├── phases/                   # ingest, outline, draft, review, SEO, publish
├── references/               # article patterns, rubric, code safety rules
├── scripts/                  # diagram generation, page builder
└── learnings.md              # accumulated feedback from reviewers

An engineer rambles about a migration for 5-10 minutes. The system transcribes it, scrapes Slack for related context, generates 9 competing outlines from 3 different angles, scores them, drafts the post, checks it against the anti-slop registry, generates diagrams, and publishes everything to a Google Doc for team review. I ran the full pipeline this week. Made about 5 real decisions. The system handled the other 200.

The important thing: the agent doesn't improvise inside a skill. Each phase has explicit instructions, expected inputs, expected outputs, and criteria for when to stop. When something goes wrong, the fix goes into the skill file. Every future run benefits.

Beyond the basics

The difference between a prompt and a skill is structure. A prompt is a single instruction you send once. A skill is an execution plan that improves with every run (if you set it up right).

The decomposition pattern that works:

SKILL.md is the entry point. Commands, phases, guardrails. The agent reads this first and knows what to do.
phases/ breaks the workflow into steps. Each phase file has explicit inputs, outputs, and stop criteria. The agent finishes one phase before starting the next.
references/ holds supporting context: style guides, rubrics, templates, API docs. Loaded on demand, not all at once, so you don't waste context window.
scripts/ contains automation the agent calls. Python for data processing, bash for system commands. The agent knows these exist and calls them at the right phase.
learnings.md is where the skill compounds. Every reviewer correction, every failed run, every edge case goes here. The agent reads it as part of one of the phases. This file is the difference between a skill that executes and one that improves.

Inside complex skills, subagents handle independent work in parallel. My snapshot skill launches 7 agents simultaneously: one reads my calendar, one checks health data, one scans recent journals, one loads project status. They all report back and the main agent synthesizes. What would take 10 minutes sequentially finishes in 2.

Model: the least interesting layer

I use the same model for most things. The architecture around the model matters more than which one sits inside it.

This is counterintuitive because model selection is what everyone debates. But once you have context that compounds, tools that expand reach, and skills that don't improvise, the model becomes a component, not the system.

The one thing worth customizing: route fast lookups to smaller, cheaper models and save the expensive model for complex reasoning. Most tool calls are simple and don't need the most powerful model in the lineup.

My current go-to models (as of April 2026):

Opus 4.6 for complex tasks and reasoning
Sonnet 4.6 for medium complexity, smart enough for most things
Haiku 4.6 for simple tasks, fast execution
GPT 5.3 Codex for coding and reviews
Gemini 3.1 Pro for reviews and design

Silent failure

What breaks

Every single skill broke on the first run.

The bookkeeping CSV parser choked on German number formats. Commas as decimal separators, semicolons as delimiters. Gmail search pulled wrong attachments for three vendors before I got the query filters right. This week, a hero image generator rendered a blog title but the text was too long for the template and got cropped off the edges. I deployed it to a Google Doc without checking. Had to fix it in front of the team.

I found coding sessions that had been running for days doing nothing useful. Memory climbing because sessions don't clean up after themselves if the parent process dies. Context windows filling with irrelevant tool results because I hadn't designed the information flow carefully enough.

Agentic systems fail silently. A chatbot tells you it can't help. An agent does it wrong and you don't find out until the output is garbage or the process has been burning RAM for 72 hours.

If you only see the polished output (bookkeeping on autopilot, 8-phase blog pipeline), you're missing the weeks of broken runs, wrong outputs, and scripts that work on the third try because the first two revealed edge cases nobody anticipated.

The compound effect

I came into my current engineering role from a non-engineering background a few weeks before. The engineers on my team have been writing TypeScript for years. I'm shipping production features alongside them, and the system is a big part of why I can keep up. I wrote about what that week actually looked like in a previous post.

Code quality stopped being the differentiator. We have models that build live UI on demand and solve problems in real-time. Jack Dorsey put it well recently: the capability is here, we just haven't put the components together yet. The system around the model is the missing piece. That's what compounding across four layers actually looks like.

Every broken run that gets fed back into the context makes the next run smarter. Every ugly script that connects a new data source expands what the system can do. Every skill that gets a fix becomes more reliable. The system accumulates judgment I don't have to carry in my head.

One of the engineers on my team is having Claude Code proactively watch for tool-call errors on our platform, create a Linear ticket, investigate, and write draft PRs. He built a system around it.

The compound spiral

What I didn't cover

Five things that deserve their own articles: prompt caching (stop re-loading the same context every session), permission handling (when to let the agent run free vs. require plan approval), cost management across dozens of skills, monitoring for silent failures, and running multiple agents in the same repo without them stepping on each other. More on those soon.

Which layers are you on defaults?

The four layers exist whether you design them or not. Three of them usually stay on defaults.

Share this article with your Claude Code agent (or Pi, or whatever you use) and paste the checklist below. Let the agent audit your current setup and tell you where the gaps are.

Read this article about the four layers of agentic engineering.
Then audit my current setup against each layer.

For each layer, tell me:
1. What I'm currently doing (check my config files, project structure)
2. What I'm missing
3. One specific thing I should build this week

Context layer:
- Check: Do I have a CLAUDE.md? How detailed is it?
- Check: Am I using hooks, rules files, or memory?
- Look for: Context I keep re-explaining every session
- If gaps exist: Draft a CLAUDE.md with my role, project conventions,
  and 3 rules based on corrections you've seen me make

Tools layer:
- Check: What external services am I accessing manually mid-session?
- Check: Do I have any shell scripts or MCP servers connected?
- Look for: Workflows where I copy-paste between apps
- If gaps exist: Pick my most-used external service and write a
  script that connects it, or suggest an MCP server to install

Prompts layer:
- Check: Do I have any skills/custom commands set up?
- Check: Am I giving the same multi-step instructions repeatedly?
- Look for: My most-repeated workflow that could be a skill
- If gaps exist: Turn that workflow into a skill file with phases,
  expected inputs/outputs, and a learnings section

Model layer:
- Check: Am I using the same model for everything?
- If so: Suggest which of my tasks could route to a faster/cheaper model

Format your response as a scorecard:
| Layer | Current state | Biggest gap | This week's build |

I rebuild parts of my system every other day. It compounds anyway.