How I designed a system where agents and I build, remember, and improve together.
A lot of people have started building "personal OS" systems recently — a system that manages your work and daily workflows through an AI agent. AI coding agents can do more than write code. They can manage files, orchestrate workflows, interact with external services. The idea of running your life from the terminal is suddenly practical, not theoretical.
I could have cloned someone else's setup. There are good ones out there. But I wanted to learn by building — and building from scratch turned out faster than finding the right existing one to adopt. How do you design skills that agents follow reliably? When do you use an MCP server versus a CLI tool versus a skill file? How do you structure context so agents stay useful across sessions? What breaks when you push agents beyond simple tasks?
I started playing around with AI coding agents in early 2026. About a month in, the system runs my daily life — managing side projects, triaging email, tracking expenses, planning work. It's not a toy or a weekend experiment. It's the system I actually live in.
The first few weeks were just exploration — trying things, seeing what agents could handle. But one friction kept coming back: agents don't remember anything. Every new conversation starts blank. You re-explain the project, re-describe what you did last time, re-state the decisions you already made. The longer a project runs, the worse it gets.
I wanted to spend my time on the actual work — building, discussing, solving problems — not manually feeding context into every conversation. So the first thing I built into my personal OS was a project/task/session layer that manages context for both me and the agent: what we're working on, what's been decided, and what to do next — without me re-explaining it every time. The rest of the system grew around it — each piece added as I hit the next limitation.
The entire system is markdown files in a single workspace and an AI coding agent running in a terminal — I use Claude Code, which is why you'll see CLAUDE.md and .claude/skills/ in the file structure. No database, no server, no special infrastructure.
workspace/
CLAUDE.md <- system-wide rules, data model, coordination
projects/
side-project-a/
PROJECT.md <- identity, resume checklist, practices
ROADMAP.md <- milestones, progress log
tasks/
build-auth-flow.md <- plan, decisions, session links
migrate-database.md
sessions/
2026-03-15-auth-flow-draft.md
2026-03-16-auth-flow-testing.md
templates/
task-template.md <- defines file structure for new files
.claude/
skills/
project/ <- teaches agent project workflows
task/ <- teaches agent task workflows
session/ <- teaches agent session workflows
Why markdown? Agents read it natively — no special tooling, no parsing, no API. Why flat folders instead of nesting tasks under projects? Tasks and sessions can exist without a project, which keeps things flexible for quick one-off work. Why templates? So the agent knows exactly what structure to expect when creating or reading files.
CLAUDE.md is the system's entry point. The agent reads it at the start of every conversation. It defines the data model, the rules for when to create a task versus just a session, how context loads across layers, and which skill to reach for.
Skill files are step-by-step instructions for specific workflows — creating a project, planning a task, closing out a session. Slash commands like /project create or /task plan invoke them, and the agent follows the instructions inside to handle scaffolding, confirmation points, and updates across files.
The core system is available as a starter repo — personal-os. The rest of this article walks through how it works.
When I start something new, I run /project create — the agent asks for a description, importance level, and an optional first milestone, then scaffolds a project folder from templates with two files.
PROJECT.md holds the project's identity. When first created, most of it is sparse — the important parts fill up through use. Two sections matter most:
ROADMAP.md tracks milestones — what needs to happen, in what order, with dependency links between them. Each milestone has a key (like p1/auth) that connects it to tasks. The agent uses these to find the next milestone to work on — the first incomplete one with all dependencies met — so it always knows what to focus on next. A progress log at the bottom grows with dated one-line entries after each session.
Planning is conversational throughout — I don't fill in templates, I discuss with the agent and it captures the results.
It starts with the roadmap. When a project is created, ROADMAP.md has at most a first milestone. The real planning happens in conversation — we discuss the vision, break it into phases, figure out what depends on what. The agent writes the milestones; I review and adjust. As the project evolves, the roadmap evolves with it.
When it's time to work on a milestone, CLAUDE.md requires a requirements conversation first — what does it do, what are the constraints, what does "done" look like. This prevents mid-build pivots when the goal wasn't clear.
The agent then recommends creating a task — I can invoke /task create directly, but in most cases the agent suggests it on its own. CLAUDE.md tells the agent to recommend a task when work needs a plan or will span multiple sessions. For simpler work, a session file is enough.
The task file is the plan and decision hub for a specific piece of work. It's what ties multiple sessions together into one coherent thread — without it, each session knows what happened in that sitting, but nothing tracks overall progress or connects them. Specifically:
The agent links the task to the project and milestone, pulls the due date from the roadmap if one exists, and generates the file from a template.
For complex tasks, the agent goes deeper with /task plan. It first checks past context — memory, past sessions, existing decisions and project practices — for anything that should inform the approach. Then we discuss: break down the work into steps, order by dependencies, capture decisions and rejected alternatives. The plan gets written as a checklist directly in the task file.
There's a hard stop here: the agent doesn't execute anything until I explicitly say "go." This is a deliberate pattern throughout the system — the agent proposes, I approve, the agent executes. Planning, session capture, knowledge promotion — all follow this loop.
This is the part where the system gets out of the way. The task has a plan, the project has context, the agent knows what to do. I focus on the actual problem — building, discussing, solving. The agent and I work together on the current plan item. As we go, plan items get checked off and decisions get recorded.
When I'm done with a piece of work — whether it's the end of the day or just switching to a different task — the agent captures the session. Sometimes I invoke /session create directly, but usually the agent recognizes when work is wrapping up and suggests it. This is the most important workflow in the system.
The agent reviews the entire conversation and extracts what matters into a session file: what was done, what was decided, and what comes next. It shows me a preview before writing anything — I can edit, add context, or correct what it got wrong.
The most important part is Next Steps, which separates three things:
The agent is writing instructions for its future self. It decides what context the next conversation needs, what questions to ask, and what files to load — so when I come back, the agent picks up where we left off without me re-explaining anything.
But session close-out doesn't stop at capturing. It cascades updates across the system:
Everything stays in sync without me manually touching files across the system. And then comes compounding — more on that in The System Gets Smarter.
Days later, I come back. I can resume at any level:
/project work my-saas-app — loads PROJECT.md, runs the Resume Checklist, finds the frontier milestone, surfaces linked tasks and their progress, reads the latest session, and presents: where I am, what I did last, what to do next./task resume migrate-database — loads the task, checks the plan progress, finds the latest session, follows its resume instructions./session resume 2026-03-15-migration — loads that specific session and follows its Next Steps directly.All three follow the same pattern: load context layer by layer. First the project baseline (Resume Checklist — always loaded), then the session specifics on top. The agent doesn't just load context — it executes the resume instructions from the last session. If the instructions say "ask how the staging migration went," it asks. If a file is marked "read only if staging failed," the agent checks with me first before loading it.
No re-explaining. Straight into the work.
I borrowed the idea of compounding from compound engineering — a pattern for building systems where the outputs of one part become inputs to another, creating returns that accelerate over time. Applied to a personal OS, that means: every work session should make the system slightly better, not just capture what happened.
After the cascade updates, the agent reviews the conversation for knowledge that should live somewhere permanent. It presents a list of proposed updates — I approve, reject, or adjust each one. Anything in the system is a valid target:
Over time, practices accumulate, stale rules get flagged, and decisions made weeks ago surface when they're relevant again. The system improves itself through use.
With the project/task/session backbone in place, I started building everything else on top of it — each one a project managed through the same system, compounding as I went.
Morning routine. One command — /daily-brief — pulls together everything the system knows. It triages my Gmail inbox (reads, classifies, applies actions), loads today's calendar, checks active projects for what I worked on yesterday and what's next, and writes it all into an Obsidian daily note. Then we discuss what to focus on today. This is where the structured project data pays off — the agent can tell me which milestones are in progress, which tasks are waiting, and what I might be forgetting.
Expense tracking. I wrote a separate piece about this. The short version: a database schema, a skill file, and an MCP connection replaced what would normally be a full backend. Agents log expenses from conversation, reconcile against bank portals and government invoice systems, and surface discrepancies for me to review.
Agent tools. Screen capture with OCR, clipboard history access, a Google Workspace CLI (Gmail, Drive, Docs, and more), a calendar CLI, a Linear integration for issue tracking, a file opener that routes to the right app — and more. Some I built, some I adopted. The tools themselves are a mix of skills, MCP servers, CLIs, and subagents — knowing when to use which is a design judgment I developed by building all of them.
It's still far from complete or perfect — but after a month of daily use and iteration, I have a clear sense of how a well-designed system and a capable agent can make my life meaningfully easier. And thanks to compounding, every session of building these tools also improved both the backbone and the tools themselves.
This system runs my side projects. I use it every day and it genuinely saves me time. But it's a personal system — one person, one knowledge base, no production constraints.
I haven't used it in a team environment. The practices that work for one person managing their own knowledge might not survive a team of five with different workflows. The compounding principle assumes one person building up context over time. What happens when multiple people contribute to the same project? When decisions conflict? When knowledge needs to be shared across different agents with different context? I run multiple agent sessions in parallel on different tasks — but the knowledge layer still assumes a single author. Those are real problems I haven't solved because I haven't needed to yet.
There are smaller rough edges too. Loading project + roadmap + task + session every conversation costs tokens — the two-layer design helps, but it's still more than starting from scratch. Resume quality depends on the previous session's close-out — a bad close-out means a bad resume next time. The system is only as good as the instructions the agent wrote for itself last time.
The tools themselves keep evolving — Claude Code's built-in memory, better project context, improving session continuity. Some of what I built manually is becoming available out of the box. That's fine. As new features land, I fold them in. As patterns change, the system adapts — that's what compounding is for. The goal was never to build a static system. It was to learn how agents work by building something I actually use, and to keep making it more agentic as the tools improve.
A month in, I understand context engineering, skill design, tool orchestration, and agent failure modes not because I read about them, but because I built a system that depends on all of them daily. The biggest lesson: CLAUDE.md and skill files are the real product — they're what turn a general-purpose agent into a system that knows how to manage projects, close out sessions, and compound knowledge. Getting those instructions right — the right level of detail, the right confirmation points, the right context loading — is the design work that makes everything else possible. That's what I was after — and it's what I'll bring to whatever I work on next.
Want to try building your own? The core project/task/session system is available as a starter repo: personal-os