AI-Augmented Backlog Refinement: How I Run Every Session with a PO Agent
How to configure Claude Code as a PO agent with session types and run AI-augmented backlog refinement that challenges your assumptions, not just generates boilerplate.
AI-Augmented Backlog Refinement: How I Run Every Session with a PO Agent
You run backlog refinement every week. Half the stories are vague. Acceptance criteria miss obvious edge cases — the kind your team discovers mid-sprint, triggering unplanned conversations and scope negotiations. Someone raises a concern about a feature approach that was resolved three sprints ago, and the team spends twenty minutes relitigating a closed decision because nobody remembers the rationale.
You've tried AI. You paste a user story into ChatGPT and ask for feedback. It tells you to "consider adding acceptance criteria for error handling" and "ensure the story follows INVEST criteria." Thanks. You already knew that. The suggestions are technically correct and completely generic — they could apply to any story on any product for any team.
The problem isn't AI capability. It's context. A generic LLM doesn't know your product's architectural constraints, your team's Definition of Ready, or that you already decided to defer mobile support until Q3. It can't challenge you on scope because it doesn't know your scope.
What follows is how I run every backlog refinement session: with Claude Code configured as a PO agent that knows my product, enforces my standards, and pushes back when I'm wrong. Not a SaaS tool. Not a prompt template. A configured agent with persistent context.
The Refinement Waste Stream
Product owners spend more time in backlog refinement than any other ceremony. And most of that time is waste — not because refinement is unnecessary, but because it's inefficient in specific, measurable ways.
Waste #1: Manual story decomposition. Large features need breaking into sprint-sized stories. Without structured context, this decomposition happens from scratch every session. The agent doesn't remember that you already broke down the authentication epic last month, or that your team's velocity means stories over 5 points reliably spill.
Waste #2: Acceptance criteria that miss edge cases. Your Definition of Ready says stories need complete ACs. In practice, ACs get written during refinement at speed, and edge cases surface during development. StoriesOnBoard reports a 72% reduction in per-story refinement time using AI. But the real metric isn't speed — it's how many stories come back from development with "AC was unclear" or "discovered new edge case during implementation."
Waste #3: Relitigated decisions. Someone asks "why don't we support offline mode?" The team discusses it for fifteen minutes. Nobody remembers that you decided against offline in Sprint 12 because the complexity didn't justify the use case frequency. Without institutional memory in the refinement session, every settled question is open for debate.
Waste #4: Opinion-driven prioritization. WSJF scoring, RICE frameworks — these exist because priority arguments without data waste time. But running the scoring frameworks manually is its own overhead, and most teams default to gut-feel prioritization. An agent with your product context can apply WSJF or RICE scoring consistently across stories, eliminating the debate about methodology while surfacing genuinely data-informed priority.
These aren't abstract inefficiencies. They're specific, recurring wastes that compound across every sprint. Scrum.org's approach to AI-enhanced sprint planning provides prompt templates for ChatGPT — write a Sprint Goal, generate acceptance criteria, decompose a story. It's useful as far as it goes. But prompt templates are stateless: each interaction starts from zero, with no memory of your product, your decisions, or your constraints.
The tool-based approach — Jira AI features, ChatGPT prompts, backlog management SaaS — treats AI as a speedup for the same broken process. The agent-based approach treats AI as a thinking partner with your actual product context loaded.
Here's the difference: a tool generates output when you ask. An agent generates output and challenges your assumptions based on what it knows about your product.
Configuring Your AI Product Owner Workflow
If you've built a structured knowledge system — an Obsidian vault with typed frontmatter where axioms, principles, decisions, and rules are explicit and traversable — that vault is about to earn its keep. If you haven't, you can still start here with just a CLAUDE.md file.
Claude Code uses a CLAUDE.md file as persistent context — the instructions and knowledge that shape every interaction. For a PO agent, this file becomes the operating manual for how the agent participates in refinement. Here's the configuration I use:
# PO Agent Context
## Product
Building an AI-native learning platform for technical product owners.
Target users: POs/PMs at Series A-C startups running 2-4 dev teams.
## Refinement Standards
- Definition of Ready: Story has clear AC (Given/When/Then),
estimated at ≤5 points, no unresolved dependencies, matches sprint goal
- Story format: "As [persona], I want [action], so that [outcome]"
- AC format: Given/When/Then with edge cases explicitly listed
- Scope rule: If a story touches >2 services, decompose further
## Challenge Rules
- Flag any story that contradicts a prior decision in decision/
- Question scope that exceeds the current sprint goal
- Push back on acceptance criteria that lack edge case coverage
- Reference axiom/ and principle/ when challenging design decisions
## Knowledge References
- See decision/ for past architectural and product decisions
- See axiom/ for foundational beliefs that constrain choices
- See principle/ for operating behaviors derived from axioms
This isn't theoretical. Each section encodes real product context: your architecture constraints, your refinement standards, your explicit instructions for the agent to push back.
The critical difference between a prompt and a configured agent becomes clear on the first interaction. Compare these two approaches to the same story:
Generic prompt approach:
"Help me refine this user story: As a user, I want to filter dashboard widgets by date range."
The generic LLM responds with textbook suggestions — add ACs for invalid dates, consider performance, include accessibility. Correct. Generic. Applicable to literally any date filter on any product.
Configured PO agent: The agent with your CLAUDE.md loaded responds differently. It knows you decided to use server-side filtering (Decision #47). It knows your axiom about working systems over beautiful architecture. So instead of generic suggestions, it asks: "This story touches the widget service and the analytics service — your scope rule says decompose if it crosses >2 services. Should we split this into a filter UI story and a data pipeline story?" It references your actual constraints.
That's the difference between AI as a tool (it does what you ask) and AI as an agent (it does what you ask while knowing enough to challenge you).
If you have MCP servers configured for Linear, GitHub, or your project management tool, the agent can pull live context too — existing backlog state, in-flight PRs, recent decisions. But the vault + CLAUDE.md is the minimum viable setup that transforms output quality.
One more thing: add your vault path to the CLAUDE.md so the agent can traverse your axioms, principles, and decisions during refinement. The vault is the agent's institutional memory — the thing that prevents relitigated decisions.
Agent-Assisted Refinement Session Types: Triage, Deep, and Spike
Not all refinement is the same. Mixing backlog hygiene with deep story decomposition is like using the same meeting format for standup and quarterly planning. Different tasks need different agent behaviors.
I run three types of refinement sessions, each with a distinct agent configuration:
Triage Refinement (15-20 minutes)
Purpose: Quick backlog hygiene. Clean up, re-prioritize, flag what needs deeper work.
Agent behavior: Scanning and flagging mode. The agent reads through the backlog and identifies: stale items (no update in 3+ sprints), duplicate intents (two stories that accomplish the same outcome differently), priority misalignment (stories that don't serve the current sprint goal), and stories missing required fields from your Definition of Ready.
Output: A clean, prioritized backlog ready for deep refinement. Stale items archived or updated. Duplicates merged. Gaps flagged.
## Triage Session Rules
- Scan the full backlog; flag, don't fix
- Group related stories by theme
- Identify top 5 candidates for deep refinement
- Time limit: 20 minutes max
- Output: prioritized shortlist + cleanup log
Deep Refinement (30-45 minutes)
Purpose: Take 3-5 stories and make them sprint-ready. This is where the agent earns its keep.
Agent behavior: Generative and adversarial. For each story, the agent: decomposes large stories into sprint-sized pieces (using your team's velocity as a constraint), generates acceptance criteria in Given/When/Then format with explicit edge cases, challenges scope against the sprint goal and prior decisions, and detects dependencies between stories and with in-flight work.
Output: Sprint-ready stories with complete ACs, scope verified, dependencies mapped.
## Deep Refinement Rules
- Process 3-5 stories maximum per session
- For each story: decompose, write ACs, challenge scope, map dependencies
- Reference decision/ for scope challenges
- Apply INVEST criteria check before marking ready
- Time limit: 45 minutes max
Spike Research (20-30 minutes)
Purpose: Investigate stories that need technical or domain research before they can be refined.
Agent behavior: Research and synthesis mode. The agent investigates: technical feasibility (can our architecture handle this?), alternative approaches (are there simpler ways to achieve the outcome?), risk assessment (what could go wrong, and how bad is the worst case?), and prior art (have we solved something similar before?).
Output: A spike summary with a clear recommendation: build as specified, modify the approach, defer to a future sprint, or decompose differently.
## Spike Research Rules
- One spike per session; depth over breadth
- Research: architecture fit, alternative approaches, risk profile
- Reference existing decision/ and principle/ for prior art
- Output: spike summary with build/defer/decompose recommendation
- Time limit: 30 minutes max
The session type concept introduced here becomes a complete taxonomy in a later article on session types and flow-based batching. For now, the key insight is simple: matching agent behavior to the type of work prevents the "refinement meeting that tries to do everything and accomplishes nothing" antipattern.
The Challenge Pattern: Before and After
Everything so far has been configuration. Here's the proof that it works.
Take a real story: "As a team lead, I want to receive weekly digest emails summarizing team activity, so I can stay informed without checking the dashboard daily."
Without context — generic AI refinement:
A stateless LLM gives you standard output: add ACs for email delivery, include unsubscribe, consider frequency options, handle no-activity weeks. Fine. Nothing you wouldn't have written yourself given ten minutes. The AI accelerates your existing thinking.
With context — configured PO agent refinement:
The PO agent with your product context loaded responds differently:
"This story touches the notification service and the analytics aggregation pipeline — that's 2 services. Your scope rule caps at 2 before decomposing, but this is at the limit. Flagging for review."
"I checked decision/notification-strategy.md — you decided in Sprint 8 to use in-app notifications as the primary channel and email as opt-in only. This story assumes email as default. Should we align with the prior decision or explicitly override it?"
"Your axiom 'working systems over beautiful architecture' suggests starting with a simple template rather than a customizable digest builder. Recommended AC: v1 uses a fixed template. Customization is a separate story."
Three differences from the generic approach: it caught a scope boundary, it referenced a prior decision, and it applied a foundational axiom to constrain the implementation. That's not prompt engineering. That's context engineering — giving the agent enough structured knowledge to think with, not just generate from.
This is the core belief shift of this article: AI backlog refinement isn't about generating user stories. It's about giving the agent enough context to challenge your assumptions and catch what you missed.
The challenge pattern requires three things:
- Product context in CLAUDE.md (so the agent knows your constraints)
- Decision history in the vault (so the agent can reference prior choices)
- Explicit challenge instructions (so the agent knows it should push back, not just agree)
Without all three, you get a helpful assistant. With all three, you get a thinking partner.
If your team isn't ready for AI in the refinement ceremony itself, start upstream. Run triage refinement async — let the agent clean and prioritize the backlog before the team session. The team sees a better-prepared backlog without changing the meeting format. Once they see the quality improvement, the resistance to AI-augmented deep refinement drops because the value is already demonstrated.
Start Your First AI-Augmented Refinement Session
You don't need the full setup to start. Here's the minimum path:
-
Create a CLAUDE.md in your project root with three sections: Product (what you're building and for whom), Refinement Standards (your Definition of Ready and story format), and Challenge Rules (explicit instructions to push back on scope and flag missing ACs).
-
Run a triage session first. Export your current backlog to markdown. Point Claude Code at it with your CLAUDE.md loaded. Ask for a triage scan: stale items, duplicates, priority misalignment. This is low-risk, high-signal — you'll immediately see whether the agent understands your product context.
-
Pick one story for deep refinement. Take the agent's triage output and select a story that needs work. Run a deep refinement session: decompose, write ACs, challenge scope. Compare the output to what you'd have written manually.
-
Add your vault. Once CLAUDE.md is working, connect your knowledge vault from Article 1. Add the vault path to the agent's context. The agent now has access to your axioms, principles, and decisions — and refinement quality steps up again.
-
Establish session types. After a few sessions, you'll naturally discover which refinement activities benefit from different agent behaviors. Formalize them. The triage/deep/spike taxonomy works as a starting point, adapted to your team's rhythm.
The goal isn't to replace the PO's judgment. It's to augment it — surfacing what you'd miss, referencing what you'd forget, and challenging what you'd let slide because it's easier not to argue with yourself.
Can you show AI augmenting a real refinement session — not just generating stories, but challenging scope, flagging risks, and referencing your existing knowledge system? If yes, you've moved from using AI as a tool to using AI as a thinking partner.
You've seen context engineering in action: structured knowledge transforming generic AI into domain-aware augmentation. Context Engineering Principles extracts the principles behind why this works — the mental models that make everything compound into a systematic approach.
Get the Obsidian Template Vault + PO Agent Blueprint
Enter your email and we'll send you the download link.
Related Reading
Building a Second Brain for AI Agents
How to architect an Obsidian vault with typed frontmatter so AI agents can reason with your knowledge, not just search it.
Context Engineering Principles: The Mental Models I Use Every Day with AI Agents
Named principles for context engineering — less is more, knowledge gardening, structure over retrieval — discovered through daily agent usage, not academic theory.
Ready to accelerate?
Book a strategy call to discuss how these patterns apply to your product team.
Book a Strategy Call