A year ago our engineering team was 8 people.
It still is. But we ship like we're 24.
Everyone benchmarks AI coding wrong. They ask "how much faster is Claude Code than a good engineer typing manually." The answer is 1.5x to 2x. Not bad. Also not 3x.
The 3x came from running ten Claude Code sessions at once.
This post is the Claude Code best practices we actually use at Warmly. The CLAUDE.md rules, the subagent architecture, the MCP server setup, the memory loop, the container config. 606 commits in, with the bruises to match.
If you're a founder or VP Eng trying to turn Claude Code from "the tool one engineer uses" into a system that compounds across your whole team, read on.
Why we went all-in on agentic coding
I'm a GTM founder. But I've been coding again the last two years because the tools got good enough that I can keep up on small things.
Last October I watched one of our engineers solve a nasty enrichment bug in 40 minutes using Claude Code. The same bug took me two hours a few months before, and I'm the person who built the original system. That's when I got it. Agentic coding isn't hype. It's the biggest productivity shift since the move from on-prem to cloud.
But out of the box, Claude Code is general-purpose. It doesn't know your database schema. It doesn't know your deploy flow. It doesn't know that "enrichment issue" at Warmly means check MongoDB first, then the AlloyDB replica, then GCP logs, then BullMQ queues.
Every engineer was reinventing the wheel. Writing their own CLAUDE.md. Copying prompts between Slack DMs. So we built a real system on top of Claude Code. We call it Warmly Intelligence. It's two things: a plugin marketplace every engineer installs, and a headless engine that runs Claude Code programmatically, 24/7, in the background.
Here's how the pieces fit.
Claude Code rules and custom instructions that actually work
The foundation is boring. CLAUDE.md files and rules. Everyone skips this part because it's not sexy. Don't skip it.
After writing, rewriting, and deleting about fifty CLAUDE.md files over eight months, here's what we learned:
Rules belong in CLAUDE.md. Context belongs in skills. A rule is "never mutate production data without SET statement_timeout = '20s'". Context is "here's our deploy flow, here's the schema, here's how to query it safely." Mix them up and both get worse.
Write rules in second person. "You always check the Linear ticket before touching code." Not "Claude should..." Not "Always...". Second person lands better. I don't know why. It just does.
Use the negative. "Never suggest a fix without reading the failing test first" lands harder than "always read the failing test first." We learned this the expensive way, burning two days because Claude was "optimistically patching" tests we hadn't read.
Check your CLAUDE.md into git. It lives in the repo. It gets code-reviewed. If someone wants to change how Claude behaves, they open a PR. Half the teams I talk to still have their rules sitting in one engineer's home directory. That's not a system. That's a hobby.
Separate global from project rules. ~/.claude/CLAUDE.md is for personal preferences. The repo's CLAUDE.md is for the team. Project rules win. Keep them that way.
That's the boring part. Now the interesting part.
How we use Claude Code subagents as force multipliers
Claude Code subagents are the single most underused feature in the product. This is where the 3x lives.
A subagent is a specialized Claude session spawned by a parent session. The parent delegates a narrow task. The subagent works in isolation. It returns a structured summary. Parent continues. Exactly how a senior engineer delegates to a junior, except the junior is also Claude and doesn't take sick days.
We ship 20+ subagent skills across two plugins (warm-dev for engineering, warm-pm for product). The most important one is called warm-debugger.
A senior engineer at Warmly has a mental map. "Ad spend issue means check the Meta webhook, then the GTM handler, then the attribution table." "Enrichment issue means MongoDB, then AlloyDB replica, then BullMQ queues." That mental map took five years to build. We wrote it down. Literally. As a SKILL.md file with a domain signal table mapping symptom to evidence source.
New engineers install the plugin on day one and debug like someone who's been at Warmly for five years. The tribal knowledge isn't trapped in someone's head anymore. It's executable code Claude runs in real time.
Three rules we learned writing subagents:
One task per subagent. Don't build a debugger that also writes tests. Build two subagents. Claude will pick the right one based on context.
The prompt is not a description. It's a spec. Most subagent configs I see in the wild are a one-liner. Ours are 200-300 lines each. The length isn't bloat. It's precision. The subagent knows exactly what to check, in what order, and what output format to return.
Return structured output, not prose. We have a report_findings tool every subagent calls at the end with a typed schema: claim, source_url, confidence. The parent agent gets clean data it can act on, not paragraphs it has to re-parse.
The Claude Code MCP server setup that gives Claude access to everything
Most Claude Code setups I see in the wild have one or two MCP servers wired up. Ours has 18 attached to every task.
| MCP Server |
Purpose |
| Linear, Linear-read |
Ticket context and updates |
| Notion, Notion-read |
Internal docs and specs |
| Statsig, Statsig-read |
Feature flag state |
| Grafana, Grafana-read |
Production metrics |
| Rootly, Rootly-read |
Incident history |
| Slack, Slack-read |
Team context and decisions |
| Pylon, Pylon-read |
Customer support tickets |
| HubSpot, HubSpot-read |
CRM data |
| Knowledge Base |
Self-maintaining internal wiki |
Every server has a read variant and a write variant. You almost always want Claude to read freely and write carefully. Separating them lets you grant read access broadly and gate writes behind approval.
The biggest unlock though isn't consuming MCP servers. It's building them.
We wrote a persona MCP that knows about our customer personas. A kb MCP that queries our self-maintained knowledge base. These didn't exist until we built them. Every company should have at least five custom MCP servers specific to their domain. If your internal systems don't speak MCP, Claude can't use them.
One small tactical note: use read-only MCP servers in your code review bots. You don't want your PR reviewer accidentally flipping Statsig flags in production.
The memory loop that makes Claude Code smarter every week
This is the part I'm most excited about and the hardest to explain.
After every completed task, a separate Sonnet process analyzes the transcript and extracts reusable memories. Four types: user preferences, work feedback, project decisions, external references. Memories get deduplicated, confidence-scored, stored. The next task loads relevant ones before it begins.
Lots of systems do that. What's different is what we do with negative feedback.
Our Slack assistant has a thumbs-down button. When someone downvotes an answer, a dedicated pipeline runs. It reads the conversation. It asks "what went wrong, what would have been correct, what domain knowledge was missing." It writes a targeted feedback memory. Every future Slack task gets that memory injected.
The 100th time someone asks about CRM sync, the answer is measurably better than the 1st time. Nobody trained a model. Nobody edited a prompt. The system noticed it was wrong and remembered.
A Claude Code setup without a feedback loop that updates memory automatically is a static system pretending to be dynamic. Build the loop. It's the difference between a tool that plateaus and one that compounds.
Claude Code tips from 8 months in production
Rapid fire, the things we learned the hard way.
Rotate OAuth tokens.
Run multiple Claude Code sessions concurrently and you will hit rate limits. We maintain multiple CLAUDE_CODE_OAUTH_TOKEN env vars and round-robin between them. Our code picks them up automatically: CLAUDE_CODE_OAUTH_TOKEN, CLAUDE_CODE_OAUTH_TOKEN_2, CLAUDE_CODE_OAUTH_TOKEN_3.
Use git worktrees for parallel tasks.
Never run two sessions in the same directory. Each task gets its own worktree: .worktrees/<taskId>/. They stay isolated. No branch conflicts. No git state collisions.
Set CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY=6.
Default is lower. Higher means parallel tool calls within a single session. For debugging investigations this is huge. Claude pulls GCP logs, Grafana metrics, and Linear context simultaneously instead of serially.
Use CLAUDE_CODE_COORDINATOR_MODE=1 for orchestrator tasks.
Changes how the main agent handles subagent delegation. Better for plan-and-delegate workflows.
BullMQ + Redis is the right queue.
We tried alternatives. BullMQ has the primitives: job dependencies, retry policies, backoff, rate limiting. Don't roll your own.
Automated PR reviews should run in multiple phases.
Ours runs three: acceptance check against the Linear ticket's criteria, deep code review, refinement pass that deduplicates findings. Single-pass reviews are noisy. Multi-phase reviews are shippable.
Generate deploy narratives, not diffs.
Our /warmly-dev:deploy command reads commit history, extracts Linear ticket IDs, fetches each ticket's details, and writes a prose changelog. We post it in the deploy thread. Reviewers actually understand what they're approving.
Where it still breaks
This system doesn't work perfectly. Five places it fails:
Long-context refactors are still hard. When a task spans 40+ files and requires holding the entire mental model at once, Claude loses the thread. We break these into phased tickets now, but a senior engineer on a big refactor end-to-end is still faster than any agentic setup I've seen.
Memory has a cold-start problem. New topics with no feedback history get generic answers. We manually seed memories when we know a new domain needs to land, but there's no clean automated solution yet.
Flaky tests lie to the agent. If a test passes 80% of the time, Claude merges the fix because the test is green on its run. Then staging fails an hour later. We added re-run logic. Flaky tests are still an adversarial input.
Cost is real. We pay low five figures per month across the company. Not small. The ROI case is strong because we'd need to hire more engineers to ship this volume, but at the seed stage this isn't free.
Anthropic rate limits during peak hours. Even with OAuth rotation across multiple subscriptions, we hit ceiling. We've built in backoff and queueing. Better than six months ago. Not solved.
The real 3x: concurrency, not speed
Most teams benchmarking AI coding ask the wrong question. "How much faster is Claude Code than manual coding for task X." The answer is 1.5x-2x and that's boring.
The right question is how many tasks my team can run in parallel without adding headcount.
There are ten Claude Code sessions running right now as I write this paragraph. Three are reviewing open PRs. Two are implementing Linear tickets assigned this morning. Four are answering questions in Slack channels. One is writing the staging deploy changelog.
Nobody is supervising any of them. Eight humans are doing their actual work. The AI department is doing the repetitive 60%.
That's the 3x. Not "make one engineer faster." It's "run ten specialized agents in parallel so your engineers only touch the 40% that requires judgment."
Every B2B startup has this in front of them right now. The ones that figure it out in the next twelve months are going to look dramatically more efficient than the ones that don't. Not because their engineers are better. Because their systems compound.
At Warmly we do the same thing on the GTM side. Instead of ten agents reviewing PRs, we run agents identifying companies visiting your website in real time, enriching buying committees, and routing high-intent accounts to your SDRs. Same concurrency thesis. Different department. If that's interesting to you, come see what we've built at warmly.ai.
How to actually start
If this post got you fired up, here's the minimum path to your first real win.
Week 1. Write a real CLAUDE.md for your main repo. Not a one-pager. 300 lines covering schema, deploy flow, testing standards, and the three most common bug investigation patterns at your company.
Week 2. Write your first two skills. One debugger playbook for your most common bug class. One database query helper that knows your connection patterns and safety rules.
Week 3. Stand up one MCP server for your most important internal system. Probably your CRM or your production database.
Month 2. Deploy a headless Claude Code runner on a single VM watching one GitHub repo. Start with automated PR reviews only. Do not try ticket-to-PR automation yet.
Month 3. Add memory extraction. Even a simple version that runs after every task and appends to a shared file is a huge unlock.
Month 6. You'll have enough signal to decide whether to build out the full platform or stay lean.
The patterns matter more than the specific code. Copy what applies to your stack. Ignore what doesn't.
FAQ
What are Claude Code best practices for teams?
Check CLAUDE.md into git, separate rules from context, write one-task-per-subagent with 200+ line prompts, build internal MCP servers for your own systems, run multiple sessions concurrently in git worktrees with OAuth token rotation, and add a memory extraction loop that learns from negative feedback.
What's the difference between Claude Code rules and custom instructions?
Rules are constraints (never do X, always do Y). Custom instructions are context (here's our schema, here's our deploy flow). Both live in CLAUDE.md but serve different purposes. Mixing them makes both weaker.
How do Claude Code subagents work?
A subagent is a specialized Claude session spawned by a parent. The parent delegates a narrow task, the subagent works in isolation, returns a structured summary, parent continues. The key is one-task-per-subagent with a detailed spec prompt, not a one-line description.
Do you need MCP servers to use Claude Code effectively?
You can start without them but the real unlock is wrapping your internal APIs as MCP servers so Claude has programmatic access to your actual systems. Separate read-only and write variants.
How does Claude Code memory work in production?
Claude Code has native memory primitives. Real production memory is something you build on top. Extract reusable memories after every task, deduplicate against existing entries, inject relevant ones into future tasks, and close the loop by triggering targeted extraction when users give negative feedback.
Is agentic coding actually 3x faster?
A single session is 1.5-2x faster than manual coding. The 3x comes from running 5-10 sessions concurrently on different tasks. Speed is linear. Concurrency is the multiplier.
How do I set up Claude Code for a team?
Start with a committed, code-reviewed CLAUDE.md. Distribute organizational knowledge as a Claude Code plugin with skills and slash commands, not as shared docs. Set up at least one internal MCP server wrapping your company's core API. Use git worktrees and OAuth token rotation once you scale to concurrent agents.
What's the difference between Claude Code and Cursor?
Cursor is an IDE with AI built in. Claude Code is a terminal-native agent that can be run interactively, headlessly via the Agent SDK, or as a background worker in production. For team workflows like automated PR review, deploy automation, Slack Q&A, and ticket-to-PR pipelines, Claude Code's headless mode is the key differentiator.
Last Updated: April 2026