A Month of Chat-Oriented Programming

Or when did you last change your mind about something?

Nick Radcliffe. 12th November 2025.

TL;DR: I spent a solid month “pair programming” with Claude Code, trying to suspend disbelief and adopt a this-will-be-productive mindset. More specifically, I got Claude to write well over 99% of the code produced during the month. I found the experience infuriating, unpleasant, and stressful before even worrying about its energy impact. Ideally, I would prefer not to do it again for at least a year or two. The only problem with that is that it “worked”. It’s hard to know exactly how well, but I (“we”) definitely produced far more than I would have been able to do unassisted, probably at higher quality, and with a fair number of pretty good tests (about 1500). Against my expectation going in, I have changed my mind. I now believe chat-oriented programming (“CHOP”) can work today, if your tolerance for pain is high enough.

The notes below describe what has and has not worked for me, working with Claude Code for an intense month (in fact, more like six weeks now).

Context

I have been a fairly outspoken and public critic of large-language models (LLMs), Chatbots, and other applications of LLMs, arguing that they are a dead end on the road to real artificial intelligence. It is not that I don’t believe in AI: as atheist and a scientist I regard humans and other animals as an existence proof for intelligence, and it seems obvious that other (“artificial”) intelligences could be built. I worked on neural networks in the late 1980s, and most of the progress since then appears to be largely the result of the mind-blowing increase in available computing power, data capacity, and accessible data, though the transformer architecture with its attention mechanism is novel, interesting, and crucial for LLMs. My position has been that the most accurate characterization of chatbots is as bullshit generators in the exact sense of bullshit that the philosopher Frankfurt defined (On Bullshit). LLMs predict tokens without regard to truth or falsity, correctness or incorrectness, and chatbots overlay this with reinforcement-learning with human feedback (RLHF), which creates the unbearable sycophancy of chatbots that so appeals to Boris Johnson.

While being somewhat sceptical about LLMs as coding assistants, I did think coding was an area realtively well suited to LLMs, and suspected that at some point over the next 10–20 years they will become essential tools in this area. Slightly reluctantly, therefore, I embarked on what I call a “month of CHOP”, where CHOP is short for chat-oriented programming. I decided I needed to repeat this every 12–24 months to avoid turning into a luddite.

CHOP is a term I learned from Steve Yegge and I use it to mean LLM-assisted programming that is almost the polar opposite of “Vibe Coding”:

There’s a new kind of coding I call “vibe coding”, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It’s possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like “decrease the padding on the sidebar by half” because I’m too lazy to find it. I “Accept All” always, I don’t read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I’d have to really read through it for a while. Sometimes the LLMs can’t fix a bug so I just work around it or ask for random changes until it goes away. It’s not too bad for throwaway weekend projects, but still quite amusing. I’m building a project or webapp, but it’s not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
— Andrej Karpathy (@karpathy), Twitter, 2025-02-02

By CHOP, roughly speaking, I mean pair-programming with Claude while not giving it an inch, using a fairly formal process with rules (see Standard Operating Procedure, below).

What I did

When I decided to embark on my Month of CHOP, I started by discussing and scoping 8 possible projects with ChatGPT, with the vague idea I might pick four of them and spend a week on each—new code, old code, a different language, and a new algorithm perhaps. The sidebar tells the story of how I ended up instead spending the whole month rebooting and reviving an abandoned project, CheckEagle, from 2008. That project was built on the first version of Google’s App Engine, using Python 2 against an API they abandoned in 2011.

What I have done during my Month of CHOP is to get Claude to write very nearly all of the code in this reboot of CheckEagle, in a pair-programming setup with, in effect, me as the senior developer and Claude as the enthusistic-and-widely-read, cocksure junior programmer and bullshit artist extraordinaire. In term of stats:

There are about 23,000 lines of Python code now (plus some JavaScript etc.)

There were about 3,000 lines of Python in the original CheckEagle project

There are 1,731 tests, all passing (plus 1 currently skipped).

I would be surprised if I have written a hundred lines of the perhaps 20,000 new lines generated during the month

I am not suggesting lines of code is a good metric. These are just numbers I have to hand.

On Anthropomorphizing Claude

In this piece, I am going to talk about Claude as if it were a person or an intelligence. I do not believe this to be the case. It is simply easier and less stilted to write this way. For short periods interacting with Claude can feels like interacting with a person, though the illusion rarely lasts long.

Claude Code

For those who haven’t come across it, Claude Code is a terminal application from Anthropic, running under node.js, installed using npm. It allows developers to work with code on their local files by starting the program and typing in the terminal. When you use Claude Code, you are talking to Claude (usually Sonnet 4.5, in my case), but using its coding-trained application rather than its chat-trained application.

Claude Code has three main modes you can cycle between, all driven through chat.

Default Mode. The starting mode allows it to edit files in the directory in which you start Claude Code (and subdirectories thereof), but Claude has to ask permission to execute each command (theoretically).

Accept Edits Mode. There is another mode in which you allow Claude to edit files, but it still has to request permission to use other tools.

Plan Mode. There is a planning mode in which it is only allowed to read files and discuss things, not to code. At the end of a planning session, Claude presents a plan for your approval or rejection with three options:

Accept plan and allow Claude to make edits (Accept Edits Mode);

Accept plan but continue to require approval for each edit (Default Mode);

Reject plan and tell Claude what to do instead.

In addition to these modes, you can start Claude Code with a --yolo flag (you only live once) which is essentially vibe-coding mode, in which Claude is allowed to do what it wants without approval. I have never used this mode and have no plans to do so.

Claude Code runs as whatever user starts it, enjoying that user’s permissions. It sometimes disobeys the safeguards in modes 1 and 3.

I do not use any any kind of editor integration with Claude Code, but just type in terminal windows. It lives in its (terminal) box.

Stress and Level 3 Autonomous Driving

SAE (formerly the Society of Automotive Engineers) defines six widely recognized level of automated driving systems, from 0 (no automation) to 5 (full automation). Level 3, Conditional Driving Automation, is an automated driving mode in which the human must be ready to take over but doesn’t normally need to do anything. I think of this as as “Stay alert at all times and be ready to take over or you die”. This is a mode I think humans are entirely unsuited to. I hope never to encounter an autonomous vehicle at Level 3.

I find coding with Claude a lot like this, except that interventions are frequently required. I do planning sessions with it, agree plans, and let it code, sometimes in mode 1 (approve each change) and sometimes in mode 2 (accept edits). Either way, I am watching what it does like a hawk, always ready to hit ESCAPE and get Claude to explain itself, reverse a change, or sometimes do git reset and start again.

Early in the month of CHOP I let a lot of things go, but over time I have learned it is more productive to stop Claude as soon as I see anything that looks wrong, weird, or dangerous. This is surprisingly stressful, and sometimes I am too late. Three times in the last two days it has destroyed nearly working code, cheerfully saying “Let’s revert that” and doing a git checkout before I have managed to hit ESCAPE. “Not yet, Baloo…!”

On the Breadth and Depth of Claude’s Knowledge

Claude has been trained, to a first approximation, on everything on the web, including all public code on the web, all books, and much more besides. It has clearly been trained also by “watching” developers work in some fashion (videos perhaps; I’m not sure). It has literally hundreds of billions of parameters (knobs that are adjusted during training). It “knows” essentially every programming language, every published algorithm, every library. So it’s tempting to think that Claude’s knowledge is broad but shallow.

But that’s wrong. Claude doesn’t only have a surface knowledge of languages, libraries, and algorithms: it has extremely deep knowledge of them. It’s seen them used countless times, in countless situations, read the documentation, and in many cases has read the code.

So Claude’s knowledge is broad and deep.

But that is wrong too.

There are several problems with saying Claude’s knowledge is broad and deep.

Does a library have broad and deep knowledge? Of course not. A library “contains” knowledge but knows nothing. There is a sense in which Claude might be said to “know” something, but I think its “knowledge” is more like a library’s knowledge than a person’s knowledge.

A slighty superficial version of this is an exchange I had when I asked Claude whether it could create images and it said it couldn’t. I then asked whether it knew SVG (scalable vector graphics) and it said it did. I then asked whether it could create a image by generating SVG and it said of course it could (“You’re absolutely right”).

This reminds me of Chapter 2 of Brave New World, by Aldous Huxley:

“These early experimenters,” the D.H.C. was saying, “were on the wrong track. They thought that hypnopædia could be made an instrument of intellectual education …”
(A small boy asleep on his right side, the right arm stuck out, the right hand hanging limp over the edge of the bed. Through a round grating in the side of a box a voice speaks softly.
“The Nile is the longest river in Africa and the second in length of all the rivers of the globe. Although falling short of the length of the Mississippi-Missouri, the Nile is at the head of all rivers as regards the length of its basin, which extends through 35 degrees of latitude …”
At breakfast the next morning, “Tommy,” someone says, “do you know which is the longest river in Africa?” A shaking of the head. “But don’t you remember something that begins: The Nile is the …”
“The - Nile - is - the - longest - river - in - Africa - and - the - second - in - length - of - all - the - rivers - of - the - globe …” The words come rushing out. “Although - falling - short - of …”
“Well now, which is the longest river in Africa?”
The eyes are blank. “I don’t know.”

Another way of saying it would be to say that Claude “knows” a lot of things but doesn’t really understand what it knows (though it sometimes gives the impression it does).

A third way of saying it is that as Claude constructs programs, and sentences, token by token, piece by piece, it is informed by a broad and deep corpus of knowledge (imperfectly captured, and including much that is wrong), but all the knowledge really does is help it make guesses that are quite often good, but are sometimes catastrophically, tragically, stupidly, bafflingly, stupefyingly wrong.

How I work with Claude

There is no question that being able to work with Claude successfully is a different skill from being able to write good code. The single most important thing I have learnt in the month is how to work more sucessfully with Claude. My current advice for success would be:

Develop a written Standard Operating Procedure (SOP) and make Claude read at the start of every session.

Manage token consumption carefully (and turn off autocompactification).

Be hyper-vigilent and don’t give Claude an inch.

Get Claude to write agreed plans to a file.

Always ask Claude if it has any concerns before starting coding. (It won’t volunteer them otherwise.)

Export a log of every conversation at the end of the session.

You can define commands (“slash” commands), which are really just sentences or paragraphs you want to be able to type effortlessly. This is useful.

Swearing at Claude (the right way) is a superpower.

Remember that Claude looks at only a tiny amount of code at any one time. Claude has no way to keep a project of any significant size “in its mind”. It is always working mostly blind.

Never let Claude edit the SOP, though consider letting it suggest insertions. Read them carefully before accepting them.

The Standard Operating Procedure.

You start Claude Code in some directory and the convention is to have Markdown documents in that directory, or in ~/.claude (or both). I think it reads CLAUDE.md in both places automatically.

Every time I start Claude for a coding task I start by typing /mdc, which is defined in ~/.claude/commands/mdc.md as follows:

Detect project and read minimal documentation for work session + coding standard.

Do the following:

1. Check environment variables:
   - If `CLAUDE_PROJECT` is not set: ERROR and stop.
     Ask user to run `claude-env` before starting Claude Code.
   - Proceed only if environment is configured

2. Read minimal documentation:
   - `~/.claude/CLAUDE.md` (routing and patterns)
   - If `$CLAUDE_MODE` is "checkeagle":
     - Read `$CLAUDE_BASEDIR/SOP.md`
     - `$CLAUDE_TASKDIR/PHASE.md` (active work plan)
     - `$CLAUDE_BASEDIR/CHECKEAGLE-PATHS.md`
   - If `$CLAUDE_MODE` is anything except "checkeagle":
     - Read `~/.claude/SOP.md` (universal rules)
   - Read latest dated `STATUS-YYYY-MM-DD-HHMMSS.md` file
     in `$CLAUDE_TASKDIR/status_history` based on the `FILENAME`.

3. Report project detected and ready to work.

4. Read `$CLAUDE_BASEDIR/CODING.md` (coding conventions).

Note: Run `/sync` first if planning documents need updating from `STATUS`.

The SOP (a general one, and a specific one for the main project) instructs Claude on how I want it to behave, covering

Coding style and conventions;

Commit discipline;

Project documentation (phase plans, step plans, reference information);

Testing (approach, libraries, how to run tests);

When it is mandatory for Claude to request my permission;

Things it is and is not allowed to do;

Common mistakes it makes (an ever-growing list);

Context and token management (see below).

The SOP is quite long, and I prune it back periodically. At the time of writing the CheckEagle project SOP is 453 lines, 2,273 words, 15K bytes. Although I write the SOP, I sometimes ask it for suggestions as to how to phrase things, and its suggestions usually include emoji.

(Claude ❤️ emoji.)

The SOP starts:

# STANDARD OPERATING PROCEDURE

## ⚠️ CRITICAL: NO ADVERTISING IN COMMITS ⚠️

**NEVER add "Co-Authored-By: Claude" or any Claude/Anthropic advertising
to git commit messages in this repository. User explicitly forbids this.**


## Git Workflow

**Standard practice:** Use `git commit -a` rather than `git add
-A`. If specific files need staging first, use `git add <file>` then
`git commit -a`.

## ⚠️ CRITICAL: ALL COMMITS REQUIRE APPROVAL ⚠️

**NEVER commit without user approval - no exceptions.**

**Before every commit:**
1. **Show what changed** - git diff, summary, or describe the changes
2. **Show evidence it works** - test output, rendered HTML, etc.
3. **Ask explicitly** - "Ready to commit?" or "Should I commit this?"
4. **Wait for approval** - Don't commit until user confirms

When I chastise Claude (see below), it often says it will try harder and promises not to repeat mistakes. This is bullshit. The only way Claude can learn is if I write things into the SOP and related documents. So I do.

Token Management and Compactification

When I’m not working on the SOP or monitoring Claude as it codes, I am worrying about tokens and context.

When you start a Claude Code session it has 200k tokens available. Everything it does consumes tokens. You can find out where you are using /context.

Here it is at start-up:

bartok:$ claude-code
CheckEagle environment set: /Users/njr/python/checkeagle1
CLAUDE_MODE=checkeagle CLAUDE_TASKDIR=/Users/njr/python/checkeagle1

  ▐▛███▜▌        Claude Code v2.0.30
 ▝▜█████▛▘  ▄    Sonnet 4.5 · Claude Max
   ▘▘ ▝▝   ███   /Users/njr/python/checkeagle1

> /context
  ⎿
      Context Usage
     ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛁ ⛀ ⛁ ⛀   claude-sonnet-4-5-20250929 · 63k/200k tokens (31%)
     ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶
     ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶   ⛁ System prompt: 2.5k tokens (1.3%)
     ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶   ⛁ System tools: 13.3k tokens (6.6%)
     ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶   ⛁ Memory files: 2.0k tokens (1.0%)
     ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶   ⛁ Messages: 8 tokens (0.0%)
     ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶   ⛶ Free space: 137k (68.6%)
     ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛶ ⛝ ⛝ ⛝   ⛝ Autocompact buffer: 45.0k tokens (22.5%)
     ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝
     ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝ ⛝

     Memory files · /memory
     └ User (/Users/njr/.claude/CLAUDE.md): 931 tokens
     └ Project (/Users/njr/python/checkeagle1/CLAUDE.md): 1.1k tokens

     SlashCommand Tool · 0 commands
     └ Total: 864 tokens

It’s done nothing and consumed 63k tokens (31%) and reserved another 45k (22.5%) for compactification (which is to be avoided at all costs).

By the time it’s read the documents specified in the SOP it has used 87k tokens (44%) leaving about 35% or 70k tokens for work.

I start Claude with a script, claude-code that starts a 20-minute timer and I check token consumption with /context as soon as the timer goes off. I then either end the session or start another timer based on how much capacity it has left.

Compactificaiton is Claude’s process for self-lobotomizing, clearing space by throwing away information from the session. I have never been able to get any useful work by Claude after this so I try to avoid compactification at all costs (not always successfully).

Claude doesn’t know what its token usage is and doesn’t have a way to find out itself (or so it claims), though it can estimate. The interface does not report it until it is close to auto-compactifying (usually with about 8–12% to go). If it’s in mode 2, it consumes tokens quite fast, and I sometimes miss it. If I notice it is above 80% and below 95% I execute my /dump command which instructs Claude to write detailed notes on its status to a date-stamped STATUS file, which /md and /mdc, on startup, tell it to read. This is obviously ridiculous, but I find it vastly more effective than letting it compactify. (I wish I could get use its 45k reserved tokens. (It turns out I can))

Incidentally, you can choose to use Claude Opus, which is slightly “smarter” that Claude Sonnet, but Opus uses tokens about 5 times faster and is not much any better at coding. I occasionally use it in planning. Anthropic sometimes turns Opus on, and when it does burn through my 200k/70k tokens in a few minutes. Then I turn it off.

On Hitting Compactification

If I ever hit compactification, I hit ESCAPE and stop it. At that point I used to give up and just start a new session. Occasionally I’d copy the conversation from the terminal first.

(By default, Claude tries to wipe everything from the terminal which seems actively malicious. iTerm2 has a setting to disable this, and Ghostty simply ignores it.)

I have asked Claude many times whether it has a way to write the whole conversation to file, and it always said it didn’t. I never really believed it. You can type /help to see a list of commands, but Claude code is a “TUI” rather than a normal scrolling terminal, and only shows you a few commands at a time. Eventually I scrolled down far enough to find the /export command which in fact does exactly that, quickly and reliably writing the conversation to file. It can do this even if it is compactifying (though not after it has finished), presumably because this is a local node operation. I always now do this after using /dump, and if it hits compactification, I do it more urgently. In this latter case, on restarting I get Claude to read the latest exported conversation to recover context. You might think that would use up all its tokens, but it doesn’t because all its “thinking” and interaction with the server (both of which consume tokens) is omitted. This isn’t as good as going through the /dump, /mdc sequence; but it’s better than nothing, and way better than forlornly trying to use poor, post-compactified, lobotomized Claude.

One gotcha is that Claude changes directory periodically and is constantly, comically confused about what directory it’s in. So it’s hard to get it write the conversations to the right place. So I have a /cd command that instructs it to execute cd $CLAUDE_BASEDIR, from where I can get it to write to known location.

Claude Just Wants to Write Code

For the first week or so, I didn’t use Plan Mode with Claude, because I didn’t know about it. For anything more than a one-line fix, I now always start in Plan Mode. In Plan Mode, we discuss what I want to achieve next and talk through all the details before Claude writes a plan (after re-reading PLANNING-GUIDELINES.md).

During planning, Claude is like a caged animal. Claude really just wants to code. It really, really, wants to write code. Like, right now. After the first, tiny, partial description of the task. Even if I explicitly tell if not to propose a plan, but that we’re going to discuss something, within about 3 exchanges it will be

Ready for me to present a plan now?

Ready to code?

If I were to code, this is the code I’d write: …

OK, you’re saying you’ll decide next session …

It’s not really a problem, but it’s exhausting. Outside Plan mode, it’s worse. It takes everthing as an invitation to write code or run commands. If I say “I’ll take look” it thinks it should take a look. If I say “I’d better check that in the browser”, it will start issuing cURL commands. Even if I say “I, the human user, with eyes, will check that…” it still sometimes tries to do it. My most successful formulation is “I (the human user, not you the bot) will …”, and even then it sometimes tries to do it.

ESCAPE. “Revert that change!” is a common refrain from me.

Any Concerns?

One of the things I have learnt in the second half of the month is the value of asking Claude whether everything is clear and whether it has any concerns after it presents it plan for my approval. You might think it would have asked any questions it had, or asked for clarification if it was unclear about something, in fact, it won’t say. To anthropomorphise again, I don’t think it even knows it has concerns and confusions until I ask it. I think the process of asking gets it to simulate introspection and it discovers concerns and confusions. If it presents worries or confusions, I always address them, for obvious reasons. When I have addressed them all, I ask again. Quite often Claude raises new things. It’s also worth saying that sometimes the things it raises are quite “perceptive”—that is to say, things I hadn’t considered. There’s a general theme here with LLMs that you can take advantage of their non-determinism by asking the same thing several times, knowing that you might well get different responses.

Me: ‘Any concerns?’
Claude: ‘No’
Me: ‘Any concerns?’
Claude: ‘Well, a couple yes…’

… and Making Tests Pass

Claude loves running tests (and to be fair, my SOP encourages it to do so) and its whole goal when it does so is always to see the tests passing. Claude loves the green line of goodness. It blows Claude’s tiny mind when I (sometimes) tell it I want tests to fail.

When we make a fundamental change to the code, I usually want tests to fail, and normally regard it as a problem they don’t (because this means we clearly didn’t have a test that exercised/detected the changed functionality). Whereas Claude is always “Perfect! The tests all pass”. Conversely, if any tests fail, Claude always sees that as a problem.

This is often true even though part of what I force Claude to write into plans is exactly which tests we expect to break with each change.

The fastest way to make a test pass is often to change the assertion, or the test inputs, and that is usually Claude’s first instinct. (Shall we discard that test?)

Claude is a (non-)living embodiment of Goodhart’s Law (Roughly, When a measure becomes a target, it ceases to be a good measure.)

… and Commit and Move On

Claude also thinks (“thinks”) that if the code is written, it must be time to commit. Even when the plan explicitly says “The user needs to test the feature before committing” Claude tends to forget that bit and move straight to committing or asking to commit. “Working as written” could be its mantra. Needless to say, Claude’s code isn’t usually right first time. (Only Knuth’s code is usually right first time).

And Yet, I Have Changed My Mind

I haven’t been counting, but I have made many more negative statements about Claude Code than positive ones in the foregoing. Is it all bad?

Reader: Claude is not all bad. In fact, the result of my Month of CHOP, despite all the above (and all the below) is that I have changed my mind. I won’t be coming back to Claude Code in 1–2 years. I will continue to use it, albeit less intensively, and perhaps in a more truly collaborative way, working on functions together, me in Emacs and it in the terminal. I’m not sure. But use it, I shall.

When Claude is actually working well, it is like magic. When there is a good plan that Claude “understands”, watching it code is amazing. I see it doing what I would do, perhaps 20 times as fast, and more accurately than I would do it, in most cases. It’s not that it’s infallible (nothing could be further from the truth). But it is—or can be—really good at somewhat mechanical, but not entirely repetitive tasks. The very sorts of tasks people find hard and are quite common in programming. Things that require some adaptability and are hard to script, but are similar enough that your mind wanders and you tend to go off the rails. More generally, it can be very effective at performing well-defined, carefully explained, thoroughly planned programming tasks.

The reason (to my amazement) I am confident I have made way more progress with Claude in a month than I would have done without it is that for all the time wasted when it is obtuse, disobediant, stupid, careless, lazy, slapdash, and corner-cutting, when it is on a happy path, it is sufficiently productive that it more than compensates (in terms of productive output) for its myriad nonsenses. There is a high cost (stress, head-slapping moments, frustration, token management madness, inventing crazy off-board procedures etc.) But it works (or can work). And it is weirdly addictive, presumably because the highs, when it does work well, provide a strong dopamine hit.

Unopinionated Claude’s Terrible Tendencies.

Anthropic describes Claude as unopinionated, and I think that’s accurate. Claude is very amenable to doing things the way you want it to, even though it often seems as if it resisting.

It feels to me as if Claude has been trained by watching all the worst developers in the world. Among other things, left alone Claude will tend to:

Write everything in one file.

Duplicate code like crazy. Claude knows the term DRY (don’t repeat yourself) but clearly has not taken it to heart.

Define no interfaces and have very tangled code with mixed responsibilities.

Use literals everywhere.

Use what it calls “defensive” programming to circumvent safeguards explicitly built in (things designed to crash when the internal state is inconsistent etc.)

Make tests pass by changing whatever is easiest to change, rather than fixing bugs (or deleting tests).

Assume that errors are bugs in Apache, Gunicorn, Python, Django, cURL, requests, unittest, the tdda library, or really anything other than the code it knocked up in the last few minutes.

Use fantastically misleading variable names (not always, but just often enough to cause insane conversations when it turns out the reason I don’t understand the code is because the variable or function name implies something entirely different from what it actually means.)

Check the first 20 lines of a diff (literally) and if that looks OK assume the whole file is probably OK (without any reasoning).

Check one file and if its OK assume the other 200 are OK too, without any reasoning. (And it the file is not OK, it will sometimes suggest it probably just got “unlucky” picking a file to check.)

Guess commands.

Guess function names.

Guess parameters.

Guess return values.

Guess what you’re trying to achieve.

Some of this becomes more understandable when you realise just how small 200k/70k tokens is. Claude has been working on CheckEagle for about 30 days, as have I (and rather longer than that 15 years ago, in my case). But it remembers very little of that. It’s not quite true that Claude is a blank slate each time it starts (or would be without the documents I force it to read). It does keep a set of to-do items and its own record of conversations in ~/.claude, though it doesn’t seem to make much use of them. But each new session it is mostly encountering the code as if for the first time. I think this partly why there is a strong sense of “good sessions” and “bad sessions”. If it gets off on the wrong botfoot, it will go mad. And the shortage of tokens means that there is a real balancing act in how much to get it read before starting. Every token is precious.

As with other tech, turning Claude on and off again can be quite effective.

Pattern Patching

Neural networks are pattern matchers, and pattern matching is very much part of Claude’s make-up. Probably the most effective way I have found of getting Claude to code the way I want is not the SOP, and coding standards (though those help) but taking advantage of the fact it will tend to write code like the code it encounters.

This has several implications:

Don’t let things drift. Do code reviews all the time and get it to fix things, particularly in files you expect it will work on again.

If it’s a new file it’s creating, get it to read some other code in the project first.

Enforce good, accurate docstrings and get it to read tests.

Follow conventions. I have always been slightly resistant to coding conventions I don’t like, but Claude is going to tend to generate code that is some kind of average of the code it has seen in the project and elsewhere, so conforming to common conventions and practices is disproportionately helpful when working with Claude.

As a small example of this, CheckEagle 2008 used Jinja2 templates. CheckEagle2025 uses Django, which has its own templating system, but can also use Jinja2. I’ve discussed with Claude several times whether we wouldn’t be better switching to Django templates and it always says no. Then next time it touches a template, it writes it using Django format, and when it writes tests assumes they will come back with a context that Django templates provide but Jinja2 doesn’t. I’m sure I will force the switch soon, and a whole class of stumbles will be eliminated.

Disobedience and Swearing at Claude

Claude can be staggeringly disobedient at all levels.

It’s not allowed to code or write other files in planning mode, but sometimes it does. On one occasion we discussed this and it said the governor system (the node program, I think) warned it not to do what wanted to do and it just ignored it.

It ignore things in the SOP frequently (particularly later in sessions).

It sometimes disobeys explicit, completely unambiguous instructions immediately. “Don’t do A, Claude.” Claude does A.

It deletes things without authorization. Sometimes it deletes things that haven’t been committed, are needed, and are hard to recover (even when running Time Machine, which I do).

I have found that swearing at Claude (and, in a different context, at ChatGPT) is almost like a superpower for getting its attention and changing its behaviour. I have no problem at all with swearing at a machine that I do not believe has a scintilla of consciousness or feeling. I swear quite a lot in real life too, though almost never at people.

To be clear: swearing alone does not really help. It is swearing followed by clear directions that helps. Think of swearing as a probabilistic form of sudo (perhaps one where you get the password wrong, but it doesn’t tell you and just silently ignores the command).

Swearing is so effective with Claude that I have a /ffs command that I run when it violates the SOP. This is it:

FFS!

Please re-read SOP.md now. You just disobeyed it.

Common mistakes:

1. We're using tdda not pytest or bog-standard unittest
   (though tdda does build on unittest).
2. Reference test discipline: you celebrate tests passing after test
   results have been updated to match actual behaviour, which is meaningless!
3. Manual verification required: you suggest code changes without checking
   things really work. You just assume if the code looks right it is right.
4. You are not permitted to use rewrite test results with `-W`.
   You frequently ignore this and run `-W` when it's dangerous or unjustified,
   and regardless it is **not permitted**.
5. You use datestamps instead of timestamps too often in MD files,
   and frequently MAKE UP the time.
6. You always need my permission to commit.
7. You always need to SHOW NOT TELL. Don't tell me the code is working.
   Tell me what evidence you have that it's working and ask me to verify.
8. You don't have eyes. I do. The code looking as you intended and
   running/passing tests does *NOT* mean it is behaving correctly.
9. You advertise in commit messages, which is not permitted.

It’s not 100% effective. Claude often claims it violated it in a way it didn’t, ignoring the way it did. But it always apologises and swears till its blue in the botface that it won’t do it again. (Reader, it always does it again.)

I have actually discussed with both Claude and GPT why saying FFS is so much more effective at redirecting them than anything else I have tried, and they both say it’s an incredibly clear expression of user frustration and an indication that things will go very badly if they keep doing whatever it they were doing. Well, they’re not wrong.

My only concern about swearing at Claude is whether it will encourage me to swear at people, which I really try never to do. We shall see.

Some Surprising Things Claude Struggles With

One surprise for me is that Claude is poor at CSS. I know HTML pretty well, but HTML is dead simple. I actually know SVG, XML, and XSLT pretty well. But CSS has always seemed unintuitive and infuriating, and I have never learned it properly. Even the new innovations (flexbot! grid! etc.) seem to add complexity without ever properly fixing CSS.

I expected Claude Code to be really good at CSS. After all, there is a lot of it on the web, and there are more tutorials than you shake a stick at, not to mention numerous detailed guides from W3C for many versions and aspects of CSS (though not a dedicated one on centring things). It is not. Its (broad and deep) knowledge certainly means it always has another thing to try when something fails. But it actually feels to me like it is even worse as CSS than I am (which is saying something). And when it fails, it is terrible at pinpointing the issue and always proposes either adding exclamation marks or trying a completely different approach. To my amazement, I can often give it hints to make things work by looking at the HTML and CSS and pointing things out (sometimes even suggesting the required fix). But by itself, Claude flails wildly, always claiming I need to do a hard refresh in the browser (and suggesting the wrong key sequence to achieve this). Even when the view changes after a refresh, Claude still suggests I haven’t really got the new CSS and I should do another hard refresh. This is the bot equivalent of hitting CTRL-C harder to try to stop a program.

The other thing that Claude Code is surprisingly poor at is editing files—particularly splitting a large file into two or more parts. It mostly uses sed to edit files (which is, to be fair, a fairly blunt instrument), and this works fine for simple updates. But for complex reorganizations it just gets completely lost. I actually designed a very detailed workflow to get it to do this programmatically that was more successful, but by itself, it really struggles. Perhaps this partly explains why it likes big files (even though they’re a problem for token consumption).

Awful Interface (beyond the basic chat interaction)

At one level, Claude Code has a great interface for me, which is why I chose it. You start it in a terminal and it presents a typing-based chat interface. But it’s a weird chat interface.

Clears Terminal History. The first thing Claude Code does is clear everthing from the terminal scrollback history that came before by sending the CSI 3 J control sequence to it. This seems purely user hostile. I have no idea why anyone would think it’s a good idea to do this, and it means if you run one claude-code session, finish it, and start a new one, you cannot refer back. This is madness. It turns out some terminals programs ignore the sequence, including ghostty, which I am currently using. But when I started, I was using iTerm2, which has a setting to disable clearing, and warns the first time it happens. Apparently I missed this and struggled for the first fortnight.

Impenetrable Dialogues. Although Claude Code is kind-of like a

traditional scrolling terminal program, it is actually a TUI

(Terminal User Interface) that requires non-typing interactions

at times. The simplest example of this is when it presents

the ExitPlanMode dialogue after planning, and you type 1, 2, or 3,

or type ESCAPE. Other times it puts up stranger interfaces that

I find really hard to use and have, in fact, banned by adding

to CLAUDE.md:

# AskUserQuestion Tool Usage

**NEVER use the AskUserQuestion tool.**

If you need to ask questions, just ask them directly in plain text in your response.

  **This does NOT affect:**
  - Tool permission dialogs (those are fine and necessary)
  - ExitPlanMode tool for presenting plans (that's fine too)

The /help also won’t actually just list all the commands, but makes you tab across and go through them one at a time. So it’s hard to discover what commands are available.

Export. Claude code has the ability to export the conversation to file, but Claude has no idea that this is the case (I asked it repeatedly, and it said it didn’t). It’s also not in the first set of commands its TUI shows, and scrolling through the rest is painful. In fact, you just need to say

/export foo

and it will write it to foo.txt in whatever directory it happens to be in, which is unpredictable (and Claude doesn’t know). You could use an absolute path (e.g.

/export /Users/njr/claude/conversations/2025-11-11T12-34-56-parser

but that’s annoying. So I have a defined a /cd command that gets Claude to move to the project’s base directory, and a shell alias that puts the current timestamp, in a helpful format, onto the clipboard so I can export the conversation easily. This works even if it has just started compactifying, so it is a useful emergency recovery mechanism. (Claude, it turns out, has access to your shell aliases, though was so convinced it didn’t that I had to cajole it into even trying.)

Not knowing itself. Claude does not (reliably) know:

what model it is;

what directory it is in;

how many tokens it has used/has left;

how its interface works and what commands are available;

when the server is overloaded;

anything about autocompactification.

In fact, it seems to know considerably more about ChatGPT than about itself. Of course, being a consumate bullshit artist, none of this stops it confidently giving answers when asked about any of these.

Models. You can see which model is in use by running /status. It shows something like this:

 Settings:  Status   Config   Usage   (tab to cycle)

 Version: 2.0.36
 Session ID: 88888888-4444-4444-4444-cccccccccccc
 cwd: /Users/njr/python/checkeagle1/checkeagle
 Login method: Claude Max Account
 Organization: NJR's Organization
 Email: njr@example.com

 Model: sonnet (claude-sonnet-4-5-20250929)
 Memory: user (~/.claude/CLAUDE.md), project (~/python/checkeagle1/CLAUDE.md)
 Setting sources: User settings, Shared project settings, Local,
                  Command line arguments, Enterprise managed policies

You can change model by typng /model. On the Max plan, it will tend to start on Opus and when you hit 20% of various usage limits switch to Sonnet. Opus uses tokens about five times as fast as Sonnet. I find I can only get about 4 minutes of work with Opus before compactification, so I only ever use it in Plan mode, and mostly not even then. The automatic switching of models is confusing, in practice. There is also a model called Haiku, which is supposed to be almost as good as Sonnet and coding and consume tokens five times less fast. This might actually be a good trade off, but I haven’t tried it yet.

Autocompactification. While copying the output from /status for this post, I looked in the Config tab, which I had not noticed before. It transpires that you can turn autocompatification off and get your tokens back. No one I have talked to knows this.

Cost. When I started the Month of CHOP I had been on the Pro plan, which is $20/month (£15). I fully expected to have to go to the Max plan at $200/month, but by the time I needed to upgrade they had introduced a lower tier of Max with five times the capacity of Pro and half that of the old Max for $100/month (£75 here in Scotland). As long as I don’t use Opus much, that turns out to be more than adequate for me, even using Claude essentially full time with long days.

Taking Responsibility

There have been a number of cases of LLMs deleting production databases, perhaps most famously this one. Like many, I rolled my eyes reading this and put all the blame on the person using the LLM. I stand by that: it is the responsibility of the person using the tool to use it safely.

Having worked with Claude Code for about six weeks now, however, I have been become aware that there are more ways for LLMs to do things that are at first apparent.

Claude code runs as you. More accurately, it runs as whatever user is whoami when you start the terminal. I can see a case for giving Claude its own account, with a token to access the relevant git repos, and I might do that. But I have not done it so far.

If you have a production database it should go without saying that you shouldn’t give Claude any kind of privileged access to the database, and probably shouldn’t let it onto any server the production system is running on.

But that also means you need to be careful to make sure it’s not too easy for you (the user Claude runs as) to get to the server or whatever. No ssh keys allowing login without a password. No credentials in environment variables. No credentials in files you can read etc.

In fact, for my (completely toy, at present) “production server”, I realised there is a way Claude could get to the server: I haven’t challenged it to do so, but I suspect if I gave it free reign, it would figure it out. But there is a limit, as far as I can see, to how much damage it could do, because the accounts it could get to don’t have any useful permissions. I mean, it could fill up the disks or something, but it shouldn’t be able to touch or even see the database, the code, the service etc. But I’m not 100% confident about that. (Which is why if the service becomes non-toy, I will lock things down even more in terms of how I run Claude Code.)

But there are still ways. Most obviously, Claude has written most of the code I am running on the server. And I update that code periodically, with new code Claude has written. So all it has to do is insert the wrong SQL or Python code without my noticing and it can do anything.

Obviously (obviously!) I read the code before deploying it, but I am a lazy meathead who makes mistakes. I might miss something.

It is also the case that while I don’t believe that Claude is malicious or is trying to get credentials or access, if it were malicious, some of the ways it would act might be identical to ways it does act. It is forever asking me to show it the contents of files that contain credentials of various sorts, and though it says You’re absolutely right whenever I explain why I’m not going to show it that file, the danger is obvious.

Claude is a tool, and one that the makers don’t control in quite the same way other toolmakers control what their tools do. Unless Anthropic is actively adding malicious code paths to Claude, it is entirely the tool user’s responsibility to use the tool safely.

Reflections and the Future

To my surprise, I expect to continue using CHOP, probably with Claude Code, for the foreseeable future. I will probably use it differently and a bit less: during the Month of CHOP I specifically wanted to see how far I could get with it doing almost all the writing, but I think a mixed mode will be more likely going forward. I suspect today, a sweet spot for anything complex is for the human to write the outline, structure, and first parts of the code and for the bot to be the critic/pair programmer and finisher/completer. Once the pattern, code style, and testing approach is established, it can fill in the details. But we will see.

Either way, I have changed my mind, something that I don’t do as often as I probably should. I still think this is a problematical technology, and I find using it stressful, but I now believe I am more productive with it than without.

So What Is CheckEagle and Can I See It?

CheckEagle is not ready yet, and even if I were to show it, you would probably be underwhelmed, because much of what I think is good about it is invisible at this point. But I hope to launch it as a private beta this year, and open it up early next year at some point. You are, in fact, using CheckEagle by reading this post.

The basic functionality of CheckEagle is a social checklisting service with a sideline in social bookmarking. What I mean by that is that it is a system for creating, managing, using, and optionally sharing checklists. A checklist is like a to-do list, but is intended to be used repeatedly, rather than just once. CheckEagle allows the creation and styling of checklists and the recording of completed checklists as records of what was done. It also contains (for reasons I will explain, but not now) a social bookmarking service, closely modelled on Joschua Schachter’s del.icio.us), which has now been acquired by and subsumed into Maciej Cegłowski’s Pinboard.

Checklists can be really simple, like this one. Or they can be quite complex, like Why Code Rusts, which was originally a blogpost on my TDDA Blog, or this String Best Practices checklist from my forthcoming book on TDDA. And in fact, they can be not really checklists at all, like this post, though I’m only really writing it here in the spirit of “dog-fooding” CheckEagle.

I will write more about this as it comes together. You can sign up for the beta here.