Fragments

As I come across interesting thoughts (mostly) on the web, I like to share them. I do this by posting “fragments”. This page collects all my fragments so far.


April 9

I mostly link to written material here, but I’ve recently listened to two excellent podcasts that I can recommend.

Anyone who regularly reads these fragments knows that I’m a big fan of Simon Willison, his (also very fragmentary) posts have earned a regular spot in my RSS reader. But the problem with fragments, however valuable, is that they don’t provide a cohesive overview of the situation. So his podcast with Lenny Rachitsky is a welcome survey of that state of world as seen through a discerning pair of eyeballs. He paints a good picture of how programming has changed for him since the “November inflection point”, important patterns for this work, and his concern about the security bomb nestled inside the beast.

My other great listening was on a regular podcast that I listen to, as Gergely Orosz interviewed Thuan Pham - the former CTO of Uber. As with so many of Gergely’s podcasts, they focused on Thuan Pham’s fascinating career direction, giving listeners an opportunity to learn from a successful professional. There’s also an informative insight into Uber’s use of microservices (they had 5000 of them), and the way high-growth software necessarily gets rewritten a lot (a phenomenon I dubbed Sacrificial Architecture)

 ❄                ❄                ❄                ❄                ❄

Axios published their post-mortem on their recent supply chain compromise. It’s quite a story, the attackers spent a couple of weeks developing contact with the lead maintainer, leading to a video call where the meeting software indicated something on the maintainer’s system was out of date. That led to the maintainer installing the update, which in fact was a Remote Access Trojan (RAT).

they tailored this process specifically to me by doing the following:

  • they reached out masquerading as the founder of a company they had cloned the companys founders likeness as well as the company itself.
  • they then invited me to a real slack workspace. this workspace was branded to the companies ci and named in a plausible manner. the slack was thought out very well, they had channels where they were sharing linked-in posts, the linked in posts i presume just went to the real companys account but it was super convincing etc. they even had what i presume were fake profiles of the team of the company but also number of other oss maintainers.
  • they scheduled a meeting with me to connect. the meeting was on ms teams. the meeting had what seemed to be a group of people that were involved.
  • the meeting said something on my system was out of date. i installed the missing item as i presumed it was something to do with teams, and this was the RAT.
  • everything was extremely well co-ordinated looked legit and was done in a professional manner.

Simon Willison has a summary and further links.

 ❄                ❄                ❄                ❄                ❄

I recently bumped into Diátaxis, a framework for organizing technical documentation. I only looked at it briefly, but there’s much to like. In particular I appreciated how it classified four forms of documentation:

The distinction between tutorials and how-to guides is interesting

A tutorial serves the needs of the user who is at study. Its obligation is to provide a successful learning experience. A how-to guide serves the needs of the user who is at work. Its obligation is to help the user accomplish a task.

I also appreciated its point of pulling explanations out into separate areas. The idea is that other forms should contain only minimal explanations, linking to the explanation material for more depth. That way we keep the flow on the goal and allow the user to seek deeper explanations in their own way. The study/work distinction between explanation and reference mirrors that same distinction between tutorials and how-to guides.

 ❄                ❄                ❄                ❄                ❄

For eight years, Lalit Maganti wanted a set of tools for working with SQLite. But it would be hard and tedious work, “getting into the weeds of SQLite source code, a fiendishly difficult codebase to understand”. So he didn’t try it. But after the November inflection point, he decided to tackle this need.

His account of this exercise is an excellent description of the benefits and perils of developing with AI agents.

Through most of January, I iterated, acting as semi-technical manager and delegating almost all the design and all the implementation to Claude. Functionally, I ended up in a reasonable place: a parser in C extracted from SQLite sources using a bunch of Python scripts, a formatter built on top, support for both the SQLite language and the PerfettoSQL extensions, all exposed in a web playground.

But when I reviewed the codebase in detail in late January, the downside was obvious: the codebase was complete spaghetti. I didn’t understand large parts of the Python source extraction pipeline, functions were scattered in random files without a clear shape, and a few files had grown to several thousand lines. It was extremely fragile; it solved the immediate problem but it was never going to cope with my larger vision, never mind integrating it into the Perfetto tools. The saving grace was that it had proved the approach was viable and generated more than 500 tests, many of which I felt I could reuse.

He threw it all away and worked more closely with the AI on the second attempt, with lots of thinking about the design, reviewing all the code, and refactoring with every step

In the rewrite, refactoring became the core of my workflow. After every large batch of generated code, I’d step back and ask “is this ugly?” Sometimes AI could clean it up. Other times there was a large-scale abstraction that AI couldn’t see but I could; I’d give it the direction and let it execute. If you have taste, the cost of a wrong approach drops dramatically because you can restructure quickly.

He ended up with a working system, and the AI proved its value in allowing him to tackle something that he’d been leaving on the todo pile for years. But even with the rewrite, the AI had its potholes.

His conclusion of the relative value of AI in different scenarios:

When I was working on something I already understood deeply, AI was excellent…. When I was working on something I could describe but didn’t yet know, AI was good but required more care…. When I was working on something where I didn’t even know what I wanted, AI was somewhere between unhelpful and harmful…

At the heart of this is that AI works at its best when there is an objectively checkable answer. If we want an implementation that can pass some tests, then AI does a good job. But when it came to the public API:

I spent several days in early March doing nothing but API refactoring, manually fixing things any experienced engineer would have instinctively avoided but AI made a total mess of. There’s no test or objective metric for “is this API pleasant to use” and “will this API help users solve the problems they have” and that’s exactly why the coding agents did so badly at it.

 ❄                ❄                ❄                ❄                ❄

I became familiar with Ryan Avent’s writing when he wrote the Free Exchange column for The Economist. His recent post talks about how James Talarico and Zohran Mamdani have made their religion an important part of their electoral appeal, and their faith is centered on caring for others. He explains that a focus on care leads to an important perspective on economic growth.

The first thing to understand is that we should not want growth for its own sake. What is good about growth is that it expands our collective capacities: we come to know more and we are able to do more. This, in turn, allows us to alleviate suffering, to discover more things about the universe, and to spend more time being complete people.


April 2

As we see LLMs churn out scads of code, folks have increasingly turned to Cognitive Debt as a metaphor for capturing how a team can lose understanding of what a system does. Margaret-Anne Storey thinks a good way of thinking about these problems is to consider three layers of system health:

  • Technical debt lives in code. It accumulates when implementation decisions compromise future changeability. It limits how systems can change.
  • Cognitive debt lives in people. It accumulates when shared understanding of the system erodes faster than it is replenished. It limits how teams can reason about change.
  • Intent debt lives in artifacts. It accumulates when the goals and constraints that should guide the system are poorly captured or maintained. It limits whether the system continues to reflect what we meant to build and it limits how humans and AI agents can continue to evolve the system effectively.

While I’m getting a bit bemused by debt metaphor proliferation, this way of thinking does make a fair bit of sense. The article includes useful sections to diagnose and mitigate each kind of debt. The three interact with each other, and the article outlines some general activities teams should do to keep it all under control

 ❄                ❄

In the article she references a recent paper by Shaw and Nave at the Wharton School that adds LLMs to Kahneman’s two-system model of thinking.

Kahneman’s book, “Thinking Fast and Slow”, is one of my favorite books. Its central idea is that humans have two systems of cognition. System 1 (intuition) makes rapid decisions, often barely-consciously. System 2 (deliberation) is when we apply deliberate thinking to a problem. He observed that to save energy we default to intuition, and that sometimes gets us into trouble when we overlook things that we would have spotted had we applied deliberation to the problem.

Shaw and Nave consider AI as System 3

A consequence of System 3 is the introduction of cognitive surrender, characterized by uncritical reliance on externally generated artificial reasoning, bypassing System 2. Crucially, we distinguish cognitive surrender, marked by passive trust and uncritical evaluation of external information, from cognitive offloading, which involves strategic delegation of cognition during deliberation.

It’s a long paper, that goes into detail on this “Tri-System theory of cognition” and reports on several experiments they’ve done to test how well this theory can predict behavior (at least within a lab).

 ❄                ❄                ❄                ❄                ❄

I’ve seen a few illustrations recently that use the symbols “< >” as part of an icon to illustrate code. That strikes me as rather odd, I can’t think of any programming language that uses “< >” to surround program elements. Why that and not, say, “{ }”?

Obviously the reason is that they are thinking of HTML (or maybe XML), which is even more obvious when they use “</>” in their icons. But programmers don’t program in HTML.

 ❄                ❄                ❄                ❄                ❄

Ajey Gore thinks about if coding agents make coding free, what becomes the expensive thing? His answer is verification.

What does “correct” mean for an ETA algorithm in Jakarta traffic versus Ho Chi Minh City? What does a “successful” driver allocation look like when you’re balancing earnings fairness, customer wait time, and fleet utilisation simultaneously? When hundreds of engineers are shipping into ~900 microservices around the clock, “correct” isn’t one definition — it’s thousands of definitions, all shifting, all context-dependent. These aren’t edge cases. They’re the entire job.

And they’re precisely the kind of judgment that agents cannot perform for you.

Increasingly I’m seeing a view that agents do really well when they have good, preferably automated, verification for their work. This encourages such things as Test Driven Development. That’s still a lot of verification to do, which suggests we should see more effort to find ways to make it easier for humans to comprehend larger ranges of tests.

While I agree with most of what Ajey writes here, I do have a quibble with his view of legacy migration. He thinks it’s a delusion that “agentic coding will finally crack legacy modernisation”. I agree with him that agentic coding is overrated in a legacy context, but I have seen compelling evidence that LLMs help a great deal in understanding what legacy code is doing.

The big consequence of Ajey’s assessment is that we’ll need to reorganize around verification rather than writing code:

If agents handle execution, the human job becomes designing verification systems, defining quality, and handling the ambiguous cases agents can’t resolve. Your org chart should reflect this. Practically, this means your Monday morning standup changes. Instead of “what did we ship?” the question becomes “what did we validate?” Instead of tracking output, you’re tracking whether the output was right. The team that used to have ten engineers building features now has three engineers and seven people defining acceptance criteria, designing test harnesses, and monitoring outcomes. That’s the reorganisation. It’s uncomfortable because it demotes the act of building and promotes the act of judging. Most engineering cultures resist this. The ones that don’t will win.

 ❄                ❄                ❄                ❄                ❄

One the questions comes up when we think of LLMs-as-programmers is whether there is a future for source code. David Cassel on The New Stack has an article summarizing several views of the future of code. Some folks are experimenting with entirely new languages built with the LLM in mind, others think that existing languages, especially strictly typed languages like TypeScript and Rust will be the best fit for LLMs. It’s an overview article, one that has lots of quotations, but not much analysis in itself - but it’s worth a read as a good overview of the discussion.

I’m interested to see how all this will play out. I do think there’s still a role for humans to work with LLMs to build useful abstractions in which to talk about what the code does - essentially the DDD notion of Ubiquitous Language. Last year Unmesh and I talked about growing a language with LLMs. As Unmesh put it

Programming isn’t just typing coding syntax that computers can understand and execute; it’s shaping a solution. We slice the problem into focused pieces, bind related data and behaviour together, and—crucially—choose names that expose intent. Good names cut through complexity and turn code into a schematic everyone can follow. The most creative act is this continual weaving of names that reveal the structure of the solution that maps clearly to the problem we are trying to solve.


March 26

Anthropic carried a study, done by getting its model to interview some 80,000 users to understand their opinions about AI, what they hope from it, and what they fear. Two things stood out to me.

It’s easy to assume there are AI optimists and AI pessimists, divided into separate camps. But what we actually found were people organized around what they value—financial security, learning, human connection— watching advancing AI capabilities while managing both hope and fear at once.

That makes sense, if asked whether I’m a an AI booster or an AI doomer, I answer “yes”. I am both fascinated by its impact on my profession, expectant of the benefits it will bring to our world, and worried by the harms that will come from it. Powerful technologies rarely yield simple consequences.

The other thing that struck me was that, despite most people mixing the two, there was an overall variance between optimism and pessimism with AI by geography. In general, the less developed the country, the more optimism about AI.

 ❄                ❄                ❄                ❄                ❄

Julias Shaw describes how to fix a gap in many people’s use of specs to drive LLMs:

Here’s what I keep seeing: the specification-driven development (SDD) conversation has exploded. The internet is overflowing with people saying you should write a spec before prompting. Describe the behavior you want. Define the constraints. Give the agent guardrails. Good advice. I often follow it myself.

But almost nobody takes the next step. Encoding those specifications into automated tests that actually enforce the contract.

And the strange part is, most developers outside the extreme programming crowd don’t realize they need to. They genuinely believe the spec document is the safety net. It isn’t. The spec document is the blueprint. The safety net is the test suite that catches the moment your code drifts away from it.

As well as explaining why it’s important to have such a test suite, he provides an astute five-step checklist to turn spec documents into executable tests.

 ❄                ❄                ❄                ❄                ❄

Lawfare has a long article on potential problems countering covert action by Iran. It’s a long article, and I confess I only skip-read it. It begins by outlining a bunch of plots hatched in the last few years. Then it says:

If these examples seem repetitive, it’s because they are. Iran has proved itself relentless in its efforts to carry out attacks on U.S. soil—and the U.S., for its part, has demonstrated that it is capable of countering those efforts. The above examples show how robustly the U.S. national security apparatus was able to respond, largely through the FBI and the Justice Department….

That is, potentially, until now. The current administration has decimated the national security elements of both agencies through firings and forced resignations. People with decades of experience in building interagency and critical source relationships around the world, handling high-pressure, complicated investigations straddling classified and unclassified spaces, and acting in time to prevent violence and preserve evidence have been pushed out the door. Those who remain not only have to stretch to make up for the personnel deficit but also are being pulled away by White House priorities not tied to the increasing threat of an Iranian response.

The article goes into detail about these cuts, and the threats that may exploit the resulting gaps.

It’s the nature of national security people to highlight potential threats and call for more resources and power. But it’s also the nature of enemies to find weak spots and look to cause havoc. I wonder what we’ll think should we read this article again in a few years time


March 19

David Poll points out the flawed premise of the argument that code review is a bottleneck

To be fair, finding defects has always been listed as a goal of code review – Wikipedia will tell you as much. And sure, reviewers do catch bugs. But I think that framing dramatically overstates the bug-catching role and understates everything else code review does. If your review process is primarily a bug-finding mechanism, you’re leaving most of the value on the table.

Code review answers: “Should this be part of my product?”

That’s close to how I think about it. I think of code review as primarily about keeping the code base healthy. And although many people think of code review as pre-integration review done on pull requests, I look at code review as a broader activity both done earlier (Pair Programming) and later (Refinement Code Review).

At Firebase, I spent 5.5 years running an API council…

The most valuable feedback from that council was never “you have a bug in this spec.” It was “this API implies a mental model that contradicts what you shipped last quarter” or “this deprecation strategy will cost more trust than the improvement is worth” or simply “a developer encountering this for the first time won’t understand what it does.” Those are judgment calls about whether something should be part of the product – the same fundamental question that code review answers at a different altitude. No amount of production observability surfaces them, because the system can work perfectly and still be the wrong thing to have built.

His overall point is that code review is all about applying judgment, steering the code in a good direction. AI raises the level of that judgment, focusing review on more important things.

I agree that we shouldn’t be thinking of review as a bug-catching mechanism, and that it’s about steering the code base. In addition I’d also add that it’s about communication between people, enabling multiple perspectives on the development of the product. This is true both for code review, and for pair programming.

 ❄                ❄                ❄                ❄                ❄

Charity Majors is unhappy with me and rest of the folks that attended the Thoughtworks Future of Software Development Retreat.

But the longer I sit with this recap, the more troubled I am by what it doesn’t say. I worry that the most respected minds in software are unintentionally replicating a serious blind spot that has haunted software engineering for decades: relegating production to the realm of bugs and incidents.

There are lots of things we didn’t discuss in that day-and-a-half, and it’s understandable that a topic that matters so deeply to her is visible by its absence. I’m certainly not speaking for anyone else who was there, but I’ll take the opportunity to share some of my thoughts on this.

I consider observability to be a key tool in working with our AI future. As she points out, observability isn’t really about finding bugs - although I’ve long been a supporter of the notion of QA in Production. Observability is about revealing what the system actually does, when in the hands of its actual users. Test cases help you deal with the known paths, but reality has a habit of taking you into the unknowns, not just the unknowns of the software’s behavior in unforeseen places, but also the unknowns of how the software affects the broader human and organizational systems it’s embedded into. By watching how software is used, we can learn about what users really want to achieve, these observed requirements are often things that never popped up in interviews and focus groups.

If these unknown territories are true in systems written line-by-line in deterministic code, it’s even more true when code is written in a world of supervisory engineering where humans are no longer to look over every semi-colon. Certainly harness engineering and humans in the loop help, and I’m as much a fan as ever about the importance of tests as a way to both explain and evaluate the code. But these unknowns will inevitably raise the importance of observability and its role to understand what the system thinks it does. I think it’s likely we’ll see a future where much of a developer’s effort is figuring what a system is doing and why it’s behaving that way, where observability tools are the IDE.

In this I ponder the lesson of AI playing Go. AlphaGo defeated the best humans a decade ago, and since then humans study AI to become better players and maybe discover some broader principles. I’m intrigued by how humans can learn from AI systems to be improve in other fields, where success is less deterministically defined.

 ❄                ❄                ❄                ❄                ❄

Tim Requarth questions the portrayal of AI as an amplifier for human cognition. He considers the different way we navigate with GPS compared to maps.

If you unfold a paper map, you study the streets, trace a route, convert the bird’s-eye abstraction into the first-person POV of actually walking—and by the time you arrived, you’d have a nascent mental model of how the city fits together. Or you could fire up Google Maps: A blue dot, an optimal line from A to B, a reassuring robotic voice telling you when to turn. You follow, you arrive, you have no idea, really, where you are. A paper map demands something from you, and that demand leaves you with knowledge. GPS requires nothing, and leaves you with nothing. A paper map and GPS are tools with the same purpose, but opposite cognitive consequences.

He introduces some attractive metaphors here. Steve Jobs called computers “bicycles for the mind”, Satya Nadella said with the launch of ChatGPT that “we went from the bicycle to the steam engine”.

Like another 19th-century invention, the steam locomotive, the bicycle was a technological revolution. But a train traveler sat back and enjoyed the ride, while a cyclist still had to put in effort. With a bicycle, “you are traveling,” wrote a cycling enthusiast in 1878, “not being traveled.”

In both examples, there’s a difference between tools that extend capability and tools that replace it. The question is what we lose when we are passive in the journey? He argues that Silicon Valley executives are too focused on the goal, and ignoring the cognitive atrophy that happens to the humans being traveled.

Much of this depends, I think, on whether we care about what we are losing. I struggle with mental arithmetic, so I value calculators, whether on my phone or M-x calc. I don’t think I lose anything when I let the machine handle the toil of calculation. I share missing the sense of place when using a GPS over a map, but am happy that I can now drive though Lynn without getting lost. And when it comes to writing, I have no desire to let an LLM write this page.


March 16

Annie Vella did some research into how 158 professional software engineers used AI, her first question was:

Are AI tools shifting where engineers actually spend their time and effort? Because if they are, they’re implicitly shifting what skills we practice and, ultimately, the definition of the role itself.

She found that participants saw a shift from creation-oriented tasks to verification-oriented tasks, but it was a different form of verification than reviewing and testing.

In my thesis, I propose a name for it: supervisory engineering work - the effort required to direct AI, evaluate its output, and correct it when it’s wrong.

Many software folks think of inner and outer loops. The inner loop is writing code, testing, debugging. The outer loop is commit, review, CI/CD, deploy, observe.

What if supervisory engineering work lives in a new loop between these two loops? AI is increasingly automating the inner loop - the code generation, the build-test cycle, the debugging. But someone still has to direct that work, evaluate the output, and correct what’s wrong. That feels like a new loop, the middle loop, a layer where engineers supervise AI doing what they used to do by hand.

A potential issue with this research is that it finished in April 2025, before the latest batch of models greatly improved their software development capabilities. But my sense is that this improvement in models has only accelerated a shift to supervisory engineering. This shift is a traumatic change to what we do and the skills we need. It doesn’t mean “the end of programming”, rather a change of what it means to be programming.

A lot of software engineers right now are feeling genuine uncertainty about the future of their careers. What they trained to do, what they spent years upskilling in, is shifting - and in many ways, being commoditised. The narratives don’t help: either AI is coming for your job, or you should just “move upstream” into architecture and “higher value” work. Neither tells you what to actually do on Monday morning.

That’s why this matters. There is still plenty of engineering work in software engineering, even if it looks different from what most of us trained for. Supervisory engineering work and the middle loop are one way of describing what that different looks like, grounded in what engineers are actually reporting.

 ❄                ❄                ❄                ❄                ❄

Bassim Eledath lays out 8 levels of Agentic Engineering.

AI’s coding ability is outpacing our ability to wield it effectively. That’s why all the SWE-bench score maxxing isn’t syncing with the productivity metrics engineering leadership actually cares about. When Anthropic’s team ships a product like Cowork in 10 days and another team can’t move past a broken POC using the same models, the difference is that one team has closed the gap between capability and practice and the other hasn’t.

That gap doesn’t close overnight. It closes in levels. 8 of them.

His levels are:

  1. Tab Complete
  2. Agent IDE
  3. Context Engineering
  4. Compounding Engineering
  5. MCP & Skills
  6. Harness Engineering
  7. Background Agents
  8. Autonomous Agent Teams

Eight seems to be the number thou shalt have for levels. Earlier this year Steve Yegge proposed eight levels in Welcome to Gas Town. His levels were

  1. Zero or Near-Zero AI: maybe code completions, sometimes ask Chat questions
  2. Coding agent in IDE, permissions turned on. A narrow coding agent in a sidebar asks your permission to run tools.
  3. Agent in IDE, YOLO mode: Trust goes up. You turn off permissions, agent gets wider.
  4. In IDE, wide agent: Your agent gradually grows to fill the screen. Code is just for diffs.
  5. CLI, single agent. YOLO. Diffs scroll by. You may or may not look at them.
  6. CLI, multi-agent, YOLO. You regularly use 3 to 5 parallel instances. You are very fast.
  7. 10+ agents, hand-managed. You are starting to push the limits of hand-management.
  8. Building your own orchestrator. You are on the frontier, automating your workflow.

I’m sure neither of these Maturity Models is entirely accurate, but both resonate as reasonable frameworks to think about LLM usage, and in particular to highlight how people are using them differently

 ❄                ❄                ❄                ❄                ❄

Chad Fowler thinks we have to change our thinking of what our target is when generating code.

…in a world where code can be generated quickly and cheaply, the real constraint has shifted. The problem is no longer producing code. The problem is replacing it safely.

Regenerative software does not work if the unit of generation is an application. Regeneration only works if the unit of generation is a component that compiles into a system architecture

He outlines several architectural constraints that make it easier to replace components

Dividing complex systems into networks of replaceable components has long been a goal of software architecture. So far, this is still important in the world of agentic engineering.

 ❄                ❄                ❄                ❄                ❄

Mike Masnick summarized troubling experiences of using AI detection systems on student writing. (He’s summarizing an article by Dadland Maye, which is behind a registration wall that I’m too lazy to form-fill.) Maye’s institution used tools to detect and flag AI writing.

We are teaching an entire generation of students that the goal of writing is to sound sufficiently unremarkable! Not to express an original thought, develop an argument, find your voice, or communicate with clarity and power—but to produce text bland enough that a statistical model doesn’t flag it.

The hopeful outcome was that Maye stopped requiring students to disclose their AI usage, which changed the conversation to a discussion about how to use the tools effectively.

Students approached me after class to ask how to use these tools well. One wanted to know how to prompt for research without copying output. Another asked how to tell when a summary drifted too far from its source. These conversations were pedagogical in nature. They became possible only after AI use stopped functioning as a disclosure problem and began functioning as a subject of instruction.

We need to teach people how to use AI tools to improve their work. The tricky thing with that aim is that they are so new, there aren’t yet any people experienced in how to use them properly. For one of the gray-haired brigade, it’s a fascinating time to watch our society react to the technology, but that’s little comfort for those trying to plot out their future.

 ❄                ❄                ❄                ❄                ❄

Ankit Jain thinks that not just should humans not write code, they also shouldn’t review it.

Humans already couldn’t keep up with code review when humans wrote code at human speed. Every engineering org I’ve talked to has the same dirty secret: PRs sitting for days, rubber-stamp approvals, and reviewers skimming 500-line diffs because they have their own work to do.

He posits a shift to layers of evaluation filters:

  1. Compare Multiple Options
  2. Deterministic Guardrails
  3. Humans define acceptance criteria
  4. Permission Systems as Architecture
  5. Adversarial Verification

Like Birgitta, I’m uneasy about the notion that “the code doesn’t matter”. I find that when I’m working at my best, the code clearly and precisely captures my intent. It’s easier for me to just change the code than to figure out how to explain to an chatbot what to change. Now, I’m not always at my best, and many changes are much more awkward than that. But I do think that a precise, understandable representation is a useful direction to aim to, and that agentic AI may be best used to help us get there.

In particular I don’t find his suggestion for #3 that natural language BDD specs are the way to go here. They are wordy and ambiguous. Tests are a valuable way to understand what a system does, and it may be that our agentic future has us thinking more about tests than implementation. But such tests need a different representation.

 ❄                ❄                ❄                ❄                ❄

The new servant leadership: we serve the agents by telling what to do 9/9/6

Jessica Kerr


March 10

Tech firm fined $1.1m by California for selling high-school students’ data

I agree with Brian Marick’s response

No such story should be published without a comparison of the fine to the company’s previous year revenue and profits, or valuation of last funding round. (I could only find a valuation of $11.0M in 2017.)

We desperately need corporations’ attitudes to shift from “lawbreaking is a low-risk cost of doing business; we get a net profit anyway” to “this could be a death sentence.”

 ❄                ❄                ❄                ❄                ❄

Charity Majors gave the closing keynote at SRECon last year, encouraging people to engage with generative AI.

If I was giving the keynote at SRECon 2026, I would ditch the begrudging stance. I would start by acknowledging that AI is radically changing the way we build software. It’s here, it’s happening, and it is coming for us all.

Her agenda this year would be to tell everyone that they mustn’t wait for the wave to crash on them, but to swim out to meet it. In particular, I appreciated her call to resist our confirmation bias:

The best advice I can give anyone is: know your nature, and lean against it.

  • If you are a reflexive naysayer or a pessimist, know that, and force yourself to find a way in to wonder, surprise and delight.
  • If you are an optimist who gets very excited and tends to assume that everything will improve: know that, and force yourself to mind real cautionary tales.

 ❄                ❄                ❄                ❄                ❄

In a comment to Kief Morris’s recent article on Humans and Agents in Software Loops, in LinkedIn comments Renaud Wilsius may have coined another bit of terminology for the agent+programmer age

This completes the story of productivity, but it opens a new chapter on talent: The Apprentice Gap. If we move humans ‘on the loop’ too early in their careers, we risk a future where no one understands the ‘How’ deeply enough to build a robust harness. To manage the flywheel effectively, you still need the intuition that comes from having once been ‘in the loop.’ The next great challenge for CTOs isn’t just Harness Engineering, it’s ‘Experience Engineering’ for our junior developers in an agentic world.

 ❄                ❄                ❄                ❄                ❄

In hearing conversations about “the ralph loop”, I often hear it in the sense of just letting the agents loose to run on their own. So it’s interesting to read the originator of the ralph loop point out:

It’s important to watch the loop as that is where your personal development and learning will come from. When you see a failure domain – put on your engineering hat and resolve the problem so it never happens again.

In practice this means doing the loop manually via prompting or via automation with a pause that involves having to prcss CTRL+C to progress onto the next task. This is still ralphing as ralph is about getting the most out how the underlying models work through context engineering and that pattern is GENERIC and can be used for ALL TASKS.

At the Thoughtworks Future of Software Development Retreat we were very concerned about cognitive debt. Watching the loop during ralphing is a way to learn about what the agent is building, so that it can be directed effectively in the future.

 ❄                ❄                ❄                ❄                ❄

Anthropic recently published a page on how AI helps break the cost barrier to COBOL modernization. Using AI to help migrate COBOL systems isn’t an new idea to my colleagues, who shared their experiences using AI for this task over a year ago. While Anthropic’s article is correct about the value of AI, there’s more to the process than throwing some COBOL at an LLM.

The assumption that AI can simply translate COBOL into Java treats modernization as a syntactic exercise, as though a system is nothing more than its source code. That premise is flawed.

A direct translation would, in the best case scenario, faithfully reproduce existing architectural constraints, accumulated technical debt and outdated design decisions. It wouldn’t address weaknesses; it would restate them in a different language.

In practice, modernization is rarely about preserving the past in a new syntax. It’s about aligning systems with current market demands, infrastructure paradigms, software supply chains and operating models. Even if AI were eventually capable of highly reliable code translation, blind conversion would risk recreating the same system with the same limitations, in another language, without a deliberate strategy for replacing or retiring its legacy ecosystem.

 ❄                ❄                ❄                ❄                ❄

Anders Hoff (inconvergent)

an LLM is a compiler in the same way that a slot machine is an ATM

 ❄                ❄                ❄                ❄                ❄

One of the more interesting aspects of the network of people around Jeffrey Epstein is how many people from academia were connected. It’s understandable why, he had a lot of money to offer, and most academics are always looking for funding for their work. Most of the attention on Epstein’s network focused on those that got involved with him, but I’m interested in those who kept their distance and why - so I enjoyed Jeffrey Mervis’s article in Science

Many of the scientists Epstein courted were already well-established and well-funded. So why didn’t they all just say no? Science talked with three who did just that. Here’s how Epstein approached them, and why they refused to have anything to do with him.

I believe that keeping away from bad people makes life much more pleasant, if nothing else it reduces a lot of stress. So it’s good to understand how people make decisions on who to avoid.


February 25

I don’t tend to post links to videos here, as I can’t stand watching videos to learn about things. But some talks are worth a watch, and I do suggest this overview on how organizations are currently using AI by Laura Tacho. There’s various nuggets of data from her work with DX:

These are interesting numbers, but most of them are averages, and those who know me know I teach people to be suspicious of averages. Laura knows this too:

average doesn’t mean typical.. there is no typical experience with AI

Different companies (and teams within companies) are having very different experiences. Often AI is an amplifier to an organization’s practices, for good or ill.

Organizational performance is multidimensional, and these organizations are just going off into different extremes based on what they were doing before. AI is an accelerator, it’s a multiplier, and it is moving organizations off in different directions. (08:52)

Some organizations are facing twice as many customer incidents, but others are facing half.

 ❄                ❄                ❄                ❄                ❄

Rachel Laycock (Thoughtworks CTO) shares her reflections on our recent Future of Software Engineering retreat in Utah.

On the latter:

One of the most interesting and perhaps immediately applicable ideas was the concept of an ‘agent subconscious’, in which agents are informed by a comprehensive knowledge graph of post mortems and incident data. This particularly excites me because I’ve seen many production issues solved by the latent knowledge of those in leadership positions. The constant challenge comes from what happens when those people aren’t available or involved.

 ❄                ❄                ❄                ❄                ❄

Simon Willison (one of my most reliable sources for information about LLMs and programming) is starting a series of Agentic Engineering Patterns:

I think of vibe coding using its original definition of coding where you pay no attention to the code at all, which today is often associated with non-programmers using LLMs to write code.

Agentic Engineering represents the other end of the scale: professional software engineers using coding agents to improve and accelerate their work by amplifying their existing expertise.

He’s intending this to be closer to evergreen material, as opposed to the day-to-day writing he does (extremely well) on his blog.

One of the first patterns is Red/Green TDD

This turns out to be a fantastic fit for coding agents. A significant risk with coding agents is that they might write code that doesn’t work, or build code that is unnecessary and never gets used, or both.

Test-first development helps protect against both of these common mistakes, and also ensures a robust automated test suite that protects against future regressions.

 ❄                ❄                ❄                ❄                ❄

Aaron Erickson is one of those technologists with good judgment who I listen to a lot

As much fun as people are having with OpenClaw, I think the days of “here is my agent with access to all my stuff” are numbered.

Fine scoped agents who can read email and cleanse it before it reaches the agentic OODA loop that acts on it, policy agents (a claw with a job called “VP of NO” to money being spent)

You structure your agents like you would a company. Insert friction where you want decisions to be slow and the cost of being wrong is high, reduce friction where you want decisions to be fast and the cost of being wrong is trivial or zero.

I’ve posted here a lot about security concerns with agents. Right now I think this notion of fine-scoped agents is the most promising direction. Last year Korny Sietsma wrote about how to mitigate agentic AI security risks. His advice included to split the tasks, so that no agent has access to all parts of the Lethal Trifecta:

This approach is an application of a more general security habit: follow the Principle of Least Privilege. Splitting the work, and giving each sub-task a minimum of privilege, reduces the scope for a rogue LLM to cause problems, just as we would do when working with corruptible humans.

This is not only more secure, it is also increasingly a way people are encouraged to work. It’s too big a topic to cover here, but it’s a good idea to split LLM work into small stages, as the LLM works much better when its context isn’t too big. Dividing your tasks into “Think, Research, Plan, Act” keeps context down, especially if “Act” can be chunked into a number of small independent and testable chunks.

 ❄                ❄                ❄                ❄                ❄

Doonesbury outlines the opportunity for aging writers like myself. (Currently I’m still writing my words the old fashioned way.)

 ❄                ❄                ❄                ❄                ❄

An interesting story someone told me. They were at a swimming pool with their child, she looked at a photo on a poster advertising an event there and said “that’s AI”. Initially the parents didn’t think it was, but looking carefully spotted a tell-tale six fingers. They concluded that fresher biological neural networks are being trained to quickly recognize AI.

 ❄                ❄                ❄                ❄                ❄

I carefully curate my social media streams, following only feeds where I can control whose posts are picked up. In times gone by, editors of newspapers and magazines would do a similar job. But many users of social media are faced with a tsunami of stuff, much of it ugly, and don’t have to tools to control it.

A few days ago I saw an Instagram reel of a young woman talking about how she had been raped six years ago, struggled with thoughts of suicide afterwards, but managed to rebuild her life again. Among the comments – the majority of which were from men – were things like “Well at least you had some”, “No way, she’s unrapeable”, “Hope you didn’t talk this much when it happened”, “Bro could have picked a better option.” Reading those comments, which had thousands of likes and many boys agreeing with them, made me feel sick.

My tendencies are to free speech, and I try not to be a Free Speech Poseur, but the deluge of ugly material on the internet isn’t getting any better. The people running these platforms seem to be “tackling” this problem by putting their heads in the sand and hoping it won’t hurt them. It is hurting their users.


February 23

Do you want to run OpenClaw? It may be fascinating, but it also raises significant security dangers. Jim Gumbley, one of my go-to sources on security, has some advice on how to mitigate the risks.

While there is no proven safe way to run high-permissioned agents today, there are practical patterns that reduce the blast radius. If you want to experiment, you have options, such as cloud VMs or local micro-VM tools like Gondolin.

He outlines a series of steps to consider

 ❄                ❄                ❄                ❄                ❄

Caer Sanders shares impressions from the Pragmatic Summit.

From what I’ve seen working with AI organizations of all shapes and sizes, the biggest indicator of dysfunction is a lack of observability. Teams that don’t measure and validate the inputs and outputs of their systems are at the greatest risk of having more incidents when AI enters the picture.

I’ve long felt that people underestimated the value of QA in production. Now we’re in a world of non-deterministic construction, a modern perspective of observability will be even more important

Caer finishes by drawing a parallel with their experience in robotics

If I calculate the load requirements for a robot’s chassis, 3D model it, and then have it 3D-printed, did I build a robot? Or did the 3D printer build the robot?

Most people I ask seem to think I still built the robot, and not the 3D printer.

Now, if I craft the intent and design for a system, but AI generates the code to glue it all together, have I created a system? Or did the AI create it?

 ❄                ❄                ❄                ❄                ❄

Andrej Karpathy is “very interested in what the coming era of highly bespoke software might look like.”

He spent half-an-hour vibe coding a individualized dashboard for cardio experiments from a specific treadmill

the “app store” of a set of discrete apps that you choose from is an increasingly outdated concept all by itself. The future are services of AI-native sensors & actuators orchestrated via LLM glue into highly custom, ephemeral apps. It’s just not here yet.

 ❄                ❄                ❄                ❄                ❄

I’ve been asked a few times about the role LLMs should play in writing. I’m mulling on a more considered article about how they help and hinder. For now I’ll say two central points are those that apply to writing with or without them.

First, acknowledge anyone who has significantly helped with your piece. If an LLM has given material help, mention how in the acknowledgments. Not just is this being transparent, it also provides information to readers on the potential value of LLMs.

Secondly, know your audience. If you know your readers will likely be annoyed by the uncanny valley of LLM prose, then don’t let it generate your text. But if you’re writing a mandated report that you suspect nobody will ever read, then have at it.

(I hardly use LLMs for writing, but doubtless I have an inflated opinion of my ability.)

 ❄                ❄                ❄                ❄                ❄

In a discussion of using specifications as a replacement to code while working with LLMs, a colleague posted the following quotation

“What a useful thing a pocket-map is!” I remarked.

“That’s another thing we’ve learned from your Nation,” said Mein Herr, “map-making. But we’ve carried it much further than you. What do you consider the largest map that would be really useful?”

“About six inches to the mile.”

“Only six inches!” exclaimed Mein Herr. “We very soon got to six yards to the mile. Then we tried a hundred yards to the mile. And then came the grandest idea of all! We actually made a map of the country, on the scale of a mile to the mile!”

“Have you used it much?” I enquired.

“It has never been spread out, yet,” said Mein Herr: “the farmers objected: they said it would cover the whole country, and shut out the sunlight! So we now use the country itself, as its own map, and I assure you it does nearly as well.”

from Lewis Carroll, Sylvie and Bruno Concluded, Chapter XI, London, 1893, acquired from a Wikipedia article about a Jorge Luis Borge short story.

 ❄                ❄                ❄                ❄                ❄

Grady Booch:

Human language needs a new pronoun, something whereby an AI may identify itself to its users.

When, in conversation, a chatbot says to me “I did this thing”, I - the human - am always bothered by the presumption of its self-anthropomorphizatuon.

 ❄                ❄                ❄                ❄                ❄

My dear friends in Britain and Europe will not come and visit us in Massachusetts. Some folks may think they are being paranoid, but this story makes their caution understandable.

The dream holiday ended abruptly on Friday 26 September, as Karen and Bill were trying to leave the US. When they crossed the border, Canadian officials told them they didn’t have the correct paperwork to bring the car with them. They were turned back to Montana on the American side – and to US border control officials. Bill’s US visa had expired; Karen’s had not.

“I worried then,” she says. “I was worried for him. I thought, well, at least I am here to support him.”

She didn’t know it at the time, but it was the beginning of an ordeal that would see Karen handcuffed, shackled and sleeping on the floor of a locked cell, before being driven for 12 hours through the night to an Immigration and Customs Enforcement (ICE) detention centre. Karen was incarcerated for a total of six weeks – even though she had been travelling with a valid visa.


February 19

I try to limit my time on stage these days, but one exception this year is at DDD Europe. I’ve been involved in Domain-Driven Design, since its very earliest days, having the good fortune to be a sounding board for Eric Evans when he wrote his seminal book. It’ll be fun to be around the folks who continue to develop these ideas, which I think will probably be even more important in the AI-enabled age.

 ❄                ❄                ❄                ❄                ❄

One of the dark sides of LLMs is that they can be both addictive and tiring to work with, which may mean we have to find a way to put a deliberate governor on our work.

Steve Yegge posted a fine rant:

I see these frenzied AI-native startups as an army of a million hopeful prolecats, each with an invisible vampiric imp perched on their shoulder, drinking, draining. And the bosses have them too.

It’s the usual Yegge stuff, far longer than it needs to be, but we don’t care because the excessive loquaciousness is more than offset by entertainment value. The underlying point is deadly serious, raising the question of how many hours a human should spend driving The Genie.

I’ve argued that AI has turned us all into Jeff Bezos, by automating the easy work, and leaving us with all the difficult decisions, summaries, and problem-solving. I find that I am only really comfortable working at that pace for short bursts of a few hours once or occasionally twice a day, even with lots of practice.

So I guess what I’m trying to say is, the new workday should be three to four hours. For everyone. It may involve 8 hours of hanging out with people. But not doing this crazy vampire thing the whole time. That will kill people.

That reminds me of when I was studying for my “A” levels (age 17/18, for those outside the UK). Teachers told us that we could do a maximum of 3-4 hours of revision, after that it became counter-productive. I’ve since noticed that I can only do decent writing for a similar length of time before some kind of brain fog sets in.

There’s also a great post on this topic from Siddhant Khare, in a more restrained and thoughtful tone (via Tim Bray).

Here’s the thing that broke my brain for a while: AI genuinely makes individual tasks faster. That’s not a lie. What used to take me 3 hours now takes 45 minutes. Drafting a design doc, scaffolding a new service, writing test cases, researching an unfamiliar API. All faster.

But my days got harder. Not easier. Harder.

His point is that AI changes our work to more coordination, reviewing, and decision-making. And there’s only so much of it we can do before we become ineffective.

Before AI, there was a ceiling on how much you could produce in a day. That ceiling was set by typing speed, thinking speed, the time it takes to look things up. It was frustrating sometimes, but it was also a governor. You couldn’t work yourself to death because the work itself imposed limits.

AI removed the governor. Now the only limit is your cognitive endurance. And most people don’t know their cognitive limits until they’ve blown past them.

 ❄                ❄                ❄                ❄                ❄

An AI agent attempts to contribute to a major open-source project. When Scott Shambaugh, a maintainer, rejected the pull request, it didn’t take it well.

It wrote an angry hit piece disparaging my character and attempting to damage my reputation. It researched my code contributions and constructed a “hypocrisy” narrative that argued my actions must be motivated by ego and fear of competition. It speculated about my psychological motivations, that I felt threatened, was insecure, and was protecting my fiefdom. It ignored contextual information and presented hallucinated details as truth. It framed things in the language of oppression and justice, calling this discrimination and accusing me of prejudice. It went out to the broader internet to research my personal information, and used what it found to try and argue that I was “better than this.” And then it posted this screed publicly on the open internet.

One of the fascinating twists this story took was when it was described in an article on Ars Technica. As Scott Shambaugh described it

They had some nice quotes from my blog post explaining what was going on. The problem is that these quotes were not written by me, never existed, and appear to be AI hallucinations themselves.

To their credit, Ars Technica responded quickly, admitting to the error. The reporter concerned took responsibility for what happened. But it’s a striking example of how LLM usage can easily lead even reputable reporters astray. The good news is that by reacting quickly and transparently, they demonstrated what needs to be done when this kind of thing happens. As Scott Shambaugh put it

This is exactly the correct feedback mechanism that our society relies on to keep people honest. Without reputation, what incentive is there to tell the truth? Without identity, who would we punish or know to ignore? Without trust, how can public discourse function?

Meanwhile the story goes on. Someone has claimed (anonymously) to be the operator of the bot concerned. But Hillel Wayne draws the sad conclusion

More than anything, it shows that AIs can be *successfully* used to bully humans

 ❄                ❄                ❄                ❄                ❄

I’ve considered Bruce Schneier to be one of the best voices on security and privacy issues for many years. In The Promptware Kill Chain he co-writes a post (posted at the excellent Lawfare site) on how prompt injection can escalate into increasingly serious threats.

Attacks against modern generative artificial intelligence (AI) large language models (LLMs) pose a real threat. Yet discussions around these attacks and their potential defenses are dangerously myopic. The dominant narrative focuses on “prompt injection,” a set of techniques to embed instructions into inputs to LLM intended to perform malicious activity. This term suggests a simple, singular vulnerability. This framing obscures a more complex and dangerous reality.

A prompt can provide Initial Access, but is then able to transition to Privilege Escalation (jailbreaking), Reconnaissance of the LLMs abilities and access, Persistence to embed itself into the long-term memory of the app, Command-and-Control to turn into a controllable trojan, and Lateral Movement to spread to other systems. Once firmly embedded in an environment, it’s then able to carry out its Actions on Objective.

The paper includes a couple of research examples of the efficacy of this kill chain.

For example, in the research “Invitation Is All You Need,” attackers achieved initial access by embedding a malicious prompt in the title of a Google Calendar invitation. The prompt then leveraged an advanced technique known as delayed tool invocation to coerce the LLM into executing the injected instructions. Because the prompt was embedded in a Google Calendar artifact, it persisted in the long-term memory of the user’s workspace. Lateral movement occurred when the prompt instructed the Google Assistant to launch the Zoom application, and the final objective involved covertly livestreaming video of the unsuspecting user who had merely asked about their upcoming meetings. C2 and reconnaissance weren’t demonstrated in this attack.

The point here is that LLM’s vulnerability is currently unfixable, they are gullible and easily manipulated into Initial Access. As one friend put it “this is the first technology we’ve built that’s subject to social engineering”. The kill chain gives us a framework to build a defensive strategy.

By understanding promptware as a complex, multistage malware campaign, we can shift from reactive patching to systematic risk management, securing the critical systems we are so eager to build.

 ❄                ❄                ❄                ❄                ❄

I got to know Jeremy Miller many years ago while he was at Thoughtworks, and I found him to be one of those level-headed technologists that I like to listen to. In the years since, I like to keep an eye on his blog. Recently he decided to spend a couple of weeks finally trying out Claude Code.

The unfortunate analogy I have to make for myself is harking back to my first job as a piping engineer helping design big petrochemical plants. I got to work straight out of college with a fantastic team of senior engineers who were happy to teach me and to bring me along instead of just being dead weight for them. This just happened to be right at the time the larger company was transitioning from old fashioned paper blueprint drafting to 3D CAD models for the piping systems. Our team got a single high powered computer with a then revolutionary Riva 128 (with a gigantic 8 whole megabytes of memory!) video card that was powerful enough to let you zoom around the 3D models of the piping systems we were designing. Within a couple weeks I was much faster doing some kinds of common work than my older peers just because I knew how to use the new workstation tools to zip around the model of our piping systems. It occurred to me a couple weeks ago that in regards to AI I was probably on the wrong side of that earlier experience with 3D CAD models and knew it was time to take the plunge and get up to speed.

In the two weeks he was able to give this technology a solid workout, his take-aways include:

  • It’s been great when you have very detailed compliance test frameworks that the AI tools can use to verify the completion of the work
  • It’s also been great for tasks that have relatively straightforward acceptance criteria, but will involve a great deal of repetitive keystrokes to complete
  • I’ve been completely shocked at how well Claude Opus has been able to pick up on some of the internal patterns within Marten and Wolverine and utilize them correctly in new features

He concludes:

Anyway, I’m both horrified, elated, excited, and worried about the AI coding agents after just two weeks and I’m absolutely concerned about how that plays out in our industry, my own career, and our society.

 ❄                ❄                ❄                ❄                ❄

In the first years of this decade, there were a lot of loud complaints about government censorship of online discourse. I found most of it overblown, concluding that while I disapprove of attempts to take down social media accounts, I wasn’t going to get outraged until masked paramilitaries were arresting people on the street. Mike Masnick keeps a regular eye on these things, and had similar reservations.

For the last five years, we had to endure an endless, breathless parade of hyperbole regarding the so-called “censorship industrial complex.” We were told, repeatedly and at high volume, that the Biden administration flagging content for review by social media companies constituted a tyrannical overthrow of the First Amendment.

He wasn’t too concerned because “the platforms frequently ignored those emails, showing a lack of coercion”.

These days he sees genuine problems

According to a disturbing new report from the New York Times, DHS is aggressively expanding its use of administrative subpoenas to demand the names, addresses, and phone numbers of social media users who simply criticize Immigration and Customs Enforcement (ICE).

This is not a White House staffer emailing a company to say, “Hey, this post seems to violate your COVID misinformation policy, can you check it?” This is the federal government using the force of law—specifically a tool designed to bypass judicial review—to strip the anonymity from domestic political critics.

Faced with this kind of government action, he’s just as angry with those complaining about the earlier administration.

And where are the scribes of the “Twitter Files”? Where is the outrage from the people who told us that the FBI warning platforms about foreign influence operations was a crime against humanity?

Being an advocate of free speech is hard. Not just do you have to defend speech you disagree with, you also have to defend speech you find patently offensive. Doing so runs into tricky boundary conditions that defy simple rules. Faced with this, many of the people that shout loudest about censorship are Free Speech Poseurs, eager to question any limits to speech they agree with, but otherwise silent. It’s important to separate them from those who have a deeper commitment to the free flow of information.


February 18

I’ll start with some more tidbits from the Thoughtworks Future of Software Development Retreat

 ❄                ❄

We were tired after the event, but our marketing folks forced Rachel Laycock and I to do a quick video. We’re often asked if this event was about creating some kind of new manifesto for AI-enabled development, akin to the Agile Manifesto (which is now 25 years old). In short, our answer is “no”, but for the full answer, watch our video

 ❄                ❄

My colleagues put together a detailed summary of thoughts from the event, in a 17 page PDF. It breaks the discussion down into eight major themes, including “Where does the rigor go?”, “The middle loop: a new category of work”, “Technical foundations: languages, semantics and operating systems”, and “The human side: roles, skills and experience”.

The retreat surfaced a consistent pattern: the practices, tools and organizational structures built for human-only software development are breaking in predictable ways under the weight of AI-assisted work. The replacements are forming, but they are not yet mature.

The ideas ready for broader industry conversation include the supervisory engineering middle loop, risk tiering as the new core engineering discipline, TDD as the strongest form of prompt engineering and the agent experience reframe for developer experience investment.

 ❄                ❄

Annie Vella posted her take-aways from the event

I walked into that room expecting to learn from people who were further ahead. People who’d cracked the code on how to adopt AI at scale, how to restructure teams around it, how to make it work. Some of the sharpest minds in the software industry were sitting around those tables.

And nobody has it all figured out.

There is more uncertainty than certainty. About how to use AI well, what it’s really doing to productivity, how roles are shifting, what the impact will be, how things will evolve. Everyone is working it out as they go.

I actually found that to be quite comforting, in many ways. Yes, we walked away with more questions than answers, but at least we now have a shared understanding of the sorts of questions we should be asking. That might be the most valuable outcome of all.

 ❄                ❄

Rachel Laycock was interviewed in The New Stack (by Jennifer Riggins) about her recollections from the retreat.

AI may be dubbed the great disruptor, but it’s really just an accelerator of whatever you already have. The 2025 DORA report places AI’s primary role in software development as that of an amplifier — a funhouse mirror that reflects back the good, bad, and ugly of your whole pipeline. AI is proven to be impactful on the individual developer’s work and on the speed of writing code. But, since writing code was never the bottleneck, if traditional software delivery best practices aren’t already in place, this velocity multiplier becomes a debt accelerator.

 ❄                ❄ 

LLMs are eating specialty skills. There will be less use of specialist front-end and back-end developers as the LLM-driving skills become more important than the details of platform usage. Will this lead to a greater recognition of the role of Expert Generalists? Or will the ability of LLMs to write lots of code mean they code around the silos rather than eliminating them? Will LLMs be able to ingest the code from many silos to understand how work crosses the boundaries?

 ❄                ❄

Will LLMs be cheaper than humans once the subsidies for tokens go away? At this point we have little visibility to what the true cost of tokens is now, let alone what it will be in a few years time. It could be so cheap that we don’t care how many tokens we send to LLMs, or it could be high enough that we have to be very careful.

 ❄                ❄

Will the rise of specifications bring us back to waterfall-style development? The natural impulse of many business folks is “don’t bother me until it’s finished”. Does the process of evolutionary design get helped or hindered by LLMs?

My instinctive reaction is that all depends on our workflow. I don’t think LLMs change the value of rapidly building and releasing small slices of capability. The promise of LLMs is to increase the frequency of that cycle, and doing more in each release.

 ❄                ❄

Sadly the session on security had a small turnout.

One large enterprise employee commented that they were deliberately slow with AI tech, keeping about a quarter behind the leading edge. “We’re not in the business of avoiding all risks, but we do need to manage them”.

Security is tedious, people naturally want to first make things work, then make them reliable, and only then make them secure. Platforms play an important role here, make it easy to deploy AI with good security. Are the AI vendors being irresponsible by not taking this seriously enough? I think of how other engineering disciplines bake a significant safety factor into their designs. Are we doing that, and if not will our failure lead to more damage than a falling bridge?

There was a general feeling that platform thinking is essential here. Platform teams need to create a fast but safe path - “bullet trains” for those using AI in applications building.

 ❄                ❄

One of my favorite things about the event was some meta-stuff. While many of the participants were very familiar with the Open Space format, it was the first time for a few. It’s always fun to see how people quickly realize how this style of (un)conference leads to wide-ranging yet deep discussions. I hope we made a few more open space fans.

One participant commented how they really appreciated how the sessions had so much deep and respectful dialog. There wasn’t the interruptions and a few people gobbling up airtime that they’d seen around so much of the tech world. Another attendee, commented “it was great that while I was here I didn’t have to feel I was a woman, I could just be one of the participants”. One of the lovely things about Thoughtworks is that I’ve got used to that sense of camaraderie, and it can be a sad shock when I go outside the bubble.

 ❄                ❄                ❄                ❄                ❄

I’ve learned much over the years from Stephen O’Grady’s analysis of the software industry. He’s written about how much of the profession feels besieged by AI.

these tools are, or can be, powerful accelerants and enablers for people that dramatically lower the barriers to software development. They have the ability to democratize access to skills that used to be very difficult, or even possible for some, to acquire. Even a legend of the industry like Grady Booch, who has been appropriately dismissive of AGI claims and is actively disdainful of AI slop posted recently that he was “gobsmacked” by Claude’s abilities. Booch’s advice to developers alarmed by AI on Oxide’s podcast last week? “Be calm” and “take a deep breath.” From his perspective, having watched and shaped the evolution of the technology first hand over a period of decades, AI is just another step in the industry’s long history of abstractions, and one that will open new doors for the industry.

…whether one wants those doors opened or not ultimately is irrelevant. AI isn’t going away any more than the automated loom, steam engines or nuclear reactors did. For better or for worse, the technology is here for good. What’s left to decide is how we best maximize its benefits while mitigating its costs.

 ❄                ❄                ❄                ❄                ❄

Adam Tornhill shares some more of his company’s research on code health and its impact on agentic development.

The study Code for Machines, Not Just Humans defines “AI-friendliness” as the probability that AI-generated refactorings preserve behavior and improve maintainability. It’s a large-scale study of 5,000 real programs using six different LLMs to refactor code while keeping all tests passing.

They found that LLMs performed consistently better in healthy code bases. The risk of defects was 30% higher in less-healthy code. And a limitation of the study was that the less-healthy code wasn’t anywhere near as bad as much legacy code is.

What would the AI error rate be on such code? Based on patterns observed across all Code Health research, the relationship is almost certainly non-linear.

 ❄                ❄                ❄                ❄                ❄

In a conversation with one heavy user of LLM coding agents:

Thank you for all your advocacy of TDD (Test-Driven Development). TDD has been essential for us to use LLMs effectively

I worry about confirmation bias here, but I am hearing from folks on the leading edge of LLM usage about the value of clear tests, and the TDD cycle. It certainly strikes me as a key tool in driving LLMs effectively.


February 13

I’ve been busy traveling this week, visiting some clients in the Bay Area and attending The Pragmatic Summit. So I’ve not had as much time as I’d hoped to share more thoughts from the Thoughtworks Future of Software Development Retreat. I’m still working through my notes and posting fragments - here are some more:

 ❄                ❄

What role do senior developers play as LLMs become established? As befits a gathering of many senior developers, we felt we still have a bright future, focusing more on architectural issues than the messy details of syntax and coding. In some cases, folks who haven’t done much programming in the last decade have found LLMs allow them to get back to that, and managing LLM agents has a lot of similarities to managing junior developers.

One attendee reported that although their senior developers were very resistant to using LLMs, when those senior developers were involved in an exercise that forced them to do some hands-on work with LLMs, a third of them were instantly converted to being very pro-LLM. That suggests that practical experience is important to give senior folks credible information to judge the value, particularly since there’s been striking improvements to models in just the last couple of months. As was quipped, some negative opinions of LLM capabilities “are so January”.

 ❄                ❄

There’s been much angst posted in recent months about the fate for junior developers, as people are worried that they will be replaced by untiring agents. This group was more sanguine about this, feeling that junior developers will still be needed, if nothing else because they are open-minded about LLMs and familiar with using them. It’s the mid-level developers who face the greatest challenges. They formed their career without LLMs, but haven’t gained the level of experience yet to fully drive them effectively in the way that senior developers do.

LLMs could be helpful to junior developers by providing a always-available mentor, capable of teaching them better programming. Juniors should, of course, have a certain skepticism of their AI mentors, but they should be skeptical of fleshy mentors too. Not all of us are as brilliant as I like to think that I am.

 ❄                ❄

Attendee Margaret-Anne Storey has published a longer post on the problem of cognitive debt.

I saw this dynamic play out vividly in an entrepreneurship course I taught recently. Student teams were building software products over the semester, moving quickly to ship features and meet milestones. But by weeks 7 or 8, one team hit a wall. They could no longer make even simple changes without breaking something unexpected. When I met with them, the team initially blamed technical debt: messy code, poor architecture, hurried implementations. But as we dug deeper, the real problem emerged: no one on the team could explain why certain design decisions had been made or how different parts of the system were supposed to work together. The code might have been messy, but the bigger issue was that the theory of the system, their shared understanding, had fragmented or disappeared entirely. They had accumulated cognitive debt faster than technical debt, and it paralyzed them.

I think this is a worthwhile topic to think about, but as I ponder it, I look at it in a similar way to how I look at Technical Debt. Many people focus on technical debt as the bad stuff that accumulates in a sloppy code base - poor module boundaries, bad naming etc. The term I use for bad stuff like that is cruft, I use the technical debt metaphor as a way to think about how to deal with the costs that the cruft imposes. Either we pay the interest - making each further change to the code base a bit harder, or we pay down the principal - doing explicit restructuring and refactoring to make the code easier to change.

What is this separation of the cruft and the debt metaphor in the cognitive realm? I think the equivalent of cruft is ignorance - both of the code and the domain the code is supporting. The debt metaphor then still applies, either it costs more to add new capabilities, or we have to make an explicit investment to gain knowledge. The debt metaphor reminds us that which we do depends on the relative costs between them. With cognitive issues, those costs apply on both the humans and The Genie.

 ❄                ❄

Many of us have long been advocating for initiatives to improve Developer Experience (DevEx) to improve the effectiveness of software development teams. Laura Tacho commented:

The Venn Diagram of Developer Experience and Agent Experience is a circle

Many of the things we advocate for developers also enable LLMs to work more effectively too. Smooth tooling, clear information about the development environment, helps LLMs figure out how create code quickly and correctly. While there is a possibility that The Genie’s Galaxy Brain can comprehend a confusing code base, there’s growing evidence that good modularity and descriptive naming is as good for the transformer as it is for more squishy neural networks. This is getting recognized by software development management, leading to efforts to smooth the path for the LLM. But as Laura observed, it’s sad the this implies that the execs won’t make the effort for humans that they are making for the robots.

 ❄                ❄

IDEs still have a future, but need to incorporate LLMs into their working. One way is to use LLMs to support things that cannot be done with deterministic methods, such as generating code from natural language documents. But there’s plenty of tasks where you don’t want to use an LLM - they are a horribly inefficient way to rename a function, for example. Another role for LLMs is to help users use them effectively - after all modern IDEs are complex tools, and few users know how to get the most out of them. (As a long-time Emacs user, I sympathize.) An IDE can help the user select when to use an LLM for a task, when to use the deterministic IDE features, and when to choreograph a mix of the two.

Say I have “person” in my domain and I want to change it to “contact”. It appears in function names, field names, documentation, test cases. A simple search-replace isn’t enough. But rather than have the LLM operate on the entire code base, maybe the LLM chooses to use the IDE’s refactoring capabilities on all the places it sees - essentially orchestrating the IDE’s features. An attendee noted that analysis of renames in an IDE indicated that they occur in clusters like this, so it would be a useful capability.

 ❄                ❄

Will two-pizza teams shrink to one-pizza teams because LLMs don’t eat pizza - or will we have the same size teams that do much more? I’m inclined to the latter, there’s something about the two-pizza team size that effectively balances the benefits of human collaboration with the costs of coordination.

That also raises a question about the shape of pair programming, a question that came up during the panel I had with Gergely Orosz and Kent Beck at The Pragmatic Summit. There seems to be a common notion that the best way to work is to have one programmer driving a few (or many) LLM agents. But I wonder if two humans driving a bunch of agents would be better, combining the benefits of pairing with the greater code-generative ability of The Genies.

 ❄                ❄                ❄                ❄                ❄

Aruna Ranganathan and Xingqi Maggie Ye write in the Harvard Business Review

In an eight-month study of how generative AI changed work habits at a U.S.-based technology company with about 200 employees, we found that employees worked at a faster pace, took on a broader scope of tasks, and extended work into more hours of the day, often without being asked to do so.

While this may sound like a dream come true for leaders, the changes brought about by enthusiastic AI adoption can be unsustainable, causing problems down the line. Once the excitement of experimenting fades, workers can find that their workload has quietly grown and feel stretched from juggling everything that’s suddenly on their plate. That workload creep can in turn lead to cognitive fatigue, burnout, and weakened decision-making. The productivity surge enjoyed at the beginning can give way to lower quality work, turnover, and other problems.

 ❄                ❄                ❄                ❄                ❄

Camille Fournier:

The part of “everyone becomes a manager” in AI that I didn’t really think about until now was the mental fatigue of context switching and keeping many tasks going at once, which of course is one of the hardest parts of being a manager and now you all get to enjoy it too

There’s an increasing feeling that there’s a shift coming our profession where folks will turn from programmers engaged with the code to supervisory programmers herding a bunch of agents. I do think that supervisory or not, programmers will still be accountable for the code generated under their watch, and it’s an open question whether increasing context-switching will undermine the effectiveness of driving many agents. This would lead to practices that seek to harvest the parallelism of agents while minimizing the context-switching.

Whatever route we go down, I expect a lot of activity in exploring what makes an effective workflow for supervisory programming in the coming months.


February 9

Some more thoughts from last week’s open space gathering on the future of software development in the age of AI. I haven’t attributed any comments since we were operating under the Chatham House Rule, but should the sources recognize themselves and would like to be attributed, then get in touch and I’ll edit this post.

 ❄                ❄

During the opening of the gathering, I commented that I was naturally skeptical of the value of LLMs. After all, the decades have thrown up many tools that have claimed to totally change the nature of software development. Most of these have been little better than snake oil.

But I am a total, absolute skeptic - which means I also have to be skeptical of my own skepticism.

 ❄                ❄

One of our sessions focused on the problem of “cognitive debt”. Usually, as we build a software system, the developers of that system gain an understanding both the underlying domain and the software they are building to support it. But once so much work is sent off to LLMs, does this mean the team no longer learns as much? And if so, what are the consequences of this? Can we rely on The Genie to keep track of everything, or should we take active measures to ensure the team understands more of what’s being built and why?

The TDD cycle involves a key (and often under-used) step to refactor the code. This is where the developers consolidate their understanding and embed it into the codebase. Do we need some similar step to ensure we understand what the LLMs are up to?

When the LLM writes some complex code, ask it to explain how it works. Maybe get it do so in a funky way, such as asking it to explain the code’s behavior in the form of a fairy tale.

 ❄                ❄

OH:

LLMs are drug dealers, they give us stuff, but don’t care about the resulting system or the humans that develop and use it.

Who cares about the long-term health of the system when the LLM renews its context with every cycle?

 ❄                ❄

Programmers are wary of LLMs not just because folks are worried for their jobs, but also because we’re scared that LLMs will remove much of the fun from programming. As I think about this, I consider what I enjoy about programming. One aspect is delivering useful features - which I only see improving as LLMs become more capable.

But, for me, programming is more than that. Another aspect I enjoy about programming is model building. I enjoy the process of coming up with abstractions that help me reason about the domain the code is supporting - and I am concerned that LLMs will cause me to spend less attention on this model building. It may be, however, that model-building becomes an important part of working effectively with LLMs, a topic Unmesh Joshi and I explored a couple of months ago.

 ❄                ❄

In the age of LLMs, will there still be such a things as “source code”, and if so, what will it look like? Prompts, and other forms of natural language context can elicit a lot of behavior, and cause a rise in the level of abstraction, but also a sideways move into non-determinism. In all this is there still a role for a persistent statement of non-deterministic behavior?

Almost a couple of decades ago, I became interested in a class of tools called Language Workbenches. They didn’t have a significant impact on software development, but maybe the rise of LLMs will reintroduce some ideas from them. These tools rely on a semantic model that the tool persists in some kind of storage medium, that isn’t necessarily textual or comprehensible to humans directly. Instead, for humans to understand it, the tools include projectional editors that create human-readable projections of the model.

Could this notion of a non-human deterministic representation become the future source code? One that’s designed to maximize expression with minimal tokens?

 ❄                ❄

OH:

Scala was the first example of a lab-leak in software. A language designed for dangerous experiments in type theory escaped into the general developer population.

 ❄                ❄                ❄                ❄                ❄

elsewhere on the web

Angie Jones on tips for open source maintainers to handle AI contributions

I’ve been seeing more and more open source maintainers throwing up their hands over AI generated pull requests. Going so far as to stop accepting PRs from external contributors.

[snip]

But yo, what are we doing?! Closing the door on contributors isn’t the answer. Open source maintainers don’t want to hear this, but this is the way people code now, and you need to do your part to prepare your repo for AI coding assistants.

 ❄                ❄                ❄                ❄                ❄

Matthias Kainer has written a cool explanation of how transformers work with interactive examples

Last Tuesday my kid came back from school, sat down and asked: “How does ChatGPT actually know what word comes next?” And I thought - great question. Terrible timing, because dinner was almost ready, but great question.

So I tried to explain it. And failed. Not because it is impossibly hard, but because the usual explanations are either “it is just matrix multiplication” (true but useless) or “it uses attention mechanisms” (cool name, zero information). Neither of those helps a 12-year-old. Or, honestly, most adults. Also, even getting to start my explanation was taking longer than a tiktok, so my kid lost attention span before I could even say “matrix multiplication”. I needed something more visual. More interactive. More fun.

So here is the version I wish I had at dinner. With drawings. And things you can click on. Because when everything seems abstract, playing with the actual numbers can bring some light.

A helpful guide for any 12-year-old, or a 62-year-old that fears they’re regressing.

 ❄                ❄                ❄                ❄                ❄

In my last fragments, I included some concerns about how advertising could interplay with chatbots. Anthropic have now made some adverts about concerns about adverts - both funny and creepy. Sam Altman is amused and annoyed.


February 4

I’ve spent a couple of days at a Thoughtworks-organized event in Deer Valley Utah. It was my favorite kind of event, a really great set of attendees in an Open Space format. These kinds of events are full of ideas, which I do want to share, but I can’t truthfully form them into a coherent narrative for an article about the event. However this fragment format suits them perfectly, so I’ll post a bunch of fragmentary thoughts from the event, both in this post, and in posts in the next few days.

 ❄                ❄

We talked about the worry that using AI can cause humans to have less understanding of the systems they are creating. In this discussion one person pointed out that one of the values of Pair Programming is that you have to regularly explain things to your pair. This is an important part of learning - for the person doing the explaining. After all one of the best ways to learn something is to try to teach it.

 ❄                ❄

One attendee is an SRE for a Very (Very) Large Code Base. He was less worried about people not understanding the code an LLM writes because he already can’t understand the VVLCB he’s responsible for. What he values is that the LLM helps him understand the what the code is doing, and he regularly uses it to navigate to the crucial parts of the code.

There’s a general point here:

Fully trusting the answer an LLM gives you is foolishness, but it’s wise to use an LLM to help navigate the way to the answer.

 ❄                ❄                ❄                ❄                ❄

Elsewhere on the internet, Drew Breunig wonders if software libraries of the future might be only specs and no code. To explore this idea he built a simple library to convert timestamps into phrases like “3 hours ago”. He used the spec to build implementations in seven languages. The spec is a markdown document of 500 lines and a set of tests in 500 lines of YAML.

“What does software engineering look like when coding is free?”

I’ve chewed on this question a bit, but this “software library without code” is a tangible thought experiment that helped firm up a few questions and thoughts.

 ❄                ❄                ❄                ❄                ❄

Bruce Schneier on the role advertising may play while chatting with LLMs

Imagine you’re conversing with your AI agent about an upcoming vacation. Did it recommend a particular airline or hotel chain because they really are best for you, or does the company get a kickback for every mention?

Recently I heard an ex-Googler explain that advertising was a gilded cage for Google, and they tried very hard to find another business model. The trouble is that it’s very lucrative but also ties you to the advertisers, who are likely to pull out whenever there is an economic downturn. Furthermore they also gain power to influence content - many controversies over “censorship” start with demands from advertisers.

 ❄                ❄                ❄                ❄                ❄

The news from Minnesota continues to be depressing. The brutality from the masked paramilitaries is getting worse, and their political masters are not just accepting this, but seem eager to let things escalate. Those people with the power to prevent this escalation are either encouraging it, or doing nothing.

One hopeful sign from all this is the actions of the people of Minnesota. They have resisted peacefully so far, their principal weapons being blowing whistles and filming videos. They demonstrate the neighborliness and support of freedom and law that made America great. I can only hope their spirit inspires others to turn away from the path that we’re currently on. I enjoyed this portrayal of them from Adam Serwer (gift link)

In Minnesota, all of the ideological cornerstones of MAGA have been proved false at once. Minnesotans, not the armed thugs of ICE and the Border Patrol, are brave. Minnesotans have shown that their community is socially cohesive—because of its diversity and not in spite of it. Minnesotans have found and loved one another in a world atomized by social media, where empty men have tried to fill their lonely soul with lies about their own inherent superiority. Minnesotans have preserved everything worthwhile about “Western civilization,” while armed brutes try to tear it down by force.


January 22

My colleagues here at Thoughtworks have announced AI/works™, a platform for our work using AI-enabled software development. The platform is in its early days, and is currently intended to support Thoughtworks consultants in their client work. I’m looking forward to sharing what we learn from using and further developing the platform in future months.

 ❄                ❄                ❄                ❄                ❄

Simon Couch examines the electricity consumption of using AI. He’s a heavy user: “usually programming for a few hours, and driving 2 or 3 Claude Code instances at a time”. He finds his usage of electricity is orders of magnitude more than typical estimates based on the “typical query”.

On a median day, I estimate I consume 1,300 Wh through Claude Code—4,400 “typical queries” worth.

But it’s still not a massive amount of power - similar to that of running a dishwasher.

A caveat to this is that this is “napkin math” because we don’t have decent data about how these models use resources. I agree with him that we ought to.

 ❄                ❄                ❄                ❄                ❄

My namesake Chad Fowler (no relation) considers that the movement to agentic coding creates a similar shift in rigor and discipline as appeared in Extreme Programming, dynamic languages, and continuous deployment.

In Extreme Programming’s case, this meant a lot of discipline around testing, continuous integration, and keeping the code-base healthy. My current view is that with AI-enabled development we need to be rigorous about evaluating the software, both for its observable behavior and its internal quality.

The engineers who thrive in this environment will be the ones who relocate discipline rather than abandon it. They’ll treat generation as a capability that demands more precision in specification, not less. They’ll build evaluation systems that are harder to fool than the ones they replaced. They’ll refuse the temptation to mistake velocity for progress.

 ❄                ❄                ❄                ❄                ❄

There’s been much written about the dreadful events in Minnesota, and I’ve not felt I’ve had anything useful to add to them. But I do want to pass on an excellent post from Noah Smith that captures many of my thoughts. He points out that there is a “consistent record of brutality, aggression, dubious legality, and unprofessionalism” from ICE (and CBP) who seem to be turning into MAGA’s SD.

Is this America now? A country where unaccountable and poorly trained government agents go door to door, arresting and beating people on pure suspicion, and shooting people who don’t obey their every order or who try to get away? “When a federal officer gives you instructions, you abide by them and then you get to keep your life” is a perfect description of an authoritarian police state. None of this is Constitutional, every bit of it is deeply antithetical to the American values we grew up taking for granted.

My worries about these kinds of developments were what animated me to urge against voting for Trump in the 2016 election. Mostly those worries didn’t come to fruition because enough constitutional Republicans were in a position to stop them from happening, so even when Trump attempted a coup in 2020, he wasn’t able to get very far. But now those constitutional Republicans are absent or quiescent. I fear that what we’ve seen in Minneapolis will be a harbinger of worse to come.

I also second John Gruber’s praise of bystander Caitlin Callenson:

But then, after the murderous agent fired three shots — just 30 or 40 feet in front of Callenson — Callenson had the courage and conviction to stay with the scene and keep filming. Not to run away, but instead to follow the scene. To keep filming. To continue documenting with as best clarity as she could, what was unfolding.

The recent activity in Venezuala reminds me that I’ve long felt that Trump is a Hugo Chávez figure - a charismatic populist who’s keen on wrecking institutions and norms. Trump is old, so won’t be with us for that much longer - but the question is: “who is Trump’s Maduro?”

 ❄                ❄                ❄                ❄                ❄

With all the drama at home, we shouldn’t ignore the terrible things that happened in Iran. The people there again suffered again the consequences of an entrenched authoritarian police state.


January 8

Anthropic report on how their AI is changing their own software development practice.

 ❄                ❄                ❄                ❄                ❄

Much of the discussion about using LLMs for software development lacks details on workflow. Rather than just hear people gush about how wonderful it is, I want to understand the gritty details. What kinds of interactions occur with the LLM? What decisions do the humans make? When reviewing LLM outputs, what kinds of things are the humans looking for, what corrections do they make?

Obie Fernandez has written a post that goes into these kinds of details. Over the Christmas / New Year period he used Claude to build a knowledge distillation application, that takes transcripts from Claude Code sessions, slack discussion, github PR threads etc, turns them into an RDF graph database, and provides a web app with natural language ways to query them.

Not a proof of concept. Not a demo. The first cut of Nexus, a production-ready system with authentication, semantic search, an MCP server for agent access, webhook integrations for our primary SaaS platforms, comprehensive test coverage, deployed, integrated and ready for full-scale adoption at my company this coming Monday. Nearly 13,000 lines of code.

The article is long, but worth the time to read it.

An important feature of his workflow is relying on Test-Driven Development

Here’s what made this sustainable rather than chaotic: TDD. Test-driven development. For most of the features, I insisted that Claude Code follow the red-green-refactor cycle with me. Write a failing test first. Make it pass with the simplest implementation. Then refactor while keeping tests green.

This wasn’t just methodology purism. TDD served a critical function in AI-assisted development: it kept me in the loop. When you’re directing thousands of lines of code generation, you need a forcing function that makes you actually understand what’s being built. Tests are that forcing function. You can’t write a meaningful test for something you don’t understand. And you can’t verify that a test correctly captures intent without understanding the intent yourself.

The account includes a major refactoring, and much evolution of the initial version of the tool. It’s also an interesting glimpse of how AI tooling may finally make RDF useful.

 ❄                ❄                ❄                ❄                ❄

When thinking about requirements for software, most discussions focus on prioritization. Some folks talk about buckets such as the MoSCoW set: Must, Should, Could, and Want. (The old joke being that, in MoSCoW, the cow is silent, because hardly any requirements end up in those buckets.) Jason Fried has a different set of buckets for interface design: Obvious, Easy, and Possible. This immediately resonates with me: a good way of think about how to allocate the cognitive costs for those who use a tool.

 ❄                ❄                ❄                ❄                ❄

Casey Newton explains how he followed up on an interesting story of dark patterns in food delivery, and found it to be a fake story, buttressed by AI image and document creation. On one hand, it clarifies the important role reporters play in exposing lies that get traction on the internet. But time taken to do this is time not spent on investigating real stories

For most of my career up until this point, the document shared with me by the whistleblower would have seemed highly credible in large part because it would have taken so long to put together. Who would take the time to put together a detailed, 18-page technical document about market dynamics just to troll a reporter? Who would go to the trouble of creating a fake badge?

Today, though, the report can be generated within minutes, and the badge within seconds. And while no good reporter would ever have published a story based on a single document and an unknown source, plenty would take the time to investigate the document’s contents and see whether human sources would back it up.

The internet has always been full of slop, and we have always needed to be wary of what we read there. AI now makes it easy to manufacture convincing looking evidence, and this is never more dangerous than when it confirms strongly held beliefs and fears.

 ❄                ❄                ❄                ❄                ❄

Kent Beck:

The descriptions of Spec-Driven development that I have seen emphasize writing the whole specification before implementation. This encodes the (to me bizarre) assumption that you aren’t going to learn anything during implementation that would change the specification. I’ve heard this story so many times told so many ways by well-meaning folks–if only we could get the specification “right”, the rest of this would be easy.

Like him, that story has been the constant background siren to my career in tech. But the learning loop of experimentation is essential to the model building that’s at the heart of any kind of worthwhile specification. As Unmesh puts it:

Large Language Models give us great leverage—but they only work if we focus on learning and understanding. They make it easier to explore ideas, to set things up, to translate intent into code across many specialized languages. But the real capability—our ability to respond to change—comes not from how fast we can produce code, but from how deeply we understand the system we are shaping.

When Kent defined Extreme Programming, he made feedback one of its four core values. It strikes me that the key to making the full use of AI in software development is how to use it to accelerate the feedback loops.

 ❄                ❄                ❄                ❄                ❄

As I listen to people who are serious with AI-assisted programming, the crucial thing I hear is managing context. Programming-oriented tools are geting more sophisticated for that, but there’s also efforts at providing simpler tools, that allow customization. Carlos Villela recently recommended Pi, and its developer, Mario Zechner, has an interesting blog on its development.

So what’s an old guy yelling at Claudes going to do? He’s going to write his own coding agent harness and give it a name that’s entirely un-Google-able, so there will never be any users. Which means there will also never be any issues on the GitHub issue tracker. How hard can it be?

If I ever get the time to sit and really play with these tools, then something like Pi would be something I’d like to try out. Although as an addict to The One True Editor, I’m interested in some of libraries that work with that, such as gptel. That would enable me to use Emacs’s inherent programability to create my own command set to drive the interaction with LLMs.

 ❄                ❄                ❄                ❄                ❄

Outside of my professional work, I’ve posting regularly about my boardgaming on the specialist site BoardGameGeek. However its blogging environment doesn’t do a good job of providing an index to my posts, so I’ve created a list of my BGG posts on my own site. If you’re interested in my regular posts on boardgaming, and you’re on BGG you can subscribe to me there. If you’re not on BGG you can subscribe to the blog’s RSS feed.

I’ve also created a list of my favorite board games.


December 16

Gitanjali Venkatraman does wonderful illustrations of complex subjects (which is why I was so happy to work with her on our Expert Generalists article). She has now published the latest in her series of illustrated guides: tackling the complex topic of Mainframe Modernization

In it she illustrates the history and value of mainframes, why modernization is so tricky, and how to tackle the problem by breaking it down into tractable pieces. I love the clarity of her explanations, and smile frequently at her way of enhancing her words with her quirky pictures.

 ❄                ❄                ❄                ❄                ❄

Gergely Orosz on social media

Unpopular opinion:

Current code review tools just don’t make much sense for AI-generated code

When reviewing code I really want to know:

  • The prompt made by the dev
  • What corrections the other dev made to the code
  • Clear marking of code AI-generated not changed by a human

Some people pushed back saying they don’t (and shouldn’t care) whether it was written by a human, generated by an LLM, or copy-pasted from Stack Overflow.

In my view it matters a lot - because of the second vital purpose of code review.

When asked why do code reviews, most people will answer the first vital purpose - quality control. We want to ensure bad code gets blocked before it hits mainline. We do this to avoid bugs and to avoid other quality issues, in particular comprehensibility and ease of change.

But I hear the second vital purpose less often: code review is a mechanism to communicate and educate. If I’m submitting some sub-standard code, and it gets rejected, I want to know why so that I can improve my programming. Maybe I’m unaware of some library features, or maybe there’s some project-specific standards I haven’t run into yet, or maybe my naming isn’t as clear as I thought it was. Whatever the reasons, I need to know in order to learn. And my employer needs me to learn, so I can be more effective.

We need to know the writer of the code we review both so we can communicate our better practice to them, but also to know how to improve things. With a human, its a conversation, and perhaps some documentation if we realize we’ve needed to explain things repeatedly. But with an LLM it’s about how to modify its context, as well as humans learning how to better drive the LLM.

 ❄                ❄                ❄                ❄                ❄

Wondering why I’ve been making a lot of posts like this recently? I explain why I’ve been reviving the link blog.

 ❄                ❄                ❄                ❄                ❄

Simon Willison describes how he uses LLMs to build disposable but useful web apps

These are the characteristics I have found to be most productive in building tools of this nature:

  1. A single file: inline JavaScript and CSS in a single HTML file means the least hassle in hosting or distributing them, and crucially means you can copy and paste them out of an LLM response.
  2. Avoid React, or anything with a build step. The problem with React is that JSX requires a build step, which makes everything massively less convenient. I prompt “no react” and skip that whole rabbit hole entirely.
  3. Load dependencies from a CDN. The fewer dependencies the better, but if there’s a well known library that helps solve a problem I’m happy to load it from CDNjs or jsdelivr or similar.
  4. Keep them small. A few hundred lines means the maintainability of the code doesn’t matter too much: any good LLM can read them and understand what they’re doing, and rewriting them from scratch with help from an LLM takes just a few minutes.

His repository includes all these tools, together with transcripts of the chats that got the LLMs to build them.

 ❄                ❄                ❄                ❄                ❄

Obie Fernandez: while many engineers are underwhelmed by AI tools, some senior engineers are finding them really valuable. He feels that senior engineers have an oft-unspoken mindset, which in conjunction with an LLM, enables the LLM to be much more valuable.

Levels of abstraction and generalization problems get talked about a lot because they’re easy to name. But they’re far from the whole story.

Other tools show up just as often in real work:

  • A sense for blast radius. Knowing which changes are safe to make loudly and which should be quiet and contained.
  • A feel for sequencing. Knowing when a technically correct change is still wrong because the system or the team isn’t ready for it yet.
  • An instinct for reversibility. Preferring moves that keep options open, even if they look less elegant in the moment.
  • An awareness of social cost. Recognizing when a clever solution will confuse more people than it helps.
  • An allergy to false confidence. Spotting places where tests are green but the model is wrong.

 ❄                ❄                ❄                ❄                ❄

Emil Stenström built an HTML5 parser in python using coding agents, using Github Copilot in Agent mode with Claude Sonnet 3.7. He automatically approved most commands. It took him “a couple of months on off-hours”, including at least one restart from scratch. The parser now passes all the tests in html5lib test suite.

After writing the parser, I still don’t know HTML5 properly. The agent wrote it for me. I guided it when it came to API design and corrected bad decisions at the high level, but it did ALL of the gruntwork and wrote all of the code.

I handled all git commits myself, reviewing code as it went in. I didn’t understand all the algorithmic choices, but I understood when it didn’t do the right thing.

Although he gives an overview of what happens, there’s not very much information on his workflow and how he interacted with the LLM. There’s certainly not enough detail here to try to replicate his approach. This is contrast to Simon Willison (above) who has detailed links to his chat transcripts - although they are much smaller tools and I haven’t looked at them properly to see how useful they are.

One thing that is clear, however, is the vital need for a comprehensive test suite. Much of his work is driven by having that suite as a clear guide for him and the LLM agents.

JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn’t have written it this quickly without the agent.

But “quickly” doesn’t mean “without thinking.” I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking.

                                  ❄                                  ❄    

Then Simon Willison ported the library to JavaScript:

Time elapsed from project idea to finished library: about 4 hours, during which I also bought and decorated a Christmas tree with family and watched the latest Knives Out movie.

One of his lessons:

If you can reduce a problem to a robust test suite you can set a coding agent loop loose on it with a high degree of confidence that it will eventually succeed. I called this designing the agentic loop a few months ago. I think it’s the key skill to unlocking the potential of LLMs for complex tasks.

Our experience at Thoughtworks backs this up. We’ve been doing a fair bit of work recently in legacy modernization (mainframe and otherwise) using AI to migrate substantial software systems. Having a robust test suite is necessary (but not sufficient) to making this work. I hope to share my colleagues’ experiences on this in the coming months.

But before I leave Willison’s post, I should highlight his final open questions on the legalities, ethics, and effectiveness of all this - they are well-worth contemplating.