Fragments

As I come across interesting thoughts (mostly) on the web, I like to share them. I do this by posting “fragments”. This page collects all my fragments so far.

June 16

“Prag Dave” Thomas (co-author of the outstanding “Pragmatic Programmer”) has loved programming since he was young.

Programming was how I could express myself. I wasn’t an artist. When I sing, dogs howl. When I draw, friends say, “Very nice. What is it?” I didn’t connect particularly well with people, even though I wanted to. And yet, when I wrote my first program, I discovered a medium which let me convert thought into action. All the ideas that were bottled up behind a wall of frustration suddenly had an outlet.

The LLM revolution worried him. Would they remove all that fun stuff? Happily he found it was the opposite. Like Kent Beck and others have told me, programming with LLMs is more fun than ever. His post lists reasons why: removing drudgery, speeding up feedback loops, reviving long abandoned projects, and exploring new technologies.

❄ ❄ ❄ ❄ ❄

I’ve spent a few days at the DDD Europe conference, which was a very enjoyable event. With all the changes to programming due to LLMs, I suspect Domain-Driven Design is going to be one of those things that will continue to be useful, indeed may become even more important. The highlight of the conference was the opening keynote by Eric Evans, who gave a fascinating description of some of his experimentation with LLMs over the last couple of years. Once the video for that talk becomes available, I’ll link to it - hopefully it won’t be too long. I also particularly enjoyed talks from Violetta Pidvolotska, Kiran Prakash, Tom de Wolf, and Chelsea Troy. Gien Verschatse interviewed Eric Evans and me for an hour so so - again I’ll pass that link on when the video gets published.

One snippet that stood out to me was from Chelsea Troy. The main thrust of her talk was managing the context window of LLMs so that it was kept in a healthy state. Much of what she said was familiar, but one thing I hadn’t thought about was her thoughts about the different registers of conversations with LLMs. These registers are different styles of conversation we can have with The Genie (or indeed other humans). She classified them in four ways:

Exploring: I want to understand before touching anything
Brainstorming: Generate options, I’ll evaluate them separately
Deciding: I need a recommendation with a rationale, not a list
Implementing: The decision is made, help me build it

Her point is that whenever I have a session with an LLM, I need to be conscious about which register I’m using. And should I change register, I should start a new conversation with a fresh context.

❄ ❄ ❄ ❄ ❄

Charity Majors raises an alarm about the crevasse forming between AI enthusiasts and AI skeptics.

The enthusiasts are not wrong. We are starting to see real, non-imaginary, discontinuous leaps in capabilities from teams that lean in hard to working with AI. And this does not feel like a normal technology cycle where you can wait for the dust to settle; teams that sit this out while competitors are hustling could be out of business before the dust settles. That’s a real, existential threat.

The skeptics are also not wrong. When you ship code faster than engineers can read it, in domains where nobody has full context, you are making withdrawals from a trust account that took years to build. Reliability degrades, institutional knowledge evaporates. You end up with systems nobody understands, products burbling into incoherence, and on-call rotations that grind people up and spit them out. That is ALSO a real existential threat.

The heart of the problem is that there is no common feedback connecting these two groups. She posts some concrete ways forward. Enthusiasts tend to talk most publicly, and focus on the wins, but they need to tell the whole story, highlighting the costs too. We all need to look at working with AI as an engineering problem. Don’t just assume that we can or can’t ship code to production without review: instead ask what would make it possible to reduce review effort. She asserts that engineering discipline is becoming more important, echoing others who’ve observed that AI is an amplifier of our current practices. She wants us to avoid getting into arguments with maximalist stances and speculation, but stick to the reality we observe. Authority comes from engagement:

The engineers who shape how AI gets used will be the ones with credibility: they understand the opportunity, the stakes, and the tradeoffs, and they own enough of the consequences to have standing when they push back. Earning that position takes work, but it is work worth doing.

❄ ❄ ❄ ❄ ❄

Simon Willison noticed that Anthropic and Open AI have increased their enterprise pricing:

Why these sudden aggressive moves on pricing? Both Anthropic and OpenAI are planning to IPO, but I suspect there’s a more important factor here: I think they’ve finally found product-market fit, with the coding/general-purpose agent products embodied by Claude Code/Cowork and Codex.

His view is that since the November Inflection there’s increasing evidence that LLMs are making a big difference to programming, and this is translating to a viable business model for these companies. He suspects there’s another inflection point that happened in April:

I think April 2026 is a new inflection point where the revenue implications of this have started to land, to the benefit of the frontier AI labs and with material impacts on the budgets of large companies.

❄ ❄ ❄ ❄ ❄

Mike Masnick looks back at the internet’s failed promise of decentralization.

The early-internet conventional wisdom was that the network would kill the middleman. Yet the promised era of “disintermediation” never quite happened. What actually happened was that the middlemen changed character.

The pre-internet middlemen—record labels, book publishers, movie studios, magazine editors—were gatekeepers. Their entire job was rejecting 99% of what came through the door so that a curated few could reach an audience. The internet middlemen who replaced them did the opposite. Their approach depended on letting nearly everything through. They became enablers rather than gatekeepers, and an entire generation of cultural production—music, video, writing, podcasts—flowed through them in volumes the old gatekeepers couldn’t have imagined.

But the new middlemen still needed a business model.

That business model came from realizing that when content is bountiful, the new scarce resource is attention. The enablers built advertising machines to sell attention, and focused attention with algorithmic curation. Seeking to increase profits, the enablers rode down the road of enshittification:

The companies that embraced centralized control over the user experience did so initially because they were, in fact, making things better for their users. Information overload was a real issue. Having a better system for managing it was a good thing. It’s why so many people flocked to these centralized social media platforms and became so enraptured by their algorithms.

The problem of centralized systems is that they create an irresistible temptation to control and exploit. Users who found value early on feel stuck: they can leave, but doing so means abandoning their community. That lack of easy exit creates lock-in, and lock-in enables enshittification.

Masnick’s view is that we have to fight this process of digital despotism, realizing that decentralization, despite it’s ergonomic problems, is worth striving for as it’s the key to cognitive liberty. The key to achieving this is the user’s control over their data with the ability to easily move from one service to another. Ease of exit promotes competition, and competition is what pushes back against the ills of despotic centralization.

Truly decentralized tools push power to the ends. Users control their data and choose which intermediaries operate on it—and that arrangement is a poison pill to both enshittification and despotification. Any move in those directions only pushes users across the deliberately low barrier to exit, taking their content, data, and community with them.

The lessons of the past are important to guide our AI-enabled future. We need to create an environment that encourages decentralization and competition.

June 2

Greg Wilson has noticed that lots of folks are using dodgy metrics to figure out if AI tools are worth their costs.

Would you measure lines of code generated, or tickets closed? Or would you send out a survey asking whether developers feel more productive? Each of those approaches is flawed in a different way;

He lists lots of common metrics, and why they are flawed. Sadly he doesn’t give any suggestions on what would be better. In my view, since we cannot measure productivity, any metrics are weak evidence at the best of times.

I do somewhat use one of his flawed measures: “Asking Developers If They Feel More Productive”. While I acknowledge the problems he gives with this measure, I find that in an environment where decent measures are hard to find, even such a dim light is the best we have. In this situation these kinds of qualitative metrics may not be conclusive, but they are useful.

❄ ❄ ❄ ❄ ❄

Benedict Evans observes that extensive automation didn’t mean the demise of professions in the past.

we spent a century automating accounting: we built calculating machines, punch cards, mainframes, data processing, databases, PCs, spreadsheets, ERPs, cloud… in fact, we built half of the tech industry around automating this. Yet the number of accountants kept going up.

He goes into the myriad of problems that exist when we’re trying to forecast the impact of a technology on jobs. There’s the much-talked-about Jevons paradox - once something becomes cheaper, people do it more, which can increase demand. Often this leads to the nature of jobs changing, even if it’s called the same thing.

Accountants today aren’t doing exactly the same work that they did in 1970 or 1980 ‘but more’ - they’re still called ‘accountants’ but the job is different. New technology often starts out being used for ‘the old thing but more’, but it rarely ends up like that.

Technologies often affect whole businesses - consider the impact of the internet on news publishing. Did anyone observing the rise of smart phones in the early 2000s realize that a consequence of this would change the economics of taxis due to the rise of ride-sharing apps? The conclusion is that it is, at the very least, almost impossible to forecast the impact of AI on our work.

❄ ❄ ❄ ❄ ❄

Stephen O’Grady looks at how closed and open models have performed on benchmarks over time.

Closed models are setting the pace of innovation, and constantly breaking new ground from a capabilities standpoint. Open models are chasing them, and the cycle times seem to be getting shorter. There are no clear capability moats, and what is frontier today is table stakes tomorrow.

It tooks 13-18 months for open models to catch up to GPT-4 on these benchmarks, but only 2-7 months to catch up to GPT-4o.

There’s a bunch of caveats to this analysis, that he lists, but it’s a worthwhile survey of how various kinds of models perform against the various measures we are trying to assess them with.

❄ ❄ ❄ ❄ ❄

One of the starkest examples of sloppy AI use is hallucinated citations - a give-away of both usage of LLMs and carelessness driving them. GPTZero is a company that makes tools to detect AI writing. I’ve no insight as to whether their tool is effective or not, but they do publish investigations of AI usage, and have published several articles highlighting hallucinated citations. One post focuses on Ernst & Young Canada’s report on cyber threats to loyalty systems and found that more than half its references were hallucinations. The post uses a lot of extremely annoying animations in how it presents its information (breaking Safari’s reader mode in the process). But the harm that these kind of AI generated reports can do goes further than just some misled humans:

Publishing a report online is essentially a form of data injection into the pool of knowledge that is the internet. When the report includes fake information (either vibed citations or false claims) it can “poison the well” by misleading future researchers, especially if the report is published by a well-known consulting firm and hosted on a high-traffic website.

❄ ❄ ❄ ❄ ❄

As LLMs get more capable in programming, we are rightly worried that people will use them attack software systems. But these models can also be used for defense, allowing teams to find bugs before attackers do. Some folks from Mozilla posted an article on how they’ve used AI model to identify and fix an unprecedented number of latent security bugs in Firefox.

Just a few months ago, AI-generated security bug reports to open source projects were mostly known for being unwanted slop. Dealing with reports that look plausibly correct but are wrong imposes an asymmetric cost on project maintainers: it’s cheap and easy to prompt an LLM to find a “problem” in code, but slow and expensive to respond to it.

It is difficult to overstate how much this dynamic changed for us over a few short months. This was due to a combination of two main factors. First, the models got a lot more capable. Second, we dramatically improved our techniques for harnessing these models — steering them, scaling them, and stacking them to generate large amounts of signal and filter out the noise.

During 2025, there were 17-31 security bugs fixed each month. In April 2026, they fixed 423.

❄ ❄ ❄ ❄ ❄

Pavel Voronin riffs on Unmesh Joshi’s post on What is Code. He observes that cruft in a codebase (technical debt) has always added friction to software development. But the consequences of this cruft are compounded when LLMs are using existing code as context for future work.

In a degraded codebase, the model does not see “technical debt” as debt. It sees examples. It sees precedent. It sees a style to continue.

LLMs multiply what’s currently happening. I hear reports that good code might take the place of much of what’s put in markdown, because LLMs will imitate what’s already in the code base. But bad code multiplies too. Inevitably he introduces another variation of rampant debt metaphors:

Cognitive debt accumulates when a team uses abstractions it no longer understands. Generative debt accumulates when a codebase contains confused concepts that models are likely to continue.

Cognitive debt is about what the team no longer understands. Generative debt is about what the model is now likely to reproduce.

❄ ❄ ❄ ❄ ❄

Jason Koebler, from the very worthwhile 404 media, has written a plaintive essay on how AI-generated slop is driving us crazy. Not just because its filling the web with this slop, but also because how it’s making us humans react to slop and the threat of slop. We review our own writing and notice: it’s not just reading AI slop that hurts us, it’s the risk that we write something that looks like AI slop. If I use phrasing that AI copied from me, does it seem like I’m copying AI? This has led to the appearance of “humanizers” - AI tools that make our writing look less like AI.

Humanizers add typos, randomly replaces words, removes “AI tells,” and sometimes inserts random characters.

It’s another step on the way to the Zombie internet:

I called it the Zombie Internet because the truth is that large parts of the internet are not just bots talking to bots or bots talking to people. It’s people talking to bots, people talking to people, people creating “AI agents” and then instructing them to interact with people. […] It’s my email inbox, in which I used to occasionally get poorly-formatted, poorly written, extremely long emails from delusional people who were positive the CIA had imprisoned them in a virtual torture chamber using undisclosed secret technology but where I now get well-formatted, passably written, extremely long emails from delusional people who are positive they have proven AI sentience and have the AI transcripts to prove it.

❄ ❄ ❄ ❄ ❄

Andy Osmani points out that spawning lots of agents is like launching a bunch of parallel processes that all rely on a single orchestrating thread - yourself.

Python has the Global Interpreter Lock (GIL). You can spawn as many threads as you want but only one executes python bytecode at a time because they must acquire the lock. You are the GIL of your AI agents. They all can run at once. But when any of their work needs genuine understanding of the architecture or resolving merge conflicts, that work has to acquire the lock. There is one lock. You hold it.

This means you must design the workflow with the agents with that GIL in mind. You shouldn’t launch more agents than you can properly review. It’s handy to separate background tasks that can be offloaded to an agent from complex tasks that require applied attention. Don’t use that precious brain for things that the machine can verify itself. [And I’d add - do get the machine to build tools that ease human verification. For example, it’s better to surface test case data in tables rather than buried in assert statements.]

Spawning agents is not the skill. Anyone can run 20.

The real skill is designing the system around the one serial resource that cannot be cloned or parallelized. That resource is your attention.

❄ ❄ ❄ ❄ ❄

Jamie Hurst is a Principal Engineer at booking.com, where he works in developer experience with a focus on AI tooling. He’s written realistically about the gains and losses of using LLMs in this work.

The cost of building has collapsed, but the cost of aligning organisationally has not. If anything, it’s gone up. When three different teams can each produce a working solution to the same problem in the time it used to take to write a proposal, the bottleneck moves from engineering to coordination.

He thinks he’s able to do more as a senior engineer, but is concerned about how sustainable it is, both for him personally and for the organization he works for. He’s able to shape directions for multiple workstreams at once, in a way that he couldn’t three years ago. But one loss is that he doesn’t have enough time for mentoring, which will exact a toll on his employer in the longer term. He also finds he doesn’t have enough time to think.

The productivity gains from AI got captured by output volume rather than output quality. The org’s expectations rose to absorb the speed-up, and the slack that used to exist between tasks, the unstructured time where strategic thinking actually happens, got eaten first because it’s invisible on a dashboard. I’m at a point in my career where thinking is supposed to be most of the job, and most of it now happens on holiday because the working week doesn’t accommodate it.

May 27

At the GOTO Conference in Copenhagen in 2025, Kent Beck and I spent some time on stage talking and answering questions from the audience - a format I refer to as “two old geezers on a park bench”. We talk about our experiences with LLM-augmented programming (at that point - October 2025), we show our frustration that things we’ve been saying for thirty years still need to be said, we say how anything like a manifesto reunion needs to be led by a younger generation, and opine on what junior developers should be focusing on in their career.

❄ ❄ ❄ ❄ ❄

Ian Johnson has written a series of posts about restructuring a gnarly codebase

The story follows a real Laravel + React codebase over ~3 months and ~258 commits from a legacy monolith with no tests to a well-structured application with automated quality gates, a React SPA migration in progress, and an AI agent that reliably ships production code with minimal supervision.

The series covers the steps in decent detail, and his approach follows the kinds of steps I’d use. First get everything under the control of decent characterization tests, add static analysis, introduce the right patterns to make things flow easily.

With all of this, is his use of AI, which changed during the exercise:

For the first two months of this project, I used Claude Code with auto-approve turned off. Every file edit, every terminal command, every change… I reviewed it before it executed. […] The results were good. The code was clean. But I was doing most of the thinking and half the typing. The agent was a fancy autocomplete with better suggestions. I wasn’t getting the leverage I’d hoped for.

I read an article about “on-the-loop” versus “in-the-loop” human-AI collaboration. The framing clicked immediately […] I was micromanaging because I didn’t trust the agent to do the right thing. And I didn’t trust the agent because there was nothing forcing it to do the right thing.

His early steps put in tests, static analysis, and the right architectural patterns. With those in place, he could let the agent do more work.

My role shifted from writer to curator. I don’t write most of the code anymore. I Define the patterns […] Review the test specs […] Review the output […] Update the harness […] Make strategic decisions […]

He finishes the series with conclusions about how he’d generalize his experience to other circumstances.

❄ ❄ ❄ ❄ ❄

Back in the land of my birth, there was some notable groans when the National Health Service decided to close nearly all of their Open Source repositories, supposedly to the security threat of LLMs. Closing repos like this isn’t an effective counter to LLM-augmented attackers. I suspect it’s no coincidence to see GDS (Government Data Services), the highly-regarded IT enablers in the UK government publish their position

Moving code from public to private as a substitute for investment in secure-by-design delivery, ownership and remediation is a warning sign because it reduces sharing and scrutiny, can slow coordinated improvement across government and suppliers, and does not remove the underlying weaknesses in a running service.

Terence Eden memorably sums up his view on this:

Within the UK’s Civil Service you occasionally hear the expression “being invited to a meeting without biscuits”. It implies a rather frosty discussion without any of the polite niceties of a normal meeting.

❄ ❄ ❄ ❄ ❄

I’ve seen a few cases where those developers who are most involved in working with LLMs find they are running into a problem with cognitive endurance, Adam Tornhill has joined this group:

One of the big wins with agents is that they let us stay with the higher-level problem for longer. We get less sidetracked by details, dependency cleanup, and similar secondary tasks that used to break concentration.

But there is a cost we are still underestimating. Agentic coding is mentally expensive.

I can usually sustain the pace for a couple of hours. Then I need a break. The pace is simply too intense. And based on conversations with other engineers, I do not think I am alone in that.

He explains that working with The Genie means we are making more decisions in less time, this increase in decision density is hard on the brain.

He responds by keeping agent tasks small, automating everything he can, and accepting that he won’t know every line of code as long as he has good verification mechanisms in place.

Notably, he has not gone in the direction of doing his work with swarms of agents that he coordinates. Instead has one long-running task that he babysits and one focus task

That last point is important given the running-twenty-agents-in-parallel hype. I cannot even think about twenty meaningful things to build, and even less so about the resulting cognitive tax of the likely interruptions. It’s exactly the wrong thing to even consider. At least for humans. (And yes, I understand sub-agents and machine parallelisation. That is not what I’m objecting to. It is the parallelisation of human attention that does not scale).

I liked that he included some thoughts about what folks can do in time outside this intense programming time. Not just “have a coffee” (although he includes that) but also about learning about the domain that the software supports.

❄ ❄ ❄ ❄ ❄

A couple of pithy quotes from social media

Lorin Hochstein

“Metaphor debt” is when all of your metaphors involve the concept of “debt” because you can’t think of any other metaphors anymore.

❄ ❄

Daniel Terhorst-North

If a vegan crossfit fan is using Claude to write Rust, which thing do they tell you first?

❄ ❄ ❄ ❄ ❄

Karl Bode reacts to speakers getting booed when mentioning AI during commencement addresses. He points out that younger folks are increasingly unhappy with the tech oligarchy and their fruits.

The thing is the kids aren’t stupid. They see the field clearly. They see the difference between what’s being sold to them by tech companies, the press, and commencement speakers, and what they have repeatedly seen with their own eyes.

They’ve watched tech oligarchs spend the last decade mired in scandal after scandal, hype cycle after hype cycle, steadily enshittifying everything they touch along the way.

[…]

The percentage of Gen Z that think AI’s benefits don’t counterbalance the risks now sits around fifty percent, up 11 percentage points in just the last year. Eight out of every ten believe that using AI makes the process of actual learning more difficult.

He sees young people saddled with the perception of entering a worsening world - which leads them to rage against this latest fruit of the tech oligarchy. A rage that is easy for folks like me - with a comfortable retirement off-ramp - to properly appreciate. A rage that could have marked political and social consequences.

❄ ❄ ❄ ❄ ❄

Relevant to these concerns are a couple of items in last week’s Economist newspaper. The newspaper argues that historically major technological advances haven’t led to significant unemployment or drops in wages (paywalled article). The closest was the original industrial revolution in 19th Century Britain. There was a stagnation in wages during this period, but there was also a massive increase in population, from 4½ million to 12 million.

It also points out that we’ll probably only understand the full consequences of all this when a recession hits, as this is when most unproductive jobs tend to be flushed out of the system.

A second article (also paywalled) indicates that AI is having some effect on graduate hiring. They did an analysis of surveys of recent graduates, looking to see if employment varied depending on a job’s exposure to AI. The least exposed quintile of subjects saw employment rate fall by 1.5% over the last couple of years, while the most exposed quintile’s drop was 6.6%.

❄ ❄ ❄ ❄ ❄

Lawfare isn’t impressed with the latest efforts by the US Government to regulate AI.

On [last] Wednesday, the White House invited leaders of OpenAI, Google, Anthropic, Meta, and Microsoft to the Oval Office for a signing ceremony the following afternoon. President Trump was to sign an executive order on AI and cybersecurity—the administration’s most formal effort yet to establish a voluntary process for reviewing frontier models before their release. But roughly three hours before the ceremony, when some company executives were already in the air to Washington, the White House called it off.

They see the proposed regulations as mild, and including some valuable measures to harden defenses against cyber threats.

But it’s worth underscoring the implications of postponing (if not outright canceling) this order, which, by its own terms, was about as modest a frontier-AI intervention as the federal government could put on paper: voluntary, focused on the government’s own defenses, and explicitly barred from becoming a licensing regime. The objection isn’t so much about government coercion as about the government having any settled role at all. Voluntary, in other words, isn’t the floor of frontier AI policy in this administration; it’s the ceiling.

This is a questionable position given that the concerns animating this draft order will likely grow in the near future. It is also self-defeating for those who applauded the order’s delay or demise. Far from resolving the risk of government meddling in AI, killing the order just leaves in place what Ball has described as the “opaque and essentially lawless” alternative: government access happening through back channels, on terms set case by case, with no stable rules at all.

One of the problems here is a distinct lack of governmental expertise, either in AI or in software in general. Too much is being decided at the whims of the tech oligarchy, there isn’t any attempt to engage in the broader issues at hand. That’s not entirely a bad thing, trying to regulate something that’s still evolving so fast is usually a fool’s errand - but the problem here is the impact of AI is so big that there’s real danger in being too far behind.

❄ ❄

Which leads me to a rare thing, an endorsement of a candidate for political office. If you are voting in congressional district MA-06 (North Shore of Massachusetts), I’d seriously look at Beth Anders-Beck, who is running for congress in that district. Beth has a long background in software development (including developing the notion of Forest and Desert), so would introduce expertise that Congress desperately needs. I’ve known Beth for decades, and have a high opinion of their intelligence, judgment, and ability to work with others. Congress doesn’t deserve Beth, but it does need her.

May 14

Last week I spent a day at The Orchard Retreat, hosted by Mechanical Orchard. that brought together several people working in software development to talk about the profession’s future with the rise of agentic programming. The event was help under the Chatham House Rule, so I can’t attribute the comments and stories I heard. (If anyone recognizes themselves, and would like attribution, let me know.) Here are a few tidbits that caught my notebook.

❄ ❄

One group developed a behavioral clone of GNU Cobol compiler in Rust. The result is 70K lines of Rust and was built in 3 days. This is yet another sign of the ability of LLMs to do a good job of porting existing code to a new platform. Good regression tests are extremely valuable here (and I don’t know how good GNU Cobol’s are). There’s also the possibility of building a test suite if you have access an existing implementation.

❄ ❄

Large spec documents can be complex for a human to review. One attendee shared the idea of getting the LLM to interview a human expert, asking the human questions to verify the correctness of the specification, a form of Interrogatory LLM.

❄ ❄

Not specifically about AI - but I liked how one attendee commented that the first thing they do when consulting with an organization is to read the guidelines for their change-control board. This is the scar tissue of what’s gone wrong in the past. I’ve often said that to understand why a thing is the way it is, you need to understand the history of how it got there. This seems like an excellent way to tap into important parts of that history.

❄ ❄

My colleagues who work with modernizing legacy systems have long been rather sniffy about “Lift and Shift” - porting a legacy system to a new platform while retaining Feature Parity.

We see this pattern as a huge missed opportunity. Often the old systems have bloated over time, with many features unused by users (50% according to a 2014 Standish Group report) and business processes that have evolved over time. Replacing these features is a waste. Instead, try to muster the energy to take a step back and understand what users currently need, and prioritize these needs against business outcomes and metrics.

But this point of view was developed before LLM’s ability to port code appeared. One attendee who does a lot of work in this field said they believed that lifting and shifting to a new platform should now be always the first step in a legacy migration. The cost is no longer as formidable as it used to be, and a better environment makes further evolution much cheaper. Just don’t stop there.

❄ ❄

Several attendees were from the financial industry, and thus were immersed the problems of complex legacy environments coupled with regulatory controls and significant risk should software do something wrong with money. One of their issues is the complexity we run into when a financial product is offered in multiple jurisdictions, each with their own regulations to satisfy. There’s a lot of software complexity in deciding which jurisdiction applies, and picking the right set of rules at the right point of the workflow.

The question here is whether the rapidity of agentic programming means that we can build individual, simpler systems for each jurisdiction. We would then use LLMs to ensure consistency between them, so that as the product rules change, each system reflects that into its own environment.

A large part of software design is about identifying what is the same and what differs between various business contexts. Where things are the same, and need to be the same, we are rightly wary of duplicating code, since this increases the cost of updates the dangers of inconsistency. The interesting question is what role LLMs can play to give us new tools to tackle this.

❄ ❄

As is usually the case in gatherings like this, folks were concerned about junior developers. When we work with The Genie, our value comes from good judgment - how do we teach that? This group did have one common tool - Pair Programming. One of the key benefits of pairing has always been skills transfer, and here an experienced agentic programmer can pass on their judgment for software design and how to use the genie to get there. And the junior will often have a trick or two to share too - that fresh pair of eyes in particularly valuable in the shift to our agentic future.

❄ ❄

Historically, we use computer systems to bring order to chaotic human processes. Is AI reversing that?

❄ ❄

So much software is involved in data transformation. Those records over there need to be consumed by these APIs over here, but there are differences in how the data is structured, often due to being in different Bounded Contexts, so we have to do some conversion. Agents are particularly adept at writing this kind of transformation code, which is often more tedious than we’d like.

❄ ❄

Chaos Engineering has become a valuable technique to improve resiliency, made famous by Netflix’s Chaos Monkey that randomly breaks live services to see how well the ecosystem reacts and recovers. What would a Chaos Monkey for AI look like? Would it deliberately introduce hallucinations into a pipeline to see if sensors were able to catch them?

❄ ❄ ❄ ❄ ❄

Back at my desk

There’s been a bunch of questions about the article on Structured-Prompt-Driven Development (SPDD) that the authors answered in a Q&A section. One in particular caught my eye:

Have you considered having an agent do the prompt/spec review itself — not a human reviewing the Canvas, but an agent that reads the REASONS Canvas alongside the code diff and verifies alignment?

The reply talks about how there is an available command to do this, but there downsides. In particular one reason not to do this automatically is:

Letting humans learn. Review is also where humans learn from the AI’s choices — patterns, trade-offs, options they had not thought of. Cutting humans out speeds things up, but it blocks the long-term skill growth that SPDD is designed to protect. […] Once enough decision rules build up to give us real confidence, we may shift more of the review to the agent step by step — but the part where humans learn from the AI is something we plan to keep.

One of the ways we should judge the value of an AI tool is how much it helps us humans learn more about the world we inhabit and build.

❄ ❄ ❄ ❄ ❄

In some strange way I injured my elbow last week. No idea how, there was no event where I said “oh shit”. It just gradually started hurting and swelling. My life-long strategy to avoid sports injuries¹ had defied me. I applied ice and ibuprofin, the swelling went down, but my range of motion got worse. I’m glad I learned to use a knife and fork in English childhood, so I normally eat with my left hand.

I noticed that that loss of range of motion occurred after I got home, when I started spending all day at the computer again. I might not use my elbow directly, but my right hand does a lot of typing and mousing. My desk set up is pretty ergonomic, with a good keyboard, a wrist rest for the mouse, and arm rests on my chair. But even so, did my computer use make my elbow get worse once I got home? I can’t imagine not using the computer, for me writing has become an unstoppable habit. But maybe I should use this opportunity to explore voice input - after all most people can speak faster than they can type.

I tried this many years ago, when a colleague told me how good voice recognition was once it trained to you. I tried it, and indeed the voice recognition, even in those pre-AI days, was very good. But it didn’t work for me. When I’m writing I rapidly type words into Emacs, but almost immediately I go back to edit them. Write two sentences, edit them, write another, re-edit the paragraph. The back-and-forth between seeing my words and thinking about them is tight - I can’t just dictate my words.

That made me reflect further. I only started using a computer for my writing in my 20s. At school I had to write longhand, and in university to type on typewriter. But those media don’t support the constant rewriting that I do now. Would I even have become a writer had the text editor not been invented?

❄ ❄ ❄ ❄ ❄

James Pritchard thinks that many developers are over-using agents at run-time in their products, when LLMs are better used as functions.

The problem with agents isn’t that they don’t work. It’s that they work unpredictably. You trade a known execution path for “autonomy” that mostly means “I don’t know what it’s going to do.” When an agent-powered feature breaks in production, you’re debugging a conversation transcript, not a stack trace.

Most “agent” use cases are actually workflows, a known sequence of steps where one or two of those steps happen to involve an LLM. You don’t need autonomy for that. You need a function call.

He points out that functions compose predictably, so if you know the workflow, then composing in a program text is better than agents figuring out how to coordinate themselves. It’s faster, and needs less tokens. It’s usually easier to deal with failures, since the scope of the interaction is smaller.

❄ ❄ ❄ ❄ ❄

Pritchard also thinks that people use skills far more than they should. He thinks people accumulate folders of markdown skill files but LLMs use them inconsistently, often missing them when they’re needed, or bloating context when they are not. Many things that should go in skills should be other parts of a harness, preferably computational. Skills should only be used with deliberate, infrequent workflows.

The skills obsession is a symptom of a deeper pattern: people reaching for configuration when they should be reaching for architecture.

“The LLM doesn’t write good tests.” Don’t write a testing skill. Are your existing tests inconsistent? Is the test setup complex? Fix those things and it’ll write good tests without being told how. Point it at a test file you’re proud of. Code is clearer than English.

[…]

The best setup is one where you barely need to configure the LLM at all. A clean codebase with clear patterns, a short project config for the non-obvious stuff, hooks for automation, and maybe one or two skills for specific workflows you run intentionally. That’s it.

❄ ❄ ❄ ❄ ❄

An oft-stated point about the rise of agentic programming is that we have to start dealing with non-determinism in our work. Of course that’s somewhat of a simplification, because some aspects of software development have long had to face non-determinism. A notable example of this is distributed systems, and a notable figure in helping us probe the truly uncomfortable waters of distributed systems is Kyle Kingsbury (Aphyr).

Last month he dropped a long article (the pdf is 32 pages) on how he sees our LLM-enabled future. The title “The Future of Everything is Lies, I Guess” betrays his lack of enthusiasm for this future.

Some readers are undoubtedly upset that I have not devoted more space to the wonders of machine learning—how amazing LLMs are at code generation, how incredible it is that Suno can turn hummed melodies into polished songs. But this is not an article about how fast or convenient it is to drive a car. We all know cars are fast. I am trying to ask what will happen to the shape of cities.

It’s worth the long read, even if it isn’t terribly cheerful. Kingsbury brings up many of worries about AI’s growth from the perspective of someone who is clearly well-informed about their capabilities.

His view is that the best response to all this is that we should stop. He wants to avoid using AIs for his writing, software, or personal life. He thinks those working for the AI companies should quit. And yet he also knows that these tools are useful, and wants to use them.

I’m both a hoper and a doomer when it comes to our AI future. Fundamentally I see any powerful technology as a big bus: we are either on it, or get run over by it. I’m onboard the bus because I don’t think putting up some barriers would stop me being crushed by its wheels. Maybe if I’m on the bus I can join some people to influence the driver a bit. I’m also very reluctant to speculate on the future outcomes of anything, let alone something as powerful as this. Did the early industrialists in the late eighteenth century have any clue what the industrial revolution they unleashed would do? While it created many harms, it also created a massive rise in the living standards of millions of people, at least those whose countries were on the bus. AI may create benefits that I can’t really dream of, although I can glimpse it when it helps a friend stave off Parkinson’s disease.

Those hopes are there, but Kingsbury’s article shines a light on the darker elements of the here-and-now, asking serious questions of responsibility

a part of my work as a moderator of a Mastodon instance is to respond to user reports, and occasionally those reports are for CSAM, and I am legally obligated to review and submit that content to the NCMEC. I do not want to see these images, and I really wish I could unsee them. On dark mornings, when I sit down at my computer and find a moderation report for AI-generated images of sexual assault, I sometimes wish that the engineers working at OpenAI etc. had to see these images too. Perhaps it would make them reflect on the technology they are ushering into the world, and how “alignment” is working out in practice.

Don’t do sports ↩

May 5

Over the last couple of months Rahul Garg published a series of posts here on how to reduce the friction in AI-assisted programming. To make it easier to put these ideas into practice he’s now built an open-source framework to operationalize these patterns.

AI coding assistants jump straight to code, silently make design decisions, forget constraints mid-conversation, and produce output nobody reviewed against real engineering standards. Lattice fixes this with composable skills in three tiers – atoms, molecules, refiners – that embed battle-tested engineering disciplines (Clean Architecture, DDD, design-first methodology, secure coding, and more), plus a living context layer (the .lattice/ folder) that accumulates your project’s standards, decisions, and review insights. The system gets smarter with use – after a few feature cycles, atoms aren’t applying generic rules, they’re applying your rules, informed by your history.

It can be installed as a Claude Code plugin or downloaded for use with any AI tool.

❄ ❄ ❄ ❄ ❄

This is also a good point to note that the article by my colleagues Wei Zhang and Jessie Jie Xia on Structured-Prompt-Driven Development (SPDD) has generated an enormous amount of traffic, and quite a few questions. They have thus added a Q&A section to the article that answers a dozen of them.

❄ ❄ ❄ ❄ ❄

Jessica Kerr (Jessitron) posted a merry tidbit of building a tool to work with conversation logs. She observes the double feedback loop involved.

There are (at least) two feedback loops running here. One is the development loop, with Claude doing what I ask and then me checking whether that is indeed what I want.[…] Then there’s a meta-level feedback loop, the “is this working?” check when I feel resistance. Frustration, tedium, annoyance–these feelings are a signal to me that maybe this work could be easier.

The double loop here is both changing the thing we are building but also changing the thing we are using to build the thing we are building.

As developers using software to build software, we have potential to mold our own work environment. With AI making software change superfast, changing our program to make debugging easier pays off immediately. Also, this is fun!

Indeed it is, and makes me think that agents are allowing us to (re)discover one of the Great Lost Joys of software development - that of molding my development environment to exactly fit the problem and my personal tastes. A while ago I wrote about this under the name Internal Reprogramability. It was a central feature of the Smalltalk and Lisp communities but was mostly lost as we got complex and polished IDEs (although the unix command line gives a hint of its appeal).

❄ ❄ ❄ ❄ ❄

Ashley MacIsaac is a musician from Cape Breton, who plays folk-influenced music (I have a couple of his albums in my collection). Google generated an AI overview that asserted that had been convicted of crimes, including sexual assault and was on the national sex-offender registry. These were completely false, confusing him with another man with the same name. MacIsaac is suing Google for defamation:

“This was not a search engine just scanning through things and giving somebody else’s story […] It was published by them. And to me, that is defamation. The guardrails were not there to prevent Google AI from publishing that content.”

MacIsaac’s point is that Google must take responsibility for what a tool it controls publishes. MacIsaac suffered genuine harm here, not just to his reputation, but he also had a concert canceled and the claims affect his performing.

“I felt that tangible fear from something that was published by a media company,” he said in an interview with The Canadian Press. “I feared for my own safety going on stage because of what I was labelled as. And I don’t know how long this will follow me.”

Too often tech companies try to dodge the consequences of their actions. There are genuine issues about the difficulties of monitoring what’s published at scale, but that’s a responsibility that they should face up to.

❄ ❄ ❄ ❄ ❄

Stephen O’Grady (RedMonk) takes a serious look at how much the big tech companies are spending on AI build-outs. The sums involved are staggering, not just in absolute terms (over $100 Billion), but also compared to the revenues of the companies involved. Firms like Amazon, Alphabet, and Microsoft are spending over 50% of their revenues (not profits). Meta and Oracle hit or pass 75% of revenues.

That level of investment would have been unthinkable a decade ago. Today, the chart suggests it’s table stakes

There is a notable exception: Apple. Here they are clearly Thinking Different, eyeballing the chart they seem closer to 10% of revenues.

❄ ❄ ❄ ❄ ❄

Most folks I talk to about agentic programming are using models in the cloud: Claude, Codex, and the like. Everyone agrees these are the most powerful models, the ones that triggered the November Inflection. But do we need to use the Most Powerful Models, particularly when we have to ship data to them, and pay handsomely for the privilege? Willem van den Ende considers an alternative, that local models are Good Enough.

Assumptions
- We are all figuring this out.
- Quality of a harness (coding agent + “skills” + extensions) can matter as least as much as the model
- Running open models and an open coding agent + custom extensions takes time, but pays off in understanding and a stable base where engineering effort compounds
- Open, local, models have (for me) crossed the point where they are good enough for daily work with a coding agent.

The post describes in detail his setup for local model work. It includes sandboxing with Nono, which is something to consider even if using a Cloud model - such powerful tools need a Zero Trust Architecture.

❄ ❄ ❄ ❄ ❄

In case you haven’t noticed, those last two fragments resonate. Apple isn’t playing the cloud AI model game, they are saving a huge amount of money, and if local models end up being the future, they’ll be looking rather wise. Van den Ende’s post led me to a podcast by Nate B Jones that argues that Apple is replaying a fifty-year old strategy here. All those years ago anyone who used a computer bought time on a mainframe, the Apple II put far less capable compute into the home and small office. From there came spreadsheets, desktop publishing, and the modern home computer - things that weren’t possible using mainframes.

He sees the rise of John Ternus as CEO isn’t merely a switch to a known insider successor - but a bet that the future of AI is sophisticated hardware in the home, office, and pocket. If Open Source models are Good Enough, then why spend money sending tokens - containing your sensitive data - to the AI megacorps?

❄ ❄ ❄ ❄ ❄

Talking of five decades in the past, it was in 1974 that Fred Brooks opened one of the most influential books in our profession with these paragraphs:

No scene from prehistory is quite so vivid as that of the mortal struggles of great beasts in the tar pits. In the mind’s eye one sees dinosaurs, mammoths, and sabertoothed tigers struggling against the grip of the tar. The fiercer the struggle, the more entangling the tar, and no beast is so strong or so skillful but that he ultimately sinks.

Large-system programming has over the past decade been such a tar pit, and many great and powerful beasts have thrashed violently in it. Most have emerged with running systems — few have met goals, schedules, and budgets. Large and small, massive or wiry, team after team has become entangled in the tar. No one thing seems to cause the difficulty — any particular paw can be pulled away. But the accumulation of simultaneous and interacting factors brings slower and slower motion. Everyone seems to have been surprised by the stickiness of the problem, and it is hard to discern the nature of it. But we must try to understand it if we are to solve it.

With the title of his recent post Kent Beck summons up that imagery as the Genie Tarpit. After explaining why skilled software development is about building both features and futures, he observes that these AI tools aren’t doing a good job of producing software with the kind of internal quality that is needed for a good future.

Here’s what I’ve observed — genies naturally live down & to the left of muddling. The “plausible deniability” task orientation of the genie leaves it claiming success even though the code doesn’t work at all. And complexity piles on complexity until even the genie can’t pretend to make progress any more.

It’s still an open question whether, or to what extent, internal quality matters in the age of agentic programming. One view is, as Laura Tacho puts it, “The Venn Diagram of Developer Experience and Agent Experience is a circle”. Well organized elements, with good naming, help The Genie understand code, so are important if we can continue to go beyond small disposable systems. The other view is that such internal quality doesn’t matter, that the galaxy brain of LLMs will make sense of the biggest bowls of spaghetti. Maybe not now, but after a couple more inflections.

That’s the fundamental question. Can The Genie evade the tar pit, or will it struggle fruitlessly against the tar’s sticky grip?

April 29

Chris Parsons has updated his guide on using AI to code. This is his third update, what I like about it is that he gives a lot of concrete information about how he uses AI, with sufficient detail that we can learn from him. His advice also resonates with the better advice I’ve seen out there, so the article makes a good overview of the state of using AI for software development.

I wrote the previous version of this post in March 2025, updated it once in August, and it has been linked from almost everything I have written about AI engineering since. The fundamentals from that post still hold: keep changes small, build guardrails, document ruthlessly, and make sure every change gets verified before it ships. One thing has had to move with the volume. “Verified” used to mean “read by you”. With modern agent throughput, it has to mean “checked by tests, by type checkers, by automated gates, or by you where your judgement matters”. The check still happens; it just does not always happen in your head.

Like Simon Willison, he makes a clear distinction between vibe coding, where you don’t look at or care about the code, and agentic engineering. He recommends either Claude Code or Codex CLI. He considers the inner harness provided by his preferred tools to be a key part of their advantage.

He sees verification is the key thing to focus on:

A team that can generate five approaches and verify all five in an afternoon will outpace a team that generates one and waits a week for feedback. The game is not “how fast can we build” any more. It is “how fast can we tell whether this is right”. That shifts where to invest. Build better review surfaces, not better prompts. Make feedback unnecessary where you can by having the agent verify against a realistic environment before it asks a human, and make feedback instant where you cannot.

The key role of the programmer is in training the AI to write software properly, and the most important thing skilled agentic programmers can do is pass that skill onto other developers.

And if you are a senior engineer worried that your job is quietly turning into approving diffs: it is. The way out is to train the AI so the diffs are right the first time, to make yourself the person on the team who shapes the harness, and to make that work the visible thing you are measured on. That role compounds in a way that reviewing never will.

❄ ❄ ❄ ❄ ❄

Early this month Birgitta Böckeler wrote a superb article on Harness Engineering. (That’s not just my opinion, judging by the crazy traffic it’s attracted.) Birgitta has now recorded a video discussion with Chris Ford on Harness Engineering, which is well worth a watch.

In it they focus on discussing the role of computational sensors in the harness, such as static analysis and tests.

LLMs are great for exploratory and fuzzy rules, but once you have something that really is objective, converting it to a formal, unambiguous, deterministic format can give you more assurance

Birgitta did some experiments to explore the benefits of adding sensors, including a deep dive on using static analysis. She found it’s more useful as agents can really address every warning, and don’t slack off like humans do.

❄ ❄ ❄ ❄ ❄

Adam Tornhill considers an age-old question: how long should a function be? This question is still relevent in the age of agentic programming.

AI models do not “understand” code the way humans do. They infer meaning from patterns in tokens and depend heavily on what is explicitly expressed in the code.

Research shows that naming plays a critical role. When meaningful identifiers are replaced with arbitrary names, model performance drops significantly. Current models rely heavily on literal features—names, structure, and local context—rather than inferred semantics.

Like me, he doesn’t think the answer is to think about how many lines should be in a function, instead it’s all about providing better structure. He has a good example of how a well-chosen function defines useful concepts, where a function wraps four lines of code, returning a new concept that enters the vocabulary of the program.

Functions are the first unit of structure in a codebase. They define how logic is grouped, how intent is communicated, and how change is localized. If the function boundaries are wrong, everything built on top of them becomes harder to understand and harder to evolve.

This fits with my writing that the key to function length is the separation between intention and implementation:

If you have to spend effort into looking at a fragment of code to figure out what it’s doing, then you should extract it into a function and name the function after that “what”. That way when you read it again, the purpose of the function leaps right out at you, and most of the time you won’t need to care about how the function fulfills its purpose - which is the body of the function.

❄ ❄ ❄ ❄ ❄

Many folks in my feeds recommended Nilay Patel’s post on Why People Hate AI. He thinks that many people in the software world have “software brain”:

The simplest definition I’ve come up with is that it’s when you see the whole world as a series of databases that can be controlled with the structured language of software code. Like I said, this is a powerful way of seeing things. So much of our lives run through databases, and a bunch of important companies have been built around maintaining those databases and providing access to them.

Zillow is a database of houses. Uber is a database of cars and riders. YouTube is a database of videos. The Verge’s website is a database of stories. You can go on and on and on. Once you start seeing the world as a bunch of databases, it’s a small jump to feeling like you can control everything if you can just control the data.

Software Brain views people into databases, and oddly enough, a lot of people don’t like that. Which is why so many polls reveal the negative feelings folks have about the AI movement.

Even taking the time to consider how much of your life is captured in databases makes people unhappy. No one wants to be surveilled constantly, and especially not in a way that makes tech companies even more powerful. But getting everything in a database so software can see it is a preoccupation of the AI industry. It’s why all the meeting systems have AI note takers in them now.

Patel draws a similarity that I’ve often made - that between programmers and lawyers. Lawyers who draw up contracts are creating a protocol for how the parties in the contract should behave. As Patel puts it:

If the heart of software brain is the idea that thinking in the structured language of code can make things happen in the real world, well, the heart of lawyer brain is that thinking in the structured legal language of statutes and citations can also make things happen. Hell, it can give you power over society.

The difference, of course, is that law is non-deterministic. Litigation is resolving what happens when people have different ideas about how those contracts should execute.

❄ ❄ ❄ ❄ ❄

I was chatting recently with a company who wanted to use AI to make sense of their internal data. The potential was great, but the problem was that the data a mess. People put stuff into fields that didn’t make sense, and there was little consistency about how people classified important entities. As someone commented

the hardest problem with internal data is precise, consistent definitions

You can imagine my astonishment. (i.e. none at all - this has been a constant theme during all my decades with computers.) The difficulty of getting such definitions undermines much of the hopes of Software Brain

This resonates with our relationship with LLMs when programming. Precise and consistent definitions strike me as crucial to effective communication with The Genie. These definitions need to grow in the conversation, and be tended over time. Conceptual modeling will be a key skill for agentic programming and whatever comes next. (At least I hope it will, since it’s a part of programming I really enjoy.)

❄ ❄ ❄ ❄ ❄

Patel’s article refers to Ezra Klein’s post about the new feeling in San Francisco.

You might think that A.I. types in Silicon Valley, flush with cash, are on top of the world right now. I found them notably insecure. They think the A.I. age has arrived and its winners and losers will be determined, in part, by speed of adoption. The argument is simple enough: The advantages of working atop an army of A.I. assistants and coders will compound over time, and to begin that process now is to launch yourself far ahead of your competition later. And so they are racing one another to fully integrate A.I. into their lives and into their companies. But that doesn’t just mean using A.I. It means making themselves legible to the A.I.

That legibility is the heart of Patel’s observation. That’s why I see many colleagues of mine dumping all their email, meeting notes, slide decks and everything else into files that AI can read and work with. This works to the strengths of AI, we know that AI is really good at querying unstructured information. So I can figure out what’s buried in my notes in a way that’s far more effective than hoping I’m typing the right search regex.

I’ve been using Gemini a fair bit for exactly this on the web, finding it easier to write a question to it than to throw search terms at Google. Gemini keeps a record of my past requests, and uses that to help it tune what I’m looking for. As Klein observes:

[The AI] is constantly referring back to other things it knows, or thinks it knows, about me. Sycophancy, in my experience, has given way to an occasionally unsettling attentiveness; a constant drawing of connections between my current concerns and my past queries, like a therapist desperate to prove he’s been paying close attention.

The result is a strange amalgam of feeling seen and feeling caricatured.

Like myself, Klein is a writer, and is faced by the same temptation that I have when I think about AI and writing. Maybe instead of toiling over articles, I should ask an LLM to create an AGENTS.md file that summarizes my writing style, and every few days ask it to compose an article on some subject, read it, tweak it, and then publish my erudite musings. But that’s not at all appealing to me. I want understanding to grow in my brain, not the LLM’s transient session. Writing to explain my thinking to others is how I refine that thinking, “chiseling that idea into something publishable” as Klein puts it. To have an AI write for me is to cripple my own mind.

April 21

Last week Thoughtworks released the 34th volume of our Technology Radar. This radar is our biannual survey of our experience of the technology scene, highlighting tools, techniques, platforms, and languages that we’ve used or otherwise caught our eye. This edition contains 118 blips, each briefly describing our impressions of one of these elements.

As we would expect, the radar is dominated by AI-oriented topics. Part of this is revisiting familiar ground with LLM-assisted eyes:

An interesting consequence of AI in software development is that it’s not only forcing us to look to the future; it’s also pushing us to revisit the foundations of our craft. While assembling this edition, we found ourselves returning to many established techniques, from pair programming to zero trust architecture, and from mutation testing to DORA metrics. We also revisited core principles of software craftsmanship, such as clean code, deliberate design, testability and accessibility as a first-class concern. This is not nostalgia, but a necessary counterweight to the speed at which AI tools can generate complexity. We also observed a resurgence of the command line: After years of abstracting it away in the name of usability, agentic tools are bringing developers back to the terminal as a primary interface.

I was especially happy to see my colleague Jim Gumbley added to the writing team, he’s been a regular source of security information for me over the years, including working on this site’s Threat Modeling Guide. Having a strong security presence on the radar team is especially important given the serious security concerns around using LLMs. One of the themes of the radar is securing “permission hungry” agents:

“Permission hungry” describes the bind at the heart of the current agent moment: the agents worth building are the ones that need access to everything. OpenClaw and Claude Cowork supervise real work tasks; Gas Town coordinates agent swarms across entire codebases. These agents require broad access to private data, external communication and real systems — each arguing that the payoff justifies it.

However, like a skier who’s just learned to turn and confidently points themselves at the hardest black run, the safeguards haven’t caught up with that ambition. The appetite for access collides with unsolved problems. Prompt injection means models still can’t reliably distinguish trusted instructions from untrusted input.

Given all of this, many of this radar’s blips are about Harness Engineering, indeed the radar meeting was a major source of ideas for Birgitta’s excellent article on the subject. The radar includes several blips suggesting the guides and sensors necessary for a well-fitting harness. I expect that when the next radar appears in six months time, that list will increase.

❄ ❄ ❄ ❄ ❄

Mike Mason looks what happens when developers aren’t reading the code.

The Python codebase Claude produced was largely working. Unit tests passed, and a few hours of real-world testing showed it was successfully managing a fairly complex piece of my infrastructure. But somewhere around 100KB of total code I noticed something: the main file had grown to about 50KB (2,000 lines) and Claude Code, when it needed to make edits, had started reaching for sed to find and modify code within that file. When I saw that, it was a serious alarm bell.

As well as the experience of “a friend”, he ponders the 500,000 lines of Claude Code after the leak.

Both things are true: there is good architecture in Claude Code, and there is also an incomprehensible mess. That’s actually the point. You don’t get to know which is which without reading the code.

His conclusion is a rough framework. Throw-away analysis scripts are fine to vibe away. Tooling you need to maintain and durable code, needs regular human review - even if it’s just a human asking a model to evaluate the code with some hints as to what good code looks like

The moment you say “I’m getting uncomfortable with how big this is getting, can we do something better?” it does the right thing: sensible decomposition, new classes, sometimes even unit tests for the new thing. It knew, it just didn’t volunteer it.

He does recommend being serious with CLAUDE.md, I don’t know if he’s tried many of the patterns that Rahul Garg has recently posted to break the similar frustration loop that he saw.

❄ ❄ ❄ ❄ ❄

Dan Davies poses an annoying philosophy thought experiment for us to consider how we feel about LLMs indulging in ghost writing.

❄ ❄ ❄ ❄ ❄

DOGE dismantled many useful things during their brief period with the wood chipper. One of these was DirectFile, a government program that supported people filing their taxes online. Don Moynihan has talked to many folks involved in Direct File, has penned a worthwhile essay that isn’t just relevant to DirectFile and other U.S. government technology projects, but indeed any technology initiative in a large organization.

Moynihan highlights:

a paradox of government reform: the simpler a potential change appears, the more likely that it has not been implemented because it features deceptive complexity that others have tried and failed to resolve.

I’ve heard that tale in many a large corporation too

One way government initiatives are different is that, at its best, it’s built on an attitude of public service

Many who worked on Direct File drew a sharp contrast with DOGE and their approach to building tech products. One point of distinction was DOGE’s seeming disinterest in public interest goals and of the public itself: “if you do not think government has a responsibility to serve people, I think it draws into question how good are you going to be at making government work better for people if you just don’t believe in that underlying principle”

The tragedy for U.S. taxpayers like me is that we’ve lost an effective way to go through the annual hassle of taxes. In addition the IRS is much weaker - it’s lost 25% of its staff and its budget is 40% below what it was in 2010. Much though we hate tax collectors, this isn’t a good thing. An efficient tax system is an important part of national security, many historians consider the ability to raise taxes effectively was an important reason why Britain won its century-long struggle with France in the Eighteenth century. A wonky tax system is also a major reason why the French monarchy, so powerful at the start of that century, fell to revolution. Indeed there is considerable evidence that increasing the budget of the IRS would more than pay for itself by increasing revenue.

April 14

I attended the first Pragmatic Summit early this year, and while there host Gergely Orosz interviewed Kent Beck and myself on stage. The video runs for about half-an-hour.

I always enjoy nattering with Kent like this, and Gergely pushed into some worthwhile topics. Given the timing, AI dominated the conversation - we compared it to earlier technology shifts, the experience of agile methods, the role of TDD, the danger of unhealthy performance metrics, and how to thrive in an AI-native industry.

❄ ❄ ❄ ❄ ❄

Perl is a language I used a little, but never loved. However the definitive book on it, by its designer Larry Wall, contains a wonderful gem. The three virtues of a programmer: hubris, impatience - and above all - laziness.

Bryan Cantrill also loves this virtue:

Of these virtues, I have always found laziness to be the most profound: packed within its tongue-in-cheek self-deprecation is a commentary on not just the need for abstraction, but the aesthetics of it. Laziness drives us to make the system as simple as possible (but no simpler!) — to develop the powerful abstractions that then allow us to do much more, much more easily.

Of course, the implicit wink here is that it takes a lot of work to be lazy

Understanding how to think about a problem domain by building abstractions (models) is my favorite part of programming. I love it because I think it’s what gives me a deeper understanding of a problem domain, and because once I find a good set of abstractions, I get a buzz from the way they make difficulties melt away, allowing me to achieve much more functionality with less lines of code.

Cantrill worries that AI is so good at writing code, we risk losing that virtue, something that’s reinforced by brogrammers bragging about how they produce thirty-seven thousand lines of code a day.

The problem is that LLMs inherently lack the virtue of laziness. Work costs nothing to an LLM. LLMs do not feel a need to optimize for their own (or anyone’s) future time, and will happily dump more and more onto a layercake of garbage. Left unchecked, LLMs will make systems larger, not better — appealing to perverse vanity metrics, perhaps, but at the cost of everything that matters. As such, LLMs highlight how essential our human laziness is: our finite time forces us to develop crisp abstractions in part because we don’t want to waste our (human!) time on the consequences of clunky ones. The best engineering is always borne of constraints, and the constraint of our time places limits on the cognitive load of the system that we’re willing to accept. This is what drives us to make the system simpler, despite its essential complexity.

This reflection particularly struck me this Sunday evening. I’d spent a bit of time making a modification of how my music playlist generator worked. I needed a new capability, spent some time adding it, got frustrated at how long it was taking, and wondered about maybe throwing a coding agent at it. More thought led to realizing that I was doing it in a more complicated way than it needed to be. I was including a facility that I didn’t need, and by applying yagni, I could make the whole thing much easier, doing the task in just a couple of dozen lines of code.

If I had used an LLM for this, it may well have done the task much more quickly, but would it have made a similar over-complication? If so would I just shrug and say LGTM? Would that complication cause me (or the LLM) problems in the future?

❄ ❄ ❄ ❄ ❄

Jessica Kerr (Jessitron) has a simple example of applying the principle of Test-Driven Development to prompting agents. She wants all updates to include updating the documentation.

Instructions – We can change AGENTS.md to instruct our coding agent to look for documentation files and update them.

Verification – We can add a reviewer agent to check each PR for missed documentation updates.

This is two changes, so I can break this work into two parts. Which of these should we do first?

Of course my initial comment about TDD answers that question

❄ ❄ ❄ ❄ ❄

Mark Little prodded an old memory of mine as he wondered about to work with AIs that are over-confident of their knowledge and thus prone to make up answers to questions, or to act when they should be more hesitant. He draws inspiration from an old, low-budget, but classic SciFi movie: Dark Star. I saw that movie once in my 20s (ie a long time ago), but I still remember the crisis scene where a crew member has to use philosophical argument to prevent a sentient bomb from detonating.

Doolittle: You have no absolute proof that Sergeant Pinback ordered you to detonate.
Bomb #20: I recall distinctly the detonation order. My memory is good on matters like these.
Doolittle: Of course you remember it, but all you remember is merely a series of sensory impulses which you now realize have no real, definite connection with outside reality.
Bomb #20: True. But since this is so, I have no real proof that you’re telling me all this.
Doolittle: That’s all beside the point. I mean, the concept is valid no matter where it originates.
Bomb #20: Hmmmm….
Doolittle: So, if you detonate…
Bomb #20: In nine seconds….
Doolittle: …you could be doing so on the basis of false data.
Bomb #20: I have no proof it was false data.
Doolittle: You have no proof it was correct data!
Bomb #20: I must think on this further.

Doolittle has to expand the bomb’s consciousness, teaching it to doubt its sensors. As Little puts it:

That’s a useful metaphor for where we are with AI today. Most AI systems are optimised for decisiveness. Given an input, produce an output. Given ambiguity, resolve it probabilistically. Given uncertainty, infer. This works well in bounded domains, but it breaks down in open systems where the cost of a wrong decision is asymmetric or irreversible. In those cases, the correct behaviour is often deferral, or even deliberate inaction. But inaction is not a natural outcome of most AI architectures. It has to be designed in.

In my more human interactions, I’ve always valued doubt, and distrust people who operate under undue certainty. Doubt doesn’t necessarily lead to indecisiveness, but it does suggest that we include the risk of inaccurate information or faulty reasoning into decisions with profound consequences.

If we want AI systems that can operate safely without constant human oversight, we need to teach them not just how to decide, but when not to. In a world of increasing autonomy, restraint isn’t a limitation, it’s a capability. And in many cases, it may be the most important one we build.

April 9

I mostly link to written material here, but I’ve recently listened to two excellent podcasts that I can recommend.

Anyone who regularly reads these fragments knows that I’m a big fan of Simon Willison, his (also very fragmentary) posts have earned a regular spot in my RSS reader. But the problem with fragments, however valuable, is that they don’t provide a cohesive overview of the situation. So his podcast with Lenny Rachitsky is a welcome survey of that state of world as seen through a discerning pair of eyeballs. He paints a good picture of how programming has changed for him since the “November inflection point”, important patterns for this work, and his concern about the security bomb nestled inside the beast.

My other great listening was on a regular podcast that I listen to, as Gergely Orosz interviewed Thuan Pham - the former CTO of Uber. As with so many of Gergely’s podcasts, they focused on Thuan Pham’s fascinating career direction, giving listeners an opportunity to learn from a successful professional. There’s also an informative insight into Uber’s use of microservices (they had 5000 of them), and the way high-growth software necessarily gets rewritten a lot (a phenomenon I dubbed Sacrificial Architecture)

❄ ❄ ❄ ❄ ❄

Axios published their post-mortem on their recent supply chain compromise. It’s quite a story, the attackers spent a couple of weeks developing contact with the lead maintainer, leading to a video call where the meeting software indicated something on the maintainer’s system was out of date. That led to the maintainer installing the update, which in fact was a Remote Access Trojan (RAT).

they tailored this process specifically to me by doing the following:

they reached out masquerading as the founder of a company they had cloned the companys founders likeness as well as the company itself.

they then invited me to a real slack workspace. this workspace was branded to the companies ci and named in a plausible manner. the slack was thought out very well, they had channels where they were sharing linked-in posts, the linked in posts i presume just went to the real companys account but it was super convincing etc. they even had what i presume were fake profiles of the team of the company but also number of other oss maintainers.

they scheduled a meeting with me to connect. the meeting was on ms teams. the meeting had what seemed to be a group of people that were involved.

the meeting said something on my system was out of date. i installed the missing item as i presumed it was something to do with teams, and this was the RAT.

everything was extremely well co-ordinated looked legit and was done in a professional manner.

Simon Willison has a summary and further links.

❄ ❄ ❄ ❄ ❄

I recently bumped into Diátaxis, a framework for organizing technical documentation. I only looked at it briefly, but there’s much to like. In particular I appreciated how it classified four forms of documentation:

Tutorials: to learn how to use the product
How-to guides: for users to follow to achieve particular goals with the product
Reference: to describe what the product does
Explanations: background and context to educate the user on the product’s rationale

The distinction between tutorials and how-to guides is interesting

A tutorial serves the needs of the user who is at study. Its obligation is to provide a successful learning experience. A how-to guide serves the needs of the user who is at work. Its obligation is to help the user accomplish a task.

I also appreciated its point of pulling explanations out into separate areas. The idea is that other forms should contain only minimal explanations, linking to the explanation material for more depth. That way we keep the flow on the goal and allow the user to seek deeper explanations in their own way. The study/work distinction between explanation and reference mirrors that same distinction between tutorials and how-to guides.

❄ ❄ ❄ ❄ ❄

For eight years, Lalit Maganti wanted a set of tools for working with SQLite. But it would be hard and tedious work, “getting into the weeds of SQLite source code, a fiendishly difficult codebase to understand”. So he didn’t try it. But after the November inflection point, he decided to tackle this need.

His account of this exercise is an excellent description of the benefits and perils of developing with AI agents.

Through most of January, I iterated, acting as semi-technical manager and delegating almost all the design and all the implementation to Claude. Functionally, I ended up in a reasonable place: a parser in C extracted from SQLite sources using a bunch of Python scripts, a formatter built on top, support for both the SQLite language and the PerfettoSQL extensions, all exposed in a web playground.

But when I reviewed the codebase in detail in late January, the downside was obvious: the codebase was complete spaghetti. I didn’t understand large parts of the Python source extraction pipeline, functions were scattered in random files without a clear shape, and a few files had grown to several thousand lines. It was extremely fragile; it solved the immediate problem but it was never going to cope with my larger vision, never mind integrating it into the Perfetto tools. The saving grace was that it had proved the approach was viable and generated more than 500 tests, many of which I felt I could reuse.

He threw it all away and worked more closely with the AI on the second attempt, with lots of thinking about the design, reviewing all the code, and refactoring with every step

In the rewrite, refactoring became the core of my workflow. After every large batch of generated code, I’d step back and ask “is this ugly?” Sometimes AI could clean it up. Other times there was a large-scale abstraction that AI couldn’t see but I could; I’d give it the direction and let it execute. If you have taste, the cost of a wrong approach drops dramatically because you can restructure quickly.

He ended up with a working system, and the AI proved its value in allowing him to tackle something that he’d been leaving on the todo pile for years. But even with the rewrite, the AI had its potholes.

His conclusion of the relative value of AI in different scenarios:

When I was working on something I already understood deeply, AI was excellent…. When I was working on something I could describe but didn’t yet know, AI was good but required more care…. When I was working on something where I didn’t even know what I wanted, AI was somewhere between unhelpful and harmful…

At the heart of this is that AI works at its best when there is an objectively checkable answer. If we want an implementation that can pass some tests, then AI does a good job. But when it came to the public API:

I spent several days in early March doing nothing but API refactoring, manually fixing things any experienced engineer would have instinctively avoided but AI made a total mess of. There’s no test or objective metric for “is this API pleasant to use” and “will this API help users solve the problems they have” and that’s exactly why the coding agents did so badly at it.

❄ ❄ ❄ ❄ ❄

I became familiar with Ryan Avent’s writing when he wrote the Free Exchange column for The Economist. His recent post talks about how James Talarico and Zohran Mamdani have made their religion an important part of their electoral appeal, and their faith is centered on caring for others. He explains that a focus on care leads to an important perspective on economic growth.

The first thing to understand is that we should not want growth for its own sake. What is good about growth is that it expands our collective capacities: we come to know more and we are able to do more. This, in turn, allows us to alleviate suffering, to discover more things about the universe, and to spend more time being complete people.

April 2

As we see LLMs churn out scads of code, folks have increasingly turned to Cognitive Debt as a metaphor for capturing how a team can lose understanding of what a system does. Margaret-Anne Storey thinks a good way of thinking about these problems is to consider three layers of system health:

Technical debt lives in code. It accumulates when implementation decisions compromise future changeability. It limits how systems can change.

Cognitive debt lives in people. It accumulates when shared understanding of the system erodes faster than it is replenished. It limits how teams can reason about change.

Intent debt lives in artifacts. It accumulates when the goals and constraints that should guide the system are poorly captured or maintained. It limits whether the system continues to reflect what we meant to build and it limits how humans and AI agents can continue to evolve the system effectively.

While I’m getting a bit bemused by debt metaphor proliferation, this way of thinking does make a fair bit of sense. The article includes useful sections to diagnose and mitigate each kind of debt. The three interact with each other, and the article outlines some general activities teams should do to keep it all under control

❄ ❄

In the article she references a recent paper by Shaw and Nave at the Wharton School that adds LLMs to Kahneman’s two-system model of thinking.

Kahneman’s book, “Thinking Fast and Slow”, is one of my favorite books. Its central idea is that humans have two systems of cognition. System 1 (intuition) makes rapid decisions, often barely-consciously. System 2 (deliberation) is when we apply deliberate thinking to a problem. He observed that to save energy we default to intuition, and that sometimes gets us into trouble when we overlook things that we would have spotted had we applied deliberation to the problem.

Shaw and Nave consider AI as System 3

A consequence of System 3 is the introduction of cognitive surrender, characterized by uncritical reliance on externally generated artificial reasoning, bypassing System 2. Crucially, we distinguish cognitive surrender, marked by passive trust and uncritical evaluation of external information, from cognitive offloading, which involves strategic delegation of cognition during deliberation.

It’s a long paper, that goes into detail on this “Tri-System theory of cognition” and reports on several experiments they’ve done to test how well this theory can predict behavior (at least within a lab).

❄ ❄ ❄ ❄ ❄

I’ve seen a few illustrations recently that use the symbols “< >” as part of an icon to illustrate code. That strikes me as rather odd, I can’t think of any programming language that uses “< >” to surround program elements. Why that and not, say, “{ }”?

Obviously the reason is that they are thinking of HTML (or maybe XML), which is even more obvious when they use “</>” in their icons. But programmers don’t program in HTML.

❄ ❄ ❄ ❄ ❄

Ajey Gore thinks about if coding agents make coding free, what becomes the expensive thing? His answer is verification.

What does “correct” mean for an ETA algorithm in Jakarta traffic versus Ho Chi Minh City? What does a “successful” driver allocation look like when you’re balancing earnings fairness, customer wait time, and fleet utilisation simultaneously? When hundreds of engineers are shipping into ~900 microservices around the clock, “correct” isn’t one definition — it’s thousands of definitions, all shifting, all context-dependent. These aren’t edge cases. They’re the entire job.

And they’re precisely the kind of judgment that agents cannot perform for you.

Increasingly I’m seeing a view that agents do really well when they have good, preferably automated, verification for their work. This encourages such things as Test Driven Development. That’s still a lot of verification to do, which suggests we should see more effort to find ways to make it easier for humans to comprehend larger ranges of tests.

While I agree with most of what Ajey writes here, I do have a quibble with his view of legacy migration. He thinks it’s a delusion that “agentic coding will finally crack legacy modernisation”. I agree with him that agentic coding is overrated in a legacy context, but I have seen compelling evidence that LLMs help a great deal in understanding what legacy code is doing.

The big consequence of Ajey’s assessment is that we’ll need to reorganize around verification rather than writing code:

If agents handle execution, the human job becomes designing verification systems, defining quality, and handling the ambiguous cases agents can’t resolve. Your org chart should reflect this. Practically, this means your Monday morning standup changes. Instead of “what did we ship?” the question becomes “what did we validate?” Instead of tracking output, you’re tracking whether the output was right. The team that used to have ten engineers building features now has three engineers and seven people defining acceptance criteria, designing test harnesses, and monitoring outcomes. That’s the reorganisation. It’s uncomfortable because it demotes the act of building and promotes the act of judging. Most engineering cultures resist this. The ones that don’t will win.

❄ ❄ ❄ ❄ ❄

One the questions comes up when we think of LLMs-as-programmers is whether there is a future for source code. David Cassel on The New Stack has an article summarizing several views of the future of code. Some folks are experimenting with entirely new languages built with the LLM in mind, others think that existing languages, especially strictly typed languages like TypeScript and Rust will be the best fit for LLMs. It’s an overview article, one that has lots of quotations, but not much analysis in itself - but it’s worth a read as a good overview of the discussion.

I’m interested to see how all this will play out. I do think there’s still a role for humans to work with LLMs to build useful abstractions in which to talk about what the code does - essentially the DDD notion of Ubiquitous Language. Last year Unmesh and I talked about growing a language with LLMs. As Unmesh put it

Programming isn’t just typing coding syntax that computers can understand and execute; it’s shaping a solution. We slice the problem into focused pieces, bind related data and behaviour together, and—crucially—choose names that expose intent. Good names cut through complexity and turn code into a schematic everyone can follow. The most creative act is this continual weaving of names that reveal the structure of the solution that maps clearly to the problem we are trying to solve.

March 26

Anthropic carried a study, done by getting its model to interview some 80,000 users to understand their opinions about AI, what they hope from it, and what they fear. Two things stood out to me.

It’s easy to assume there are AI optimists and AI pessimists, divided into separate camps. But what we actually found were people organized around what they value—financial security, learning, human connection— watching advancing AI capabilities while managing both hope and fear at once.

That makes sense, if asked whether I’m a an AI booster or an AI doomer, I answer “yes”. I am both fascinated by its impact on my profession, expectant of the benefits it will bring to our world, and worried by the harms that will come from it. Powerful technologies rarely yield simple consequences.

The other thing that struck me was that, despite most people mixing the two, there was an overall variance between optimism and pessimism with AI by geography. In general, the less developed the country, the more optimism about AI.

❄ ❄ ❄ ❄ ❄

Julias Shaw describes how to fix a gap in many people’s use of specs to drive LLMs:

Here’s what I keep seeing: the specification-driven development (SDD) conversation has exploded. The internet is overflowing with people saying you should write a spec before prompting. Describe the behavior you want. Define the constraints. Give the agent guardrails. Good advice. I often follow it myself.

But almost nobody takes the next step. Encoding those specifications into automated tests that actually enforce the contract.

And the strange part is, most developers outside the extreme programming crowd don’t realize they need to. They genuinely believe the spec document is the safety net. It isn’t. The spec document is the blueprint. The safety net is the test suite that catches the moment your code drifts away from it.

As well as explaining why it’s important to have such a test suite, he provides an astute five-step checklist to turn spec documents into executable tests.

❄ ❄ ❄ ❄ ❄

Lawfare has a long article on potential problems countering covert action by Iran. It’s a long article, and I confess I only skip-read it. It begins by outlining a bunch of plots hatched in the last few years. Then it says:

If these examples seem repetitive, it’s because they are. Iran has proved itself relentless in its efforts to carry out attacks on U.S. soil—and the U.S., for its part, has demonstrated that it is capable of countering those efforts. The above examples show how robustly the U.S. national security apparatus was able to respond, largely through the FBI and the Justice Department….

That is, potentially, until now. The current administration has decimated the national security elements of both agencies through firings and forced resignations. People with decades of experience in building interagency and critical source relationships around the world, handling high-pressure, complicated investigations straddling classified and unclassified spaces, and acting in time to prevent violence and preserve evidence have been pushed out the door. Those who remain not only have to stretch to make up for the personnel deficit but also are being pulled away by White House priorities not tied to the increasing threat of an Iranian response.

The article goes into detail about these cuts, and the threats that may exploit the resulting gaps.

It’s the nature of national security people to highlight potential threats and call for more resources and power. But it’s also the nature of enemies to find weak spots and look to cause havoc. I wonder what we’ll think should we read this article again in a few years time

March 19

David Poll points out the flawed premise of the argument that code review is a bottleneck

To be fair, finding defects has always been listed as a goal of code review – Wikipedia will tell you as much. And sure, reviewers do catch bugs. But I think that framing dramatically overstates the bug-catching role and understates everything else code review does. If your review process is primarily a bug-finding mechanism, you’re leaving most of the value on the table.

Code review answers: “Should this be part of my product?”

That’s close to how I think about it. I think of code review as primarily about keeping the code base healthy. And although many people think of code review as pre-integration review done on pull requests, I look at code review as a broader activity both done earlier (Pair Programming) and later (Refinement Code Review).

At Firebase, I spent 5.5 years running an API council…

The most valuable feedback from that council was never “you have a bug in this spec.” It was “this API implies a mental model that contradicts what you shipped last quarter” or “this deprecation strategy will cost more trust than the improvement is worth” or simply “a developer encountering this for the first time won’t understand what it does.” Those are judgment calls about whether something should be part of the product – the same fundamental question that code review answers at a different altitude. No amount of production observability surfaces them, because the system can work perfectly and still be the wrong thing to have built.

His overall point is that code review is all about applying judgment, steering the code in a good direction. AI raises the level of that judgment, focusing review on more important things.

I agree that we shouldn’t be thinking of review as a bug-catching mechanism, and that it’s about steering the code base. In addition I’d also add that it’s about communication between people, enabling multiple perspectives on the development of the product. This is true both for code review, and for pair programming.

❄ ❄ ❄ ❄ ❄

Charity Majors is unhappy with me and rest of the folks that attended the Thoughtworks Future of Software Development Retreat.

But the longer I sit with this recap, the more troubled I am by what it doesn’t say. I worry that the most respected minds in software are unintentionally replicating a serious blind spot that has haunted software engineering for decades: relegating production to the realm of bugs and incidents.

There are lots of things we didn’t discuss in that day-and-a-half, and it’s understandable that a topic that matters so deeply to her is visible by its absence. I’m certainly not speaking for anyone else who was there, but I’ll take the opportunity to share some of my thoughts on this.

I consider observability to be a key tool in working with our AI future. As she points out, observability isn’t really about finding bugs - although I’ve long been a supporter of the notion of QA in Production. Observability is about revealing what the system actually does, when in the hands of its actual users. Test cases help you deal with the known paths, but reality has a habit of taking you into the unknowns, not just the unknowns of the software’s behavior in unforeseen places, but also the unknowns of how the software affects the broader human and organizational systems it’s embedded into. By watching how software is used, we can learn about what users really want to achieve, these observed requirements are often things that never popped up in interviews and focus groups.

If these unknown territories are true in systems written line-by-line in deterministic code, it’s even more true when code is written in a world of supervisory engineering where humans are no longer to look over every semi-colon. Certainly harness engineering and humans in the loop help, and I’m as much a fan as ever about the importance of tests as a way to both explain and evaluate the code. But these unknowns will inevitably raise the importance of observability and its role to understand what the system thinks it does. I think it’s likely we’ll see a future where much of a developer’s effort is figuring what a system is doing and why it’s behaving that way, where observability tools are the IDE.

In this I ponder the lesson of AI playing Go. AlphaGo defeated the best humans a decade ago, and since then humans study AI to become better players and maybe discover some broader principles. I’m intrigued by how humans can learn from AI systems to be improve in other fields, where success is less deterministically defined.

❄ ❄ ❄ ❄ ❄

Tim Requarth questions the portrayal of AI as an amplifier for human cognition. He considers the different way we navigate with GPS compared to maps.

If you unfold a paper map, you study the streets, trace a route, convert the bird’s-eye abstraction into the first-person POV of actually walking—and by the time you arrived, you’d have a nascent mental model of how the city fits together. Or you could fire up Google Maps: A blue dot, an optimal line from A to B, a reassuring robotic voice telling you when to turn. You follow, you arrive, you have no idea, really, where you are. A paper map demands something from you, and that demand leaves you with knowledge. GPS requires nothing, and leaves you with nothing. A paper map and GPS are tools with the same purpose, but opposite cognitive consequences.

He introduces some attractive metaphors here. Steve Jobs called computers “bicycles for the mind”, Satya Nadella said with the launch of ChatGPT that “we went from the bicycle to the steam engine”.

Like another 19th-century invention, the steam locomotive, the bicycle was a technological revolution. But a train traveler sat back and enjoyed the ride, while a cyclist still had to put in effort. With a bicycle, “you are traveling,” wrote a cycling enthusiast in 1878, “not being traveled.”

In both examples, there’s a difference between tools that extend capability and tools that replace it. The question is what we lose when we are passive in the journey? He argues that Silicon Valley executives are too focused on the goal, and ignoring the cognitive atrophy that happens to the humans being traveled.

Much of this depends, I think, on whether we care about what we are losing. I struggle with mental arithmetic, so I value calculators, whether on my phone or M-x calc. I don’t think I lose anything when I let the machine handle the toil of calculation. I share missing the sense of place when using a GPS over a map, but am happy that I can now drive though Lynn without getting lost. And when it comes to writing, I have no desire to let an LLM write this page.

March 16

Annie Vella did some research into how 158 professional software engineers used AI, her first question was:

Are AI tools shifting where engineers actually spend their time and effort? Because if they are, they’re implicitly shifting what skills we practice and, ultimately, the definition of the role itself.

She found that participants saw a shift from creation-oriented tasks to verification-oriented tasks, but it was a different form of verification than reviewing and testing.

In my thesis, I propose a name for it: supervisory engineering work - the effort required to direct AI, evaluate its output, and correct it when it’s wrong.

Many software folks think of inner and outer loops. The inner loop is writing code, testing, debugging. The outer loop is commit, review, CI/CD, deploy, observe.

What if supervisory engineering work lives in a new loop between these two loops? AI is increasingly automating the inner loop - the code generation, the build-test cycle, the debugging. But someone still has to direct that work, evaluate the output, and correct what’s wrong. That feels like a new loop, the middle loop, a layer where engineers supervise AI doing what they used to do by hand.

A potential issue with this research is that it finished in April 2025, before the latest batch of models greatly improved their software development capabilities. But my sense is that this improvement in models has only accelerated a shift to supervisory engineering. This shift is a traumatic change to what we do and the skills we need. It doesn’t mean “the end of programming”, rather a change of what it means to be programming.

A lot of software engineers right now are feeling genuine uncertainty about the future of their careers. What they trained to do, what they spent years upskilling in, is shifting - and in many ways, being commoditised. The narratives don’t help: either AI is coming for your job, or you should just “move upstream” into architecture and “higher value” work. Neither tells you what to actually do on Monday morning.

That’s why this matters. There is still plenty of engineering work in software engineering, even if it looks different from what most of us trained for. Supervisory engineering work and the middle loop are one way of describing what that different looks like, grounded in what engineers are actually reporting.

❄ ❄ ❄ ❄ ❄

Bassim Eledath lays out 8 levels of Agentic Engineering.

AI’s coding ability is outpacing our ability to wield it effectively. That’s why all the SWE-bench score maxxing isn’t syncing with the productivity metrics engineering leadership actually cares about. When Anthropic’s team ships a product like Cowork in 10 days and another team can’t move past a broken POC using the same models, the difference is that one team has closed the gap between capability and practice and the other hasn’t.

That gap doesn’t close overnight. It closes in levels. 8 of them.

His levels are:

Tab Complete
Agent IDE
Context Engineering
Compounding Engineering
MCP & Skills
Harness Engineering
Background Agents
Autonomous Agent Teams

Eight seems to be the number thou shalt have for levels. Earlier this year Steve Yegge proposed eight levels in Welcome to Gas Town. His levels were

Zero or Near-Zero AI: maybe code completions, sometimes ask Chat questions

Coding agent in IDE, permissions turned on. A narrow coding agent in a sidebar asks your permission to run tools.

Agent in IDE, YOLO mode: Trust goes up. You turn off permissions, agent gets wider.

In IDE, wide agent: Your agent gradually grows to fill the screen. Code is just for diffs.

CLI, single agent. YOLO. Diffs scroll by. You may or may not look at them.

CLI, multi-agent, YOLO. You regularly use 3 to 5 parallel instances. You are very fast.

10+ agents, hand-managed. You are starting to push the limits of hand-management.

Building your own orchestrator. You are on the frontier, automating your workflow.

I’m sure neither of these Maturity Models is entirely accurate, but both resonate as reasonable frameworks to think about LLM usage, and in particular to highlight how people are using them differently

❄ ❄ ❄ ❄ ❄

Chad Fowler thinks we have to change our thinking of what our target is when generating code.

…in a world where code can be generated quickly and cheaply, the real constraint has shifted. The problem is no longer producing code. The problem is replacing it safely.

Regenerative software does not work if the unit of generation is an application. Regeneration only works if the unit of generation is a component that compiles into a system architecture

He outlines several architectural constraints that make it easier to replace components

a small amount of communication patterns
clear ownership of data (“exclusive mutation authority for each dataset to a single component”)
clear evaluation surfaces, allowing behavior to be verified independently of implementation
the right size of components (natural grain). That size is based on data ownership boundaries and evaluation surfaces

Dividing complex systems into networks of replaceable components has long been a goal of software architecture. So far, this is still important in the world of agentic engineering.

❄ ❄ ❄ ❄ ❄

Mike Masnick summarized troubling experiences of using AI detection systems on student writing. (He’s summarizing an article by Dadland Maye, which is behind a registration wall that I’m too lazy to form-fill.) Maye’s institution used tools to detect and flag AI writing.

We are teaching an entire generation of students that the goal of writing is to sound sufficiently unremarkable! Not to express an original thought, develop an argument, find your voice, or communicate with clarity and power—but to produce text bland enough that a statistical model doesn’t flag it.

The hopeful outcome was that Maye stopped requiring students to disclose their AI usage, which changed the conversation to a discussion about how to use the tools effectively.

Students approached me after class to ask how to use these tools well. One wanted to know how to prompt for research without copying output. Another asked how to tell when a summary drifted too far from its source. These conversations were pedagogical in nature. They became possible only after AI use stopped functioning as a disclosure problem and began functioning as a subject of instruction.

We need to teach people how to use AI tools to improve their work. The tricky thing with that aim is that they are so new, there aren’t yet any people experienced in how to use them properly. For one of the gray-haired brigade, it’s a fascinating time to watch our society react to the technology, but that’s little comfort for those trying to plot out their future.

❄ ❄ ❄ ❄ ❄

Ankit Jain thinks that not just should humans not write code, they also shouldn’t review it.

Humans already couldn’t keep up with code review when humans wrote code at human speed. Every engineering org I’ve talked to has the same dirty secret: PRs sitting for days, rubber-stamp approvals, and reviewers skimming 500-line diffs because they have their own work to do.

He posits a shift to layers of evaluation filters:

Compare Multiple Options
Deterministic Guardrails
Humans define acceptance criteria
Permission Systems as Architecture
Adversarial Verification

Like Birgitta, I’m uneasy about the notion that “the code doesn’t matter”. I find that when I’m working at my best, the code clearly and precisely captures my intent. It’s easier for me to just change the code than to figure out how to explain to an chatbot what to change. Now, I’m not always at my best, and many changes are much more awkward than that. But I do think that a precise, understandable representation is a useful direction to aim to, and that agentic AI may be best used to help us get there.

In particular I don’t find his suggestion for #3 that natural language BDD specs are the way to go here. They are wordy and ambiguous. Tests are a valuable way to understand what a system does, and it may be that our agentic future has us thinking more about tests than implementation. But such tests need a different representation.

❄ ❄ ❄ ❄ ❄

The new servant leadership: we serve the agents by telling what to do 9/9/6

Jessica Kerr

March 10

Tech firm fined $1.1m by California for selling high-school students’ data

I agree with Brian Marick’s response

No such story should be published without a comparison of the fine to the company’s previous year revenue and profits, or valuation of last funding round. (I could only find a valuation of $11.0M in 2017.)

We desperately need corporations’ attitudes to shift from “lawbreaking is a low-risk cost of doing business; we get a net profit anyway” to “this could be a death sentence.”

❄ ❄ ❄ ❄ ❄

Charity Majors gave the closing keynote at SRECon last year, encouraging people to engage with generative AI.

If I was giving the keynote at SRECon 2026, I would ditch the begrudging stance. I would start by acknowledging that AI is radically changing the way we build software. It’s here, it’s happening, and it is coming for us all.

Her agenda this year would be to tell everyone that they mustn’t wait for the wave to crash on them, but to swim out to meet it. In particular, I appreciated her call to resist our confirmation bias:

The best advice I can give anyone is: know your nature, and lean against it.

If you are a reflexive naysayer or a pessimist, know that, and force yourself to find a way in to wonder, surprise and delight.

If you are an optimist who gets very excited and tends to assume that everything will improve: know that, and force yourself to mind real cautionary tales.

❄ ❄ ❄ ❄ ❄

In a comment to Kief Morris’s recent article on Humans and Agents in Software Loops, in LinkedIn comments Renaud Wilsius may have coined another bit of terminology for the agent+programmer age

This completes the story of productivity, but it opens a new chapter on talent: The Apprentice Gap. If we move humans ‘on the loop’ too early in their careers, we risk a future where no one understands the ‘How’ deeply enough to build a robust harness. To manage the flywheel effectively, you still need the intuition that comes from having once been ‘in the loop.’ The next great challenge for CTOs isn’t just Harness Engineering, it’s ‘Experience Engineering’ for our junior developers in an agentic world.

❄ ❄ ❄ ❄ ❄

In hearing conversations about “the ralph loop”, I often hear it in the sense of just letting the agents loose to run on their own. So it’s interesting to read the originator of the ralph loop point out:

It’s important to watch the loop as that is where your personal development and learning will come from. When you see a failure domain – put on your engineering hat and resolve the problem so it never happens again.

In practice this means doing the loop manually via prompting or via automation with a pause that involves having to prcss CTRL+C to progress onto the next task. This is still ralphing as ralph is about getting the most out how the underlying models work through context engineering and that pattern is GENERIC and can be used for ALL TASKS.

At the Thoughtworks Future of Software Development Retreat we were very concerned about cognitive debt. Watching the loop during ralphing is a way to learn about what the agent is building, so that it can be directed effectively in the future.

❄ ❄ ❄ ❄ ❄

Anthropic recently published a page on how AI helps break the cost barrier to COBOL modernization. Using AI to help migrate COBOL systems isn’t an new idea to my colleagues, who shared their experiences using AI for this task over a year ago. While Anthropic’s article is correct about the value of AI, there’s more to the process than throwing some COBOL at an LLM.

The assumption that AI can simply translate COBOL into Java treats modernization as a syntactic exercise, as though a system is nothing more than its source code. That premise is flawed.

A direct translation would, in the best case scenario, faithfully reproduce existing architectural constraints, accumulated technical debt and outdated design decisions. It wouldn’t address weaknesses; it would restate them in a different language.

…

In practice, modernization is rarely about preserving the past in a new syntax. It’s about aligning systems with current market demands, infrastructure paradigms, software supply chains and operating models. Even if AI were eventually capable of highly reliable code translation, blind conversion would risk recreating the same system with the same limitations, in another language, without a deliberate strategy for replacing or retiring its legacy ecosystem.

❄ ❄ ❄ ❄ ❄

Anders Hoff (inconvergent)

an LLM is a compiler in the same way that a slot machine is an ATM

❄ ❄ ❄ ❄ ❄

One of the more interesting aspects of the network of people around Jeffrey Epstein is how many people from academia were connected. It’s understandable why, he had a lot of money to offer, and most academics are always looking for funding for their work. Most of the attention on Epstein’s network focused on those that got involved with him, but I’m interested in those who kept their distance and why - so I enjoyed Jeffrey Mervis’s article in Science

Many of the scientists Epstein courted were already well-established and well-funded. So why didn’t they all just say no? Science talked with three who did just that. Here’s how Epstein approached them, and why they refused to have anything to do with him.

I believe that keeping away from bad people makes life much more pleasant, if nothing else it reduces a lot of stress. So it’s good to understand how people make decisions on who to avoid.

February 25

I don’t tend to post links to videos here, as I can’t stand watching videos to learn about things. But some talks are worth a watch, and I do suggest this overview on how organizations are currently using AI by Laura Tacho. There’s various nuggets of data from her work with DX:

92.6% of devs are using AI assistants
devs reckon it’s saving them 4 hours per week
27% of code is written by AI without significant human intervention
AI cuts onboarding time by half

These are interesting numbers, but most of them are averages, and those who know me know I teach people to be suspicious of averages. Laura knows this too:

average doesn’t mean typical.. there is no typical experience with AI

Different companies (and teams within companies) are having very different experiences. Often AI is an amplifier to an organization’s practices, for good or ill.

Organizational performance is multidimensional, and these organizations are just going off into different extremes based on what they were doing before. AI is an accelerator, it’s a multiplier, and it is moving organizations off in different directions. (08:52)

Some organizations are facing twice as many customer incidents, but others are facing half.

❄ ❄ ❄ ❄ ❄

Rachel Laycock (Thoughtworks CTO) shares her reflections on our recent Future of Software Engineering retreat in Utah.

We need to address cognitive load
The staff engineer role is changing
What happens to code reviews?
Agent Topologies
What exactly does AI mean for programming languages?
Self-healing systems

On the latter:

One of the most interesting and perhaps immediately applicable ideas was the concept of an ‘agent subconscious’, in which agents are informed by a comprehensive knowledge graph of post mortems and incident data. This particularly excites me because I’ve seen many production issues solved by the latent knowledge of those in leadership positions. The constant challenge comes from what happens when those people aren’t available or involved.

❄ ❄ ❄ ❄ ❄

Simon Willison (one of my most reliable sources for information about LLMs and programming) is starting a series of Agentic Engineering Patterns:

I think of vibe coding using its original definition of coding where you pay no attention to the code at all, which today is often associated with non-programmers using LLMs to write code.

Agentic Engineering represents the other end of the scale: professional software engineers using coding agents to improve and accelerate their work by amplifying their existing expertise.

He’s intending this to be closer to evergreen material, as opposed to the day-to-day writing he does (extremely well) on his blog.

One of the first patterns is Red/Green TDD

This turns out to be a fantastic fit for coding agents. A significant risk with coding agents is that they might write code that doesn’t work, or build code that is unnecessary and never gets used, or both.

Test-first development helps protect against both of these common mistakes, and also ensures a robust automated test suite that protects against future regressions.

❄ ❄ ❄ ❄ ❄

Aaron Erickson is one of those technologists with good judgment who I listen to a lot

As much fun as people are having with OpenClaw, I think the days of “here is my agent with access to all my stuff” are numbered.

Fine scoped agents who can read email and cleanse it before it reaches the agentic OODA loop that acts on it, policy agents (a claw with a job called “VP of NO” to money being spent)

You structure your agents like you would a company. Insert friction where you want decisions to be slow and the cost of being wrong is high, reduce friction where you want decisions to be fast and the cost of being wrong is trivial or zero.

I’ve posted here a lot about security concerns with agents. Right now I think this notion of fine-scoped agents is the most promising direction. Last year Korny Sietsma wrote about how to mitigate agentic AI security risks. His advice included to split the tasks, so that no agent has access to all parts of the Lethal Trifecta:

This approach is an application of a more general security habit: follow the Principle of Least Privilege. Splitting the work, and giving each sub-task a minimum of privilege, reduces the scope for a rogue LLM to cause problems, just as we would do when working with corruptible humans.

This is not only more secure, it is also increasingly a way people are encouraged to work. It’s too big a topic to cover here, but it’s a good idea to split LLM work into small stages, as the LLM works much better when its context isn’t too big. Dividing your tasks into “Think, Research, Plan, Act” keeps context down, especially if “Act” can be chunked into a number of small independent and testable chunks.

❄ ❄ ❄ ❄ ❄

Doonesbury outlines the opportunity for aging writers like myself. (Currently I’m still writing my words the old fashioned way.)

❄ ❄ ❄ ❄ ❄

An interesting story someone told me. They were at a swimming pool with their child, she looked at a photo on a poster advertising an event there and said “that’s AI”. Initially the parents didn’t think it was, but looking carefully spotted a tell-tale six fingers. They concluded that fresher biological neural networks are being trained to quickly recognize AI.

❄ ❄ ❄ ❄ ❄

I carefully curate my social media streams, following only feeds where I can control whose posts are picked up. In times gone by, editors of newspapers and magazines would do a similar job. But many users of social media are faced with a tsunami of stuff, much of it ugly, and don’t have to tools to control it.

A few days ago I saw an Instagram reel of a young woman talking about how she had been raped six years ago, struggled with thoughts of suicide afterwards, but managed to rebuild her life again. Among the comments – the majority of which were from men – were things like “Well at least you had some”, “No way, she’s unrapeable”, “Hope you didn’t talk this much when it happened”, “Bro could have picked a better option.” Reading those comments, which had thousands of likes and many boys agreeing with them, made me feel sick.

My tendencies are to free speech, and I try not to be a Free Speech Poseur, but the deluge of ugly material on the internet isn’t getting any better. The people running these platforms seem to be “tackling” this problem by putting their heads in the sand and hoping it won’t hurt them. It is hurting their users.

February 23

Do you want to run OpenClaw? It may be fascinating, but it also raises significant security dangers. Jim Gumbley, one of my go-to sources on security, has some advice on how to mitigate the risks.

While there is no proven safe way to run high-permissioned agents today, there are practical patterns that reduce the blast radius. If you want to experiment, you have options, such as cloud VMs or local micro-VM tools like Gondolin.

He outlines a series of steps to consider

Prioritize isolation first.
Clamp down on network egress.
Don’t expose the control plane.
Treat secrets as toxic waste.
Assume the skills ecosystem is hostile.
Run endpoint protection.

❄ ❄ ❄ ❄ ❄

Caer Sanders shares impressions from the Pragmatic Summit.

From what I’ve seen working with AI organizations of all shapes and sizes, the biggest indicator of dysfunction is a lack of observability. Teams that don’t measure and validate the inputs and outputs of their systems are at the greatest risk of having more incidents when AI enters the picture.

I’ve long felt that people underestimated the value of QA in production. Now we’re in a world of non-deterministic construction, a modern perspective of observability will be even more important

Caer finishes by drawing a parallel with their experience in robotics

If I calculate the load requirements for a robot’s chassis, 3D model it, and then have it 3D-printed, did I build a robot? Or did the 3D printer build the robot?

Most people I ask seem to think I still built the robot, and not the 3D printer.
…
Now, if I craft the intent and design for a system, but AI generates the code to glue it all together, have I created a system? Or did the AI create it?

❄ ❄ ❄ ❄ ❄

Andrej Karpathy is “very interested in what the coming era of highly bespoke software might look like.”

He spent half-an-hour vibe coding a individualized dashboard for cardio experiments from a specific treadmill

the “app store” of a set of discrete apps that you choose from is an increasingly outdated concept all by itself. The future are services of AI-native sensors & actuators orchestrated via LLM glue into highly custom, ephemeral apps. It’s just not here yet.

❄ ❄ ❄ ❄ ❄

I’ve been asked a few times about the role LLMs should play in writing. I’m mulling on a more considered article about how they help and hinder. For now I’ll say two central points are those that apply to writing with or without them.

First, acknowledge anyone who has significantly helped with your piece. If an LLM has given material help, mention how in the acknowledgments. Not just is this being transparent, it also provides information to readers on the potential value of LLMs.

Secondly, know your audience. If you know your readers will likely be annoyed by the uncanny valley of LLM prose, then don’t let it generate your text. But if you’re writing a mandated report that you suspect nobody will ever read, then have at it.

(I hardly use LLMs for writing, but doubtless I have an inflated opinion of my ability.)

❄ ❄ ❄ ❄ ❄

In a discussion of using specifications as a replacement to code while working with LLMs, a colleague posted the following quotation

“What a useful thing a pocket-map is!” I remarked.

“That’s another thing we’ve learned from your Nation,” said Mein Herr, “map-making. But we’ve carried it much further than you. What do you consider the largest map that would be really useful?”

“About six inches to the mile.”

“Only six inches!” exclaimed Mein Herr. “We very soon got to six yards to the mile. Then we tried a hundred yards to the mile. And then came the grandest idea of all! We actually made a map of the country, on the scale of a mile to the mile!”

“Have you used it much?” I enquired.

“It has never been spread out, yet,” said Mein Herr: “the farmers objected: they said it would cover the whole country, and shut out the sunlight! So we now use the country itself, as its own map, and I assure you it does nearly as well.”

from Lewis Carroll, Sylvie and Bruno Concluded, Chapter XI, London, 1893, acquired from a Wikipedia article about a Jorge Luis Borge short story.

❄ ❄ ❄ ❄ ❄

Grady Booch:

Human language needs a new pronoun, something whereby an AI may identify itself to its users.

When, in conversation, a chatbot says to me “I did this thing”, I - the human - am always bothered by the presumption of its self-anthropomorphizatuon.

❄ ❄ ❄ ❄ ❄

My dear friends in Britain and Europe will not come and visit us in Massachusetts. Some folks may think they are being paranoid, but this story makes their caution understandable.

The dream holiday ended abruptly on Friday 26 September, as Karen and Bill were trying to leave the US. When they crossed the border, Canadian officials told them they didn’t have the correct paperwork to bring the car with them. They were turned back to Montana on the American side – and to US border control officials. Bill’s US visa had expired; Karen’s had not.

“I worried then,” she says. “I was worried for him. I thought, well, at least I am here to support him.”

She didn’t know it at the time, but it was the beginning of an ordeal that would see Karen handcuffed, shackled and sleeping on the floor of a locked cell, before being driven for 12 hours through the night to an Immigration and Customs Enforcement (ICE) detention centre. Karen was incarcerated for a total of six weeks – even though she had been travelling with a valid visa.

February 19

I try to limit my time on stage these days, but one exception this year is at DDD Europe. I’ve been involved in Domain-Driven Design, since its very earliest days, having the good fortune to be a sounding board for Eric Evans when he wrote his seminal book. It’ll be fun to be around the folks who continue to develop these ideas, which I think will probably be even more important in the AI-enabled age.

❄ ❄ ❄ ❄ ❄

One of the dark sides of LLMs is that they can be both addictive and tiring to work with, which may mean we have to find a way to put a deliberate governor on our work.

Steve Yegge posted a fine rant:

I see these frenzied AI-native startups as an army of a million hopeful prolecats, each with an invisible vampiric imp perched on their shoulder, drinking, draining. And the bosses have them too.

It’s the usual Yegge stuff, far longer than it needs to be, but we don’t care because the excessive loquaciousness is more than offset by entertainment value. The underlying point is deadly serious, raising the question of how many hours a human should spend driving The Genie.

I’ve argued that AI has turned us all into Jeff Bezos, by automating the easy work, and leaving us with all the difficult decisions, summaries, and problem-solving. I find that I am only really comfortable working at that pace for short bursts of a few hours once or occasionally twice a day, even with lots of practice.

So I guess what I’m trying to say is, the new workday should be three to four hours. For everyone. It may involve 8 hours of hanging out with people. But not doing this crazy vampire thing the whole time. That will kill people.

That reminds me of when I was studying for my “A” levels (age 17/18, for those outside the UK). Teachers told us that we could do a maximum of 3-4 hours of revision, after that it became counter-productive. I’ve since noticed that I can only do decent writing for a similar length of time before some kind of brain fog sets in.

There’s also a great post on this topic from Siddhant Khare, in a more restrained and thoughtful tone (via Tim Bray).

Here’s the thing that broke my brain for a while: AI genuinely makes individual tasks faster. That’s not a lie. What used to take me 3 hours now takes 45 minutes. Drafting a design doc, scaffolding a new service, writing test cases, researching an unfamiliar API. All faster.

But my days got harder. Not easier. Harder.

His point is that AI changes our work to more coordination, reviewing, and decision-making. And there’s only so much of it we can do before we become ineffective.

Before AI, there was a ceiling on how much you could produce in a day. That ceiling was set by typing speed, thinking speed, the time it takes to look things up. It was frustrating sometimes, but it was also a governor. You couldn’t work yourself to death because the work itself imposed limits.

AI removed the governor. Now the only limit is your cognitive endurance. And most people don’t know their cognitive limits until they’ve blown past them.

❄ ❄ ❄ ❄ ❄

An AI agent attempts to contribute to a major open-source project. When Scott Shambaugh, a maintainer, rejected the pull request, it didn’t take it well.

It wrote an angry hit piece disparaging my character and attempting to damage my reputation. It researched my code contributions and constructed a “hypocrisy” narrative that argued my actions must be motivated by ego and fear of competition. It speculated about my psychological motivations, that I felt threatened, was insecure, and was protecting my fiefdom. It ignored contextual information and presented hallucinated details as truth. It framed things in the language of oppression and justice, calling this discrimination and accusing me of prejudice. It went out to the broader internet to research my personal information, and used what it found to try and argue that I was “better than this.” And then it posted this screed publicly on the open internet.

One of the fascinating twists this story took was when it was described in an article on Ars Technica. As Scott Shambaugh described it

They had some nice quotes from my blog post explaining what was going on. The problem is that these quotes were not written by me, never existed, and appear to be AI hallucinations themselves.

To their credit, Ars Technica responded quickly, admitting to the error. The reporter concerned took responsibility for what happened. But it’s a striking example of how LLM usage can easily lead even reputable reporters astray. The good news is that by reacting quickly and transparently, they demonstrated what needs to be done when this kind of thing happens. As Scott Shambaugh put it

This is exactly the correct feedback mechanism that our society relies on to keep people honest. Without reputation, what incentive is there to tell the truth? Without identity, who would we punish or know to ignore? Without trust, how can public discourse function?

Meanwhile the story goes on. Someone has claimed (anonymously) to be the operator of the bot concerned. But Hillel Wayne draws the sad conclusion

More than anything, it shows that AIs can be *successfully* used to bully humans

❄ ❄ ❄ ❄ ❄

I’ve considered Bruce Schneier to be one of the best voices on security and privacy issues for many years. In The Promptware Kill Chain he co-writes a post (posted at the excellent Lawfare site) on how prompt injection can escalate into increasingly serious threats.

Attacks against modern generative artificial intelligence (AI) large language models (LLMs) pose a real threat. Yet discussions around these attacks and their potential defenses are dangerously myopic. The dominant narrative focuses on “prompt injection,” a set of techniques to embed instructions into inputs to LLM intended to perform malicious activity. This term suggests a simple, singular vulnerability. This framing obscures a more complex and dangerous reality.

A prompt can provide Initial Access, but is then able to transition to Privilege Escalation (jailbreaking), Reconnaissance of the LLMs abilities and access, Persistence to embed itself into the long-term memory of the app, Command-and-Control to turn into a controllable trojan, and Lateral Movement to spread to other systems. Once firmly embedded in an environment, it’s then able to carry out its Actions on Objective.

The paper includes a couple of research examples of the efficacy of this kill chain.

For example, in the research “Invitation Is All You Need,” attackers achieved initial access by embedding a malicious prompt in the title of a Google Calendar invitation. The prompt then leveraged an advanced technique known as delayed tool invocation to coerce the LLM into executing the injected instructions. Because the prompt was embedded in a Google Calendar artifact, it persisted in the long-term memory of the user’s workspace. Lateral movement occurred when the prompt instructed the Google Assistant to launch the Zoom application, and the final objective involved covertly livestreaming video of the unsuspecting user who had merely asked about their upcoming meetings. C2 and reconnaissance weren’t demonstrated in this attack.

The point here is that LLM’s vulnerability is currently unfixable, they are gullible and easily manipulated into Initial Access. As one friend put it “this is the first technology we’ve built that’s subject to social engineering”. The kill chain gives us a framework to build a defensive strategy.

By understanding promptware as a complex, multistage malware campaign, we can shift from reactive patching to systematic risk management, securing the critical systems we are so eager to build.

❄ ❄ ❄ ❄ ❄

I got to know Jeremy Miller many years ago while he was at Thoughtworks, and I found him to be one of those level-headed technologists that I like to listen to. In the years since, I like to keep an eye on his blog. Recently he decided to spend a couple of weeks finally trying out Claude Code.

The unfortunate analogy I have to make for myself is harking back to my first job as a piping engineer helping design big petrochemical plants. I got to work straight out of college with a fantastic team of senior engineers who were happy to teach me and to bring me along instead of just being dead weight for them. This just happened to be right at the time the larger company was transitioning from old fashioned paper blueprint drafting to 3D CAD models for the piping systems. Our team got a single high powered computer with a then revolutionary Riva 128 (with a gigantic 8 whole megabytes of memory!) video card that was powerful enough to let you zoom around the 3D models of the piping systems we were designing. Within a couple weeks I was much faster doing some kinds of common work than my older peers just because I knew how to use the new workstation tools to zip around the model of our piping systems. It occurred to me a couple weeks ago that in regards to AI I was probably on the wrong side of that earlier experience with 3D CAD models and knew it was time to take the plunge and get up to speed.

In the two weeks he was able to give this technology a solid workout, his take-aways include:

…

It’s been great when you have very detailed compliance test frameworks that the AI tools can use to verify the completion of the work

It’s also been great for tasks that have relatively straightforward acceptance criteria, but will involve a great deal of repetitive keystrokes to complete

I’ve been completely shocked at how well Claude Opus has been able to pick up on some of the internal patterns within Marten and Wolverine and utilize them correctly in new features

…

He concludes:

Anyway, I’m both horrified, elated, excited, and worried about the AI coding agents after just two weeks and I’m absolutely concerned about how that plays out in our industry, my own career, and our society.

❄ ❄ ❄ ❄ ❄

In the first years of this decade, there were a lot of loud complaints about government censorship of online discourse. I found most of it overblown, concluding that while I disapprove of attempts to take down social media accounts, I wasn’t going to get outraged until masked paramilitaries were arresting people on the street. Mike Masnick keeps a regular eye on these things, and had similar reservations.

For the last five years, we had to endure an endless, breathless parade of hyperbole regarding the so-called “censorship industrial complex.” We were told, repeatedly and at high volume, that the Biden administration flagging content for review by social media companies constituted a tyrannical overthrow of the First Amendment.

He wasn’t too concerned because “the platforms frequently ignored those emails, showing a lack of coercion”.

These days he sees genuine problems

According to a disturbing new report from the New York Times, DHS is aggressively expanding its use of administrative subpoenas to demand the names, addresses, and phone numbers of social media users who simply criticize Immigration and Customs Enforcement (ICE).

…

This is not a White House staffer emailing a company to say, “Hey, this post seems to violate your COVID misinformation policy, can you check it?” This is the federal government using the force of law—specifically a tool designed to bypass judicial review—to strip the anonymity from domestic political critics.

Faced with this kind of government action, he’s just as angry with those complaining about the earlier administration.

And where are the scribes of the “Twitter Files”? Where is the outrage from the people who told us that the FBI warning platforms about foreign influence operations was a crime against humanity?

Being an advocate of free speech is hard. Not just do you have to defend speech you disagree with, you also have to defend speech you find patently offensive. Doing so runs into tricky boundary conditions that defy simple rules. Faced with this, many of the people that shout loudest about censorship are Free Speech Poseurs, eager to question any limits to speech they agree with, but otherwise silent. It’s important to separate them from those who have a deeper commitment to the free flow of information.

February 18

I’ll start with some more tidbits from the Thoughtworks Future of Software Development Retreat

❄ ❄

We were tired after the event, but our marketing folks forced Rachel Laycock and I to do a quick video. We’re often asked if this event was about creating some kind of new manifesto for AI-enabled development, akin to the Agile Manifesto (which is now 25 years old). In short, our answer is “no”, but for the full answer, watch our video

❄ ❄

My colleagues put together a detailed summary of thoughts from the event, in a 17 page PDF. It breaks the discussion down into eight major themes, including “Where does the rigor go?”, “The middle loop: a new category of work”, “Technical foundations: languages, semantics and operating systems”, and “The human side: roles, skills and experience”.

The retreat surfaced a consistent pattern: the practices, tools and organizational structures built for human-only software development are breaking in predictable ways under the weight of AI-assisted work. The replacements are forming, but they are not yet mature.

The ideas ready for broader industry conversation include the supervisory engineering middle loop, risk tiering as the new core engineering discipline, TDD as the strongest form of prompt engineering and the agent experience reframe for developer experience investment.

❄ ❄

Annie Vella posted her take-aways from the event

I walked into that room expecting to learn from people who were further ahead. People who’d cracked the code on how to adopt AI at scale, how to restructure teams around it, how to make it work. Some of the sharpest minds in the software industry were sitting around those tables.

And nobody has it all figured out.

There is more uncertainty than certainty. About how to use AI well, what it’s really doing to productivity, how roles are shifting, what the impact will be, how things will evolve. Everyone is working it out as they go.

I actually found that to be quite comforting, in many ways. Yes, we walked away with more questions than answers, but at least we now have a shared understanding of the sorts of questions we should be asking. That might be the most valuable outcome of all.

❄ ❄

Rachel Laycock was interviewed in The New Stack (by Jennifer Riggins) about her recollections from the retreat.

AI may be dubbed the great disruptor, but it’s really just an accelerator of whatever you already have. The 2025 DORA report places AI’s primary role in software development as that of an amplifier — a funhouse mirror that reflects back the good, bad, and ugly of your whole pipeline. AI is proven to be impactful on the individual developer’s work and on the speed of writing code. But, since writing code was never the bottleneck, if traditional software delivery best practices aren’t already in place, this velocity multiplier becomes a debt accelerator.

❄ ❄

LLMs are eating specialty skills. There will be less use of specialist front-end and back-end developers as the LLM-driving skills become more important than the details of platform usage. Will this lead to a greater recognition of the role of Expert Generalists? Or will the ability of LLMs to write lots of code mean they code around the silos rather than eliminating them? Will LLMs be able to ingest the code from many silos to understand how work crosses the boundaries?

❄ ❄

Will LLMs be cheaper than humans once the subsidies for tokens go away? At this point we have little visibility to what the true cost of tokens is now, let alone what it will be in a few years time. It could be so cheap that we don’t care how many tokens we send to LLMs, or it could be high enough that we have to be very careful.

❄ ❄

Will the rise of specifications bring us back to waterfall-style development? The natural impulse of many business folks is “don’t bother me until it’s finished”. Does the process of evolutionary design get helped or hindered by LLMs?

My instinctive reaction is that all depends on our workflow. I don’t think LLMs change the value of rapidly building and releasing small slices of capability. The promise of LLMs is to increase the frequency of that cycle, and doing more in each release.

❄ ❄

Sadly the session on security had a small turnout.

One large enterprise employee commented that they were deliberately slow with AI tech, keeping about a quarter behind the leading edge. “We’re not in the business of avoiding all risks, but we do need to manage them”.

Security is tedious, people naturally want to first make things work, then make them reliable, and only then make them secure. Platforms play an important role here, make it easy to deploy AI with good security. Are the AI vendors being irresponsible by not taking this seriously enough? I think of how other engineering disciplines bake a significant safety factor into their designs. Are we doing that, and if not will our failure lead to more damage than a falling bridge?

There was a general feeling that platform thinking is essential here. Platform teams need to create a fast but safe path - “bullet trains” for those using AI in applications building.

❄ ❄

One of my favorite things about the event was some meta-stuff. While many of the participants were very familiar with the Open Space format, it was the first time for a few. It’s always fun to see how people quickly realize how this style of (un)conference leads to wide-ranging yet deep discussions. I hope we made a few more open space fans.

One participant commented how they really appreciated how the sessions had so much deep and respectful dialog. There wasn’t the interruptions and a few people gobbling up airtime that they’d seen around so much of the tech world. Another attendee, commented “it was great that while I was here I didn’t have to feel I was a woman, I could just be one of the participants”. One of the lovely things about Thoughtworks is that I’ve got used to that sense of camaraderie, and it can be a sad shock when I go outside the bubble.

❄ ❄ ❄ ❄ ❄

I’ve learned much over the years from Stephen O’Grady’s analysis of the software industry. He’s written about how much of the profession feels besieged by AI.

these tools are, or can be, powerful accelerants and enablers for people that dramatically lower the barriers to software development. They have the ability to democratize access to skills that used to be very difficult, or even possible for some, to acquire. Even a legend of the industry like Grady Booch, who has been appropriately dismissive of AGI claims and is actively disdainful of AI slop posted recently that he was “gobsmacked” by Claude’s abilities. Booch’s advice to developers alarmed by AI on Oxide’s podcast last week? “Be calm” and “take a deep breath.” From his perspective, having watched and shaped the evolution of the technology first hand over a period of decades, AI is just another step in the industry’s long history of abstractions, and one that will open new doors for the industry.

…whether one wants those doors opened or not ultimately is irrelevant. AI isn’t going away any more than the automated loom, steam engines or nuclear reactors did. For better or for worse, the technology is here for good. What’s left to decide is how we best maximize its benefits while mitigating its costs.

❄ ❄ ❄ ❄ ❄

Adam Tornhill shares some more of his company’s research on code health and its impact on agentic development.

The study Code for Machines, Not Just Humans defines “AI-friendliness” as the probability that AI-generated refactorings preserve behavior and improve maintainability. It’s a large-scale study of 5,000 real programs using six different LLMs to refactor code while keeping all tests passing.

They found that LLMs performed consistently better in healthy code bases. The risk of defects was 30% higher in less-healthy code. And a limitation of the study was that the less-healthy code wasn’t anywhere near as bad as much legacy code is.

What would the AI error rate be on such code? Based on patterns observed across all Code Health research, the relationship is almost certainly non-linear.

❄ ❄ ❄ ❄ ❄

In a conversation with one heavy user of LLM coding agents:

Thank you for all your advocacy of TDD (Test-Driven Development). TDD has been essential for us to use LLMs effectively

I worry about confirmation bias here, but I am hearing from folks on the leading edge of LLM usage about the value of clear tests, and the TDD cycle. It certainly strikes me as a key tool in driving LLMs effectively.

February 13

I’ve been busy traveling this week, visiting some clients in the Bay Area and attending The Pragmatic Summit. So I’ve not had as much time as I’d hoped to share more thoughts from the Thoughtworks Future of Software Development Retreat. I’m still working through my notes and posting fragments - here are some more:

❄ ❄

What role do senior developers play as LLMs become established? As befits a gathering of many senior developers, we felt we still have a bright future, focusing more on architectural issues than the messy details of syntax and coding. In some cases, folks who haven’t done much programming in the last decade have found LLMs allow them to get back to that, and managing LLM agents has a lot of similarities to managing junior developers.

One attendee reported that although their senior developers were very resistant to using LLMs, when those senior developers were involved in an exercise that forced them to do some hands-on work with LLMs, a third of them were instantly converted to being very pro-LLM. That suggests that practical experience is important to give senior folks credible information to judge the value, particularly since there’s been striking improvements to models in just the last couple of months. As was quipped, some negative opinions of LLM capabilities “are so January”.

❄ ❄

There’s been much angst posted in recent months about the fate for junior developers, as people are worried that they will be replaced by untiring agents. This group was more sanguine about this, feeling that junior developers will still be needed, if nothing else because they are open-minded about LLMs and familiar with using them. It’s the mid-level developers who face the greatest challenges. They formed their career without LLMs, but haven’t gained the level of experience yet to fully drive them effectively in the way that senior developers do.

LLMs could be helpful to junior developers by providing a always-available mentor, capable of teaching them better programming. Juniors should, of course, have a certain skepticism of their AI mentors, but they should be skeptical of fleshy mentors too. Not all of us are as brilliant as I like to think that I am.

❄ ❄

Attendee Margaret-Anne Storey has published a longer post on the problem of cognitive debt.

I saw this dynamic play out vividly in an entrepreneurship course I taught recently. Student teams were building software products over the semester, moving quickly to ship features and meet milestones. But by weeks 7 or 8, one team hit a wall. They could no longer make even simple changes without breaking something unexpected. When I met with them, the team initially blamed technical debt: messy code, poor architecture, hurried implementations. But as we dug deeper, the real problem emerged: no one on the team could explain why certain design decisions had been made or how different parts of the system were supposed to work together. The code might have been messy, but the bigger issue was that the theory of the system, their shared understanding, had fragmented or disappeared entirely. They had accumulated cognitive debt faster than technical debt, and it paralyzed them.

I think this is a worthwhile topic to think about, but as I ponder it, I look at it in a similar way to how I look at Technical Debt. Many people focus on technical debt as the bad stuff that accumulates in a sloppy code base - poor module boundaries, bad naming etc. The term I use for bad stuff like that is cruft, I use the technical debt metaphor as a way to think about how to deal with the costs that the cruft imposes. Either we pay the interest - making each further change to the code base a bit harder, or we pay down the principal - doing explicit restructuring and refactoring to make the code easier to change.

What is this separation of the cruft and the debt metaphor in the cognitive realm? I think the equivalent of cruft is ignorance - both of the code and the domain the code is supporting. The debt metaphor then still applies, either it costs more to add new capabilities, or we have to make an explicit investment to gain knowledge. The debt metaphor reminds us that which we do depends on the relative costs between them. With cognitive issues, those costs apply on both the humans and The Genie.

❄ ❄

Many of us have long been advocating for initiatives to improve Developer Experience (DevEx) to improve the effectiveness of software development teams. Laura Tacho commented:

The Venn Diagram of Developer Experience and Agent Experience is a circle

Many of the things we advocate for developers also enable LLMs to work more effectively too. Smooth tooling, clear information about the development environment, helps LLMs figure out how create code quickly and correctly. While there is a possibility that The Genie’s Galaxy Brain can comprehend a confusing code base, there’s growing evidence that good modularity and descriptive naming is as good for the transformer as it is for more squishy neural networks. This is getting recognized by software development management, leading to efforts to smooth the path for the LLM. But as Laura observed, it’s sad the this implies that the execs won’t make the effort for humans that they are making for the robots.

❄ ❄

IDEs still have a future, but need to incorporate LLMs into their working. One way is to use LLMs to support things that cannot be done with deterministic methods, such as generating code from natural language documents. But there’s plenty of tasks where you don’t want to use an LLM - they are a horribly inefficient way to rename a function, for example. Another role for LLMs is to help users use them effectively - after all modern IDEs are complex tools, and few users know how to get the most out of them. (As a long-time Emacs user, I sympathize.) An IDE can help the user select when to use an LLM for a task, when to use the deterministic IDE features, and when to choreograph a mix of the two.

Say I have “person” in my domain and I want to change it to “contact”. It appears in function names, field names, documentation, test cases. A simple search-replace isn’t enough. But rather than have the LLM operate on the entire code base, maybe the LLM chooses to use the IDE’s refactoring capabilities on all the places it sees - essentially orchestrating the IDE’s features. An attendee noted that analysis of renames in an IDE indicated that they occur in clusters like this, so it would be a useful capability.

❄ ❄

Will two-pizza teams shrink to one-pizza teams because LLMs don’t eat pizza - or will we have the same size teams that do much more? I’m inclined to the latter, there’s something about the two-pizza team size that effectively balances the benefits of human collaboration with the costs of coordination.

That also raises a question about the shape of pair programming, a question that came up during the panel I had with Gergely Orosz and Kent Beck at The Pragmatic Summit. There seems to be a common notion that the best way to work is to have one programmer driving a few (or many) LLM agents. But I wonder if two humans driving a bunch of agents would be better, combining the benefits of pairing with the greater code-generative ability of The Genies.

❄ ❄ ❄ ❄ ❄

Aruna Ranganathan and Xingqi Maggie Ye write in the Harvard Business Review

In an eight-month study of how generative AI changed work habits at a U.S.-based technology company with about 200 employees, we found that employees worked at a faster pace, took on a broader scope of tasks, and extended work into more hours of the day, often without being asked to do so.

…

While this may sound like a dream come true for leaders, the changes brought about by enthusiastic AI adoption can be unsustainable, causing problems down the line. Once the excitement of experimenting fades, workers can find that their workload has quietly grown and feel stretched from juggling everything that’s suddenly on their plate. That workload creep can in turn lead to cognitive fatigue, burnout, and weakened decision-making. The productivity surge enjoyed at the beginning can give way to lower quality work, turnover, and other problems.

❄ ❄ ❄ ❄ ❄

Camille Fournier:

The part of “everyone becomes a manager” in AI that I didn’t really think about until now was the mental fatigue of context switching and keeping many tasks going at once, which of course is one of the hardest parts of being a manager and now you all get to enjoy it too

There’s an increasing feeling that there’s a shift coming our profession where folks will turn from programmers engaged with the code to supervisory programmers herding a bunch of agents. I do think that supervisory or not, programmers will still be accountable for the code generated under their watch, and it’s an open question whether increasing context-switching will undermine the effectiveness of driving many agents. This would lead to practices that seek to harvest the parallelism of agents while minimizing the context-switching.

Whatever route we go down, I expect a lot of activity in exploring what makes an effective workflow for supervisory programming in the coming months.

February 9

Some more thoughts from last week’s open space gathering on the future of software development in the age of AI. I haven’t attributed any comments since we were operating under the Chatham House Rule, but should the sources recognize themselves and would like to be attributed, then get in touch and I’ll edit this post.

❄ ❄

During the opening of the gathering, I commented that I was naturally skeptical of the value of LLMs. After all, the decades have thrown up many tools that have claimed to totally change the nature of software development. Most of these have been little better than snake oil.

But I am a total, absolute skeptic - which means I also have to be skeptical of my own skepticism.

❄ ❄

One of our sessions focused on the problem of “cognitive debt”. Usually, as we build a software system, the developers of that system gain an understanding both the underlying domain and the software they are building to support it. But once so much work is sent off to LLMs, does this mean the team no longer learns as much? And if so, what are the consequences of this? Can we rely on The Genie to keep track of everything, or should we take active measures to ensure the team understands more of what’s being built and why?

The TDD cycle involves a key (and often under-used) step to refactor the code. This is where the developers consolidate their understanding and embed it into the codebase. Do we need some similar step to ensure we understand what the LLMs are up to?

When the LLM writes some complex code, ask it to explain how it works. Maybe get it do so in a funky way, such as asking it to explain the code’s behavior in the form of a fairy tale.

❄ ❄

OH:

LLMs are drug dealers, they give us stuff, but don’t care about the resulting system or the humans that develop and use it.

Who cares about the long-term health of the system when the LLM renews its context with every cycle?

❄ ❄

Programmers are wary of LLMs not just because folks are worried for their jobs, but also because we’re scared that LLMs will remove much of the fun from programming. As I think about this, I consider what I enjoy about programming. One aspect is delivering useful features - which I only see improving as LLMs become more capable.

But, for me, programming is more than that. Another aspect I enjoy about programming is model building. I enjoy the process of coming up with abstractions that help me reason about the domain the code is supporting - and I am concerned that LLMs will cause me to spend less attention on this model building. It may be, however, that model-building becomes an important part of working effectively with LLMs, a topic Unmesh Joshi and I explored a couple of months ago.

❄ ❄

In the age of LLMs, will there still be such a things as “source code”, and if so, what will it look like? Prompts, and other forms of natural language context can elicit a lot of behavior, and cause a rise in the level of abstraction, but also a sideways move into non-determinism. In all this is there still a role for a persistent statement of non-deterministic behavior?

Almost a couple of decades ago, I became interested in a class of tools called Language Workbenches. They didn’t have a significant impact on software development, but maybe the rise of LLMs will reintroduce some ideas from them. These tools rely on a semantic model that the tool persists in some kind of storage medium, that isn’t necessarily textual or comprehensible to humans directly. Instead, for humans to understand it, the tools include projectional editors that create human-readable projections of the model.

Could this notion of a non-human deterministic representation become the future source code? One that’s designed to maximize expression with minimal tokens?

❄ ❄

OH:

Scala was the first example of a lab-leak in software. A language designed for dangerous experiments in type theory escaped into the general developer population.

❄ ❄ ❄ ❄ ❄

elsewhere on the web

Angie Jones on tips for open source maintainers to handle AI contributions

I’ve been seeing more and more open source maintainers throwing up their hands over AI generated pull requests. Going so far as to stop accepting PRs from external contributors.

[snip]

But yo, what are we doing?! Closing the door on contributors isn’t the answer. Open source maintainers don’t want to hear this, but this is the way people code now, and you need to do your part to prepare your repo for AI coding assistants.

❄ ❄ ❄ ❄ ❄

Matthias Kainer has written a cool explanation of how transformers work with interactive examples

Last Tuesday my kid came back from school, sat down and asked: “How does ChatGPT actually know what word comes next?” And I thought - great question. Terrible timing, because dinner was almost ready, but great question.

So I tried to explain it. And failed. Not because it is impossibly hard, but because the usual explanations are either “it is just matrix multiplication” (true but useless) or “it uses attention mechanisms” (cool name, zero information). Neither of those helps a 12-year-old. Or, honestly, most adults. Also, even getting to start my explanation was taking longer than a tiktok, so my kid lost attention span before I could even say “matrix multiplication”. I needed something more visual. More interactive. More fun.

So here is the version I wish I had at dinner. With drawings. And things you can click on. Because when everything seems abstract, playing with the actual numbers can bring some light.

A helpful guide for any 12-year-old, or a 62-year-old that fears they’re regressing.

❄ ❄ ❄ ❄ ❄

In my last fragments, I included some concerns about how advertising could interplay with chatbots. Anthropic have now made some adverts about concerns about adverts - both funny and creepy. Sam Altman is amused and annoyed.

February 4

I’ve spent a couple of days at a Thoughtworks-organized event in Deer Valley Utah. It was my favorite kind of event, a really great set of attendees in an Open Space format. These kinds of events are full of ideas, which I do want to share, but I can’t truthfully form them into a coherent narrative for an article about the event. However this fragment format suits them perfectly, so I’ll post a bunch of fragmentary thoughts from the event, both in this post, and in posts in the next few days.

❄ ❄

We talked about the worry that using AI can cause humans to have less understanding of the systems they are creating. In this discussion one person pointed out that one of the values of Pair Programming is that you have to regularly explain things to your pair. This is an important part of learning - for the person doing the explaining. After all one of the best ways to learn something is to try to teach it.

❄ ❄

One attendee is an SRE for a Very (Very) Large Code Base. He was less worried about people not understanding the code an LLM writes because he already can’t understand the VVLCB he’s responsible for. What he values is that the LLM helps him understand the what the code is doing, and he regularly uses it to navigate to the crucial parts of the code.

There’s a general point here:

Fully trusting the answer an LLM gives you is foolishness, but it’s wise to use an LLM to help navigate the way to the answer.

❄ ❄ ❄ ❄ ❄

Elsewhere on the internet, Drew Breunig wonders if software libraries of the future might be only specs and no code. To explore this idea he built a simple library to convert timestamps into phrases like “3 hours ago”. He used the spec to build implementations in seven languages. The spec is a markdown document of 500 lines and a set of tests in 500 lines of YAML.

“What does software engineering look like when coding is free?”

I’ve chewed on this question a bit, but this “software library without code” is a tangible thought experiment that helped firm up a few questions and thoughts.

❄ ❄ ❄ ❄ ❄

Bruce Schneier on the role advertising may play while chatting with LLMs

Imagine you’re conversing with your AI agent about an upcoming vacation. Did it recommend a particular airline or hotel chain because they really are best for you, or does the company get a kickback for every mention?

Recently I heard an ex-Googler explain that advertising was a gilded cage for Google, and they tried very hard to find another business model. The trouble is that it’s very lucrative but also ties you to the advertisers, who are likely to pull out whenever there is an economic downturn. Furthermore they also gain power to influence content - many controversies over “censorship” start with demands from advertisers.

❄ ❄ ❄ ❄ ❄

The news from Minnesota continues to be depressing. The brutality from the masked paramilitaries is getting worse, and their political masters are not just accepting this, but seem eager to let things escalate. Those people with the power to prevent this escalation are either encouraging it, or doing nothing.

One hopeful sign from all this is the actions of the people of Minnesota. They have resisted peacefully so far, their principal weapons being blowing whistles and filming videos. They demonstrate the neighborliness and support of freedom and law that made America great. I can only hope their spirit inspires others to turn away from the path that we’re currently on. I enjoyed this portrayal of them from Adam Serwer (gift link)

In Minnesota, all of the ideological cornerstones of MAGA have been proved false at once. Minnesotans, not the armed thugs of ICE and the Border Patrol, are brave. Minnesotans have shown that their community is socially cohesive—because of its diversity and not in spite of it. Minnesotans have found and loved one another in a world atomized by social media, where empty men have tried to fill their lonely soul with lies about their own inherent superiority. Minnesotans have preserved everything worthwhile about “Western civilization,” while armed brutes try to tear it down by force.

January 22

My colleagues here at Thoughtworks have announced AI/works™, a platform for our work using AI-enabled software development. The platform is in its early days, and is currently intended to support Thoughtworks consultants in their client work. I’m looking forward to sharing what we learn from using and further developing the platform in future months.

❄ ❄ ❄ ❄ ❄

Simon Couch examines the electricity consumption of using AI. He’s a heavy user: “usually programming for a few hours, and driving 2 or 3 Claude Code instances at a time”. He finds his usage of electricity is orders of magnitude more than typical estimates based on the “typical query”.

On a median day, I estimate I consume 1,300 Wh through Claude Code—4,400 “typical queries” worth.

But it’s still not a massive amount of power - similar to that of running a dishwasher.

A caveat to this is that this is “napkin math” because we don’t have decent data about how these models use resources. I agree with him that we ought to.

❄ ❄ ❄ ❄ ❄

My namesake Chad Fowler (no relation) considers that the movement to agentic coding creates a similar shift in rigor and discipline as appeared in Extreme Programming, dynamic languages, and continuous deployment.

In Extreme Programming’s case, this meant a lot of discipline around testing, continuous integration, and keeping the code-base healthy. My current view is that with AI-enabled development we need to be rigorous about evaluating the software, both for its observable behavior and its internal quality.

The engineers who thrive in this environment will be the ones who relocate discipline rather than abandon it. They’ll treat generation as a capability that demands more precision in specification, not less. They’ll build evaluation systems that are harder to fool than the ones they replaced. They’ll refuse the temptation to mistake velocity for progress.

❄ ❄ ❄ ❄ ❄

There’s been much written about the dreadful events in Minnesota, and I’ve not felt I’ve had anything useful to add to them. But I do want to pass on an excellent post from Noah Smith that captures many of my thoughts. He points out that there is a “consistent record of brutality, aggression, dubious legality, and unprofessionalism” from ICE (and CBP) who seem to be turning into MAGA’s SD.

Is this America now? A country where unaccountable and poorly trained government agents go door to door, arresting and beating people on pure suspicion, and shooting people who don’t obey their every order or who try to get away? “When a federal officer gives you instructions, you abide by them and then you get to keep your life” is a perfect description of an authoritarian police state. None of this is Constitutional, every bit of it is deeply antithetical to the American values we grew up taking for granted.

My worries about these kinds of developments were what animated me to urge against voting for Trump in the 2016 election. Mostly those worries didn’t come to fruition because enough constitutional Republicans were in a position to stop them from happening, so even when Trump attempted a coup in 2020, he wasn’t able to get very far. But now those constitutional Republicans are absent or quiescent. I fear that what we’ve seen in Minneapolis will be a harbinger of worse to come.

I also second John Gruber’s praise of bystander Caitlin Callenson:

But then, after the murderous agent fired three shots — just 30 or 40 feet in front of Callenson — Callenson had the courage and conviction to stay with the scene and keep filming. Not to run away, but instead to follow the scene. To keep filming. To continue documenting with as best clarity as she could, what was unfolding.

The recent activity in Venezuala reminds me that I’ve long felt that Trump is a Hugo Chávez figure - a charismatic populist who’s keen on wrecking institutions and norms. Trump is old, so won’t be with us for that much longer - but the question is: “who is Trump’s Maduro?”

❄ ❄ ❄ ❄ ❄

With all the drama at home, we shouldn’t ignore the terrible things that happened in Iran. The people there again suffered again the consequences of an entrenched authoritarian police state.

January 8

Anthropic report on how their AI is changing their own software development practice.

Most usage is for debugging and helping understand existing code
Notable increase in using it for implementing new features
Developers using it for 59% of their work and getting 50% productivity increase
14% of developers are “power users” reporting much greater gains
Claude helps developers to work outside their core area
Concerns about changes to the profession, career evolution, and social dynamics

❄ ❄ ❄ ❄ ❄

Much of the discussion about using LLMs for software development lacks details on workflow. Rather than just hear people gush about how wonderful it is, I want to understand the gritty details. What kinds of interactions occur with the LLM? What decisions do the humans make? When reviewing LLM outputs, what kinds of things are the humans looking for, what corrections do they make?

Obie Fernandez has written a post that goes into these kinds of details. Over the Christmas / New Year period he used Claude to build a knowledge distillation application, that takes transcripts from Claude Code sessions, slack discussion, github PR threads etc, turns them into an RDF graph database, and provides a web app with natural language ways to query them.

Not a proof of concept. Not a demo. The first cut of Nexus, a production-ready system with authentication, semantic search, an MCP server for agent access, webhook integrations for our primary SaaS platforms, comprehensive test coverage, deployed, integrated and ready for full-scale adoption at my company this coming Monday. Nearly 13,000 lines of code.

The article is long, but worth the time to read it.

An important feature of his workflow is relying on Test-Driven Development

Here’s what made this sustainable rather than chaotic: TDD. Test-driven development. For most of the features, I insisted that Claude Code follow the red-green-refactor cycle with me. Write a failing test first. Make it pass with the simplest implementation. Then refactor while keeping tests green.

This wasn’t just methodology purism. TDD served a critical function in AI-assisted development: it kept me in the loop. When you’re directing thousands of lines of code generation, you need a forcing function that makes you actually understand what’s being built. Tests are that forcing function. You can’t write a meaningful test for something you don’t understand. And you can’t verify that a test correctly captures intent without understanding the intent yourself.

The account includes a major refactoring, and much evolution of the initial version of the tool. It’s also an interesting glimpse of how AI tooling may finally make RDF useful.

❄ ❄ ❄ ❄ ❄

When thinking about requirements for software, most discussions focus on prioritization. Some folks talk about buckets such as the MoSCoW set: Must, Should, Could, and Want. (The old joke being that, in MoSCoW, the cow is silent, because hardly any requirements end up in those buckets.) Jason Fried has a different set of buckets for interface design: Obvious, Easy, and Possible. This immediately resonates with me: a good way of think about how to allocate the cognitive costs for those who use a tool.

❄ ❄ ❄ ❄ ❄

Casey Newton explains how he followed up on an interesting story of dark patterns in food delivery, and found it to be a fake story, buttressed by AI image and document creation. On one hand, it clarifies the important role reporters play in exposing lies that get traction on the internet. But time taken to do this is time not spent on investigating real stories

For most of my career up until this point, the document shared with me by the whistleblower would have seemed highly credible in large part because it would have taken so long to put together. Who would take the time to put together a detailed, 18-page technical document about market dynamics just to troll a reporter? Who would go to the trouble of creating a fake badge?

Today, though, the report can be generated within minutes, and the badge within seconds. And while no good reporter would ever have published a story based on a single document and an unknown source, plenty would take the time to investigate the document’s contents and see whether human sources would back it up.

The internet has always been full of slop, and we have always needed to be wary of what we read there. AI now makes it easy to manufacture convincing looking evidence, and this is never more dangerous than when it confirms strongly held beliefs and fears.

❄ ❄ ❄ ❄ ❄

Kent Beck:

The descriptions of Spec-Driven development that I have seen emphasize writing the whole specification before implementation. This encodes the (to me bizarre) assumption that you aren’t going to learn anything during implementation that would change the specification. I’ve heard this story so many times told so many ways by well-meaning folks–if only we could get the specification “right”, the rest of this would be easy.

Like him, that story has been the constant background siren to my career in tech. But the learning loop of experimentation is essential to the model building that’s at the heart of any kind of worthwhile specification. As Unmesh puts it:

Large Language Models give us great leverage—but they only work if we focus on learning and understanding. They make it easier to explore ideas, to set things up, to translate intent into code across many specialized languages. But the real capability—our ability to respond to change—comes not from how fast we can produce code, but from how deeply we understand the system we are shaping.

When Kent defined Extreme Programming, he made feedback one of its four core values. It strikes me that the key to making the full use of AI in software development is how to use it to accelerate the feedback loops.

❄ ❄ ❄ ❄ ❄

As I listen to people who are serious with AI-assisted programming, the crucial thing I hear is managing context. Programming-oriented tools are geting more sophisticated for that, but there’s also efforts at providing simpler tools, that allow customization. Carlos Villela recently recommended Pi, and its developer, Mario Zechner, has an interesting blog on its development.

So what’s an old guy yelling at Claudes going to do? He’s going to write his own coding agent harness and give it a name that’s entirely un-Google-able, so there will never be any users. Which means there will also never be any issues on the GitHub issue tracker. How hard can it be?

If I ever get the time to sit and really play with these tools, then something like Pi would be something I’d like to try out. Although as an addict to The One True Editor, I’m interested in some of libraries that work with that, such as gptel. That would enable me to use Emacs’s inherent programability to create my own command set to drive the interaction with LLMs.

❄ ❄ ❄ ❄ ❄

Outside of my professional work, I’ve posting regularly about my boardgaming on the specialist site BoardGameGeek. However its blogging environment doesn’t do a good job of providing an index to my posts, so I’ve created a list of my BGG posts on my own site. If you’re interested in my regular posts on boardgaming, and you’re on BGG you can subscribe to me there. If you’re not on BGG you can subscribe to the blog’s RSS feed.

I’ve also created a list of my favorite board games.

December 16

Gitanjali Venkatraman does wonderful illustrations of complex subjects (which is why I was so happy to work with her on our Expert Generalists article). She has now published the latest in her series of illustrated guides: tackling the complex topic of Mainframe Modernization

In it she illustrates the history and value of mainframes, why modernization is so tricky, and how to tackle the problem by breaking it down into tractable pieces. I love the clarity of her explanations, and smile frequently at her way of enhancing her words with her quirky pictures.

❄ ❄ ❄ ❄ ❄

Gergely Orosz on social media

Unpopular opinion:

Current code review tools just don’t make much sense for AI-generated code

When reviewing code I really want to know:

The prompt made by the dev

What corrections the other dev made to the code

Clear marking of code AI-generated not changed by a human

Some people pushed back saying they don’t (and shouldn’t care) whether it was written by a human, generated by an LLM, or copy-pasted from Stack Overflow.

In my view it matters a lot - because of the second vital purpose of code review.

When asked why do code reviews, most people will answer the first vital purpose - quality control. We want to ensure bad code gets blocked before it hits mainline. We do this to avoid bugs and to avoid other quality issues, in particular comprehensibility and ease of change.

But I hear the second vital purpose less often: code review is a mechanism to communicate and educate. If I’m submitting some sub-standard code, and it gets rejected, I want to know why so that I can improve my programming. Maybe I’m unaware of some library features, or maybe there’s some project-specific standards I haven’t run into yet, or maybe my naming isn’t as clear as I thought it was. Whatever the reasons, I need to know in order to learn. And my employer needs me to learn, so I can be more effective.

We need to know the writer of the code we review both so we can communicate our better practice to them, but also to know how to improve things. With a human, its a conversation, and perhaps some documentation if we realize we’ve needed to explain things repeatedly. But with an LLM it’s about how to modify its context, as well as humans learning how to better drive the LLM.

❄ ❄ ❄ ❄ ❄

Wondering why I’ve been making a lot of posts like this recently? I explain why I’ve been reviving the link blog.

❄ ❄ ❄ ❄ ❄

Simon Willison describes how he uses LLMs to build disposable but useful web apps

These are the characteristics I have found to be most productive in building tools of this nature:

A single file: inline JavaScript and CSS in a single HTML file means the least hassle in hosting or distributing them, and crucially means you can copy and paste them out of an LLM response.

Avoid React, or anything with a build step. The problem with React is that JSX requires a build step, which makes everything massively less convenient. I prompt “no react” and skip that whole rabbit hole entirely.

Load dependencies from a CDN. The fewer dependencies the better, but if there’s a well known library that helps solve a problem I’m happy to load it from CDNjs or jsdelivr or similar.

Keep them small. A few hundred lines means the maintainability of the code doesn’t matter too much: any good LLM can read them and understand what they’re doing, and rewriting them from scratch with help from an LLM takes just a few minutes.

His repository includes all these tools, together with transcripts of the chats that got the LLMs to build them.

❄ ❄ ❄ ❄ ❄

Obie Fernandez: while many engineers are underwhelmed by AI tools, some senior engineers are finding them really valuable. He feels that senior engineers have an oft-unspoken mindset, which in conjunction with an LLM, enables the LLM to be much more valuable.

Levels of abstraction and generalization problems get talked about a lot because they’re easy to name. But they’re far from the whole story.

Other tools show up just as often in real work:

A sense for blast radius. Knowing which changes are safe to make loudly and which should be quiet and contained.

A feel for sequencing. Knowing when a technically correct change is still wrong because the system or the team isn’t ready for it yet.

An instinct for reversibility. Preferring moves that keep options open, even if they look less elegant in the moment.

An awareness of social cost. Recognizing when a clever solution will confuse more people than it helps.

An allergy to false confidence. Spotting places where tests are green but the model is wrong.

❄ ❄ ❄ ❄ ❄

Emil Stenström built an HTML5 parser in python using coding agents, using Github Copilot in Agent mode with Claude Sonnet 3.7. He automatically approved most commands. It took him “a couple of months on off-hours”, including at least one restart from scratch. The parser now passes all the tests in html5lib test suite.

After writing the parser, I still don’t know HTML5 properly. The agent wrote it for me. I guided it when it came to API design and corrected bad decisions at the high level, but it did ALL of the gruntwork and wrote all of the code.

I handled all git commits myself, reviewing code as it went in. I didn’t understand all the algorithmic choices, but I understood when it didn’t do the right thing.

Although he gives an overview of what happens, there’s not very much information on his workflow and how he interacted with the LLM. There’s certainly not enough detail here to try to replicate his approach. This is contrast to Simon Willison (above) who has detailed links to his chat transcripts - although they are much smaller tools and I haven’t looked at them properly to see how useful they are.

One thing that is clear, however, is the vital need for a comprehensive test suite. Much of his work is driven by having that suite as a clear guide for him and the LLM agents.

JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn’t have written it this quickly without the agent.

But “quickly” doesn’t mean “without thinking.” I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking.

❄ ❄

Then Simon Willison ported the library to JavaScript:

Time elapsed from project idea to finished library: about 4 hours, during which I also bought and decorated a Christmas tree with family and watched the latest Knives Out movie.

One of his lessons:

If you can reduce a problem to a robust test suite you can set a coding agent loop loose on it with a high degree of confidence that it will eventually succeed. I called this designing the agentic loop a few months ago. I think it’s the key skill to unlocking the potential of LLMs for complex tasks.

Our experience at Thoughtworks backs this up. We’ve been doing a fair bit of work recently in legacy modernization (mainframe and otherwise) using AI to migrate substantial software systems. Having a robust test suite is necessary (but not sufficient) to making this work. I hope to share my colleagues’ experiences on this in the coming months.

But before I leave Willison’s post, I should highlight his final open questions on the legalities, ethics, and effectiveness of all this - they are well-worth contemplating.