How to tackle unreliability of coding assistants

This article is part of “Exploring Gen AI”. A series capturing our explorations of using gen ai technology for software development.

29 Nov 2023

One of the trade-offs to the usefulness of coding assistants is their unreliability. The underlying models are quite generic and based on a huge amount of training data, relevant and irrelevant to the task at hand. Also, Large Language Models make things up, they “hallucinate” as it’s commonly called. (Side note: There is a lot of discourse about the term “hallucination”, about how it is not actually the right psychology metaphor to describe this, but also about using psychology terms in the first place, as it anthropomorphizes the models.)

That unreliability creates two main risks: It can affect the quality of my code negatively, and it can waste my time. Given these risks, quickly and effectively assessing my confidence in the coding assistant’s input is crucial.

How I determine my confidence in the assistant’s input

The following are some of the questions that typically go through my head when I try to gauge the reliability and risk of using a suggestion. This applies to “auto complete” suggestions while typing code as well as to answers from the chat.

Do I have a quick feedback loop?

The quicker I can find out if the answer or the generated information works, the lower the risk that the assistant is wasting my time.

Do I have a reliable feedback loop?

As well as the speed of the feedback loop for the AI input, I also reflect on the reliability of that feedback loop.

What is the margin of error?

I also reflect on what my margin of error is for what I’m doing. The lower the margin for error, the more critical I will be of the AI input.

Do I need very recent information?

The more recent and the more specific (e.g. to a version of a framework) I need the answer to be, the higher the risk that it is wrong, because the probability is higher that the information I’m looking for is not available or not distinguishable to the AI. For this assessment it’s also good to know if the AI tool at hand has access to more information than just the training data. If I’m using a chat, I want to be aware if it has the ability to take online searches into account, or if it is limited to the training data.

Give the assistant a timebox

To mitigate the risk of wasting my time, one approach I take is to give it a kind of ultimatum. If the suggestion doesn’t bring me value with little additional effort, I move on. If an input is not helping me quick enough, I always assume the worst about the assistant, rather than giving it the benefit of the doubt and spending 20 more minutes on making it work.

The example that comes to mind is when I was using an AI chat to help me generate a mermaid.js class diagram. I’m not very familiar with the mermaid.js syntax, and I kept trying to make the suggestion work, and thought I had maybe included it in my markdown file in the wrong way. Turns out, the syntax was totally wrong, which I found out when I finally went to the online documentation after 10 minutes or so.

Make up a persona for the assistant

When preparing this memo, I started wondering if making up a persona for the assistant could help with how to use it responsibly, and with as little waste of time as possible. Maybe anthropomorphizing the AI could actually help in this case?

Thinking about the types of unreliabilities, I’d imagine the AI persona with these traits:

I tried a few prompts with an image generator, asking it for variations of eager beavers and stubborn donkeys. Here’s the one I liked the best (“eager stubborn donkey happy books computer; cartoon, vector based, flat areas of color” in Midjourney):

a cartoonish picture of a very excited donkey among a stack of books

You could even come up with a fun name for your persona, and talk about it on the team. “Dusty was an annoying know-it-all during that session, we had to turn them off for a bit”, or “I’m glad Dusty was there, I got that task done before lunch”. But the one thing you should never say is “Dusty caused that incident!”, because Dusty is basically underage, they don’t have a license to commit. We are kind of the parents who are ultimately responsible for the commits, and “parents are liable for their children”.

Conclusion

The list of situation assessments might seem like a lot to apply every single time you’re using a coding assistant. But I believe we’re all going to get better at it the more we use these tools. We make quick assessments with multiple dimensions like this all the time when we are coding, based on our experience. I’ve found that I’ve gotten better at deciding when to use and trust the assistant the more times I ran into the situations mentioned above - the more I touch the hot stove, so to say.

You might also think, “If the AI assistants are unreliable, than why would I use them in the first place?”. There is a mindset shift we have to make when using Generative AI tools in general. We cannot use them with the same expectations we have for “regular” software. GitHub Copilot is not a traditional code generator that gives you 100% what you need. But in 40-60% of situations, it can get you 40-80% of the way there, which is still useful. When you adjust these expectations, and give yourself some time to understand the behaviours and quirks of the eager donkey, you’ll get more out of AI coding assistants.

Thanks to Brandon Cook, Jörn Dinkla, Paul Sobocinski and Ryder Dain for their feedback and input.

This memo was written with GitHub Copilot active, in markdown files. It helps with ideas and turns of phrase, and sometimes when I’m stuck, but suggestions very rarely end up as they were in the final text. I use ChatGPT as a thesaurus, and to find a good name for a donkey.