The toolchain

This article is part of “Exploring Gen AI”. A series capturing our explorations of using gen ai technology for software development.

26 Jul 2023

Let’s start with the toolchain. Whenever there is a new area with still evolving patterns and technology, I try to develop a mental model of how things fit together. It helps deal with the wave of information coming at me. What types of problems are being solved in the space? What are the common types of puzzle pieces needed to solve those problems? How are things fitting together?

How to categorise the tools

The following are the dimensions of my current mental model of tools that use LLMs (Large Language Models) to support with coding.

Assisted tasks

These are the types of tasks I see most commonly tackled when it comes to coding assistance, although there is a lot more if I would expand the scope to other tasks in the software delivery lifecycle.

Interaction modes

I’ve seen three main types of interaction modes:

Prompt composition

The quality of the prompt obviously has a big impact on the usefulness on the tools, in combination with the suitability of the LLM used in the backend. Prompt engineering does not have to be left purely to the user though, many tools apply prompting techniques for you in a backend.

Properties of the model

Origin and hosting

Examples

Here are some common examples of tools in the space, and how they fit into this model. (The list is not an endorsement of these tools, or dismissal of other tools, it’s just supposed to help illustrate the dimensions.)

Tool Tasks Interaction Prompt composition Model Origin / Hosting
GitHub Copilot Code generation In-line assistance Composed by IDE extension Trained with code, vulnerability filters Commercial
GitHub Copilot Chat All of them Chat Composed of user chat + open files Trained with code Commercial
ChatGPT All of them Chat All done by user Trained with code Commercial
GPT Engineer Code generation CLI Prompt composed based on user input Choice of OpenAI models Open Source, connecting to OpenAI API
“Team AIs” All of them Web UI Prompt composed based on user input and use case Most commonly with OpenAI’s GPT models Maintained by a team for their use cases, connecting to OpenAI APIs
Meta’s CodeCompose Code generation In-line assistance Composed by editor extension Model fine-tuned on internal use cases and codebases Self-hosted

What are people using today, and what’s next?

Today, people are most commonly using combinations of direct chat interaction (e.g. via ChatGPT or Copilot Chat) with coding assistance in the code editor (e.g. via GitHub Copilot or Tabnine). In-line assistance in the context of an editor is probably the most mature and effective way to use LLMs for coding assistance today, compared to other approaches. It supports the developer in their natural workflow with small steps. Smaller steps make it easier to follow along and review the quality more diligently, and it’s easy to just move on in the cases where it does not work.

There is a lot of experimentation going on in the open source world with tooling that provides prompt composition to generate larger pieces of code (e.g. GPT Engineer, Aider). I’ve seen similar usage of small prompt composition applications tuned by teams for their specific use cases, e.g. by combining a reusable architecture and tech stack definition with user stories to generate task plans or test code, similar to what my colleague Xu Hao is describing here. Prompt composition applications like this are most commonly used with OpenAI’s models today, as they are most easily available and relatively powerful. Experiments are moving more and more towards open source models and the big hyperscalers hosted models though, as people are looking for more control over their data.

As a next step forward, beyond advanced prompt composition, people are putting lots of hopes for future improvements into the model component. Do larger models, or smaller but more specifically trained models work better for coding assistance? Will models with larger context windows enable us to feed them with more code to reason about the quality and architecture of larger parts of our codebases? At what scale does it pay off to fine-tune a model with your organization’s code? What will happen in the space of open source models? Questions for a future memo.

Thanks to Kiran Prakash for his input