The toolchain

Birgitta is a Distinguished Engineer and AI-assisted delivery expert at Thoughtworks. She has over 20 years of experience as a software developer, architect and technical leader.

This article is part of “Exploring Gen AI”. A series capturing Thoughtworks technologists' explorations of using gen ai technology for software development.

26 July 2023

Let’s start with the toolchain. Whenever there is a new area with still evolving patterns and technology, I try to develop a mental model of how things fit together. It helps deal with the wave of information coming at me. What types of problems are being solved in the space? What are the common types of puzzle pieces needed to solve those problems? How are things fitting together?

How to categorise the tools

The following are the dimensions of my current mental model of tools that use LLMs (Large Language Models) to support with coding.

Assisted tasks

Finding information faster, and in context
Generating code
“Reasoning” about code (Explaining code, or problems in the code)
Transforming code into something else (e.g. documentation text or diagram)

These are the types of tasks I see most commonly tackled when it comes to coding assistance, although there is a lot more if I would expand the scope to other tasks in the software delivery lifecycle.

Interaction modes

I’ve seen three main types of interaction modes:

Chat interfaces
In-line assistance, i.e. typing in a code editor
CLI

Prompt composition

The quality of the prompt obviously has a big impact on the usefulness on the tools, in combination with the suitability of the LLM used in the backend. Prompt engineering does not have to be left purely to the user though, many tools apply prompting techniques for you in a backend.

User creates the prompt from scratch
Tool composes prompt from user input and additional context (e.g. open files, a set of reusable context snippets, or additional questions to the user)

Properties of the model

What the model was trained with
- Was it trained specifically with code, and coding tasks? Which languages?
- When was it trained, i.e. how current is the information
Size of the model (it’s still very debated in which way this matters though, and what a “good” size is for a specific task like coding)
Size of the context window supported by the model, which is basically the number of tokens it can take as the prompt
What filters have been added to the model, or the backend where it is hosted

Origin and hosting

Commercial products, with LLM APIs hosted by a the product company
Open source tools, connecting to LLM API services
Self-built tools, connecting to LLM API services
Self-built tools connecting to fine-tuned, self-hosted LLM API

Examples

Here are some common examples of tools in the space, and how they fit into this model. (The list is not an endorsement of these tools, or dismissal of other tools, it’s just supposed to help illustrate the dimensions.)

Tool	Tasks	Interaction	Prompt composition	Model	Origin / Hosting
GitHub Copilot	Code generation	In-line assistance	Composed by IDE extension	Trained with code, vulnerability filters	Commercial
GitHub Copilot Chat	All of them	Chat	Composed of user chat + open files	Trained with code	Commercial
ChatGPT	All of them	Chat	All done by user	Trained with code	Commercial
GPT Engineer	Code generation	CLI	Prompt composed based on user input	Choice of OpenAI models	Open Source, connecting to OpenAI API
“Team AIs”	All of them	Web UI	Prompt composed based on user input and use case	Most commonly with OpenAI’s GPT models	Maintained by a team for their use cases, connecting to OpenAI APIs
Meta’s CodeCompose	Code generation	In-line assistance	Composed by editor extension	Model fine-tuned on internal use cases and codebases	Self-hosted

What are people using today, and what’s next?

Today, people are most commonly using combinations of direct chat interaction (e.g. via ChatGPT or Copilot Chat) with coding assistance in the code editor (e.g. via GitHub Copilot or Tabnine). In-line assistance in the context of an editor is probably the most mature and effective way to use LLMs for coding assistance today, compared to other approaches. It supports the developer in their natural workflow with small steps. Smaller steps make it easier to follow along and review the quality more diligently, and it’s easy to just move on in the cases where it does not work.

There is a lot of experimentation going on in the open source world with tooling that provides prompt composition to generate larger pieces of code (e.g. GPT Engineer, Aider). I’ve seen similar usage of small prompt composition applications tuned by teams for their specific use cases, e.g. by combining a reusable architecture and tech stack definition with user stories to generate task plans or test code, similar to what my colleague Xu Hao is describing here. Prompt composition applications like this are most commonly used with OpenAI’s models today, as they are most easily available and relatively powerful. Experiments are moving more and more towards open source models and the big hyperscalers hosted models though, as people are looking for more control over their data.

As a next step forward, beyond advanced prompt composition, people are putting lots of hopes for future improvements into the model component. Do larger models, or smaller but more specifically trained models work better for coding assistance? Will models with larger context windows enable us to feed them with more code to reason about the quality and architecture of larger parts of our codebases? At what scale does it pay off to fine-tune a model with your organization’s code? What will happen in the space of open source models? Questions for a future memo.

Thanks to Kiran Prakash for his input