What role does LLM reasoning play for software tasks?

Birgitta is a Distinguished Engineer and AI-assisted delivery expert at Thoughtworks. She has over 20 years of experience as a software developer, architect and technical leader.

This article is part of “Exploring Gen AI”. A series capturing Thoughtworks technologists' explorations of using gen ai technology for software development.

18 February 2025

The general principle of how Large Language Models work is rooted in pattern matching and statistical prediction of the next token (“stochastic parrots”). One of the somewhat unexpected capabilities that emerged from that approach is that they can also “reason” through problems, to an extent. Some models are better at reasoning than others, OpenAI’s “o1” and “o3” models are two of the prominent reasoning models, and DeepSeek’s “R1” recently created a lot of stir. But what role does this capability play when we use AI with coding tasks?

Spoiler alert: I don’t have the answer yet! But I have questions and thoughts.

I’ll start with two things that in my understanding are limitations to the reasoning abilities, limitations that are relevant in the context of coding. And then I’ll share my thoughts on where reasoning could be useful for coding tasks, and where not.

Context is key, especially for reasoning

A paper by Apple about Large Language Models’ limitations to reason got a lot of attention last year. The authors are introducing a new benchmark to test how good LLMs are at “Mathematical Reasoning”. They based their benchmark on an already existing one that contains a set of grade school math problems. They took 100 of those problems and turned them into templates with variable placeholders, and then created a set of 50 variations for each of those, resulting in a data set of 5,000 problems. In a second step, they also created a new data set where they added irrelevant information to the problems.

They found that

The variation of names and numbers influences the performance of the models in solving the problems. So even when the reasoning steps stay exactly the same (Sophie watching her nephew becomes Anita watching her granddaughter, or it’s 12 stuffed animals instead of 8), the performance of the models to solve the problems is not consistent, and even slightly drops in comparison to the original benchmark.
There’s a further decline in performance when the difficulty and size of the problems increases.
Finally, they found a big negative impact on performance when they added irrelevant information to the problem.

First of all, this is a great example of why we should take LLM benchmarks with a grain of salt.

And in the coding context, I found the last finding particularly interesting. Coding assistants are getting better and better at orchestrating the relevant code context for LLMs, so that the model has just the right information to suggest useful code to us. The paper would suggest that it’s particularly important for good reasoning that our tool can discard irrelevant code snippets that could distract the model and negatively impact the quality of the reasoning.

Basically: Data quality matters, for reasoning even more than other task types. When we have identified a problem that does need multi-step reasoning, then we should take particular care about the context we feed to the model.

No function calling during the reasoning step

Reasoning models like o1 and R1 work in two steps, first they “reason” or “think” about the user’s prompt, then they return a final result in a second step. In the reasoning step, the model goes through a chain of thought to come to a conclusion. It depends on the user interface in front of the model if you can fully see the contents of this reasoning step. OpenAI e.g. is only showing users summaries of each step. DeepSeek’s platform shows the full reasoning chain (and of course you also have access to the full chain when you run R1 yourself). At the end of the reasoning step the chatbot UIs will show messages like “Thought for 36 seconds”, or “Reasoned for 6 seconds”. However long it takes, and regardless of if the user can see it or not, tokens are being generated in the background, because LLMs think through token generation.

As far as I can tell, during this reasoning mode, the models cannot call any functions available to them. I’m not sure if that’s because of API limitations or model limitations, but my guess is that the whole point of the reasoning step is to generate an uninterrupted chain of tokens, and that it would defeat part of the purpose to interrupt this with a function call.

When is reasoning relevant for coding tasks?

Many of the reasoning benchmarks use grade school math problems, so those are my frame of reference when I try to find analogous problems in software where a chain of thought would be helpful. It seems to me like this is about problems that need multiple steps to come to a solution, where each step depends on the output of the previous one.

Based on that definition, reasoning doesn’t seem important at all for the following tasks, as these things can be covered by LLMs without step-by-step reasoning:

Inline coding suggestions
Generating code for a new function
Answering questions about how to do something (e.g., “how do I delete all local Docker images?”)
Generating documentation, or summaries
Turning natural language into formal queries (e.g. SQL)

This is where models that are fine-tuned for coding excel, regardless of their reasoning capabilities.

Reasoning models are brought up mainly in these two contexts:

Debugging

Debugging seems like an excellent use case for chain of thought. My main puzzle is how much our usage of reasoning for debugging will be hindered by the lack of function calling.

Let’s consider a debugging task: I have an error, and want AI to think about potential causes. At that point, I myself actually have no idea where the problem is originating, and I cannot dump the whole codebase into the context window, so I actually want the AI tool to look up information on the way.

For example, let’s say there is an error in the logs, “ERROR: duplicate key value violates unique constraint “users_email_key””, and I’m asking AI to help me debug. One of the first hypotheses might be that there is an email validation that is not working, so I would want it to look up my validation code, and then continue reasoning. This is very typical of debugging, we come up with a hypothesis, then look up information that confirms or invalidates our hypothesis, then either take the next step or discard the hypothesis altogether. From my understanding about an LLM’s reasoning step, looking up more information on the way through function calling is currently not possible.

Making a plan for an implementation

I’ve seen a lot of reports of people using a reasoning model for a planning step in their coding, and then a model that is particularly good at coding (like Claude Sonnet) for the implementation of that plan. The popular open source coding assistant Cline recently released a feature where the developer can switch between a “Plan” and an “Act” mode, which holds the LLM back from jumping straight to implementation, but also makes it easier to use a different model for the first step, potentially one with reasoning capabilities.

I have tried using a reasoning model for planning a few times, but to be honest, so far I haven’t perceived a particular difference over a good coding model. That made me wonder - is planning a coding task really a chain of thought problem? Is “we need an API endpoint and a change in this web component” something a model should reason through, or can it be covered with the more basic pattern matching capabilities?

I have not used it enough to draw a conclusion on this yet. But the second finding in the paper that I mention above could also be relevant here: When the authors increased the size and difficulty of the math problems, “[…] the probability of arriving at an accurate answer decreases exponentially with the number of tokens or steps involved.”, “[…] as the difficulty increases, the performance decreases and the variance increases”. In the paper, “difficulty” is defined by adding more factors and conditions to the original problem.

Are these findings setting a glass ceiling for the size and difficulty of a problem that could be planned and solved by a model that has high reasoning capabilities? We might find that o1 and R1 play a smaller role in coding than they are made out to be at the moment. Consider as an example this approach by Qodo where they found that a step-by-step agent flow can outperform a reasoning model on its own. If the problem size and complexity should indeed stay a limiting factor for the models, then to make AI better at solving problems we have to break things down inside of some kind of agent that gives smaller problems to multiple models. In which case I wonder if reasoning would only be absolutely necessary for a small subset of those smaller problems.

In Summary

I’m personally not quite sure yet if reasoning models are really as big of a game changer for software development tasks as they are made out to be. Apart from being slow and expensive, the models have the potential limitations outlined above, and in my various attempts at using them for coding or debugging issues, I haven’t found a case yet where o1 gave me noticeably better results than Claude-Sonnet-3.5 in an IDE with good AI context orchestration.