Understanding the agentic loop

To understand why AI token usage can increase rapidly and become expensive, you need to understand how agents work and, in particular, the agentic loop. This page explains the mechanics of agents and how tokens are consumed during their operation.

From chatbots to agents

A few years ago, AI assistants were simple question-and-answer systems. You asked a chatbot (like ChatGPT), the LLM reasoned and generated an answer based on its training data, and you received the result. There was no ability to use tools or take action on your behalf.

The key limitation was that LLMs operate on the principle of stateless inference – they generate responses based solely on the input prompt and their training data. They cannot execute code, run commands, or interact with external systems.

With modern agents, this limitation is removed. Agents can recognise when a task requires more than inference alone and can request tools to be used. Tools are snippets of code (for example, fetching a web page, running a shell command, or querying a database) that execute on your machine (if running Copilot locally) or on a hosted machine (if running in GitHub cloud). These tool executions are relatively inexpensive compared to LLM inference – they use existing APIs and command-line utilities that have been in production for years.

How the agentic loop works

An agent operates in a loop that repeats until a task is complete. Here’s what happens at each step:

The four-step cycle

  1. Send prompt to LLM – Your request is sent to the language model along with the current context (previous responses, tool outputs, current state)

  2. LLM reasons and decides – The model analyses the prompt and determines:
    • Can this task be solved with reasoning alone?
    • Do I need to use a tool?
    • Which tool should I use (if any)?
  3. Act: invoke the tool – If the LLM decides a tool is needed, the agent executes it. This is the cheap part – a tool call might be an HTTP request, a shell command, or a database query. No LLM inference is happening here.

  4. Return output to LLM – The tool’s output is sent back to the LLM, which reads it and decides:
    • Is the task complete? If so, generate the final answer
    • Do I need more information? If so, invoke another tool
    • Should I repeat the loop?

The loop repeats

The key point is that this cycle may repeat many times before you get the final answer. Each time the loop iterates, the LLM performs another inference pass, consuming tokens. The tool executions themselves are cheap, but the LLM invocations are expensive.

Token consumption in agents

To understand why costs increase, consider a concrete example:

Simple chatbot (no agent):

1 prompt → 1 LLM inference → 1 response = cheap

Agent solving a multi-step problem:

1 prompt → 1st LLM inference (decides to use tool A)
         → execute tool A (cheap)
         → 2nd LLM inference (reads tool output, decides to use tool B)
         → execute tool B (cheap)
         → 3rd LLM inference (reads tool output, generates final answer)
         → return result = 3× the token cost

In this simple example, the agent consumed 3 LLM inferences instead of 1. Complex tasks requiring 5, 10, or even 20 iterations are common. This is why agents can rapidly increase costs.

Each loop iteration consumes tokens proportional to the model tier. A complex task requiring 15 reasoning steps with a high-tier model (e.g., Claude Opus, GPT-5.5) can consume significantly more tokens than a straightforward prompt to a basic model.

Why agents are still valuable

Despite the potential for higher token consumption, agents are valuable because:

  • They handle complexity – Tasks that would require multiple manual prompts and context switches can be solved in a single agent request
  • They reduce human iteration – Agents can autonomously retry and self-correct, reducing the human effort needed
  • They perform actions – Agents can create files, run tests, analyse code, and perform other actions that simple LLMs cannot

The goal is not to avoid agents, but to use them strategically and monitor their token consumption.

Reducing agent costs

To keep token costs low when using agents:

  • Use the appropriate model tier – Agents running on lower-tier models (Claude Haiku, GPT-mini) consume fewer tokens per inference
  • Provide clear context – Well-scoped prompts with relevant code snippets and constraints reduce the number of iterations needed
  • Set iteration limits – Some agents allow you to limit the maximum number of tool invocations or reasoning steps
  • Monitor usage – Use the Chronicle command in the GitHub Copilot CLI to track which agents consume the most tokens

For detailed strategies on optimising token usage, see Efficient token usage.

Key takeaways

  • Agents operate in a loop: prompt → LLM inference → tool execution → output → repeat
  • Each loop iteration consumes tokens; complex tasks can require many iterations
  • Token costs scale with task complexity and model tier
  • Agents are valuable for complex, multi-step tasks despite higher costs
  • Monitor usage and match model tiers to task complexity to manage expenses

Further reading