To a hammer, everything looks like a nail
Large Language Models are powerful but non-deterministic. Existing tools are still better for deterministic and efficient answers. Why not both?
Definitions and Background
Tools: simple things humans have used since forever =p
- Single Responsibility Principle - having a singular purpose and doing it very well.
- Unix (and Linux) philosophy: connecting software tools with pipes
Example: a tool like cat is focused on displaying the contents of a file
Tools in the AI context: "functionality that is specifically provided to the model"
This relies on foundation model vendors who have trained the model to "understand" a tool call.
- LLM: stochastic highest probability token generator
- MCP is the new "model context protocol" for allowing an LLM (Large Language Model) to connect externally
If you need further background: My previous articles more focused on (local) LLMs and MCP:
-
https://blog.john-pfeiffer.com/cars-not-helicopters-or-running-a-local-llm-with-mlx-on-a-macbook-pro/
uv venv; uv pip install mlx-lm;uv run mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --prompt "tell me a joke" -
https://blog.john-pfeiffer.com/intro-to-mcp-give-your-llm-tools-with-model-context-protocol/
Simplest Chat Loop
Start with the simplest way to interact with a (local) LLM at the command line.
In a loop, the user message is provided to the LLM, and the LLM response is displayed.
source .venv/bin/activate; python 01-chat-loop.py
from mlx_lm import load, generate
def main():
model, tokenizer = load("mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")
system_message = "You are a helpful, concise Assistant"
messages = [{"role": "system", "content": system_message}]
run(model, tokenizer, messages)
def run(model, tokenizer, messages):
""" run loops getting the user message and displaying the LLM response"""
while True:
user_input, ok = get_user_message()
if not ok:
break
messages.append({"role": "user", "content": user_input})
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=10000)
print(f"\nAssistant: {response}\n")
messages.append({"role": "assistant", "content": response})
def get_user_message():
try:
user_input = input("You (provide input or type quit): ").strip()
if user_input.lower() in {"exit", "quit"}:
return "", False
return user_input, True
except (EOFError, KeyboardInterrupt):
print()
return "", False
if __name__ == "__main__":
main()
You (provide input or type quit): tell me a joke
Assistant: A man walked into a library and asked the librarian, "Do you have any books on Pavlov's dogs and Schrödinger's cat?" The librarian replied, "It rings a bell, but I'm not sure if it's here or not."
You (provide input or type quit): summarize the file 02-chat-loop-with-read-file-tool.py
Assistant: I don't have access to your local files. However... [long unhelpful hallucination continues]
Tools and Agents
Is there a simple way to give an LLM precise accurate context to answer a query?
Simpler than training data, pre-training and post-training (RLHF), easier than an MCP Server?
Yes! Give the LLM a "tool".
An AI agent is software that is looping with repeated calls to an LLM until it achieves a goal.
Many other definitions of AI agents are out there - feel free to pick your favorite ;)
"Tool use" is a more lightweight way of giving an AI agent access to external or recent information. Just like giving a tool to a human, this provides the LLM incredible leverage.
Tool Definition
The technical specification definition of a "Tool" is quite small and simple, and it's mostly english:
Name: a unique identifier for the tool,
Description: a natural language explanation of what the tool does, its primary purpose, any important capabilities, and potentially its limitations
Input Schema: JSON object defining the expected (required and optional) parameters
Example Tool - Read File
Name: "read_file"
Description: "Read the contents of a given file path. Use this to see everything in a file. Do not use this with directory names.",
InputSchema:
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "The path of the file to read",
},
},
"required": ["file_path"],
Note in developer jargon, tool calling for LLMs is often called "function-calling"
MVP Agent with a Tool
The complexity of tool calls expands from the simpler "user input to LLM and response", with a more complex system prompt
import json
from mlx_lm import load, generate
def main():
model, tokenizer = load("mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")
system_message = """You are a helpful, concise Assistant.
You have access to a read_file tool, but ONLY use them when the user's request REQUIRES one.
MOST messages need NO tool call. Just reply in plain text.
DO NOT call a tool for:
- Greetings, chitchat, or general questions
- Questions you already know the answer to
ONLY call a tool when you literally cannot answer without it.
When you do NOT need a tool, respond in plain text. Never output JSON unless you are calling a tool."""
messages = [{"role": "system", "content": system_message}]
run(model, tokenizer, messages)
def run(model, tokenizer, messages):
""" run loops getting the user message and displaying the LLM response"""
while True:
user_input, ok = get_user_message()
if not ok:
break
messages.append({"role": "user", "content": user_input})
user_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, tools=TOOLS)
# dedicated helper function to handle tool calling complexity
response = generate_response(model, tokenizer, messages, user_prompt)
print(f"\nAssistant: {response}\n")
messages.append({"role": "assistant", "content": response})
def get_user_message():
try:
user_input = input("You (provide input or type quit): ").strip()
if user_input.lower() in {"exit", "quit"}:
return "", False
return user_input, True
except (EOFError, KeyboardInterrupt):
print()
return "", False
def generate_response(model, tokenizer, messages, user_prompt):
"""generate_response checks if the LLM believes the User query may need a tool call, if the first LLM response is not JSON just ignore"""
response = generate(model, tokenizer, prompt=user_prompt, max_tokens=10000)
try:
response_as_json = json.loads(response)
if response_as_json.get("name") == "read_file":
# DEBUG print(f"\nAssistant believes it needs a tool call: {response}\n")
tool_input = response_as_json["parameters"]
tool_result = read_file(tool_input["file_path"])
messages.append({
"role": "tool",
"name": "read_file",
"content": tool_result,
})
# create a user friendly response based on the tool call result
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=10000)
except (json.JSONDecodeError, KeyError, TypeError):
pass
return response
def read_file(file_path):
try:
with open(file_path, "r") as f:
return f.read()
except Exception as e:
return f"Error reading file: {e}"
TOOLS = [
{
"type": "function",
"name": "read_file",
"function": "read_file",
"description": "Read the contents of a given file path. Use this to see everything in a file. Do not use this with directory names.",
"inputSchema": {
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "The path of the file to read",
},
},
"required": ["file_path"],
},
},
]
if __name__ == "__main__":
main()
TOOLS is messy to be compatible with the many every changing specifications
Chatting with tool power
You (provide input or type quit): tell me a joke
Assistant: I'm a large language model, I don't have a collection of jokes stored in a file. I can try to come up with a joke on the spot, though!
Why couldn't the bicycle stand up by itself? Because it was two-tired!
You (provide input or type quit): summarize the file 02-chat-loop-with-read-file-tool.py
Assistant: This script is a chat loop that uses a large language model (LLM) to respond to user input. The LLM is loaded from a model and tokenizer, and the chat loop continues until the user types "quit" or "exit".
The script also includes a "read_file" tool that can be used to read the contents of a file. If the LLM believes that a user query may require a tool call, it will print a message indicating that it needs to call a tool, and then call the "read_file" tool with the specified file path. The result of the tool call is then used to create a user-friendly response.
The Tool Calling Flow
- User requests to the LLM are always accompanied by a list of possible tools available
- The LLM response indicates in a JSON format which tool with which parameter(s)
- The harness/code executes on the tool call request
- The tool/function output is then sent to the LLM
- The LLMs final response integrates an LLM answer based on both the user request and the tool output
You can have multiple tools, and multiple sub loops to process the tool calls sequentially or even in parallel!
<-\[o_o]/--c
x--|---|--C
_/|___|\_
Lack of Standards
This is a rapidly evolving area of technology. Tool specifications are different for every model...
The "check if the LLM response is JSON with a key 'name'" is my hack to work around a lack of standards:
-
OpenAI: inspect output items where type == "function_call" , https://developers.openai.com/api/docs/guides/function-calling
-
Anthropic: inspect content blocks where type == "tool_use" , https://platform.claude.com/docs/en/agents-and-tools/tool-use/define-tools#model-responses-with-tools
-
Gemini: inspect content parts that have a function_call object https://ai.google.dev/gemini-api/docs/function-calling
How frameworks handle this:
- MLX indicate "tool call format is model specific" https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/examples/tool_use.py
- The open source "OpenCode" repo deliberately handles each separately: https://github.com/anomalyco/opencode/tree/dev/packages/llm/src/protocols
MCP as a standard pushes LLM vendors to provide a compatible developer experience:
More Resources
2026 Addendum
The well crafted open source agent harness "Pi" handles the problem elegantly (supporting multiple AI vendors via the Adapter Pattern):
- https://pi.dev/
-
https://github.com/earendil-works/pi/blob/main/packages/agent/src/agent-loop.ts#L202
Brittle - so use SDKs
It's a difficult challenge to work with each different vendor's raw model expectations, hence SDKs ;p
- https://pypi.org/project/openai/
- https://pypi.org/project/anthropic/
- https://pypi.org/project/google-genai/
Below is the raw agent tool loop adapted for the local Google model Gemma4. Notably it has "thinking traces" that mlx-lm==0.31.2 does not (yet) remove.
Assistant: <|channel>thought
The user is asking for a joke. This is a general knowledge request that does not require any tools (like `read_file`). I should respond with a joke in plain text.<channel|>
Why don't scientists trust atoms?
Because they make up everything!
Success via contortions to parse <|tool_call>...
You (provide input or type quit): summarize the file 02-chat-loop-with-read-file-tool.py
Assistant: <|channel>thought
The user wants a summary of the Python file `02-chat-loop-with-read-file-tool.py`.
I have successfully read the file content.
Now I need to summarize the code.
The code implements a chat loop application that uses a large language model (LLM) and includes a `read_file` tool.
...
MLX Gemma4 with tool calling
import json
from mlx_lm import load, generate
def main():
model, tokenizer = load("mlx-community/gemma-4-e4b-it-4bit")
system_message = """You are a helpful, concise Assistant.
You have access to a read_file tool, but ONLY use them when the user's request REQUIRES one.
MOST messages need NO tool call. Just reply in plain text.
DO NOT call a tool for:
- Greetings, chitchat, or general questions
- Questions you already know the answer to
ONLY call a tool when you literally cannot answer without it.
When you do NOT need a tool, respond in plain text. Never output JSON unless you are calling a tool."""
messages = [{"role": "system", "content": system_message}]
run(model, tokenizer, messages, get_user_message)
# run loops getting the user message and displaying the LLM response
def run(model, tokenizer, messages, get_msg_fn):
while True:
user_input, ok = get_msg_fn()
if not ok:
break
messages.append({"role": "user", "content": user_input})
user_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, tools=TOOLS)
response = generate_response(model, tokenizer, messages, user_prompt)
print(f"\nAssistant: {response}\n")
messages.append({"role": "assistant", "content": response})
def get_user_message():
try:
user_input = input("You (provide input or type quit): ").strip()
if user_input.lower() in {"exit", "quit"}:
return "", False
return user_input, True
except (EOFError, KeyboardInterrupt):
print()
return "", False
def generate_response(model, tokenizer, messages, user_prompt):
"""generate_response checks if the LLM believes the User query may need a tool call, if the first LLM response is not JSON just ignore"""
response = generate(model, tokenizer, prompt=user_prompt, max_tokens=10000)
try:
if "<|tool_call>" in response:
response_as_json = parse_gemma_tool_call(response)
else:
response_as_json = json.loads(response)
if response_as_json.get("name") == "read_file":
tool_input = response_as_json["parameters"]
tool_result = read_file(tool_input.get("file_path") or tool_input["filename"])
# prepare_gemma_tool_call_result follows the format https://ai.google.dev/gemini-api/docs/function-calling
messages.append({"role": "assistant", "content": "",
"tool_calls": [
{
"id": "read_file_1",
"function": {
"name": "read_file",
"arguments": response_as_json.get("parameters"),
},
}
],
})
messages.append({
"role": "tool",
"name": "read_file",
"content": tool_result,
"tool_call_id": "read_file_1",
})
# create a user friendly response based on the tool call result
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=10000)
except (json.JSONDecodeError, KeyError, TypeError):
pass
return response
def parse_gemma_tool_call(response):
"""parse_gemma_tool_call skips past the THINKING and handles strings like <|tool_call>call:read_file{file_path:<|"|>FILENAME<|"|>}<tool_call|>"""
_, _, after_thought = response.partition("<channel|>")
# print("DEBUG", after_thought)
tool_call = between(response, "<|tool_call>", "<tool_call|>")
if not tool_call.startswith("call:"):
return {}
name = between(tool_call, "call:", "{")
# hack way assuming a single parameter but could be file_path or filepath or filename or...
filepath = between(tool_call, ':<|"|>', '<|"|>}')
return {"name": name, "parameters": {"file_path": filepath}}
def between(text, left, right):
before, found_left, after_left = text.partition(left)
if not found_left:
raise ValueError(f"Missing left delimiter: {left!r}")
middle, found_right, after = after_left.partition(right)
if not found_right:
raise ValueError(f"Missing right delimiter: {right!r}")
return middle
def read_file(file_path):
try:
with open(file_path, "r") as f:
return f.read()
except Exception as e:
return f"Error reading file: {e}"
TOOLS = [
{
"type": "function",
"name": "read_file",
"function": "read_file",
"description": "Read the contents of a given file path. Use this to see everything in a file. Do not use this with directory names.",
"inputSchema": {
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "The path of the file to read",
},
},
"required": ["file_path"],
},
},
]
if __name__ == "__main__":
main()