AI Agents & Agentic Workflows

I. Theoretical Definition of Agency

Agency in Large Language Models (LLMs) is defined as the capacity to generate goal-directed actions alongside token predictions. Unlike standard "chat" models which optimize strictly for next-token likelihood $ P(x_t | x_{<t}) $, an agentic system optimizes for a multi-step objective function where intermediate tokens (Reasoning/Thoughts) serve as latent variables to guide external execution.

Formalized by the Agency Hypothesis, true intelligence requires valid grounding in an interactive environment. The system operates in a continuous control loop: $$ S_t \\xrightarrow{\\pi} A_t \\xrightarrow{env} O_{t+1}, R_{t+1} $$ where $ S $ is the State, $ A $ is the Action, $ O $ is the Observation, and $ R $ is the Reward/Feedback.

Reasoning as a Policy In this framework, the LLM acts as the policy network $ \pi_\theta(a|s) $. The core innovation of modern agents is not the architecture (primarily Transformer-based) but the Cognitive Architecture—the scaffolding of prompts, memory buffers, and tool interfaces that wrap the weights.

The Perception Loop

Converting unstructured environmental feedback (API errors, HTML DOMs, sensor logs) into textual embeddings that fit the model's context window without cache saturation.

II. The ReAct Paradigm (Reasoning + Acting)

The ReAct framework (Yao et al., 2022) addresses the disconnect between high-level reasoning and low-level execution. Standard LLMs suffer from "hallucination propagation" where a single logic error compounds over a generated chain. ReAct enforces a strict Thought-Action-Observation block structure that grounds the model's "internal monologue" in external reality.

Structure of a ReAct Block:

Thought: The model generates a reasoning trace explaining the current state gap (e.g., "I need to find the user's location").
Action: The model outputs a strict function signature (e.g., LOCATE_USER()). The generation halts here.
Observation: An external runtime executes the tool and appends the raw output (e.g., "Lat: 40.7128, Long: -74.0060") to the context.

This interleaving allows the model to perform Dynamic Error Correction. If an API call fails (returns 404), the subsequent "Thought" step can analyze the error message and propose an alternative endpoint, a capability absent in non-agentic Chain-of-Thought (CoT).

ReAct Trace Simulator

Simulate the token-by-token generation halt and external observation injection.

[WAITING_FOR_INPUT_STREAM]

III. Advanced Tool Use: Gorilla & AST Eval

While ReAct provides the loop, Tool Proficiency dictates success. General-purpose models (GPT-4) often fail to respect strict API syntax (hallucinating parameters). The Gorilla framework (Patil et al.) demonstrates that fine-tuning specifically for API calls requires Retriever-Aware Training (RAT).

Gorilla Architecture: The model is trained not just on (Instruction, API) pairs, but on (Instruction + Retrieved Schema, API) tuples. This forces the model to attend to the *provided schema* in the context rather than memorizing deprecated API endpoints from its pre-training weights.

AST-Based Evaluation Evaluating tool use via string matching (BLEU/ROUGE) is fundamentally flawed. A correct API call might vary in whitespace or argument order. Robust evaluation requires parsing the generated code into an Abstract Syntax Tree (AST) and verifying functional equivalence against the ground-truth function signature.

JSON Schema Enforcement: Modern production agents utilize "Guidance" or "Constrained Sampling" (e.g., Grammar-Based Sampling). The logits of the model are masked at every step to ensure the output strictly conforms to a predefined JSON Schema (e.g., forcing a closing brace } after a value). This guarantees 100% syntactic validity for downstream parsers.

IV. Reflexion & Recursive Planning

One-shot agents inevitably fail on complex tasks. The Reflexion framework (Shinn et al., 2023) introduces a verbal reinforcement learning loop that does not require weight updates. It decouples the system into three components:

1. The Actor

Standard ReAct agent generating trajectories. It has a short-term memory of the current attempt.

2. The Evaluator

A critic model (often a stronger LLM) that scores the trajectory. It produces a reward scalar $ r $ and a classifications (Success/Failure).

3. Self-Reflection

Crucially, this model verbally analyzes why the attempt failed (e.g., "I used the wrong search query"). This reflection is stored in the Episodic Memory Buffer.

The Episodic Buffer: In the next trial, the Actor's context window is seeded with the accumulated reflections from previous failed trials. This effectively "simulates" gradient descent by optimizing the prompt context rather than the model parameters. The agent "learns" not to repeat the same mistake (e.g., "avoid using `search_tool` for math questions").

Tree of Thoughts (ToT): For planning, ToT generalizes "Chain of Thought" by maintaining a tree of partial solutions. The agent uses a search algorithm (BFS or DFS) to explore multiple reasoning paths, backtracking when a branch is evaluated as unpromising by the critic.

V. Multi-Agent Swarms & Communication

Single-agent architectures hit a "context wall" and "capability ceiling." Multi-Agent Systems (MAS) distribute the cognitive load across specialized roles (e.g., Coder, Reviewer, Planner). The critical challenge is Communication Protocols.

The CAMEL Framework: Introduced "Inception Prompting" to establish role-playing boundaries. Agent A (User) and Agent B (Assistant) engage in an autonomous chat loop to solve a task. The system requires a "Meta-Prompt" to ensure the conversation converges rather than entering infinite loops of politeness.

Shared Memory Architectures: Agents in a swarm typically share a read-only "Global State" (e.g., the file system or a vector database) but maintain private "Local State" (their reasoning trace). Efficient swarms use a Manager Agent (or Router) to orchestrate message passing, determining which agent should speak next based on the current sub-task.

VI. Multi-Agent Consensus & Debate

Single agents suffer from "tunnel vision"—once they commit to a wrong path, they rarely recover. Consensus Protocols mitigate this by aggregating outputs from multiple instantiations.

Majority Voting

Run the same prompt $K$ times (Temperature > 0). The final answer is the most consistent result (Self-Consistency).

Multi-Persona Debate

Agent A proposes a plan. Agent B (The Critic) scrutinizes it. Agent A revises. This adversarial loop improves factual accuracy.

VII. Monte Carlo Tree Search (MCTS)

Chain of Thought (CoT) is a linear, greedy search. MCTS allows the agent to simulate future outcomes before committing to an action.

The process follows the standard 4 steps: 1. Selection: Pick a promising thought branch. 2. Expansion: Generate $k$ possible next steps. 3. Simulation: Fast-forward to see if this path leads to the goal (using a lightweight Value Model). 4. Backpropagation: Update the value of the parent thought.

VIII. Schema Robustness & Recovery

Production agents must never break downstream parsers. We enforce Constraint-Based Sampling (e.g., using guidance or outlines libraries) to mask invalid tokens.

Self-Correction Loops: If an API call fails (e.g., JSONDecodeError), we strictly do not crash. We inject the Python stack trace back into the context:
User: ValueError: Expecting , delimiter: line 1 column 45
The model then corrects its syntax in the next turn.

Primary Sources & Further Reading

Agentic Frameworks

Yao et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models.
Shinn et al. (2023). Reflexion: Language Agents with Iterative Design Learning.
Wang et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models.

Engineering & Design

Microsoft (2024). AutoGen: Enabling Next-Gen LLM Applications.
Lilian Weng (2023). LLM Powered Autonomous Agents (Blog).
OpenAI (2024). Function Calling and Tool Use Guides.