Failure mode library

LangGraph recursion limit of 20 reached: the actual fix (not what the docs say)

LangGraph throws GraphRecursionError when your agent loops past 25 steps. Raising the limit just hides the bug. Here is the diagnostic procedure that finds the real cause — and the 5 patterns that resolve 90% of cases.

·LangGraph · LangChain

The symptom

Your LangGraph agent throws this:

GraphRecursionError: Recursion limit of 25 reached without hitting a stop condition.

(or 20, or whatever your recursion_limit is set to). You raise the limit. It throws again at 50. You raise it to 100. It still throws — and now your token bill tripled.

What is actually happening

recursion_limit is not a bug. It is a tripwire that is doing exactly its job: warning you that your graph is not converging. Raising the limit hides the symptom and lets the loop run longer before it finally errors out, which costs you tokens and makes the bug harder to diagnose.

Recursion limit errors are almost always caused by one of five patterns. All five are structural problems in how your graph is built, not problems with the limit itself.

The 3 things people try first that do not fix it

  1. Raising the limit. Hides the bug, costs more, fails again later.
  2. Switching to a "stronger" model. The model is not the loop. The graph topology is.
  3. Adding time.sleep or retry delays. Does not change behavior — your nodes are still revisiting the same state.

The 5 actual causes

Cause 1: A node returns to a previous state with no progress

The most common cause. Your graph has node A → B → C, and node C conditionally returns to A. The condition for "we are done" never fires because nothing in the state changed between iterations.

Diagnostic: log the state hash at the start of every node invocation:

import hashlib, json

def state_fingerprint(state: dict) -> str:
    return hashlib.md5(
        json.dumps(state, sort_keys=True, default=str).encode()
    ).hexdigest()[:8]

def my_node(state):
    print(f"[my_node] entering, fp={state_fingerprint(state)}")
    ...

If you see the same fingerprint appearing more than twice, the graph is going in circles without producing new information. That is the loop.

Fix: add a counter to the state, increment it on each loop entry, and exit when it exceeds a threshold:

def my_node(state):
    state["loop_count"] = state.get("loop_count", 0) + 1
    if state["loop_count"] > 3:
        return {"next": "END", "reason": "max_iterations"}
    ...

Cause 2: The agent keeps calling the same tool hoping for a different result

You see this in the trace: the agent calls search_database(query="X") and gets a result. The result is not what it wanted. So it calls search_database(query="X") again. Same arguments. Same result. It tries a third time. Then a fourth.

This is the LLM's analog of clicking "refresh" hoping a website fixes itself.

Diagnostic: log every tool call with its arguments. If you see identical (tool, arguments) tuples appearing more than once consecutively, you have this pattern.

Fix: cache tool calls by (tool_name, arguments_hash) and short-circuit duplicate invocations:

def cached_tool_node(state):
    key = (state["next_tool"], hash(json.dumps(state["next_args"], sort_keys=True)))
    if key in state.get("tool_cache", {}):
        # Force the planner to try something different
        return {
            "messages": state["messages"] + [
                {"role": "system", "content": f"Tool {key[0]} with these args was already called. Result: {state['tool_cache'][key]}. Try a different approach."}
            ]
        }
    result = call_tool(state["next_tool"], state["next_args"])
    state["tool_cache"] = {**state.get("tool_cache", {}), key: result}
    return state

Cause 3: The final-answer state is not being recognized

Your agent did reach a valid answer at step 7. But your conditional edge that routes to END checks for a field that is never populated, so the graph keeps looping past it.

Diagnostic: print the full state right before your "should we end?" routing function evaluates. Check whether the field it is looking for is actually set.

Fix: most teams check state.get("final_answer") but their final-answer node sets state["response"] or state["output"]. Pick a single field name, write it once, and have your end-condition check exactly that field. Constants prevent this:

ANSWER_KEY = "final_answer"

def answer_node(state):
    return {ANSWER_KEY: produce_answer(state)}

def should_end(state):
    return "END" if state.get(ANSWER_KEY) else "continue"

Cause 4: A subgraph is recursing inside the parent's count

Recursion limit counts every node invocation across the whole graph, including subgraphs. A subgraph that loops 15 times internally will eat 15 of the parent's 25-step budget before the parent has done anything.

Diagnostic: check whether the loop is inside a subgraph node. If so, the subgraph has its own state that needs its own exit condition — the parent's limit will not save you.

Fix: give the subgraph its own recursion_limit and explicit termination conditions, and make the subgraph return a structured failure if it cannot resolve within budget:

subgraph_app = subgraph.compile()
parent_app = parent_graph.compile()

def subgraph_node(state):
    try:
        result = subgraph_app.invoke(
            state,
            config={"recursion_limit": 10},  # Budget independent of parent
        )
        return result
    except GraphRecursionError:
        return {"subgraph_failed": True, "reason": "subgraph_budget_exhausted"}

Cause 5: The planner has no exit condition in its prompt

The planner LLM was never instructed when to stop. So it keeps proposing next steps forever. The graph faithfully executes whatever the planner proposes.

Diagnostic: read your planner system prompt. Does it explicitly say under what circumstances to return "DONE" / "FINISH" / "END"? If not, you have this bug.

Fix: add explicit termination conditions to the planner prompt:

You plan the next step. Return one of:
- {"next": "tool_X", "args": {...}} — to call a tool
- {"next": "answer", "content": "..."} — when you have enough information to answer
- {"next": "refuse", "reason": "..."} — when the request is out of scope

CRITICAL: return "answer" or "refuse" within at most 5 tool calls.
If after 5 tool calls you still cannot answer, return "refuse" with reason="insufficient_information".
Do NOT keep calling tools indefinitely.

The full diagnostic procedure

If you hit GraphRecursionError, run this in order:

  1. Do not raise the limit. Resist this. It costs you tokens and hides the bug.
  2. Log state fingerprints at every node entry (Cause 1 check).
  3. Log tool calls with arguments (Cause 2 check).
  4. Check your end-condition field name matches what your answer node actually sets (Cause 3 check).
  5. Check whether the loop is inside a subgraph (Cause 4 check).
  6. Read your planner prompt for explicit exit conditions (Cause 5 check).

In nine production audits out of ten, one of those five is the cause. The fix is structural, not configurational.

What "good" looks like

A well-structured LangGraph agent has:

  • A hard step counter in state, with an explicit cap (3-7 for most workflows)
  • Tool-call caching for idempotent operations
  • Exactly one field name used to signal completion, used consistently
  • Subgraphs with their own independent budgets and structured failure returns
  • A planner prompt that explicitly names the exit conditions

If you do not have all five, the recursion_limit error is not going away — it is just biding its time.

When to get help

If you have run the diagnostic above and cannot identify which of the five patterns is causing your loop, the issue is usually that multiple patterns are intersecting. That is harder to diagnose by reading code; it requires tracing actual runs.

We do this on the free 30-minute audit — screen-share your traces, we identify the exact loop pattern within the call, and you have a fix path within 24 hours.

Stuck on this exact failure?

Hitting this exact failure? Skip the debugging.

Book a free 30-minute call. We scope where your agent is breaking and map the fix. Then we commit to the result and work until we hit it. No pitch deck, no obligation.