← writing

Streaming Agent Consciousness

Streaming Agent Consciousness

“Simplicity is the ultimate sophistication.”
— Leonardo da Vinci

I rotate free Gemini API keys because I can’t afford LLM credits. But the pattern seemed absurd enough that I had to fix it regardless of cost.

Since ChatGPT launched, we’ve treated language models like stateless functions. Send full context, get response, hang up. Need to continue? Call back and repeat the entire conversation from the beginning.

This is fucking insane.

It’s like calling someone, having a conversation, hanging up mid-sentence, then calling back and starting over just to finish your thought. Every time an agent needs a tool, traditional frameworks restart the entire reasoning process. The longer your reasoning chain, the more expensive and slower each step becomes.

This bothered me more than it should have.

From XML Tags to Delimiters

I started experimenting with ways to let the language model signal its own state changes. First attempt was XML tags. Messy parsing, collision risks, too much ceremony. Why terminate sections when we only need state changes?

The delimiter system emerged:

§THINK: Let me plan this out...
§CALLS: [{"name": "search", "query": "..."}]
§YIELD:
[Tool execution happens here]
§RESPOND: Here's what I found...
§YIELD:

Simple. Unambiguous. Zero collision risk. The §YIELD delimiter signals state machine transitions. When tool calls finish streaming or the LLM ends its turn. The language model drives its own execution.

But I was still thinking in HTTP terms. Pause meant ending the conversation and starting fresh.

Then I found the missing piece.

On my birthday, researching streaming approaches, I discovered Gemini’s Live API and OpenAI’s Realtime API. WebSocket connections with bidirectional streaming. Pause the language model mid-stream, execute tools, inject results, resume the same conversation thread. Context preserved in the session.

Built a prototype that weekend. The agent started thinking, signaled when it needed tools, paused while tools executed, received results, and continued thinking in the same session. No restart. No replay. Holy shit, it worked.

What This Actually Means

Traditional frameworks replay full context every tool cycle. Streaming consciousness pauses/resumes the same WebSocket conversation.

Token efficiency scaling (see docs/proof.md for mathematical analysis):

Traditional frameworks scale quadratically (O(n²)), streaming scales linearly (O(n)).

The Architecture

# Traditional: Context grows with each tool cycle
agent.run("Complex task")
→ [Full context] → Tool call → [Full context + tool result] → Tool call
# Each step more expensive than the last

# Streaming consciousness: Context stays constant
agent = Agent()
async for event in agent.stream("Complex task"):
    # Stream pauses → tools execute → stream resumes
    # Same session, same context, constant token usage

Dual-mode compatibility makes it practical. If WebSocket streaming isn’t available, Cogency falls back to traditional HTTP cycles automatically. Same API, different transport layer.

The First Principles Approach

I didn’t study existing frameworks and optimize them. The request/response pattern felt misaligned with how language models actually work. Continuous token streams with state transitions.

LangChain simulates state through context management in stateless HTTP. I built stateless orchestration with stateful streams—the agent has no in-memory state, all state lives in the WebSocket session or is assembled from storage on demand. The language model drives its own execution flow instead of being driven by framework polling.

Beyond Token Efficiency

Streaming consciousness enables real-time visibility. You see thinking, reasoning steps, exactly where agents get stuck. Debug information becomes natural. Performance bottlenecks become obvious.

Plus, agents can ask clarifying questions mid-task without breaking context. Multi-step reasoning becomes genuinely conversational rather than simulated dialogue through context replay.