Post-transformer AI: what comes after classic LLMs
For the last few years we’ve been living in the “age of transformers” – the architecture behind ChatGPT, Claude, Gemini and a huge number of smaller models. Everything we usually call an LLM is, in practice, some variant of a transformer: a huge network that reads tokens, computes attention and predicts the next word.
But as models grow, we’re hitting a wall:
- the context window is still limited,
- models have no real long-term memory – only a temporary “context buffer”,
- learning is offline: you train for months and then freeze the weights,
- each jump in size demands brutal amounts of GPUs and power.
That’s why more and more teams are talking about “post-transformer” architectures – a new generation of models that keep what’s good about transformers (quality, scaling) but add built-in memory, online learning and much better efficiency.
One of the loudest attempts in this direction is Baby Dragon Hatchling (BDH) from Pathway – an architecture they openly label as post-Transformer, directly inspired by how the human brain works.

Quick refresher: what are transformers and where do they crack?
Back in 2017, transformers replaced old RNN/LSTM models and brought two key ideas:
- Attention – instead of reading tokens strictly in sequence like an RNN, the model in each layer looks at the whole context and decides what matters.
- Parallelism – you can train huge models on thousands of GPUs because tokens are processed in batches.
That led to the LLMs we use today: large decoder models (GPT-style) trained on billions or trillions of tokens.
But transformers also have built-in weaknesses:
- Context is a sliding window – even if you’re writing a 300-page book, the model only sees X thousand tokens at a time. The bigger the window, the more expensive it gets.
- No true memory – every query is like a fresh scene. The model doesn’t remember you as a person unless developers bolt on external memory systems.
- Catastrophic forgetting – if you try to fine-tune it on new data, the old knowledge can easily “smear” and degrade.
- Everything is one monolith – world knowledge, working memory and reasoning are all mixed into one parameter blob.
In practice that means:
- an NPC powered by an LLM that plays a wise old wizard today might sound like a crypto shill tomorrow in the same town, because there is no stable personality or memory,
- an AI co-pilot in your IDE forgets what you did last week and starts from scratch every time,
- agents that seem smart in one long session, but the next day you have to re-stuff all the context through the prompt.
That’s why more and more research is looking for architectures with explicit memory and structure – something between a brain and classical software.
Baby Dragon Hatchling: a “dragon” brain with built-in memory
Pathway is a small but aggressive Palo Alto team claiming they’ve built “a missing link between transformers and brain-like models”. Their Baby Dragon Hatchling (BDH) architecture is described as:
- a network of “neural particles” that communicate locally (instead of global attention over all tokens),
- a scale-free graph – some nodes are dense hubs, most have few connections, similar to real neural networks,
- working memory based on synaptic plasticity – connections are temporarily strengthened or weakened during reasoning,
- a clear separation between long-term parameters and short-term state.
In plain language:
- instead of letting every token “look at” every other token via expensive attention, BDH uses local interactions on a graph,
- memory is not just a stream of tokens, but changes in the connections themselves over a short time window – like “traces” in the brain,
- the architecture is designed to be more interpretable – you can track which parts of the graph carry what kinds of information.
Early experiments suggest that BDH:
- can reach roughly GPT-2-level language quality and translation with similar parameter counts,
- preserves transformer-like scaling – more data and compute still bring better performance,
- opens doors for algorithmic reasoning and longer-term memory without exploding context windows.
We’re not talking about a model that will replace GPT-4/5 in production tomorrow, but as an architecture BDH is interesting because it:
- explicitly separates memory from the core model,
- tries to mimic biological principles (Hebbian learning, scale-free networks),
- is designed from day one for real-time adaptation and lifelong generalization.
What this means for games and NPCs
So why is this article under software-gaming rather than pure AI theory?
Because games might be one of the first places where post-transformer ideas become very visible.
Imagine a world where:
- an NPC truly remembers all your previous encounters – not just as a log line, but as opinions, loyalties and grudges,
- the game world has persistent collective memory – the city remembers who betrayed whom, who saved whom, who sold what to whom, and it all shapes future events,
- an “AI dungeon master” genuinely learns your play style, generates new quests on the fly and tweaks the rules to stay challenging but fair,
- a multiplayer server has a single persistent AI “world overseer” that builds the server’s history together with players over months and years.
With today’s transformer LLMs, we can hack some of this together:
- store logs in external databases,
- push summaries of previous interactions into each new prompt,
- build custom server logic that glues memory around the model.
But it scales terribly and costs a fortune. Once you have thousands of NPCs and tens of thousands of players, prompt-engineering becomes a nightmare and GPU bills go through the roof.
Post-transformer architectures with:
- built-in working memory (short-term), and
- structured long-term memory (graphs or hierarchical states)
could lead to NPC systems that:
- are far cheaper per instance (each NPC runs a small “dragon mind”),
- naturally remember interactions without hand-crafted prompts,
- can learn over the lifetime of the game – an NPC becomes a “seasoned veteran” after 200 hours of server time, not after the next retrain in a datacenter.
For studios and indie teams, this enables a whole new class of games: living worlds that aren’t entirely pre-scripted, but evolve together with players.
What this means for software, agents and co-pilots
The same logic applies beyond gaming.
Today’s AI agents and co-pilots are mostly:
- statistical autocomplete on steroids – very powerful, but still “token by token”,
- lacking continuity – every CLI tool, IDE plugin or chatbot has to re-explain the context each time,
- with memory that’s “bolted on”: knowledge bases, vector search, manual integrations.
A post-transformer approach promises:
-
Long-term project memory
A co-pilot that remembers what your codebase looked like for months, and knows why you did things in a certain way – not just how. -
Real personalization
An assistant that adapts to your work style, habits, pace and even mood – because it actually learns and generalizes over time. -
Autonomous agents
Instead of an “agent” that gets a fresh prompt every step and forgets the last one, you get an architecture that can run for days or weeks with a stable internal memory and “character”. -
Better planning and long-horizon goals
Post-transformer designs are naturally better suited for multi-step planning, where decisions depend not only on the current prompt but also on long-term history and expected consequences.
For developers, this is effectively a shift from “LLM as a function” to:
AI as a long-running process inside your system – with its own memory, habits and history.
It’s not just Dragon: other paths “beyond transformers”
BDH is not the only project trying to move beyond pure transformers.
In parallel, other lines of work are emerging:
-
Hybrid models (Transformer + SSM)
Models like IBM’s Bamba-9B combine transformers with state-space models (Mamba-style) to cut memory and KV-cache needs while preserving quality.
For gaming and real-time apps, that’s crucial: more throughput and lower latency, meaning you can run more AI instances on the same hardware. -
Fast reasoning models for the edge
Smaller, hybrid models focused on fast reasoning with 2–3× lower latency, designed for devices with limited memory (phones, consoles, VR headsets).
That’s directly relevant for games: it lets AI partly live on the client, not only on the server. -
New learning paradigms
Things like nested learning and related work try to tackle continual learning – how a model can keep learning throughout its lifetime without forgetting old skills.
That’s key for agents that should run for months without a hard reset. -
Hierarchical memory networks
Some architectures explicitly build a hierarchy of memories – short-term, mid-term and long-term – something like L1/L2/L3 cache, but for knowledge and experiences.
Common theme:
Less flat, monolithic networks; more structure, memory and specialization.
What should a “normal” developer do with all this?
Realistically, as a dev you can’t just download “BDH 1.0” tomorrow and drop it in as a GPT replacement. But you can prepare your architecture:
-
Separate “brain” and memory already now
Keep knowledge, context and history in clear layers (databases, graphs, event logs), instead of pushing everything into the prompt. -
Write architecture-agnostic AI code
Build your AI adapters so you can swap GPT-X for a future post-transformer model (BDH, hybrid SSM, whatever) without rewriting the entire system. -
Think in terms of agents, not single queries
Design systems where AI has a “lifecycle”: state, goals, tasks, history. That maps naturally to architectures with built-in memory. -
In games – separate world logic from the “AI brain”
If you’re using prompt-based LLMs for NPCs today, keep world history in separate structures (graph DBs, event stores), so later you can plug in a post-transformer brain that actually uses those data structures efficiently.
Conclusion
Transformers gave us impressive LLMs – but also a ceiling: enormous models, expensive GPUs, limited context and essentially no real memory.
Post-transformer architectures like Baby Dragon Hatchling are trying to make the next leap – AI that truly learns and remembers over time, closer to how a brain works.
For those of us building games, apps and AI agents, the next 5–10 years could bring:
- NPCs with real personality and history, not just an 8k-token prompt,
- co-pilots that remember your projects for months instead of a single session,
- agents that behave like long-running processes, not just a
complete(prompt)call.
We may stay in the transformer world for a while longer – but it’s already clear that the next generation of AI won’t just be a bigger model with more parameters. It will be a different kind of mind.
Disclaimer: This article is for informational purposes only and does not constitute financial, investment, legal or any other form of professional advice.






