The Scaling Debate
There is a prevailing assumption in artificial intelligence research that scale is the primary driver of capability. Larger models, more parameters, more data, more compute. This philosophy has delivered remarkable — and at times genuinely surprising — results. Large language models can now draft legal arguments, write functional code, pass medical licensing exams, and exhibit emergent behaviors like chain-of-thought reasoning that were never explicitly trained (Wei et al., 2022).
We do not dismiss these achievements. They are real, and they demand serious engagement from anyone arguing that scaling alone is insufficient.
Yet we believe the trajectory of ever-increasing scale shares a troubling resemblance to a strategy that evolution already tested — and ultimately moved beyond. To understand why, we need to look at what biology actually optimized for over 500 million years. The answer is not size. It is structure. But critically, it is also not structure alone. The story is more nuanced than either camp tends to acknowledge.
What Elephants Teach Us About the Limits of Scale
Consider the African elephant. Its brain weighs approximately 4.8 kilograms — roughly three times the mass of the human brain. It contains an estimated 257 billion neurons, more than triple our own count of 86 billion (Herculano-Houzel et al., 2014). By any naive scaling metric, elephants should be the dominant intellectual species on this planet.
They are not.
The reason lies in a distinction that may carry implications for AI architecture: intelligence is not a simple function of how many computational units you have. It is a function of where those units are concentrated and how they are organized.
In the elephant brain, the vast majority of neurons — approximately 251 billion of the 257 billion total — reside in the cerebellum, the region responsible for motor coordination, balance, and the fine control of a 5,000-kilogram body and a remarkably dexterous trunk (Herculano-Houzel et al., 2014). Only about 5.6 billion neurons populate the elephant's cerebral cortex, the structure most closely associated with abstract reasoning, planning, and flexible problem-solving.
Humans, by contrast, pack roughly 16 billion neurons into the cerebral cortex (Azevedo et al., 2009). We have fewer neurons overall, but nearly three times as many in the region most critical for higher cognition.
The lesson appears straightforward: total parameter count is not what produces intelligence. Architectural allocation matters enormously. But we should be careful not to overextend this analogy. The human cortex is also significantly larger than that of most other primates. Scale and structure are not opposing forces — they are likely multiplicative. The question is which variable offers greater marginal returns, and at what point.
The Corvid Paradox: Extreme Density as an Alternative Path
If the elephant case complicates the argument for pure scale, the corvid case introduces something more provocative — evidence that radical miniaturization, paired with the right structural principles, can achieve cognitive performance far beyond what raw size would predict.
The brain of a New Caledonian crow weighs roughly 7.5 grams. A raven's, about 15 grams. These are brains smaller than a shelled walnut. And yet corvids demonstrate tool manufacturing, causal reasoning, prospective planning, mirror self-recognition, and social deception — capacities that place them cognitively alongside the great apes in several domains (Emery & Clayton, 2004; Taylor et al., 2009).
How is this possible?
Research by Olkowicz et al. (2016), published in the Proceedings of the National Academy of Sciences, revealed that avian brains pack neurons at densities far exceeding those found in mammalian brains of equivalent size. A macaw's forebrain contains approximately 1.9 billion neurons — comparable to a macaque monkey, whose brain is an order of magnitude larger by mass. The neurons in bird brains are dramatically smaller and more tightly packed, enabling a level of computational density that mammals never evolved.
This is not a marginal difference. It represents a fundamentally different engineering solution to the same problem: how to maximize cognitive capability within the constraints of an energy and weight budget. Birds solved it not by building larger brains, but by building denser, more efficient ones.
However, a critical caveat: corvids are not as intelligent as humans. They outperform expectations given their brain size, but they do not match human capabilities in language, cumulative culture, or abstract mathematics. The densest known brain is not the smartest known brain. This suggests that density alone is also insufficient — that true general intelligence may require both efficient architecture and sufficient scale, along with specific structural innovations (such as the six-layered neocortex) that remain poorly understood.
The Implications for Artificial Intelligence
The parallels to contemporary AI research are worth examining carefully — with the understanding that biological analogies carry both insight and risk.
The dominant strategy in machine learning today is, in essence, the elephant strategy. We are scaling models to hundreds of billions — soon trillions — of parameters. We are training them on increasingly vast corpora. We are consuming megawatts of power for single training runs that cost tens of millions of dollars. And we are achieving genuinely impressive results.
But there are empirical signals that brute-force scaling may face diminishing returns. Current LLMs require exposure to orders of magnitude more data than any human child encounters during language acquisition. A child achieves fluency from roughly 10 to 100 million words of input over the first five years of life (Hart & Risley, 1995; Gilkerson et al., 2017). GPT-class models train on trillions of tokens. This is not a difference of degree. It is a difference of kind, and it points to a gap in learning efficiency that scaling alone has not closed.
Moreover, despite their fluency, LLMs continue to struggle with tasks that require systematic compositional generalization — applying known rules to novel combinations. Chollet (2019) argued in "On the Measure of Intelligence" that current benchmarks conflate skill (narrow task performance) with intelligence (broad generalization from minimal experience), and that LLMs excel at the former while largely failing at the latter. The ARC (Abstraction and Reasoning Corpus) benchmark, designed to test fluid intelligence, remains largely unsolved by LLMs even as they dominate natural language benchmarks.
The counterargument deserves honest engagement. Proponents of scaling — including researchers at OpenAI, Google DeepMind, and Anthropic — point to a track record of emergent capabilities that appeared unpredicted at scale. In-context learning, few-shot adaptation within the context window, and rudimentary reasoning chains were not explicitly designed. They emerged from scale (Brown et al., 2020; Wei et al., 2022). The scaling laws documented by Kaplan et al. (2020) and Hoffmann et al. (2022) show smooth, predictable improvements in loss as a function of compute, with no obvious ceiling yet reached.
This is genuine evidence, and it would be intellectually dishonest to dismiss it. The question is whether these emergent capabilities represent steps toward general intelligence, or increasingly impressive approximations of it that will asymptote short of the real thing. That question remains empirically open.
From Brute Force to Biological Principles: A Research Agenda
Rather than claiming that scaling is wrong, we propose that it is incomplete — and that the field would benefit from parallel investment in architectures inspired by the structural principles biology has already validated. We identify three specific research directions, each grounded in existing (if early-stage) work.
1. Structural Specialization Over Homogeneous Scaling
Biological brains are not uniform arrays of identical units. They are heterogeneous systems with specialized regions, each optimized for different types of computation. The cortical column, the hippocampal circuit for episodic memory, the entorhinal grid cells for spatial reasoning, the cerebellar microcircuit for predictive motor control — these are qualitatively different architectures coexisting within a single system (Mountcastle, 1997; Moser et al., 2008).
Current transformer-based models rely on repeated application of the same attention mechanism across all layers. Mixture-of-experts architectures (Shazeer et al., 2017; Fedus et al., 2022) represent a step toward heterogeneity by routing different inputs to different specialist sub-networks. But current MoE models differ from biological specialization in a crucial way: their experts are architecturally identical and differ only in learned weights. True structural specialization would involve qualitatively different computational modules — some optimized for sequential reasoning, others for spatial representation, others for episodic recall — integrated within a single system.
2. Computational Density Over Raw Parameter Count
The corvid brain demonstrates that cognitive capability scales more closely with neuron density in task-relevant regions than with total brain volume. The analogous challenge for AI: can we design architectures that achieve equivalent capability with radically fewer parameters?
There is preliminary evidence that this is possible. Sparse attention mechanisms (Child et al., 2019), knowledge distillation (Hinton et al., 2015), and structured pruning have shown that large fractions of parameters in trained models are redundant. The Chinchilla findings (Hoffmann et al., 2022) already demonstrated that smaller models trained on more data can outperform larger models trained on less — a finding that reframes the scaling question around efficiency rather than raw size.
More radical approaches — including spiking neural networks (Maass, 1997), neuromorphic computing on platforms like Intel's Loihi 2 and IBM's NorthPole, and energy-based models (LeCun, 2022) — attempt to achieve biological levels of computational efficiency. These approaches have not yet matched transformer performance on standard benchmarks, and intellectual honesty requires acknowledging this directly. But benchmark performance on current tasks may not be the right metric if the goal is architectures capable of efficient generalization rather than massive memorization.
3. World Models and Causal Learning Over Statistical Compression
Perhaps the most critical gap between biological and artificial intelligence lies in how each system learns. Biological systems construct generative models of the world — internal simulations that support prediction, counterfactual reasoning, and rapid generalization from sparse data (Friston, 2010; Lake et al., 2017). A child who watches a ball roll behind a screen expects it to emerge on the other side. This expectation reflects a causal model of object permanence, learned from a handful of experiences.
LLMs perform next-token prediction over massive corpora — a form of statistical compression that produces remarkable fluency but operates over surface-level patterns rather than causal structure. They can describe what typically follows what, but they do not build the kind of manipulable world models that allow biological agents to reason about novel situations from first principles.
Yann LeCun's proposed Joint Embedding Predictive Architecture (JEPA) represents one concrete attempt to address this gap — learning abstract representations by predicting in embedding space rather than pixel space, potentially enabling the kind of world modeling that current architectures lack (LeCun, 2022). DeepMind's work on Dreamer (Hafner et al., 2020) and related model-based reinforcement learning systems demonstrates that learned world models can dramatically improve sample efficiency in specific domains. Lake et al. (2017) have argued for hybrid architectures that combine neural network pattern recognition with structured probabilistic programs capable of causal inference.
None of these approaches has yet produced a system that matches LLM performance at scale. That is the uncomfortable truth. The alternative paradigm we are advocating for does not yet have a flagship result. But we believe the theoretical foundations are strong enough — and the limitations of pure scaling are becoming visible enough — to justify substantial parallel investment.
An Honest Assessment
We want to be precise about what we are claiming and what we are not.
We are not claiming that large language models are useless. They are among the most transformative tools ever built. They will reshape industries, augment human productivity, and fundamentally change how we interact with information. The calculator transformed mathematics without understanding a single equation. LLMs may represent a similar inflection — a tool that operates powerfully in the space of language and reasoning, regardless of whether it possesses either in any deep sense.
We are not claiming that scaling is irrelevant. The human brain is both efficiently structured and large. Scale clearly matters. The question is whether scale alone, applied to architecturally homogeneous systems trained through statistical prediction, can cross the threshold into genuine general intelligence.
What we are claiming is this: comparative neuroscience offers strong evidence that the relationship between computational resources and cognitive capability is mediated by architecture in ways that current AI systems do not fully exploit. The corvid brain achieves primate-level cognition at a fraction of the size through radical gains in density and organization. The elephant brain demonstrates that scale without appropriate structural allocation produces diminishing cognitive returns.
If these biological lessons translate — and we acknowledge that biological analogies can mislead as easily as they can illuminate — then the field may benefit from diversifying its bets. The scaling paradigm should continue. But alongside it, we believe there is a compelling case for sustained, well-funded research into architectures that prioritize computational density, structural heterogeneity, and efficient causal learning.
The crows have shown us that intelligence does not require brute force. The open question — the truly hard question — is whether we can engineer what evolution stumbled upon.
The views expressed here represent an emerging research perspective on the architectural limitations of current AI paradigms and the potential for neuroscience-informed alternatives. We have attempted to engage honestly with counterevidence and to distinguish between what the data supports and what remains speculative. The field benefits most when competing hypotheses are tested, not when they are dismissed prematurely — in either direction.
