Can The LLM Market Scale To Artificial General Intelligence?
Introduction
Scaling current large-language-model (LLM) infrastructure yields steady – but slowing – gains. Fundamental constraints in compute, data supply, energy, cost, and safety indicate that brute-force scaling is unlikely to cross the remaining gap to human-level, general intelligence without substantial algorithmic advances and new system designs.
1. What Pure Scaling Has Achieved
Year | Frontier model (public) | Train compute (FLOP) | Cost (USD, est.) | ARC-AGI-1 score | Notable capabilities |
---|---|---|---|---|---|
2020 | GPT-3 (175 B) | 3.1 e23 | $2–4 M | 0% | few-shot text generation |
2023 | GPT-4 | ≈6 e24 | $41–78 M | 5% | chain-of-thought, tool use |
2024 | Claude 3.5 | n/a | “few tens of millions” | 14% | improved coding & reasoning |
2025 | o3-medium | ≈1 e25 | $30–40 M | 53% on ARC-AGI-1 but ≤3% on harder ARC-AGI-2 | beats graduate-level STEM tests, 25% on Frontier-Math |
Raw scale has pushed LLMs from near-random performance to superhuman scores on many benchmarks, showing that power-law “scaling laws” hold over five orders of magnitude in compute. Yet even the most compute-hungry model still fails most ARC-AGI-2 tasks that ordinary humans solve easily.
2. Why Scaling Laws Flatten
-
Compute and Cost
-
Training cost for the largest runs has grown 2.4 × per year since 2016; extrapolates to > $1 billion by 2027.
-
Inference cost also rises with test-time “long-thinking” strategies that drive recent gains.
-
-
Energy and Carbon
-
A single 65 B-parameter model can draw 0.3 – 1 kW per inference job at scale training GPT-3 emitted approximately 550 t CO₂-eq.
-
Running 3.5 M H100 GPUs at 60% utilisation would consume approximately 13 TWh yr⁻¹ – more than many small countries.
-
-
Data Exhaustion
-
Human-generated high-quality text (about300 T tokens) will be fully consumed between 2026 to 2032 if current trends continue.
-
Heavy reliance on synthetic data risks “model collapse” and degraded diversity.
-
-
Networking & Memory Limits
-
Clusters above about 30 k GPUs suffer steep efficiency loss from interconnect and fault-tolerance bottlenecks.
-
Sparse mixture-of-experts helps but increases VRAM pressure and complexity.
-
-
Safety & Governance Friction
-
Labs have adopted Responsible Scaling Policies that require pauses when dangerous capabilities emerge; ever-larger models hit these checkpoints sooner.
-
3. Evidence That Scale Alone Is Insufficient
-
ARC-AGI-2 glass ceiling: o3’s ≤ 3% score – after about 50,000 × compute growth since 2019 – shows diminishing returns on tasks demanding systematic abstraction.
-
Diminishing log-log slopes: Updated scaling fits reveal exponents flattening as models reach the Chinchilla-optimal data/parameter ratio.
-
From pattern learning to planning: Current LLMs remain brittle at multi-step novel reasoning, long-horizon planning, and grounding in the physical world.
-
Economic infeasibility: A $1 billion training run would need to recoup > $10 billion in revenue just to match cloud depreciation, excluding alignment research and liability risk.
4. Paths Beyond Brute Scaling
-
Algorithmic Efficiency
-
Chinchilla showed that smarter allocation of tokens beats larger models at equal compute.
-
Retriever-augmented generation, sparse routing, and neuromorphic techniques cut costs by 5 to 20 times.
-
-
Test-time Adaptation & Agents
-
Tree search, majority voting, and tool-use agents outperform naïve parameter scaling on maths and code.
-
-
Multimodal & Continual-Learning Systems
-
Grounding in images, actions, and feedback loops may supply richer gradients than extra text alone.
-
-
Synthetic-Data Science
-
SynthLLM finds power-law scaling in generated curricula up to approximately 300 B tokens before plateau.
-
Theory warns that mutual-information bottlenecks, not sheer volume, drive generalization.
-
-
Architecture Innovation
-
New memory-augmented, modular or hybrid neuro-symbolic models aim to break the quadratic attention wall and enable compositional generality.
-
5. Outlook: Toward AGI Requires More Than Bigger Clusters
Scaling current transformer-based LLM infrastructure will continue to deliver valuable, super-human skills – especially when paired with clever inference algorithms – yet multiple converging ceilings suggest it will not by itself close the remaining qualitative gap to general intelligence:
-
Compute, energy, and cost grow faster than capabilities.
-
High-quality data is finite; synthetic data helps but introduces new failure modes.
-
Benchmarks designed to detect genuine abstraction (ARC-AGI-2) still expose large deficits.
-
Safety regimes and public policy are already nudging labs to slow or pivot from raw scale.
The most plausible route to AGI therefore lies in hybrid progress. Continued – but economically tempered – scaling combined with breakthroughs in architecture, efficient learning algorithms, richer data modalities, and robust alignment methods. Pure scale remains a crucial ingredient, yet it is neither all we need nor, on its own, a guaranteed path to human-level general intelligence.
References.
- https://www.linkedin.com/posts/callou876_ai-training-cost-estimates-from-the-stanford-activity-7188106664758185984-ikED
- https://forum.effectivealtruism.org/posts/CoPNbwNqDai6orZhv/openai-s-o3-model-scores-3-on-the-arc-agi-2-benchmark
- https://www.forbes.com/sites/katharinabuchholz/2024/08/23/the-extreme-cost-of-training-ai-models/
- https://arcprize.org
- https://www.reddit.com/r/singularity/comments/1id60qi/big_misconceptions_of_training_costs_for_deepseek/
- https://arcprize.org/blog/analyzing-o3-with-arc-agi
- https://arxiv.org/html/2405.21015v1
- https://arxiv.org/html/2505.11831v1
- https://highlearningrate.substack.com/p/1212-o3-saturates-the-arc-agi-benchmark
- https://klu.ai/glossary/scaling-laws
- https://arxiv.org/abs/2203.15556
- https://blogs.nvidia.com/blog/ai-scaling-laws/
- https://openreview.net/forum?id=VNckp7JEHn
- https://arxiv.org/pdf/2310.03003.pdf
- https://hdsr.mitpress.mit.edu/pub/fscsqwx4
- https://higes.substack.com/p/the-energy-cost-of-teaching-machines-diving-deep-into-energy-and-llms-d01f7e1acb12
- https://arxiv.org/html/2211.04325v2
- https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data
- https://phinity.ai/blog/synthetic-data-llms-definitive-guide-2025
- https://www.reworked.co/information-management/llms-are-hungry-for-data-synthetic-data-can-help/
- https://www-cdn.anthropic.com/1adf000c8f675958c2ee23805d91aaade1cd4613/responsible-scaling-policy.pdf
- https://www.anthropic.com/news/anthropics-responsible-scaling-policy
- https://metr.org/blog/2023-09-26-rsp/
- https://legalgenie.com.au/artificial-intelligence/chinchilla-point/
- https://www.linkedin.com/pulse/ai-hits-wall-ilya-sutskever-plateau-llm-scaling-diana-wolf-torres-ryo0c
- https://arxiv.org/abs/2501.07458
- https://epoch.ai/blog/how-much-does-it-cost-to-train-frontier-ai-models
- https://www.rcrwireless.com/20250120/fundamentals/three-ai-scaling-laws-what-they-mean-for-ai-infrastructure
- https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-models
- https://arxiv.org/html/2503.19551v2
- https://openreview.net/forum?id=UxkznlcnHf
- https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf
- https://www.reddit.com/r/singularity/comments/1g4a1mm/anthropic_announcing_our_updated_responsible/
- https://cameronrwolfe.substack.com/p/llm-scaling-laws
- https://en.wikipedia.org/wiki/Neural_scaling_law
- https://openreview.net/forum?id=iBBcRUlOAPR
- https://arxiv.org/abs/2404.17785
- https://clickhouse.com/blog/how-anthropic-is-using-clickhouse-to-scale-observability-for-ai-era
- https://www.jonvet.com/blog/llm-scaling-in-2025
- https://lifearchitect.ai/chinchilla/
- https://www.anthropic.com/responsible-scaling-policy
- https://paperswithcode.com/method/chinchilla
- https://www.anthropic.com/news/activating-asl3-protections
- http://arxiv.org/pdf/2203.15556.pdf
- https://fortune.com/2025/02/19/generative-ai-scaling-agi-deep-learning/
- https://cacm.acm.org/blogcacm/the-energy-footprint-of-humans-and-large-language-models/
- https://www.lawfaremedia.org/article/openai’s-latest-model-shows-agi-is-inevitable.-now-what
- https://www.reddit.com/r/OpenAI/comments/1dqk2b8/is_it_scaling_or_is_it_or_learning_that_will/
- https://garymarcus.substack.com/p/breaking-openais-efforts-at-pure
- https://arxiv.org/abs/2211.04325
- https://adasci.org/how-much-energy-do-llms-consume-unveiling-the-power-behind-ai/
- https://www.wired.com/story/microsoft-and-openais-agi-fight-is-bigger-than-a-contract/
- https://www.marketingaiinstitute.com/blog/agi-policy-debate
- https://www.reddit.com/r/mlscaling/comments/1dag1a6/will_we_run_out_of_data_limits_of_llm_scaling/
- https://blog.spheron.network/understanding-the-expenses-of-training-large-language-models
- https://arxiv.org/abs/2309.14393
- https://www.linkedin.com/pulse/beyond-data-exhaustion-innovative-training-strategies-kesharwani-k8eoe
- https://arxiv.org/html/2505.04521v1
- https://www.nature.com/articles/s41598-024-76682-6
- https://dl.acm.org/doi/10.1145/3701100.3701162
- https://www.nownextlater.ai/Insights/post/the-ai-landscape-in-2024-the-rising-costs-of-training-ai-models
- https://hotcarbon.org/assets/2024/pdf/hotcarbon24-final154.pdf
- https://arxiv.org/abs/2410.12896
- https://team-gpt.com/blog/how-much-did-it-cost-to-train-gpt-4/
- https://www.sustainabilitybynumbers.com/p/carbon-footprint-chatgpt
- https://arcprize.org/blog/oai-o3-pub-breakthrough
- https://www.reddit.com/r/ArtificialInteligence/comments/1hitny3/open_ais_o3_model_scores_875_on_the_arcagi/
- https://forum.effectivealtruism.org/posts/GbHqM4pMjMt4pyrHm/arc-evals-responsible-scaling-practices
- https://www.rdworldonline.com/just-how-big-of-a-deal-is-openais-o3-model-anyway/
- https://www.lesswrong.com/posts/pnmFBjHtpfpAc6dPT/arc-evals-responsible-scaling-policies
- https://metr.org/blog/2023-03-18-update-on-recent-evals/
- https://www.fanaticalfuturist.com/2025/01/openais-o3-ai-model-smashes-the-aci-agi-benchmark-tests/
- https://www.givingwhatwecan.org/charities/arc-evals
- https://arcprize.org/leaderboard
- https://www.alignmentforum.org/posts/EPLk8QxETC5FEhoxK/arc-evals-new-report-evaluating-language-model-agents-on
Leave a Reply
Want to join the discussion?Feel free to contribute!