Open Thinkering

AI's energy problem is a systems problem

A collage of a Japanese vintage landscape and 3 large black data servers emitting a cloud of smoke

Recently, I came across an article suggesting that powerful AI tools could run on the number of watts drawn from the average smartphone battery. Given my interest in these technologies, I thought it deserved some further investigation.

What I found was that there are four well-understood engineering techniques that can reduce the energy used per useful AI output by ~100x. That's a lot.

So without burying the lede, what are those techniques?

  1. Training models more efficiently (smaller models, better data)
  2. Replacing dense transformer arithmetic with leaner algorithms
  3. Running those algorithms on hardware that stops shuffling data back and forth unnecessarily
  4. Serving requests intelligently rather than wastefully

These aren't exactly exotic techniques, and they're all – in principle, at least – combinable.

My feeds are full of either AI boosterism (usually LinkedIn) or AI doomerism (everywhere else). It's either going to solve climate change if we build enough data centres, or an environmental catastrophe. Both of these narratives, as we argued in a paper for Friends of the Earth last year obscure the real question: not whether AI uses energy, but who decides how much, and on what basis.

A four-layered problem that comes down to governance

One of the reasons I'm drawn to systems thinking is because it's interdisciplinary. But much of the academic and academic-adjacent world exists firmly within set disciplines. They have their own research community, conferences, and sets of incentives. Coordination between these worlds is currently, it seems, largely being done by Big Tech companies. So the focus is on profit rather than efficiency and optimisation.

Let's take them in turn:

Training practice

There's a paper from 2022 by DeepMind which is known informally as Chinchilla which shows that most LLMs have been trained with too many parameters and too few tokens. What does this mean in practice? That you can hit the same capability level with smaller models trained on more, higher-quality, deduplicated data.

This requires a fraction of the compute cost, and the finding has been public for several years at this point. Yet many frontier models continue to be trained in ways that do not reflect the research, because there are marketing benefits to announce ever-larger models. It's easier for people to understand big parameter counts in a press release than careful training decisions.

Algorithms

Most modern AI models use a transformer design that relies on attention. The cost of this attention grows with the square of the input length which, in practice, means if you give the model a longer prompt, the energy required rises much faster than the length of the prompt itself.

There are different ways of doing this, hoewever, including State Space Models (SSMs). Take the Mamba architecture, for example, which scales linearly with input length, meaning that longer inputs can be dealt with far more efficiently. On long-context tasks, it means they can run 10x more faster than transformers.

And then, separately to this, research in “quantisation” demonstrates that you can shrink the precision of model weights dramatically. There's a good recent example on the Google Research blog. This involves the number of bits used for model weights – for example, instead of using 16-bit floating-point numbers, getting down to an effective average of around 1.58 bits per weight.

What does that mean in practice? Using just three possible values for each weight: -1, 0, or 1. In theory, this means basic operations inside the model become around 30x cheaper to compute. The BitNet b1.58 paper from early 2024 demonstrated this at scale.

Hardware

Graphics Processing Units (GPUs) are what tend to be used for AI hardware. This is why the price of gaming PCs has gone through the roof recently! GPUs repeatedly move data back and forth, pulling numbers from memory before doing calculations on them, and then writing the results back to memory.

Doing this thousands of times for each layer and token requires a lot of shuttling of data costing time and energy. IBM has demonstrated a different way of doing this through its NorthPole chip which puts the computing units right next to the memory so that the data hardly needs to move at all. This was described in Science in 2023 and then explained in practice in a 2025 paper,

The upshot is a chip that, on vision and language tasks, uses around 25x less energy while responding 5x faster than a similar GPU built on similar manufacturing technology.

In addition, there are experimental hardware ideas such spiking neural networks which only transmit information when a neuron fires, promising 10x to 100x greater efficiency in research settings. And the example which initially piqued my interest, dendrocentric AI designs, go even further by mimicking the branching structure of biological neurons. None of these hardware approaches is mainstream yet, but neither are they mere science fiction.

Sever / operations

Even before you change the model or the hardware, there are big efficiency gains to be had in how AI systems are run. For example, moving training jobs to times and places where the electricity grid is cleaner cuts carbon emissions by 30-40% without affecting how well the model works.

Then there's the vLLM system which changes how an operating system treats virtual memory during transformer inference. In effect, it allows the system to serve more requests at once, which means the cost of loading the model into memory is shared across more users.

Other approaches? Speculative decoding uses a smaller, faster “draft” model to quickly suggest the next tokens, while a larger model checks those suggestions in parallel. WhileFlashAttention-3 restructures attention computation in a way which minimises expensive memory reads and writes.

It's all down to governance

It's hard not to feel optimistic when reading what's technically possible. So let's temper that by talking about governance. You'll be dismayed to learn that there is currently no standard for reporting energy per useful output. That means, for example, per token generated, per query answered, per benchmark task completed.

Companies report power usage effectiveness (PUE) figures for data centres, and occasionally their total electricity consumption, but almost never energy per unit of capability. Our report for Friends of the Earth report noted that:

[A]ccurate data about the whole-life impact of generative AI models remains extremely difficult to access, despite the efforts of campaigners like Sasha Luccioni  to develop assessment models, largely due to the lack of access to data from Big Tech companies.

It's all very well talking about the numerator (the amount of energy used) but without a denominator (how much capability was delivered) it's impossible to compare the efficiency of different systems – or to hold providers accountable for improving this.

The culture around benchmarking AI systems makes this worse. The most popular way of evaluating LLMs focuses on accuracy on standardised tests rather than accuracy-per-watt. For example, a quantised model that uses 10x less energy but scores 2% lower on a benchmark looks like a 'loser'.

In other words, there is no equivalent of miles-per-gallon (MPG) for AI. Just as we wouldn't choose a car solely on how fast it goes, so we shouldn't be choosing AI models just on how capable they are.

As I mentioned above, the insights from the Chinchilla paper about ways to optimise training has been public knowledge for years. And yet many frontier models continue to be trained with training with parameter counts and token ratios that don't reflect it.

The incentive structures here are, of course, partly commercial. As I alluded to above, announcing a model with hundreds of billions of parameters signals ambition and resource availability in ways that a smaller, more efficient model trained on carefully curated data does not (even if the latter performs comparably).

Approaches such as carbon-aware scheduling, which works at scale without any performance penalty, requires somebody to be measuring and reporting on carbon intensity in the first place. Here in the UK, for example, this is not on the political agenda: the government's AI Opportunities Action Plan (January 2025) didn't mention the environment or climate at all. This is a policy failure – especially when emissions from data centres could be hundreds of times higher than originally estimated.

The Friends of the Earth report points to the EU's Corporate Sustainability Reporting Directive (CSRD), which came into effect in 2024. It requires large companies to report on social and environmental risks – including supply chains. In addition, the EU AI Act's discusses environmental impact reporting for foundation models. These are moves in the right direction, but of course (thanks, Brexit!) the UK sits outside these frameworks, and voluntary disclosure from Big Tech does not gone far enough to enable meaningful comparison or accountability.

Where are the buttons and levers here?

We can't wait for Big Tech to gain a conscience when the supposedly most-ethical version of frontier AI is paying Elon Musk for access to gas-powered compute facilities. Instead, there are at least four levers that don't require new legislation in order to pull.

  1. Procurement – As we've been discussing with the TechFreedom pilot cohort, procurement allows organisations to align their technology choices with their values. So universities, government departments, and large civil society organisations can add energy-per-token disclosures to their tender criteria. If buyers start asking, providers will start measuring.
  2. Benchmarks – The most commonly-cited AI evaluation frameworks are HELM at Stanford, the EleutherAI evaluation suite, and the Hugging Face Open LLM Leaderboard. These do a good job of reporting accuracy, latency, and sometimes the cost of AI models. None of them, however, currently reports energy per token as a standard output. This would be an easy addition and would immediately change the conversation about what counts as a “good” model. The research data to populate such a metric already exists. Sasha Luccioni's work at Hugging Face has already demonstrated methodologies for measuring inference energy.
  3. Standards bodies The AI research community has produced the empirical basis for an energy efficiency standard, and the IEEE and ISO have processes for exactly this kind of standardisation.
  4. Open models as competitive pressure – Open source models such as Llama, Mistral, and their derivatives, already publish efficiency data alongside their capability data. This creates a point of comparison that closed providers can't ignore indefinitely. The principles we came up with in the Friends of the Earth report include transparency as a first-order concern because without public data, there is no basis for accountability.

And finally...

I'm not predicting an overnight 100x reduction in energy usage for AI. But it is a good estimate of what's possible when looking across all four layers of the stack at once and asking what would happen if we were serious about combining the best available techniques.

The barrier here isn't physics or engineering. What's missing is a way to make efficiency legible to organisations procuring AI systems, to policymakers, and to the public. And then, on top of that, a set of incentives that reward improving that efficiency.

AI is neither a catastrophe nor a means of salvation. Arguing one of two binary points of view just leads to an intractable situation. What's more tractable is asking the question what would it take for “how much energy does this cost per useful output” to be as normal a question as “how accurate is it”?

We don't have to wait for new hardware or breakthrough research. We just need to do some systems thinking and coordination across layers that already exist but aren't currently connected. And that requires political will and governance.