This is a living document, in-which I track efficiency progress at the bottom of this article
“I fed a picture of donald duck to my computer and asked it to identify said duck. I then asked my computer to draw donald duck’s girlfriend in svg format on my shitty decade-old gtx 1080. It ran into an infinite loop since the number of polka dots kept exceeding its context window. So finally I decided to generate a textured 3D mesh from donald duck and it did so within 5 minutes without much error, then I approved for my computer to send it to my 3D printer”
This comment was not something I would have made 10 years ago, nor would have expected 5 years ago, even 3 years ago once LLM quantization techniques stopped being theoretical. Nor did I think it would have been a comment that I would have made 6 months ago. It’s not exactly surprising, except for the fact how quickly I was dragged into it, within the past few months, with my limited resources.
If we split the comment into sentences associated with years that it became believeable:
“… asked my computer to identify said duck”1 – 2021
“… draw donald duck’s girlfriend in svg format” – 2022
“… on my decade-old gtx 1080” – ~2023 - 2024
“… generate a textured 3D mesh from donald duck” – ~2023
“… within 5 minutes” – 2025
Now try saying that to most people in 2000. The utter lack of dependent concepts, makes this sentence so utterly alien it’d be hard to differentiate from gibberish and insanity. So what’s the next 25 years going to look like?
Today we are living in donald duck times. Next time you look at your box of 20-year-old 32 bit thinkpads, consider keeping it. If you can’t see the utility, a species native to its environment would.
Since publishing this post:
- February 2026: “Run Llama 70B on 24GB RTX 3090” https://github.com/xaskasdf/ntransformer (though this won’t run on an RTX 1080 since CUDA support was dropped)
- March 2026: “inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU and GPU”
https://github.com/microsoft/BitNet
- “bitnet.cpp achieves speedups of 1.37x to 5.07x on ARM CPUs, with larger models experiencing greater performance gains. Additionally, it reduces energy consumption by 55.4% to 70.0%, further boosting overall efficiency. On x86 CPUs, speedups range from 2.37x to 6.17x with energy reductions between 71.9% to 82.2%. Furthermore, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices.”
- April 2026: “1-bit Bonsai”
https://github.com/PrismML-Eng/Bonsai-demo/tree/main
- “On raw benchmark averages, 1-bit Bonsai 8B remains competitive with leading 8B-class models, but it does so at just 1.15 GB memory footprint, roughly 12-14x smaller than its peers. […] On an RTX 4090, it reaches 368 tokens per second […] 1-bit Bonsai 8B uses substantially less energy than its 16-bit full-precision counterparts, delivering roughly 4-5x better energy efficiency. […] these gains come primarily from the reduced memory footprint of 1-bit models, not yet from fully exploiting the 1-bit structure of the weights during inference. In other words, Bonsai already delivers substantial advantages on hardware that was not built for this class of model. […] In linear layers such as MLPs, 1-bit weights make it possible to perform inference with little or no multiplication, replacing much of the computation with simple additions.”
- I am skeptical of organizations using their own benchmarks to demonstrate performance gains… but we will see; I can run this on my own hardware, and it supports CUDA 12 for my trusty ol’ 1080 RTX
-
… in natural language ↩︎