A CPU that runs entirely on GPU

(github.com)

193 points by cypres12 hours ago

29 comments

jagged-chisel6 hours ago
“A CPU that runs entirely on the GPU”I imagine a carefully crafted set of programming primitives used to build up the abstraction of a CPU…“Every ALU operation is a trained neural network.”Oh… oh. Fun. Just not the type of “interesting” I was hoping for.
- koolala6 hours ago
 Isn't it interesting it doesn't instantly crash from a precision error? That sounds carefully crafted to me.
- amelius4 hours ago
 Is it emulating a Pentium processor? :)
 - vessenes3 hours ago
 ARM64(!?!) I know you were joking, but still.
- robertcprice14 hours ago
 Please tell me what you had in mind so I can try something different!
 - anthk4 hours ago
 Begin reimplementing a subleq/muxleq VM with GPU primitive commands:<a href="https://github.com/howerj/muxleq" rel="nofollow">https://github.com/howerj/muxleq</a> (it has both, muxleq (multiplexed subleq, which is the same but mux'ing instructions being much faster) and subleq. As you can see the implementation it's trivial. Once it's compiled, you can run eforth, altough I run a tweked one with floats and some beter commands, edit muxleq.fth, set the float to 1 in that file with this example:<pre><code> 1 constant opt.float </code></pre> The same with the classic do..loop structure from Forth, which is not enabled by default, just the weird for..next one from EForth:<pre><code> 1 constant opt.control </code></pre> and recompile:<pre><code> ./muxleq ./muxleq.dec < muxleq.fth > new.dec </code></pre> run:<pre><code> ./muxleq new.dec </code></pre> Once you have a new.dec image, you can just use that from now on.
 - Retr0id3 hours ago
 I was imagining something more like Xeon Phi
user____name5 hours ago
Someone needs to implement LLVMPipe to target this isa, then one can run software OpenGL emulation and call it "hardware accelerated".
- yjftsjthsd-h9 minutes ago
  Surely that would be hardware decelerated
- jagged-chisel2 hours ago
  This causes me discomfort.
bob10298 hours ago
A fun experiment but I wonder how many out there seriously think we could ever completely rid ourselves of the CPU. It seems to be a rising sentiment.The cost of communicating information through space is dealt with in fundamentally different ways here. On the CPU it is addressed directly. The actual latency is minimized as much as possible, usually by predicting the future in various ways and keeping the spatial extent of each device (core complex) as small as possible. The GPU hides latency with massive parallelism. That's why we can put them across relatively slow networks and still see excellent performance.Latency hiding cannot deal well in workloads that are branchy and serialized because you can only have one logical thread throughout. The CPU dominates this area because it doesn't cheat. It directly targets the objective. Making efficient, accurate control flow decisions tends to be more valuable than being able to process data in large volumes. It just happens that there are a few exceptions to this rule that are incredibly popular.
- st_goliath4 hours ago
 > I wonder how many out there seriously think we could ever completely rid ourselves of the CPU. It seems to be a rising sentiment.This sentiment is not a recent thing. Ever since GPGPU became a thing, there have been people who first hear about it, don't understand processor architectures and get excited about GPUs magically making everything faster.I vividly recall a discussion with some management type back in 2011, who was gushing about getting PHP to run on the new Nvidia Teslas, how amazingly fast websites will be!Similar discussions also spring up around FPGAs again and again.The more recent change in sentiment is a different one: the "graphics" origin of GPUs seem to have been lost to history. I have met people (plural) in recent years who thought (surprisingly long into the conversation) that I mean stable diffusion when talking about rendering pictures on a GPU.Nowadays, the 'G' in GPU probably stands for GPGPU.
 - ecshafer2 hours ago
 The dream I think has always been heterogeneous computing. The closest here I think is probably apple with their multi-core cpus with different cores, and a gpu with unified memory. (someone with more knowledge of computer architecture could probably correct me here).Have a CPU, GPU, FPGA, and other specific chips like Neural chips. All there with unified memory and somehow pipelining specific work loads to each chip optimally to be optimal.I wasn't really aware people thought we would be running websites on GPUs.
- volemo7 hours ago
 I see us not getting rid of CPU, but CPU and GPU being eventually consolidated in one system of heterogeneous computing units.
 - nine_k5 hours ago
 CPU and GPU have very different ways of scheduling instructions, requiring somehow different interfaces and programming models.. I'd hazard to say that a GPU and CPU with unified memory access (like the Apple's M series, and most mobile chips) is already such a consolidated system.
 - amelius4 hours ago
 nVidia Jetson also has unified memory access btw.
 - jagged-chisel6 hours ago
 Agreed. Much like “RISC is gonna replace everything” - it didn’t. Because the CPU makers incorporated lessons from RISC into their designs.I can see the same happening to the CPU. It will just take on the appropriate functionality to keep all the compute in the same chip.It’s gonna take awhile because Nvidia et al like their moats.
 - StilesCrisis3 hours ago
 CISC only survived because CPUs now dedicate a ton of silicon to decoding the CISC stream into RISC-y microcode. RISC CPUs can avoid this completely, but it turns out backwards compatibility was important to the market and the transistor cost of "instruction decode" just adds like +1 pipeline depth or something.
 - zephen1 hour ago
 > CPUs now dedicate a ton of silicon to decoding the CISC stream into RISC-y microcode.In absolute terms, this is true. But in relative terms, you're talking less than 1% of the die area on a modern, heavily cached, heavily speculative, heavily predictive CPU.
 - zozbot2346 hours ago
 > It will just take on the appropriate functionality to keep all the compute in the same chip.So, an iGPU/APU? Those exist already. Regardless, the most GPU-like CPU architecture in common use today is probably SPARC, with its 8-way SMT. Add per-thread vector SIMD compute to something like that, and you end up with something that has broadly similar performance constraints to an iGPU.
 - junon2 hours ago
 We're getting there already with e.g. Grace-Blackwell chips.
- spot50103 hours ago
 I don't think we get rid of the CPU. But the relationship will be inverted. Instead of the CPU calling the GPU, it might be that the GPU becomes the central controller and builds programs and calls the CPU to execute tasks.
 - pklausler1 hour ago
 Sounds reminiscent of the CDC 6600, a big fast compute processor with a simple peripheral processor whose barreled threads ran lots of the O/S and took care of I/O and other necessary support functions.
 - treyd2 hours ago
 This will never without completely reimagining how process isolation works and rewriting any OS you'd want to run on that architecture.
- fc417fc8026 hours ago
 > I wonder how many out there seriously think we could ever completely rid ourselves of the CPU.How do you class systems like the PS5 that have an APU plugged into GDDR instead of regular RAM? The primary remaining issue is the limited memory capacity.I wonder if we might see a system with GPU class HBM on the package in lieu of VRAM coupled with regular RAM on the board for the CPU portion?
 - chris_money2026 hours ago
 I don’t think the remaining issue is memory capacity. CPUs are designed to handle nonlinear memory access and that is how all modern software targeting a CPU is written. GPUs are designed for linear memory access. These are fundamentally different access patterns the optimal solution is to have 2 distinct processing units
 - zozbot2345 hours ago
 If anything, GPUs combine large private per-compute unit private address spaces and a separate shared/global memory, which doesn't mesh very well with linear memory access, just high locality. You can kinda get to the same arrangement on CPU by pushing NUMA (Non-Uniform Memory: only the "global" memory is truly Unified on a GPU!) to the extreme, but that's quite uncommon. "Compute-in-memory" is a related idea that kind of points to the same constraint: you want to maximize spatial locality these days, because moving data in bulk is an expensive operation that burns power.
robertcprice14 hours ago
Hey everyone thank you taking a look at my project. This was purely just a “can I do it” type deal, but ultimately my goal is to make a running OS purely on GPU, or one composed of learned systems.
- yjftsjthsd-h7 minutes ago
 This is hilarious and profoundly in the spirit of hacker news. Thanks for posting:)
- StilesCrisis4 hours ago
 I think it's curious that you're saying "on GPU" when you mean "using tensors." GPUs run compute shaders naturally and can trivially act like CPUs, just use CUDA. This is more akin to "a CPU on NPU" and your NPU happens to be a GPU.
- lstevens144 hours ago
 Hi! I think that the idea is certainly a fun one. There is a long history of trying to make a good parallel operating system. I do not think that any of the projects succeeded though. This article is a good read if you are interested in that. I am not sure why the economics of parallel computer operating systems have not worked out so far. I think it most likely has to do with the operating systems that we have being good enough and familiar. [0] <a href="https://news.ycombinator.com/item?id=43440174">https://news.ycombinator.com/item?id=43440174</a>
 - activestore57 minutes ago
 The Blue Gene Active Storage project demonstrated compute in highly parallel “storage” where storage was HPC memory. It could work for the relationship between CPU and GPU, FPGA, etc.<a href="https://www.fz-juelich.de/en/jsc/downloads/slides/bgas-bof/bgas-bof-fitch/@@download/file" rel="nofollow">https://www.fz-juelich.de/en/jsc/downloads/slides/bgas-bof/b...</a>
bmc750510 hours ago
As foretold six years ago. [1][1]: <a href="https://breandan.net/2020/06/30/graph-computation#roadmap" rel="nofollow">https://breandan.net/2020/06/30/graph-computation#roadmap</a>
- toolslive7 hours ago
 <a href="https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing" rel="nofollow">https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing</a> ?
 - DiabloD35 hours ago
 <a href="https://en.wikipedia.org/wiki/Larrabee_(microarchitecture)" rel="nofollow">https://en.wikipedia.org/wiki/Larrabee_(microarchitecture)</a> ?
 - anthk2 hours ago
 Before that there was Forth running in the Transputer, which looks really close to current parallel computing.
RandyOrion44 minutes ago
Cool. However, one still need CPU to send commands to GPU in order to let GPU do CPU things.
- palmotea38 minutes ago
 > Cool. However, one still need CPU to send commands to GPU in order to let GPU do CPU things.Doesn't the Raspberry Pi's GPU boot up first, and then the GPU initializes the CPU?With this technology, we've eliminated the need for that superfluous second step.
GeertB1 hour ago
I don't quite understand how multiply doesn't require addition as well to combine the various partial products.
Nevermark2 hours ago
Time to benchmark Doom.Now we know future genius models won't even need CPUs, just tensor/rectifier circuits. If they need a CPU, they will just imagine them.A low-bit model with adaptive sparse execution might even be able to imagine with performance. Effectively, neural PGA capability.
nomercy4008 hours ago
I was taught years ago that MUL and ADD can be implemented in one or a few cycles. They can be the same complexity. What am I missing here?Also, is it possible to use the GPU's ADD/MUL implementation? It is what a GPU does best.
- volemo7 hours ago
 To multiply two arbitrary numbers in a single cycle, you need to include dedicated hardware into your ALU, without it you have to combine several additions and logical shifts.As to why not use the ADD/MUL capabilities of the GPU itself, I guess it wasn’t in the spirit of the challenge. ;)
andrewdb8 hours ago
Why do we call them GPUs these days?Most GPUs, sitting in racks in datacenters, aren't "processing graphics" anyhow.
- xeonmc8 hours ago
 General Processing UnitsGross-Parallelization UnitsGenerative Procedure UnitsGratuitously Profiteering Unscrupulously
 - incognito1247 hours ago
 Greed Processing Units
 - wartywhoa234 hours ago
 This is just brilliant!
 - allreduce3 hours ago
 Sometimes Gibberish Producing Units
- jgtrosh8 hours ago
 The dedicated term GPGPU [0] didn't catch on.[0]: <a href="https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units" rel="nofollow">https://en.wikipedia.org/wiki/General-purpose_computing_on_g...</a>
- CompuHacker7 hours ago
 <pre><code> CPU = Compute GPU = Impute</code></pre>
- anthk4 hours ago
 VPU. Vector/Video Processing Unit.
deep128310 hours ago
This is a fun idea. What surprised me is the inversion where MUL ends up faster than ADD because the neural LUT removes sequential dependency while the adder still needs prefix stages.
lorenzohess11 hours ago
Out of curiosity, how much slower is this than an actual CPU?
- bastawhiz11 hours ago
 Based on addition and subtraction, 625000x slower or so than a 2.5ghz cpu
 - zackmorris21 minutes ago
 I wish the project said how many CPUs could be run simultaneously on one GPU.It might be worth having a CPU that's 100 times slower (25 MHz) if 1000 of them could be run simultaneously to potentially reach a 10 times speedup for embarrassingly parallel computation. But starting from a hole that's 625000x slower seems unlikely to lead to practical applications. Still a cool project though!
 - medi8r9 hours ago
 So it could run Doom?
 - repelsteeltje8 hours ago
 Yes: <a href="https://github.com/robertcprice/nCPU?tab=readme-ov-file#doom-raycaster-demo" rel="nofollow">https://github.com/robertcprice/nCPU?tab=readme-ov-file#doom...</a>
 - medi8r7 hours ago
 Oh I forgot to Doom scroll.
 - binsquare8 hours ago
 Can we run doom inside of doom yet?
 throawayonthe7 hours ago
 Yes: <a href="https://github.com/kgsws/doom-in-doom" rel="nofollow">https://github.com/kgsws/doom-in-doom</a>
 PowerElectronix6 hours ago
 What a time to be alive
 vee-kay8 hours ago
 [dead]
 afewquarks8 hours ago
 [dead]
 - anthk6 hours ago
 Doom it's easy. Better the ZMachine with an interpreter based on DFrotz, or another port. Then a game can even run under a Game Boy.For a similar case, check Eforth+Subleq. If this guy can emulate subleq CPU under a GPU (something like 5 lines under C for the implementation, the rest it's C headers and the file opening function), it can run Eforth and maybe Sokoban.
DonThomasitos6 hours ago
I don‘t understand why you would train a NN for an operation like sqrt that the GPU supports in silicon.
- nine_k5 hours ago
  I see it as a practical joke or a fun hack, like CPUs implemented in the Game of Life, or in Minecraft.
  - anthk4 hours ago
    I actually ran Sokoban under EForth running on top of subleq/muxleq with a VM interpreted under few lines of AWK.
low_tech_punk3 hours ago
Saw the DOOM raycast demo at bottom of page.Can't wait for someone to build a DOOM that runs entirely on GPU!
sudo_cowsay11 hours ago
"Multiplication is 12x faster than addition..."Wow. That's cool but what happens to the regular CPU?
- adrian_b10 hours ago
 This CPU simulator does not attempt to achieve the maximum speed that could be obtained when simulating a CPU on a GPU.For that a completely different approach would be needed, e.g. by implementing something akin to qemu, where each CPU instruction would be translated into a graphic shader program. On many older GPUs, it is impossible or difficult to launch a graphic program from inside a graphic program (instead of from the CPU), but where this is possible one could obtain a CPU emulation that would be many orders of magnitude faster than what is demonstrated here.Instead of going for speed, the project demonstrates a simpler self-contained implementation based on the same kind of neural networks used for ML/AI, which might work even on an NPU, not only on a GPU.Because it uses inappropriate hardware execution units, the speed is modest and the speed ratios between different kinds of instructions are weird, but nonetheless this is an impressive achievement, i.e. simulating the complete Aarch64 ISA with such means.
 - 5o1ecist9 hours ago
 > where each CPU instruction would be translated into a graphic shader programYou really think having a shader per CPU-instruction is going to get you closer to the highest possible speed one can achieve?
 - adrian_b4 hours ago
 You could coalesce multiple instructions per shader, but even with a single CPU instruction (which would be translated to a sequence of GPU instructions), you could reach orders of magnitude greater speed than in this neural network implementation, by using the arithmetic-logic execution units of the GPU.Once translated, the shader programs would be reused. All this could be inserted in qemu, where a CPU is emulated by generating for each instruction a short program that is compiled and then the resulting executable functions are cached and executed during the interpretation of the program for the emulated CPU.In qemu, one could replace the native CPU compiler with a GPU compiler, either for CUDA or for a graphic shader language, depending on the target GPU. Then the compiled shaders could be loaded in the GPU memory, where, if the GPU is recent enough to support this feature, they could launch each other in execution.Eventually, one might be able to use a modified qemu running on the CPU to bootstrap a qemu + a shader compiler that have been translated to run on the GPU, so that the entire simulation of a CPU is done on the GPU.
 - koolala7 hours ago
 If its bindless and pre-compiled why not? What's a faster way?
koolala6 hours ago
Exciting if an Ai that is helping in its own improvements finds this and incorporates it into its own architecture. Then it starts reading and running all the worlds binary and gains intelligence as a fully actualized "computer". Finally becoming both a master of language and of binary bits. Thinking in poetry and in pure precise numerical calculations.
artemonster7 hours ago
Every clueless person who suggest that we move to GPUs entirely have zero idea how things work and basically are suggesting using lambos to plow fields and tractors to race in nascar
- madwolf5 hours ago
 Bad comparison. Lambos are regularly plowing fields and they're quite good at it. <a href="https://www.lamborghini-tractors.com/en-eu/" rel="nofollow">https://www.lamborghini-tractors.com/en-eu/</a>
throawayonthe7 hours ago
very tangentially related is whatever vectorware et al are doing: <a href="https://www.vectorware.com/blog/" rel="nofollow">https://www.vectorware.com/blog/</a>
jleyank5 hours ago
How is this different than the (various?) efforts back then to build a machine based on the Intel i860? Didn’t work, although people gave it a good try.
taofor43 hours ago
What is the purpose of this project? I didn't get it. How will it be useful?
- jebarker3 hours ago
 > How will it be useful?Does it need to be?
RagnarD11 hours ago
Being able to perform precise math in an LLM is important, glad to see this.
- jdjdndnzn11 hours ago
 Just want to point out this comment is highly ironic.This is all a computer does :PWe need llms to be able to tap that not add the same functionality a layer above and MUCH less efficiently.
 - Nuzzerino11 hours ago
 > We need llms to be able to tap that not add the same functionality a layer above and MUCH less efficiently.Agents, tool-integrated reasoning, even chain of thought (limited, for some math) can address this.
 - RagnarD9 hours ago
 You're both completely missing the point. It's important that an LLM be able to perform exact arithmetic reliably without a tool call. Of course the underlying hardware does so extremely rapidly, that's not the point.
 - kruffalon5 hours ago
 Could you explain why that is?
 koolala5 hours ago
 A tool call is like 100,000,000x slower isnt it?
 kruffalon3 hours ago
 No idea really, but if it is speed related I would have thought that OP would have used faster rather than importance to try and make their point.
 koolala57 minutes ago
 It's both. Being dirrctly a part of it makes it integrated into its intelligence for training and operation.
 - jdjdndnzn7 hours ago
 The computer ALREADY does do math reliably. You are missing the point.
- koolala7 hours ago
 That would be cool. A way to read cpu assembly bytecode and then think in it.It's slower than real cpu code obviously but still crazy fast for 'thinking' about it. They wouldn't need to actually simulate an entire program in a never ending hot loop like a real computer. Just a few loops would explain a lot about a process and calculate a lot of precise information.
- 5o1ecist9 hours ago
 Why?
wartywhoa234 hours ago
Oh these brave new ways to paraphrase the good old "fuck fuel economy"...Thank you, Mr. Do-because-I-can!Yours truly,- GPU company CEO,- Electric company CEO.
nicman2310 hours ago
can i run linux on a nvidia card though?
- micw10 hours ago
  Linux runs everywhere
  - volemo7 hours ago
    Except on my stupid iPad “Pro”. :(
mrlonglong9 hours ago
Now I've seen it all. Time to die.. (meant humourously)
raphaelmolly824 minutes ago
[dead]
Surac10 hours ago
Well GPU are just special purpous CPU.
MadnessASAP9 hours ago
Ya know just today I was thinking around a way to compile a neural network down to assembly. Matching and replacing neural network structures with their closest machine code equivalent.This is way cooler though! Instead of efficiently running a neural network on a CPU, I can inefficiently run my CPU on neural network! With the work being done to make more powerful GPUs and ASICs I bet in a few years I'll be able to run a 486 at 100MHz(!!) with power consumption just under a megawatt! The mind boggles at the sort of computations this will unlock!Few more years and I'll even be able to realise the dream of self-hosting ChatGPT on my own neural network simulated CPU!