Oh, after 3:20 minutes, it's already over, very short video.
The main message is about the motivation behind CUDA: People don't want to learn a completely new language, they want to invest as little as possible. So that means, have just C but on GPU. The motivation of CUDA was to make GPU programming as easy as possible for someone who knows already C programming.
yep I noticed it rightaway - watched it twice to confirm -- and came here to post the same thing :)
not sure if it was a slip and corrected in post ... they are Nvidia afterall so they have all compute they want :).
But if it was indeed corrected in post then they did an excellent job of getting the acoustics / ambient noise perfectly right. Can someone do forensics on the audio track to see any editing artifacts?
PS: this bit is at timestamp 0:12 though
PS2: Channel is "Nvidia Tesla" and as someone commented on youtube he looks a bit like Elon Musk :)
what happened is they took the recording of him saying "2004" at the 00:47 mark and spliced that audio over the 00:07 mark. strange, but if you listen to the 00:47 and then back to the 00:07 it's quite clear it's the same.
Ian Buck's doctoral thesis Stream Computing on Graphics Hardware (2006) [1] and a shorter article about it (2004).
From the 2004 article: "It is also possible that future streaming hardware will share the same memory as the CPU, eliminating the need for data transfer altogether." Unified memory foreseen 16 years before Apple Silicon (and the point of this comparison is to indicate how hard is to go from a prediction in a paper to mass manufacturing/popularity, not that Apple invented unified memory).
More seriously people need to stop with the Apple comparisons. Unified memory has been a thing for a way longer time. Heck around 2014 AMD had integrated GPUs with not just unified memory but fully unified address spaces with the host. Unified memory in itself happened way before that.
Not to mention that mobiles have always been unified archs. It’s just a design decision.
Ian Buck's 2004 prediction is still 10 years before 2014. I did not say Apple invented unified memory, it just got popular with Apple Silicon, and looking at local LLM inference on M1/M2 and the 192 GB of memory M2 Ultra allows, it will surely get more important.
"Unified memory" has been around forever in one form or another, as it's simply a single address space (and physical location) for various independent subsystems. In graphics, it's probably been used since Amiga. (?) This is common in console GPUs which always punched above their weight. The ubiquitous Intel shared memory has been around for ages, although it was not entirely unified (reserved area for GPU, which it cannot escape; zero-copy still possible by allocating inside it and addressing data on CPU).
I obviously did not mean "unified memory" in that sense. In that sense even Apple I in 1976 had "unified memory" [1] [2]. The sense in which I meant it, following the spirit of the above paper/thesis, which no one seems to have read because they were too quick to jump on the bashing bandwagon, was unified memory performing "stream computing", e.g. an Apple Silicon chip running a local large language model. And if you get to run Vicuna 13B or something else on an Intel Tiger Lake or similar, more power to you, and to us if you make it open source.
>Unified memory foreseen 16 years before Apple Silicon.
Honestly, it's good to get some more background information before claiming that Apple invented every innovation till sliced bread.
Microcontrollers, SoCs from various vendors, gaming consoles and Intel CPUs with integrated graphics also had unified memory since .. forever(?), or at least nearly 30 years, because it was as efficient back then as it is now for silicone and SW usage.
Apple didn't reinvent the wheel in this regard, it was already there as a low hanging fruit.
I did not say Apple invented unified memory, just that Ian Buck's prediction was 16 years before Apple Silicon. Also, in practice, is not only about unified memory, but also performance. The Intel chip with integrated graphics is able to boil an egg with the heat it dissipates, the M1/M2 are cool as if not even running while handling way more workload.
Having unified memory and having much more compute performance/watt are two orthogonal issues. Unfired memory had already existed before Apple silicone and was already present in most consoles, cheap tablets and smartphones due to how ubiquos SoCs and Intel chips with iGPUs were.
Ian Buck's thesis is about how GPUs can be used as "stream computing": "In this paper (Buck, 2004), we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming co-processor." That's all the point: Apple Silicon allows for running local large language models (and other ML models/algorithms) in a way, at a price point, with enough performance, and so on in which other chips with unified memory don't.
There's a difference between CPU-GPU shared memory and unified memory, although not everyone seems to be using "unified" in the same sense.
What Apple appear to have with their M2 chips is shared memory meaning that the CPU and GPU are directly accessing the same memory chips. On the just-announced M2 Ultra chip they are claiming 800GB/sec memory bandwidth, which compares well to the 1TB/sec on a recent NVIDIA card.
Unified memory, at least as NVIDIA use the term, only refers to a unified address space such that the GPU and CPU (located on opposite sides of the PCI bus) can use the same address space to access memory. However, the memory being mapped to by this unified address space may be on either side of the PCI bus (i.e be CPU memory or GPU memory) and may migrate from one side to the other to optimize performance. Given how slow PCI bus transfers are compared to GPU memory bandwidth, the use cases for this is not at all the same as true shared memory... It's really just a developer convenience feature to not have to explicitly orchestrate CPU-GPU memory transfers yourself (which you may be better off doing to maximize performance).
NVIDIA seem to be going in the same direction as Apple here, with their latest designs integrating GPU and CPU on a single module.
Unified memory was a key feature of Silicon Graphics's low end O2 workstation released in 1996 (well before that 2004 date.) It enabled both "unlimited" texture mapping memory and ability to map video streams as texture maps without extra memory moves or copies.
When SGI's viability became questionable, I always thought there might be some value in Apple scooping them up for innovative bits of value like that but that never came to pass. Would be interested to know if it was ever considered/rejected and why.
Quite deserved: CUDA is probably the reason Nvidia became a trillion dollar co.
In his 2004 PhD, he tells ATI had a much better performance than nvidia... even if it was still the case, it would not even matter as their tools and drivers are terrible.
Yes, but the point is "future streaming hardware" allowing for "stream computing" as Ian Buck puts it in 2004. The prediction is not: we will have unified memory and we will be able to boil eggs on a chip (such as Intel integrated graphics chip), but that we will have unified memory and be able to run Vicuna 13B locally (such as the M1/M2 chips).
I’ve been wanting to get into graphics card programming for a while and have found the documentation to be extremely difficult to understand. I was wondering if anyone knew of any good tutorials that can help me out here.
They have their own alternatives to CUDA & cuDNN called ROCm & MiOpen, as well as tools (HIP, HIPify) that let you write code that'll run on both NVIDIA/CUDA and AMD cards.
There are also AMD versions of PyTorch and TensorFlow.
The problem is that all of these efforts are not quite 100% there ... there are bugs and incompatibilities that seem to make most people abandon them. It's a shame since the hardware itself seems great.
Another big issue is that AMD does not officially support their CUDA alternative on consumer hardware. Finding a list of supported GPUs is basically impossible so I am not surprised nobody bothers adding ROCm/HIP support to their software.
That is a stark contrast to Nvidia where everything works on even the most entry level GPU, encouraging adoption in third-party software.
Feels like this (and the drivers, as recently rediscovered by geohot) should be their #1 priority to be frank. Or should have been for the last few years but better late than never.
The most bang for buck they could do to improve their competitiveness (and share price)
The main message is about the motivation behind CUDA: People don't want to learn a completely new language, they want to invest as little as possible. So that means, have just C but on GPU. The motivation of CUDA was to make GPU programming as easy as possible for someone who knows already C programming.