Nvidia
Nvidia Corporation is an American multinational corporation and technology company. It is a software and fabless company which designs and supplies graphics processing units (GPUs), application programming interfaces (APIs) for data science and high-performance computing, as well as system on a chip units (SoCs) for the mobile computing and automotive market. Nvidia is also a dominant supplier of artificial intelligence (AI) hardware and software.
Nvidia's professional line of GPUs are used for edge-to-cloud computing and in supercomputers and workstations for applications in such fields as architecture, engineering and construction, media and entertainment, automotive, scientific research, and manufacturing design. Its GeForce line of GPUs are aimed at the consumer market and are used in applications such as video editing, 3D rendering and PC gaming. The company expanded its presence in the gaming industry with the introduction of the Shield Portable (a handheld game console), Shield Tablet (a gaming tablet) and Shield TV (a digital media player), as well as its cloud gaming service GeForce Now.
If someone is interested in moderating this community, message @[email protected].
view the rest of the comments
Article says nothing really relevant about the architecture and implementation. Unified memory says nothing. Is it using a GPU-like math coprocessor or just extra cores. If it is just the CPU, it will have the same limitations of cache bus width. If it is a split workload, it will be limited in tools that can split the math. The article is comparing this to ancient standards of compute like 32 GB of system memory when AI has been in the public space for nearly 2 years now. For an AI setup 64 GB to 128 GB is pretty standard. It also talks about running models that are impossible to fit on even a dual station setup in their full form. You generally need twice the memory of the model parameters size to load the full version. You need the full version for any kind of training but not for general use. Even two of these systems at 256 GB of memory is not going to load a 405B model at full precision. Sure it can run a quantized version, and that will be like having your own ChatGPT 3.5, but it is not training or some developer use case. Best case scenario, this would load a 70B at full precision. A 70B on an Intel 12th gen 20 logical core CPU, 64 GB of DDR5 at max spec speed, and a 16 GB 3080Ti GPU loads a 70B with Q4L quantization and streams slightly slower than my natural reading pace. There is no chance that this could be used in anything like an agent or for more complex tasks that require interaction. My entire reason for clicking on the article was to see the potential architecture difference that might enable larger models beyond the extremely limited CPU cache bus width problem present in all CPU architectures. The real world design life cycle of hardware is 10 years. So real AI specific hardware is still at least 8 years away. Anything in the present is marketing wank and hackery. Hackery is interesting, but it is not in this text.
Can you tell us more about the CPU cache width problem?
It's not a real problem for a system like this. The system uses CXL. Their rant is just because they didn't take the time to do a click down into what the specs are.
The system uses CXL/AMBA CHI specs under NVLink-C2C. This means the memory is linked both to the GPU directly as well as to the CPU.
All of their complaints are pretty unfounded in that case and they would have to rewrite any concerns taking into account those specs.
Check https://www.nvidia.com/en-us/project-digits/ which is where I did my next level dive on this.
EDIT: This is all me assuming they are talking about the bandwidth requirements of allocating all memory as being CPU allocation rather than enabling concepts like LikelyShared vs Unique.