We’re excited to talk with one of AMD’s top engineers today about RDNA 3. Sam Naffziger has a great track record of success in the industry, and we’re thrilled to have him here with us to talk about his work on this new architecture. Chiplets were a relatively unproven concept in 2016, when Naffziger first proposed them to the company. He successfully convinced everyone at AMD that they were the way to go, and now his work is paying off in a big way. Thanks to chiplets, AMD GPUs are more powerful than ever. In this interview, Sam will give us the straight scoop on why RDNA 3 uses chiplets and what this means for AMD’s future prospects. We only get opportunities like this every few years, so be sure to read carefully!
AMD Zen to AMD RDNA
 On the heels of AMD’s Radeon 7000 series announcements, I have Sam Naffziger here, who already joined us to talk about Zen 1. He has a long history in CPUs and power efficiency, leading the chiplet architecture for our Ryzen and Epyc lines before moving into graphics in 2017. Most recently, he worked on the chiplet architecture for RDNA 3. It’s been exciting stuff. So Sam gave an excellent presentation here for the press, and I wanted to go over a few points with all the attendees so they could benefit from his wealth of knowledge.
Â
The first question is obviously about chiplet design:
What is the motivation for chiplets? I think cost is one of the most obvious, but as Sam explained, there are other motivations at play beyond simple economics. So you mentioned the various unifying technologies that need nodes, such as memory chips or cache, versus the logic components that do the actual computation and benefit more from lower shrink rates per node improvement.
 The economically viable way to take advantage of these costly advanced processing nodes then becomes a smaller area required per component integrated in such an array, also known as chiplets. This results not only in improved functionality due to the reduced crowding effects of previous methodologies but also in increased speed of electrical communication interfaces between various composite IP blocks within the system-on-a-chip (SOC), thanks largely to Thomas’ efficient signaling, plus the elimination of crosstalk issues plaguing traditional density-filled monolithic designs where everything, including the proverbial kitchen sink, ended up sharing a single piece of silicon.
But wait! Actually, there are even more advantages… So as we continue to produce smaller and smaller geometries for our chipsets (thanks to thinner wires, aka metal interconnects between blocks)—beyond the obvious issues like the marketability of “being able to say you have a new product, etc.”—comes another wrinkle: power. Yes, such reduced dimensions begin to leak current, known as “sub-threshold conduction,” where electrons, under the influence of the applied voltage, find it easier to take a shortcut through the gate dielectric material rather than travel a longer path through the transistor body.
The Many Advantages of Chiplet Technology
This increases electricity consumption so that washer-dryer cycles waste heating and air conditioning at home, literally melting the polar ice caps when done in commercial-scale production facilities! Not really, but the point is that originally physics set the SATA hard drive clock speed at 600 Mb/second; after doubling the frequency over the course of 5 years, this plateau has been reached again in 6 weeks due to engineering innovations in process management, so again it is good to have a flexible chiplet philosophy as various parts of the system can be updated at different rates rather than refreshing all at once.
The other motivation Sam was talking about had to do with leveraging different vendor processes that specialize in certain types of technologies to create a more efficient SOC device overall. So, for example, one company may excel at producing high-quality, high-precision ADCs at their 14 nanometer node, while others working on a 7 nanometer node could produce memory with a much higher bit density per second and a lower price. It’s silly to use a technologically inferior part of twice the area, isn’t it? Having the ability to select the best components in the industry, regardless of manufacturers’ locations, greatly reduces the risk of defective silicon coming off the line.
That you trust us is not something you want to experience firsthand! So chiplets contribute to fewer bad dies, fantastically higher reliability products, and improved customer satisfaction!
 In conclusion, it is clear that chiplets offer immense advantages over traditional monolithic designs in terms of cost, performance, and flexibility. As the industry continues to move towards smaller nodes, chiplet architecture will become increasingly important. We can’t wait to see what the future holds for this technology.
AMD was transformed by chiplets
 Chiplets have been gaining traction in the media as a way to reduce costs, but not many people have considered what that means from a chip manufacturing perspective. Back in 2016, when chiplets were still unproven, I spent a lot of time convincing everyone in the company that they were the way to go. One of the results of that is that our internal performance models predict pretty accurately what the cost would be if we were to build a full 16-core CPU monolithically.
 Monolithic CPUs, with all their associated interfaces and components, are much larger and therefore suffer worse performance compared to smaller chiplets. For example, wafers have defects that can ruin entire dies, but if those dies are small enough (i.e., shares per instruction), then only some will be affected by these defects rather than a whole batch being discarded as before. This makes more efficient use of costly production runs and reduces overall waste.
Â
In other words, fewer cores packaged on each individual chip results in higher throughput per chip in any given node or transistor budget (due to higher integration).In addition, the advantages of a chip-based approach, such as faster time to market, lower complexity, and easier adaptability to new nodes, make it a much more attractive option overall. When we can focus most of our engineering team on developing components that will have the greatest impact on product performance, we end up with better products that are released on a more predictable schedule.
GPU vs CPU Design Requirements
 When talking about the differences between GPUs and CPUs, one of the issues is the need for bandwidth. This means that more data can pass through a GPU than a CPU in a given time. This is because GPUs are designed to handle large amounts of data simultaneously, while CPUs are not. Within a GPU, there are shader engines that act as modules. In our RDNA3 model, these shader engines have decoupled clocks, making them relatively independent of each other. However, they still require high bandwidth due to the amount of work they must perform, such as distributing textures and vertices between them. If all this information were routed through CPU interfaces, it would be incredibly slow by comparison.
Microscope shot of RDNA3
 This is an MCD interface. It contains 50 threads that are part of the hundreds that make up the GCD and MCD interfaces. We call it a fillet, and it provides the Infinity Fabric interface. From a logical perspective, it looks like a single wire, but it actually pumps bits in a quad fashion to provide routing here at the receiver end.
Fanout Routing
 Learn about fan-out routing, a packaging technology used in many smartphones. This technology provides a rigid base on which to mount chiplets and allows for much finer line spacing than is possible with organic packages. However, it is not as thin as silicon and is therefore less expensive to manufacture.
Increasing the Die Area
 It’s been a pleasure chatting with you about some of the graphics innovations we’ve made recently. As I mentioned, one area we are really focusing on is reducing the size of our GPUs. We knew that the number of wires was too high in traditional graphics chipsets and realized that there was a clear boundary between infinite cache and downsizing our GPUs. We knew that the wire count was too high on traditional graphics chipsets, and we noticed that there was a clean boundary between infinite cache and output.
Porting the proper GDDR6 interfaces to new technologies like N6 (TSMC’s 6nm) saved us a lot of time and money. In addition, this allowed us to split the array functions into their own separate small arrays, which, as you can see from the results, increased our number of compute units by 20%. We added new functions and, at the same time, reduced the size from 520 mm to 300 mm. I think this is a great achievement, and we are looking forward to seeing where this technology takes us in the future.