Вернемся к исходной теме…
Jeff Stuecheli, who has the title of chief nest architect for the Power8 processor, gave the presentation at Hot Chips going over the feeds and speeds. If the cores on a Power chip are the eggs, then the chief nest architect worries about all of the other things that surround the cores – what Intel calls the uncore regions when it talks about chips.
The Power8 nest is lined with L3 caches, PCI-Express and DDR memory controllers, various other accelerators to speed up functions that might otherwise run on the cores, and the NUMA interconnects for implementing shared memory across multiple sockets.
With the Power8 chip, IBM has a few goals. First, the company is shifting from the 32-nanometer processes used for the relatively recent Power7+ chips to a 22-nanometer process. The shrinking of the transistor gates allows IBM to add more features to a die, cranks the clocks, or do a little of both.
Judging from the Power8, it looks like IBM is content to keep in the same clock speed range as the Power7+ chips – around 4GHz, give or take a little. It’ll also move PCI-Express 3 controllers into the chip package to keep those hungry little Power8 cores fed; these controllers will offer a coherent memory protocol to external accelerators as well as a new cache hierarchy that goes all the way out to the L4 cache.
As expected, IBM is also goosing the number of processor threads per core with Power8, doubling it up to eight per core. IBM has been vague about how many cores it might squeeze onto a die with the 22-nanometer shrink, and it could have probably done as many as sixteen cores if it had not added so much eDRAM L3 cache memory with the Power7+ and then boosted it even further with the Power8.
On the workloads that Big Blue is targeting with its Power Systems iron, having more cache and cores running at near peak utilisation is more important than having lots of cores on a die. Just as is the case for mainframes, at the prices that IBM has to charge for Power Systems servers, the chip has to be architected to run at close to full-tilt-boogie in a sustained manner. If IBM can do that, then it can garner the prices it commands and the profits we all presume it gets from Power Systems.
The Power8 chip is implemented in IBM’s familiar high-k metal gate processes, which include copper and silicon-on-insulator technologies in a 22-nanometer process. The precise transistor count was not given during the presentation, but the Power8 chip weighs in at 650 square millimetres; this is a bit bigger than Power7+, which used a 32-nanometer process, had 2.1 billion transistors, and a surface area of 567 square millimetres.
The Power8 core has a total of sixteen execution pipes. These include two load store units (LSUs) and a condition register unit (CRU), a branch register unit (BRU), and two instruction fetch units (IFUs). There are two fixed-point units (FXUs), two vector math units (VMXs), a decimal floating unit (DFU), and one cryptographic unit (not labeled in the core diagram above).
Each core now has eight threads implemented using simultaneous multithreading (what IBM calls SMT8), instead of four threads per core with the Power7 and Power7+ chips. And like earlier Power chips, this SMT is dynamically tuneable so a core can have one, two, four, or eight threads fired up.
Putting it all together: What does a complete package look like?
If single-thread performance is the most important thing for a piece of work, a core or set of cores will step down the threading automagically and run it with fewer processor threads. The Power8 core, said Stuecheli, has twice as much L1 data cache at 64KB compared to its predecessor (L1 instruction cache remains the same). Data buses from L1 to L2 cache on the die are now twice as wide at 64 bytes. The core has larger issue queues, improved branch prediction, can handle twice as many data cache misses, and has significantly beefed up prefetching of instructions and data. Add it all up, and at a 4GHz clock speed, a Power8 chip will yield about 1.6 times the single-threaded performance of a Power7 chip from 2010.
Each core has 512KB of SRAM memory etched right near it. A segmented NUMA-like L3 cache using what IBM calls a “non-uniform cache architecture” or NUCA for short, spans all twelve cores on the die, for a total of 96MB of L3 cache. That’s only 8MB of L3 cache per core, compared to 10MB per core for the Power7+ chip announced last year, but the Power8 has a much more sophisticated main memory subsystem and an L4 cache that obviates the need for so much L3 cache on the die. (More on that in a second.) The L3 cache is implemented using embedded DRAM, as was the case with the Power7 and Power7+ processors.
At a 4GHz clock speed, you can move data into L3 cache from the external L4 cache at 128GB/sec and from the L3 cache out to L4 at 64GB/sec. Data can be crammed into L2 cache from L3 at 128GB/sec (or back out at the same bandwidth). The pipe from L2 cache into the cores has 256GB/sec of bandwidth, but only 64GB/sec in the other direction. Add it all up, across a twelve-core Power8 chip that works out to 4TB/sec of L2 cache bandwidth and 3TB/sec of L3 cache bandwidth.
Chip makers have been putting memory controllers onto processors for quite some time now, but IBM has done something clever with the Power8. Instead of picking either an existing DDR3 or a future DDR4 controller for the die, Big Blue has instead created a generic memory controller for the die that speaks out over a high-speed bus to a memory buffer (and now quasi-controller) chip called Centaur. This chip is so named, says Stuecheli, because it is half L4 cache and half memory controller.
In this case, the Centaur chip is implementing DDR3 main memory, but should IBM want to shift out to DDR4 at some future time, it can swap out the memory cards and their integrated L4 cache and buffer chips that were designed for DDR3 memory for ones that use DDR4 chips without changing anything on the processors.
All of the memory scheduling logic, caching structures, and energy management features of what was an on-die memory controller with prior Power chips are now in the Centaur chip. That memory link between the Power8 package and the Centaur memory buffer chip has a 40-nanosecond latency and 9.6GB/sec of bandwidth. That Centaur chip is also implemented in IBM’s 22-nanometer processes and includes 16MB of cache memory which is used as L4 cache by the processor.
Each Power8 chip can have up to eight of these Centaur chips, for a total of 128MB of L4 cache in a fully loaded socket. That socket would have eight memory channels, for a total of 230GB/sec of sustained bandwidth into and out of the processor and the 32 DDR memory ports hanging off one twelve-core chip would have 410GB/sec of peak bandwidth at the DRAM level.
With 32GB DDR3 memory sticks, each Power8 socket will be able to support 1TB of main memory, and presuming the high-end Power8 machine has 32 sockets like the Power7-based Power 795 server does, that means IBM can deliver a box with 32TB of memory across 384 cores and 3,072 processor threads.
The Power8 chip will also have integrated PCI-Express 3.0 controllers, bringing IBM’s Power chips on par with competing Sparc T5 and M5 chips from Oracle and Xeon E5 (and soon Xeon E7) chips from Intel. Those PCI-Express ports have an aggregate of 48GB/sec of I/O bandwidth, significantly more than the 20GB/sec that the Power7 and Power7+ chips offered with the combination of the GX++ bus and I/O bridge chip that was used to implement PCI-Express 2.0 slots.
These integrated PCI-Express 3.0 controllers on the Power8 die provide the transport layer for what IBM is calling the Coherence Attach Processor Interface, or CAPI. And this interface will allow accelerators plugged into the PCI bus of a system – possibly GPU coprocessors or field programmable gate arrays – to easily access data and follow pointers in main memory just like processors themselves do. This is going to be very handy, and has a good chance of getting Big Blue back into the supercomputer racket in a way that didn’t happen with the Power7-based beast formerly known as “Blue Waters”.
Depending on the workload, a Power8 chip will yield somewhere around 2.5 times the performance as a baseline Power7+ chip. Again, we presume those are comparisons for chips running at 4GHz.
IBM will offer memory cards with 32GB, 64GB, and 128GB capacities, will have a variety of chip packaging options and will use the Power8 chip across a full line of machines, William Starke, the SMP architect for the Power processors, told El Reg. IBM is not being precise about when the Power8 will come to market, with rumours ranging from late 2014 to early 2015, but Starke said those rumours were wrong and that mid-2014 is a better timeline for system launches using the Power8 chips.
IBM was showing off a part, has systems of all sizes up and running in its labs using the Power8 chips, and has been designing the Power9 processor for quite a while already, according to Starke.