Archive for the ‘Architecture’ Category

Thoughts on Magny-Cours

Monday, September 21st, 2009

AMD just announced their Magny-Cours1 12-core processor.

The second-most interesting thing here is AMD’s retreat to MCM-based designs, as Jon Stokes describes. Intel’s made their first “multicore” chips by gluing two normal chips together; AMD made fun of them for not actually “designing” anything. But right now, AMD is behind in the processor wars, and to catch up they’re taking the time-saving and cost-saving measure of … gluing two normal chips together. In the fullness of time all statements eventually become ironic.

The most interesting thing here is the directory-based cache coherence scheme. Quoting Jon Stokes again:

The solution that AMD has adopted with Istanbul and Magny-Cours involves setting aside 1MB of each chip’s 6MB cache to store a directory of the contents of the other chips’ caches, so that by consulting this local directory each chip can avoid broadcasting a significant number of traffic-increasing snoop requests to the other chips.

Directory-based coherence is nothing new, but I think it’s new to commodity workstation- and server-class processors.


Starting with this column, I’m going to explain what I’m talking about. I should aim these columns toward an audience less specialized than me in architecture, rather than assume the audience knows more than me. I hope this will make my columns more accessible, and also clarify my knowledge of architecture. “To explain is to understand”.

Cache coherence is too big a topic for me to cover right now, so I’ll glide over it, and caches as well. There are two main schemes for cache coherence: snoopy and directory.

Snoopy coherence (or snoopy-bus, pic unrelated) has all processors connected to a shared bus to memory. Whenever a processor reads or writes from main memory, the other processors “snoop” to make sure their own caches are up to date.

In directory coherence, there is a single directory (centralized or distributed) that tracks the location and status of each line of memory. When a processor wants to read or write memory, it consults the directory first, then sends a (usually) targeted message to any processors that already have that line.

Directory coherence involves more messages than snoopy bus, but almost all of them are targeted point-to-point, rather than broadcast. In these days of growing wire delay, broadcast buses are a bad idea (just ask Michael Taylor). From Jon Stokes’s article, it sounds like AMD was also concerned about the limited off-chip bandwidth. Broadcasting between sockets must be several times slower than broadcasting between cores.

  1. My high school French tells me that’s pronounced very much like “many core”. Nice branding, AMD.

Thoughts on the Rumored IBM/Sun Buyout

Wednesday, March 25th, 2009

There are rumors floating around that IBM plans to buy Sun. A few friends and I discussed the effects on Twitter.

cwhitney: I kinda like the IBM + Sun idea. That actually works, although it would then be basically SunBM versus HP versus genericroSoft.
jauricchio: I’m in favor of hitting ZFS, DTrace, and OSol with the GPLHammer. I’m not in favor of axing Rock and Niagara. You know it’s true.
cwhitney: They would have a stable of old but mission critical ($$$) unix OSes too. GPLv3 Hammer is a no for me (you do want OS X ZFS?)
jauricchio: It’d be v2. THEY want it in Linux.
cwhitney: Rock and Niagara may go away, but a future of Sun arch guys + PPC team + actual in-house fab = fun times.
jauricchio Not stoked about POWER5, 6. Niagara and Rock broke more notable architectural ground.
cwhitney: Most certainly but most all the non-embedded, non-x86 CPU arch work comes from those teams. IBM also still has fabs, unlike most
djcapelis: You mean unlike… AMD? :( Yeah it would be interesting to see them together. I would hope for people to jump between them more.
cwhitney I understand the $ reasons, but that AMD move was dumb. Losing vertical integration = bad.

I’m very pleased by Sun’s work in architecture. Niagara and Rock are both bold experiments. At a time when most of the chip vendors were just starting to realize single-threaded scaling was going to get harder, Niagara threw away single-thread performance for radical parallelism. For the workloads Sun targets (network serving, mostly), that turned out to be a very, very good trade-off. These days you can get a 4U with 256 hardware threads and 512GB RAM. That’s a lot of threads. Matched with Sun’s reliably solid memory systems, that’s some pretty serious multi-thread performance.

Rock is something out of a research paper. Somebody finally built a hardware transactional memory system? Suddenly all those papers become relevant to the real world! I’ve got more to say on Rock, but that’s another column. Let’s just say it’s a Good Thing.

On the other hand, IBM’s architecture team hasn’t impressed me lately. The POWER6 looks like a solid chip, but it’s just more of the same: all the old tricks with bigger numbers.

  • Two-way SMT is good.
  • The semi-shared L2 looks like a cute idea: if you have up to four threads working intensely on the same data, you can fit up to 8MB in their L2. Without semi-sharing, you’d only get the same speed for 4MB between two threads. That wider and larger sharing could squeeze some more parallel speedup out of code that can be parallelized but still contends for the same data. To put everything in the right places, you’d need a good scheduler that can see the coherence patterns.
  • The L3 is huge! 32MB? What is this a, PA-8800 Mako?
  • Clock speeds are ever higher. Anybody running lots of POWER chips doesn’t care about power and heat: they’ll just put a little more in the budget for their new supercomputing center. Because of who they sell to, IBM is in some ways immune to the general purpose computer power/heat crunch. The first wave of the crunch (laptops and desktops) only hit Intel, AMD, and IBM’s PowerPC. The second wave (datacenters) is hitting everyone but the POWER team. At least, that’s sure how it looks from where I’m standing.

Pretty much the only things I find interesting in POWER6’s architecture are the semi-shared cache and the retreat to shallow in-order pipes. Even that latter was foreshadowed by Niagara and the Cell/Xenon.

Don’t get me wrong: I’m not trying to belittle the POWER6 in any way. It’s a great work of engineering. I’m just not impressed with it as research. In contrast, Sun’s doing research with every processor they make. If IBM does buy Sun, I really hope they let Sun’s architects and chip engineers keep doing their thing. The best future, as Chris said, is Sun’s creativity on IBM’s resources.

Thoughts on the Atom

Monday, May 19th, 2008

The RISC vs CISC war isn’t over, and the next battle will be for handheld devices. Intel’s new Atom microarchitecture looks like a very interesting competitor to ARM and PowerPC in the “embedded systems with muscle” space (roughly: smartphones and set-tops). Hannibal nicely sums up the issue in an article that’s made the rounds of Slashdot et al, so I’ll let him do the talking for a few moments.

RISC vs CISC in the Mobile Era

I’m surprised at how strongly Intel is now embracing SMT. The Core lost HyperThreading for power and heat concerns a few years ago, and it stayed out of Core 2. But this year, Nehalem brings back SMT… and it’s in Atom too!

SMT in an in-order low-power chip is an interesting choice. Historically, SMT was about performance (not about perf per watt). In 2000, if you had a big honkin’ superscalar, you probably didn’t care about power consumption much. Hannibal makes the very strong and clear point that because of Atom’s x86 legacy (the excess of transistors burned on predecode, length decode, and complex-op microcode hardware), it’s impossible to follow the ARM Cortex strategy of building a tiny core and stamping them out (see also Sun Niagara!). The front-end is so heavy that its power cost has to be shared by/amortized over a few threads.

I’d suspect, for comparable parts, Atom will outperform Cortex on multithreaded workloads (no surprise), Cortex will beat Atom for complex single threads, and Cortex will use much less power than Atom on easy single threaded code.

Finally, I’m still not convinced by Intel’s “x86 everywhere” strategy. This is the embedded space, where different system boards share nothing in common. In answering the question, “What does this device look like to my code?” the ISA is the least interesting thing to examine. The embedded community has to support many many wildly different systems, and they do a very good job of it. The x86 community has not had any experience like this, and I don’t think giving them the option to adapt to this new world is necessarily a productive thing to do.1

Case in point: the Linux i386 branch is almost exclusively intended for “PCs”… even a diskless workstation like Scott’s little Cyrix is way out in the boonies of supported systems. But Linux also supports dozens of fantastically varied embedded systems: I count 59 ARM-based, 27 MIPS-based, 22 PPC-based, and 22 others including Super-H, SHARC, Blackfin, Tensilica, and FPGA soft-cores. There are only ten x86-based embedded systems. It is the embedded community that can most effectively accommodate new devices. All x86 could bring to the table is an arrogant assumption that things “ought to work like they do on PCs” and binary compatibility with software nobody cares about. If I’m building a set-top box, I don’t care if it can run Word ‘97. That’s just not a selling point I see for the Atom.

  1. Of course the PC world has many different devices, and Windows users have been dealing with driver problems as long as there have been PCs or drivers. But it’s one thing to have to track down the right driver for your old ISA sound card. It’s something completely different when your CPU talks to the sound chip over memory-mapped registers that go through a Spartan-3’s GPIO pins.