PowerPC e6500: The Road Not Taken

NXP has been heading for the exits from the PowerPC business for many years. The microcontroller-class e200 still ekes out some relevance in the automotive sector, and sees occasional new silicon, but the big Freescale/NXP cores of yesteryear are essentially dead. NXP's network processor lineup has moved almost entirely to ARM.

The final generation of these cores, though, is noteworthy for its unconventional design. The PowerPC e6500 is a unique fused design vaguely reminiscent of AMD's Bulldozer, but with a very different implementation. Used in large network processors like the 12-core (or 24-core, depending on one's perspective) QorIQ T4240, the e6500 represents a road not taken for multithreaded architectures.

The Big Picture

An e6500 "core" consists of two replicated "thread" structures. However, these "threads" very heavily resemble complete cores. They contain their their own fetch pipeline, their own decoders, their own register files, and mostly, their own functional units. Unlike Bulldozer, e6500's 32K+32K L1 cache is fully shared between threads, though d-cache tags are replicated per thread. TLBs are also mostly shared - with the exception of the L1DTLB structures, which are thread-private. Separate tables within the TLB serve variable-sized pages and 4K pages.

e6500 shares more functional units across core pairs that Bulldozer does. The two VMX SIMD units - one of which handles permutes and the other of which handles all other VMX instructions - are shared. So is the single scalar FPU, and unlike Bulldozer, e6500 also shares the complex-integer unit, which handles multiply and divide instructions. Each thread has a private branch unit, a private load/store unit, and two private simple integer units. The simple ALUs are mostly symmetrical, though only one of them can execute population-count instructions.

=========================================================
|| Shared         | Private                            ||
||-----------------------------------------------------||
|| Complex int    | 2x simple int                      ||
|| Vector         | Branch                             ||
|| FPU            | 256b Fetch                         ||
|| L1 cache       | 2-wide Decode                      ||
|| L1 ITLB        | L1 DTLB                            ||
|| L2 TLB         | Reorder and scheduling structures  ||
|| L2 cache       | LSU                                ||
=========================================================

Reorder and Throughput

e6500 is an out-of-order machine, albeit with tiny structure sizes. Each functional unit, both private and shared, has exactly one reservation station entry per thread, each fed by one of five thread-private issue queues - one each for branch, "general" (simple and complex ALU), load/store, FP, and SIMD. The branch IQ contains three instructions; all others contain four. Each thread enjoys a 16-entry ROB, which NXP calls a Completion Queue, as well as a separate 16-entry rename register file.

An e6500 thread can only sustain execution of two instructions per cycle - contrasting with the 8-wide instruction fetch available to each thread.[1] Decode is 2-wide, as is dispatch to the issue queues. The IQs can each accept two instructions per cycle from dispatch, and can each issue either one or two instructions per cycle to the functional units. Notably, the General Issue Queue is backed by three functional units (two private simple ALUs, one shared complex ALU) but can only issue to two of them in one cycle.

Branches are predicted via relatively basic single-level BHT and BTB, each of 512 entries, with a standard 2-bit taken/not-taken prediction.

Latencies

Latency on the e6500, unlike some roughly contemporary PPC designs, is mostly good. Most simple-int instructions manage single-cycle latency and throughput, with the primary exceptions being popcnt instructions and instructions using the overflow register; both have 2-cycle latency and throughput. Integer multiply performance is also decent, with most forms being able to execute a multiply every cycle with four-cycle latency. A particularly bright spot of e6500's latency story is SIMD; vector simple-int instructions consistently manage single-cycle throughput and latency, and permutes can issue every cycle with two-cycle latency. Vector int-multiply and reduction ops all have a nominal latency of four cycles and single-cycle throughput.

Floating point latecy is worse across the board, both for scalar and vector instructions. Most scalar FP instructions can issue every cycle, but have a steep 7-cycle latency. All vector FP instructions can repeat every cycle with 6-cycle latency with the exception of vrefp (vector FP reciprocal estimate), which blocks for two cycles and produces a result after seven.

Final Thoughts

e6500 is different. The thoughts going into it are sound - combine two pipelines derived from the previous-gen 2-wide e5500 and e500 PowerPC families into a shared block, now with VMX support and without replicating some of the most expensive structures. Most of the questions I have about the end result are questions that apply equally well to e5500 - why, for instance, did they bother with OoO at all, with such tiny reorder resources? Would a 3-wide in-order design not have produced better results in comparable area?

Normal integer spaghetti performance on the e6500 is underwhelming - it is not far off from a Cortex-A53 on SPEC, despite the A53 being in-order. The powerful vector unit and high core counts, though, make the e6500 stand out. In 2011 it was a very respectable network processor core, favorable against almost any competitor except the fast Netlogic XLP MIPS64 family.

On a comparative note, if one squints enough, e6500 almost looks like a combination of Bulldozer and Zen 5 - the thread-private frontends and fully shared L1 caches of Zen 5 combined with Bulldozer's mix of thread-private and shared functional units.

[1] I believe each thread can do an independent 256b fetch from L1I every cycle, but the manual is vague on this point. It's possible that each thread can only fetch every other cycle.