Article 1773 of dec.notes.technology.dechips:
Title: Bets Merced info to date. Smarter than 21264-21364?

Comments/Björn

Parallelism  is part of P7 picture

By Alexander Wolfe

April 15, 1996, Santa Clara, Calif. -- Intel's upcoming Merced -- the 64-bit 
microprocessor that's being jointly developed with Hewlett-Packard behind tightly closed 
doors-won't be a very-long-instruction-word (VLIW) design, but will be a heavily modified 
superscalar architecture that exploits instruction-level parallelism and pushes the 
envelope of instruction pre-decoding technology, according to experts close to the 
effort.

Though Merced (P7) isn't expected to see the light of day until 1998-when its initial use 
will be in 64-bit Unix servers aimed at high-powered Internet and enterprise 
applications-a fuller picture of the chip is forming. According to the experts, four 
clear data points have emerged:

•Merced won't be a classic VLIW architecture, but it will decode instructions into a 
series of primitive "micro-operations" and schedule multiple micro-op streams for 
simultaneous execution. Merced will take this technique-already used in the Pentium 
Pro-to new levels of complexity. The overall design goal is to take advantage of 
instruction-level-parallelism as much as possible.

•The processor will make extensive use of instruction pre-decoding and tagging, to 
attempt to create a correspondence between micro-op streams and the chip's functional 
units. The implementation, while not VLIW, will be something of a spiritual cousin of 
that technology.

•Merced will incorporate on-chip decoders for object-code compatibility. One decoder will 
convert X86 instructions into Merced micro-operations. A second may be added to convert 
HP-PA instructions; however, it's more likely that the PA code will be handled by 
software translation.

•New compiler and operating-system technology will be developed specifically for Merced, 
with the goal of extracting maximum performance from its 64-bit architecture.

•Merced-formerly known as P7-has been the subject of intense industry speculation since 
Intel and HP linked up two years ago to work on the next-generation processor. At the 
time, they pledged to apply leading engineers from both companies to the task of 
"leapfrogging existing computing paradigms." Such statements sparked a focus on 
VLIW-particularly since HP employs several leading VLIW proponents.

However, earlier this year, Dave House, senior vice president and general manager of 
Intel's enterprise server group, dismissed talk that Merced would be a VLIW device. 
Instead, he said at the time, it would be "a new kind of architecture. . . . The stuff 
we've been doing is beyond RISC and beyond VLIW. We took the VLIW work at HP and the 
Intel RISC/CISC work and came up with something new."

Intel officials declined to comment for this story, saying they had nothing to add to 
their previous statements. HP officials also declined to comment.

The impetus for the numerous new twists and turns in Merced is the CPU bandwidth problem 
that's rearing its head in the X86 architecture. An advanced superscalar processor, the 
Pentium Pro is built around a dynamically scheduled microengine, which can juggle 
numerous instructions in various stages of execution. The technique, which blends 
out-of-order and speculative execution, is intended to supply the chip's multiple 
functional units with steady streams of micro-ops. (The micro-ops themselves are created 
by three parallel instruction decoders.)

However, as pipelines get longer and instruction caches deeper, processors become more 
subject to misses and stalls. Consequently, it becomes more critical to keep code density 
high, to enable the chip to proceed past stalls with all due haste and run as close to 
full-bore as possible.

To that end, Merced is expected to build upon the Pentium Pro's decoupled 
microarchitecture, in which an eight-stage pipeline can fetch and decode up to three X86 
instructions in parallel. (Under ideal circumstances, the chip can generate up to four 
micro-operations from each macro instruction, or a maximum of six micro-ops per clock 
cycle; however, in most cases, it generates only two micro-ops at a time.)

Specifically, Merced is expected to go well beyond Pentium Pro in its application of 
instruction pre-decoding, performing more of it and adding tag-bits to decoded micro-ops 
to indicate which functional unit an instruction is destined for. The broader concept is 
to maximize use of instruction-level parallelism-that is, to analyze and schedule 
instructions for simultaneous execution as often as possible.

"Merced will employ many of the ideas of the Pentium Pro, especially the front-end 
decoder, which expands the X86 instruction set into much wider internal 
micro-operations," said one source. "With a variable-length instruction format, it's 
possible to directly specify many operations in a single, very long instruction, while 
letting most of the code use much shorter opcode formats."

The overall objective is two-fold: first, to decode as many instructions as possible in 
parallel, so that multiple micro-ops can be generated. The second goal is to apply bits 
within a micro-op to mark instruction boundaries (something that's done in the Pentium 
Pro). "It should be easy to define an instruction format where the decode hardware [or a 
compiler] generates single or compound instructions, but uses an extra bit to signal the 
border between instruction groups that can safely be executed in parallel," said the 
source. "A CPU with many units would grab a larger group of instructions at once, while a 
low-power version [of Merced] could settle for a few instructions."

In many ways, Intel's tack appears to be the addition of VLIW-like concepts to an 
aggressive base of superscalar technology. This is evident in the use of horizontal 
microcode and tag-bits and in exploiting instruction-level parallelism (ILP)

However, in eschewing straight VLIW, Intel is apparently aiming to avoid the "Achilles' 
heel" of pure VLIW: There is no forward migration path within a processor family. This 
stems from the fact that object-code is generated by compilers with explicit knowledge 
about the number and type of functional units within a processor. That is, the code is 
much more tightly linked to the hardware than is the case in the superscalar world. And 
it's difficult to get VLIW object code to run equally well on a variety of processors 
with a chip family-processors that will have widely differing configurations and varying 
numbers of functional units.

Industry sources concur that the chip won't be VLIW. "It won't be a classic VLIW," said 
one, "because that exposes the horizontal microcode to the end binaries and expects that 
the compilers have extensively rearranged the code to make it work."

(To date, VLIW seems to be making its mark in special-purpose media processors-such as 
those from Chromatic Research Inc., Philips Semiconductors and MicroUnity-where forward 
migration isn't an issue and where the multimedia applications themselves boast the kind 
of inherent parallelism that makes VLIW shine).

Because so many of the concepts being implemented in Merced mirror those already in use 
to some extent in Pentium Pro, a question arises: Is anything all that new here? The 
answer is, the differences in the two architectures will be one of degree. According to 
experts, while Pentium Pro is an impressive device, it is a successor to Pentium and 
builds upon the existing bricks-and-mortar of the X86 architecture.

Merced, in contrast, is being architected as a completely new chip from the ground up. 
Even those techniques it shares with Pentium Pro-instruction prefetching and decoding, 
multiple micro-ops, register renaming and X86 object-code compatibility-are being laid 
out from scratch, with an eye toward optimum performance.

In addition, though the VLIW skills that HP brought to the effort aren't being applied in 
toto, Intel is using many ILP concepts that were developed within the VLIW research 
community.

But it is on the software front that many of Merced's most radical changes will appear. 
Intel is expected to release new, advanced compiler technology, which can generate code 
optimized for the 64-bit Merced architecture.

Intel is expected to widely publicize the technology and emphasize the importance of 
running 64-bit code-rather than existing 32-bit code-in extracting full performance from 
Merced. That tactic will be aimed at avoiding the confusion surrounding the release of 
the Pentium Pro, which occurred because many end users didn't realize that the processor 
was designed for fast execution of 32-bit applications but didn't provide much of a speed 
boost for older, 16-bit code.

Specifically, the Merced-aware compiler is expected to use "interprocedural compilation" 
to identify regions of code that lend themselves to parallel execution. The idea is to 
look at a large group of instructions, spanning multiple functions, and package them 
together so that they can be translated by the processor into the maximum number of 
simultaneous micro-ops, with a minimum of pipeline misses.

Writing such a compiler is no easy task. For example, the superscalar, out-of-order 
nature of the Pentium Pro makes it difficult to accurately predict the execution time of 
instructions.

Intel's software technology is likely to be made available to the industry in the form of 
a reference compiler, which will be released to developers and Merced beta-testers.

The second leg of the Merced software strategy will be a new 64-bit Unix operating 
system. Called Summit 3D, the OS is being written by HP and SCO (Santa Cruz, Calif.)