Article 1773 of dec.notes.technology.dechips: Title: Bets Merced info to date. Smarter than 21264-21364? Comments/Björn Parallelism  is part of P7 picture By Alexander Wolfe April 15, 1996, Santa Clara, Calif. -- Intel's upcoming Merced -- the 64-bit microprocessor that's being jointly developed with Hewlett-Packard behind tightly closed doors-won't be a very-long-instruction-word (VLIW) design, but will be a heavily modified superscalar architecture that exploits instruction-level parallelism and pushes the envelope of instruction pre-decoding technology, according to experts close to the effort. Though Merced (P7) isn't expected to see the light of day until 1998-when its initial use will be in 64-bit Unix servers aimed at high-powered Internet and enterprise applications-a fuller picture of the chip is forming. According to the experts, four clear data points have emerged: •Merced won't be a classic VLIW architecture, but it will decode instructions into a series of primitive "micro-operations" and schedule multiple micro-op streams for simultaneous execution. Merced will take this technique-already used in the Pentium Pro-to new levels of complexity. The overall design goal is to take advantage of instruction-level-parallelism as much as possible. •The processor will make extensive use of instruction pre-decoding and tagging, to attempt to create a correspondence between micro-op streams and the chip's functional units. The implementation, while not VLIW, will be something of a spiritual cousin of that technology. •Merced will incorporate on-chip decoders for object-code compatibility. One decoder will convert X86 instructions into Merced micro-operations. A second may be added to convert HP-PA instructions; however, it's more likely that the PA code will be handled by software translation. •New compiler and operating-system technology will be developed specifically for Merced, with the goal of extracting maximum performance from its 64-bit architecture. •Merced-formerly known as P7-has been the subject of intense industry speculation since Intel and HP linked up two years ago to work on the next-generation processor. At the time, they pledged to apply leading engineers from both companies to the task of "leapfrogging existing computing paradigms." Such statements sparked a focus on VLIW-particularly since HP employs several leading VLIW proponents. However, earlier this year, Dave House, senior vice president and general manager of Intel's enterprise server group, dismissed talk that Merced would be a VLIW device. Instead, he said at the time, it would be "a new kind of architecture. . . . The stuff we've been doing is beyond RISC and beyond VLIW. We took the VLIW work at HP and the Intel RISC/CISC work and came up with something new." Intel officials declined to comment for this story, saying they had nothing to add to their previous statements. HP officials also declined to comment. The impetus for the numerous new twists and turns in Merced is the CPU bandwidth problem that's rearing its head in the X86 architecture. An advanced superscalar processor, the Pentium Pro is built around a dynamically scheduled microengine, which can juggle numerous instructions in various stages of execution. The technique, which blends out-of-order and speculative execution, is intended to supply the chip's multiple functional units with steady streams of micro-ops. (The micro-ops themselves are created by three parallel instruction decoders.) However, as pipelines get longer and instruction caches deeper, processors become more subject to misses and stalls. Consequently, it becomes more critical to keep code density high, to enable the chip to proceed past stalls with all due haste and run as close to full-bore as possible. To that end, Merced is expected to build upon the Pentium Pro's decoupled microarchitecture, in which an eight-stage pipeline can fetch and decode up to three X86 instructions in parallel. (Under ideal circumstances, the chip can generate up to four micro-operations from each macro instruction, or a maximum of six micro-ops per clock cycle; however, in most cases, it generates only two micro-ops at a time.) Specifically, Merced is expected to go well beyond Pentium Pro in its application of instruction pre-decoding, performing more of it and adding tag-bits to decoded micro-ops to indicate which functional unit an instruction is destined for. The broader concept is to maximize use of instruction-level parallelism-that is, to analyze and schedule instructions for simultaneous execution as often as possible. "Merced will employ many of the ideas of the Pentium Pro, especially the front-end decoder, which expands the X86 instruction set into much wider internal micro-operations," said one source. "With a variable-length instruction format, it's possible to directly specify many operations in a single, very long instruction, while letting most of the code use much shorter opcode formats." The overall objective is two-fold: first, to decode as many instructions as possible in parallel, so that multiple micro-ops can be generated. The second goal is to apply bits within a micro-op to mark instruction boundaries (something that's done in the Pentium Pro). "It should be easy to define an instruction format where the decode hardware [or a compiler] generates single or compound instructions, but uses an extra bit to signal the border between instruction groups that can safely be executed in parallel," said the source. "A CPU with many units would grab a larger group of instructions at once, while a low-power version [of Merced] could settle for a few instructions." In many ways, Intel's tack appears to be the addition of VLIW-like concepts to an aggressive base of superscalar technology. This is evident in the use of horizontal microcode and tag-bits and in exploiting instruction-level parallelism (ILP) However, in eschewing straight VLIW, Intel is apparently aiming to avoid the "Achilles' heel" of pure VLIW: There is no forward migration path within a processor family. This stems from the fact that object-code is generated by compilers with explicit knowledge about the number and type of functional units within a processor. That is, the code is much more tightly linked to the hardware than is the case in the superscalar world. And it's difficult to get VLIW object code to run equally well on a variety of processors with a chip family-processors that will have widely differing configurations and varying numbers of functional units. Industry sources concur that the chip won't be VLIW. "It won't be a classic VLIW," said one, "because that exposes the horizontal microcode to the end binaries and expects that the compilers have extensively rearranged the code to make it work." (To date, VLIW seems to be making its mark in special-purpose media processors-such as those from Chromatic Research Inc., Philips Semiconductors and MicroUnity-where forward migration isn't an issue and where the multimedia applications themselves boast the kind of inherent parallelism that makes VLIW shine). Because so many of the concepts being implemented in Merced mirror those already in use to some extent in Pentium Pro, a question arises: Is anything all that new here? The answer is, the differences in the two architectures will be one of degree. According to experts, while Pentium Pro is an impressive device, it is a successor to Pentium and builds upon the existing bricks-and-mortar of the X86 architecture. Merced, in contrast, is being architected as a completely new chip from the ground up. Even those techniques it shares with Pentium Pro-instruction prefetching and decoding, multiple micro-ops, register renaming and X86 object-code compatibility-are being laid out from scratch, with an eye toward optimum performance. In addition, though the VLIW skills that HP brought to the effort aren't being applied in toto, Intel is using many ILP concepts that were developed within the VLIW research community. But it is on the software front that many of Merced's most radical changes will appear. Intel is expected to release new, advanced compiler technology, which can generate code optimized for the 64-bit Merced architecture. Intel is expected to widely publicize the technology and emphasize the importance of running 64-bit code-rather than existing 32-bit code-in extracting full performance from Merced. That tactic will be aimed at avoiding the confusion surrounding the release of the Pentium Pro, which occurred because many end users didn't realize that the processor was designed for fast execution of 32-bit applications but didn't provide much of a speed boost for older, 16-bit code. Specifically, the Merced-aware compiler is expected to use "interprocedural compilation" to identify regions of code that lend themselves to parallel execution. The idea is to look at a large group of instructions, spanning multiple functions, and package them together so that they can be translated by the processor into the maximum number of simultaneous micro-ops, with a minimum of pipeline misses. Writing such a compiler is no easy task. For example, the superscalar, out-of-order nature of the Pentium Pro makes it difficult to accurately predict the execution time of instructions. Intel's software technology is likely to be made available to the industry in the form of a reference compiler, which will be released to developers and Merced beta-testers. The second leg of the Merced software strategy will be a new 64-bit Unix operating system. Called Summit 3D, the OS is being written by HP and SCO (Santa Cruz, Calif.)