From - Fri Oct 24 07:32:55 1997
Path: news.mitre.org!blanket.mitre.org!philabs!newsjunkie.ans.net!newsfeeds.ans.net!news-was.dfn.de!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!news-peer.sprintlink.net!news-pull.sprintlink.net!news-in-east.sprintlink.net!news.sprintlink.net!Sprint!206.172.150.11!news1.bellglobal.com!news.uunet.ca!not-for-mail
From: Paul DeMone <PaulDeMone@EasyInternet.net>
Newsgroups: comp.arch
Subject: Re: Better information about EPIC?
Date: Fri, 24 Oct 1997 02:01:10 -0400
Organization: UUNET Canada News Transport
Lines: 168
Message-ID: <345039A6.6830@EasyInternet.net>
References: <6223nv$d5$1@news.ox.ac.uk> <622f60$7vb@senator-bedfellow.MIT.EDU> <snewman-1510971509380001@dnai-207-181-209-42.dialup.dnai.com> <zalmanEI4Cs8.8x7@netcom.com> <3450213C.7728@best.com>
NNTP-Posting-Host: 207.176.244.89
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailer: Mozilla 2.02 (Win95; I)

Brian Case wrote:
> 
> Zalman Stern wrote:
> 
> > The architecture is pseudo-LIW with three operations per 128-bit
> > instruction word. Intel terminology calls the 128-bit entity a "bundle"
> > and the three things within it "instructions." That works out as 42 and 2/3
> > bits per instruction on average and Jerry Huck allows a 4/3 increase in
> > code size over "RISC." There can be dependencies between the instructions
> > within a bundle so its only LIW in terms of instruction alignment.
> 
> I'm was a little surprised when they said that there's a 4/3 increase
> in code size. This means that they get nothing from a richer instruction
> set. I mean, I would expect that this ISA has neat instructions that
> can subsume sometimes two traditional RISC instructions, plus
> predication
> should occasionally save a branch instruction, plus ...

   Consider:  1) branches must align to 128 bit bundle so occasionally a NOP
                 might be needed to pad a bundle.

              2) Speculative loads have two instruction words - the load, and
                 the check.  This is twice as big as a regular, unified load.

              3) Effective compilation for IA-64 might require much more aggresive
                 unrolling of loops and duplication of code to merge basic blocks
                 than has been seen for any existing compiler.

              4) Deliberate obfuscation by Intel.

> 
> > "Most instructions" are predicated. Meaning there is likely a 6 or 7 bit
> > field in each instruction which controls whether it gets executed or not.
> > One could imagine combining the template bits with the predication bits so
> > that is one instruction "depends" on another, they share a predicate. (This
> > would save bits but reduces the orthogonality and power of the mechanism.)
> > Seems like its just a simple predicate specification though.
> 
> If you look at the examples of the IA-64 code, you'll find that the
> compare instructions always set a pair of predicate registers (P0,P1)
> with true and complemented results. Thus, at least the compare
> instructions
> might need only five bits to specify the destination predicate.
> 
> Also, I looked at one example and noticed a "bug". When I pointed this
> out to Crawford, he, rather smugly and condenscendingly I must say,
> pointed out that compare instructions might not work the way I was
> assuming. Seems that a compare instruction with a false predicate
> will still generate a result and write it to the predicate bits.
> (The "bug" was a path that would have left predicate bits uninitialized
> but used. They weren't uninitialized because the predicated compare
> always generates results. Or maybe, the version that was in the
> example always does. I haven't thought through all the ramifications
> of this. Perhaps there aren't any.)

  The IA-64 design seems to rely heavily on research using the PlayDoh
  architecture (from HP labs, not Homer Simpson ;-)

  PlayDoh defined two types of predicates - unconditional, and OR-type.

  The unconditional predicate is used for single condition clauses while
  OR-type predicates are used for multiple condition clauses to allow
  several comparison instructions targeting the same predicate register
  to be executed in the same cycle.  OR-type predicates must be preset
  to zero.

  See "Compiler Research for Future Microprocessors" by W.W. Hwu et al
  in the Dec 1995 issue of Proceedings of the IEEE (vol 83 No. 12).
  (One of the authors of this paper, S.A. Mahlke, was referred to by
  John Crawford on his slide showing the fraction of branches that
  could be eliminated by predication)

  This detailed article probably tells you ten times more about IA-64
  and EPIC than Intel's dog and pony show at Microprocessor Forum.
  The comparitive results in this paper show that IA-64 could really have
  a big advantage over classic RISC ISAs if Intel doesn't drop the ball
  or kill the clock rate with x86 compatibility hardware.

> 
> Crawfords attitude when I asked him about the "bug" and when I asked,
> publicly from the audience, about the apparent total dependence on
> static analysis and code sequencing gave me a very deja-vu feeling
> about IA-64. "You ignorant peasant, how dare you insinuate that we
> have anything less than all the answers and a perfect architecture."
> 
> John, if you're listening, sorry to be so blunt, especially if this
> wasn't your intended tone. I just call it as I see it. Your attitude
> reminded me of *my own* back in the 80s when someone would dare
> criticize RISC or the 29K.... I would have expected your response to
> be "Excellent question! You bring up one of the biggest challenges
> we faced. I can't talk much about how we address it, but we have
> what I think to be a complete and elegant solution." Instead, I got
> "Uh, we have ways of dealing with that." OK, maybe the factual
> information content is nearly the same in those two reponses, but
> the one I got sure didn't inspire confidence.

  Go easy on John, its just the daily arrogance pill Intel types have
  to take talking.  Contrast Crawford's and Pollack's self-important
  oleaginousness with HP's Joel Birnbaum; now that's a *class* act.

  The big joke is that the innovation in IA-64 is likely 90+ percent derived
  from HP's efforts; Intel's role is that of the deep-pocketed wafer-heads.
  HP's PA-RISC has always been competitive despite being usually 2 process
  generations behind everyone else.  Now imagine HP's architectural design
  know-how coupled with Intel's process and manufacturing muscle - very
  scary!!  I wonder if HP's Faustian bargain will come back to haunt them.

> 
> > Load instructions can be broken into two pieces. The first of which does
> > the load and the second of which makes sure the load succeeded. This allows
> > hoisting speculative loads well before where the loaded value is used, even
> > if the load might not happen on some code paths. (I'm not sure why the
> > "check load" instruction approach was taken. For example Tera uses poison
> > bits instead.)
> 
> Because the most useful implementations of poison bits couldn't be
> patented? Yes, I've become cynical.
> 
> Another deja-vu: I proposed the split generate-request/consume-result
> model of loads for the 29K back in 1984, but I'm sure it wasn't my
> invention. It sure made for a more elegant pipelining of instructions.
> 
> I remember thinking "hmmm, strange choice", but then Crawford said
> something to which I thought "Oh, that's why." I'll have to try to
> remember.
> 
> > In the wild speculation department, lets say IA-64 computational
> > instructions are three address and use a 7-bit predicate field - 1 bit for
> > predication on/off and 6-bits for which predicate register (another choice
> > would be 6-bit predicate field with 0 meaning don't check predicate). That
> > gives 28 bits, plus an opcode and perhaps something else (e.g. a four
> > address instruction). An 8-bit template field leaves room for three 40-bit
> > instructions. 12-bits should be ok for an opcode and literal fields, etc.
> 
> Crawford explicitly said "forty-something bits long" re: instruction
> length. As I remember the context, he meant "longer than 40 bits". This
> means 41 bits, because the template bits must be at least three: two
> for intra-bundle boundaries, one for "this bundle is connected to the
> following bundle." 41 * 3 + 3 < 128, 42 * 3 + 3 > 128. Instruction
> length
> must be 41 bits. Did I miss something?

  I'd say  3*42 bits plus two template bits.

  Call the three instructions in the bundle A, B, and C

  A possible encoding of the template bits might be:

  00  A is last instruction of current parallel group
  01  B is last instruction of current parallel group
  10  C is last instruction of current parallel group
  11  current parallel group contains A, B, C and continues into next bundle.

  Although it isn't possible to encode all possible issue
  relationships with two bits some might be avoided by the
  use of NOPs to pad bundles.


All opinions strictly my own.  I have no confidential knowledge of IA-64,
all of this is conjecture.


--
Paul W. DeMone                 The 801 experiment SPARCed an ARMs race
Kanata, Ontario                to put more PRECISION and POWER into
demone@mosaid.com              architectures with MIPSed results but
PaulDeMone@EasyInternet.net    ALPHA's well that ends well.