Article 76186 of comp.arch: Hugh LaMaster writes: > Getting back to the "5X to 15X", I agree that the hierarchy of > memory bandwidth can be scaled up at each level to meet the > goals. But, I'm not sure about latency. > The latency is the big issue IMHO as well. But I think we still have some tricks up our sleave yet to combat this. The crux of the thing is that if we are executing a single instruction stream then we are always going to be dependant on data latency as an absolute bottleneck. I-stream latency can almost be overcome with agressive I-stream prefetching, but data latency is a lot less predictable. Therefore we have to overcome this waiting on memory. Right now there are several ideas to do this. You have "Run-ahead" caches, SMT, Multi-scalar, etc. All of these have inherent trade-offs. Run-ahead caches for instance are negitive feedback mechinisms that do preculative banches and prefetches while you are waiting on main memory. The bottleneck is that the better you get the worse it works. SMT and Multi-scalar are still reletively new and will require new hardware which hasn't been IMHO fully figured out yet. The other problem with them is the programming aspect of something that hasn't been done before. More and more I am looking at some of the VLIW/psuedo-instruction set stuff that IBM is doing with interest. The concept of seperating the processor interface and the actually processor has some merit. Digital's FX/32 is interesting in that it doesn't neccessary have to only work on IA32 -> alpha. It would be interesting to see what could be done, based off of that tech, with an alpha(gen - 1) -> alpha(gen) translator/scheduler. In essence, I like feedback compilers but they are both slow and only work on a limited set of data. It would be nice if my apps continuely optimized themselves for the hardware. Find places were a pre-fetch will work here, change a branch to a cmov there, etc. Aaron Spink not speaking for digital