Article 76059 of comp.arch: In article <33937A15.ABD@nas.nasa.gov> Hugh LaMaster writes: >Ian Ringrose wrote: >> >> Preston Briggs wrote in article >> <5mkq7h$132$1@joe.rice.edu>... > >> > Tera provides all sorts of goodies of this genre >> > >> > a pop count >> > count leading/trailing zeroes/ones >> > bit matrix transpose (treating a 64-bit word as an 8x8 matrix) >> > bit matrix multiply (a couple of varities) >> > bitwise select >> > bitpack and unpack >> > >> > plus a generous allotment of bitwise boolean operations, >> > and the ability to compute parity on the result. The population count has a precise mathematical definition when the argument is nonnegative. It has proved useful by applications. It should be added to common languages such as C and Fortran. If the hardware provides a population count instruction, then the compiler can generate inline code using this instruction. Otherwise the compiler can use one of the many known idioms. The user should be able to say simply POP_COUNT(unsigned expression). After 10 years, when POP_COUNT becomes a standard language feature and appears in some benchmark programs, we can estimate how important it is to have direct hardware support. Other operations which have precise mathematical definitions, but where the best algorithm varies considerably with the machine, are Number of trailing zero bits in a positive (nonzero) integer. Number of significant bits required to hold number (equal to word size minus leading zero count; one more than truncated base 2 logarithm if argument is nonzero.) Greatest common divisor of two integers (signed or unsigned) Modular arithmetic operations (single precision arguments) Truncated square root of nonnegative integer These primitives arise in many programs I write (my specialty is number theory). For example, when I use the left-to-right binary method of exponentiation, I need to know the location of the leftmost nonzero bit in the exponent. Other times I need to know the highest power of 2 dividing a nonzero integer; this is its trailing zero count. Or I may be scanning the nonzero bits in a sparse data structure; I need to quickly locate the leftmost (or rightmost) nonzero bit, then the next such bit, ... A linear algebra program takes the inner product of two vectors modulo 2; this is simply 1 & POP_COUNT(v1 & v2) if the vectors fit in one machine word. All of these should be included in language standards and optimized in each implementation, even where there is no hardware support. Other operations needing language support are: Both halves of integer product (signed or unsigned) Our SGI machines, for example, support a 64 x 64 -> 128-bit integer product, but I don't know how to access it in Fortran or C. Floating point arithmetic with explicit rounding mode on each add, subtract, multiply, or divide. For example, a program adding two intervals should be able to use constructs resembling: interval interval_add(interval *i1, interval *i2) { interval answer; answer.low = ROUND_DOWN(i1->low + i2->low); answer.high = ROUND_UP(i1->low + i2->low); return answer; } with the compiler changing the rounding mode as needed. On the Alpha, ROUND_DOWN is supported directly by the hardware and ROUND_UP can be achieved by first generating the negative of the desired result (and perhaps testing for +- 0). >1) Because Seymour Cray always put an assortment of such >instructions in every machine he designed. Including >vector versions(!) Since Seymour only included really >important instructions on his minimalist machines, >there must be a really good reason! >(No kidding! Did Seymour Cray ever include anything >over multiple generations that didn't prove useful?) I found no use for the rounded normalize (ZXi Xj,Bk) in the CDC Cyber series. The unrounded normalize (NXi Xj,Bk), on the other hand, was used while converting integers to floating point, was generated by the FTN compiler after each single-precision floating point add or subtract, and could be used to get the leading zero count of an integer in [0, 2^48 - 1]. The existence of two memory types (SCM and LCM on the 7600, CM and ECS on other models) was a mistake. The 18-bit address limitation stopped one from extending the main memory. One could put data in secondary memory, but that data was accessed by different instructions than data in primary memory. If a library module did matrix multiplication, and allowed its input and output matrices to be in main or secondary memory, then the software needed 2^3 = 8 versions of the code -- a maintenance headache. I understand that some Crays had a small cache which similarly was accessed by different instructions than regular memory. -- Peter L. Montgomery pmontgom@cwi.nl San Rafael, California A mathematician whose age has doubled since he last drove an automobile.