Article 76059 of comp.arch:
In article <33937A15.ABD@nas.nasa.gov> Hugh LaMaster <lamaster@nas.nasa.gov> writes:
>Ian Ringrose wrote:
>>
>> Preston Briggs <preston@cs.rice.edu> wrote in article
>> <5mkq7h$132$1@joe.rice.edu>...
>
>> > Tera provides all sorts of goodies of this genre
>> >
>> >       a pop count
>> >       count leading/trailing zeroes/ones
>> >       bit matrix transpose (treating a 64-bit word as an 8x8 matrix)
>> >       bit matrix multiply (a couple of varities)
>> >       bitwise select
>> >       bitpack and unpack
>> >
>> > plus a generous allotment of bitwise boolean operations,
>> > and the ability to compute parity on the result.

        The population count has a precise mathematical definition
when the argument is nonnegative.  It has proved useful by applications.
It should be added to common languages such as C and Fortran.
If the hardware provides a population count instruction, then the compiler
can generate inline code using this instruction.  Otherwise
the compiler can use one of the many known idioms.
The user should be able to say simply POP_COUNT(unsigned expression).

        After 10 years, when POP_COUNT becomes a standard language feature and
appears in some benchmark programs, we can estimate how important it
is to have direct hardware support.

        Other operations which have precise mathematical definitions,
but where the best algorithm varies considerably with the machine, are

        Number of trailing zero bits in a positive (nonzero) integer.
        Number of significant bits required to hold number
                 (equal to word size minus leading zero count;
                  one more than truncated base 2 logarithm
                  if argument is nonzero.)
        Greatest common divisor of two integers (signed or unsigned)
        Modular arithmetic operations (single precision arguments)
        Truncated square root of nonnegative integer

        These primitives arise in many programs I write
(my specialty is number theory).  For example, when I use the
left-to-right binary method of exponentiation, I need to know
the location of the leftmost nonzero bit in the exponent.
Other times I need to know the highest power of 2 dividing a
nonzero integer; this is its trailing zero count.
Or I may be scanning the nonzero bits in a sparse
data structure; I need to quickly locate the
leftmost (or rightmost) nonzero bit, then the next such bit, ...
A linear algebra program takes the inner product of
two vectors modulo 2; this is simply   1 & POP_COUNT(v1 & v2)
if the vectors fit in one machine word.

        All of these should be included in language standards
and optimized in each implementation, even where there is no hardware support.
Other operations needing language support are:

        Both halves of integer product (signed or unsigned)
              Our SGI machines, for example, support a
              64 x 64 -> 128-bit integer product, but
              I don't know how to access it in Fortran or C.

        Floating point arithmetic with explicit rounding mode on
              each add, subtract, multiply, or divide.
              For example, a program adding two intervals should
              be able to use constructs resembling:

                  interval interval_add(interval *i1, interval *i2)
                  {
                      interval answer;
                      answer.low = ROUND_DOWN(i1->low + i2->low);
                      answer.high = ROUND_UP(i1->low + i2->low);
                      return answer;
                  }

             with the compiler changing the rounding mode as needed.
             On the Alpha, ROUND_DOWN is supported directly by
             the hardware and ROUND_UP can be achieved by
             first generating the negative of the desired result
             (and perhaps testing for +- 0).

>1) Because Seymour Cray always put an assortment of such
>instructions in every machine he designed.  Including
>vector versions(!)  Since Seymour only included really
>important instructions on his minimalist machines,
>there must be a really good reason!

>(No kidding!  Did Seymour Cray ever include anything
>over multiple generations that didn't prove useful?)

        I found no use for the rounded normalize  (ZXi Xj,Bk)
in the CDC Cyber series.  The unrounded normalize (NXi Xj,Bk),
on the other hand, was used while converting integers to floating point,
was generated by the FTN compiler after each single-precision
floating point add or subtract, and could be used to get the
leading zero count of an integer in [0, 2^48 - 1].

        The existence of two memory types (SCM and LCM on the 7600,
CM and ECS on other models) was a mistake.  The 18-bit address limitation
stopped one from extending the main memory.  One could put data
in secondary memory, but that data was accessed by different
instructions than data in primary memory.
If a library module did matrix multiplication, and allowed its
input and output matrices to be in main or secondary memory,
then the software needed 2^3 = 8 versions of the code --
a maintenance headache.  I understand that some Crays had
a small cache which similarly was accessed by different instructions
than regular memory.
-- 
        Peter L. Montgomery    pmontgom@cwi.nl    San Rafael, California

A mathematician whose age has doubled since he last drove an automobile.