<<< MOVIES::DISK$SYSDATA:[NOTES$LIBRARY]DOLLAR_INFO.NOTE;1 >>>
                      -< Dollar File System Information >-
================================================================================
Note 86.3                Additional Product Requirements                  3 of 8
LEDER1::PETTENGILL "mulp"                            82 lines  26-APR-1994 02:22
                    -< Don't use CRC, use a Fletcher code >-
--------------------------------------------------------------------------------
In reading the notes about backup I was reminded that a major issue with backup
is the cost of computing the CRC.

Well, there is to my knowledge NO advantage to using a CRC a la CCITT or 32 or
any of the other CRC varients.

Instead, a Fletcher code makes much more sense as it is much cheaper to compute
and provides equivalent error detection.  In fact, computing a Fletcher code
on Alpha will cost no more than computing a checksum and it is simple enough
to be combined with other functions such as computing the XORcise redundancy
block.

And just in case you feel that hardware is now so reliable that there is no
need for FCS or redundancy, well, remember the DEQNA.  That was the Ethernet
adapter that corrupted data with no reasonable workaround that NISCS had to
include an FCS to catch problems with the DEQNA.  Guess what, a similar problem
has popped up in another LAN adapter.  So far it has only been forced to occur
using VMS NISCS doing disk I/O and the larger the transfer the more likely
to occur.  The problem is being fixed, but the reason that it was found was
that CVG does lots and lots and lots of testing of disk/file I/O and we check
almost all the data for validity.  Now, this problem is being fixed because
we found it and after some months of investigation the problem was found
and a fix is being made to the host drivers so that the hardware can ship
prior to the respin of the GA.  We have also found data corruption problems
in many other adapters (CI, DSSI, NI, FDDI, etc) and in a number of controllers
(HSC, KDM70, etc.) and in disk drives.  (For the record, we find far more
software problems than hardware or firmware.)

But we do NO SIGNIFICANT TESTING OF TAPE.  While storage is certainly doing
tape testing, they are not doing complete system testing where they are looking
at end to end data integrity.  What is the true error rate of the complete
system, including the VMS tape class driver, the tape mscp server, the LAN
software, firmware, hardware, etc.  The reliability of a backup can be no
better than sum of all those error rates.

---

For those unfamilar with the Fletcher codes they are computed like:

	for i=1..n
		b=data[1]
		c0=(c0+b) mod m
		c1=(c1+c0) mod m
		c2=(c2+c1) mod m
		c3=(c3+c2) mod m
Typically c0,c1,c2,c3 are each 8 bits.  If you want a 16 bit FCS then you
compute just c0 and c1.  For increased redundancy and larger blocks, then
compute additional values.  The modulus can be something convenient such as
256 or even 255 with little impact on cost.  Alternatively, c0,c1,.. can be
16 bits or any other convenient size.

While the mathematical algorithm calls for byte processing, the actual algorithm
used would normally use larger values and in the case of Alpha it can all be
done based on longwords, even with a modulus of 255.  The code of significance
would be something along the lines of
		ldl	get longword
x		addl	c0
x		addl	c1
x		addl	c2
x		addl	c3
		lda	update pointer
		cmp	test for done
		br	loop
and if an xor block were being computed, then only four instructions would be
added to the similar loop above.  (The above would be unrolled and scheduled
for best performance.)

One might also imagine it being used to protect segments in data cache at
essentially zero cost.  Since data in a data cache is copied to and from
the user buffer, the data is already loaded into a register and the cost is
the load/store, not the computation, so computing an FCS would be very low
cost.

The error detection capability of a Fletcher code is similar to and sometimes
greater than that of a CRC if the data is actually checked for octet framing
via some other mechanism, which is certainly the case with disks, FDDI, etc.

It was for this reason that OSI Transport uses a Fletcher code instead of CRC.

(I guess to a mathematician Fletcher and CRC are the same since both are
simply polynomials, but the cost of computing CRCs in software isn't quite
so simple.)