From:	CSBVAX::MRGATE!RELAY-INFO-VAX@CRVAX.SRI.COM@SMTP 20-AUG-1988 12:58
To:	ARISIA::EVERHART
Subj:	Emulex tape controllers.  READ THIS if you use them.


Received: From KL.SRI.COM by CRVAX.SRI.COM with TCP; Sat, 20 AUG 88 07:53:24 PDT
Received: from OHSTPY.MPS.OHIO-STATE.EDU ([128.146.37.10]) by KL.SRI.COM with TCP; Sat, 20 Aug 88 07:35:20 PDT
Date: Sat, 20 Aug 88 10:28 EDT
From: "Barron Hulver (216) 775-8290" <PHULVER%OCVAXA@OHSTPY.MPS.OHIO-STATE.EDU>
Subject: Emulex tape controllers.  READ THIS if you use them.
To: info-vax@kl.sri.COM
X-VMS-To: OHSTPY::IN%"info-vax@kl.sri.com",PHULVER

If you use Emulex TC13 tape controllers on a VMS system, READ THIS.

   1.  According to Emulex (8/19/88) the TC13 tape controller does not work
       under VMS 5.0.  They are in the process of writing a device
       driver for this controller.  Call Emulex technical support
       at (714) 662-5600 for more information or call an Emulex
       regional sales office.  Emulex claims the software will be
       ready around the end of September.  (Yes, but will it work?)

   2.  If you use a TC13 with more than 1 tape drive (daisy-chained),
       there is a timing window problem.  If 2 tape jobs are running
       at the same time, it is possible for 1 job to abort the other
       job.

       If anyone is interested in the command files I used to find
       the timing window problem I will glady send them to you.


....Barron Hulver                        (216)  775-8290
   e-mail postmaster & technical support analyst
   Houck Computing Center
   Oberlin College,   Oberlin, OH   44074

   phulver @ oberlin                      (bitnet)
   ocvaxa::phulver                        (ccnet)
   phulver % oberlin @ uk.ac.rl           (janet to bitnet)
   (you can also send mail to -> postmaster/postmast/postmstr)

Here is a memo I sent to our staff about some tape problems we
were having.
 
----------------------------------------------------------------------
Memo to Oberlin staff:

This is a followup to Aletha's tape problem.  I have been able to get a
hard (recurring) error.  I am to the point where
   - I can submit a tape job that will cause Aletha's tape job
         to abort every time.
   - I can submit a tape job such that Aletha's tape job will cause
         my job to abort.

This is a very long memo as I want to describe the testing that was
involved, an explanation of the problem, and finally a recommendation.

Initial testing - Aletha's tape job and a slow tape scan
---------------

I initially ran Aletha's job 3 times and it ran fine every time.  I then
traced the operator's console logs and checked the system error logs to
find what was happening on the system when her jobs aborted.  An obvious
coincidence was that David Hersey (the student programmer for the
summer) was testing a general purpose program to manipulate tapes
(TCOPY). This program can copy one tape to another or can scan a tape to
print certain attributes. 

I then ran Aletha's job (SR_COM:SR2J1350.COM) on VAXC drive MSA0: and
did a slow scan of tape 10188 on VAXC drive MSB0:.  Much to my surprise
my tape scan job aborted while Aletha's tape job ran fine. 

I inserted the DCL command SHOW TIME after every single command in
Aletha's job (copied over to my account by now) and did the same for my
tape scanning job.  In 2 out of 3 tests Aletha's job aborted my slow
scan tape job 52 seconds after Aletha's job started doing a DIRECTORY of
the tape she had just written.  In the 3rd test Aletha's job aborted my
slow scan tape job 53 seconds after it started doing a DIRECTORY of the
tape she had just created. 

I started to whittle down Aletha's job.  I deleted all the commands
except the command to do a DIRECTORY of Aletha's tape (10217).  I
resubmitted the new DIRECTORY job and resubmitted my slow scan routine.
The DIRECTORY job aborted my slow scan routine!!  I should say here that
my slow scan routine simply reads every block of data on the tape and
reports the number of blocks in each file on the tape.

Testing to eliminate the tape scan program
------------------------------------------

At this point I replaced my slow scan routine with a standard VMS
utility: DUMP.  I resubmitted the DIRECTORY job and my job to dump all
blocks on the tape.  The DIRECTORY job aborted my dump job. I then
resubmitted the 2 jobs, ran ANALYZE/SYSTEM and performed SHOW DEVICE
MSA0 and SHOW DEVICE MSB0.  I was able to catch glimpses of the I/O
request queue.  I was able to see on MSA0 that the directory command
uses a combination of "read physical block" function calls with "skip to
next tape mark" function calls.  Also, since this was a labeled tape,
the magnetic tape Ancillary Control Process (ACP) displayed the current
position of the tape expressed as a number of blocks from the beginning
of the tape.

I had to do some thinking about how a labeled tape is formatted.
An ANS labeled tape is formatted as follows:

         2-4 HDR labels (file name and attributes)
               tape mark
         'n' data blocks for file
               tape mark
         2-4 EOF labels (file name again and attributes)
               tape mark
                  ...
         2-4 HDR labels (file name and attributes)
               tape mark
         'n' data blocks for file
               tape mark
         2-4 EOF labels (file name again and attributes)
               tape mark
               tape mark

It appears that the DIRECTORY command algorithm reads each block of the
HDR labels and print the file name and attributes.  The DIRECTORY
command then skips to the next tape mark (skips over the data blocks),
reads the EOF labels, then repeats the process by reading the next set
of HDR labels and printing the next file name.  It continues this until
two consecutive tape marks are encountered on the tape. 

More testing - using standard VMS tape utilities
------------

At this point I had a hard error.  I was wondering if perhaps the
software ACP was the culprit.  To test this I replaced Aletha's
DIRECTORY job with a tape job that would simply mount her tape (10217)
as an unlableled tape, and use the DCL command SET MAGTAPE/SKIP=BLOCKS:N
to skip around on the tape.  A second job mounted tape 10188 as an
unlabeled tape and dumped every block on the tape (read each block of
data on the tape).

The job that dumped data would time out when the other job tried to skip
a file containing 2474 blocks or more. The job that dumped data would
work fine when the other job tried to skip a file containing 2068 blocks
or less.  I did not try to find the exact break in the range 2068-2474. 

I then repeated the above tests but used the vms command SET
MAGTAPE/SKIP=FILES:1 to skip over the files that were above 2474 blocks
or less then 2068 blocks.  The results were the same.  The job that
dumped data would time out when the other job tried to skip a file that
contained 2474 blocks or more.  The job that dumped data would work fine
when the other job tried to skip a file containing 2068 blocks or less. 

The tests seemed to indicate that a tape job that was skipping blocks or
files could cause another tape job to abort.  Specifically it would
cause the other tape job to time out. 

Back to Aletha's tape job
-------------------------

I ran Aletha's tape job and at the same time ran a fast tape scan of a
tape with one large file on it (10188).  Aletha's tape job aborted
(timed out). I reran Aletha's tape job and at the same time ran a slow
tape scan of tape 10188.  The slow tape scan job aborted (at the time
Aletha's job was doing a DIRECTORY command). 


More testing - 2048-byte blocks at 1600bpi on both VAXC and VAXB
------------

I created two tapes (10029 and 10030) that have 1 file each of 2048-byte
blocks of the repeating 2-byte pattern "*V".  Both 2400 foot tapes are
full, unlabeled tapes at 1600bpi.  I have in a subdirectory of mine
(APS:[PHULVER.TAPERRORS.ERASEC) a set of command procedures to test
these tapes if anyone else is interested.  The set of tests perform a
slow scan of tape 10029 on drive msa0: and skip blocks on tape 10030 on
drive msb0:. 


Here are the results:                                          

     VAXC   results

$!
$!! set magtape/skip=blocks:5000    tape_drive  ! aborted other tape job
$!! set magtape/skip=blocks:4000    tape_drive  ! aborted other tape job
$!! set magtape/skip=blocks:3000    tape_drive  ! aborted other tape job
$!! set magtape/skip=blocks:2000    tape_drive  ! aborted other tape job
$!! set magtape/skip=blocks:1500    tape_drive  ! aborted other tape job
$!! set magtape/skip=blocks:1000    tape_drive  ! aborted other tape job
$!! set magtape/skip=blocks:700    tape_drive  !  did not abort others
$!! set magtape/skip=blocks:900    tape_drive  !  did not abort others
$!! set magtape/skip=blocks:950    tape_drive  !  aborted other tape job
$!! set magtape/skip=blocks:925    tape_drive  !  aborted other tape job
$!! set magtape/skip=blocks:920    tape_drive  !  aborted other tape job
$!! set magtape/skip=blocks:915    tape_drive  !  did not abort others
$!! set magtape/skip=blocks:910    tape_drive  !  did not abort others
$!! set magtape/skip=blocks:905    tape_drive  !  did not abort others
$!
$!  the critical range is 915-920 blocks
$!
$!! set magtape/skip=blocks:919    tape_drive  !  aborted other tape job
$!! set magtape/skip=blocks:918    tape_drive  !  aborted other tape job
$!! set magtape/skip=blocks:917    tape_drive  !  aborted other tape job
$!! set magtape/skip=blocks:916    tape_drive  !  did not abort others
$!
$!  CONCLUSION for this test:
$!   On VAXC:  skipping 916 blocks will not abort other tape jobs
$!             skipping 917 blocks will cause other jobs to timeout
$!


           VAXB results
$!
$!
$!! set magtape/skip=blocks:950    tape_drive  !  aborted other tape job
$!! set magtape/skip=blocks:900    tape_drive  !  aborted other tape job
$!! set magtape/skip=blocks:700    tape_drive  !  did not abort others
$!! set magtape/skip=blocks:800    tape_drive  !  did not abort others
$!! set magtape/skip=blocks:850    tape_drive  !  did not abort others
$!! set magtape/skip=blocks:860    tape_drive  !  did not abort others
$!! set magtape/skip=blocks:870    tape_drive  !  did not abort others
$!! set magtape/skip=blocks:880    tape_drive  !  did not abort others
$!! set magtape/skip=blocks:890    tape_drive  !  did not abort others
$!! set magtape/skip=blocks:895    tape_drive  !  aborted other tape job
$!
$!!  critical range is 890-895 blocks
$!
$!! set magtape/skip=blocks:895    tape_drive  !  aborted other tape job
$!! set magtape/skip=blocks:894    tape_drive  !  did not abort others
$!
$!  CONCLUSION for testing on VAXB:
$!      skipping 894 blocks did not abort the other tape job
$!      skipping 895 blocks did abort the other tape job
$!


The results show that on VAXC the breakdown occurs when skipping more
than 916 blocks.  On VAXB the breakdown occurs when skipping more
than 894 blocks.  I don't know why this small discrepancy exists;
perhaps because it was a model 891 performing the skipping on VAXB.


Here is our Hardware configuration
----------------------------------
     VAXC
     8600
      TC13 ----------- CIPHER 990 ----------  CIPHER 990
      (MSA:)             (MSA0:)                (MSB0:)
      (MSB:)

     VAXB
     780
      TC13 ----------- CIPHER 990 ---------  CIPHER 891
      (MSA:)            (MSA0:)                (MSB0:)
      (MSB:)


This is what the software thinks our hardware configuration is
--------------------------------------------------------------

     VAXC
     8600
       TS11 ----------  CIPHER 990
       (MSA:)             (MSA0:)

       TS11 ----------  CIPHER 990
       (MSB:)             (MSB0:)

     VAXB
     780
       TS11 ----------  CIPHER 990
       (MSA:)             (MSA0:)

       TS11 ----------  CIPHER 891
       (MSB:)             (MSB0:)


A DEC TS11 tape controller supports only one tape drive.  An Emulex
TC13 tape controller emulates up to 4 DEC TS11 tape controllers.

The manuals
-----------

The Emulex TC13 tape coupler technical manual on page 2-11 has Table 2-3
Formatter to Coupler Signals.  The first formatter to coupler signal
listed if FBY.  "FBY remains TRUE until the command is completed or tape
motion ceases." 

The Cipher 890 manual on page 1-24 has tape I-6 Command Decoding.
One possible command is "File Search Forward."

In other words, the formatter is busy from the time you issue a command
to skip to the next tape mark until that tape mark is found.

Now refer to the Emulex TC13 controller manual page 2-7, Table 2-1.
Pin/Signal Assignments for TC13 Tape Coupler and Tape Transport
Interface.  What this table says is that there is only one FBY signal
line on the cable; there is not one FBY signal for each drive on the
cable.  Therefore if one tape drive says it is busy, a command cannot be
started on the other tape drive.  This is how the hardware handles it. 


Explanation
-----------

Here is why 2 different tape jobs have problems.  Assume there is a job
called WRITE_TAPE that writes several files to a tape.  Assume there is
another job called DIRECTORY that will show a list of files on a tape
and that the tape only has one file but it fills the entire 2400 foot
reel of tape. 

The WRITE_TAPE job starts first, acquires drive MSA0:, and starts
writing blocks of data, one block at a time.  The actual VMS code
will look something like this:

           loop:
               Is controller MSA: free?   yes, it is.
               Start device MSA0: I/O function
                   and begin countdown (device timeout) timer.

                   <device MSA0: interrupts when function is complete>
               go to loop

In this case the "I/O function" is to write a single block of data. 

Now the job DIRECTORY starts and acquires MSB0:.  It will read the HDR
labels and print them, skip to the next tape mark, then read the
EOF labels, and loop printing all the file names on the tape.

           loop:
               Is controller MSB: free?   yes, it is.
               Start device MSB0: I/O function
                   and begin countdown (device timeout) timer.

                   <device MSB0: interrupts when function is complete>
               go to loop

In this case the "I/O function" is either a read of a single block (HDR
or EOF label) or a skip to the next tape mark (skip over data in a
file). 

Here is the sequence of events that create the error:

       DIRECTORY job
          Is controller MSB: free?   yes, it is always free.
          Start device MSB0: to skip to next tape mark
                 and begin countdown timer.
       WRITE_TAPE job
          Is controller MSA: free?   yes, it is always free.
          Start device MSA0: to write a single block
                 and begin countdown timer.

        (The FBY line is busy servicing the DIRECTORY job's request,
        so the TC13 controller just puts the WRITE_TAPE job's request
        on hold.)

        <normal hardware clock interrupt>
           check all devices on system, are any countdown timers
           equal to zero?  Yes, the countdown timer for device MSA0:
           has reached zero.  Issue an error message saying the
           device has timed out.  This ultimately aborts the job.


Call to Emulex
--------------

I called Emulex technical support at (714) 662-5600.  I talked with Bob
Johnson. I confirmed that the daisy-chain cabling is dedicated to a
single tape drive during a skip tape mark operation.  That is, an I/O
operation cannot be started or in-progress on the other tape drive
during this time. 

I then asked about skip tape block operation.  Bob Johnson said the TC13
tape controller dedicates the cable for the entire amount of time to
skip all requested blocks.  The TC13 issues a "space forward 1 record"
and maintains a counter for the number of blocks it is supposed to skip.
In other words the tape drive skips one block at a time and the
controller keeps a count of the number of blocks to skip. The controller
will not let an I/O begin on the other drive until the skip of all the
blocks is complete.  "There is only so much code we can put in an PROM."
Actually, I thought about this over the weekend and decided there is a
better reason.  The controller needs to monopolize the cabling when
skipping blocks.  Since the controller is counting, it needs to keep
issuing commands to the tape drive to keep the drive streaming. If the
controller allowed I/O to the other drive between single skip block
operations, then there would be excessive repositioning on the skipping
tape drive. 

After I inquired about the algorithm the tape controller uses to skip
tape marks and skip blocks, I told Bob Johnson about our particular
problem.  He admitted that not many VAX sites use daisy chaining. He
refers to our problem as a "constriction" or a funnel.  He describes us
has having a fan-in, pipe, fan-out configuration.   That is, 2 tape
controllers (MSA:, MSB:), one data path (the cabling between the
tape controller and the tape drives), and 2 tape drives (MSA0:, MSB0:).

Call to DEC
-----------

I called DEC Colorado Springs Customer Support Center.  I asked how long
is the timeout for the TS11 software device driver.  The response was
that for rewinding a tape the timeout is 5 minutes otherwise the timeout
is 20 seconds.  These values may be overriden on the call to the QIO
system service. 


Additional comments
-------------------

In this document I have said the aborted tape job error was due to a
"device timeout" error being returned by the system.  It is also
possible for the tape job to abort with the error "device not in
configuration or not available."  This will happen if the daisy-chain
cabling is busy when a job goes to mount a tape. 

Also, note how daisy-chaining can severely limit tape drive performance.
Only one tape drive can be using the cable at any one time.  How
in the world is the system supposed to keep both drives streaming?

Also, I do not wish to suggest this is the answer to all of our tape
problems.  This memo just explains the problem that occurred with Aletha's
tape job.


Testing Results of 8/19/88
--------------------------

On Friday, 8/19/88 we moved the TC13 tape controller from OCVAXC to
OCVAXB which gave us the following hardware configuration in OCVAXB. (We
were able to maintain the tape drive names as MSA0: and MSB0:.) 

Let me restate the condition that causes a problem.  Two tape jobs are
running at the same time.  One job is reading or writing blocks of data
(a file) to a tape.  The other job is skipping over a large file on a
tape (e.g.  doing a DIRECTORY of the tape).  The job skipping will cause
the other job to abort with a device timeout condition.  The timeout
will occur about 20 seconds after the skipping tape job starts a large
skip. 

I ran one single test and the problem did not occur.  Specifically, I
had a tape job running on MSA0: that would read every block of data on a
full 2400' tape.  I also had a tape job running on MSB0: that would skip
every block of data on a full 2400' tape.  The large skip operation took
nearly 5 minutes to complete.  Both jobs ran successfully. 

This test confirms that an Emulex tape controller can support only one
tape drive.


Testing hardware configuration
------------------------------
     VAXB
     780
       TC13 ----------  CIPHER 990 
       (MSA:)             (MSA0:)

       TC13 ----------  CIPHER 990
       (MSB:)             (MSB0:)

Testing software configuration
------------------------------
     VAXB
     780
       TS11 ----------  CIPHER 990
       (MSA:)             (MSA0:)

       TS11 ----------  CIPHER 990
       (MSB:)             (MSB0:)


Recommendation
--------------

My testing and research shows that a TC13 tape controller cannot support
more than 1 tape drive at the same time.  Our only remedy is to have one
tape controller per tape drive.  This means we should purchase 2 more
TC13 tape controllers: one more for VAXB and one more for VAXC. 


Possible future problem
-----------------------

Emulex has received reports from its customers that the TC13 tape
controller does not work under VMS 5.0.  They are going to investigate
these reports over the next 2-3 weeks to determine if they are true. If
they are true Emulex is going to try using a software device driver from
a previous VMS version (such as 4.7).  If a driver from a previous
version does not work then Emulex is faced with writing their own
driver.  I will stay in touch with Emulex. 


I called Emulex on 8/19/88.  They confirmed that the TC13 tape
controller does not work under VMS 5.0.  They are in the process of
writing a device driver.  They claim the software will be ready in
another 4 weeks.  They also said we need to contact the company we
bought the controller from in order to get on the software distribution
list.  Is Lowery still in business?  I called the Emulex regional office
in Cincinnatti (513) 762-7882 and asked to talk to a sales rep.  Bob
Shizel was not in so I left a message to have him call me.

Bob Shizel returned my call later in the day.  He also confirmed
that they are writing software to handle the TC13 controller.  He
said it should be ready by the end of September.
 

   e-mail postmaster & technical support analyst
   Houck Computing Center
   Oberlin College,   Oberlin, OH   44074

   phulver @ oberlin                      (bitnet)
   ocvaxa::phulver                        (ccnet)
   phulver % oberlin @ uk.ac.rl           (janet to bitnet)
   (you can also send mail to -> postmaster/postmast/postmstr)