From: CSBVAX::MRGATE!RELAY-INFO-VAX@CRVAX.SRI.COM@SMTP 20-AUG-1988 12:58 To: ARISIA::EVERHART Subj: Emulex tape controllers. READ THIS if you use them. Received: From KL.SRI.COM by CRVAX.SRI.COM with TCP; Sat, 20 AUG 88 07:53:24 PDT Received: from OHSTPY.MPS.OHIO-STATE.EDU ([128.146.37.10]) by KL.SRI.COM with TCP; Sat, 20 Aug 88 07:35:20 PDT Date: Sat, 20 Aug 88 10:28 EDT From: "Barron Hulver (216) 775-8290" Subject: Emulex tape controllers. READ THIS if you use them. To: info-vax@kl.sri.COM X-VMS-To: OHSTPY::IN%"info-vax@kl.sri.com",PHULVER If you use Emulex TC13 tape controllers on a VMS system, READ THIS. 1. According to Emulex (8/19/88) the TC13 tape controller does not work under VMS 5.0. They are in the process of writing a device driver for this controller. Call Emulex technical support at (714) 662-5600 for more information or call an Emulex regional sales office. Emulex claims the software will be ready around the end of September. (Yes, but will it work?) 2. If you use a TC13 with more than 1 tape drive (daisy-chained), there is a timing window problem. If 2 tape jobs are running at the same time, it is possible for 1 job to abort the other job. If anyone is interested in the command files I used to find the timing window problem I will glady send them to you. ....Barron Hulver (216) 775-8290 e-mail postmaster & technical support analyst Houck Computing Center Oberlin College, Oberlin, OH 44074 phulver @ oberlin (bitnet) ocvaxa::phulver (ccnet) phulver % oberlin @ uk.ac.rl (janet to bitnet) (you can also send mail to -> postmaster/postmast/postmstr) Here is a memo I sent to our staff about some tape problems we were having. ---------------------------------------------------------------------- Memo to Oberlin staff: This is a followup to Aletha's tape problem. I have been able to get a hard (recurring) error. I am to the point where - I can submit a tape job that will cause Aletha's tape job to abort every time. - I can submit a tape job such that Aletha's tape job will cause my job to abort. This is a very long memo as I want to describe the testing that was involved, an explanation of the problem, and finally a recommendation. Initial testing - Aletha's tape job and a slow tape scan --------------- I initially ran Aletha's job 3 times and it ran fine every time. I then traced the operator's console logs and checked the system error logs to find what was happening on the system when her jobs aborted. An obvious coincidence was that David Hersey (the student programmer for the summer) was testing a general purpose program to manipulate tapes (TCOPY). This program can copy one tape to another or can scan a tape to print certain attributes. I then ran Aletha's job (SR_COM:SR2J1350.COM) on VAXC drive MSA0: and did a slow scan of tape 10188 on VAXC drive MSB0:. Much to my surprise my tape scan job aborted while Aletha's tape job ran fine. I inserted the DCL command SHOW TIME after every single command in Aletha's job (copied over to my account by now) and did the same for my tape scanning job. In 2 out of 3 tests Aletha's job aborted my slow scan tape job 52 seconds after Aletha's job started doing a DIRECTORY of the tape she had just written. In the 3rd test Aletha's job aborted my slow scan tape job 53 seconds after it started doing a DIRECTORY of the tape she had just created. I started to whittle down Aletha's job. I deleted all the commands except the command to do a DIRECTORY of Aletha's tape (10217). I resubmitted the new DIRECTORY job and resubmitted my slow scan routine. The DIRECTORY job aborted my slow scan routine!! I should say here that my slow scan routine simply reads every block of data on the tape and reports the number of blocks in each file on the tape. Testing to eliminate the tape scan program ------------------------------------------ At this point I replaced my slow scan routine with a standard VMS utility: DUMP. I resubmitted the DIRECTORY job and my job to dump all blocks on the tape. The DIRECTORY job aborted my dump job. I then resubmitted the 2 jobs, ran ANALYZE/SYSTEM and performed SHOW DEVICE MSA0 and SHOW DEVICE MSB0. I was able to catch glimpses of the I/O request queue. I was able to see on MSA0 that the directory command uses a combination of "read physical block" function calls with "skip to next tape mark" function calls. Also, since this was a labeled tape, the magnetic tape Ancillary Control Process (ACP) displayed the current position of the tape expressed as a number of blocks from the beginning of the tape. I had to do some thinking about how a labeled tape is formatted. An ANS labeled tape is formatted as follows: 2-4 HDR labels (file name and attributes) tape mark 'n' data blocks for file tape mark 2-4 EOF labels (file name again and attributes) tape mark ... 2-4 HDR labels (file name and attributes) tape mark 'n' data blocks for file tape mark 2-4 EOF labels (file name again and attributes) tape mark tape mark It appears that the DIRECTORY command algorithm reads each block of the HDR labels and print the file name and attributes. The DIRECTORY command then skips to the next tape mark (skips over the data blocks), reads the EOF labels, then repeats the process by reading the next set of HDR labels and printing the next file name. It continues this until two consecutive tape marks are encountered on the tape. More testing - using standard VMS tape utilities ------------ At this point I had a hard error. I was wondering if perhaps the software ACP was the culprit. To test this I replaced Aletha's DIRECTORY job with a tape job that would simply mount her tape (10217) as an unlableled tape, and use the DCL command SET MAGTAPE/SKIP=BLOCKS:N to skip around on the tape. A second job mounted tape 10188 as an unlabeled tape and dumped every block on the tape (read each block of data on the tape). The job that dumped data would time out when the other job tried to skip a file containing 2474 blocks or more. The job that dumped data would work fine when the other job tried to skip a file containing 2068 blocks or less. I did not try to find the exact break in the range 2068-2474. I then repeated the above tests but used the vms command SET MAGTAPE/SKIP=FILES:1 to skip over the files that were above 2474 blocks or less then 2068 blocks. The results were the same. The job that dumped data would time out when the other job tried to skip a file that contained 2474 blocks or more. The job that dumped data would work fine when the other job tried to skip a file containing 2068 blocks or less. The tests seemed to indicate that a tape job that was skipping blocks or files could cause another tape job to abort. Specifically it would cause the other tape job to time out. Back to Aletha's tape job ------------------------- I ran Aletha's tape job and at the same time ran a fast tape scan of a tape with one large file on it (10188). Aletha's tape job aborted (timed out). I reran Aletha's tape job and at the same time ran a slow tape scan of tape 10188. The slow tape scan job aborted (at the time Aletha's job was doing a DIRECTORY command). More testing - 2048-byte blocks at 1600bpi on both VAXC and VAXB ------------ I created two tapes (10029 and 10030) that have 1 file each of 2048-byte blocks of the repeating 2-byte pattern "*V". Both 2400 foot tapes are full, unlabeled tapes at 1600bpi. I have in a subdirectory of mine (APS:[PHULVER.TAPERRORS.ERASEC) a set of command procedures to test these tapes if anyone else is interested. The set of tests perform a slow scan of tape 10029 on drive msa0: and skip blocks on tape 10030 on drive msb0:. Here are the results: VAXC results $! $!! set magtape/skip=blocks:5000 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:4000 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:3000 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:2000 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:1500 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:1000 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:700 tape_drive ! did not abort others $!! set magtape/skip=blocks:900 tape_drive ! did not abort others $!! set magtape/skip=blocks:950 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:925 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:920 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:915 tape_drive ! did not abort others $!! set magtape/skip=blocks:910 tape_drive ! did not abort others $!! set magtape/skip=blocks:905 tape_drive ! did not abort others $! $! the critical range is 915-920 blocks $! $!! set magtape/skip=blocks:919 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:918 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:917 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:916 tape_drive ! did not abort others $! $! CONCLUSION for this test: $! On VAXC: skipping 916 blocks will not abort other tape jobs $! skipping 917 blocks will cause other jobs to timeout $! VAXB results $! $! $!! set magtape/skip=blocks:950 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:900 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:700 tape_drive ! did not abort others $!! set magtape/skip=blocks:800 tape_drive ! did not abort others $!! set magtape/skip=blocks:850 tape_drive ! did not abort others $!! set magtape/skip=blocks:860 tape_drive ! did not abort others $!! set magtape/skip=blocks:870 tape_drive ! did not abort others $!! set magtape/skip=blocks:880 tape_drive ! did not abort others $!! set magtape/skip=blocks:890 tape_drive ! did not abort others $!! set magtape/skip=blocks:895 tape_drive ! aborted other tape job $! $!! critical range is 890-895 blocks $! $!! set magtape/skip=blocks:895 tape_drive ! aborted other tape job $!! set magtape/skip=blocks:894 tape_drive ! did not abort others $! $! CONCLUSION for testing on VAXB: $! skipping 894 blocks did not abort the other tape job $! skipping 895 blocks did abort the other tape job $! The results show that on VAXC the breakdown occurs when skipping more than 916 blocks. On VAXB the breakdown occurs when skipping more than 894 blocks. I don't know why this small discrepancy exists; perhaps because it was a model 891 performing the skipping on VAXB. Here is our Hardware configuration ---------------------------------- VAXC 8600 TC13 ----------- CIPHER 990 ---------- CIPHER 990 (MSA:) (MSA0:) (MSB0:) (MSB:) VAXB 780 TC13 ----------- CIPHER 990 --------- CIPHER 891 (MSA:) (MSA0:) (MSB0:) (MSB:) This is what the software thinks our hardware configuration is -------------------------------------------------------------- VAXC 8600 TS11 ---------- CIPHER 990 (MSA:) (MSA0:) TS11 ---------- CIPHER 990 (MSB:) (MSB0:) VAXB 780 TS11 ---------- CIPHER 990 (MSA:) (MSA0:) TS11 ---------- CIPHER 891 (MSB:) (MSB0:) A DEC TS11 tape controller supports only one tape drive. An Emulex TC13 tape controller emulates up to 4 DEC TS11 tape controllers. The manuals ----------- The Emulex TC13 tape coupler technical manual on page 2-11 has Table 2-3 Formatter to Coupler Signals. The first formatter to coupler signal listed if FBY. "FBY remains TRUE until the command is completed or tape motion ceases." The Cipher 890 manual on page 1-24 has tape I-6 Command Decoding. One possible command is "File Search Forward." In other words, the formatter is busy from the time you issue a command to skip to the next tape mark until that tape mark is found. Now refer to the Emulex TC13 controller manual page 2-7, Table 2-1. Pin/Signal Assignments for TC13 Tape Coupler and Tape Transport Interface. What this table says is that there is only one FBY signal line on the cable; there is not one FBY signal for each drive on the cable. Therefore if one tape drive says it is busy, a command cannot be started on the other tape drive. This is how the hardware handles it. Explanation ----------- Here is why 2 different tape jobs have problems. Assume there is a job called WRITE_TAPE that writes several files to a tape. Assume there is another job called DIRECTORY that will show a list of files on a tape and that the tape only has one file but it fills the entire 2400 foot reel of tape. The WRITE_TAPE job starts first, acquires drive MSA0:, and starts writing blocks of data, one block at a time. The actual VMS code will look something like this: loop: Is controller MSA: free? yes, it is. Start device MSA0: I/O function and begin countdown (device timeout) timer. go to loop In this case the "I/O function" is to write a single block of data. Now the job DIRECTORY starts and acquires MSB0:. It will read the HDR labels and print them, skip to the next tape mark, then read the EOF labels, and loop printing all the file names on the tape. loop: Is controller MSB: free? yes, it is. Start device MSB0: I/O function and begin countdown (device timeout) timer. go to loop In this case the "I/O function" is either a read of a single block (HDR or EOF label) or a skip to the next tape mark (skip over data in a file). Here is the sequence of events that create the error: DIRECTORY job Is controller MSB: free? yes, it is always free. Start device MSB0: to skip to next tape mark and begin countdown timer. WRITE_TAPE job Is controller MSA: free? yes, it is always free. Start device MSA0: to write a single block and begin countdown timer. (The FBY line is busy servicing the DIRECTORY job's request, so the TC13 controller just puts the WRITE_TAPE job's request on hold.) check all devices on system, are any countdown timers equal to zero? Yes, the countdown timer for device MSA0: has reached zero. Issue an error message saying the device has timed out. This ultimately aborts the job. Call to Emulex -------------- I called Emulex technical support at (714) 662-5600. I talked with Bob Johnson. I confirmed that the daisy-chain cabling is dedicated to a single tape drive during a skip tape mark operation. That is, an I/O operation cannot be started or in-progress on the other tape drive during this time. I then asked about skip tape block operation. Bob Johnson said the TC13 tape controller dedicates the cable for the entire amount of time to skip all requested blocks. The TC13 issues a "space forward 1 record" and maintains a counter for the number of blocks it is supposed to skip. In other words the tape drive skips one block at a time and the controller keeps a count of the number of blocks to skip. The controller will not let an I/O begin on the other drive until the skip of all the blocks is complete. "There is only so much code we can put in an PROM." Actually, I thought about this over the weekend and decided there is a better reason. The controller needs to monopolize the cabling when skipping blocks. Since the controller is counting, it needs to keep issuing commands to the tape drive to keep the drive streaming. If the controller allowed I/O to the other drive between single skip block operations, then there would be excessive repositioning on the skipping tape drive. After I inquired about the algorithm the tape controller uses to skip tape marks and skip blocks, I told Bob Johnson about our particular problem. He admitted that not many VAX sites use daisy chaining. He refers to our problem as a "constriction" or a funnel. He describes us has having a fan-in, pipe, fan-out configuration. That is, 2 tape controllers (MSA:, MSB:), one data path (the cabling between the tape controller and the tape drives), and 2 tape drives (MSA0:, MSB0:). Call to DEC ----------- I called DEC Colorado Springs Customer Support Center. I asked how long is the timeout for the TS11 software device driver. The response was that for rewinding a tape the timeout is 5 minutes otherwise the timeout is 20 seconds. These values may be overriden on the call to the QIO system service. Additional comments ------------------- In this document I have said the aborted tape job error was due to a "device timeout" error being returned by the system. It is also possible for the tape job to abort with the error "device not in configuration or not available." This will happen if the daisy-chain cabling is busy when a job goes to mount a tape. Also, note how daisy-chaining can severely limit tape drive performance. Only one tape drive can be using the cable at any one time. How in the world is the system supposed to keep both drives streaming? Also, I do not wish to suggest this is the answer to all of our tape problems. This memo just explains the problem that occurred with Aletha's tape job. Testing Results of 8/19/88 -------------------------- On Friday, 8/19/88 we moved the TC13 tape controller from OCVAXC to OCVAXB which gave us the following hardware configuration in OCVAXB. (We were able to maintain the tape drive names as MSA0: and MSB0:.) Let me restate the condition that causes a problem. Two tape jobs are running at the same time. One job is reading or writing blocks of data (a file) to a tape. The other job is skipping over a large file on a tape (e.g. doing a DIRECTORY of the tape). The job skipping will cause the other job to abort with a device timeout condition. The timeout will occur about 20 seconds after the skipping tape job starts a large skip. I ran one single test and the problem did not occur. Specifically, I had a tape job running on MSA0: that would read every block of data on a full 2400' tape. I also had a tape job running on MSB0: that would skip every block of data on a full 2400' tape. The large skip operation took nearly 5 minutes to complete. Both jobs ran successfully. This test confirms that an Emulex tape controller can support only one tape drive. Testing hardware configuration ------------------------------ VAXB 780 TC13 ---------- CIPHER 990 (MSA:) (MSA0:) TC13 ---------- CIPHER 990 (MSB:) (MSB0:) Testing software configuration ------------------------------ VAXB 780 TS11 ---------- CIPHER 990 (MSA:) (MSA0:) TS11 ---------- CIPHER 990 (MSB:) (MSB0:) Recommendation -------------- My testing and research shows that a TC13 tape controller cannot support more than 1 tape drive at the same time. Our only remedy is to have one tape controller per tape drive. This means we should purchase 2 more TC13 tape controllers: one more for VAXB and one more for VAXC. Possible future problem ----------------------- Emulex has received reports from its customers that the TC13 tape controller does not work under VMS 5.0. They are going to investigate these reports over the next 2-3 weeks to determine if they are true. If they are true Emulex is going to try using a software device driver from a previous VMS version (such as 4.7). If a driver from a previous version does not work then Emulex is faced with writing their own driver. I will stay in touch with Emulex. I called Emulex on 8/19/88. They confirmed that the TC13 tape controller does not work under VMS 5.0. They are in the process of writing a device driver. They claim the software will be ready in another 4 weeks. They also said we need to contact the company we bought the controller from in order to get on the software distribution list. Is Lowery still in business? I called the Emulex regional office in Cincinnatti (513) 762-7882 and asked to talk to a sales rep. Bob Shizel was not in so I left a message to have him call me. Bob Shizel returned my call later in the day. He also confirmed that they are writing software to handle the TC13 controller. He said it should be ready by the end of September. e-mail postmaster & technical support analyst Houck Computing Center Oberlin College, Oberlin, OH 44074 phulver @ oberlin (bitnet) ocvaxa::phulver (ccnet) phulver % oberlin @ uk.ac.rl (janet to bitnet) (you can also send mail to -> postmaster/postmast/postmstr)