From: SMTP%"raspuzzi@mrlat.enet.dec.com" 28-OCT-1994 09:36:48.80 To: EVERHART CC: Subj: Re: Any more recent info/articles on LAT rating algorithm? X-Newsgroups: comp.os.vms Subject: Re: Any more recent info/articles on LAT rating algorithm? Message-Id: <38f07n$gma@mrnews.mro.dec.com> From: raspuzzi@mrlat.enet.dec.com (Michael D. Raspuzzi) Date: 23 OCT 94 20:43:44 Organization: Digital Equipment Corporation Nntp-Posting-Host: mrlat.enet.dec.com Lines: 316 To: Info-VAX@Mvb.Saic.Com X-Gateway-Source-Info: USENET In article <1994Oct20.024416.29188@ultb.isc.rit.edu>, dsf5454@ultb.isc.rit.edu (D.S. Foster ) writes... >Howdy... > There's a file obtainable via anon FTP at: > >ACFCLUSTER.NYU.EDU, [VMS]LAT.EXPLAINATION > >That has a post from Nick Smith to comp.os.vms circa 1990 that had interesting >info on the LAT rating algorithm. I'm aware that the algorithm has most >likely changed since then... so I'm curious if anyone had any info or >pointers where any could be found for that particular subject that was >done after 1990? > >-Dan >Internet: dsf5454@rit.edu The rating in use in 1990 (~VMS V5.5 release) is still the same one today. Below is a lengthy explanation of some rating anomalies that have been seen from time to time. I hope that the explanations make some sort of sense. It is 200+ lines. I should also mention that we are considering supplying a default rating algorithm that can be easily modified at sites (that is, you can write your own LAT rating algorithm and plug it into LTDRIVER). This is in the works for a future release of OpenVMS (both VAX and AXP). At this time, I have no release information so I don't know when it will be available. The plan would be to ship the sample rating algorithm (written in C and a little MACRO) in SYS$EXAMPLES. -- Mike Raspuzzi (raspuzzi@mrlat.enet.dec.com) Digital Equipment Corporation -----------------------LAT Rating Info----------------------- We get a lot of questions about the LAT rating algorithm. This text is an attempt to describe some scenarios and explanations for LAT rating algorithms that "don't feel right" or don't look right due to a lack of understanding on exactly what is happening with the rating algorithm. First, let's dissect the rating algorithm so that an understanding can be gained from it's make up. The LAT rating algorithm is as follows: ! ! The LAT rating is calculated using system load average and number of ! free job slots: ! ! 20 * (IJOBLIM - IJOBCNT) min(235,IJOBLIM) * 100 ! RATING = ------------------------ + ---------------------- ! IJOBLIM (100 + GHB$L_LOADAVG) ! ! If IJOBLIM = 0 or IJOBLIM = IJOBCNT then rating is set to 0. ! ! GHB$L_LOADAVG is a quantity that equals 100 if there is an average of ! 1 job waiting in a computable queue (200 if 2 jobs, etc.) and is ! a moving average taken every 5 seconds. Note, only those processes ! whose priority is DEFPRI or higher are included in the load average. ! ! The first term in the above formula is dubbed the "AVAILABILITY" term, ! and represents the fraction of free job slots available. The second ! term is known as the "LOAD" term, and varies according to system load. ! ! The factor "min(235,IJOBLIM)" may be changed by the LATCP command ! "SET NODE/CPU_RATING=nnn". The quantity "nnn" represents the system ! manager's estimate of relative CPU power. Its value can range ! from 1 (for a low power CPU) to 100 (for the highest power CPU which ! offers the given LAT service). This range is scaled up by LATCP ! to produce a factor which ranges from 1 to 235 in order to fit in ! the above formula. ! ! In addition to the above terms, there may be a term which represents ! a penalty of up to 40 rating points if the system is low on free ! memory. This term is: ! ! (FREEGOAL + 2048 - FREECNT) ! PENALTY = 40* --------------------------- ! (FREEGOAL + 2048) ! ! This term is SUBTRACTED from the rating ONLY IF it is positive. If ! FREECNT is high enough, this term is not used. ! In it's simplest form, the LAT rating is simply: RATING = AVAILABILITY + SYSTEM LOAD Since the rating can only have a maximum value of 255, LTDRIVER must take care in making sure that AVAILABILITY + SYSTEM LOAD does not exceed the value 255. To do this, LTDRIVER only lets the AVAILAABILITY term have a maximum value of 20 and the SYSTEM LOAD can only have a maxmimum value of 235. So, if both were set at their maximum values, the LAT rating would be 20 + 235 or 255. To start with, let's look at the first term. It is dubbed the "AVAILABILITY" term. [MYTH] If the login limit on a system is set to 100 and 50 people are logged in, I would expect the LAT rating to be half way to 0. [FACT] This is not entirely true. Remember, the AVAILABILITY term is only worth 20 points of the rating. So, if 50 people out of 100 are logged in, then the AVAILABILITY is calculated accordingly: AVAILABILITY = 20 * (50 / 100) = 10 Let's assume the system load is at a constant 100. As people login, the rating will drop slower than one would expect. In the simplest case, let's say that the interactive login limit for a node is 20. With a system load of 100, the rating will be (with no user's logged in): RATING = 20 * (20 / 20) + 100 = 120 Assume the system load stays constant as users begin to login. Every time a user logs in, the rating will only drop 1 point. So, with 10 users logged in, the rating: RATING = 20 * (10 / 20) + 100 = 110 It is a common misconception that the LAT rating drops drastically as people log in. This is not entirely true if the system load remains the same. The only exception to this is if the interactive login limit is reached. When the interactive login limit is reached, the rating is unconditionally set to 0. Now, let's look at the load term. It is the most important term since it makes up 235 / 255 or about 92% of the entire LAT rating. This frist thing to notice is that the system load term can only have a maximum value of 235. It *could* be smaller. For example, a node that has an interactive job limit of 64 will not use 235 as its upper limit. It uses 64 as the upper limit of the SYSTEM LOAD. This is observed by looking at MIN(235,IJOBLIM) of the SYSTEM LOAD term. So, if IJOBLIM is below 235, then the value of IJOBLIM will be used. Keeping this in mind, what is the absolute highest RATING a system with an interactive (IJOBLIM) job limit of 64 can have? First, you know that 0 users have to be logged in to get the maxmimum out of the availability term. Next, you know that the load average has to be at its lowest value (0) indicating the system is completely idle. Using those values (no one logged in and a load average of 0) the highest load average a system with an interactive job limit of 64 can have is: RATING = 20 * (64 / 64) + (MIN(235,64) * 100) / (100 + 0) RATING = 20 + (64 * 100) / 100 AVAILABILITY SYSTEM LOAD As you can see, the maxmimum rating is 84 (20 from the AVAILABILITY and 64 from the SYSTEM LOAD). If a system has an interactive job limit of 64 and the LAT rating is higher than 84, then something odd is going on (either an incorrect calculation in LTDRIVER or someone is changing dynamic system parameters - IJOBLIM - on the fly). Chances are, someone changed IJOBLIM. [MYTH] A LAT rating of 0 means no one can connect to that node. [FACT] A LAT rating of 0 simply means that the system is "not very available" or overloaded. It has nothing to do with whether or not the system can be connected to. A system with a LAT rating of 0 can still be connected to but one may not be able to login. Let's continue to look at the SYSTEM LOAD term. Note, that this is calculated to be 100 if 1 job is computable, 200 if 2 jobs are computable, etc. By computable, this means a process that is either on a COM/COMO queue or any process in one of the following states: MWAIT, COLPG, FPG, PFW THIS DOES NOT TAKE INTO ACCOUNT ANY PROCESS THAT IS CURRENTLY RUNNING ON A CPU. This is a common misconception about the LAT rating. It is assumed that because the CPU is 100% busy that the LAT rating must be real low because the system is loaded. THIS MAY NOT BE THE CASE!!! [MYTH] Monitor or other system performance tools show the CPU to be 100% busy and several processes are contending for the CPU. The load rating should therefore be very low. [FACT] While the above is a very general statement, it does NOT apply to all situations. Why? Remember that the only processes counted in the system load average are those in the COM queues AND THE PRIORITY MUST BE ABOVE THE SYSGEN PARAMETER DEFPRI. For example, a system with 20 batch jobs running at priority 3 (with DEFPRI set to 4) may have the same LAT rating as a system that is totally idle. Moreover, when a process is CUR on a CPU, it is not any of the COMputable states that LAT uses for its rating. Therefore, a CUR process has ABSOLUTELY NO BEARING ON THE LAT RATING ALGORITHM LOAD AVERAGE. When several processes begin contending for the CPU, some of them will be in the COMputable state and that is when the system load average begins to fluctuate. Having only 1 COMputable process using 100% of the CPU is not causing CPU contention and therefore, this may not affect the LAT rating. Keeping in mind that CUR processes are not counted as part of the LAT rating algorithm, let's take a look at how another popular misconception can lead to confusion. Let's assume there is a 2 node cluster. The first node (NODE1) is a VAX 6000 model 600 with 6 processors. The second node (NODE2) is a VAX 6000 model 400 with 1 processor. Both systems have 128MB of memory and for now, assume that the memory penalty is not a factor in the load average calculation. Next, assume both systems are configured identically - SYSGEN parameters, common UAF file, print/batch queues and interactive job limits. For the purposes of simplicity, let's assume both systems have an interactive job limit of 200. That means that the maximum SYSTEM LOAD either system can have is 200 (remember MIN(235,200)). Also, the maximum AVAILABILITY is 20 so the absolute highest rating either system can have in this configuration is 220. Now, let's use some numbers in a simple load calculation (from this point on, assume all processes used in SYSTEM LOAD calculations are running at DEFPRI or higher). On NODE1 - 75 users are logged in. On NODE2 - 10 users are logged in. One NODE1, 5 processes are compute intensive and on NODE2 only 2 processes are compute intensive. What will the load average look like? First, let's look at NODE1. Five processes are compute intensive. Because NODE1 has 6 CPUs in it, chances are, the SYSTEM LOAD will remain at its highest value (200). But wait, that doesn't make sense you say! However, if you think carefully, it does. Remember, processes in the CUR state are not counted in the SYSTEM LOAD. Since there are 5 compute intensive processes, chances are, each one is CUR on a processor in the VAX 6660 (and there is even 1 processor to spare). So, the system load average might look like this: RATING = 20 * (25 / 100) + 200 Even though 5 processes are running all out, it is possible that they are not affecting the system load average at all. The above shows a RATING of 205. Now let's look at NODE2. There are 2 processes that are compute intensive but the system only has 1 processor. So, chances are, the load average for this system is going to be 100 (because 1 process will be CUR while the other process is COMputable - waiting in a compute queue). In this node's case, the rating is has a little more math: RATING = 20 * (90 / 100) + ((100 * 100) / (100 + 100)) RATING = 18 + 50 As you can see, the LAT rating for NODE2 is only 68 yet it only has 2 processes running and only has 10 users logged in. This may seem to defy logic but shows the point how COMputable processes can affect the LAT rating algorithm. The other system, has plenty of compute horsepower (even though 75% of its user capacity has been reached). MORAL of this story: BE CAREFUL WITH SMP SYSTEMS!!!! The more CPUs in a system, the higher the chance that the system has a lighter load average. This story also demonstrates another common problem. A VAX 6000-600 with 6 CPUs is a pretty powerful machine. Yet, by setting the interactive login limit to 100 (the same as the VAX 6000-400 with 1 processor) this is effectively making the machines "equal". Make sure IJOBLIM is set appropriately for each system in a VMScluster. The story also demonstrates how COMputable processes (i.e. processes waiting to use CPU resources) will drop the LAT rating quickly. It has no relationship to system load measured by other tools. Let's take another example where the same 2 node cluster exists, but this time, both systems are completely identical. System number 1 (NODE1) is a VAX 7000 model 600 - 1 processor. System number 2 (NODE2) is also a VAX 7000 model 600 with 1 processor. For simplicity, assume the following: NODE1 - 90 users logged in - 1 process is 50% computable NODE2 - 20 users logged in - 1 process is compute intensive On NODE1, the load average will be about 50 (because the 1 process is 50% computable - if it were always computable, the load average would be 100). The LAT rating looks something like this: RATING = 20 * (10 / 100) + ((100 * 100) / (100 + 50)) RATING = 2 + 66 The rating for NODE1 is 68. On NODE2, the load average will be about 100 because 1 process is computable all the time. LAT rating: RATING = 20 * (80 / 100) + ((100 * 100) / (100 + 100)) RATING = 16 + 50 The LAT rating for NODE2 is 66! Even though there are 70 fewer users logged into NODE2, the LAT rating is slightly lower because the system load is higher. The system load is the dominating factor for the LAT rating. [MYTH] The number of users logged into a system should control the LAT rating. [FACT] As you can see from the above examples, the dominating portion of the LAT rating is the load average - not the number of users logged in. The memory penalty is simply a deduction to the LAT rating when a system has less physical memory available for processes to use. Note, if the LAT rating algorithm is not suitable to a site, then it is possible to write a program (based on CSC's DYNRAT program) to calculate the LAT rating and set a static rating based on that calculation. The LAT rating is not perfect but works well in most instances.