Individual Entry

The immorality of immortality. — Part 1

In recent weeks, I've been dealing with a process "immortality" issue on a large university medical center's OpenVMS cluster. This particular OpenVMS cluster is running a database product known as Caché. Its implementers, to maintain ACID (atomicity, consistency, isolation, durability) access to the database, chose to make OpenVMS processes running Caché immortal by setting the process status bit known as the "NODELET" (PCB$V_NODELET) bit. Their justification for setting this bit is/was to thwart uncontrolled process termination; however, there are very serious ramifications when setting this bit that make it a highly undesirable practice. Immortality - a fate worse than death. -- Edgar A. Shoaff

So, right about now you are pondering, "What is so undesirable about preventing a third-party process in the cluster from deleting another's process?" Nothing really; what is undesirable are the side effects of process deletion with the "NODELET" bit set. Setting this bit is not something that should be done without good reason. If a process is to remain immortal, there has to be a good reason for it. Simply setting the PCB$V_NODELET bit as a way to stave off attempts to issue $ STOP/IDENTIFICATION={pid} is, IMHO, NOT proper justification. There are several OpenVMS mechanisms which can be employed to keep a process from being indiscriminately deleted from the system; I will expound on this subject in the subsequent parts of this article. Very few processes require this draconian method of OpenVMS process immortality.

If you don't think that you have ever seen such a process on an OpenVMS systen, you don't need to look far. Issue $ SHOW SYSTEM and the first process on the list should be the SWAPPER process. The SWAPPER process is an integral function of the OpenVMS operating system; without it, OpenVMS cannot function. This is one of the few processes which, one can reasonably argue, requires the "NODELET" bit. Taking a look at this process with SDA, one can see that it has the "NODELET" bit characteristic:

Process index: 0001 Name: SWAPPER Extended PID: 2FC00401
--------------------------------------------------------------------
Process status: 01850011 RES,PSWAPM,NOSUSPEND,PHDRES,NODELET,DISAWS
status2: 00000000

PCB address 810DB3C8 JIB address 00000000
PHD address 810DAE00 Swapfile disk address 00000000
KTB vector address 810DB6B4 HWPCB address FFFFFFFF.810DAE80
Callback vector address 00000000 Termination mailbox 0000
Master internal PID 00000000 Subprocess count 0
Creator extended PID 00000000 Creator internal PID 00000000
Previous CPU Id 00000000 Current CPU Id 00000000
Previous ASNSEQ 0000000000000007 Previous ASN 00000000000000F3
Initial process priority 16 # open files remaining 0/0
Delete pending count 0 Direct I/O count/limit 6/6
UIC [00001,000004] Buffered I/O count/limit 0/0
Abs time of last event 1A392EB8 BUFIO byte count/limit 0/0
# of threads 1 ASTs remaining 0/0
Swapped copy of LEFC0 00000000 Timer entries remaining 0/0
Swapped copy of LEFC1 00000000 Active page table count 0
Global cluster 2 pointer 00000000 Process WS page count 0

For the SWAPPER process, without which OpenVMS would not function properly, establishing this immortality with the "NODELET" bit is soundly justified.

If you look through the output of the $ SHOW SYSTEM command, you can see other processes which warrant the use of the "NODELET" bit. ACPs (ancillary control process), for example, typically have the "NODELET" bit set on them. These are processes which work in conjunction with device drivers on the system. If the ACP were to disappear, chances are that the system will go to hell in a handbasket rather quickly as I/Os needing to be handled by the driver are no longer able to complete. OpenVMS also comes with several "server" processes which, if these suddenly disappeared, would wreck havoc on the system operations.

However, no matter how many processes you do check, you will never see the "NODELET" bit set on a normal OpenVMS INTERACTIVE process; at least, not one that has had it set by OpenVMS. INTERACTIVE processes are considered to be ephemeral. A user logs in, performs some task or tasks and then, logs out. BATCH jobs, PRINT jobs, NETWORK tasks, and INTERACTIVE logins are all considered transient in contrast to the life of the system. Granting immortality to such processes is like Mr. Cadwallader granting immortality to Walter Bedeker in the Twilight Zone episode "Escape Clause" only there is no escape clause for the OpenVMS processes. Walter Bedeker was able to terminate his eternity of imprisonment whereas an OpenVMS process granted immortality cannot. Let's have a look.

When a process terminates, OpenVMS goes through a number of steps to rundown activated images, free resources, and eventually terminate the process. The crux of the final processing in process termination takes place in the SYS$EXIT code. However, when a process is declared immortal, OpenVMS is presented with a quandary — how should it handle the apparent death of an immortal process?. Apparently, this was just as much a quandary for the OpenVMS developers. There are some reasons why one would not want to have a process declared immortal to disappear from the system, even after it has reached the final phases of process termination.

The solution implemented by OpenVMS engineering to this immortal process quandary was to insure that the process was, indeed, immortal. When and if process termination gets to the final stage of termination, the process's PCB$L_STS field is checked for the PCB$V_NODELET bit. If this bit is set, indicating an immortal process, OpenVMS forgoes the final steps of process termination and, instead, leaves the process in a state of suspended animation in an infinite loop.

$DELPRC_S ; DELETE SELF
PUSHL R0 ; SAVE ANY ERROR RETURNED
$SETPRI_S PRI=#0 ; MAKE NEXT LOOP HARMLESS
POPL R0 ; RESTORE THE ERROR FROM DELPRC_S
20$: BRB 20$ ; ****** FELL THROUGH DELPRC SOMEHOW

The code above, sourced in Macro32, is the code which implements this state of suspended animation culminated in the final instruction of this code sequence. This code equates to a single instruction, infinite loop on all three OpenVMS target architectures — VAX: 20$: BRB 20$; Alpha: 20$: BR 20$; and Integrity: 20$: br.sptk.many 20$.

Why have I spent this time to discuss all of this? It's because Caché insists on making INTERACTIVE processes immortal. Unfortunately, there is quite a well documented history of their practice backfiring when processes running Caché terminate without running through the Caché exit procedure established to clear this bit. The medical center which has been running Caché recently upgraded their cluster to new HP Integrity Blades. They also installed ProvN's AUDIT terminal auditing product which is how I became involved with them since I provide ProvN with maintenace and development of their AUDIT product.


Immorality is not to do what one has to do when one has to do it." -- Jean Anouilh

The medical center had been running Caché for years on Alphas. They also ran ProvN AUDIT. At HP's suggestion, they abandoned ProvN AUDIT when they upgraded to the Integrity Blades. However, they soon discovered that the competitor's terminal auditing product, which was recommended by HP, wasn't ready for prime-time on Integrity. After numerous system crashes provoked by the competitor's product, this medical center came back into the fold to install and run ProvN AUDIT.

The medical center also began using a web based interface to the system on these new Integrity Blades. Soon after the installation of and upgrade to these Integrity Blades, they began to find INTERACTIVE Caché processes stuck in the infamous immortality quandary loop. The medical center contacted their respective support organizations. Despite well documented cases to the contrary, the support organizations of HP and Caché realized that this medical center's four new HP Integrity Blade systems represented significant monies in sales and licenses. They chose a scapegoat; a sacrificial lamb which would suffice as the target of blame for these looping processes. They, of course, chose to point the finger at ProvN AUDIT.

This finger pointing was not just a passing comment that, perhaps, the ProvN AUDIT software was causing these processes to loop. Oh no, a staunch, accusatory and conspiratorial camarilla formed between these factions to lay blame on an innocent product and vendor. Backed with bogus claims and substantiated with unsupportive evidence, this cabal exhaustively wasted three months of my time on an issue that was never one that I should have been a party to in the first place.

HP loaded a PC (program counter) tracing extension in SDA and logged where the program counter was within the looping process. Anybody with some rudimentary experience with this sort of thing would or should have known where one would find the process's PC and an HP CSC support person should have known without having to give it a second thought. Yet, I was passed along an excess of 800 pages of a Micro$oft W(ie)RD document containing the SDA log of this looping process.
Timestamp CPU PC IPL M Pid Routine
--------------- --- ----------------- --- - -------- --------------------
08:02:30.771943 01 FFFFFFFF.80D64740 0 K 202319C5 EXE$EXIT_INT_C+003C0
08:02:30.761731 00 FFFFFFFF.80D64740 0 K 202319C5 EXE$EXIT_INT_C+003C0
08:02:30.749775 02 FFFFFFFF.80D64740 0 K 202319C5 EXE$EXIT_INT_C+003C0
08:02:30.738775 02 FFFFFFFF.80D64740 0 K 202319C5 EXE$EXIT_INT_C+003C0
08:02:30.727775 02 FFFFFFFF.80D64740 0 K 202319C5 EXE$EXIT_INT_C+003C0
08:02:30.717731 00 FFFFFFFF.80D64740 0 K 202319C5 EXE$EXIT_INT_C+003C0
08:02:30.705943 01 FFFFFFFF.80D64740 0 K 202319C5 EXE$EXIT_INT_C+003C0

Yes, you've guessed it. It was looping on the 20$: BRB 20$ in the routine EXE$EXIT_INT in the SYS$EXIT handler code (it is the code that is posted in the prior segment above). A brief note accompanying this voluminous, Micro$oft-bloated, email said that this tracing clearly showed that the process was looping in code in ProvN AUDIT! In the back of my mind, I could hear echoes of Wallace Shawn's lisping character Vizzini from the film The Princess Bride exclaiming, "Incontheivable!" as I read this.

Caché's support group had also convinced their customer, this medical center, that there were two distinctly different looping senarios — referring to them as loose loop and tight loop. In my humble opinion, they're Fruit Loopy! There is only one looping scenario and that is the infinite loop, and it's a very tight infinite loop.

A forced crash dump was provided which showed one of these Caché running processes in a loop. Analysis of that crash dump not only showed that ProvN AUDIT wasn't loaded but that ProvN AUDIT had never been loaded since the system had been booted. The specious claims that ProvN AUDIT was at the root of these looping processes were now beginning to raise suspicions that there was more to this than a simple possibility of culpability. This was an all out blame game.

Even though it was shown that ProvN AUDIT wasn't loaded at the time of the forced crash with the Caché running process looping with the immortality conundrum, there was still insistence that it was ProvN AUDIT causing the problem. Thus began the saga of getting secure VPN access to this medical center's systems to allow me a firsthand look-see at one of these "loopy" processes. Eventually, VPN access was effected. ProvN AUDIT was loaded on one of the medical center's production cluster machines and I awaited notice from them that there was a processes stuck in this immortality loop.

One of the things that struck at the core foundation of their specious claims was the fact that not a single one of the processes which were found looping were candidates for auditing by ProvN AUDIT. There is simply no correlation between ProvN AUDIT and a process that is not a candidate for auditing. ProvN AUDIT works by injecting itself in between the terminal port and terminal class drivers using a patented process. This affects only the terminal device associated with the process for INTERACTIVE I/O. If the process's terminal has not been intercepted, there is no contribution whatsoever from ProvN AUDIT in the context of the process.

So, what this all boils down to:
  • Caché sets the PCB$V_NODELET bit on INTERACTIVE processes;
  • Caché processes routinely end in the dreaded immortality loop;
  • ProvN AUDIT was not involved and proven so twice;
  • HP's CSC support contact was either a dupe or a dope;
  • the medical center is being played for a bunch of rubes.
The problem lies with Caché who are not taking their responsibility to correct this issue but are, instead, trying to focus the blame elsewhere. Immorality IS not to do what one has to do when one has to do it.


Comments?


To thwart automated comment SPAM, you must answer this question to post.

Comment moderation is enabled. Your comment(s) will not be visisble until approved.
Remember personal info?
Notify?
Hide email?
All html tags, with the exception of <b> and <i>, will be removed from your comment. You can make links by simply typing the url or email-address.
Powered by…