Use the Dump Cluster Trace (DMPCLUTRC) command to help determine and resolve problems with a cluster.
The Dump Cluster Trace (DMPCLUTRC) command can help you to determine whether a cluster job has completed or what the job is currently processing. The command dumps cluster-related trace and debug information to a file. The information is dumped locally on one or more cluster nodes. The command can be used to dump one or all cluster resource service (CRS) jobs. Each CRS job that is dumped has a file member in the file. The name of the file member is the name of the CRS job. Clustering must be active for the command to produce output. Only nodes that have an active CRS job will have output. The information that is dumped originates from the user trace and other information is taken from cluster objects. The amount of information dumped is determined by the dump level. The different dump levels are basic information, error information, informational information, and verbose information. The dump level determines how much information is sent to the file. In most cases an IBM® service representative will inform you of which level to specify based on your needs, however; LEVEL(*ERROR) is sufficient for most troubleshooting scenarios. If you have a question on which level is appropriate for your situation, contact an IBM Service representative.
You can analyze the trace results to get an understanding of what the clustering is doing, such as which cluster job is causing the protocol to wait. The output which comes from the user trace will contain a separator line which is a series of equal signs (=). Depending on how many times the DMPCLUTRC was issued will have an impact on how many separator lines are seen in the file. There may be multiple calls of DMPCLUTRC in the same file. The last set of stack dumps has the most current information. In some cases, a CRG job can have two groups. Each group has a separate dump section in the file.
In the following example of Dump Cluster Trace results, there are two nodes (SYSTEM1 and SYSTEM2) in a cluster named MYCLUSTER. It has one CRG named MYCRG. Both nodes are in the recovery domain of the MYCRG. The user issued the STRCRG CL command and the process is taking a long time to return results. From a different workstation, the user entered DMPCLUTRC CLUSTER(MYCLUSTER) CRG(*ALL) LEVEL(*ERROR) FILE(MYFILE) on a command line interface.
User Trace Dump for job 073586/QSYS/MYCRG. Size: 300K, Wrapped 0 times. --- 08/22/2005 16:43:32 --- (1a) 00000006:658536 Main thread handle 2 (1b) 00000008:748016 Work thread 1 handle 13 (1b) 00000007:754576 Work thread 2 handle 11 --- 08/22/2005 16:46:04 --- 00000008:269608 CSTDAMBR 1115: WaitForMsg 4 1005 CPFBB3C --- 08/22/2005 16:48:17 --- 00000006:925112 (1c) DMPCLUTRC Node SYSTEM1 Group MYCRG ===================================================
The first section contains the thread numbers and handles for the cluster job. Cluster jobs may have two or more threads. In this example there is one for the main thread (1a), which is where all the work comes into, and two work threads (1b). This section also contains information about what system this trace came from and to which cluster job it pertains too (1c).
00000006:925168 Stack Dump For Target Thread: Handle 2 (0x00000002) 00000006:925192 Stack: (2aa) Main Thread Stack MYCRG 00000006:925256 Stack: Library / Program Module Stmt Procedure 00000006:933432 Stack: QSYS / QCSTCRGJOB CSTCRGJOB 0 : _CXX_PEP__Fv 00000006:933488 Stack: QSYS / QCSTCRGJOB CSTCRGJOB 46 : main 00000006:933536 Stack: QSYS / QCSTCRGJOB CSTCRGJOB 65 : completeStartup__FP8CstDAMbr 00000006:933584 Stack: QSYS / QCSTCRGJOB CSTCRGJOB 26 : mainQueueProcessLoop__FP8Cs DAMbr 00000006:933616 Stack: QSYS / QCSTCMN CSTDAMBR 57 : processQueueMsgs__8CstDAMbrF Q2_8CstDAMbr13CstQueueIndex 00000006:933664 Stack: QSYS / QCSTCMN CSTDAMBR 53 : processMsg__8CstDAMbrFP6CstM sg 00000006:933712 Stack: QSYS / QCSTCMN CSTDAMBR 17 : callFnPtr__8CstDAMbrFPQ2_8Cs tDAMbr19MsgFunctionPtrEntryP6 00000006:933744 Stack: QSYS / QCSTCRGJOB CSTCRGJOB 94 : crgDump__FP6CstMsgP8CstDAMbr 00000006:933792 Stack: QSYS / QCSTCMN CSTACK 95 : CstAckQueryMsg 00000006:933832 Stack: QSYS / QP0ZCPA QP0ZUDBG 3 : Qp0zDumpTargetStack 00000006:933864 Stack: QSYS / QP0ZSCPA QP0ZSDBG 12 : Qp0zSUDumpTargetStack 00000006:934016 Stack: Exception In Stack Dump Code 00000006:934040 Stack: thread is likely terminated or no longer running the same code as the captured stack 00000006:934080 (2a) Work Thread Index 1 Group MYCRG Last or current values (2e) 00000006:934112 Request handle 8E3E1002 EE3218A1 824F0004 AC000456 (2c) 00000006:934136 SPI name QcstStartClusterResourceGroup 00000006:934160 (2g) POF 10, Completed ack round 1 (2i) 00000006:934176 (2o) In waitForJobEnd QDFTJOBD MYCLUSTER 073590 00000006:934216 Node Ack Status POF (2bb) Nack Msg Id 00000006:934240 (2n) SYSTEM1 (2cc) Ready 00000006:934272 SYSTEM2 Ack 10 (2k) 00000006:934296 Messages 00000006:934320 Stack Dump For Target Thread: Handle 13 (0x0000000d) 00000006:934344 Stack: Work Thread 1 Stack MYCRG 00000006:934792 Stack: Library / Program Module Stmt Procedure 00000006:934840 Stack: QSYS / QCSTCRGJOB CSTCRGJOB 9 : workThreadRoutine__FPv 00000006:934888 Stack: QSYS / QCSTCRGJOB CSTCRGJOB 28 : workQueueProcessLoop__FP8Cst DAMbr 00000006:941688 Stack: QSYS / QCSTCMN CSTDAMBR 57 : processQueueMsgs__8CstDAMbrF Q2_8CstDAMbr13CstQueueIndex 00000006:941696 Stack: QSYS / QCSTCMN CSTDAMBR 33 : processMsg__8CstDAMbrFP6CstM sg 00000006:941712 Stack: QSYS / QCSTCMN CSTDAMBR 17 : callFnPtr__8CstDAMbrFPQ2_8Cs tDAMbr19MsgFunctionPtrEntryP6 00000006:941728 Stack: QSYS / QCSTCMN CSTACK 3 : CstStripOffHeaderMsgPart 00000006:941736 Stack: QSYS / QCSTCMN CSTDAMBR 53 : processMsg__8CstDAMbrFP6CstM sg 00000006:941752 Stack: QSYS / QCSTCMN CSTDAMBR 17 : callFnPtr__8CstDAMbrFPQ2_8Cs tDAMbr19MsgFunctionPtrEntryP6 00000006:970888 Stack: QSYS / QCSTCRGS2 CSTCRGSS 39 : startCrg 00000006:970912 Stack: QSYS / QCSTCRGS2 CSTCRGSS 344 : doMessageProcessing__FP6CstM sgP8CstDAMbr 00000006:970928 Stack: QSYS / QCSTCRGS2 CSTCRGSS 57 : doExitPgmPhase__FP6CstMsgP8C stDAMbr 00000006:981984 Stack: QSYS / QCSTCMN CSTDAMBR 52 : waitForJobEnd__8CstDAMbrFPA2 6_ci 00000006:982000 Stack: QSYS / QCSTCMN CSTDAMBR 73 : waitForSpecialMsg__8CstDAMbr FP17CstSpecialMsgListPA8_ciT3 00000006:982016 Stack: QSYS / QC2UTIL1 QC2MI3 1 : (2dd) deq 00000006:982136 Stack: Exception In Stack Dump Code 00000006:982136 Stack: thread is likely terminated or no longer running the same code as the captured stack 00000006:982160 (2b)Work Thread Index 2 Group MYCRG Last or current values (2f)00000006:982176 Request handle D9C3C8C3 E2E3F5F2 0003 0000 (2cc)00000006:982176 SPI name 00000006:982184 (2h) POF 0, (2d)Completed ack (2j)round 0 00000006:982184 In getNextWorkMsg 00000006:982208 Node Ack Status POF Nack Msg Id (2l) 00000006:982208 SYSTEM1 Ready (2l) 00000006:982232 SYSTEM2 Ready 00000006:982248 Messages 00000006:982256 Stack Dump For Target Thread: Handle 11 (0x0000000b) 00000006:982256 Stack: Work Thread 2 Stack MYCRG 00000006:982344 Stack: Library / Program Module Stmt Procedure 00000006:982360 Stack: QSYS / QCSTCRGJOB CSTCRGJOB 9 : workThreadRoutine__FPv 00000006:982376 Stack: QSYS / QCSTCRGJOB CSTCRGJOB 28 : workQueueProcessLoop__FP8Cst DAMbr 00000006:982392 Stack: QSYS / QCSTCMN CSTDAMBR 51 : processQueueMsgs__8CstDAMbrF Q2_8CstDAMbr13CstQueueIndex (2m) 00000006:982400 Stack: QSYS / QCSTCMN CSTDAMBR 105 : getNextWorkMsg__8CstDAMbrFv 00000006:982416 Stack: QSYS / QC2UTIL1 QC2MI3 1 : deq 00000006:982480 Stack: Exception In Stack Dump Code 00000006:982480 Stack: thread is likely terminated or no longer running the same code as the captured stack
The second section contains the call stacks for each thread that is part of the cluster job. For the most part, the main thread will show the DMPCLUTRC which just completed (2aa). The work threads (2a and 2b) contain the trace information that will help determine what is happening with a cluster job. This section contains details on the call stack, such as the SPI name (2c), the completed acknowledgement (ACK)(2d), request handle for the associated APIs (2e) or the last completed request handle (2f), current point of failure (POF) (2g and 2h), current round of acknowledgement (ACK) (2i and 2j) and the nodes that have acknowledged (ACK) (2k and 2l).
Current point of failure (POF) is an internal value that represents where in the code the current protocol is, and may not necessarily indicate a failure has occurred. An Ack means the node has successfully completed this part of the protocol and is waiting for all other nodes to acknowledge (ACK) or Nack. A Nack means the node cannot successfully complete this part of the protocol and is waiting for all other nodes to respond. The message ID for the Nack is given in the next column (2bb). This is the same message that is sent to the originator's RIQ. If a node fails during a protocol, it's status shows as Fail and, depending on the protocol and the node, may or may not be considered a Nack. An Ack status of Inactive means the node did not participate in the protocol. A value of Ready means that node has not responded back yet. When a thread is in getNextWorkMsg (2m) it means the thread is waiting for work to do.
Read the procedure names by starting at the bottom and working up the call stack. This example file contains a deq (2dd) with a waitForSpecialMsg, waitForJobEnd and doExitPgmPhase. This indicates that the protocol is waiting for an exit program to complete before it can continue processing. From the Ack Status (2k), we can determine which node the protocol is waiting for. In this example, we are waiting for node SYSTEM1 (2n). The qualified job name (2e) indicates the job for which the system is waiting. Once you determine the job name, you can work with the job to troubleshoot the cause of the delay. Some possible causes may be that the job is still waiting in job queue, the job is running but it takes time to process, or the job is waiting for an object that is locked.
In this example, the protocol is waiting for an exit program to complete. An easier way to determine if a protocol is waiting for an exit program or vary job to complete is to look at the first section, and see if in waitForJobEnd (2o). The job name being waited for is on the same line. This eliminates the need to look through the stacks.
5722SS1 V5R4M0 060210 AS/400 DUMP 073586/QSYS/MYCRG 08/22/05 16:48:18 PAGE 1 DMPSYSOBJ PARAMETERS (3a)OBJ- MYCRGAIX CONTEXT- QTEMP TYPE- *ALL SUBTYPE-*ALL OBJECT TYPE- INDEX *CRGM NAME- MYCRGAIX TYPE- 0E SUBTYPE- A5 LIBRARY- QTEMP 006B8A19B00C9E807000 TYPE- 04 SUBTYPE- C1 CREATION- 08/22/05 16:43:32 SIZE- 0000007000 ATTRIBUTES- 0000 ADDRESS- C7FE286F04 000000 . . . .POINTERS- NONE OIR DATA- NONE END OF DUMP * * * * * E N D O F L I S T I N G * * * * *
The third section shown is an internal object which contains information about the cluster job. In this example, it’s an internal index called MYCRGAIX (3a). The information in here is much easier to read than in section 2 above.
5722SS1 V5R4M0 060210 AS/400 DUMP 073586/QSYS/USER 08/22/05 16:48:18 PAGE 1 DMPSYSOBJ PARAMETERS (4a)OBJ- MYCRGTQ CONTEXT- QTEMP TYPE- 0A SUBTYPE-EF OBJECT TYPE- QUEUE *QTQ NAME- MYCRGTQ TYPE- 0A SUBTYPE- EF LIBRARY- QTEMP 006B8A19B00C9E807000 TYPE- 04 SUBTYPE- C1 CREATION- 08/22/05 16:43:32 SIZE- 000002C000 ATTRIBUTES- 0000 ADDRESS- CC6765CAA2 000000 QUEUE ATTRIBUTES- . . . .POINTERS- 000010 SPP 1A EF MYCRG QSYS 073586 00002160 0000 000020 SPP 1A EF MYCRG QSYS 073586 00001540 8000 000030 SPP 1A EF MYCRG QSYS 073586 000016E0 0000 000040 SPP 1A EF MYCRG QSYS 073586 00001690 0000 000050 SPP 1A EF MYCRG QSYS 073586 000016A0 0000 000070 SPP 1A EF MYCRG QSYS 073586 00002160 0000 OIR DATA- NONE END OF DUMP * * * * * E N D O F L I S T I N G * * * * *
5722SS1 V5R4M0 060210 AS/400 DUMP 073586/QSYS/MYCRG 08/22/05 16:48:18 PAGE 1 DMPSYSOBJ PARAMETERS (5a) OBJ- MYCRG CONTEXT- QUSRSYS TYPE- 19 SUBTYPE-2C OBJECT TYPE- SPACE *CRG NAME- MYCRG TYPE- 19 SUBTYPE- 2C LIBRARY- QUSRSYS TYPE- 04 SUBTYPE- 01 CREATION- 08/17/05 07:16:40 SIZE- 0000002000 OWNER- MYCLUSTER TYPE- 08 SUBTYPE- 01 ATTRIBUTES- 0800 ADDRESS- 1EC687A1F3 000000 SPACE ATTRIBUTES- . . . END OF DUMP * * * * * E N D O F L I S T I N G * * * * *
The fifth section contains information about the CRG object (5a).