Determine if a cluster problem exists

Start here to diagnose your cluster problems.

At times, it may seem that your cluster is not operating correctly. When you think a problem exists, you can use the following to help determine if a problem exists and the nature of the problem.

Determine if clustering is active on your system.
To determine if cluster resource services is active, look for the two jobs - QCSTCTL and QCSTCRGM - in the QSYSWRK subsystem. If these jobs are active, then cluster resource services is active. You can use the Work Management function in iSeries™ Navigator to View jobs in a subsystem or use the WRKACTJOB (Work with Active Jobs) command to do this. You can also use the DSPCLUINF (Display Cluster Information) command to view status information for the cluster.
- Additional jobs for cluster resource services may also be active. Cluster resource services job structure provides information about how cluster resource services jobs are formatted.
Look for messages indicating a problem.
- Look for inquiry messages in QSYSOPR that are waiting for a response.
- Look for error messages in QSYSOPR that indicate a cluster problem. Generally, these will be in the CPFBB00 to CPFBBFF range.
- Display the history log (DSPLOG CL command) for messages that indicate a cluster problem. Generally, these will be in the CPFBB00 to CPFBBFF range.
Look at job logs for the cluster jobs for severe errors.
These jobs are initially set with a logging level at (4 0 *SECLVL) so that you can see the necessary error messages. You should ensure that these jobs and the exit program jobs have the logging level set appropriately. If clustering is not active, you can still look for spool files for the cluster jobs and exit program jobs.
If you suspect some kind of hang condition, look at call stacks of cluster jobs.
Determine if there is any program in some kind of DEQW (dequeue wait). If so, check the call stack of each thread and see if any of them have getSpecialMsg in the call stack.
Check for cluster vertical licensed internal code (VLIC) logs entries.
These log entries have a 4800 major code.
Use NETSTAT command to determine if there are any abnormalities in your communications environment.
NETSTAT returns information about the status of TCP/IP network routes, interfaces, TCP connections and UDP ports on your system.
- Use Netstat Option 1 (Work with TCP/IP interface status) to ensure that the IP addresses chosen to be used for clustering show an 'Active' status. Also ensure that the LOOPBACK address (127.0.0.1) is also active.
- Use Netstat Option 3 (Work with TCP/IP Connection Status) to display the port numbers (F14). Local port 5550 should be in a 'Listen' state. This port must be opened via the STRTCPSVR *INETD command evidenced by the existence of a QTOGINTD (User QTCP) job in the Active Jobs list. If clustering is started on a node, local port 5551 must be opened and be in a '*UDP' state. If clustering is not started, port 5551 must not be opened or it will, in fact, prevent the successful start of clustering on the subject node.
Use ping. If you try to start a cluster node and it cannot be pinged, you will receive an internal clustering error (CPFBB46).
Use the CLUSTERINFO macro to show cluster resource services' view of nodes in the cluster, nodes in the various cluster resource groups, and cluster IP addresses being currently used.
Discrepencies found here may help pinpoint trouble areas if the cluster is not performing as expected. See Investigate a problem with CLUSTERINFO macro for details on using and interpreting the CLUSTERINFO macro results.