Start here to diagnose your cluster problems.
At times, it may seem that your cluster is not operating correctly. When
you think a problem exists, you can use the following to help determine if
a problem exists and the nature of the problem.
- Determine if clustering is active on your system.
To determine
if cluster resource services is active, look for the two jobs - QCSTCTL and
QCSTCRGM - in the QSYSWRK subsystem. If these jobs are active, then cluster
resource services is active. You can use the Work Management function in iSeries™ Navigator
to View jobs in a subsystem or use the WRKACTJOB (Work with Active Jobs) command to
do this. You can also use the DSPCLUINF (Display Cluster Information) command to
view status information for the cluster.
- Look for messages indicating a problem.
- Look for inquiry messages in QSYSOPR that are waiting for a response.
- Look for error messages in QSYSOPR that indicate a cluster problem. Generally,
these will be in the CPFBB00 to CPFBBFF range.
- Display the history log (DSPLOG CL command) for messages
that indicate a cluster problem. Generally, these will be in the CPFBB00 to
CPFBBFF range.
- Look at job logs for the cluster jobs for severe errors.
These
jobs are initially set with a logging level at (4 0 *SECLVL) so that you can
see the necessary error messages. You should ensure that these jobs and the
exit program jobs have the logging level set appropriately. If clustering
is not active, you can still look for spool files for the cluster jobs and
exit program jobs.
- If you suspect some kind of hang condition, look at call stacks of
cluster jobs.
Determine if there is any program in some kind of DEQW
(dequeue wait). If so, check the call stack of each thread and see if any
of them have getSpecialMsg in the call stack.
- Check for cluster vertical licensed internal code (VLIC) logs entries.
These
log entries have a 4800 major code.
- Use NETSTAT command to determine if there are any abnormalities in
your communications environment.
NETSTAT returns information about
the status of TCP/IP network routes, interfaces, TCP connections and UDP ports
on your system.
- Use Netstat Option 1 (Work with TCP/IP interface status) to ensure that
the IP addresses chosen to be used for clustering show an 'Active' status.
Also ensure that the LOOPBACK address (127.0.0.1) is also active.
- Use Netstat Option 3 (Work with TCP/IP Connection Status) to display the
port numbers (F14). Local port 5550 should be in a 'Listen' state. This port
must be opened via the STRTCPSVR *INETD command evidenced by the existence
of a QTOGINTD (User QTCP) job in the Active Jobs list. If clustering is started
on a node, local port 5551 must be opened and be in a '*UDP' state. If clustering
is not started, port 5551 must not be opened or it will, in fact, prevent
the successful start of clustering on the subject node.
- Use ping. If you try to start a cluster node and it cannot be pinged,
you will receive an internal clustering error (CPFBB46).
- Use the CLUSTERINFO macro to show cluster
resource services' view of nodes in the cluster, nodes in the various cluster
resource groups, and cluster IP addresses being currently used.
Discrepencies
found here may help pinpoint trouble areas if the cluster is not performing
as expected. See Investigate a problem with CLUSTERINFO macro for details on using and interpreting the CLUSTERINFO
macro results.