ibm-information-center/dist/eclipse/plugins/i5OS.ic.rzaig_5.4.0.1/rzaigtroubleshootexamplefailover.htm

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en-us" xml:lang="en-us">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="security" content="public" />
<meta name="Robots" content="index,follow" />
<meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l gen true r (SS~~000 1))' />
<meta name="DC.Type" content="concept" />
<meta name="DC.Title" content="Example: Failure" />
<meta name="abstract" content="Usually, a failover results from a node failure, but there are other reasons that can also generate a failover." />
<meta name="description" content="Usually, a failover results from a node failure, but there are other reasons that can also generate a failover." />
<meta name="DC.Relation" scheme="URI" content="rzaigconceptsfailover.htm" />
<meta name="DC.Relation" scheme="URI" content="rzaigtroubleshootpartitionerrors.htm" />
<meta name="copyright" content="(C) Copyright IBM Corporation 1998, 2006" />
<meta name="DC.Rights.Owner" content="(C) Copyright IBM Corporation 1998, 2006" />
<meta name="DC.Format" content="XHTML" />
<meta name="DC.Identifier" content="rzaigtroubleshootexamplefailover" />
<meta name="DC.Language" content="en-us" />
<!-- All rights reserved. Licensed Materials Property of IBM -->
<!-- US Government Users Restricted Rights -->
<!-- Use, duplication or disclosure restricted by -->
<!-- GSA ADP Schedule Contract with IBM Corp. -->
<link rel="stylesheet" type="text/css" href="./ibmdita.css" />
<link rel="stylesheet" type="text/css" href="./ic.css" />
<title>Example: Failure</title>
</head>
<body id="rzaigtroubleshootexamplefailover"><a name="rzaigtroubleshootexamplefailover"><!-- --></a>
<!-- Java sync-link --><script language="Javascript" src="../rzahg/synch.js" type="text/javascript"></script>
<h1 class="topictitle1">Example: Failure</h1>
<div><p>Usually, a failover results from a node failure, but there are
other reasons that can also generate a failover.</p>
<p>It is possible for a problem to affect only a single cluster resource group
that can cause a failover for that cluster resource group (CRG) but not for
any other CRG.</p>
<p>The following table shows various types of failures and the category they
fit in:</p>

<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" frame="border" border="1" rules="all"><thead align="left"><tr><th valign="top" id="d0e25">Failure</th>
<th valign="top" id="d0e27">General category</th>
</tr>
</thead>
<tbody><tr><td valign="top" headers="d0e25 ">CEC hardware failure (CPU, for example)</td>
<td valign="top" headers="d0e27 ">2</td>
</tr>
<tr><td valign="top" headers="d0e25 ">Communications adapter, line, or router failure; or ENDTCPIFC affecting
all IP interface addresses defined for the node</td>
<td valign="top" headers="d0e27 ">4</td>
</tr>
<tr><td valign="top" headers="d0e25 ">Power loss to the CEC</td>
<td valign="top" headers="d0e27 ">1</td>
</tr>
<tr><td valign="top" headers="d0e25 ">Operating system software machine check</td>
<td valign="top" headers="d0e27 ">2</td>
</tr>
<tr><td valign="top" headers="d0e25 ">ENDTCP(*IMMED or *CNTRLD with a time limit) is issued</td>
<td valign="top" headers="d0e27 ">1</td>
</tr>
<tr><td valign="top" headers="d0e25 ">ENDSBS QSYSWRK(*IMMED or *CNTRLD) is issued</td>
<td valign="top" headers="d0e27 ">1</td>
</tr>
<tr><td valign="top" headers="d0e25 ">ENDSBS(*ALL, *IMMED, or *CNTRLD) is issued</td>
<td valign="top" headers="d0e27 ">1</td>
</tr>
<tr><td valign="top" headers="d0e25 ">ENDSYS (*IMMED or *CNTRLD) is issued</td>
<td valign="top" headers="d0e27 ">1</td>
</tr>
<tr><td valign="top" headers="d0e25 ">PWRDWNSYS(*IMMED or *CNTRLD) is issued</td>
<td valign="top" headers="d0e27 ">1</td>
</tr>
<tr><td valign="top" headers="d0e25 ">Initial program load (IPL) button is pushed while cluster resource
services is active on system</td>
<td valign="top" headers="d0e27 ">1</td>
</tr>
<tr><td valign="top" headers="d0e25 ">Cancel Job (*IMMED or *CNTRLD with a time limit) of the QCSTCTL job
is issued</td>
<td valign="top" headers="d0e27 ">1</td>
</tr>
<tr><td valign="top" headers="d0e25 ">Cancel Job (*IMMED or *CNTRLD with a time limit) of the QCSTCRGM job
is issued</td>
<td valign="top" headers="d0e27 ">1</td>
</tr>
<tr><td valign="top" headers="d0e25 ">Cancel Job (*IMMED or *CNTRLD with a time limit) of a cluster resource
group job is issued</td>
<td valign="top" headers="d0e27 ">3</td>
</tr>
<tr><td valign="top" headers="d0e25 ">End Cluster Node API is called</td>
<td valign="top" headers="d0e27 ">1</td>
</tr>
<tr><td valign="top" headers="d0e25 ">Remove Cluster Node API is called</td>
<td valign="top" headers="d0e27 ">1</td>
</tr>
<tr><td valign="top" headers="d0e25 ">Cluster resource group job has a software error that causes it to end
abnormally</td>
<td valign="top" headers="d0e27 ">3</td>
</tr>
<tr><td valign="top" headers="d0e25 ">Enters function 8 or function 3 from control panel to power down system</td>
<td valign="top" headers="d0e27 ">2</td>
</tr>
<tr><td valign="top" headers="d0e25 ">Enters function 7 for a delayed power down of a partition</td>
<td valign="top" headers="d0e27 ">1</td>
</tr>
<tr><td valign="top" headers="d0e25 ">Application program failure for an application cluster resource group</td>
<td valign="top" headers="d0e27 ">3</td>
</tr>
<tr><td colspan="2" valign="top" headers="d0e25 d0e27 "><p>General category:</p>
<ol><li>All of cluster resource services (CRS) fails on a node and is detected
as a node failure. The node may actually be operational or the node may have
failed (for example, a system failure due to power loss). If all of cluster
resource services fails, then the resources that are managed by CRS will go
through the failover process.</li>
<li>All of CRS fails on a node but it is detected as a cluster partition.
The node may or may not be operational.</li>
<li>A failure occurs on an individual cluster resource group. These conditions
are always detected as a failure.</li>
<li>A failure occurs but the node and cluster resource services are still
operational and it is detected as a cluster partition.</li>
</ol>
 </td>
</tr>
</tbody>
</table>
</div>
<p>When a failure occurs, the action taken by cluster resource services for
a specific cluster resource group depends on the type of failure and the state
of the cluster resource group. However, in all cases, the exit program is
called. A failover may must work with a list of failed nodes. When the exit
program is called, it needs to determine if it must deal with only a single
node failure or with a list of failed nodes.</p>
<p>If the cluster resource group is <var class="varname">inactive</var>, the membership
status of the failed node in the cluster resource group's recovery domain
is changed to either an <var class="varname">Inactive</var> or  <var class="varname">Partition</var> status.
However, the node roles are not changed, and the backup nodes are not reordered.
The backup nodes are reordered in an inactive cluster resource group when
the  <a href="../cl/strcrg.htm"><span class="cmdname">Start
Cluster Resource Group (STRCRG)</span> command</a> or the <a href="../apis/clrgstcrg.htm"><span class="apiname">Start Cluster Resource
Group (QcstStartClusterResourceGroup)</span> API</a> is called. But,
the Start Cluster Resource Group API will fail if the primary node is not
active. You must issue the <a href="../cl/chgcrg.htm"><span class="cmdname">Change
Cluster Resource Group (CHGCRG)</span> command</a> or the <a href="../apis/clrgchgcrg.htm"><span class="apiname">Change Cluster Resource
Group (QcstChangeClusterResourceGroup)</span>  API</a> to designate
an active node as the primary node, then call the Start Cluster Resource Group
API again.</p>
<p>If the cluster resource group is <var class="varname">active</var> and the failing
node is <var class="varname">not</var> the primary node, the failover updates the
status of the failed recovery domain member in the cluster resource group's
recovery domain. If the failing node is a backup node, the list of backup
nodes is reordered so that active nodes are at the beginning of the list.</p>
<p>If the cluster resource group is <var class="varname">active</var> and the recovery
domain member is the primary node, the following actions are performed based
on which type of failure has occurred:</p>
<dl><dt class="dlterm">Category 1 Failure</dt>
<dd>Failover occurs. The primary node is marked <var class="varname">inactive</var> in
each cluster resource group and made the last backup node. The node that was
the first backup becomes the new primary node. All device cluster resource
groups failover first. Then, all data cluster resource groups failover. Finally,
all application cluster resource groups failover. If a failover for any CRG
detects that none of the backup nodes are active, the status of the CRG is
set to <var class="varname">indoubt</var>.</dd>
<dt class="dlterm">Category 2 Failure</dt>
<dd>Failover occurs but the primary node does not change. All nodes in the
cluster partition that do not have the primary node as a member of the partition
will end the active cluster resource group. The status of the nodes in the
recovery domain in the cluster resource group are set to a <var class="varname">partition</var> status
for each node that is in the primary partition. If a node really failed but
is detected only as a partition problem and the failed node was the primary
node, you lose all the data and application services on that node and no automatic
failover is started. You must either declare the node as failed or bring the
node back up and start clustering on that node again. See <a href="rzaigtroubleshootchangepartitionednodes.htm">Change partitioned
nodes to failed</a> for more information.</dd>
<dt class="dlterm">Category 3 Failure</dt>
<dd>If only a single cluster resource group is affected, failover occurs on
an individual basis because cluster resource groups are independent of each
other. It may happen that several cluster resource groups are affected at
the same time due to someone canceling several cluster resource jobs. However,
the type of failure is handled on a CRG by CRG basis, and no coordinated failover
between CRGs is performed. The primary node is marked as <var class="varname">inactive</var> in
each cluster resource group and made the last backup node. The node that was
the first backup node becomes the new primary node. If there is no active
backup node, the status of the cluster resource group is set to <var class="varname">indoubt</var>.</dd>
<dt class="dlterm">Category 4 Failure</dt>
<dd>This category is similar to category 2. However, while all nodes and cluster
resource services on the nodes are still operational, not all nodes can communicate
with each other. The cluster is partitioned, but the primary node or nodes
are still providing services. However, because of the partition, you may experience
various problems. For example, if the primary node is in one partition and
all the backup nodes or replicate nodes are in another partition, you are
no longer replicating data and have no protection should the primary node
fail. In the partition that contains the primary node, the failover process
updates the status of the nodes in the cluster resource group's recovery domain
to <var class="varname">partition</var> for all nodes in the other partition. In the
partition that does not contain the primary node, the status of the nodes
in the cluster resource group's recovery domain for all nodes in the other
partition is set to <var class="varname">partition.</var></dd>
</dl>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <img src="./delta.gif" alt="Start of change" /><a href="rzaigconceptsfailover.htm" title="Failover occurs when a server in a cluster automatically switches over to one or more backup servers in the event of a system failure.">Failover</a><img src="./deltaend.gif" alt="End of change" /></div>
</div>
<div class="relconcepts"><strong>Related concepts</strong><br />
<div><a href="rzaigtroubleshootpartitionerrors.htm" title="Certain cluster conditions are easily corrected. If a cluster partition has occurred, you can learn how to recover. This topic also tells you how to avoid a cluster partition and gives you an example of how to merge partitions back together.">Partition errors</a></div>
</div>
</div>
</body>
</html>