207 lines
12 KiB
HTML
207 lines
12 KiB
HTML
|
<?xml version="1.0" encoding="UTF-8"?>
|
||
|
<!DOCTYPE html
|
||
|
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
||
|
<html lang="en-us" xml:lang="en-us">
|
||
|
<head>
|
||
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
||
|
<meta name="security" content="public" />
|
||
|
<meta name="Robots" content="index,follow" />
|
||
|
<meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l gen true r (SS~~000 1))' />
|
||
|
<meta name="DC.Type" content="concept" />
|
||
|
<meta name="DC.Title" content="Example: Failure" />
|
||
|
<meta name="abstract" content="Usually, a failover results from a node failure, but there are other reasons that can also generate a failover." />
|
||
|
<meta name="description" content="Usually, a failover results from a node failure, but there are other reasons that can also generate a failover." />
|
||
|
<meta name="DC.Relation" scheme="URI" content="rzaigconceptsfailover.htm" />
|
||
|
<meta name="DC.Relation" scheme="URI" content="rzaigtroubleshootpartitionerrors.htm" />
|
||
|
<meta name="copyright" content="(C) Copyright IBM Corporation 1998, 2006" />
|
||
|
<meta name="DC.Rights.Owner" content="(C) Copyright IBM Corporation 1998, 2006" />
|
||
|
<meta name="DC.Format" content="XHTML" />
|
||
|
<meta name="DC.Identifier" content="rzaigtroubleshootexamplefailover" />
|
||
|
<meta name="DC.Language" content="en-us" />
|
||
|
<!-- All rights reserved. Licensed Materials Property of IBM -->
|
||
|
<!-- US Government Users Restricted Rights -->
|
||
|
<!-- Use, duplication or disclosure restricted by -->
|
||
|
<!-- GSA ADP Schedule Contract with IBM Corp. -->
|
||
|
<link rel="stylesheet" type="text/css" href="./ibmdita.css" />
|
||
|
<link rel="stylesheet" type="text/css" href="./ic.css" />
|
||
|
<title>Example: Failure</title>
|
||
|
</head>
|
||
|
<body id="rzaigtroubleshootexamplefailover"><a name="rzaigtroubleshootexamplefailover"><!-- --></a>
|
||
|
<!-- Java sync-link --><script language="Javascript" src="../rzahg/synch.js" type="text/javascript"></script>
|
||
|
<h1 class="topictitle1">Example: Failure</h1>
|
||
|
<div><p>Usually, a failover results from a node failure, but there are
|
||
|
other reasons that can also generate a failover.</p>
|
||
|
<p>It is possible for a problem to affect only a single cluster resource group
|
||
|
that can cause a failover for that cluster resource group (CRG) but not for
|
||
|
any other CRG.</p>
|
||
|
<p>The following table shows various types of failures and the category they
|
||
|
fit in:</p>
|
||
|
|
||
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" frame="border" border="1" rules="all"><thead align="left"><tr><th valign="top" id="d0e25">Failure</th>
|
||
|
<th valign="top" id="d0e27">General category</th>
|
||
|
</tr>
|
||
|
</thead>
|
||
|
<tbody><tr><td valign="top" headers="d0e25 ">CEC hardware failure (CPU, for example)</td>
|
||
|
<td valign="top" headers="d0e27 ">2</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">Communications adapter, line, or router failure; or ENDTCPIFC affecting
|
||
|
all IP interface addresses defined for the node</td>
|
||
|
<td valign="top" headers="d0e27 ">4</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">Power loss to the CEC</td>
|
||
|
<td valign="top" headers="d0e27 ">1</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">Operating system software machine check</td>
|
||
|
<td valign="top" headers="d0e27 ">2</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">ENDTCP(*IMMED or *CNTRLD with a time limit) is issued</td>
|
||
|
<td valign="top" headers="d0e27 ">1</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">ENDSBS QSYSWRK(*IMMED or *CNTRLD) is issued</td>
|
||
|
<td valign="top" headers="d0e27 ">1</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">ENDSBS(*ALL, *IMMED, or *CNTRLD) is issued</td>
|
||
|
<td valign="top" headers="d0e27 ">1</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">ENDSYS (*IMMED or *CNTRLD) is issued</td>
|
||
|
<td valign="top" headers="d0e27 ">1</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">PWRDWNSYS(*IMMED or *CNTRLD) is issued</td>
|
||
|
<td valign="top" headers="d0e27 ">1</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">Initial program load (IPL) button is pushed while cluster resource
|
||
|
services is active on system</td>
|
||
|
<td valign="top" headers="d0e27 ">1</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">Cancel Job (*IMMED or *CNTRLD with a time limit) of the QCSTCTL job
|
||
|
is issued</td>
|
||
|
<td valign="top" headers="d0e27 ">1</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">Cancel Job (*IMMED or *CNTRLD with a time limit) of the QCSTCRGM job
|
||
|
is issued</td>
|
||
|
<td valign="top" headers="d0e27 ">1</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">Cancel Job (*IMMED or *CNTRLD with a time limit) of a cluster resource
|
||
|
group job is issued</td>
|
||
|
<td valign="top" headers="d0e27 ">3</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">End Cluster Node API is called</td>
|
||
|
<td valign="top" headers="d0e27 ">1</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">Remove Cluster Node API is called</td>
|
||
|
<td valign="top" headers="d0e27 ">1</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">Cluster resource group job has a software error that causes it to end
|
||
|
abnormally</td>
|
||
|
<td valign="top" headers="d0e27 ">3</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">Enters function 8 or function 3 from control panel to power down system</td>
|
||
|
<td valign="top" headers="d0e27 ">2</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">Enters function 7 for a delayed power down of a partition</td>
|
||
|
<td valign="top" headers="d0e27 ">1</td>
|
||
|
</tr>
|
||
|
<tr><td valign="top" headers="d0e25 ">Application program failure for an application cluster resource group</td>
|
||
|
<td valign="top" headers="d0e27 ">3</td>
|
||
|
</tr>
|
||
|
<tr><td colspan="2" valign="top" headers="d0e25 d0e27 "><p>General category:</p>
|
||
|
<ol><li>All of cluster resource services (CRS) fails on a node and is detected
|
||
|
as a node failure. The node may actually be operational or the node may have
|
||
|
failed (for example, a system failure due to power loss). If all of cluster
|
||
|
resource services fails, then the resources that are managed by CRS will go
|
||
|
through the failover process.</li>
|
||
|
<li>All of CRS fails on a node but it is detected as a cluster partition.
|
||
|
The node may or may not be operational.</li>
|
||
|
<li>A failure occurs on an individual cluster resource group. These conditions
|
||
|
are always detected as a failure.</li>
|
||
|
<li>A failure occurs but the node and cluster resource services are still
|
||
|
operational and it is detected as a cluster partition.</li>
|
||
|
</ol>
|
||
|
</td>
|
||
|
</tr>
|
||
|
</tbody>
|
||
|
</table>
|
||
|
</div>
|
||
|
<p>When a failure occurs, the action taken by cluster resource services for
|
||
|
a specific cluster resource group depends on the type of failure and the state
|
||
|
of the cluster resource group. However, in all cases, the exit program is
|
||
|
called. A failover may must work with a list of failed nodes. When the exit
|
||
|
program is called, it needs to determine if it must deal with only a single
|
||
|
node failure or with a list of failed nodes.</p>
|
||
|
<p>If the cluster resource group is <var class="varname">inactive</var>, the membership
|
||
|
status of the failed node in the cluster resource group's recovery domain
|
||
|
is changed to either an <var class="varname">Inactive</var> or <var class="varname">Partition</var> status.
|
||
|
However, the node roles are not changed, and the backup nodes are not reordered.
|
||
|
The backup nodes are reordered in an inactive cluster resource group when
|
||
|
the <a href="../cl/strcrg.htm"><span class="cmdname">Start
|
||
|
Cluster Resource Group (STRCRG)</span> command</a> or the <a href="../apis/clrgstcrg.htm"><span class="apiname">Start Cluster Resource
|
||
|
Group (QcstStartClusterResourceGroup)</span> API</a> is called. But,
|
||
|
the Start Cluster Resource Group API will fail if the primary node is not
|
||
|
active. You must issue the <a href="../cl/chgcrg.htm"><span class="cmdname">Change
|
||
|
Cluster Resource Group (CHGCRG)</span> command</a> or the <a href="../apis/clrgchgcrg.htm"><span class="apiname">Change Cluster Resource
|
||
|
Group (QcstChangeClusterResourceGroup)</span> API</a> to designate
|
||
|
an active node as the primary node, then call the Start Cluster Resource Group
|
||
|
API again.</p>
|
||
|
<p>If the cluster resource group is <var class="varname">active</var> and the failing
|
||
|
node is <var class="varname">not</var> the primary node, the failover updates the
|
||
|
status of the failed recovery domain member in the cluster resource group's
|
||
|
recovery domain. If the failing node is a backup node, the list of backup
|
||
|
nodes is reordered so that active nodes are at the beginning of the list.</p>
|
||
|
<p>If the cluster resource group is <var class="varname">active</var> and the recovery
|
||
|
domain member is the primary node, the following actions are performed based
|
||
|
on which type of failure has occurred:</p>
|
||
|
<dl><dt class="dlterm">Category 1 Failure</dt>
|
||
|
<dd>Failover occurs. The primary node is marked <var class="varname">inactive</var> in
|
||
|
each cluster resource group and made the last backup node. The node that was
|
||
|
the first backup becomes the new primary node. All device cluster resource
|
||
|
groups failover first. Then, all data cluster resource groups failover. Finally,
|
||
|
all application cluster resource groups failover. If a failover for any CRG
|
||
|
detects that none of the backup nodes are active, the status of the CRG is
|
||
|
set to <var class="varname">indoubt</var>.</dd>
|
||
|
<dt class="dlterm">Category 2 Failure</dt>
|
||
|
<dd>Failover occurs but the primary node does not change. All nodes in the
|
||
|
cluster partition that do not have the primary node as a member of the partition
|
||
|
will end the active cluster resource group. The status of the nodes in the
|
||
|
recovery domain in the cluster resource group are set to a <var class="varname">partition</var> status
|
||
|
for each node that is in the primary partition. If a node really failed but
|
||
|
is detected only as a partition problem and the failed node was the primary
|
||
|
node, you lose all the data and application services on that node and no automatic
|
||
|
failover is started. You must either declare the node as failed or bring the
|
||
|
node back up and start clustering on that node again. See <a href="rzaigtroubleshootchangepartitionednodes.htm">Change partitioned
|
||
|
nodes to failed</a> for more information.</dd>
|
||
|
<dt class="dlterm">Category 3 Failure</dt>
|
||
|
<dd>If only a single cluster resource group is affected, failover occurs on
|
||
|
an individual basis because cluster resource groups are independent of each
|
||
|
other. It may happen that several cluster resource groups are affected at
|
||
|
the same time due to someone canceling several cluster resource jobs. However,
|
||
|
the type of failure is handled on a CRG by CRG basis, and no coordinated failover
|
||
|
between CRGs is performed. The primary node is marked as <var class="varname">inactive</var> in
|
||
|
each cluster resource group and made the last backup node. The node that was
|
||
|
the first backup node becomes the new primary node. If there is no active
|
||
|
backup node, the status of the cluster resource group is set to <var class="varname">indoubt</var>.</dd>
|
||
|
<dt class="dlterm">Category 4 Failure</dt>
|
||
|
<dd>This category is similar to category 2. However, while all nodes and cluster
|
||
|
resource services on the nodes are still operational, not all nodes can communicate
|
||
|
with each other. The cluster is partitioned, but the primary node or nodes
|
||
|
are still providing services. However, because of the partition, you may experience
|
||
|
various problems. For example, if the primary node is in one partition and
|
||
|
all the backup nodes or replicate nodes are in another partition, you are
|
||
|
no longer replicating data and have no protection should the primary node
|
||
|
fail. In the partition that contains the primary node, the failover process
|
||
|
updates the status of the nodes in the cluster resource group's recovery domain
|
||
|
to <var class="varname">partition</var> for all nodes in the other partition. In the
|
||
|
partition that does not contain the primary node, the status of the nodes
|
||
|
in the cluster resource group's recovery domain for all nodes in the other
|
||
|
partition is set to <var class="varname">partition.</var></dd>
|
||
|
</dl>
|
||
|
</div>
|
||
|
<div>
|
||
|
<div class="familylinks">
|
||
|
<div class="parentlink"><strong>Parent topic:</strong> <img src="./delta.gif" alt="Start of change" /><a href="rzaigconceptsfailover.htm" title="Failover occurs when a server in a cluster automatically switches over to one or more backup servers in the event of a system failure.">Failover</a><img src="./deltaend.gif" alt="End of change" /></div>
|
||
|
</div>
|
||
|
<div class="relconcepts"><strong>Related concepts</strong><br />
|
||
|
<div><a href="rzaigtroubleshootpartitionerrors.htm" title="Certain cluster conditions are easily corrected. If a cluster partition has occurred, you can learn how to recover. This topic also tells you how to avoid a cluster partition and gives you an example of how to merge partitions back together.">Partition errors</a></div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</body>
|
||
|
</html>
|