Clusters Defined
A cluster is a group of independent computers working together as a single
system to ensure that mission-critical applications and resources are as
highly-available as possible. The group is managed as a single system, shares a common
namespace, and is specifically designed to tolerate component failures, and to
support the addition or removal of components in a way that's transparent to
users. Clustered systems have several advantages: fault-tolerance,
high-availability, scalability, simplified management and support for rolling
upgrades, to name a few.
There are two different types of cluster models in the
industry: the shared device model and the shared nothing model.
In the shared device model, applications running
within a cluster can access any hardware resource connected to any node in
the cluster. As a result, access to
the data must be synchronized. In many such implementations, a special component called a Distributed Lock Manager (DLM) is used for
this purpose. A DLM is a service that manages access to cluster hardware resources.
When multiple applications access the same resource, the DLM resolves any
conflicts that might arise. Along with this sophistication and complexity, a DLM
adds significant overhead to the cluster. Most of this is additional traffic
between nodes; however, a performance hit is also realized due to the loss of serialized
access to hardware resources.
By default, Microsoft Cluster Server and the
Windows Cluster Service use the shared nothing model.
Because this model does not use a DLM, it does not have the overhead incurred by
using such a service. In the shared nothing model, only one node
can own and access a single hardware resource at any given time. When failure occurs,
a surviving node can take ownership of the failed node's resources and make them
available to users.
While both Microsoft Cluster Server and the
Windows Cluster Service support the shared nothing
model, they can use the shared device model, but only if the clustered application supplies
its own DLM.
Why Cluster?
Generally speaking, hardware
failure is not the predominant cause of downtime. The leading causes
of downtime are typically related to events that are external to the
system, such as misconfiguration, power outages, security breaches, and so
forth. Clustering cannot help you solve those types of problems.
In addition, a cluster cannot protect you from software incompatibilities,
corrupt databases, viruses, catastrophes or mistakes. Clustering is
best implemented when a substantial proportion of your server downtime is
caused by hardware failure. If your organization’s leading cause of
downtime is the result of failures in administration, software, or
infrastructure, an investment in clustering technology may not reduce your
downtime.
You first need to assess the reasons for
server downtime in your organization, look at the problems that clustering
solves, and then make a business decision as to whether clustering is an
appropriate solution. The primary focus of clustering is solving
problems that arise from hardware failure, such as a blown CPU, bad
memory, or the loss of an entire server. In addition, clustering allows
you to continue providing resources during planned outages that may cause
downtime for users. A cluster system can allow resources to be manually
moved—or failed over—to one server while the other is brought down to
perform a rolling upgrade, a configuration change, or other maintenance.
A rolling upgrade is the process of
applying a service pack or other hardware or software update to each node
in the cluster while the other node continues providing service. Rolling
upgrades are typically a series of stages:
- Groups are moved from the node to be
upgraded to another node.
- The node to be upgraded is taken
offline.
- Perform the installation on or upgrade
to the offline node.
- Bring the upgraded node online.
- Move the groups back to the upgraded
node.
Then, repeat this process on each node in
the cluster until the entire cluster is upgraded. Rolling upgrades are
very attractive from a server management standpoint because services are
only unavailable during the time it takes to move resources from one node
to the other. By design, clusters help increase uptime. Increased uptime
really means reduced downtime. Clustering can help reduce both planned and
unplanned downtime. When any mission critical system fails, the
consequences can include lost revenue, interruption of services to
customers, and knowledge workers unproductively sitting idle. In
organizations of all sizes, failures incur costs in many areas. Hidden
costs often include damage to your reputation among customers, suppliers,
and end-users; and the perception that your organization isn’t able to
satisfy customer needs. Understanding the limitations of clustering is
just as important as understanding the benefits. While clustering protects
against the failure of a node in the cluster, it does not provide any
protection against other problems, such as network failures, database
corruption, loss of shared storage, or disasters.
Before implementing a cluster in your
environment, you should evaluate whether this solution really solves
enough of your problems to justify its cost. Clustering adds complexity to
your environment and administration. Therefore, it is important that you
understand and evaluate this technology in relation to your overall goals
and the needs of your network.
Fault
Tolerance Defined
Fault tolerance is the ability of a system to continue functioning when
part of it fails (e.g., experiences a fault). This term is used to
describe disk subsystems (e.g., RAID), symmetric multiple processors (SMP),
redundant power supplies (with separate power sources), uninterruptible power supplies, redundant network
adapters, etc. Fault tolerance is designed to alleviate the problems
caused by component failures, power outages, or other like occurrences.
Disk subsystems that use RAID, which stands for Redundant Array of
Inexpensive Disks (or Redundant Array of Independent Disks, or Redundant Array
of Inexpensive Devices, depending on who you ask) are considered fault tolerant. RAID refers to the
grouping of individual hard disks in a way that provides continued operation in
the event of a disk failure. There is both hardware RAID (e.g., a RAID
controller is used) and software RAID (e.g., the functionality is provided by an
operating system or application). There are many forms (levels) of RAID:
- RAID-0: Stripe set without parity. Stripe sets work
well with databases due to the usually random I/O nature of database
transactions. In RAID-0, data is divided into blocks and spread (in a
fixed order) across all of the disks in an array. RAID-0 improves read/write
performance by spreading operations across multiple disks, so that
operations can be performed independently and simultaneously. While
RAID-0 provides the highest performance, it does not provide any fault
tolerance. If a drive in a RAID-0 array fails, all of the data within
the stripe set becomes inaccessible.
- RAID-1: Mirroring. Disk mirroring provides a
redundant, identical copy of a disk. Data written to the
primary disk is also written to a mirror disk. RAID-1 provides fault tolerance
and generally improves read performance, but it may also degrade write
performance. Because dual-write operations can degrade system
performance, many mirror set implementations use duplexing, where each
mirror drive has its own disk controller. While the mirror approach provides
good fault tolerance, it is relatively expensive to implement. In
addition, only half of the available disk space can be used for
storage. The other half is needed for mirroring.
- RAID-5: Stripe set with parity. RAID-5 provides
redundancy of all data on the array, allowing a single disk to fail and be
replaced, in most cases, without system downtime. RAID-5 offers lower
performance than RAID-0 or RAID-1 but higher reliability and faster
recovery. RAID-5 uses the equivalent of one disk for storing the
parity strips, but distributes the parity strips across all the drives in
the array. The data and parity information are arranged on the disk array so
that they are always on different disks.
There are other implementations of RAID, such as RAID-0+1 (aka RAID-10),
RAID-2, RAID-3, etc., but these are typically proprietary implementations unique
to the hardware manufacturer that support them.
High-Availability Defined
By definition, the goal of a highly available system is to provide continuous
use of critical data and applications that keep businesses up and running,
regardless of planned or unplanned interruption. High-Availability refers to a system uptime that approaches 100%.
For example, an availability level of 99.999%, calculated on a round-the-clock
basis, would mean that an organization would experience at least five minutes of
unscheduled downtime per year. A level of 99.99% translates to 52 minutes
of downtime. A level of 99.9% translates to 8.7 hours, and a level of 99%
equals about 3.7 days of downtime per year.
The need for high-availability is not limited to 365x24x7 environments.
Many applications must be available during normal business hours or for a
critical time periods throughout the day. A system failure during these
critical periods is unacceptable for many organizations.
Alternatives to
Microsoft Cluster Server and Windows 2000 Cluster Services
Alternatives include Vinca's
Co-StandbyServer,
Vinca's
Octopus,
and Network Specialists'
Double-Take. In addition, there are the
shared storage clustering solutions provided by
Digital's
Clusters for Windows NT,
NCR's
LifeKeeper or Veritas
FirstWatch. Finally, there is the fault
tolerance and site disaster tolerance of
Marathon's
Endurance 4000.