Introducing
Microsoft System Center Operations Manager 2007
Modern organizations of any size all have one thing in
common: their computing environment isn’t simple. Everybody’s world contains
hardware and software of different vintages from different vendors glued
together in different ways. Managing this world—keeping these diverse,
multifaceted systems running—is unavoidably complex.
Despite this complexity, two fundamental management requirements
are clear. The first is the need to monitor the hardware and software in the
environment, keeping track of everything from application availability to disk
utilization. The operations staff that keeps the environment running must have
a clear and complete picture of the current state of their world. The second
requirement is the ability to respond intelligently to the information this
monitoring produces. Whenever possible, this response should avoid incidents by
addressing underlying problems before something fails. In any case, the
operations staff must have effective tools for fixing failed systems and
applications.
The goal of Microsoft’s System Center Operations Manager
2007 is to meet both of these requirements. The successor to Microsoft Operations
Manager (MOM) 2005, the product is focused on managing Windows environments,
including desktops, servers, and the software that runs on them. It can also be
used to manage non-Windows systems and software, devices such as routers and
switches, and more. Released in early 2007, Operations Manager aims at
providing a solution for monitoring and managing a Windows-oriented computing
world.
The
Challenges of Monitoring and Management
Think about what’s
required for effective monitoring and management in an enterprise computing
environment. Keeping track of what’s happening on the myriad of machines in an
organization means dealing with diverse software, including desktop and server
operating systems, databases, web servers, and applications, together with all
sorts of hardware, such as processors, disk drives, routers, and much, much
more. All of these components must inform the people managing this world of
their status.
This is bound to
generate lots of information. The operations staff will certainly need some
kind of dedicated interface that organizes this plethora of data into
understandable graphics and numbers. They’d also probably like a Web version of
this interface, an option that would increase their ability to manage this
world remotely. And for some scenarios, such as creating scripts, a command
line interface is the best choice. Yet while a variety of user interfaces are
required for working with management data as it’s generated, the ability to
generate reports on historical data is also essential. How can the people
responsible for maintaining this environment know how they’re doing without
some way to track their history? Real-time interfaces are certainly important,
but so is an intelligent way to examine long-term trends.
Here’s another challenge: No single organization—vendor or
end user—can afford to have people on staff who are expert in managing each
part of a complex IT world. Instead, a management and monitoring tool must
provide a way to apply packaged expertise via software. The product must also
be capable of providing a platform for third parties to create this kind of
package.
And there’s more. The
IT Information Library (ITIL) and the Microsoft Operations Framework (MOF) both
promote an IT Service Management (ITSM) approach. Rather than focusing solely
on the details of managing individual technologies, ITSM emphasizes the
services that IT provides to the business it’s part of. Given that the business
people who pay for it see IT entirely in terms of these services, this approach
makes perfect sense. Yet doing it successfully requires explicit support for
defining and monitoring distributed applications as a whole. An organization’s
email service is made up of many parts, for example, including email server
software, database software, and the machines this software runs on. Providing
a way to monitor and manage this combination as a single service improves IT’s
ability to offer the performance and reliability that the
business expects.
Effectively
monitoring and managing a modern computing environment requires addressing all
of these problems. The next section gives an overview of how Operations Manager
does this.
Addressing the Challenges: What System Center Operations Manager 2007 Provides
While Operations Manager provides a variety of functions,
three of them stand out as most important: providing a common foundation for
monitoring and managing desktops, servers, and more; taking a customizable,
model-based approach to management; and supporting service monitoring of
complete distributed applications. What follows describes each of these three.
A Common Foundation for Monitoring and Managing Desktops, Servers, and More
Despite the diversity of enterprise computing, a single
product can provide a broad foundation for a significant part of an
organization’s management challenges. Understanding how Operations Manager does
this requires a basic grasp of the product’s architecture. The figure below
shows its major components.
As the diagram shows, the software that comprises
Operation Manager is divided into servers and agents. The
servers, which run on Windows Server 2003 and the forthcoming Windows Server
codename “Longhorn”, divide into two categories:
n
The Operations Manager management
server. This server relies on an operational database, and it’s the
primary locus for handling real-time information received from agents. As the
diagram shows, it also provides an access point for the product’s various user
interfaces.
n
The Operations Manager reporting
server. This server relies on a data warehouse, a database capable
of storing large amounts of information received from agents for long periods.
The reporting server can run predefined and custom reports against this
historical data.
Unlike management and reporting servers, the Operations Manager
agent runs on both client and server
machines. This agent runs on Windows 2000, Windows XP, and Windows Vista
clients, as well as Windows 2000 Server, Windows Server 2003, and Windows
Server codename “Longhorn”. To manage non-Windows devices, such as routers and
switches, Operations Manager managers and agents can connect to them using SNMP
or the newer WS-Management protocol. There’s also an option that allows
retrieving basic management information from Windows systems that aren’t
running agents.
Agents send four primary kinds of information to
management servers:
n
Events, indicating that something
interesting has occurred on the managed system. An agent might send an event
indicating that a login attempt has failed, for instance, or that a failed
hardware component has been brought back to life.
n
Alerts, indicating that something
has happened that requires an operator’s attention. For example, an agent might
send an event for every failed login, but send an alert if four failed logins
occur within three minutes on the same account.
n
Performance data, regularly sent
updates on various aspects of the managed component’s performance.
n
Discovery data, information about discovered
objects. Rather than requiring an operator to explicitly identify the objects
to be managed, each agent can discover them itself, a process that’s described
later in this paper.
All of this information is sent to the operational
database and/or the data warehouse, and all of it can be accessed through Operations
Manager’s user interfaces. Operations staff will most often rely on the
Operations Manager console, a Windows application that can display
events, show alerts, graph performance over time, and more. A large subset of
the Console’s functions can also be performed through the Operations Manager Web
console, providing browser-based access to this information. And for
creating scripts or for people who just prefer a command line interface, the
product also allows access via the Operations Manager command shell.
This broad foundation is essential for modern monitoring
and management. It’s not enough, though—more is required. How, for instance,
can a single product address the diversity of managed components in a typical
enterprise? How Operations Manager addresses this is described next.
Customizable, Model-Based Management
Any attempt to address the broad problem of monitoring and
management in a single product faces an unavoidable limitation: No one vendor
can have all of the specialized knowledge required to handle the wide range of
software and hardware that its customers use. Instead, what’s needed is a
generalized framework for packaging specialized management knowledge and
behavior, packages that can then be plugged into a common management foundation.
This is exactly what’s provided by Operations Manager’s management
packs (MPs). Each MP packages together the knowledge and behavior required
to manage a particular component, such as an operating system, a database
management system, a server machine, or something else. These MPs are then
installed into Operations Manager, as the figure below shows.
Since creating an MP requires specialized knowledge about
managing the component this MP targets, each one is typically created by the
organization that knows the most about that component. As the figure above
suggests, for example, Microsoft has created MPs for client and server versions
of Windows as well as for Exchange Server, SQL Server, and other Microsoft
products. Other vendors have created MPs for non-Microsoft software and
hardware about which they have specialized knowledge. Hewlett-Packard provides
an MP for its ProLiant server machines, for example, while Dell offers MPs for
its servers.
As the figure shows, each MP can contain several things,
including the following:
n
Monitors, letting an agent track
the state of various parts of a managed component.
n
Rules, instructing an agent to
collect performance and discovery data, send alerts and events, and more.
n
Tasks, defining activities that can
be executed by either the agent or the console.
n
Knowledge, providing textual advice
to help operators diagnose and fix problems.
n
Views, offering customized user
interfaces for monitoring and managing this component.
n
Reports, defining specialized ways
to report on information about this managed component.
When an MP is installed, its various parts wind up in
different places. The monitors and rules, for instance, are downloaded to the
agents on the appropriate machines, while the knowledge and reports remain on
the management and reporting servers. Wherever its contents are used, the goal
is always the same: providing the specialized knowledge and behavior required
to monitor and manage a particular component.
To get a sense of how the various components of an MP
might work together, imagine that an application running on some managed system
notices that it lacks sufficient disk space to function. This application
writes an event into the system’s event log indicating this, then shuts itself
down. The Operations Manager agent on this system continually monitors the
event log, and so it quickly notices this event. If the application’s MP
contains an appropriate rule, the agent will send a specific alert to the
management server when this event occurs. The operator sees the alert in the
Operations Manager console, and he also sees the MP-provided knowledge
associated with this alert. Reading this knowledge, he learns that he should
direct the agent to run a task that deletes the temp directory on the
application’s machine, then restart the application. This entire process, from
detection of the problem to its ultimate resolution, depends on the information
contained in the MP.
Of course, it would be better to avoid this problem in the
first place. One way to do this is to keep an eye on things such as free disk
space on the machine hosting this application, then inform an operator when a
problem looms. Doing this requires creating a model of what a healthy managed
component looks like, then indicating any deviation from its normal state. In
Operations Manager, this is exactly what monitors are for. Each MP defines a
set of objects that can be managed, then specifies a group of monitors
for those objects. These monitors keep track of the state of each object,
making it easier to avoid problems like application crashes before they occur.
In the language of Operations Manager, the set of monitors for a managed object
comprise a health model for that object. By tying together the health
models for the various objects on a system, an overall health model can be
created that reflects the state of the system as a whole.
Allowing
each MP to define its own set of managed objects makes sense. Yet the best an
MP’s creators can do is define generic objects; they can’t know exactly what’s
on any given system. For example, the SQL Server MP defines an object
representing a database. When this MP is installed on a real system, that
machine might have one, two, or more actual databases. How are these concrete
instances of the MP’s generic object type found? One approach would be to
require an operator to identify each instance manually, a task that nobody
would enjoy. Instead, Operations Manager allows each MP to include specific discovery
rules (also called just discoveries) that let the agent locate these
instances. The goal is to make finding the things that need to be managed as
straightforward as possible.
Providing this generalized approach to defining management
knowledge and behavior requires a common technology to create these
definitions. Like other products in the System Center family, Operations
Manager relies on an XML-based language called the System Definition Model
(SDM) to do this. All MPs are expressed in SDM, providing a standard
format for their creators to use. Defining MPs with SDM also implies a more
general, less Windows-specific infrastructure for management. Although
Operations Manager remains a Windows-oriented product, it’s significantly less
wedded to the Microsoft world than was its predecessor.
In
fact, SDM is the basis of an in-progress standard known as the Service
Modeling Language (SML). Embraced by Microsoft, BEA, BMC, CA, Cisco, Dell,
HP, IBM, and Sun, SML will provide a vendor-neutral foundation for describing
managed systems. The value of a model-based approach to this problem is clear,
and it’s a fundamental aspect of Operations Manager.
Service Monitoring for Distributed Applications
The goal of virtually every IT department is to provide
services to the organization it’s part of. Monitoring and managing the various
parts of the IT environment is an essential aspect of doing this. Yet business
people don’t really care about the state of individual components. Instead,
their concern is for the services they see. Can they send and receive email?
Are the applications they need right now running effectively? This
service-based concern makes sense, since it reflects what’s most important to
the organization as a whole. Yet each service is likely made up of a number of
underlying components, including both software and hardware. Looking at each of
an application’s components separately isn’t enough.
What’s needed is a way to monitor and manage a distributed
application—the combination of components that underlie a particular service—as
a whole. Operations Manager provides this through service monitoring.
The diagram below illustrates this idea.
Think, for example, about a custom ASP.NET application. As
the figure suggests, this application’s main components might include IIS, the
application itself, SQL Server, a specific database, the network used to
connect these components, and more. From a technology point of view, all of
these are distinct pieces, and without some way to group them together, an
operator would be hard pressed to know that they actually comprise a single
distributed application. From the perspective of a business user, however, all
that matters is the service this entire application provides. If any part of it
is malfunctioning, the entire service is likely to be unavailable. Letting the
operator know that, say, a failed disk drive is actually part of this
business-critical application can help that operator understand the importance
of getting it back on line as quickly as possible. Rather than viewing their
world as discrete components, operations staff can instead have a perspective
that’s closer to what their customers see: a service-based view.
Getting a grip on Operations Manager requires
understanding a number of different concepts. None of these ideas are more
fundamental than management servers and agents, and so the place to begin is by
looking more deeply at these bedrock parts of the product.
Understanding Management Servers
Management servers are at the center of Operations
Manager. Agents send them alerts, events, performance data, and more, and
they’re also the access point for the product’s user interfaces. While the
basic architecture is straightforward, as shown earlier, understanding Operations
Manager requires knowing a bit more about management servers. This section
takes a slightly more detailed look at this important part of the product.
Root Management Servers
Every agent communicates with exactly one management
server. While many organizations could potentially meet their needs with a
single management server, it’s common to install two or more management
servers, then divide agents among these servers. In this case, the first server
that’s installed becomes the root management server. This root server
communicates with any other management servers that have been installed, and it
also communicates with its own agents.
The root management server performs several unique
functions. All of Operations Manager’s user interfaces connect only to the root
management server, for example, as shown earlier. Given this central role, it’s
common to cluster a root management server. (All of these connected management
servers rely on a single operational database, so it’s also a good idea to
cluster it.) If a root management server fails, an administrator can promote
another management server, allowing it to become the new root.
Management Groups
A collection of Operations Manager servers and agents is
known as a management group. Each management group contains one root
management server, zero or more other management servers, an operational
database, and zero or more agents. The figure below illustrates how the various
parts of a management group fit together.
Operations Manager can support many agents in a single
management group—the exact number depends on a variety of factors--but
organizations that need more than this can install multiple management groups.
And although it’s neither required nor shown in the diagram, a management group
can also contain a reporting server.
As
mentioned earlier, each agent is assigned to one primary management server. If
its assigned server becomes unavailable, an agent will automatically begin
communicating with another management server in its management group. When its
primary management server reappears, the agent will switch back to it. In both
cases, no administrative intervention is necessary. While an administrator can
explicitly control which management server an agent should communicate with if
its primary server fails, this isn’t required.
Another option is to create tiered management
groups. With this approach, the root management server in a local
management group is associated with the root management server in a connected
management group. Once this is done, it’s possible to monitor and manage the
connected group from the console of the local group. There are some
limitations—the console doesn’t support performing all administrative actions
in the connected group—but this option can make sense in some situations. For
example, creating tiered management groups can be a useful way to connect
groups within any large Operations Manager deployment. It can also make sense
when connecting a management group at headquarters with a subordinate
management group in a branch office, particularly if the branch is accessed via
a lower-speed connection.
Especially in large enterprises, installing more than one
systems management product is common. Making a multi-vendor management
environment work well can require connecting these products together. To allow
this, Operations Manager includes the Operations Manager Connector Framework
(MCF). MCF allows other management products to exchange alerts and other
information with an Operations Manager management server, making it easier to
use multiple tools in a single organization.
Some organizations, such as government agencies, have a
legal requirement to track failed logins, multiple login attempts, and other
security-related aspects of their IT environments. To provide direct support
for this, Operations Manager includes the Audit Collection Service (ACS). This
service relies on its own database maintained by a management server. If ACS is
used, relevant security information is sent to the ACS database rather than to
the standard operational database, making it simpler for organizations to
comply with their legal mandate.
Understanding Agents
Management servers are an important part of Operations
Manager, but they’d be useless without agents. This section takes a closer look
at what agents do and how they do it.
Installing Agents
Before an agent can do anything useful, it must be
installed on a target computer. One option is to install agents manually on
individual machines. Yet especially in a large organization, installing agents
individually on all of the managed systems can be an onerous task. To make
installation easier, Operations Manager provides a wizard that lets an operator
query Active Directory for machines matching some criteria, then install agents
on all of them. And for organizations that use Microsoft Systems Management
Server (SMS) or its successor, System Center Configuration Manager 2007, either
of these products can also be used to install agents.
However it’s installed, a new agent needs to determine
which management server it should communicate with. For installations that use
Active Directory, the wizard allows specifying the management server each agent
should talk to. For manually installed agents, the person performing the installation
can explicitly specify a management server. If no server is specified, a
manually installed agent as well as one installed by a tool such as
Configuration Manager will contact Active
Directory when it first starts running to learn which management server it
should communicate with.
How Agents Gather and Send Information
Once it’s installed, an agent’s behavior is defined
entirely by the management packs that are downloaded to that machine. The
monitors, rules, and other information in each MP tell the agent what objects
it should monitor and determine what information it sends to a management
server. To acquire this information, agents can do several things, including:
n
Watching the event log on this
machine. An agent reads everything that’s placed in this log.
n
Accessing Windows Management
Instrumentation (WMI). WMI is an interface to the Windows operating system that
allows access to a variety of information about the hardware and software on a
machine. This interface is Microsoft’s implementation of the Web-Based
Enterprise Management (WBEM) standard created by the Distributed
Management Task Force (DMTF).
n
Running scripts provided by a
management pack to collect specific information.
n
Accessing performance counters.
n
Executing synthetic transactions. A
synthetic transaction accesses an application as if it were a user of that
application, such as by attempting to login or requesting a web page. In some
situations, such as with applications that generate little useful management
information, synthetic transactions are the best way to learn about an
application’s state. They can also be used to determine current
characteristics, such as whether the login process is taking an abnormal amount
of time.
Based on the rules and monitors in the management packs installed
on its system, an agent sends events, alerts, and performance data to its
associated management server. The management server writes all of this
information to both the operational database and the data warehouse, making it
available for immediate use and for creating reports. Management servers can
also communicate with agents, instructing them to do things such as change a
rule in a management pack, install a new management pack, or run a task.
With MOM 2005, 90% of the traffic between an agent and a
management server was commonly performance data sent by rules. To minimize this
traffic (and the storage requirements it implies), Operations Manager allows
setting the relevant rules in a management pack so that no new performance
information is transmitted unless a value has changed by, say, at least 5%. The
management server can then infer that nothing has changed significantly if it
receives no new information. For example, think about an agent that’s
monitoring free disk space on a server. This number is likely to be the same
over many hours, and so it makes sense to send performance information only
when a significant change occurs.
It’s worth pointing out that not every shutdown of an
application or machine indicates a problem; scheduled maintenance can require
scheduled downtime. To prevent an agent from sending needless alerts in this
case, an operator can put some or all of the objects on a system into maintenance
mode. If just one database needs fixing, for example, that single object
could be placed into maintenance mode. Similarly, an entire application or an
entire machine can be placed in maintenance mode if necessary. The goal is to
avoid distracting operators with meaningless alerts during scheduled shutdowns.
Like any other software, agents execute under some
identity. A simple approach would be to have the entire agent run under a
single identity using a single account. While this works, it’s not an optimal
solution (although it is what’s done in MOM 2005). The problem with this is that
the agent’s identity needs to have the union of all permissions required by the
management packs installed on that system. The result is that agents tend to
have identities with lots of privileges, something that doesn’t make most IT
operations people especially happy. To avoid this problem, Operations Manager
introduces the idea of run-as execution. Rather than assign a single
identity to an agent, an administrator can instead define individual
identities—separate accounts—for different things this agent does. If
necessary, individual parts of a management pack, such as a monitor or a rule,
can be assigned specific identities, then run as that identity. Rather than
assigning agents a single account with many privileges, each function the agent
performs can instead have only the permissions it needs.
Agents communicate with management servers using a
Microsoft-defined protocol. Each agent maintain a queue of information to be
sent, which allows prioritizing traffic—alerts always go to the front of the
queue. If connectivity is lost, the queue stores information until the
management server is reachable again. The communication protocol between agents
and management servers also provides compression to reduce the bandwidth
required, along with Kerberos-based security. Both the management server and
the agent must prove their identities using Kerberos mutual authentication, and
information sent between the two is encrypted using the standard Kerberos
approach.
For agents that aren’t part of a Windows domain, such as
those running on web servers in a DMZ, security between agents and managers can
use certificates installed on both sides rather than Kerberos. A management
server can also use certificate-based security to communicate with another
management server. Referred to as a gateway server, this second
management server might be in an untrusted Active Directory forest, for
example. This option is also useful for a service provider that wishes to
manage other organizations by connecting to their management servers across the
Internet.
Working with Many Agents: Managing Clients
Managing desktops is different from managing servers in a
number of ways. One of the most important—and most obvious—is that there are a
lot more desktop machines than there are servers. To allow operators to work
effectively with large numbers of clients, Operations Manager provides a couple
of options.
One approach, called aggregate client monitoring,
exploits the fact that operators typically don’t need to monitor the exact
state of every client machine. Instead, an operator can define client groups,
then keep track of the state of the entire group. She can still run tasks and
do other things to individual machines, but having a single state available for
all of the machines in the group makes monitoring easier. This option also
allows running reports on client groups showing things such as the percentage
of machines that have acceptable performance each month or the number of
systems that experienced down time when upgraded to Office 2007. In fact, it’s
likely that most organizations will find that the ability to create these broad
reports is the most valuable aspect of aggregate client monitoring.
A second option for working with clients effectively is mission-critical
client monitoring. Here, an operator chooses specific desktop machines to
monitor directly. Operations staff in a financial services firm might choose to
include each trader’s desktop machine, for example, while those in a retail
environment might specify all of the critical point-of-sale systems. This
approach lets the most important clients be monitored directly without
requiring that every desktop machine get this level of attention.
Combining the two approaches is also possible. IT
operations staff might use aggregate monitoring to manage most desktops in
groups, for example, while still choosing specific clients for mission-critical
monitoring. And like most things in Operations Manager, how a client is
monitored is determined by the management pack that’s installed on that system.
Working with No Agents: Agentless Management
Not everything that needs to be managed is capable of
running an Operations Manager agent. Because of this, it’s also possible to
manage systems without agents. The absence of an agent limits what can be done,
but this approach can still be useful in some situations.
One option, agentless exception monitoring (AEM),
relies on the Windows Error Reporting client, a standard part of Windows,
rather than an Operations Manager agent. This client notices application and
operating system crashes, then asks the user for permission to send this
information to Microsoft (a prompt with which most people are familiar). With
AEM, this information can be sent to Operations Manager servers, letting
operations staff examine it, create reports based on it, and decide whether it
should be forwarded to Microsoft’s main database. While AEM provides only
limited information about Windows machines, it does offer a way to track some
important aspects of a machine and the applications running on it.
Another option, one that targets non-Windows devices, is
the ability to monitor other systems using SNMP or WS-Management. Using this
approach, Operations Manager can work with routers, switches, printers, and
anything else that supports either of these standard protocols. The diagram
below shows a simple illustration of how this might look.
As the diagram shows, both management servers and agents
are capable of monitoring devices (although agents are the most common choice).
While Operations Manager provides built-in support for SNMP and WS-Management,
this kind of monitoring also requires installing a management pack that knows
how to work with the device being monitored. Management packs that allow this
are available for equipment from Cisco, HP, and other vendors.
It’s fair to say that the focus of Operations Manager is
monitoring and managing Windows desktops and Windows servers. Still, the
ability to work with other kinds of devices can also be important. The
product’s support for SNMP and WS-Management, together with the appropriate
management packs, makes this possible.
Agents generate a large amount of information. To let
operations staff use this information effectively, Operations Manager provides
two options: immediate access via interactive user interfaces and the ability
to run historical reports. This section looks at both.
User Interfaces
As shown earlier, Operations Manager includes three
distinct user interfaces: the console, the Web console, and the command shell.
All three are useful, and understanding Operations Manager requires knowing
something about each one.
The Console
The Operations Manager console is the primary interface
for most users of the product. Information sent by agents, including events,
alerts, and performance data, can be written into this database, and so the
console presents a current view of what’s happening in the environment.
Operations staff can also run reports from the console, providing them with a
historical perspective.
Presenting the vast amount of available information in a
coherent way is challenging. Operations Manager addresses this challenge by
displaying monitoring information in a number of different views, each
accessible through the console. The best way to get a sense of what this looks
like is to see some of the most important of these views.
The screen shot below shows the console’s State view. This
example shows the state of SQL Server’s database engine, but State views are
available for many objects in the environment. As this screen shows, this view
provides a quick way to see what parts of the object are healthy (those with
green labels) and which are not (those with red labels). The content of this
view is derived from information sent by the monitors defined in the management
pack for this component.
Having a summary
picture of an object’s state is useful. It’s also important to know when
something has happened to an object that requires attention. The console’s
Alerts view provides this. As the screen shot below illustrates, this view
shows the active alerts in this managed environment. In this example, two
databases are offline, and so two alerts are displayed. Details for one of these
alerts are shown in the lower pane, including knowledge supplied by the
management pack for this component. As described earlier, the goal of this
knowledge is to help the operator resolve whatever is causing this alert.
Both monitors and
rules can send an alert, and either one can also send an event. Operations
Manager provides an Events view to display these events, although that view
isn’t shown here. Performance data, however, is sent solely by rules. To
display this information, the console provides a Performance view, an example
of which is shown below. This example graphs the performance of a server
machine’s processor, but management packs can include rules that send a variety
of other performance data.
Having different
views is nice—in fact, it’s essential—but being able to see several things at
once is also useful. To allow this, the Operations Manager console lets its
users create Dashboard views. Each dashboard shows a customized combination of
other views. For example, the screen shot below shows a dashboard created to
monitor SQL Server, and it includes state information, performance data (this
time showing free disk space), and more. In a typical organization, dashboards
are likely to be a common approach to monitoring the computing environment.
The Web Console
Most often, an operator will use the Operations Manager
console, which typically runs on a machine that’s inside an organization’s
firewall. Yet what happens when the operator is at home or in a hotel room, but
still needs access to Operations Manager? The Web console was created for
situations like these. Using this tool, an operator can perform a (large)
subset of the functions possible with the main Operations Manager console.
Like the main console, the Web console provides a variety
of views, including a State view, an Alerts view, a Performance view, a
Dashboard view, and more. In general, these views look much their counterparts
in the main console. Here’s the Web console version of the Alerts view, for
instance:
As in the main console Alerts view, this one shows active
alerts, then allows a closer look at an alert’s details. This once again
includes any knowledge supplied by the management pack about how to resolve
this alert. Other Web console views provide similar information with similar
layouts to the corresponding main console view.
The Web console isn’t intended to replace Operations
Manager’s main console. Instead, providing a Web-based option lets some of the
most important management functions be performed from any place with an
Internet connection. The goal is to make life better for the people who keep
distributed environments running.
The Command Shell
Graphical interfaces are usually the right choice for
monitoring an environment. How else could a range of diverse information be
displayed in an intelligible way? Given this reality, it’s fair to say that the
Operations Manager console and the Web console will be the most popular
interfaces to this product. Yet there are cases where a standard graphical
interface isn’t the best option. While it’s great for displaying information, a
point-and-click approach can be inefficient and slow for running commands or creating
scripts. In situations like this, a traditional command line interface can be a
better choice.
To allow this, Operations Manager provides the command
shell. Built using Microsoft PowerShell, it gives users a command line
interface as well as the ability to create programs called cmdlets.
Operations Manager provides a set of built-in cmdlets, such as get-Alert to
access alerts on a particular managed component, get-ManagementPack to learn
about an installed management pack, and others. Its users can also create their
own cmdlets. For example, suppose an operator wishes to disable all rules in
all management packs that target databases. Doing this manually via the console
would be possible, but it would also be painful. Writing a script for this,
perhaps relying on one or more built-in cmdlets, would probably be easier.
Other Options
With three user interface options—the console, the Web
console, and the command shell—it might seem like Operations Manager covers all
of the possible bases. But what if a third party, such as another software firm
or in-house developers at a large organization, wishes to create a custom
interface to the product? To allow this, Operations Manager provides a software
development kit (SDK). This set of programmable interfaces makes available all
of the functionality provided by the console, and so third parties can create
software that does anything the console allows. While this approach probably
won’t be a mainstream choice, it’s an important option to have in some cases.
Interactive user interfaces are certainly
important—they’re essential—but nobody spends all of their time in front of a
screen. Yet problems can arise that require attention even when no one sees an
on-screen alert. To handle cases like this, Operations Manager allows operators
to determine which alerts should send notifications. For example, an operator
might wish to receive a notification for any alert generated by a Windows
server system in her firm’s main office that has remained unresolved for more
than 60 minutes. This notification might be sent as an email, an instant
message, an SMS text message, or something else, and it provides a way to reach
an operator who’s not currently sitting at the console. The goal is to allow
people to learn about problems that need their attention no matter where they
might be.
Reporting
Interactive access to management information is
fundamental to effective management. Yet seeing trends and understanding
long-term behavior of the managed components requires more than an interactive
interface. The ability to generate reports is also essential.
As shown earlier, reporting in Operations Manager depends
on a reporting server. This server is built on SQL Server Reporting Services,
although it makes some additions to this base technology. Relying on the
Operations Manager data warehouse, a reporting server can be installed on the
same machine as a management server or on its own machine. Management servers
send data directly to the data warehouse—there’s no need to move it manually
from the operational database before running reports.
Operations Manager provides a number of built-in reports.
Among others, these generic reports include:
n
Performance reports, which can
display the performance of various things over a specified period of time.
n
Alert reports, providing a view
into the alert histories of managed components.
n
Event reports, allowing long-term
tracking of events sent by a component.
n
Availability reports, showing the
history of availability for managed components.
For example, the performance report below shows CPU
utilization on a Windows Server machine over a ten-day period. As in SQL Server
Reporting Services, reports can be created as PDF files, as shown here, or in
other formats.
For all of the built-in reports, IT operations staff can
determine the components that should be included in the report, set the time
span covered (including defining relative dates, e.g., “two days ago”), and
control other options. Once they’re defined, reports can be run on demand or
scheduled to run regularly. A performance report might be set to run at 10 pm
each Sunday night, for example, while availability reports might run daily or
at other intervals.
Operations Manager reports can also be used in other ways.
A report can be interactive, for instance, so someone looking at an
availability report might click on a particular machine in that report, then be
presented with the console’s current State view for this machine. It’s also
possible to see other views, execute tasks, and run other reports directly from
within the current report.
Although reports must be run from the console—the Web
console can’t be used—their output can be accessed directly from the Web. A
report can be sent to a document library in Windows SharePoint Services, for example,
making it easier for IT managers and others who don’t commonly have direct
console access to view them. In fact, some of Operations Manager’s built-in
reports are specifically targeted at IT managers rather than more technically
oriented operations people.
Reporting is a core part of most management technologies,
and Operations Manager is no exception. By providing its own data warehouse,
the product allows an organization to maintain large amounts of historical data
about the computing environment. By providing a range of built-in reports, it
lets operations staff and others access this information in useful ways.
Controlling Access: Role-Based Security
Operations Manager potentially allows access to anything
in the managed environment. Yet letting every user of this tool have full
access to everything isn’t what most organizations want. There must be some way
to control who can access information, run tasks, and do other management work.
The approach Operations Manager uses to do this is called role-based
security.
A role is defined as the combination of a profile
and a scope. A profile defines what operations someone can do, while a
scope specifies the objects on which she’s allowed to perform these operations.
The intersection of the two yields a limited set of objects on which an
operator is allowed to perform only a defined set of operations.
For example, a profile might allow someone to view alerts
and run tasks, but not allow him to change any rules in management packs.
Operations Manager provides a group of built-in profiles, including the
following:
n
Administrator: A user with this
profile can do anything—all operations are allowed on any objects (in fact,
scopes don’t apply to administrators). This profile will typically be limited
to a small number of people in an organization, since most operations staff
won’t need this level of power.
n
Author: As the name suggests, a
user with this profile can make changes to the environment. This includes
things such as creating and modifying rules and monitors in installed
management packs. An author can also monitor the environment, viewing events,
alerts, and other information
n
Operator: Users with this profile
are expected to focus primarily on monitoring the environment. Accordingly,
they can view events, alerts, and other information, but they’re not allowed to
create new rules, define new management packs, or make other changes.
Whatever profile a role is based on, a person in that role
can only perform operations on the objects specified by its scope (with the
exception of roles where scopes don’t apply, such as administrator). For
example, someone whose job was solely focused on keeping the email system
running might be given a role with an operator profile and a scope containing
only Exchange Server objects, views, and tasks. A co-worker whose
responsibilities were focused on managing an SAP application might be given a
role with an author profile and a scope containing only SAP-related objects,
views, and tasks. Standard roles are also defined, such as a Report Operator
role that controls the ability to run reports of various kinds. Used
intelligently, roles can give an organization fine-grained control over what
their operations staff is allowed to do.
Operations Manager provides a general foundation for
monitoring and managing systems and software. This foundation knows nothing
about how to do these things for specific components, however. Providing this
specialized knowledge is the responsibility of management packs.
As mentioned earlier, each MP is described using the
XML-based SDM. The Operations Manager console provides an authoring view that
can be used to create MPs, and Microsoft has announced plans to provide a
standalone MP authoring tool. And since they’re just XML, a determined author
could theoretically create an MP using Notepad, although this isn’t likely to
be a very productive approach. The important thing to understand is that an MP
primarily contains configuration information, not executable software.
Each MP is installed into the operational database, with
different parts of its contents then downloaded and used by different parts of
Operations Manager. It’s worth nothing that the MP format used by Operations
Manager isn’t the same as that used by its predecessor, MOM 2005. Because of
this, MPs created for this earlier technology must be converted using a
Microsoft-supplied tool, then re-installed into the operational database.
MPs are the brains of Operations Manager, determining
exactly how monitoring and management happen for each managed component.
Understanding them requires knowing something about what they contain and how
they function. The rest of this section digs a bit deeper into the structure
and contents of this important aspect of Operations Manager.
What Management Packs Do: Some Examples
To get a sense of what management packs do and of how
diverse their functions can be, it’s useful to look briefly at a few examples.
In all of these cases, the MPs provide information about the availability of
this component—is it running?—and basics such as how much space is left on the
disk it relies on. Each one also provides performance information about the
component it targets. Beyond these basics, however, different MPs provide quite
different things.
For example, MPs that target Windows server operating
systems let an operator determine things such as which Windows services are
running, which applications (if any) are crashing repeatedly, and whether IP
address conflicts are occurring. Operations Manager also provides MPs for
Windows client systems that let an operator learn whether this machine can
access the Internet, read from and write to file servers, and perform other
functions.
Just as Operations Manager supports both server and client
operating systems, it also supports MPs for applications running in both
places. Once again, the basics are the same, which each MP able to report
whether an application is running, monitor its performance, and provide other
standard information. Each one also provides application-specific information
as well. For example, the MP for Exchange Server lets an operator see detailed
information about mailboxes and messages, while the SQL Server MP provides data
about the number of deadlocks that occur, the execution of stored procedures,
and more.
Microsoft also provides an MP for Microsoft Office that
lets an operator see whether Office applications are crashing or hanging,
measure their resource consumption, and determine how responsive they are. He
can also determine whether they’re working normally, such as checking whether
Outlook can send and receive mail. All of these things are vitally important to
users, and so they’re also important to operations staff.
Describing What’s Managed: Objects
Every management pack defines a model, described in SDM,
of the component that it’s managing. This model is expressed as one or more classes,
each representing something that can be monitored and managed. A class also
defines attributes, values that can describe an object of that class.
When an MP’s information is sent down to an agent, the agent relies on specific
discovery rules in the MP to find the actual instances of the classes
this pack defines. To discover these instances, an agent might look for a
specific registry key, query WMI, or perform some other action. However it's
done, the result is a hierarchy of objects, each of some class and with a
specific set of attributes, representing the things this MP targets.
A note on terminology: When talking to management pack
authors, Microsoft uses the terms “class” and “instance”, concepts that are
familiar to developers. In the Operations Manager console, however, the terms
“target” and “object” are used instead. This paper uses “class” rather than
“target”, but the terms “instance” and “object” are used interchangeably
throughout.
When an agent is first deployed, it knows that it’s
running on a Windows computer. Accordingly, it creates an instance of a Windows
computer object, then asks its management server for all rules (including
discovery rules), monitors, and other relevant aspects of the Windows MP. Once
they’re downloaded, the discovery rules can find other objects on this machine,
such as SQL Server or a DNS server. The agent then requests that the rules,
monitors, discoveries, and so on for these classes also be downloaded from the
various MPs in which they’re contained. This process of progressive discovery
continues until all managed objects on the system have been found.
Using this approach, an agent constructs a hierarchy of
managed objects from multiple MPs. The figure below shows a simple picture of
how this might look.
The MP for the Windows Server operating system defines a
class for the computer it runs on, while the SQL Server MP defines classes representing
a SQL Server database and an instance of SQL Server itself. In this simplified
example, the computer object appears directly above two instances of SQL
Server. One of these instances has two SQL Server database objects below it,
while the other has only a single database object. All of this, along with
populating the attributes associated with each object, is created automatically
by the agent. And to catch any changes that occur, the discovery process is
regularly re-run, including each time the machine reboots or the agent on that
machine is restarted.
One more aspect of the figure above requires explanation:
the green checks in each object. This symbol represents the object’s state, as
illustrated in the console’s State view earlier. Each object’s state provides a
quick summary of its condition. One object’s state can affect another, however,
allowing a more intelligent perspective on what’s really happening. All of this
depends on monitors, and how it works is described next.
Tracking Object State: Monitors
A primary goal of management is keeping software and the
hardware it depends on running well. One way to do this is to wait until
something fails, then fix it. This approach can work, but it’s usually not the
best solution. Just as with our personal health, avoiding problems before they
happen is much better. Rather than just react to potentially catastrophic
failures, we can keep track of the health of the objects we’re managing to
prevent serious problems whenever possible. In other words, we can create and
monitor a health model for a component.
Operations Manager relies on monitors to do this. Each
monitor reflects the state of some aspect of an object, changing as that state
changes. For example, a monitor tracking disk utilization might be in one of
three states: green if the disk is less than 75% full, yellow if it’s between
75% and 90% full, and red if the disk is more than 90% utilized. A monitor
tracking a particular application’s availability might have only two states:
green if the application is running and red if it’s not.
The author of each management pack defines what monitors
it contains, how many states each monitor has, and what aspect of the managed
object this monitor tracks. A monitor can determine what its state should be in
several different ways. It might examine particular performance counters every
90 seconds, for example, or regularly issue a particular WMI query. A monitor
might also watch the event log for events that affect its state. Think, for
example, about a monitor representing whether an application can communicate
with a particular database. The application might write an event to the event
log when this communication fails, causing the monitor to change its state to
red. When communication is restored, the application might write a new event
indicating this, causing the monitor to change its state back to green. This
example illustrates an important fact about monitors (and about application
manageability in general): Applications should be written in certain ways to
make themselves manageable—they should be instrumented—and the creators
of management packs must know how to take advantage of this instrumentation.
Whenever a monitor changes its state, this change is sent
to both the operational database and the data warehouse. This information
allows the operator to see the current state of an object or group of objects.
The console State view shown earlier is just a window into the set of monitors
that represent the state of one or more managed objects.
All of the monitors for a particular managed object are
organized into a hierarchy. Every monitor hierarchy has four standard monitors
that live just below its root: performance, security, configuration, and
availability. Each monitor defined by a management pack appears below one of
these four, with the choice made by the management pack’s author. Any change in
a monitor’s state can cause a change in the state of the monitor above it. This
allows problems to bubble up through the hierarchy of monitors and the objects
whose states they represent.
The figure above, which uses the same simple set of
objects shown earlier, illustrates how monitor states can percolate through a
hierarchy of managed objects. In this example Database 1 has a problem: perhaps
a disk drive has failed. The monitor that watches this aspect of the database
notices this and sets its state to red. This state change causes the standard
availability monitor for this object to also set its state to red, which in turn
sets the monitor for the object’s overall state to red.
The monitor for the database’s overall state also appears
in the monitor hierarchy for its parent object, SQL Server 1. This object has
two databases, however, only one of which has failed. Accordingly, this
object’s availability monitor is set to yellow rather than red, a decision that
was made by the author of the SQL Server management pack. This is once again
reflected in the overall state for this object, the state of which is also set
to yellow.
Just as the overall state of the database appeared in the
SQL Server 1 monitor hierarchy, the overall state of this SQL Server instance
appears in the Computer object. (Recall that Computer is defined in a different
MP from the other objects shown here—MP boundaries don’t limit this kind of
monitor interaction.) The Computer object also sets its availability monitor
and its overall state to yellow, indicating that while the computer is
functioning, there is a problem that needs attention.
Along with state changes, monitors can also send alerts,
events, and even performance data. Their primary role, however, is to provide
an accurate representation of state. By allowing the state of an object to
depend on the state of objects below it, monitors provide an intelligent way to
model the health of an entire system. Yet doing this requires creating an
effective set of monitors and relationships between those monitors. Put another
way, the people who create each management pack must be able to define an
appropriate health model for the component this pack targets. To do this for
its own products, Microsoft relies on input from the teams that create them,
from its customers, and from its own services group.
Other Elements of Management Packs
Monitors are essential to every management pack. As
mentioned earlier, however, MPs can also contain a number of other things. This
section gives brief descriptions of these, including rules, tasks, knowledge,
views, reports, and synthetic transactions.
Rules
Monitors and the health models they enable are fundamental
to how Operations Manager does its work. There are cases, however, where
monitors aren’t appropriate. Suppose a system needs to collect data regularly
from several performance counters, for instance, then send this information to
the management server and data warehouse. Because they’re designed to model
states, monitors aren’t capable of doing this.
To address this kind of problem, MPs include rules. A
simple way to think about rules is as an if/then statement. For example, an MP
for an application might contain rules such as the following:
n
If a message indicating that the
application is shutting down appears in the event log, send an alert.
n
If a logon attempt fails, send an
event indicating this failure.
n
If five minutes have elapsed since
the last update was sent and the new value is more than 2% different from the
previous one, send the value of the machine’s free disk space performance
counter.
As these examples show, rules can send alerts, events, or
performance data. Rules can also run scripts, allowing a rule to attempt to
restart a failed application. Even the discovery process described earlier
depends on specialized sets of discovery rules.
The distinction between monitors and rules can seem
subtle, but it’s not: monitors maintain state, rules don’t. Unlike monitors,
rules are just expressions of things an agent should do. If something changes
the state of an object, it should be modeled using a monitor. If it doesn’t
change an object’s state, an MP is likely to use a rule instead.
Tasks
A task is a script or other executable code that runs
either on the management server or on the server, client, or other device
that’s being managed. Tasks can potentially perform any kind of activity,
including restarting a failed application, deleting files, and more, subject to
the limitations of the identity they’re running under. Like other aspects of an
MP, each task is associated with a particular managed object. Running chkdsk
only makes sense on a disk drive, for example, while a task that restarts
Exchange Server is only meaningful on a system that’s running Exchange. If
necessary, an operator can also run the same task simultaneously on multiple
managed systems.
Monitors can have two special kinds of tasks associated
with them: diagnostic tasks that try to discover the cause of a problem, and
recovery tasks that try to fix the problem. These tasks can be run
automatically when the monitor enters an error state, providing an automated
way to solve problems. They can also be run manually, since automated recovery
isn’t always the preferred approach.
Knowledge
While tasks can help diagnose and fix problems, they
aren’t much good to an operator unless she knows which ones to use in a
particular situation. And like it or not, the skill level and experience of
operations staff isn’t always what their managers would like it to be. By
providing pre-packaged knowledge, a management pack can help less capable staff
find and fix problems more effectively.
As shown in an earlier screenshot, knowledge appears as
human-readable text in the console, and its goal is to help an operator
diagnose and fix problems. Embedded in this text can be links to tasks,
allowing the author of this knowledge to walk an operator through the recovery
process. For example, the operator might first be instructed to run task A,
then based on the result of this task, run either task B or task C. Knowledge
can also contain embedded links to performance views and to reports, giving the
operator direct access to information needed to solve a problem. And as with
every aspect of a management pack, an MP’s knowledge must be created by people
who deeply understand the component this pack addresses. If this isn’t the
case, the information it contains isn’t likely to be of much use to the
operators who depend on it.
Views
The Operations Manager console provides standard views for
State, Alerts, Performance, and more, as shown earlier. Yet a particular MP
might find it useful to include specialized views of its own. Since each pack
defines its own unique set of objects, its creators might also choose to
provide customized views that show only these objects, or only alerts on these
objects, or some other more specialized perspective. MPs can contain custom
views to address cases like these, and the people who create those MPs
frequently take advantage of this ability: custom views are common.
Reports
Just as a management pack can contain views customized for
the objects that MP targets, it can also contain custom reports. For example, a
management pack might include a customized definition of one of Operations
Manager’s built-in reports, specifying the exact objects that the report should
target. The creator of a management pack can also build custom reports from
scratch with the Report Definition Language (RDL) used with SQL Server
Reporting Services. More complex reports can also have stored procedures, use
indexes, and more, allowing reports that access lots of data to offer better
performance.
Modifying an Installed Management Pack
No matter how good the creators of a particular management
pack might be, there’s no way that they’ll set everything perfectly for all of
the environments that MP will be used in. Maybe a particular rule doesn’t make
sense in an organization, or perhaps an unnecessary alert is being sent. If an
MP allows it, operators with the right security permissions are able to change
some or all of what this MP defines to match their requirements. An MP can also
be sealed, however, which means that it can’t be directly modified. All
Microsoft-provided MPs are sealed, for example, as are some provided by third
parties.
Whether or not an MP is sealed, an operator can create overrides.
Rather than permanently changing the MP, an override makes a change without directly
modifying the underlying monitor or rule or other element. This lets the
operator revert to the MP’s original settings if necessary, an option that’s
often useful to have.
In the beginning, systems management focused on managing
servers. Today, this focus has expanded to include clients, applications, and
more. Yet in most cases, the real goal is to manage the services that people
actually use: email, line-of-business applications, and others. All of these
services are provided by combinations of hardware and software, and so managing
them as a whole requires some way to group together the relevant components
into a single manageable entity. This is exactly what Operations Manager’s
service monitoring allows.
Using the Distributed Application Designer, a tool
accessible via the authoring section of the Operations Manager console, an
administrator can define the various components that make up a service. This
designer provides standard templates for defining common application types,
such as messaging and line-of-business applications. These templates can be
customized as needed to reflect the details of a particular service. Once the
definition is complete, the tool generates a management pack for this service,
complete with a monitor-based health model. This MP can then be installed on
the relevant agents just like any other MP.
The screen shot below shows an example of how a
distributed application defined in this way might look. Using the console’s Diagram
view (which can also be applied to other aspects of the managed environment),
it shows the various components that make up this particular service: a web
application, the database it depends on, and more. As with any other health
model, this one shows a hierarchy of objects, each with a state. In this
example, one of the databases is in a red state, a problem that bubbles up to
the overall state for this service.
Monitoring and managing individual components of a service
is clearly useful. Yet adding the ability to manage all of these components as
a unified group with a single health model can provide significantly more
business value. As systems management continues to move toward ITIL-based IT
service management, tools that directly support this service-oriented view
become more important.
Conclusion
The computing environment of most organizations gets more
complex every day. The tools IT operations staffs use to monitor and manage
this environment must keep pace, adding the capabilities people need to do
their jobs. This evolution has been reflected in Microsoft’s offerings. From
its beginning as Microsoft Operations Manager 2000, a product focused on
managing servers, System Center Operations Manager 2007 now supports client
management, the ability to view distributed applications as unified services,
state-based management with health models, and more.
It’s a safe bet that information technology will continue
to advance. Even when those changes are improvements—and they usually are—they
still increase the management complexity of the environment. Given this, no one
should expect System Center Operations Manager 2007 to be the last word in
systems management. Yet for organizations with a significant investment in Microsoft
software, this tool can play a useful role in monitoring and managing their
world.