Health

Health monitoring allows IT organizations to track, measure, and alert various error conditions with the network and host infrastructure. An early detection of such conditions can avert potentially disastrous situations. Errors within a networked environment are common and part of the regulating mechanism of congestion control. For example, dropped packets within the network are a frequent occurrence when switch and router queues fill. Packet loss is a feedback mechanism used by TCP to control how much and how fast to inject traffic into the network. Because TCP is a reliable transport mechanism retransmissions will occur to insure the delivery of the application messages. Although packet loss is common, high rates of packet loss can signal oversubscription and consequently network problems that will result in degraded application performance. Health Monitors are passive in nature; they do not introduce traffic onto the network and they track health of each device (host) as well as network (port).

Host Health Monitor The Host Health Monitor tracks metrics concerning connections to hosts along with some other metrics that are helpful in the troubleshooting process. For instance, an increase in the amount of time it takes to setup a connection to a host over and above the norm is a good indication of a lack of resources on the host.

Connection Setup Time:

Measurement of the time, in milliseconds, between the first transport layer connection establishment message and the first data packet exchanged on the connection.

Connection Timeouts:

Measures the percentage of connections terminated nongracefully between two communication end points per sample period.

Retransmissions:

Measures the percentage of data-bearing packets within the layer four connection between the client and the server that were retransmitted.

Out-of-Order Packets:

The percentage within established connection lifetimes in which transport packets were detected as being out of sequential order chapter.

Client to Server (C/S) Window Size:

Measures the window size of transport level protocols from the perspective of the Client.

Server to Client (S/C) Window Size:

Measures the window size of transport level protocols from the perspective of the Server.

These monitors focus on measurable entities on the network that pertain to host computer issues. For example, one of the best leading indicators of a pending problem or busy host (usually server) is the Connection Setup Time metric. Because TCP is a fundamental OS function (in Linux it is a part of the kernel operation) it has a higher than normal priority (relative to a typical application) and should be serviced by the system within milliseconds (a couple of hundred milliseconds is normal under a moderate CPU load). This value will climb or even spike to seconds when the system cannot respond to a connection request. In this case the host is almost always experiencing an overloaded condition. This could be a high connection demand or the OS is processing other tasks and cannot respond to the connection requests. Window size is another example of potential host issues. Because this is a rate control mechanism used by the receiving host to throttle traffic, a relatively low average window size may indicate that the receiving host may be overwhelmed by the sending host and unable to keep up with the traffic sent to it. Normally the receiving host will send a response ack with a low window size or even 0 to stop the transmitting host from sending any more data until it has had time to catch up. This can severely impact the performance of an application and could be interpreted as a network problem because the transfer rate seems slow when in fact it is a host problem.

Port Health Monitor The Monitoring Port Health Monitor is primarily looking for the percentage of error conditions on the physical network that it is monitoring. Metrics tracked include the percentage of Dropped Packets, Fragments, CRC Errors, Oversized Packets, Undersized Packets, and Jabbers. These monitors track the measurable entities on a particular network segment. Because they are mainly focused on the media layer they are most useful in a Tapped (in-line network tap) deployment. In this deployment configuration these metrics are useful for trunked links between high traffic switches and routers. They can also be used to track interface issues.

 
© 2011 NEXVU APM, LLC. All Rights Reserved.