Network Monitoring
From charlesreid1
Basics
Terminology
Element - the fundamental unit of network monitoring, an element consists of a single metric that is being monitored. There are usually hundreds or thousands of elements in a given network.
Acquisition - the process of actually obtaining the observational data from the element
Frequency - related to acquisition, what is the frequency at which data arrives? what kind of data is being sent? under what conditions?
Data warehousing - depending on the size of the network and the amount of data, you can end up with a big storage problem on your hands. For purposes of monitoring, you may decide not to store the data at all, you may decide to keep it for a short amount of time, or you may decide to archive it somewhere.
Threshold value - this gets into the "what" part of your monitoring. What, exactly, are you monitoring, and what is the value of the element that will trigger an alert? (What constitutes an emergency?)
Reset value - opposite of threshold, what is the value of the element that will un-trigger an alert and signify the "all clear"?
Threshold response - what is the response when a threshold is reached and an alert is triggered?
Requester - the entity that is requesting the monitoring data, and where it lives (may be on-board the machine, or may be a networked data store)
List of Monitoring Tools
Cross-platform tools:
- Ping - checks if a target machine is online/up and running, and how long it takes to reach the machine
- SNMP - simple network management protocol, this tool can generate data about elements on a network
- ICMP - internet control messaging protocol, used by routers/switches to send error messages about unreachable hosts
- Syslog - of course, the system log is a useful place for data about what's happening on a particular machine and can yield data about elements
- Other log files - programs will typically provide a way to log information to a log file, so this is another source of data about various elements on the network
- Scripting - scripting is the best way to collect information, and allows for custom element data to be collected and sent off to the receiver
- Flow - understanding the flow of traffic on a network (where it comes from, where it goes, and what kind of traffic it is) is important to understanding the network
Platform-specific tools:
- (cisco) IP SLA - internet protocol service level agreements are usually found onboard Cisco routers, and can keep the WAN running smoothly
- (windoze) WMI - windows management instrumentation is a windoze scripting language for collecting information about a target system
- (windoze) PerfMon - performance monitor that gives information about the machine's current state, as well as information about errors
- (windoze) Event log - the event log in Windows is the equivalent of the syslog, recording everything happening onboard the machine
What To Monitor
Let's cover what you actually want to monitor on the network.
Availability, Faults, Performance
Three important things to measure for each element:
Availability - is an element online/responding to requests? or is it offline/not responding?
Faults - is a given element functioning correctly? have any failures been detected? (Can failures be detected?)
Performance - how well is the network performing? (throughput, utilization, response times, error rates)
The most useful tools for determining these metrics are Ping, SNMP, and ICMP, to measure:
- Response time
- Packet loss
- CPU load
- Memory utilization
- Hardware status
If you can't physically access each of the networks between you and your target, it is impossible to measure availability/faults/performance. In this case, use IP SLA to simulate traffic between two networks and measure the performance of the connection. (Particularly useful for things like audio or video, which tend to be more sensitive to network routes/delays).
Address Space Monitoring
With thousands of IP addresses being assigned on a network, it's important to keep track of what IP addresses have already been used and when subnets are full. It is also important to identify when a network component (e.g., DHCP or DNS) is misconfigured.
Different tools are useful for monitoring different types of elements. Figuring out what tools you need will depend on your role monitoring the network. If you are a one-man band, you need something that can do everything and keeps it simple. If you're monitoring something specific (i.e., virtual containers), you can go for specialized/expensive tools that help you monitor that one particular thing deeply.
Things to consider:
- What is the (one-time) purchase cost?
- What is the (ongoing) maintenance cost?
- What is the support cost?
- What is the customization cost?
DART Framework
SolarWinds recommends using a DART framework, which stands for:
- Discovery
- Alerting
- Remediation
- Troubleshooting
DART
Discover
Discover consists of finding out what is happening. What is the health of the network? Where are the problems? Where are the potential/future points of failure?
- Identify all of your assets and find out if they are connected
- Provide a network baseline for network performance
- Gather data needed to compute network performance/efficiency statistics
Alerting
Alerting is a notification that something has gone wrong or is broken. It is important to correctly calibrate
Flags
| network monitoring tools and techniques for monitoring networks to avoid pain and suffering
Network Monitoring/Ten Best Practices
Network Monitoring Tools: Bro (network baselining): Bro Snort (IDS): Snort
Category:Network Monitoring · Category:Networking · Category:Linux Flags · Template:NetworkMonitoringFlag · e |