Fri Jan 13, 2012 11:08 pm
I'd like to share how we do what you're attempting. I create two Notifications (On Call: Ups and On Call: Downs).
On Call: Downs
----------------
Delay: 00:00:00
Repeat Interval: 00:15:00
Repeat Count: 20
On Status: unstable->down and up->down
On Call: Ups
----------------
Delay: 00:00:00
Repeat Interval: 00:00:00
Repeat Count: 0
On Status: down->up
This will page our On Call tech when the device initially goes down. Then 15 minutes later if it is still down it will page again. This will continue to happen 20 times, the equivalent of 5 hours. If the device were to come up within that time frame the down pages would cease and the on call tech would get one page stating the device has come up.
We also have a On Call: Emergency notification.
I create a probe called ping2.
Name: ping2
Type: ICMP
Agent: default
Packet Size: 32
TTL: 64
Retry Count: 3
Retry Interval: 1000
Then create a notification.
Name: On Call: Emergency
Delay: 00:00:00
Repeat Interval: 00:00:00
Repeat Count: 0
On Status: down->up and unstable->down and up->down
Then add the ping2 service to the device you'd like to have an emergency page if the on call tech has not resolved it within an hour.
Device>Services>Plus Sign
Add Probe: ping2
Probe Interval: default
Probe Timeout: default
Probe Down Count: 100
Notifications Tab>Use Notifications: On Call: Emergency
This will notify myself, the Network Administrator, if a device has remained down for over an hour and hasn't been ACKed. The hour comes from the Probe Interval, default is 30 seconds, the Probe Timeout, default is 10 seconds. It will ping the device every 30 seconds and then if it's down will wait 10 seconds to confirm there is no reply then repeat. 40s*100times = 400s. 400s/60s=66.66 minutes. One nice thing about this, is if something's is down for less than an hour the original ping service on that device will be down, but the ping2 service will be unstable. This will make the device orange. After the ping2 service goes down at the 66 minute mark the device will go red. This allows everyone to know if I have been paged yet or not. If it's red, I am aware of the issue. If it's orange I haven't been notified yet.
Let me know if this helps or if anyone knows of a more efficient way to do this.