Dude is using ALL network capasity

eivind · Sat Mar 29, 2008 3:01 pm

I have been running dude in a couple of networks and both networks have expirienced fatal crashes. Dude have periodically been "eating" all the bandwith from our customers. The periods lasts for abouth 5 to 10 minutes. I had to unistall the Dude from my service PC to fix the problem. In both cases I tried out the beta -.07 version. Version 2.2 seem to work. The reason for trying out the beta was the promise of less bandwith use (ref version log).
In the first case I had a traffic sensor running (other software) and in the bad periodes the network traffic was extremely high. By uninstalling Dude it all went back to normal.
In the second case i had no possibility to look for any reasons. I just had to unistall it.
Both cases were upgraded from stable v 2.2 without changing any program settings.
Dude was in both cases running under XP-pro.

talon63 · Sat Mar 29, 2008 6:52 pm

What are your discover/polling intervals? You may something set too low and it is overwhelming your pipe by generating too often.

eivind · Sun Mar 30, 2008 1:59 am

Hi talon63

You may have a point. My knowledge abouth dude isn't too good.
That's why I'm very careful of what settings I make in addition to the default ones.
I was polling abouth each 30'th second, and the only probe I have made in addition to the existing ones was RF-level of the client antennas. I was polling traffic, RF-level, and ping on abouth 60 wireless mikrotik clients. Auto discover was disabled.
Parts of the same subnet is used for managing other networks (Cable-TV docsis) and they crashed too.
The fatal crash happened a while after installing the beta version with no other changes in setup.
Dude is a exellent program compared to several other programs I've tested and bought, but is it reliable? I have a version of PRTG running in another network polling a lot more often than Dude. It works fine, but it costs a lot of money and have less configuration possibilities.
Maybe Dude should have had a limit of how much bandwith it was allowed to use?
I won't try it again without some external bandwith limiting. Several hundreds of angry customers is'nt that funny.

talon63 · Sun Mar 30, 2008 5:40 am

I'll have to install the beta to see what the defaults are, but polling every 30 seconds is a little high. I'm using 2.2 on over 200 devices and polling every two minutes. I originally set it up on a workstation just to play with it, and just recently moved installed it on a server. All I had to do was export the XML from the workstation and import it into the server and it has been running with no problems.

Let me install the beta version and see if I can come up with something that will help you.

Meanwhile, take a look at this post...http://forum.mikrotik.com/viewtopic.php?f=8&t=22618

cheers

eivind · Sun Mar 30, 2008 4:47 pm

Hi!
I have decided to give Dude a second chance. After all it is probably my own settings that made the trouble, and Dude IS an exellent monitoring system. I've measured the servers bandwith today and it is abouth 30 Mbps in both directions. Dude is now running at clean default settings, no auto discovery, an I'm going to limit the server's bandwith to abouth 2 Mbps before I make any changes to the log settings.
Then Dude itself is getting into trouble instead of my customers (and me)

.

Thank You very much for Your interest of helping a novice...

beerfiend · Mon Mar 31, 2008 9:36 pm

I wonder if this is not the same unexplained network wide death that i discovered. We're running 3.07 beta on win server 2003, and administered from an XP box. after running for about two to three months one day all our wireless devices locked up (30 cisco 1200's, 2 cisco 1400's) i couldn't do anything with them. Even console ports were locked out. Think this might be the same issue?

talon63 · Mon Mar 31, 2008 10:26 pm

It is a possibility. Any of the tools that are used to monitor/manage a network use resources on that same network. The configuration of polling can have an impact on the rest of the network. My general polling is set at a pretty wide interval, with more 'mission critical' devices configured to be polled more often. My timeouts are set respective to the device being monitored, as is the countdown before I get notification. I also utilize device relationships to alert me when a key device goes down, not a minor device that is connected to it.
The trick is to balance the polling, the timeout, and the countdown so that you get the notifications you need before the phones start ringing.

eivind · Sat Apr 05, 2008 6:29 pm

One thing is for sure.. Dude is a bit complicated, and there is a lot of traps to fall into. The user manual is missing a lot of stuff, and far to old.
But then it's free and able to do whatever You want.

The question is if it would be cheaper to buy a more preconfigured system, but then all the interesting challenges is gone away. I need a challenge every day to feel alive.

talon63 · Sun Apr 06, 2008 1:53 am

It is not that the Dude is complicated, any NMS will be fairly complicated to properly configure and use if you lack knowledge and experience. Before implementing any software that monitors service availability you should map it all out on paper (or a spreadsheet) so that you can determine what is critical to your organizations operations. Once you have everything ranked as to its importance, you then need to determine what your monitoring state will be re: polling intervals. Once this is done, you can set your notifications so that the important things will let you know quickly, and objects of lesser importance will still notify you, but maybe not at 3AM.

Look at it this way - The most critical components of this imaginary network are the switches that connect to the outside world, the internal VLAN's from each switch, mail servers, file and client web servers that are critical to production, and large network printers and plotters. These objects have the most frequent polling rate (say 30 seconds), with a low number for timeout (10 seconds) and a low countdown (also 3, never one as it will generate too many false positives). This is so I will get the notification as quickly as possible when something goes wrong and hopefully before the users notice anything.
At the next level, I poll devices at about every two minutes, with a timeout of 30 seconds, and a countdown of 5. These devices may be important, but not critical to everyday tasks, so they can be down for a couple of mintues before I need to be aware of them, and I may not need notifcation after hours or on weekends.
Finally I have the non-critical objects that just need to be monitored, like UPS's, that I record information from, and monitor to ensure availbility when we do need them. These will poll at the highest interval, can afford long timeouts and high countdowns. Since I am monitoring these objects use when primary systems fail and to ensure that are ready to step in, my notification level for this is something that can be in my inbox when I show up the office.
Now, this was just an example, and finer tuning can be performed - but you need to really understand the nature of the business you are supporting, what is it that keeps people productive, and what the impact is when one of those systems falls to an event.
As to buying something "more preconfigured", you would still face the same learning curve to have it fully operational. In the case of most commercial products you may even have to learn a proprietary language to customize the tool and you will spend a great deal of money to have something that may or may not do what the Dude does for free.
Of course, that's just my opinion, based on experience with several OTF products and a few years in the IT game.

eivind · Sun Apr 06, 2008 4:09 am

Hello friend!

Some hours ago I installed Dude again, and made following setups with care:
Installed Dude and did one discovery, no auto discovery. Have set 2 minutes polling intervals, 10 seconds timeout and 3 retries. I managed to make a nice level measurement in dB related to -90 dBm. It gave me a readable graph not spanning from 0 to -90 dBm, but with a span of abouth 20 dB. The graph is even separated from the default ping-graph while pointing the mouse over an item.
In addition to all that I'm measuring bandwith usage of the server PC and it looks very nice, some hundreds Kbytes now and then.
The only missing parts now is to set mail/sms notification on errors and high bandwith usage for the server.

My conclution is that my intens polling was the "ghost", and I believe in Dude again. Next step is to expand the use to other areas.

So thank You very much for Your help, and maybe I'll be able to help some others in the future...

Dude is using ALL network capasity

Dude is using ALL network capasity

Re: Dude is using ALL network capasity

Re: Dude is using ALL network capasity

Re: Dude is using ALL network capasity

Re: Dude is using ALL network capasity

Re: Dude is using ALL network capasity

Re: Dude is using ALL network capasity

Re: Dude is using ALL network capasity

Re: Dude is using ALL network capasity

Re: Dude is using ALL network capasity

Who is online