Page 1 of 1

version 3.4 so far

Posted: Wed May 06, 2009 5:44 am
by adamd292
1) The find bugs are fixed :)
I can find by string; by case sensitive string; and by regex correctly.
I can also "not find" something without crashing.
This is excellent.

2) The SNMP sample bug is still there
reprobe.png
3) False outages
None yet, will check at end of day.

Re: version 3.4 so far

Posted: Wed May 06, 2009 9:55 am
by adamd292
False outages are much the same as before (Dude 3.3 + Set Affinity)
I've had a one false 28 second outage this afternoon.

Re: version 3.4 so far

Posted: Wed May 06, 2009 3:47 pm
by adamd292
My posts about the remaining few small issues I have with the Dude should in no way put anyone off from upgrading or switching to the Dude.

The Dude really is a fantastic program and is way better than many other applications.

Re: version 3.4 so far

Posted: Wed May 06, 2009 5:16 pm
by lebowski
I was thinking about the reprobe issue... When you click reprobe it is not creating a new timeslot and probably adding values together.

The code for polling has not changed in 3.4 as far as I can tell. Other things seem to correlate slightly with polling becoming unstable. Like ping not staying around 1ms but not always it is not predictable. Older versions would not even record any 1ms pings so it is better.
ping.JPG
I have not tried running it not as a service... Maybe I will try that next.

Re: version 3.4 so far

Posted: Wed May 06, 2009 8:00 pm
by lebowski
I ran it as a user and not as a service... soon after I started having false positives.
probe.JPG
I have changed it back to a service and I am now disabling automatic discovery to see if it is more stable that way...

Re: version 3.4 so far

Posted: Mon May 25, 2009 5:09 am
by adamd292
My outage problem is mostly resolved now.

Some of my devices take a long time to respond to SNMP messages (but not all the time).
This is usually due to the CPU load on the device, although some devices just seem to have a "bad attitude".

On the Bad Attitude devices - I have reduced the number of SNMP values being probed, and they have come good.

The problem with High CPU outage devices isn't something I can fix on the device, and I don't think I can fix it in The Dude. It looks like it's just a case of buying devices that can handle the load properly.

Re: version 3.4 so far

Posted: Tue May 26, 2009 5:53 pm
by lebowski
I have tried everything and it seems that if I make a change to any map collection goes screwey a few minutes after and even if I don't make a change collection goes screwey every few hours. Here are a couple views of the trouble. I just have to restart the service every couple hours. Sadly I can't use the notifications at all since I have horible counts of false positives. It seems as though there is something wrong with multi-threading as well because graping goes screwey sooner if I don't force affinity to a single core. Although now I am pinging about 1600 more devices since we want to know how many computers are being turned off over night. Maybe graphing is inconsistent because I am exporting one of my dashboards every 15 minutes?
graph1.PNG
graph2.PNG
Does anyone has any ideas about how to get collection to be consistent?
SD

Re: version 3.4 so far

Posted: Wed May 27, 2009 1:24 pm
by Minollie
Hi guys,

The Dude 3.4 leaves me with a couple of questions.

1:
Occasionally The Dude 3.4 simply panics and turns almost everything into down-state, sending lots of false positives then.
Sadly enough I haven't been able to retrieve the moment and circumstances when this happen as both are various and there is no logging for whatever reason.

2:
A bug from previous versions when opening the details page of some devices has come back or hasn't been gone at all.
Sometimes some devices sort of hang the Dude when you open de 'General' Tab and after a while it's displaying 4,2 G of services which are not there for real.

3:
Not all of the graphs made by the Dude are consistent, I understand the spikes to be the result of a missed poll and then a succesfull poll, but I don't get it why a device which appears to be SNMP-browsable by The Dude SNMPWalk is showing gaps every now and then.

4:
As a new feature I would like to request the ability to add a simple digital clock on a map.

This is it for now, will be back later.

Regards,
Minollie

Re: version 3.4 so far

Posted: Wed May 27, 2009 5:40 pm
by lebowski
Hey adamd292, Well since you mentioned CPU I decided to play with the settings again.

I changed polling to 30 with an 8 second timeout and I amlost instantly got false positives. I looked at the outages and they were some minutes and 22 or 52 seconds when they finally recovered. So it seems that although the packet has gone back and forth in under a second the dude is not timestamping the return packet until many seconds after it has arrived.

I set the affinity back to all cores and set the priority for the dude process to high with tool from system internals "Process Explorer".
This seem to have cleaned up graphing but I still get some up down spikes and missing parts but not nearly as bad.

I have not restarted the service and over night I only had 5 outages that might have been accurate.

My recent collection weirdness started when I added about 1600 computers to a single map so right now I am collecting for about 1700 devices.
Thankfully this is only temprorary I am looking forward to see how it will graph once I delete all the computers out.

Another thing I did was replace the old intel network driver.

Re: version 3.4 so far

Posted: Thu May 28, 2009 8:48 am
by lebowski
I have found an interesting effect to quickly temporally fix a false positive. When one of your probes shows "down" (false positive) If you open the settings for the device and click re-probe you will notice it will not come back immediately (no matter how many times you click re-probe), edit the probe and change the error portion of the probe either add a space character or delete a space character. Then save the probe and re-probe and it will return to normal instantly.

I hope we are getting close to being able to exactly describe this bug. If it was fixed I think the dude would be much more popular with the windows crowd. I feel that folks get it up and running and then start seeing the large number of false positives and just de-install it with out seeing how completely awesome The Dude is. But I have to say I am tired of trying to understand this bug and get this issue fixed.

Re: version 3.4 so far

Posted: Thu May 28, 2009 11:54 am
by normis
about the clock - if time is available over SNMP, you can display it below the device. If you need the system time - well, you can install some windows widget that's always-on-top :)

Re: version 3.4 so far

Posted: Tue Jun 02, 2009 10:15 pm
by lebowski
Well after tons of messing around I believe 1,700 devices is too many for my setup. I deleted the end computers out of the "all computers map" (only service was ping) and now my graphing is looking very good for everything. I even set ping on this map to 5 minute intervals. There were 1600 devices on this single map. I was getting endless false positives, now I have none.

I have also found that if you force the dude to high priority it graphs much better. With high priority you do not need to set affinity to one cpu.

Note: The cpu was never above 20% in any of my testing. Even when it was forced high with affinity to a single cpu it would hang around 16% for that one cpu.

Re: version 3.4 so far

Posted: Wed Jun 03, 2009 12:14 pm
by Minollie
@ Normis

Hi Normis,

Actually I already use a SNMP probe for the time on one map, but there are some drawbacks behind using this as a way to display time.
1- If the device used to get the time fails you don't have the time any longer (how obvious.. ;))
2- I think it's not particularly desired to put a SNMP-clock under every single device you have to avoid that risk, the overview of the map would be a mess in that case
3- Many times the format of the SNMP-clock is not an easily human readable one.. (the SNMP clock I use is formatted like: YYYY-MM-DD, HH:MM:SS,s where ss is 1/10th of a second.. bit too much info.. )

I think it might be desirable if you can add a static which is showing the server time, you can add this static easy to any map you desire or leave it away.
By doing it this way you also make sure that users of the clients get the server time (assuming it's correct ofcourse.. ;)) and also people who use the webinterface will get the same time. Maybe this is worth considering?

Anyway, now I also have a question maybe resulting in a feature request..
Is there a way to add the same note to a whole list of devices without the need to add them one-by-one?
I'm trying/willing to add 'Serialnumber: [oid("1.3.6.1.2.1.47.1.1.1.1.11.1")]' to lots of devices, so the serial appears when you hover over the device, but one-by-one is a bit time-consuming if you ask me with over a 1000 devices..
Do you happen to know an easy way to do this? If not, could/would you consider implementing something that makes this possible?

This is it for now, thanks anyway for The Dude, it's a better tool than most people ever imagine..

Best regards,
Minollie

Re: version 3.4 so far

Posted: Wed Jun 03, 2009 7:55 pm
by lebowski
Minollie try this...

Click on Global Settings, Top left, Click Map, Click device appearance, paste you oid into Tooltip: "Serialnumber: [oid("1.3.6.1.2.1.47.1.1.1.1.11.1")]"

Click on ok,

Hover on a device.

Worked for me and thanks for the oid :)
[edit: Cisco devices have a different oid for serial number]