The Dude: Large scale setup. Improved performance. No timeouts.

rememberme · Mon Jan 23, 2023 9:18 pm

Hi all!

I wanted to write down a post and describe my Dude setup so other people on the community can benefit from this nice monitoring tool.

First of, I am working in communications industry as long as I remember myself. All these years I've used tens of monitoring systems while working for many ISPs and similar companies.
What I can tell for sure is that there is no one monitoring system that will fit everyone with any needs. Every free and paid system has it's own unique benefit and it's own drawbacks. That way I got used to having many systems at the same time to get the most from the network I'm working with.
IMHO, no other monitoring systems other that dude can graphically represent the network the way the dude does. I keep it for almost only this reason.

I'm not gonna start any sort of a contest like what's better, a hammer or a screwdriver, SolarWinds or Zabbix, Dude or intermapper, etc... You can have it all (and maybe you better do).
I will focus in this topic describing how to get the most juice out of the Mikrotik Dude.

At this point my largest dude setup consists of:
Devices: 17,000
Probes: 6,000 (ICMP/SNMP/TCP)
Polling: every 10 secs, 5 probes timeout for an outage, 10 sec probe timeout
This gives me "the golden 1 minute" alert for an outage to be triggered.

I will (perhaps) go with detailed description how and why I got to these settings later on, but for now I will just post everything I did with minimal explanations so you can adjust/replicate the setup you have.

Server:

The Dude server is ran as an eSXI VM:
ROS: CHR v7.7 licence P1
CPU: 12 cores, 2.66 GHz Intel Xeon X5650
RAM: 12GB
Storage: SSD 40GB

The server resources are overkill. The CPU usage is averaging at 1%. The RAM is overkill as well as the system is using roughly 1.5GB of it. Storage is a must as the database commit happens in "all at once" fashion and you get to be fast to dump the data on disk. Storage space is minimal. It depends on the retention settings you want, but my setup uses roughly 300MB for the DB file with generous graphing.

Polling:

The most most used probe in my setup is ping. The "ping probe settings" are as follows:
name: ping
packet size: 32bytes
TTL: 64
retry count: 8
retry interval: 1second

The point here is to have the device pinged couple of times during the probe interval of 10 seconds.

It appears that the Dude likes the retry interval in increments of 1 second. Meaning, if you set the ping probe to ping 4 times every 250ms you will get everything timed out. The same goes with interval, let's say, 1250ms. I don't know how it's written in the code, but I know for sure that the best is to have the 1000ms interval for it.

Now I will poll my devices every 10 seconds and I want to send 8 packets every 1000ms for any of the ping response to come to validate the probe.

Example of what happens with a device that is down:
1. probe is initiated
2. 8 ICMP packets are sent, one every second
3. the 8th packet is still unanswered...
4. the probe is locked for another 2 seconds
5. on the 10th second, the probe is considered down due to 10seconds probe timeout
6. timer (memory) is freed so the next probe start right next
That is I locked some memory for 10 seconds for the probe and I freed it right away.

If the device is responsive, the probe will be considered "up" with the first packet and the timer/memory will be freed.

The wrong way would be to have:
Probe interval 10 seconds
Ping of 20 packets every 1000ms interval.
This setup will mean a 20sec probe duration for a device that's down, that will result in an overlapping on-going probe when the next one is queued.

The goal here is not to have probes with overlapping timers. That is, you have to complete the probe either "up" or "down" before the next probe round starts.
Dude is queuing all probing at the same time instead of spreading along the probing interval, so timers are crucial here. Otherwise you'll go into the false down crazyness

The beauty of the setup is in the server/agent scheme.
Together with the server, I got 5 agents running - 4 agents are used for service polling and 1 for miscellaneous device polling updates. I'll elaborate below:

Dude agents:

Polling: 4 x CCR2116
Device/misc: 1 x CCR2116

I found out that a CCR2116 can poll about 4000-5000 services every 10 seconds without falsely timing out on probing. The safe side is to stay closer to and don't get over 4000 probes every 10 seconds per agent. You can say CCR2116 as an agent can do 400 probes per second. That is you can change the poll interval to fit this performance metric.
I have tried using CHR, TILE and ARM based routers but the best for me appeared to be the ARM64 CCR2116 that I got in stock. You can use other architecture routers, but you'll have to downscale the intensity of probing. The main issue is that polling is not distributed evenly on all cores and you end up with idling CPU slammed on just couple of cores.
I got my 6000 probes just on 2 agents just fine, but I have installed 4 of them for future proof (an addition of the same size network is taking place right now), that is I expect the setup to be able to have on monitoring roughly 34,000 devices, 12,000 probes every 10 seconds with 5 probe down alert.

Besides the polling agent there is a (I call it so) "device/misc" agent. This agent is used just to offload the server from updating the labels (I use SNMP OIDs in device's labels), MAC mapping, DNS resolving - the miscellaneous. Without this misc agent you'll run into an issue when you try to open a device' settings and your dude client will hang, eating up 100% of one CPU core and will die.
This issue was described on couple different topics in this forum, so I hope I can help with the answer.

The polling scheme (how to):
- Distribute services over all the agents:

Services -> Status=all, Type=all, Map=all

Sort by name

Select the first one and "Shift+Page down" selecting around 200 services at a time. Looks like the internal code is limited to 255 on selection count. You sadly can't do Ctrl+A...

Right click on the selection -> Settings (NOT DEVICE SETTING)

Set the agent to the appropriate one (one of the 4 I have)

Distribute all services evenly across all agents

- Set the devices themselves to the misc poller:

Devices -> Status=all, Type=all, Map=all

Select the first one and "Shift+Page down" selecting around 200 services at a time. Looks like the internal code is limited to 255 on selection count. You sadly can't do Ctrl+A...

Right click on the selection -> Settings (NOT DEVICE SETTING)

Set the agent to the misc agent

It looks like if setting the "default" agent in the global settings to the "misc" doesn't do the trick and you have to explicitly set it on a per-device basis.

To get and idea on how many services are on any agent you can use winbox on the dude server:
Dude -> Services: filter by "Agent contains <name of the agent>"

The same to check if no devices are left monitored by the server itself:
Dude -> Devices -> Device tab: filter by "Agent contains default" or "Agent contains server" - you should have none as everything must be on the misc agent.

Once you do this, every time you add a new device to monitoring just select the agent for the device as the "misc agent" and agent for an added probe one of the polling agents.
Later on, you can re-distribute the load across the pollers with the same technique.
I wish mikrotik finally implements the dude CLI so you can write scripts for redistributing the load over agents. For now, all by hand...

That being said, the server itself is only responsible for maintaining the database and all the polling is done by the agents.
This setup has been proven to be the most stable and smooth.

PS: I would love mikrotik to move forward improving the dude, add the 64bit counters, revamp the polling strategy and so on, but at this point of waiting I think it's worth it giving The Dude a chance to live.

I hope I helped someone like me trying to keep the Dude while fighting with it's caprices everyday!

PPS: no, I don't have to reboot the CHR or agents nor to restart dude service - it's fine and stable. I have the v7.7 because I just wanted to, but it's been stable since I brought it up to the scheme I described since many versions ago.

Mon Jan 23, 2023 10:14 pm

useful info

thank you for sharing !!!

Mon Jan 23, 2023 10:30 pm

how much chart Keep time you use for:

Raw value:
10 min value:
2 hour value:
1 day value:

Using the windows client to visualize a history graph of a service or a device , have you had trouble when visualizing several days graph?

simogere · Tue Jan 24, 2023 11:33 am

Hi @rememberme, thanks for sharing you experience!
Can you attach some screenshots of your dude settings mentioned above?

rememberme · Tue Jan 24, 2023 8:58 pm

how much chart Keep time you use for:

Raw value:
10 min value:
2 hour value:
1 day value:

Using the windows client to visualize a history graph of a service or a device , have you had trouble when visualizing several days graph?

raw: 3h
10min: 24h
2hour: 7d
1day: 30d

No issues going through graphs for the whole retention period.

rememberme · Tue Jan 24, 2023 9:16 pm

Hi @chechito, thanks for sharing you experience!
Can you attach some screenshots of your dude settings mentioned above?

Here you go:

Tue Jan 24, 2023 11:20 pm

very useful info, thank you for sharing

challado · Thu May 25, 2023 5:39 pm

The most important feature on a monitoring system is the capacity to interact with them. The dude doesn't have ANY form to interact with your data. Doesn't have a simple programatically add devices, neither in API, neither in Console. Observium, nagios, prtg, zabbix, all of these have a way to interact with the objects and simple ways to do simple tasks, like add devices programatically.

simogere · Mon Mar 04, 2024 3:17 am

About polling function, I made some tests:

For example if we have:

Polling:
Probe Interval = 10 seconds
Probe Timeout = 5 seconds
Probe Down Count = 3

Ping probe:
Retry count = 3
Retry interval = 1s

You will receive a message that service is down at 00:00:23

Poll 1= poll from 00:00:00, failed at 00:00:03 due to ping probe settings
Poll 2= poll from 00:00:10, failed at 00:00:13 due to ping probe settings
Poll 3= poll from 00:00:20, failed at 00:00:23 due to ping probe settings -> Notification

With the default settings (probe interval: 30, timeout: 10, probe count: 5), you will receive a message that service is down at 00:02:03

@rememberme, with your settings it seems you will receive a service ping down message at 00:00:48

The Dude: Large scale setup. Improved performance. No timeouts.

The Dude: Large scale setup. Improved performance. No timeouts.

Re: The Dude: Large scale setup. Improved performance. No timeouts.

Re: The Dude: Large scale setup. Improved performance. No timeouts.

Re: The Dude: Large scale setup. Improved performance. No timeouts.

Re: The Dude: Large scale setup. Improved performance. No timeouts.

Re: The Dude: Large scale setup. Improved performance. No timeouts.

Re: The Dude: Large scale setup. Improved performance. No timeouts.

Re: The Dude: Large scale setup. Improved performance. No timeouts.

Re: The Dude: Large scale setup. Improved performance. No timeouts.

Who is online