----
Background:
I have a situation resulting in the following symptoms on a MMIPS hEX S running ROS v6.48.1 (but I tried restoring the backups on netinstalled versions from 6.45 through the latest 7.1 beta with similar results):
- /system sup-output fails or freezes
- /export stalls or freezes (sometimes CTRL-c will recover, other times console frozen until reboot)
- /system backup will save and restore .backup files without any reported errors
- Even though manually running /system sup-output fails, as recently as last night (before I formatted the drive with netinstall), an autosupout.rif is recreated on most(all?) system crashes
- The system crashes and reboots either around 3 minutes 20 seconds or 5 minutes and 10 seconds, depending on ... (the installed packages, maybe??)
- LEDs, USB, netinstall, and the Mode and Reset buttons appear to work perfectly.
- My only system access in this situation is through a Woobm-USB console, and whatever files I can read or write to the flash and microSD card.
- I believe this all happened yesterday when a watchdog timer tripped. There may have been a pending routerboard update that ran on that reboot, but I don't think so.
This all happens even if the only package installed is system-n.nn.n-mmips.npk.
There is an autosupout.rif file just sitting there, mocking me, seemingly untouchable.
----
Half-baked theories:
One of my novice theories is that due to some corruption in the configuration (that restoring the backup file brings back to a clean netinstall) the switch chip goes offline or the associated software crashes or maybe disables the GPIO pins controlling the switch chip, or some linux kernel extension just crashes. The linux system boots and RouterOS starts configuration, then the interfaces disappear, and something just can't cope with that reality.
It seems unlikely the actually plain-text configuration could cause this sort of thing (surely that is screened for syntax and sanitized to prevent buffer-overflow exploits), but perhaps there is some sort of compiled representation of the configuration (i.e. binary data structures/store stored on disk and cached in memory, or at least in an intermediate form such as that used to store scripts in ROS) that has been somehow corrupted. This is based the symptoms listed above, plus:
- Port activity lights for connected switch ports work on boot (i.e. flash to show line activity), but all go off late in the boot process
- Device is not visible in winbox
- I can print or export information from parts of the cli tree that seem to have no nexus with a physical interface (e.g. /system routerboard, /ip dhcp vendor-table, /ip dns)
- I cannot even print information from many other places. Sometimes after a CTRL-c the console will recover and the headers for that section will be displayed with no data.
... (print count-only also fails in these cases) - At least once, I noticed what seemed like a partial default-configuration script in /environment (it looked official, and didn't contain any of my config data)
- There is little to nothing in the logs about these failures (and I have turned on many logging topics, both to microSD, memory, and echo)
- The only log entry of note, logged at about 17 seconds of system uptime is: system,error,critical router rebooted because some critical program crashed
- One time only, this log entry occurred immediately following that one:
system,error,critical router was rebooted without proper shutdown
- If I start with a clean configuration or reload an older backup file, all the problems magically seem to go away.
- If I restore this backup file (or any others created after yesterdays crash) on a working router, all of the above symptoms return*.
* (which I guess means that the autosupout.rif file and the backup file may be ... ?mostly the same thing?, since in this case the backup file is only 10 KiB larger than the autosupout.rif file. That, or maybe the backup file contains more data, plus a compressed version of the autosupout.rif file contents. ... one/both of the files in this case are corrupted / incomplete.)
----
What I'm asking for / Tl;dr:
I would love one or more of these to be a possible solution:
- copy the file to a USB drive or microSD card
- specify an alternate location for autosupout.rif if external storage is mounted (i.e. automatically, or configured through either a file of a specific name existing on the internal flash storage, a ROS $global variable, or a normal system configuration flag settable through winbox and the cli)
- display the file in console (even if it was just in hex, so it could be copied/pasted to a pc where one could reconstitute the original file)
One final idea I had was to spin up a CHR instance and try restoring the .backup file there, but since they are a completely different architecture I figure it's a long-shot. (I believe I have seen in several places that they are only designed to be restored to the same model of device as where they were created).
--> Are any of those possible today, perhaps with some hidden configuration setting or some magic scripting? --> Did I just miss something super-obvious?
--> Any other ideas? --> If this is not possible now, can some functionality be added to a future release? (Kermit, anyone? ;)
Thanks for reading! If I find something useful myself, I'll try to come back and post it to help others in the future.
--
P.S. Yes, yes, I know I should have had a better backup solution. Someone will probably tell me anyway. Fair. (I receive autosupouts via e-mail from the watchdog timer, but the last one is over a month old because this device is normally so stable.) Still, six weeks of minor firewall and QoS tweaks made to facilitate pandemic-related home work add up to something more significant that I had predicted... if any of that data is still there, I'd love to see it!