MikroTik

Posted: **Wed Dec 27, 2017 6:49 pm**

Issue:

SIP client cannot re-register in the SIP server after switching ISP (different NAT).

Description:

In our setup we have two ISP providers, a SIP client with a private IP, and we're using NATs (a different NAT for each ISP provider) with SIG ALG translation, aka SIP nat helper.

When changing the default route from one ISP provider to the another one (manually, or because the ISP link goes down), the Mikrotik applies the wrong NAT rule. Because of this, the SIP register messages cannot reach the SIP server and the SIP connection drops.

If we clean the NAT table or even reboot the router, everything is gonna be ok again.

Versions affected:

6.38(mibspe),6.38.5(chr),6.39.3(mibspe),6.41(chr)

Note: We have tested some real Mikrotiks (mibspe) and run some simulations in GNS3 with routerosx86 Mikrotik virtual machines (chr).

How to reproduce:

- Plug a router to two different ISPs (each one giving you a different real IP) and to an internal network;
- Create proper NAT rules for each ISP;
- Create proper default routes (static routes) for each ISP (the first ISP with the smaller distance);
- Set up a SIP client (in the internal network) to register in an external SIP server and do the register;
- Change the distance of the default routes so the second ISP will be the active route (smaller distance);
- Try to re-register the SIP client in the SIP server and you will see that no SIP message returns and the re-register fails;
- Check the NAT table and run a sniffer in the router and you will see that the router is routing the package via the second ISP but it's still applying the old NAT rule (for the first ISP) instead of the correct NAT rule.

Network setup and detailed how to reproduce:

I have a production setup somebit complicated. However, I run a much more simple setup using GNS3. So, I'm showing this simplified setup here.

The screenshot of my GNS3 setup is above:

The isp1 and isp2 nodes simulate the two different ISPs. They connect to the Internet via GNS3 NAT nodes (if you doesn't know how GNS3 works, just consider that the isp1 and isp2 nodes just behave as real ISPs routers).

Our router (router) is connected to both ISPs and also to the sip-client node (an Ubuntu 14.04 docker node that simulates a SIP client).

A more detailed diagram is showed above:

The implementation of the router node is:

/ip address
add address=10.10.1.2/24 interface=ether1 network=10.10.1.0
add address=10.10.2.2/24 interface=ether2 network=10.10.2.0
add address=192.168.0.1/24 interface=ether3 network=192.168.0.0
/ip firewall nat
add action=src-nat chain=srcnat out-interface=ether1 to-addresses=10.10.1.2
add action=src-nat chain=srcnat out-interface=ether2 to-addresses=10.10.2.2
/ip route
add distance=1 gateway=10.10.1.1
add distance=2 gateway=10.10.2.1
/system identity
set name=router

To make NAT tests easier, we also have increased the NAT ICMP timeout:

/ip firewall connection tracking
set icmp-timeout=1h

And this is the network configuration of the SIP client (/etc/network/interfaces):

auto eth0
iface eth0 inet static
	address 192.168.0.100
	netmask 255.255.255.0
	gateway 192.168.0.1
	up echo nameserver 192.168.0.1 > /etc/resolv.conf

Our SIP client connects to the SIP server using NAT with the help of the SIP ALG translation, aka SIP nat helper:

[admin@router] > /ip firewall service-port print where name=sip
Flags: X - disabled, I - invalid 
 #   NAME                                                                 PORTS
 0   sip                                                                  5060 
                                                                          5061

Now, in the client, we will ping (ICMP) the SIP server and also send a SIP message to our SIP server. After this, we have the following entries in the firewall NAT table:

[admin@router] > /ip firewall connection print detail where protocol=icmp      
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying, 
F - fasttrack, s - srcnat, d - dstnat 
 0  S C  s  protocol=icmp src-address=192.168.0.100 dst-address=199.87.121.233 
            reply-src-address=199.87.121.233 reply-dst-address=10.10.1.2 
            icmp-type=8 icmp-code=0 icmp-id=521 timeout=58m16s orig-packets=4 
            orig-bytes=336 orig-fasttrack-packets=0 orig-fasttrack-bytes=0 
            repl-packets=3 repl-bytes=252 repl-fasttrack-packets=0 
            repl-fasttrack-bytes=0 orig-rate=0bps repl-rate=0bps 

[admin@router] > /ip firewall connection print detail where connection-type=sip
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying, 
F - fasttrack, s - srcnat, d - dstnat 
 0  SAC  s  protocol=udp src-address=192.168.0.100:5060 
            dst-address=199.87.121.233:5060 
            reply-src-address=199.87.121.233:5060 
            reply-dst-address=10.10.1.2:5060 connection-type="sip" 
            timeout=58m21s orig-packets=3 orig-bytes=1 347 
            orig-fasttrack-packets=0 orig-fasttrack-bytes=0 repl-packets=3 
            repl-bytes=927 repl-fasttrack-packets=0 repl-fasttrack-bytes=0 
            orig-rate=0bps repl-rate=0bps

To reproduce the problem, let's change the active default route to the second ISP:

[admin@router] > /ip route print where dst-address=0.0.0.0/0
Flags: X - disabled, A - active, D - dynamic, 
C - connect, S - static, r - rip, b - bgp, o - ospf, m - mme, 
B - blackhole, U - unreachable, P - prohibit 
 #      DST-ADDRESS        PREF-SRC        GATEWAY            DISTANCE
 0 A S  0.0.0.0/0                          10.10.1.1                 1
 1   S  0.0.0.0/0                          10.10.2.1                 2

[admin@router] > /ip route set [find static gateway=10.10.1.1] distance=100    

[admin@router] > /ip route print where dst-address=0.0.0.0/0               
Flags: X - disabled, A - active, D - dynamic, 
C - connect, S - static, r - rip, b - bgp, o - ospf, m - mme, 
B - blackhole, U - unreachable, P - prohibit 
 #      DST-ADDRESS        PREF-SRC        GATEWAY            DISTANCE
 0 A S  0.0.0.0/0                          10.10.2.1                 2
 1   S  0.0.0.0/0                          10.10.1.1               100

The new scenario is showed in the image below:

Now, in the client, we will ping (ICMP) the SIP server and send a SIP message to the SIP server again. We will notice that ICMP ping works, but the SIP message doesn't returns.

The reason is because the new ICMP packets adds a new NAT entry (via the second ISP) but the SIP NAT still uses the NAT via the first ISP (the router NATs the package using the IP of the ether1 although it sends the package via ether2).

[admin@router] > /ip firewall connection print detail where protocol=icmp      
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying, 
F - fasttrack, s - srcnat, d - dstnat 
 0  S C  s  protocol=icmp src-address=192.168.0.100 dst-address=199.87.121.233 
            reply-src-address=199.87.121.233 reply-dst-address=10.10.1.2 
            icmp-type=8 icmp-code=0 icmp-id=521 timeout=49m38s orig-packets=4 
            orig-bytes=336 orig-fasttrack-packets=0 orig-fasttrack-bytes=0 
            repl-packets=3 repl-bytes=252 repl-fasttrack-packets=0 
            repl-fasttrack-bytes=0 orig-rate=0bps repl-rate=0bps 

 1  S C  s  protocol=icmp src-address=192.168.0.100 dst-address=199.87.121.233 
            reply-src-address=199.87.121.233 reply-dst-address=10.10.2.2 
            icmp-type=8 icmp-code=0 icmp-id=522 timeout=59m36s orig-packets=2 
            orig-bytes=168 orig-fasttrack-packets=0 orig-fasttrack-bytes=0 
            repl-packets=2 repl-bytes=168 repl-fasttrack-packets=0 
            repl-fasttrack-bytes=0 orig-rate=0bps repl-rate=0bps 

[admin@router] > /ip firewall connection print detail where connection-type=sip
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying, 
F - fasttrack, s - srcnat, d - dstnat 
 0  SAC  s  protocol=udp src-address=192.168.0.100:5060 
            dst-address=199.87.121.233:5060 
            reply-src-address=199.87.121.233:5060 
            reply-dst-address=10.10.1.2:5060 connection-type="sip" 
            timeout=59m38s orig-packets=5 orig-bytes=2 245 
            orig-fasttrack-packets=0 orig-fasttrack-bytes=0 repl-packets=3 
            repl-bytes=927 repl-fasttrack-packets=0 repl-fasttrack-bytes=0 
            orig-rate=0bps repl-rate=0bps 

[admin@router] > /tool sniffer quick port=5060                           
INTERFACE             TIME    NUM DI SRC-MAC           DST-MAC           VLAN  
ether3              12.427      1 <- B2:C2:B4:18:98:07 00:6B:0E:68:7F:02
ether2              12.427      2 -> 00:6B:0E:68:7F:01 00:6B:0E:6B:12:01
-- [Q quit|D dump|C-z pause]

I don't know if it's a bug or if this behavior really makes sense, but I guess that the Mikrotik router (when receiving the new SIP packets) should create a new NAT SIP entry with the new reply-dst-address just as it occurs with the ICMP messages (because now the packages are sent through a new interface - ether2 - that have a different NAT rule).

Some questions:

- It's a bug?
- Someone already saw this problem in another setup - like with another SIP helper or with normal UDP NATs?
- What is the expected behavior?
- If it's a bug, how can I inform the Mikrotik suport team about it?

Known workaround:

We're using now the following workaround:

- The router checks from time to time (via a script that runs in /system scheduler) if the default gateway have changed;
- If it discovers a change, then it runs the following command:

/ip firewall connection remove [find where connection-type=sip]

After this, all SIP connections started working again.

Disable SIP ALG is not an option:

Please, there's nothing wrong with using SIP ALG (as long as it is implemented without bugs). Actually, our case is exactly the case SIP ALG was created for.

Moreover, our server requires SIG ALG to call the SIP client when necessary.

Posted: **Thu Dec 28, 2017 2:06 am**

Unfortunately, I think this is a known issue with Mikrotik users. We are a service provider with SIP phones at our clients' locations, and if we put a backup connection at the site, the SIP connections do exactly what you're describing, and our workaround has been the same - to wipe all SIP connections out of the connections table.

I hope Mikrotik fixes this. (Or someone who knows a way to properly set this up chimes in with a real solution)

Posted: **Thu Feb 08, 2018 5:27 am**

Some new info:

An employee from my company realized that may exist connection-type=sip2 entries (although "sip2" connections are not documented in MikroTik wiki - only "sip" connections are - https://wiki.mikrotik.com/wiki/Manual:I ... all/Mangle or https://wiki.mikrotik.com/wiki/Manual:I ... n_tracking).

If it's true, may be necessary to change the workaround to:

/ip firewall connection remove [find where connection-type=sip or connection-type=sip2]

I also send an email to support@mikrotik.com today and I'm waiting for a response. I will update here when a have some news.

Posted: **Thu Feb 08, 2018 11:16 am**

It is not so much SIP-related except that it is most notable with SIP. NAT is an extension of connection tracking, and the SIP helper as well.

When a connection is established which involves a NAT, the socket quadruple is remembered, which describes the sockets of the endpoints as well as local sockets used. A packet from an endpoint's socket to the local socket looking in that endpoint's direction is identified as part of the connection and forwarded accordiingly after replacing socket information if needed.

In your case, the endpoints are the same (the VoIP provider's equipment and your CPE), but the first successful registration builds a connection record which then reuses the socket quadruple even if the actual output route changes, so the packet leaves through one interface indicating the IP address of another one as its source address.

While an icmp "connection" lasts since sending the request until receiving the response, so already the first icmp request (ping) sent after the route change leaves the Mikrotik with a correct source IP, a SIP "connection" lasts for the configured lifetime (1 hour by default), thanks to the SIP helper. If the registration fails, the connection is not destroyed, only its lifetime is reduced to 3 minutes like with an ordinary UDP connection.

Assuming that the IP address of your SIP provider is not relevant to anything else but SIP connections, I would not look at "sip" connections when cleaning up the connection table after detecting a change of the active route, and would use the following instead:

/ip firewall connection remove [find where dst-address~"199.87.121.233"]

Posted: **Thu Feb 08, 2018 4:51 pm**

Just thinking out loud, are these configs using masquerade instead of src NAT? Maybe try using src NAT. might be caused by the difference in the way they (masquerade vs src NAT) handle connections when the IP changes

Posted: **Thu Feb 08, 2018 5:04 pm**

Nope. masquerade vs. src-nat only affects from where the new source address of the packet is taken and when, not how the already established connections are handled without the interface going down or changing address.

So with masquerade, each time the interface goes down or its IP address changes, all tracked connections are cleared and newly established connections are src-nated to the new address (there is a video on that as well, explaining why to use masquerade only where really necessary).

But if the interface does not go down or change address, and just the route in question doesn't go through it any more, as is the case which the OP describes, use of masquerade instead of src-nat doesn't change anything.

Posted: **Fri Feb 09, 2018 8:22 pm**

Hi @sindy, thanks for your explanation. You are right: the SIP problem is not a SIP problem, but an UDP NAT problem (a more general problem). It isn't even a bug: it's a UDP NAT limitation (a protocol limitation).

I made some NAT tests and understood better how NAT works in MikroTik. I'm publishing my discoveries here.

@ZeroByte, this can help you.

The general idea

The general idea behind NAT is to divide the NAT translation workflow in two different phases: the NAT table (/ip firewall connection) and the NAT rules (/ip firewall nat).

The idea is showed at the image:

The first packet will trigger the NAT table entry creation (since there is a NAT rule for it). From the second packet onwards, the already created NAT rules will be used (if the NAT rule is deleted at this point, for example, the NAT will continue to be applied since already exists a NAT table entry).

This is what causes the NAT problem in SIP: once a NAT table entry is created, it doesn't matter if the default router changes or what NAT rule would be applied - the already created NAT table entry will ALWAYS be applied (even if it NATs to the wrong IP) until it times out.

In the most simpler implementations, each entry in the NAT table will be similar to those in the next image:

There is an additional column (hidden in my image) that is the timeout (or the time of the last processed packet - the timeout can be calculated if we know when the last packet was processed).

If a new packet arrives with the same set of values (int-src-addr, int-src-port, int-dst-addr, int-dst-port and protocol), the same NAT rule will be applied (the packet will match this NAT rule). If a packet arrives with the set of values (ext-dst-addr, ext-dst-port, ext-src-addr, ext-src-port and protocol), its a reply for a natted packet so the NAT rule will be applied in the reversed order.

The MikroTik implementation

MikroTik has a more efficient NAT implementation. Its NAT table looks like this one (its a simplified schema, but explains well what is done):

MikroTik (as well as other modern routers) uses special columns to make better NAT decisions.

Note: There is no UDP stream flag. But Mikrotik uses more general flags ('Confirmed' and 'Assured' flags, and maybe others) to check if an UDP package bellongs to an UDP flow. So, I simplified this in the image.

MikroTik also applies different timeout values for each packet type. More info can be found here, in the docs: https://wiki.mikrotik.com/wiki/Manual:I ... operties_2.

Now, I will put my test results about NAT in Mikrotik (ICMP, TPC and UDP). I uses the same lab environment in the topic description.

But before, I increased all the relevant NAT timeouts (it's easier to check the NAT table with greater timeouts):

/ip firewall connection tracking
set icmp-timeout=1h udp-stream-timeout=1h

These is no need to increase the tcp-established-timeout because it is high by default (1d).

ICMP NAT in MikroTik

The ICMP Echo Request package has the format described here: https://en.wikipedia.org/wiki/Ping_(net ... ho_request.

It has four fields that are important in the context of the tests: Type, Code, Identity (or ID) and Sequence.

ICMP Echo Requests uses Type=8 and Code=0. When we run the 'ping' command in Linux, a new ID number is chosen (different 'ping' executions will use different IDs) and each new ping packet sent will increment its sequence numbers. Each received ICMP Echo Reply will have the same values, except it will use Type=0 (https://en.wikipedia.org/wiki/Ping_(net ... Echo_reply). Mikrotik will use this fact to un-nat a replied ping packet.

In an image:

When I run this test, this was the NAT table entry created in my router:

[admin@router] > /ip firewall connection print detail where protocol=icmp      
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying, 
F - fasttrack, s - srcnat, d - dstnat 
 0  S C  s  protocol=icmp src-address=192.168.0.100 dst-address=199.87.121.233 
            reply-src-address=199.87.121.233 reply-dst-address=10.10.1.2 
            icmp-type=8 icmp-code=0 icmp-id=521 timeout=58m16s orig-packets=4 
            orig-bytes=336 orig-fasttrack-packets=0 orig-fasttrack-bytes=0 
            repl-packets=3 repl-bytes=252 repl-fasttrack-packets=0 
            repl-fasttrack-bytes=0 orig-rate=0bps repl-rate=0bps

My 'ping' program choose the ID 521.

When I changed my default route, my 'ping' program stopped to work (it was using the already created NAT table entry, and it was changing the source IP to 10.10.1.2 , a wrong value now). But, when I kill the program and started another 'ping', the new instance choose a different ICMP ID, so the new packets didn't match the old NAT table entry (because of the different ICMP ID) and a new entry was created:

[admin@router] > /ip firewall connection print detail where protocol=icmp      
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying, 
F - fasttrack, s - srcnat, d - dstnat 
 0  S C  s  protocol=icmp src-address=192.168.0.100 dst-address=199.87.121.233 
            reply-src-address=199.87.121.233 reply-dst-address=10.10.1.2 
            icmp-type=8 icmp-code=0 icmp-id=521 timeout=49m38s orig-packets=4 
            orig-bytes=336 orig-fasttrack-packets=0 orig-fasttrack-bytes=0 
            repl-packets=3 repl-bytes=252 repl-fasttrack-packets=0 
            repl-fasttrack-bytes=0 orig-rate=0bps repl-rate=0bps 

 1  S C  s  protocol=icmp src-address=192.168.0.100 dst-address=199.87.121.233 
            reply-src-address=199.87.121.233 reply-dst-address=10.10.2.2 
            icmp-type=8 icmp-code=0 icmp-id=522 timeout=59m36s orig-packets=2 
            orig-bytes=168 orig-fasttrack-packets=0 orig-fasttrack-bytes=0 
            repl-packets=2 repl-bytes=168 repl-fasttrack-packets=0 
            repl-fasttrack-bytes=0 orig-rate=0bps repl-rate=0bps

And abracadabra alakazam, my ping worked!

TCP NAT in MikroTik

TCP is a connection-oriented protocol. After doing its famous three-way-handshake (https://en.wikipedia.org/wiki/Transmiss ... ablishment), our connection will be at an established state.

I did a test is something like this appeared in the Nat table:

[admin@router]> /ip firewall connection print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat
 0  SAC  s  protocol=tcp src-address=192.168.0.100:40000
            dst-address=172.127.30.14:443 reply-src-address=172.127.30.14:443
            reply-dst-address=10.10.1.2:40000 tcp-state=established
            timeout=23h59m51s orig-packets=3 orig-bytes=164
            orig-fasttrack-packets=0 orig-fasttrack-bytes=0 repl-packets=2
            repl-bytes=185 repl-fasttrack-packets=0 repl-fasttrack-bytes=0
            orig-rate=0bps repl-rate=0bps

Note that the tcp-state is 'established' and that timeout value is next to one day!

If we close the connection in the correct way (https://en.wikipedia.org/wiki/Transmiss ... ermination) sending a FIN packet, the connection will be closed in both hosts and Mikrotik will change the entry in the NAT table to the time wait state:

[admin@router]> /ip firewall connection print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat
 0  SAC  s  protocol=tcp src-address=192.168.0.100:40000
            dst-address=172.127.30.14:443 reply-src-address=172.127.30.14:443
            reply-dst-address=10.10.1.2:40000 tcp-state=time-wait timeout=5s
            orig-packets=10 orig-bytes=536 orig-fasttrack-packets=0
            orig-fasttrack-bytes=0 repl-packets=8 repl-bytes=611
            repl-fasttrack-packets=0 repl-fasttrack-bytes=0 orig-rate=0bps
            repl-rate=0bps

This is an intermediate state before removing the NAT entry. Its important to avoid problems with ACK retransmissions after the FIN sent (https://networkengineering.stackexchang ... 9718/17394).

After a few seconds, the TCP NAT entry will be entirely removed.

[admin@router]> /ip firewall connection print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat

It's interesting to see that due to its connection oriented nature, it's possible to remove a NAT entry without waiting a long time for a timeout expiration.

So, I made the following test: I started a TCP connection and, in the Mikrotik router, I changed the default route. At this time, the communication between the client and the server goes down. After some time, the client connection timeout out. In this moment, this was the state of the NAT table:

[admin@router] /ip firewall connection> print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat
 0  SAC  s  protocol=tcp src-address=192.168.0.100:40000
            dst-address=172.127.30.14:443 reply-src-address=172.127.30.14:443
            reply-dst-address=10.10.1.2:40000 tcp-state=fin-wait timeout=2s
            orig-packets=12 orig-bytes=646 orig-fasttrack-packets=0
            orig-fasttrack-bytes=0 repl-packets=4 repl-bytes=315
            repl-fasttrack-packets=0 repl-fasttrack-bytes=0 orig-rate=0bps
            repl-rate=0bps

The 'fin-wait' state means that the client sent a FIN to close the connection (once it timeout out). However, the server did not responded (as at this time we're applying the wrong NAT table entry - which uses the wrong IP).

The client also keeps its state in 'fin-wait':

root@sip-client:~# netstat -nt
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0     12 192.168.0.100:40000    172.127.30.14:443       FIN_WAIT1

And just after a few seconds (and some more state changes), the NAT table is cleared:

[admin@router]> /ip firewall connection print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat

So, now, it's possible to start a new TCP connection with the server (even using the same src-port). Note: To always use the same source port, I used the command nc -p 40000 172.127.30.14 443.

So, what happens if we try to reconnect before 'fin-wait' entry have timed out (in other words, what happens if a new SYN is send)?

I did this test and I have a great surprise: Mikrotik did not apply the old NAT rule (which would NAT to the wrong IP). Because the new SYN packet is not related to the previous connection (even if it has the same src-addr, src-port, dst-addr, dst-port and protocol), the router knows if belongs to a new connection. So, the router pass the new SYN packet thought the NAT rules and replaces the old NAT table entry with a new one that uses the correct IP.

Before:

[admin@router] /ip firewall connection> print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat
 0  SAC  s  protocol=tcp src-address=192.168.0.100:40000
            dst-address=172.127.30.14:443 reply-src-address=172.127.30.14:443
            reply-dst-address=10.10.1.2:40000 tcp-state=fin-wait
            timeout=59m56s orig-packets=11 orig-bytes=580 
            orig-fasttrack-packets=0 orig-fasttrack-bytes=0 repl-packets=2 
            repl-bytes=185 repl-fasttrack-packets=0 repl-fasttrack-bytes=0 
            orig-rate=0bps repl-rate=0bps

And after:

[admin@router] /ip firewall connection> print detail where protocol=tcp
Flags: E - expected, S - seen-reply, A - assured, C - confirmed, D - dying,
F - fasttrack, s - srcnat, d - dstnat
 0  SAC  s  protocol=tcp src-address=192.168.0.100:40000
            dst-address=172.127.30.14:443 reply-src-address=172.127.30.14:443
            reply-dst-address=10.10.2.2:40000 tcp-state=established
            timeout=23h59m55s orig-packets=3 orig-bytes=164 
            orig-fasttrack-packets=0 orig-fasttrack-bytes=0 repl-packets=2 
            repl-bytes=185 repl-fasttrack-packets=0 repl-fasttrack-bytes=0 
            orig-rate=0bps repl-rate=0bps

Note that reply-dst-address was updated to the new IP.

UDP NAT in MikroTik

UDP is not a connection oriented protocol (like TCP). So, MikroTik has no way to know in whats state the connection is (or even know that an UDP packet is trying to establish a new connection). It also has no Type/Code/ID fields (like ICMP) that allows MikroTik to use new NAT table entries when some of these fields changes. Mikrotik only has the five columns (src-addr, src-port, dst-addr, dst-port and protocol) to decide with NAT table entry to use.

It also has the stream state to change the UDP timeout when necessary. But once MikroTik classifies an UDP NAT table entry as a stream, it has no way to discover that the "stream connection" stopped to work and the client is trying to start a new stream. Its a very limited protocol (lightweight, but limited).

A found more useful info about this in the O'Reilly's book "High Performance Browser Networks" written by Ilya Grigorik: https://books.google.com.br/books?id=tf ... &q&f=false.

In my tests, I only discovered three ways to make UDP NATs work again after the default route have changed:

- Wait for the timeout to expire (when using SIP Alg, however, huge timeouts are necessary, and the server and the client frequently send keepalive messages that doesn't allow the NAT table entry to timeout - so, cannot be applied in my case);
- Sending new UDP packets from a different source port;
- Manually removing the NAT table entries.

Conclusion

There is not bug in Mikrotik SIP Alg. Actually, it's a limitation of the UDP + NAT schema. However, a workaround is important to circumvent this limitation.

Workaround

1 - Clean the SIP NAT table on default route changes

As I explained before:

- The router checks from time to time (via a script that runs in /system scheduler) if the default gateway have changed;
- If it discovers a change, then it runs the following command:

/ip firewall connection remove [find where connection-type=sip]

After this, all SIP connections will start working again.

Note: As I also explained before, a more complete command (/ip firewall connection remove [find where connection-type=sip or connection-type=sip2]) may be necessary.

2 - Change source port

The second workaround is to use a different UDP port in the client when it realizes the connection with the server is not working anymore. When trying new invites, the SIP client should use a new UDP port to force a new NAT table entry to be created.

Note that this is not the default (SIP clients uses port 5060). Second Wikipedia SIP page (https://en.wikipedia.org/wiki/Session_I ... n_Protocol):

SIP can be carried by several transport layer protocols including the Transmission Control Protocol (TCP), the User Datagram Protocol (UDP), and the Stream Control Transmission Protocol (SCTP).[13][14] SIP clients typically use TCP or UDP on port numbers 5060 or 5061 for SIP traffic to servers and other endpoints. Port 5060 is commonly used for non-encrypted signaling traffic whereas port 5061 is typically used for traffic encrypted with Transport Layer Security (TLS).

Also, changing the source port can break QoS rules or other firewall rules, not be supported by SIP clients, etc. But it may be a good solution.

3 - Change the transport protocol used by SIP

Obviously, I'm not proposing to send the VoIP stream over TCP (it will still using RTP), but only the SIP part.

As an important note, even only using TCP for SIP is highly not recommended (https://www.onsip.com/blog/sip-via-udp-vs-tcp) and can lead to a lot of problems. Moreover, most SIP clients (and possibly servers) doesn't allow SIP over TCP traffic.

Posted: **Sat Feb 10, 2018 11:55 am**

Hats off to the author of this post. One of the most detailed and precise firewall/connection tracking related posts that I have seen in this forum.

Thank you for this post which most likely will explain how NAT works to many users here. Often situation which is recognized as a "bug" is an actual requirement or simply - how things work in networking.

Countless times we have heard - reboot resolves the issue. Reboot simply clears connection tracking table.

Posted: **Wed Feb 21, 2018 7:06 pm**

Hi @sindy, thanks for your explanation. You are right: the SIP problem is not a SIP problem, but an UDP NAT problem (a more general problem). It isn't even a bug: it's a UDP NAT limitation (a protocol limitation).

I made some NAT tests and understood better how NAT works in MikroTik. I'm publishing my discoveries here.

@ZeroByte, this can help you.

This was a fantastic deep dive into the quantum mechanics of why this issue occurs. I already understood the behavior of the connection tables / NAT, etc quite well, and why the NAT rules weren't using the new WAN address on route changes, but I did learn the ultimate root cause of why UDP streams are affected while TCP sockets are not. (i.e. the ICMP identifier and TCP identifiers being present but the absence of those hooks in UDP). I just never sat and pondered the mechanics to the point where I could come to that conclusion as you did, so thanks a lot. I think your post is wiki-worthy. Kudos!

Now on to my response to this whole thing:

I can tell you that this is not something that creeps up in any other nat/firewall box I'm aware of, such as Cisco IOS, Cisco ASA, Netgear, PFSense, Sonicwall, etc etc. So the conclusion is that the connection tracking engine's architecture on RouterOS is the root cause of this. From reading your analysis and doing a little bit of thinking, it seems to me that the issue could be resolved by adding two more fields to the connection tracking table: in-interface / out-interface. Obviously the firewall filter and mangle rules are processed for packets flowing through the router even when they're part of active entries in the state tracking table. The issue here is that Mikrotik's acceleration tactic of skipping the nat table for packets found in the connections list is the fault. So I would think that if the in/out interfaces were part of the table, then the NAT attributes of the packet would no longer match for datagrams following a different path through the router, which would require the router to evaluate the NAT table again, which would result in a new connection entry having the updated NAT results.

Obviously I'm not an engineer for Mikrotik, and there could be a bazillion reasons why this would either not fix the behavior, or would break other things, or lead to poor performance, but at the end of the day, the engine needs to be smart enough to realize that the table entries no longer apply after topology changes occur, and deal with it automatically. Having scripts that either run on scheduler or on event triggers is exactly what your post calls them: a workaround.

Mikrotik's position of "I canna change the laws of physics" doesn't fly with me. Other vendors don't have this problem. If an undersirable behavior comes from the engine, then the engine needs to be tweaked.

I would also posit that another work-around would be to use a "netmap" NAT entry which matches only SIP packets - as netmap is stateless, requiring the router to evaluate the NAT table every time for those packets, but it would be a solution not requiring scripting to patch up. I'll have to take a stab at that in my lab. However, this would only work* for a single internal host on a static internal IP, so it's not really useful in real-world deployments of desktop SIP phones.

* the reason is that you must also supply an "un-nat" rule when using netmap, so the internal IP must be known in advance for writing the inbound "un-nat" rule.

Posted: **Mon Apr 16, 2018 3:09 pm**

Hi all, my 50 cents as working solution
Linux/Asterisk crontab script:

=========CUT=========
#!/usr/bin/bash

ret=$(/usr/sbin/asterisk -rx "sip show registry" | grep -c "Request Sent")
if [ "$ret" -eq 0 ]
then {
echo "SiP OK"
} else {
ret=$(/usr/bin/ssh username@192.168.xx.1 ':foreach i in=[/ip firewall connection find dst-address~":5060" protocol~"udp"] do={ /ip firewall connection remove $i } ; quit')
}
fi;
=========CUT=========

Posted: **Mon Jun 11, 2018 3:13 pm**

Strods, being aware of the explanation, can we expect an improvement to the OS engine in this regard or at least a special function that can be applied to SIP traffic?
I have posted my VOIP issue the other day on the forums and am super happy to have my great friend mozer point this out to me before I lost more sleep and made anymore bizarre rules using dstnat and mangle to try and avoid this issue.

Where do we vote for the post of the decade!!!!

Posted: **Mon Jun 11, 2018 3:17 pm**

Zero Byte,
I am interested in your solution as I have a fixed private LANIP static and a single host. Do you have more info?
Remember anybody with a voip modem at home is in this scenario, not so small an audience these days!!

Posted: **Mon Jun 11, 2018 3:34 pm**

Users with a single fixed IP on a single WAN line should not be affected by the above, so they do not need a fix.
The problem described above only occurs when there are multiple WAN lines and the router switches between them (e.g. due to some failover mechanism) without the client knowing about it.
There is another problem (not the one shown above and not fixed by the above suggestion) that occurs when the external IP address changes (dynamic address) without the router noticing it.
(i.e. no WAN interrface down/up event)
However that is not what you have.

Posted: **Mon Jun 11, 2018 5:47 pm**

Pe1chl

Where did I state I have one WANIP> I have two WANS and one voip modem on a single private IP behind the router which is static.

You do raise a good question. For my FIBER connection which is a dynamic IP address, how do I know for sure that when the iP changes the connection will still work as I am not manually at the router to see what gatewayIP is in use in case it needs to be changed for routing. ?????

Posted: **Tue Jun 12, 2018 2:24 am**

Users with a single fixed IP on a single WAN line should not be affected by the above, so they do not need a fix.
The problem described above only occurs when there are multiple WAN lines and the router switches between them (e.g. due to some failover mechanism) without the client knowing about it.

I have a single fixed IP on a single WAN and I've suffered from this particular bug.

Posted: **Tue Jun 12, 2018 2:34 am**

Users with a single fixed IP on a single WAN line should not be affected by the above, so they do not need a fix.
The problem described above only occurs when there are multiple WAN lines and the router switches between them (e.g. due to some failover mechanism) without the client knowing about it.
I have a single fixed IP on a single WAN and I've suffered from this particular bug.

Please elaborate

Posted: **Tue Jun 12, 2018 4:00 am**

Where did I state I have one WANIP> I have two WANS and one voip modem on a single private IP behind the router which is static.

Hi anav,

When you have two WANs and a VoIP modem using a single private IP behind the router, is the same case discussed in this post. In your case, your modem works as the SIP client, and your router does the SIP NATs (using SIG ALG helper).

You do raise a good question. For my FIBER connection which is a dynamic IP address, how do I know for sure that when the iP changes the connection will still work as I am not manually at the router to see what gatewayIP is in use in case it needs to be changed for routing. ?????

However, if you have only one WAN connection, and your router receives a dynamic IP address via DHCP, there are two possibilities:

1) Using src-nat

In this case, you WILL HAVE PROBLEM when the router IP changed. The wrong SIP NAT rule will be applied, in a very similar way of the case initially discussed in this post, because the SIP NAT entry in the NAT table will still be using the old IP.

So, you will have to apply a complex workaround (similar to the proposed in the post).

- Check from time to time (via a script that runs in /system scheduler) if the IP associated to your WAN interface have changed;
- If you discover a change, then clean the SIP entries from the NAT table.

2) Using marcarade

When using mascarede, your router will automatically clean all tracked connections, including the SIP NAT entries in the NAT table. This way, new SIP NAT entries (that will point to the new IP) will be created and your SIP connections will still be working after this.

As sindi already commented about mascarede:

[...] with masquerade, each time the interface goes down or its IP address changes, all tracked connections are cleared and newly established connections are src-nated to the new address (there is a video on that as well, explaining why to use masquerade only where really necessary).

So, instead of applying a complex workaround, I really recommend you to JUST USE MASCARADE and everything will work.

Posted: **Tue Jun 12, 2018 4:20 am**

Hi rarlyson,

Thanks for a great thread by the way!!
I watched this presentation with interest captured some wireshark type data and discovered or so it seems that the modem is already SIP nat aware and before the ALG was applied at layer 7, it was clear the modem is aware of public IP.
https://mum.mikrotik.com/presentations/ ... 084451.pdf
https://www.youtube.com/watch?v=tM7wyKd ... e=youtu.be

In any case I was testing different routers and IP routes and switching back and forth between WANIPs and I noted that the modem would get stuck such that it and the sip server would try to maintain connectivity over the fail over WAN for example and not use the primary which basically shut down operations. This in spite of being a nat aware modem and despite the mikrotik layer 7 ALG.

Thus I think if falls within your scope of issue discussed.
By the way I have two srcnat rules and both use action=masquerade so it is no fix (yes sindy can be wrong, touch wood).

What I am most curious about is the post by ZERO BYTE where he said he could prevent the scenario with netmap nat entry but never came back to explain his findings on the test he said he was going to conduct. Thus far I see no easy path to resolve....... Do you know what he was talking about or where I can pursue this line of reasoning?? To be frank, it works 99% of the time and I only noticed the issue due to a. changing routers from a zyxel router to the now in place HEX router and b. of course all my testing. I was kind of happy to know that I was not crazy and that its a legitimate issue, I had started making way out there dstnat rules and mangle rules to no effect.

I also should note that a mikrotik rep, strods noted your post and the discussion but we see no changes in place or forthcoming?? So I added it to the beta suggested issues thread...

Posted: **Tue Jun 12, 2018 1:08 pm**

By the way I have two srcnat rules and both use action=masquerade so it is no fix (yes sindy can be wrong, touch wood).

I can be wrong and when I am I have no problem to admit it, but in this particular case, you may be mixing several things together.

The original purpose of SIP registration is that the SIP end device informs the exchange about its current IP address so that the exchange knew where to send eventual incoming calls. If the device is on a private address behind a public one, one problem is that the device's own (private) address in the REGISTER message is useless for the exchange. So there are three ways how to deal with that:

the device itself uses STUN to determine how the NAT behaves and what is the public addres (which may not be a simple task, think about load balancing on WANs with different IP addresses) and puts that public address into the REGISTER message
the NAT in the router uses an ALG, modifying the SIP message contents (the device puts its private address into the message and the ALG replaces it with the public IP of the WAN it uses to forward the message
the exchange notices that the source socket address of the incoming REGISTER message differs from the one received inside the message and remembers both (so it sends messages to the source socket of the REGISTER and uses the address received inside the REGISTER inside the messages it sends)

Besides, some exchanges check whether ougoing calls of a particular user account come from the same socket address through which that user account has previously registered, and do not accept call initiation requests coming from any other socket.

So e.g. if the device registers for 20 minutes, and before the 20 minutes expire, the WAN address changes, the device does not learn about the change so it does not know that it has to re-register. So until the registration expires, the device cannot receive incoming calls because the exchange sends them to the old address, and may even be unable to call itself because the exchange can see the call initiation request to come from a different address than the last preceding REGISTER.

So the use of masquerade instead of src-nat only addresses one possible issue, which is that an already established connection on the firewall remembers the src-nat address assigned when the connection was set up - as explained earlier, use of masquerade causes all connections to be removed if the address changes, so the next REGISTER from the device establishes a new connection rather than updating the old one, and thus the current WAN address is assigned to that connection.

But the use of masquerade only affects the behaviour of the firewall itself. If the device has previously determined, using STUN, the public address it gets, there is a chance that it never updates that information and keeps using the old one in its messages (until it is restarted or at least its Ethernet connection goes down and up again). I have no idea how the Mikrotik's SIP ALG handles other addresses than the devices' own ones in the SIP messages from the devices on LAN - maybe it only replaces device's own addres with the WAN's address and lets any foreign addresses alone. You'd have to use packet sniffing and analyse the packets using Wireshark to find out.

What I'm trying to say is that it never came to my mind to use more than one of the three methods of dealing with the customer side NAT simultaneously. While ALG and STUN can coexist with exchange-side auto-learning peacefully, I'm afraid they may be incompatible mutually. As STUN is by principle less reliable than ALG, I would choose ALG out of the two. And because many ALG implementations are buggy, if the exchange supports auto-learning, the best approach is let it deal with the customer-side NAT itself and not use even the ALG.

But even if your VoIP provider's exchange does support auto-learning, the problem of WAN address change between registrations remains. So a script taking down and up the Ethernet to which the phone is connected when the WAN address changes, or a very short DHCP lease time for the phone, may be necessary to shorten the gap. The best in my understanding would be that you could synchronize the lease times of the local DHCP server with the lease times of the DHCP client on the WAN interface, but nothing like this exists in RouterOS.

Posted: **Tue Jun 12, 2018 1:21 pm**

Hi Sindy, thanks for the explanation!
I too am very curious as to the interplay between the modem configuration which I have no control over and the ALG.
Do you know what zerobyte was alluding too with his netmap comments??

Posted: **Tue Jun 12, 2018 1:32 pm**

You cannot use netmap together with a dynamically changing WAN address as, like src-nat and unlike masquerade, it does not automatically use the address of the out-interface.
So its advantage is limited to multi-WAN scenarios where the WAN addresses are static, and if one of the WANs fails, you need to route the traffic through the other one.
But it again only helps with the behaviour of the firewall itself, not with the fact that there is a gap between the change of used public address (in this case, because the second WAN needs to be used) and the re-registration of the phone.

Posted: **Thu Jun 14, 2018 12:59 am**

Rarylson with your intimate knowledge of the issue, maybe I can change a setting on the ATA or modem or whatever it is (obi202)........
The only settings on the mikrotik are basically enable or disable (use or not use the media option), selection a time function (not sure what it does but set at 1hr) and also I suppose change ports being used.

I found an excellent source for what can be manipulated on the device itself.
http://www.obihai.com/docs/OBiDeviceAdminGuide.pdf
The relevant section I think starts on page 94 where it talks about sip registration.
The most relevant section for options that I may wish to change are on page 100 which shows all the parameters with text explanation on the following pages.

When and if you have the time and patience perhaps there is some gem on that page that will assist in better performance.

Posted: **Thu Jun 14, 2018 1:53 am**

Hi anav,

This is not actually a response for your problem, but it can help you.

First, the most important part of the manual (OBi Device Admin Guide) that you should read is (from pg. 95):

NAT Traversal Considerations

If the device sits behind a NAT (typically the case), it can discover the mapped external address corresponding to its local SIP contact address as seen by the server in one of the following ways:

- From the “received=” and “rport=” parameters of the VIA header of the REGISTER response sent by the server; these two parameters tells the device its mapped IP address and port number respectively. This method is used if periodic registration is enabled on the device

- From the response to a STUN binding request the device sent to a STUN server. This method is used by enabling X_KeepAliveEnable and setting the X_KeepAliveMsgType parameter to “stun”. In that case, the STUN server is taken from the X_KeepAliveServer parameter, if it is specified. Otherwise, the keep-alive messages are sent to the same server where a REGISTER request would be sent to. The latter is the most effective way of using STUN to discover the mapped external contact address

The device always uses the mapped external contact address in all outbound SIP requests instead of its local contact address if one is discovered by either method discovered above.

So, your device doesn't need a router running SIP ALG (I think you can safely disable this feature - SIG ALG, aka SIG Helper, in Mikrotik - if your OBi device is the only SIP device behind your NAT.

Because your OBi device is the one how does the SIP translation, any SIP translation problem should be fixed in your OBi device (not on the router).

You can try this:

1) Disable the SIG ALG in your Mikrotik

2) Check if all of you WAN NATs uses mascarade

3) Start a sniffer in Mikrotik (filter by UDP port 5060-5064 with src or dst address is your SIP server/proxy/registrar IP)

4) Reboot your OBi device

5) Check if your OBi device have registered on your SIP server

6) Wait until your external IP have changed and wait some minutes

7) Check if your OBi can now register on the SIP server

Stop the sniffer in Mikrotik

If your OBi device cannot register in your SIP server now, it may be caching and using the old external IP instead of the new one. You can check this in Wireshark (look for REGISTER messages in your package capture using Wireshark).

Posted: **Thu Jun 14, 2018 2:13 am**

Right and if its caching the old number then what do we do?
Also my concern is that it sends the wan IP in its SIP traffic and thus the SIP server at the other end expects that IP address as well so it ends up also being a problem.......

Posted: **Thu Jun 14, 2018 4:57 am**

It's hard to me to suggest a fix for your problem without understanding what is going on behind the scenes. If I suggest you some action, it would be only a weak guess.

I think it's better to you to discover what's happening (running the tests I proposed is a good start). You can also put the package capture here, so someone in this forum can help you. Running controlled tests and analyzing the sniffed packages can help you and us to find a good solution for you.

Because you are using a vendor specific device (OBi device), I cannot run a lab to discover what is going on in your case. Unfortunately, you have to discover (or at least, run the tests and put the package capture here) by yourself.

PS: Maybe Sindy has some better ideas about how to handle this.

Posted: **Thu Jun 14, 2018 5:32 pm**

Unfortunately Sindy has no more ideas. The key issue is that when the public address changes, the phone becomes inaccessible until it re-registers, but there is no peaceful way to force the phone to re-register before the previous registration is about to expire. Some phones do accept notifications about registration change but a) only a few models have this capability and b) you have no means to generate the notification SIP message locally in RouterOS anyway.

If you can connect the phone directly to the Mikrotik's Ethernet port, you can use the script configured at the dhcp-client attached to the WAN interface to switch the port down and up again each time the address changes (or with each assignment regardless whether it differs from the previous one or not). So first try to do that manually to see whether the phone notices that and re-registers, and if it does, create the script.

The rest of what I wrote before was a summary of my personal practical experience with reliability of various methods of NAT traversal and the consequent preference of which of them to use. This is, however, somewhat orthogonal to the additional problem of changing public address.

Posted: **Thu Jun 14, 2018 8:20 pm**

If you can connect the phone directly to the Mikrotik's Ethernet port, you can use the script configured at the dhcp-client attached to the WAN interface to switch the port down and up again each time the address changes (or with each assignment regardless whether it differs from the previous one or not). So first try to do that manually to see whether the phone notices that and re-registers, and if it does, create the script.

Hi anav,

This solution maybe works for you! You can test if it works removing the LAN cable connected in your OBi device and, some minutes after, reconnecting the cable.

However, I think before applying it, it's good also to discover what is happening with your device behind the scenes. If you do not do it, you can loose many time trying a lot of workarounds that won't solve your problem.

Posted: **Thu Jun 14, 2018 11:08 pm**

Thanks both, thats how I solve it manually, I unplug power to the modem for about 5 minutes or so plug power back in and the modem tends to sync to the actual primary router.
I am not sure if unplugging ethernet for five minutes would have the same effect?
Yes testing with wireshark captures would be best
a. for when it connects from startup (powerup) would be best.
b. when I manually change IP primary and capture what happens from modem.

A. would be normal process capture
B. would be to see how it deals or doesnt with a sudden loss of WANIP (to a secondary WANIP).

Posted: **Fri Jun 15, 2018 12:40 am**

A brief shutdown of the port (for a second or less) should be enough to make the phone re-register. If taking the port down for 5 minutes would be necessary, it would mean that it would be inaccessible for those 5 minutes, so in such case, configuring it to register every 5 minutes would be a better approach (provided that the VoIP provider wouldn't force a longer time which SIP allows him to do).

But as @rarylson says, do have a look at what actually happens if you take WAN1 down and the phone re-registers via WAN2, it can be that the exchange doesn't accept the re-register from a new address while the registration from the old one is still valid - in such case, even forcing the phone to re-register would not be enough.

And I've forgotten that your issue is not the DHCP on WAN1 changing the address but WAN1 going down and a failover to WAN2, so binding the Ethernet interface shutdown to the address update by the dhcp client on WAN1 would make no sense, you would instead have to monitor WAN1's availability state and shutdown the phone-facing Ethernet interface whenever WAN1's availability state would change.

Posted: **Fri Jun 15, 2018 1:24 am**

There are two real world instances which I do not reguarly detect.

One if the ISP was actually unavailable (rare).
Two if the ISP changes my public WANIP (probably more often but I am not able to discern when this happens).
My concern in the latter case is that the gateway I HAD TO manually enter into my routing rules, so as long as the ISP doesnt change the gateway I think I am okay (and just changes the wanip).

I stumbled across an inoperative VOIP because of setting up the hex routers and playing with routing.
I could switch ISPs and the VOIP phone would not switch to the new ISP AT ALL. It seemed stuck, now I dont think I waited as long as an hour though.......
In this case it must be true that the SIP alg maybe had no effect because Im assuming the SIP alg would know which one of my WANs is the active primary in a (primary and failover setup).

All goes back to my ATA modem settings on page 99. What parameter or set of parameters would better handle an ISP changeover////

Posted: **Sat Jun 16, 2018 4:07 pm**

There are two real world instances which I do not reguarly detect.

One if the ISP was actually unavailable (rare).
Two if the ISP changes my public WANIP (probably more often but I am not able to discern when this happens).
My concern in the latter case is that the gateway I HAD TO manually enter into my routing rules, so as long as the ISP doesnt change the gateway I think I am okay (and just changes the wanip).

I stumbled across an inoperative VOIP because of setting up the hex routers and playing with routing.
I could switch ISPs and the VOIP phone would not switch to the new ISP AT ALL. It seemed stuck, now I dont think I waited as long as an hour though.......
In this case it must be true that the SIP alg maybe had no effect because Im assuming the SIP alg would know which one of my WANs is the active primary in a (primary and failover setup).

All goes back to my ATA modem settings on page 99. What parameter or set of parameters would better handle an ISP changeover////

MikroTik RouterOS provides a Tool called Netwatch ... With netwatch you can have it monitor your ISP connections and if it senses a change Netwtch can issue the following directive that may forces the Obi to re-register:

/ip firewall connection remove [find where connection-type=sip or connection-type=sip2]

I have no idea what sort of performance hit Netwatch causes ... but it seems like the logical tool to use for your issue.
I have no experience using netwatch so I do not know if in actual fact this would work for you.

You can test this command via Terminal as follows to actually see the connection entries for sip:
/ip firewall connection print where connection-type=sip or connection-type=sip2
after-which you can issue the remove directive and see in real time how the modem responds.

Posted: **Sat Jun 16, 2018 5:41 pm**

if it senses a change Netwtch can issue the following directive that forces the Obi to re-register:
/ip firewall connection remove where connection-type=sip or connection-type=sip2

@moserd, how exactly does the above force the OBi (or any other SIP CPE) to re-register? What that script command definitely does is that it removes the existing "connection" in the firewall, so the next (re-)register creates a new connection whose src-nat address will be the one of the active WAN, but unless I miss something fundamental, it does not make the phone re-register immediately. So until the phone itself decides to renew the registration, it will be inaccessible for incoming calls and, depending on the VoIP provider, may also be unable to call out.

Posted: **Sat Jun 16, 2018 10:26 pm**

There is only so much that can be done. When you have an unreliable or non-cooperating ISP you cannot have reliable SIP service.
When you really cannot switch to a more reasonable ISP, at least avoid the address change by setting up some virtual server (e.g. with CHR)
and route your local network to there without NAT, then do the NAT on that server. Your path to the CHR can change to another ISP
and the external address of your connection can change without affecting the SIP state.

Posted: **Sat Jun 16, 2018 11:25 pm**

rarylson's is indeed a fantastic post and description of how UDP NAT works. However, it does not explain all of the failure states with this that I have encountered.

(Besides, as ZeroByte rightly counters, ROS is so far the only routing platform that I have encountered with this issue.)

There is a scenario I have run into where stale connection-tracking entries prevent SIP phones from re-registering, which to this day I do not understand:

1. Customer has a single WAN / internet connection.
2. WAN uses PPPoE.
3. Customer is assigned a STATIC ADDRESS on the PPPoE interface.
4. PPPoE session bounces for whatever reason.
5. After PPPoE session comes back up WITH THE SAME IP ADDRESS, SIP phones cannot reregister until I clear out connection-tracking entires (or reboot).

This is admittedly quite different than the dual-WAN scenario being discussed in this thread, but I see it as being related given how similar the symptoms are.

The running theory I've had is that I must have some NAT rule that is causing traffic from the SIP phone to be NATted out an interface OTHER than the PPPoE one, and the phone happens to send out a SIP keep-alive-esque message while the PPPoE interface happens to be down, which generates a connection-tracking table entry pointed out the wrong interface, which continues to be followed even after the PPPoE interface comes back up until I clear out the conntrack table.

But as far as I can tell, this isn't happening. For one, the only NAT rules that could be matching are masquerade and those specifically match on out-interface=PPPoE, and even if I wasn't matching on out-interface, there is no other entry in the routing table that the SIP traffic could be conceivably following while the PPPoE connection is down.

Also, from what I can recall whenever I have taken a second to look at this, the conntrack entry *looks* fine, from a src- and dst-address perspective. Unfortunately, I have yet to actually be able to reproduce this in a lab setting (I've tried repeatedly to artificially bump the PPPoE connection or block transmissions from occurring between PPPoE client and server, and it recovers fine every time...OF COURSE), and whenever it has occurred "in the wild" with a paying customer, it's usually a situation where they need to get their phones back up pronto and there is never enough time to sit down and dissect the thing before we are forced to just get things back up and running for them.

One oddball thing that I HAVE seen, and which I have also seen others mention in other threads, is that typically when this problem is occurring, we will see the connection-tracking rule that is the culprit have its timeout value count down to 0 from the specified UDP Stream Timeout...AND THEN PROMPTLY STARTS COUNTING UP FROM ZERO. At that point the rule appears to be "stuck" and will never clear out of the table on its own. It has to be purged manually.

Super aggravating.

-- Nathan

Posted: **Sun Jun 17, 2018 12:19 am**

if it senses a change Netwtch can issue the following directive that forces the Obi to re-register:
/ip firewall connection remove where connection-type=sip or connection-type=sip2
@moserd, how exactly does the above force the OBi (or any other SIP CPE) to re-register? What that script command definitely does is that it removes the existing "connection" in the firewall, so the next (re-)register creates a new connection whose src-nat address will be the one of the active WAN, but unless I miss something fundamental, it does not make the phone re-register immediately. So until the phone itself decides to renew the registration, it will be inaccessible for incoming calls and, depending on the VoIP provider, may also be unable to call out.

@sindy, I was under the false impression that the Obi device poles the Router every 6 seconds so I was speculating that once the connection was removed that the Obi would be forced to re-establish the connection .... apparently my speculation is not correct as I just now tested it on my obi202..... the obi poles the ISP every 6 seconds not the router..

The Obi does have a configuration setting Service Providers ITSP A -> SIP -> TimerB: that just maybe what Anav is looking for that may force the Obi to re-register on failover --- this seems to work for someone else based on the following thread: https://www.obitalk.com/forum/index.php?topic=12723.0

@Anav following is Graphic that shows the setting you would need to change:if you wanted to experiment.
The Default setting for TimerB is 32000 ..... so to modify that number yiu first have to uncheck the parameter THEN enter the value 10000 [do NOT recheck the box] after-which you will need to commit [save]

Posted: **Sun Jun 17, 2018 1:38 am**

There is a scenario I have run into where stale connection-tracking entries prevent SIP phones from re-registering, which to this day I do not understand:

1. Customer has a single WAN / internet connection.
2. WAN uses PPPoE.
3. Customer is assigned a STATIC ADDRESS on the PPPoE interface.
4. PPPoE session bounces for whatever reason.
5. After PPPoE session comes back up WITH THE SAME IP ADDRESS, SIP phones cannot reregister until I clear out connection-tracking entires (or reboot).

This is exactly what I have seen on my router.

One oddball thing that I HAVE seen, and which I have also seen others mention in other threads, is that typically when this problem is occurring, we will see the connection-tracking rule that is the culprit have its timeout value count down to 0 from the specified UDP Stream Timeout...AND THEN PROMPTLY STARTS COUNTING UP FROM ZERO. At that point the rule appears to be "stuck" and will never clear out of the table on its own. It has to be purged manually.

And this.

Posted: **Sun Jun 17, 2018 3:36 am**

Thanks Mozerd, although the thread seems to indicate at the end the issue was solved with simply checking an option box.

Re: Successful implementaion with Obi devices?
« Reply #1 on: May 29, 2017, 01:34:00 pm »
When I was testing there was a 32 second delay before failover occurred, so you might not be waiting long enough. This delay is controlled by TimerB and the default is 32 seconds. I set mine to 10 seconds. Don't set it too low because you will get false failovers. You can set NoRegNoCall to get instant failovers when the trunk isn't registered. TimerB will catch any non-register problems.

Voice Services -> SP1 Service -> X_NoRegNoCall: checked
Service Providers ITSP A -> SIP -> TimerB: 10000

Thanks. After I checked Voice Services -> SP1 Service -> X_NoRegNoCall, it works.

Posted: **Sun Jun 17, 2018 10:11 am**

@mozerd,
I didn't realize it was possible to make the OBi periodically test the availability of the exchange. I know that many CPEs can be configured to refresh the firewall by sending a non-call-establishing SIP request to the exchange frequently, but I haven't met any yet which would analyse the response. And as a public VoIP provider I would be quite unhappy about each CPE sending test SIP requests every couple of seconds as it multiplies the load on the exchange which has to provide a meaningful response, so I prefer to bother the CPEs from the exchange side instead as it takes much less CPU on the exchange, with periodicity which is sufficient to keep open the pinholes of firewalls not aware of SIP (so twice per minute is usually enough even with occasional packet loss). Be aware that in today's world full of DoS attacks, many VoIP providers track the number of requests per time interval coming from a remote socket and ban that socket for a while if a threshold is exceeded.

In high availability setups in enterprise networks the situation is different and checking exchange's availability by test SIP requests is a common approach because the counts of clients are by several orders of magnitude lower than in case of VoIP services for common public.

If the periodical testing of exchange's availability works with your particular VoIP provider, clearing the connection in Mikrotik's firewall allows the test REGISTER (if it is a REGISTER and not some other SIP request like OPTIONS or PING) get through with the current value of WAN IP address, but it remains a question what exactly happens next - it depends on whether OBi checks the received parameter in the responses to these test messages and how it reacts to a change detected, whether the exchange answers registration state queries coming from other source socket than the one of the active registration...

@anav,
do try using /tool sniffer to see what exactly happens if you simulate WAN1 failure (changing the gateway IP should be sufficient to let the failover control kick in) and remove the connection from the table manually using the command @mozerd suggested. If possible, change the SIP password at provider end and in OBi settings before capturing, and then change it back to the one you normally use, so that you could publish the capture without compromising your SIP account.

I haven't had the OBi on my table, but X_NoRegNoCall seems to be the way to "warn" the OBI that it makes no sense to attempt outgoing calls if no successful registration is in place (the IMS behaviour of the exchange as mentioned earlier). Timer B is normally the timeout of giving up on an outgoing call attempt, and the whole context on the link above is a failover between VoIP providers controlled by the OBi, which is only meaningful for outgoing calls; for incoming calls, it depends on the VoIP provider if they offer two service points which can back up each other also for incoming calls to the clients, which would then have to register to both and expect calls to come via any of them.

@NathanA,
your scenario really smells like a RouterOS bug, so support@mikrotik.com. If counting the connection timeout up actually has some special meaning, it should be stated in the documentation, which is not the case.

Posted: **Sun Jun 17, 2018 12:44 pm**

@NathanA,
your scenario really smells like a RouterOS bug, so support@mikrotik.com. If counting the connection timeout up actually has some special meaning, it should be stated in the documentation, which is not the case.

`
Agreed, it smells like a bug. But I'm not the first one to encounter it and it has already been discussed ad nauseam on these forums for years, so it is hard for me to believe that MT support & engineers are not already aware of it. I can also tell you from past experience that reporting it via support@ is a waste of time until I can find a way to reliably reproduce it (which I totally understand -- how are they going to fix something that they themselves cannot verify? -- but is still a roadblock).

I also have a feeling this is a Linux bug at the core, not a ROS bug in particular. It would be nice if MT could implement a fix in current RouterOS, but what if it is already fixed in later Linux kernels? Chances are good that we will just get the response "you will have to wait for RouterOS 7" as so many other people have received when it comes to other core engineering/structural defects. Unfortunately, ROS is not going to be off of Linux 3.3.5 (+ custom patches) until RouterOS 7, and at this point who knows when (or if) that will ever happen...

-- Nathan

Posted: **Sat Sep 22, 2018 8:26 am**

Excuse me,
Could someone teach me?
how to write the scheduler for default gateway have changed?
I have multi ISP, four Gateway for distance 1,2,3,4.

The router checks from time to time (via a script that runs in /system scheduler) if the default gateway have changed;????HOW

Posted: **Sat Sep 22, 2018 2:12 pm**

To check for a change, you have to remember the previous value in a global variable. To check what is the IP of the gateway of the currently active default route, you can use the following:

local currentGw [ip route get [find active=yes 0.0.0.0/0 in dst-address !routing-mark] gateway]

Sometimes you may want to know the gateway of the currently active route to a particular destination rather than the default one, so you might instead use the following:

local currentGw ([ip route check the.target.ip.address once as-value]->"nexthop")

Be aware that in both cases, you may end up with an empty currentGw and that the second method doesn't take into account a routing-mark eventually assigned by firewall mangle rules.

So the whole script might look as follows:

local currentGw [ip route get [find active=yes 0.0.0.0/0 in dst-address !routing-mark] gateway]
if ([len [/system script environment find name=previousGw]]=0) do={global previousGw $currentGw}
if ($currentGw!=previousGw) do={
  set previousGw $currentGw
  ...put whatever the script should do when it detects a gateway change here ...
}

the first line stores the gateway of the currently used default route into a local variable currentGw. Local variables "disappear" at the end of the script run (simplified!).
the second line checks whether the global variable exists; if it doesn't because it is the first run of the script ever or after a reboot, it creates and initializes it. You can initialize it to the current value, which means that the script will never execute its "on-change" part during the first run, or you can initialize it to 0.0.0.0 which means the opposite. The choice is yours.
the first thing the "on-change" part must do is to store the current value so that it doesn't get executed on the next run if no actual change has happened (line 4).

Posted: **Wed Oct 03, 2018 7:09 am**

I just realized/discovered today that IF you are using a PPPoE client on WAN (as I mentioned I was in my earlier post), then detecting the change (down/up event) and automatically acting on it is super-easy, and you don't have to fire off a scheduled script every 60 seconds to do it, either. In 6.33, MikroTik added scriptable "On Down" and "On Up" events for PPP interfaces. The trick is that they aren't available in the interface config, but in the PPP profile config.

So I am going to start configuring the "default" PPP profile on my routers with the following "On Up" script:
`

:delay 5
/ip firewall connection remove [find protocol=udp and (dst-address~":5060" or dst-address~":5061")]

`
I added the 5-second delay after interface "up" because I read in some other threads that the "up" event can be fired off after successful authentication but before everything else has settled (e.g., IP address assigned, adding default route to table, etc.), so even if it's not strictly necessary, I figure waiting 5 seconds can't hurt. Though this is a personal preference, I also prefer using the dst-address matcher rather than connection-type=sip/sip2 because then this form of the command will work regardless of whether you run with the SIP ALG enabled or disabled. If your VoIP provider uses a port other than 5060/5061 then you will have to change this command as appropriate, but if that were the case you'd have to change the config of the SIP ALG anyway, too.

As I mentioned before, this "bug" is stupid-hard to reproduce -- I have tried every method I know of to artificially interrupt the PPPoE session multiple times, and can never get this bug to trigger -- so I don't yet know what the success rate of this method will be, but I guess I'll soon know...

-- Nathan

EDIT: I realized after posting this that if the "default" PPP profile is shared with something else -- like, say, a PPTP or L2TP server -- then this script might inadvertently fire off even when you don't want it to, like say if an incoming PPP connection request hits your MT's PPP server. You wouldn't want an incoming VPN connection to also wipe out the NAT entry for your SIP registration, so it's probably best to just create a new, separate PPP profile with this particular script, and assign that profile to your pppoe-out interface.

Posted: **Wed Oct 03, 2018 7:45 am**

I would use 'dst-address~":5060$"' form: it doesn't touch ports 50600-50609 and it should be a bit faster

Posted: **Wed Oct 03, 2018 12:01 pm**

I would use 'dst-address~":5060$"' form: it doesn't touch ports 50600-50609 and it should be a bit faster

`
Good point. Though it apparently needs to be ":5060\$" because MikroTik CLI will try to parse $ (even when in quotes!) if it isn't escaped.

-- Nathan

Posted: **Wed Oct 03, 2018 12:55 pm**

In Terminal - yep, but when you paste the script via WinBox - nope

Posted: **Wed Oct 03, 2018 1:57 pm**

<e>[/quote]</e></QUOTE>
@sindy, I was under the false impression that the Obi device poles the Router every 6 seconds so I was speculating that once the connection was removed that the Obi would be forced to re-establish the connection .... apparently my speculation is <s></s>not correc<e></e>t as I just now tested it on my obi202..... the obi poles the ISP every 6 seconds not the router.. 
 
The Obi does have a configuration setting <s></s>Service Providers ITSP A -> SIP -> TimerB:<e></e> that just <s></s>maybe<e></e> what Anav is looking for that may force the Obi to re-register on failover --- this seems to work for someone else based on the following thread: <URL url="https://www.obitalk.com/forum/index.php ... </URL> 
 
@Anav following is Graphic that shows the setting you would need to change:if you wanted to experiment. 
The Default setting for TimerB is 32000 ..... so to modify that number yiu first have to uncheck the parameter THEN enter the value 10000 [do NOT recheck the box] after-which you will need to commit [save]
<e>[/quote]</e></QUOTE>

Timer T1 and Timer B are initiated with an Invite and gets reset when a 100 Trying response is received . If no response is received when T1 timer reaches 500ms it waits 2*T1 and sends another Invite . If still no response it sends more Invites at 4*T1 , 8*T1 etc until 8 Invites have been sent using exponential back off and 32 secs have expired an Timer B operates and cancels this Invite request. This is not special to the Obi it is just that the obi makes these settings configurable . I always reduce Timer B to 4secs and have never had a false operation.
 
The Obi can be setup to use this to direct the same Invite to a different sip server.
</QUOTE></r>

Posted: **Wed Oct 03, 2018 2:30 pm**

@NathanA,
your scenario really smells like a RouterOS bug, so support@mikrotik.com. If counting the connection timeout up actually has some special meaning, it should be stated in the documentation, which is not the case.
`
Agreed, it smells like a bug. But I'm not the first one to encounter it and it has already been discussed ad nauseam on these forums for years, so it is hard for me to believe that MT support & engineers are not already aware of it. I can also tell you from past experience that reporting it via support@ is a waste of time until I can find a way to reliably reproduce it (which I totally understand -- how are they going to fix something that they themselves cannot verify? -- but is still a roadblock).

I also have a feeling this is a Linux bug at the core, not a ROS bug in particular. It would be nice if MT could implement a fix in current RouterOS, but what if it is already fixed in later Linux kernels? Chances are good that we will just get the response "you will have to wait for RouterOS 7" as so many other people have received when it comes to other core engineering/structural defects. Unfortunately, ROS is not going to be off of Linux 3.3.5 (+ custom patches) until RouterOS 7, and at this point who knows when (or if) that will ever happen...

-- Nathan

viewtopic.php?t=132980

Yes this is a Linux Kernel Bug with UDP + PPoE, with a newer Kernel it will be fixed in v7 with a newer kernel. Im surprised your script is working, because i could only truly reset these connections by disabling the interfaces for a couple of seconds.

My updated script as of now (Only use this in SIP + PPoE environments):

/ppp profile
set *0 on-down=":local einwahlether\r\
    \n:if ([/interface get [/interface pppoe-client get [find where disabled=no] interface] type]=\"vlan\") do={:set einwahlether ([/interface vlan get [find where vlan-id=31] interface])} else={:set einwahlether ([/interface pppoe-client get [find w\
    here disabled=no] interface])}\r\
    \n\r\
    \n:local einwahlppoe\r\
    \n:set einwahlppoe  ([/interface pppoe-client get [/interface pppoe-client find disabled=no] name])\r\
    \n\r\
    \n:if ([/interface ethernet find name=\$einwahlether disabled=no] !=\"\") do={\r\
    \n:if ([/interface pppoe-client find name=\$einwahlppoe disabled=no] !=\"\") do={\r\
    \n  /interface pppoe-client disable \$einwahlppoe\r\
    \n  /interface ethernet disable \$einwahlether\r\
    \n  delay 3\r\
    \n  /interface ethernet enable \$einwahlether\r\
    \n  delay 2\r\
    \n  /interface pppoe-client enable \$einwahlppoe\r\
    \n}\r\
    \n}\r\
    \n"


/system scheduler
add interval=5m name=activate-interfaces-every-5-min on-event=\
    "\r\
    \n/interface ethernet enable [find where disabled=yes comment!=\"disabled\"]\r\
    \n/interface pptp-client enable [find where disabled=yes comment!=\"disabled\"]\r\
    \n/interface pppoe-client enable [find where disabled=yes comment!=\"disabled\"]" policy=ftp,reboot,read,write,policy,test,password,sniff,sensitive,romon start-time=startup

to note:
on some connections we have to use a VLAN for our PPoE Connection, in this case the script looks up which ethernet uses the vlan and turns it off for a a couple of seconds.
the scheduler script is sometimes a live safer, for some reason there is a possibility, that after a script the ethernet and the pppoe interface are down after a reboot, the script ensures that they're getting turned back on again, if there are ether-interfaces which shall stay down: set the comment "disabled" on them.

I recommend using an own /ppp profile, as of now i just use different ones for VPNs etc, but for easier managament it would probably be the best to create an own profile.

Posted: **Thu Oct 04, 2018 11:09 am**

Chupaka wrote:

In Terminal - yep, but when you paste the script via WinBox - nope

`
Maybe I am misunderstanding you, but what you say does not appear to be true: if I take this script WITHOUT escaping the $, add it to System > Scripts in Winbox, and then highlight it and click the "Run Script" button, it does not work. If I go in and edit the script and escape the $, and click "Run Script" again, it works perfectly.

Exactly in what context are you able to get the quoted regex to parse without escaping the EOL ($) character? I agree that it shouldn't need to be escaped, but I can't find any place where it works without being escaped, either inside or outside of Terminal.

mTwUser wrote:

Im surprised your script is working, because i could only truly reset these connections by disabling the interfaces for a couple of seconds.

`
Interesting. We are both an ISP and a VoIP provider, and have deployed hundreds of MTs to customers. So far we have never had manually removing the connection tracking entries NOT work for us, so I see no reason why the script should not work for us, either. I'm not sure what is different between the way you are configuring ROS and the way we are. I might have suggested that the SIP ALG might have additional bugs that compound the issue, and we turn it off for anyone we set up with our VoIP service, but I see from reading your post in the other thread that you turn it off as well. So I'm as surprised/puzzled as you are.

-- Nathan

Posted: **Thu Oct 04, 2018 11:33 am**

Chupaka wrote:
In Terminal - yep, but when you paste the script via WinBox - nope
`
Maybe I am misunderstanding you, but what you say does not appear to be true: if I take this script WITHOUT escaping the $, add it to System > Scripts in Winbox, and then highlight it and click the "Run Script" button, it does not work. If I go in and edit the script and escape the $, and click "Run Script" again, it works perfectly.

Exactly in what context are you able to get the quoted regex to parse without escaping the EOL ($) character? I agree that it shouldn't need to be escaped, but I can't find any place where it works without being escaped, either inside or outside of Terminal.

mTwUser wrote:
Im surprised your script is working, because i could only truly reset these connections by disabling the interfaces for a couple of seconds.
`
Interesting. We are both an ISP and a VoIP provider, and have deployed hundreds of MTs to customers. So far we have never had manually removing the connection tracking entries NOT work for us, so I see no reason why the script should not work for us, either. I'm not sure what is different between the way you are configuring ROS and the way we are. I might have suggested that the SIP ALG might have additional bugs that compound the issue, and we turn it off for anyone we set up with our VoIP service, but I see from reading your post in the other thread that you turn it off as well. So I'm as surprised/puzzled as you are.

-- Nathan

ISP + VoIP Provider here too :'). Though i have to say, the issue doesn't always come up, what seems to happen (for us) is if there is only a brief disconnection of PPoE (like really quickly, which happened since our country changed how we handle DSL). This bug is truely on the linux kernel, as other firewalls faced the same issue back then (sophos for example). I think your disconnections are long enough, thats why the issue doesn't happen, but if you ever encounter this issue, make sure to use the script

Posted: **Thu Oct 04, 2018 12:02 pm**

Though i have to say, the issue doesn't always come up, what seems to happen (for us) is if there is only a brief disconnection of PPoE (like really quickly, [...]). I think your disconnections are long enough, thats why the issue doesn't happen, [...]

`
No, ours tend to happen after brief disconnections, too. It's rare, but we have even seen it happen *immediately* after a reboot ('tik boots up, PPPoE connects, SIP still stuck). And, yes, it doesn't always happen to us either...in fact it has been FRUSTRATINGLY difficult to reproduce in a lab. If I *try* to make it happen, it will never happen.

-- Nathan

Posted: **Thu Oct 04, 2018 12:22 pm**

No, ours tend to happen after brief disconnections, too. It's rare, but we have even seen it happen *immediately* after a reboot ('tik boots up, PPPoE connects, SIP still stuck). And, yes, it doesn't always happen to us either...in fact it has been FRUSTRATINGLY difficult to reproduce in a lab. If I *try* to make it happen, it will never happen.

It is always tricky with SIP. On my home router I have VDSL (via external modem and PPPoE in the MikroTik) and a WiFi link via another network as a backup.
After powerfail (happened two times this week, normally not so often) the backup link usually is up quicker than the VDSL and the phone sets up a SIP connect via the backup link, which works.
Then some seconds later the VDSL comes up, the route switches over to that line, and the NAT entry is invalid and has to be removed to get it working again.
It would be nice when NAT entries in general (and for SIP in particular) were more aware of interfaces going down/up, routing to change, etc.

Posted: **Thu Oct 04, 2018 1:07 pm**

Maybe I am misunderstanding you, but what you say does not appear to be true: if I take this script WITHOUT escaping the $, add it to System > Scripts in Winbox, and then highlight it and click the "Run Script" button, it does not work. If I go in and edit the script and escape the $, and click "Run Script" again, it works perfectly.

Yeah, correct, I was confused by /sys script export - there was that slash, sorry

Posted: **Thu Oct 04, 2018 1:10 pm**

It would be nice when NAT entries in general (and for SIP in particular) were more aware of interfaces going down/up, routing to change, etc.

`
Your version of the issue sounds like the "dual WAN" scenario, and the previous explanations for the underlying cause of that variant of the problem make sense to me.

In our case, though, it is single WAN. It's just that the WAN so happens to be PPPoE, and for some reason that makes a difference.

When PPPoE is down, there is no match in the routing table for the SIP traffic to follow, so one would assume the SIP client gets back ICMP "destination net unreachable" by the router when in that state. Also, whenever I have observed it while trying to reproduce the problem, if I have the SIP client try to make a connection while PPPoE is down, no NAT entry is created (as I would expect/hope). If I force the SIP client to (re)attempt registration while PPPoE is down and then let PPPoE come up just after it has tried, the problem still doesn't seem to happen.

And in fact, to try to reproduce the problem, I have gone as far as bringing PPPoE up and then letting the SIP client register, which is a success and a valid NAT entry is created, and after that I artificially cause the PPPoE to disconnect (e.g. put a switch in between MT and DSL modem, and unplug the DSL modem from switch...this way MT ethernet interface state doesn't change). If I watch the NAT table when the PPPoE connection times out and drops, THE NAT TABLE IS CORRECTLY CLEARED OF ALL ENTRIES RELATED TO WAN, INCLUDING THE SIP ONE! And so when I allow PPPoE to reconnect, everything works just fine!

So I have absolutely no idea what is happening differently when the problem occurs when I'm not watching it or trying to reproduce it. There is clearly a timing element involved, though. It's just not clear to me what exactly is deadlocking.

In fact, it doesn't even seem to have anything to do with entries in the routing table coming and going. I have one router that is configured so that the PPPoE client interface has "add-default-route=no", and then I added a static "dst-address=0.0.0.0/0 gateway=pppoe-out1", and the problem still happens!!!

-- Nathan

Posted: **Thu Oct 04, 2018 2:08 pm**

From your description maybe it works fine when the NAT entry has to handle some traffic during the time the connection is down, but it does not work when the interruption is so short that there is no traffic in that time interval?
NAT always remains tricky. Fortunately most of the use I have for MikroTik routers does not involve NAT at all. But this (my home phone via internet) is an exception, as unfortunately my phone has issues with IPv6 so that cannot be used as the easy solution (apart from the fact that MikroTik support for advanced routing in IPv6 is so bad).

Posted: **Sun Jan 06, 2019 6:24 pm**

As far as I know, the SIP server has a limited number of connections on the account. If the limit is exceeded, the SIP server will block the connection. By default each SIP account only has one connection. So when the SIP client in the network has two WAN lines, it will create two connections. So it has not been re-registered. If you want to register, you must fix the client sip to only go to a WAN. Or you must configure the SIP account on the server with a limit of more than 2 connections at the same time.

Posted: **Mon Jan 07, 2019 9:35 pm**

As far as I know, the SIP server has a limited number of connections on the account. If the limit is exceeded, the SIP server will block the connection.

If you encounter such behaviour, I'm afraid that VoIP operator's exchange behaves very unusually.

Typically, the number of concurently registered CPEs on the same account is limited, but in terms that if an extra CPE (or the same CPE from a new address) registers, the registrar replaces the oldest pending registration or, if the same CPE registers from a new address and it identifies itself using an UUID, updates the older registration of that UUID with the new address.

So in case of a single CPE per account, the newest registration normally always wins.

Posted: **Tue Jan 08, 2019 12:22 am**

I am not comfortable with scripts and thus resort to unplugging the modem, and turning it off and on and it works again, whenever I frig with the routers, or a power outage occurs etc...........
My cable ethernet always comes up no problem, but the issue is the vlan fiber connection which doenst work with 0.0.0.0/0 when the gateway IP changes. I have to ensure a. it gets bounded and then b. go to my recursive rules and manually change the gateway IP. (then reboot the voip modem LOL).

Posted: **Tue Jan 08, 2019 2:26 am**

I also have a feeling this is a Linux bug at the core, not a ROS bug in particular. It would be nice if MT could implement a fix in current RouterOS, but what if it is already fixed in later Linux kernels?
-- Nathan

Yes, Peplink has the same issue. If the network bounces and there is NAT at both ends of a SIP connection (I have a SIP Peer) it will sometimes never recover. This threads makes me feel better that I was not alone.

Posted: **Tue Aug 20, 2019 10:35 am**

Since then mikrotik has released a patch to fix this? I've been with the problem a year ago, I kill nat connections but it doesn't always work, it starts to be a bigger problem.
I usually use only one wan with pppoe, im using a profile for clean nat table but sometimes it doesn't seem to be enough..

Thanks.

MikroTik

SIP client cannot re-register in the SIP server after switching ISP (different NAT)

SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT) [SOLVED]

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)

Re: SIP client cannot re-register in the SIP server after switching ISP (different NAT)