rOS v7.1xxx x86_64 bare metal TX/RX annoying drops
Posted: Thu Jun 20, 2024 4:51 pm
Hi guys
it looks like it either goes passed by and no one notices the threads..or perhaps the development team his not making tests on the rOS x86 bare metal hardwares?
we have a couple of licenses L6 running in Ros V7.14.xx and latest stable v7.15x as we have upgraded in order to see if issue would get resolved..
we do have annoying amount or TX/RX Errors showing on the interfaces.. after about 24h running we get over 10000 RX drops.. TX does not seem to be affected..
so we though oursevels.. well it could be GBIC SFP+ wise.. lets swap and try other vendors.. so we did a couple of tests..
1- intel NIC card X520da2 with intel gbic sfp+ 850nm 330mts models.. both servers with same NIC card + gbic + patch cord 3mts multi-mode cable because the gbic is multi-mode 850nm plugged both machines and traffic going by averga 1gbps to 2gbps traffic.. after 24hours we get 10k RX Errors..
after a more reading and search we have checked that intel has updated the drivers firmwares, ok lets upgrade to latest 2023 firmware update drivers on intel cards, start machines again same thing happened.. same 10k RX errors only no TX errors.. after 24h running..
ok now we know its not intel NIC neither drivers.. lets try make crazy swap gbic again put in Huawei 3km gbic sfp+ 130nm SR LR model instead and swapped patch cord to Single mode patch cord... again let it run for 24hours and same annoying error on RX ERRORS appearing around 7k to 11k RX errors only...
Ok about this time we decided to change NICs and test again this time with INTEL X710 quad port cards on each server, and needles to say the exact same issue just on RX ERRORS.... so we decided could be the GBICs.. lets try DAC 10g cable.. plugged both servers and same RX errors..
but because we like spending money on L6 licenses and love X86_64 hardware we decided to continue invest just for the sake of testing right?
swapped intel NICs.. and jumped in to Mellanox MCX 354 40Gbps cards... with the proper GBIC 40gbps mellanox... and again let it run for 24hours traffic average 1gbps to 2gbps.. and next day result around 10K to 11k RX erros.. somehow its just on RX errors on both sides...
swapped nics this time we put in the melanox 10g connect-x4 models CX2424 ... needles to say same exact thing just RX-ERRORS on both sides...
ok we started to look for other hardware tunning... both servers running DDR3 memory.. 32GB.. insert more memory.. up to 128GB.. we have only noticed some increase on speed bandwidth traffic tests.. from one server to another routing high volume of traffic.. which by increasing the memory on each CPU slot..gave better results on the 40Gbps cards.. running bonding on both interfaces on each servers... but we still have the RX-ERROS throwing in like marshmallows...
Ok at this time we knew it was not hardware.. we were pretty confident right this could be some software issue on rOS x86 v7 ....
At about this time we were really annoyed with the situation that we had tried several brands GBICs.. even bought a pair of mikrotik sfp+ 10g model and same error happened...
so we decided to make one last test... put switches between both servers x86 running...
on our BGP.. operator carrier grade side.. they have plugged in our 10G link port on a huawei switch s6730 and suddently all RX-ERRORS stopped on our LINK isp traffic.. tested on Mellanox 10G cards, tested on Intel 10G nics with huawei, intel, mikrotik gbics... we gave testing of around 24h on each.. so we could be sure ... and TX/RX Erros gone ZEROOOOOO errors..
so we tried the very same thing on our side, from our BGP server to our PPPOE server... we have put a CRS317 we had layiong around getting dust.. and we have plugged both x86 machines on the switch... bridging in switchOS mode.. traffic from one port to another.. and let it run.. for another 24hours.. and again VOILLA all TX-RX ERROS gone...
so we tought there must be something catching...
checked RX/TX flow control.. both off on both machines.. and on the switch also off..
QUEUES we went into queues interfaces and changed all from no-hardware-queue to multi-queue-ethernet-default mq pfifo and even raised queue-type /QUEUE-SIZE to 5000 packets just to see if we could get any better results.. but the exact same thing happened.. just RX-ERRORS.. throwing in..
after we pluged in the switches.. the RX-ERRORS gone again...
and yes.. we have played around with MTU also.. increased from 1500 up to 1600 , the actual-mtu changes also automatic.. but the L2 MTU allways stays 0.. not sure if this is a normal standard on the x86 perhaps as due to different network cards and buffer.. some cannot be changed..
BUT now my main question is. why this is happening? and surelly isn´t just me as i have found a hole bunch of topics with the exact same issue starting all the way back on rOS v5 and v6..
we have decided to go on x86 because of the extra processing with xeons 22cores , 28cores with higher clocks, routing etc... and to be able to put more 10g interfaces with more ports compared to CCR1036 and CCR2116....
but this RX errors can be annoying .. and we do get and odd ping timeout... that should not be happening.. when we plug 2 servers straigh in order to eliminate the switches in the middle of the network.
And below is a picture of result print of both scenarios, link ISP arriving from Huawei switch S6730... and NIC2 forwarding from BGP direct to second x86 server.. direct fiber connection with RX-ERROR
it looks like it either goes passed by and no one notices the threads..or perhaps the development team his not making tests on the rOS x86 bare metal hardwares?
we have a couple of licenses L6 running in Ros V7.14.xx and latest stable v7.15x as we have upgraded in order to see if issue would get resolved..
we do have annoying amount or TX/RX Errors showing on the interfaces.. after about 24h running we get over 10000 RX drops.. TX does not seem to be affected..
so we though oursevels.. well it could be GBIC SFP+ wise.. lets swap and try other vendors.. so we did a couple of tests..
1- intel NIC card X520da2 with intel gbic sfp+ 850nm 330mts models.. both servers with same NIC card + gbic + patch cord 3mts multi-mode cable because the gbic is multi-mode 850nm plugged both machines and traffic going by averga 1gbps to 2gbps traffic.. after 24hours we get 10k RX Errors..
after a more reading and search we have checked that intel has updated the drivers firmwares, ok lets upgrade to latest 2023 firmware update drivers on intel cards, start machines again same thing happened.. same 10k RX errors only no TX errors.. after 24h running..
ok now we know its not intel NIC neither drivers.. lets try make crazy swap gbic again put in Huawei 3km gbic sfp+ 130nm SR LR model instead and swapped patch cord to Single mode patch cord... again let it run for 24hours and same annoying error on RX ERRORS appearing around 7k to 11k RX errors only...
Ok about this time we decided to change NICs and test again this time with INTEL X710 quad port cards on each server, and needles to say the exact same issue just on RX ERRORS.... so we decided could be the GBICs.. lets try DAC 10g cable.. plugged both servers and same RX errors..
but because we like spending money on L6 licenses and love X86_64 hardware we decided to continue invest just for the sake of testing right?
swapped intel NICs.. and jumped in to Mellanox MCX 354 40Gbps cards... with the proper GBIC 40gbps mellanox... and again let it run for 24hours traffic average 1gbps to 2gbps.. and next day result around 10K to 11k RX erros.. somehow its just on RX errors on both sides...
swapped nics this time we put in the melanox 10g connect-x4 models CX2424 ... needles to say same exact thing just RX-ERRORS on both sides...
ok we started to look for other hardware tunning... both servers running DDR3 memory.. 32GB.. insert more memory.. up to 128GB.. we have only noticed some increase on speed bandwidth traffic tests.. from one server to another routing high volume of traffic.. which by increasing the memory on each CPU slot..gave better results on the 40Gbps cards.. running bonding on both interfaces on each servers... but we still have the RX-ERROS throwing in like marshmallows...
Ok at this time we knew it was not hardware.. we were pretty confident right this could be some software issue on rOS x86 v7 ....
At about this time we were really annoyed with the situation that we had tried several brands GBICs.. even bought a pair of mikrotik sfp+ 10g model and same error happened...
so we decided to make one last test... put switches between both servers x86 running...
on our BGP.. operator carrier grade side.. they have plugged in our 10G link port on a huawei switch s6730 and suddently all RX-ERRORS stopped on our LINK isp traffic.. tested on Mellanox 10G cards, tested on Intel 10G nics with huawei, intel, mikrotik gbics... we gave testing of around 24h on each.. so we could be sure ... and TX/RX Erros gone ZEROOOOOO errors..
so we tried the very same thing on our side, from our BGP server to our PPPOE server... we have put a CRS317 we had layiong around getting dust.. and we have plugged both x86 machines on the switch... bridging in switchOS mode.. traffic from one port to another.. and let it run.. for another 24hours.. and again VOILLA all TX-RX ERROS gone...
so we tought there must be something catching...
checked RX/TX flow control.. both off on both machines.. and on the switch also off..
QUEUES we went into queues interfaces and changed all from no-hardware-queue to multi-queue-ethernet-default mq pfifo and even raised queue-type /QUEUE-SIZE to 5000 packets just to see if we could get any better results.. but the exact same thing happened.. just RX-ERRORS.. throwing in..
after we pluged in the switches.. the RX-ERRORS gone again...
and yes.. we have played around with MTU also.. increased from 1500 up to 1600 , the actual-mtu changes also automatic.. but the L2 MTU allways stays 0.. not sure if this is a normal standard on the x86 perhaps as due to different network cards and buffer.. some cannot be changed..
BUT now my main question is. why this is happening? and surelly isn´t just me as i have found a hole bunch of topics with the exact same issue starting all the way back on rOS v5 and v6..
we have decided to go on x86 because of the extra processing with xeons 22cores , 28cores with higher clocks, routing etc... and to be able to put more 10g interfaces with more ports compared to CCR1036 and CCR2116....
but this RX errors can be annoying .. and we do get and odd ping timeout... that should not be happening.. when we plug 2 servers straigh in order to eliminate the switches in the middle of the network.
And below is a picture of result print of both scenarios, link ISP arriving from Huawei switch S6730... and NIC2 forwarding from BGP direct to second x86 server.. direct fiber connection with RX-ERROR