rOS v7.1xxx x86_64 bare metal TX/RX annoying drops

PortalNET · Thu Jun 20, 2024 4:51 pm

Hi guys

it looks like it either goes passed by and no one notices the threads..or perhaps the development team his not making tests on the rOS x86 bare metal hardwares?

we have a couple of licenses L6 running in Ros V7.14.xx and latest stable v7.15x as we have upgraded in order to see if issue would get resolved..

we do have annoying amount or TX/RX Errors showing on the interfaces.. after about 24h running we get over 10000 RX drops.. TX does not seem to be affected..

so we though oursevels.. well it could be GBIC SFP+ wise.. lets swap and try other vendors.. so we did a couple of tests..

1- intel NIC card X520da2 with intel gbic sfp+ 850nm 330mts models.. both servers with same NIC card + gbic + patch cord 3mts multi-mode cable because the gbic is multi-mode 850nm plugged both machines and traffic going by averga 1gbps to 2gbps traffic.. after 24hours we get 10k RX Errors..

after a more reading and search we have checked that intel has updated the drivers firmwares, ok lets upgrade to latest 2023 firmware update drivers on intel cards, start machines again same thing happened.. same 10k RX errors only no TX errors.. after 24h running..

ok now we know its not intel NIC neither drivers.. lets try make crazy swap gbic again put in Huawei 3km gbic sfp+ 130nm SR LR model instead and swapped patch cord to Single mode patch cord... again let it run for 24hours and same annoying error on RX ERRORS appearing around 7k to 11k RX errors only...

Ok about this time we decided to change NICs and test again this time with INTEL X710 quad port cards on each server, and needles to say the exact same issue just on RX ERRORS.... so we decided could be the GBICs.. lets try DAC 10g cable.. plugged both servers and same RX errors..

but because we like spending money on L6 licenses and love X86_64 hardware we decided to continue invest just for the sake of testing right?

swapped intel NICs.. and jumped in to Mellanox MCX 354 40Gbps cards... with the proper GBIC 40gbps mellanox... and again let it run for 24hours traffic average 1gbps to 2gbps.. and next day result around 10K to 11k RX erros.. somehow its just on RX errors on both sides...

swapped nics this time we put in the melanox 10g connect-x4 models CX2424 ... needles to say same exact thing just RX-ERRORS on both sides...

ok we started to look for other hardware tunning... both servers running DDR3 memory.. 32GB.. insert more memory.. up to 128GB.. we have only noticed some increase on speed bandwidth traffic tests.. from one server to another routing high volume of traffic.. which by increasing the memory on each CPU slot..gave better results on the 40Gbps cards.. running bonding on both interfaces on each servers... but we still have the RX-ERROS throwing in like marshmallows...

Ok at this time we knew it was not hardware.. we were pretty confident right this could be some software issue on rOS x86 v7 ....

At about this time we were really annoyed with the situation that we had tried several brands GBICs.. even bought a pair of mikrotik sfp+ 10g model and same error happened...

so we decided to make one last test... put switches between both servers x86 running...

on our BGP.. operator carrier grade side.. they have plugged in our 10G link port on a huawei switch s6730 and suddently all RX-ERRORS stopped on our LINK isp traffic.. tested on Mellanox 10G cards, tested on Intel 10G nics with huawei, intel, mikrotik gbics... we gave testing of around 24h on each.. so we could be sure ... and TX/RX Erros gone ZEROOOOOO errors..

so we tried the very same thing on our side, from our BGP server to our PPPOE server... we have put a CRS317 we had layiong around getting dust.. and we have plugged both x86 machines on the switch... bridging in switchOS mode.. traffic from one port to another.. and let it run.. for another 24hours.. and again VOILLA all TX-RX ERROS gone...

so we tought there must be something catching...

checked RX/TX flow control.. both off on both machines.. and on the switch also off..

QUEUES we went into queues interfaces and changed all from no-hardware-queue to multi-queue-ethernet-default mq pfifo and even raised queue-type /QUEUE-SIZE to 5000 packets just to see if we could get any better results.. but the exact same thing happened.. just RX-ERRORS.. throwing in..

after we pluged in the switches.. the RX-ERRORS gone again...

and yes.. we have played around with MTU also.. increased from 1500 up to 1600 , the actual-mtu changes also automatic.. but the L2 MTU allways stays 0.. not sure if this is a normal standard on the x86 perhaps as due to different network cards and buffer.. some cannot be changed..

BUT now my main question is. why this is happening? and surelly isn´t just me as i have found a hole bunch of topics with the exact same issue starting all the way back on rOS v5 and v6..

we have decided to go on x86 because of the extra processing with xeons 22cores , 28cores with higher clocks, routing etc... and to be able to put more 10g interfaces with more ports compared to CCR1036 and CCR2116....

but this RX errors can be annoying .. and we do get and odd ping timeout... that should not be happening.. when we plug 2 servers straigh in order to eliminate the switches in the middle of the network.

And below is a picture of result print of both scenarios, link ISP arriving from Huawei switch S6730... and NIC2 forwarding from BGP direct to second x86 server.. direct fiber connection with RX-ERROR

sample.png

julianho · Mon Jun 24, 2024 9:49 am

I have same problem, bare metal with mikrotik installed use 82599 and two ports bonding to huawei6730.

In mikrotik, one port get rx error a lot, another port had fewer rx errors. no error in huawei6730 switch.

When I disable the port with many rx errors, all the error packets will go to another port.

In fact, I have two such network architectures, and their situations are the same, even the ports for error packets are the same, one has more and the other has less.

Fortunately, even though there were rx errors, but icmp packet not loss.

PortalNET · Thu Jun 27, 2024 5:38 am

I have same problem, bare metal with mikrotik installed use 82599 and two ports bonding to huawei6730.

In mikrotik, one port get rx error a lot, another port had fewer rx errors. no error in huawei6730 switch.

When I disable the port with many rx errors, all the error packets will go to another port.

In fact, I have two such network architectures, and their situations are the same, even the ports for error packets are the same, one has more and the other has less.

Fortunately, even though there were rx errors, but icmp packet not loss.

hi try changing the queue-type on your interfaces runnning the traffic to the same config on the picture below and let it run for a few hour and reset counters to check if it works ok. don´t forget to set your interface-queue on the interface with the only-hardware-queue and if with 2000 queue-size you still get errors increase it up to 4096 and check if it stopped the errors

julianho · Thu Jun 27, 2024 9:48 am

Thanks for your suggestion.
I tried increasing the mq pfifo to 4000+, but the rx errors still increase.
Before this, mq pfifo only solved the problem of tx queue drops.

PortalNET · Thu Jun 27, 2024 4:55 pm

Thanks for your suggestion.
I tried increasing the mq pfifo to 4000+, but the rx errors still increase.
Before this, mq pfifo only solved the problem of tx queue drops.

hi

on the interface-queue try selecting the interface with error and change from only-hardware-queue to

multi-queue-ethernet-default

and queue-types select the same mqpfifo with 5000 queue-size packets.. and do another test.

jd603 · Sat Sep 14, 2024 9:24 pm

This started happening to me with 7.x too. I'm starting to think it is just certain packets being sent that the driver doesn't like, I don't think it necessarily means performance is degraded. It's possible these packets always had issues but just with a new driver or change to ROS we now see them displayed as RX ERRORS.

rOS v7.1xxx x86_64 bare metal TX/RX annoying drops

rOS v7.1xxx x86_64 bare metal TX/RX annoying drops

Re: rOS v7.1xxx x86_64 bare metal TX/RX annoying drops

Re: rOS v7.1xxx x86_64 bare metal TX/RX annoying drops

Re: rOS v7.1xxx x86_64 bare metal TX/RX annoying drops

Re: rOS v7.1xxx x86_64 bare metal TX/RX annoying drops

Re: rOS v7.1xxx x86_64 bare metal TX/RX annoying drops