I'm using an Ixia traffic generator on an ample host with x710 (i40e) NICs testing a CHR p-unlimited vm using RFC 2544 methodology. I am trying to build systems that can forward 64-byte packets @>1.5Gbps unidirectional without loss using less than 8 virtual CPUs on the same NUMA node. The same host with the same VMware interface vibs is able to do well over 5Gbps through the interfaces with applications built on DPDK. I'm only getting ~750Mbps through the vswitch in my current configuration. I can get 32Gbps on btest to localhost. Seems the CHR does not really improve on the x86 version in a VMware ESXi VM.
The CHR has an accept rule on the forward chain and is simply IP forwarding between networks on directly connected interfaces.
I suspect the Mikrotik included VMxnet3 drivers/modules or their configuration are faulty if one expected to achieve performance.
VMware ESXi is showing drops attributed to the rx ring buffer being full:
/> cat /net/portsets/vSwitch4/ports/100663335/vmxnet3/rxqueues/7/stats
stats of a vmxnet3 vNIC rx queue {
LRO pkts rx ok:0
LRO bytes rx ok:0
pkts rx ok:44959982
bytes rx ok:7988029784
unicast pkts rx ok:44959982
unicast bytes rx ok:7988029784
multicast pkts rx ok:0
multicast bytes rx ok:0
broadcast pkts rx ok:0
broadcast bytes rx ok:0
running out of buffers:5385632
pkts receive error:0
1st ring size:1024
2nd ring size:1024
# of times the 1st ring is full:5385632
# of times the 2nd ring is full:0
fail to map a rx buffer:0
request to page in a buffer:0
# of times rx queue is stopped:0
failed when copying into the guest buffer:0
# of pkts dropped due to large hdrs:0
# of pkts dropped due to max number of SG limits:0
pkts rx via data ring ok:0
bytes rx via data ring ok:0
Whether rx burst queuing is enabled:0
current backend burst queue length:0
maximum backend burst queue length so far:0
aggregate number of times packets are requeued:0
aggregate number of times packets are dropped by PktAgingList:0
# of pkts dropped due to large inner (encap) hdrs:0
number of times packets are dropped by burst queue:0
}
It seems LRO is disabled in the VMxnet module parameters? I understand that sometimes LRO can cause a performance issue, but sometimes it can help.
Is it possible to get the vmxnet linux module parameters (even if extracted behind the API) exposed in the CLI?
In my testing, I see vmxnet3 on my host has created 8 RSS/VMDq threads, yet I am only seeing the IRQ counter increment for one of those receive threads in system resources IRQ. I do see the transmit interface IRQ count increase on all queues.
Interestingly, the vsish counters above show each rss queue had packets...
Would properly scaled VMDq/CPU configuration not make the RPS feature redundant?
I'm considering reproducing in a vanilla Linux environment for comparison - does anyone have any suggestions for comparing apples to apples?