FCoE vs iSCSI

Adventures in fibre channel over ethernet. This is a draft so can be said to have errors for now.

initiator: device that requires access to a remote block store over FCoE or iSCSI
target: device that provides the block storage to remote initiators

Here we assume target device has a MAC Address of ac:de:48:23:45:67 on int0 and the initiator of 00:00:5e:00:53:00 on eth0.

AS a first step, have unlocked the max MTU on the involved interfaces

It does seem to be legitimate to run FCoE on a brctl bridge as it is just another set of ethertypes

Let’s Begin

on target, install the targetcli

apt-get install targetcli

may not be needed for vn2vn

apt-get install lldpad fcoe-utils

on target, we will run targetcli and export a LVM2 logical volume called fcoetest

# targetcli
/backstores/iblock create name=fcoetest dev=/dev/lvm2_groupname/fcoetest

fcoeadm -c int0

creates entry in /sys/class/fc_host/

now can do:

cat /sys/class/fc_host/host*/port_name

We now can set up access to the LUN effectively to the other mac address

/tcm_fc create 20:00:ac:de:48:23:45:67
/tcm_fc/20:00:ac:de:48:23:45:67/luns create /backstores/iblock/fcoetest
/tcm_fc/20:00:ac:de:48:23:45:67/acls create 20:00:00:00:5e:00:53:00

To redo this, we can /tcm_fc delete 20:00:ac:de:48:23:45:67

Complete a step not done by fcoe-utils tools dpkg ☺

echo org.open-fcoe.libhbalinux /usr/lib/libhbalinux.so.2.0.2 >> /etc/hba.conf

/> ls
o- / [...]
o- backstores [...]
| o- fileio [0 Storage Object]
| o- iblock [1 Storage Object]
| | o- fcoetest [/dev/stat/fcoetest activated]
| o- pscsi [0 Storage Object]
| o- rd_dr [0 Storage Object]
| o- rd_mcp [0 Storage Object]
o- iscsi [0 Target]
o- loopback [0 Target]
o- tcm_fc [1 Target]
o- 20:00:ac:de:48:23:45:67 [enabled]
o- acls [1 ACL]
| o- 20:00:00:00:5e:00:53:00 [1 Mapped LUN]
| o- mapped_lun0 [lun0 (rw)]
o- luns [1 LUN]
o- lun0 [iblock/fcoetest (/dev/stat/fcoetest)]

If all is satisfactory, use /saveconfig to keep the settings

Setup success!

The default wanted a so-called fcf device, which we will call a special kind of ethernet forwarder. Not got one of those.

We want vn2vn mode which means we can generally do FCoE between 2 ordinary computers with ethernet interfaces, the only need here is zero frame loss.

Now modprobe fcoe on initiator and targets and tell them where to do “create_vn2vn” or stop doing “destroy” FCoE, the same step is done on both ends, though the target also had the access control rules setup as above with targetcli.

Might also run into, problem: Unknown value type 'fc_wwn' Temporarily edited config.py and utils.py to suppress warning and treat as valid.

echo eth0 > \
/sys/module/libfcoe/\
parameters/create_vn2vn
echo eth0 > \
/sys/module/libfcoe/\
parameters/destroy
echo int0 > \
/sys/module/libfcoe/\
parameters/create_vn2vn
echo int0 > \
/sys/module/libfcoe/\
parameters/destroy

Set permission on the block devices to access from the userspace, and then read and write octets with hexedit. The long term intention is to put a distributed cluster filesystem on it, the block should be proven stable first though.

setfacl -m u:user:rwX /dev/disk/by-path/pci-0000:04:00.0-fc-0x2000acde48234567-lun-0
hexedit /dev/disk/by-path/pci-0000:04:00.0-fc-0x2000acde48234567-lun-0

But its slow, was it because fcoeadm -l says "MaxFrameSize: 2112", first thought was to make this bigger but it turns out this is by FC design. It is time to set up iSCSI to compare performance

iSCSI Setup for comparison

Again, in targetcli, we setup a comparable iSCSI setup with the same LUN as for FCoE

/iscsi create iqn.2000-01.arpa.ip6
.f.e.8.0.0.0.0.0.0.0.0.0.0.0.0.0
.a.e.d.e.4.8.f.f.f.e.2.3.4.5.6.7
luns/ create /backstores/iblock/fcoetest
acls/ create value from initiator /etc/iscsi/initiatorname.iscsi
portals/ create 2001:db8::aede:48ff:fe23:4567
set attribute authentication=0 demo_mode_write_protect=0

For iSCSI the lun acces is set target to wwn in/etc/iscsi/initiatorname.iscsi of initiator

TARGETIP=2001:db8::aede:48ff:fe23:4567
WWN=`iscsiadm -m discovery -t sendtargets -p ${TARGETIP} | cut -d" " -f2`
iscsiadm --mode node --targetname "${WWN}" --portal [${TARGETIP}]:3260 --login

Let’s race FCoE and iSCSI!

setfacl -m u:user:rwX /dev/disk/by-path/\
ip-2001:db8::\
aede:48ff:fe23:4567:3260-iscsi-\
iqn.2000-01.arpa.ip6
.f.e.8.0.0.0.0.0.0.0.0.0.0.0.0.0
.a.e.d.e.4.8.f.f.f.e.2.3.4.5.6.7-lun-0
dd if=/dev/zero of=/dev/disk/by-path/\
pci-0000:04:00.0-fc-0x2000acde48234567-lun-0
dd if=/dev/zero of=/dev/disk/by-path/\
ip-2001:db8::\
aede:48ff:fe23:4567:3260-iscsi-\
iqn.2000-01.arpa.ip6
.f.e.8.0.0.0.0.0.0.0.0.0.0.0.0.0
.a.e.d.e.4.8.f.f.f.e.2.3.4.5.6.7-lun-0

At this point it was expected that FCoE would outpace iSCSI but it went very wrong. iSCSI was in lead by far, and the cause must be thought to be dropped frames, somewhere… Time to find out why.

Let’s capture!

Here is a tcpdump string for wireshark ether proto 0x8906 or tcp port 3260 we can watch iSCSI and FCoE together

Using dropwatch there were signs of trouble already:

/usr/src/linux-source-3.2/net/packet/af_packet.c: packet_rcv_spkt+de, tpacket_rcv+6a5
/usr/src/linux-source-3.2/net/core/dev.c: net_tx_action+0

Stress testing the ethernets

A packet injector is used to create a flood of frames as fast as we can.

There are published adjustments to reduce the possibility that core drops frames:

# make the RX buffers bigger
net.core.optmem_max = 50000
net.core.netdev_max_backlog = 100000
# make the tx buffer bigger
ifconfig eth0 txbuffer 10000

Let’s test!

./packETHcli -i eth0 -m 2 -d -1 -n 0 -f cec &

What we still drop frames?!

RTL8169 Issues

It turns out that frames are dropped in rtl8169 driver, though this was not declared in the ifconfig output as actually the NIC drops them as “rx_missed” due to the driver not processing frames efficiently enough.

ethtool -S eth0

rx_missed ?!

Make the interrupt handler do more work maybe, 64 is not big enough?

It was said that realteks can hold up to 1024 frames before they drop so lets empty up to that many per RX interrupt

sysctl net.core.dev_weight=1024

It turns out to really improve rtl8169.ko we must hack source, users can recompile just the modules to be patched.

The secret is that _rx_interrupt does rather lot of work via memcpy of whole frames and we remove that, instead rotating the RX ring pointers as frames are received.

If we remove that then it should be faster, we can preallocate frames to mtu during driver initialisation, and alter pointers in the RX ring buffer as frames are moved from the ring into linux networking via skb_put etc.

One side effect is we do use more precious kernel memory, and use too much and system crash, so we compromise on enough to go drop-free.

The moment we've been waiting for

With adjusted rtl8169 driver, FC perfomance approaches that of iSCSI and other applications like NFS seem to benefit from less droppiness.

# iscsi
131073+0 records in
131072+0 records out
67108864 bytes (67 MB) copied, 16.4312 s, 4.1 MB/s
# fcoe
dd if=/dev/urandom of=/dev/disk/by-path/pci-0000:04:00.0-fc-0x2000acde48234567-lun-0 bs=$((1024*4))
dd: writing `/dev/disk/by-path/pci-0000:04:00.0-fc-0x2000acde48234567-lun-0': No space left on device
16385+0 records in
16384+0 records out
67108864 bytes (67 MB) copied, 7.79058 s, 8.6 MB/s

GFS2 adventures

/loopback/ create?
naa.2000acde48234567
/loopback/
naa.2000acde48234567/luns/ create /backstores/iblock/fcoetest