On Tue, Nov 4, 2014 at 6:34 PM, Arman Khalatyan <arm2arm@gmail.com> wrote:

I will he I.teresting to see your iscsi setup with drbd. Did you got splitbrain before failure?
Did you check if your target went to readonly mode?
Thanks
Arman.



 I used some information provided here, even if it is with CentOS 5.7 and lvm on top of drbd, while in my setup I have CentOS 6.5 and drbd on top of lvm:

http://blogs.mindspew-age.com/2012/04/05/adventures-in-high-availability-ha-iscsi-with-drbd-iscsi-and-pacemaker/

- my drbd resource definition for iSCSI HA:
[root@srvmgmt01 ~]# cat iscsiha.res
resource iscsiha {
 disk {
   disk-flushes no;
   md-flushes no;
   fencing resource-and-stonith;
 }
 device minor 2;
 disk /dev/iscsihavg/iscsihalv;
 syncer {
 rate 30M;
 verify-alg md5;
 }
 handlers {
 fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
 after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
 }
 on srvmgmt01.localdomain.local {
 address 192.168.230.51:7790;
 meta-disk internal;
 }
 on srvmgmt02.localdomain.local {
 address 192.168.230.52:7790;
 meta-disk internal;
 }
}


- tgtd is setup to start on both nodes at startup
iscsi and iscsid services configured to off

- Put the agents iSCSILogicalUnit e iSCSITarget under 
 /usr/lib/ocf/resource.d/heartbeat/ on both nodes

downloaded from here, as they are not provided in plain CentOS:
http://linux-ha.org/doc/man-pages/re-ra-iSCSITarget.html

- Here below the pcs steps to create the group:

pcs cluster cib iscsiha_cfg

pcs -f iscsiha_cfg resource create p_drbd_iscsiha ocf:linbit:drbd drbd_resource=iscsiha \
op monitor interval="29s" role="Master" timeout="30" op monitor interval="31s" \
role="Slave" timeout="30" op start interval="0" timeout="240" op stop interval="0" timeout="100"

pcs -f iscsiha_cfg resource master ms_drbd_iscsiha p_drbd_iscsiha \
master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

pcs -f iscsiha_cfg resource create p_iscsi_store1 ocf:heartbeat:iSCSITarget \
params implementation="tgt" iqn="iqn.2014-07.local.localdomain:store1" tid="1" \
allowed_initiators="10.10.1.61 10.10.1.62 10.10.1.63" incoming_username="iscsiuser" incoming_password="iscsipwd" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60" \
op monitor interval="30" timeout="60"

pcs -f iscsiha_cfg resource create p_iscsi_store1_lun1 ocf:heartbeat:iSCSILogicalUnit \
params implementation="tgt" target_iqn="iqn.2014-07.local.localdomain:store1" lun="1" \
path="/dev/drbd/by-res/iscsiha" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60" \
op monitor interval="30" timeout="60"

pcs -f iscsiha_cfg resource create p_ip_iscsi ocf:heartbeat:IPaddr2 \
params ip="10.10.1.71" \
op start interval="0" timeout="20" \
op stop interval="0" timeout="20" \
op monitor interval="30" timeout="20"

pcs -f iscsiha_cfg resource create p_portblock-store1-block ocf:heartbeat:portblock \
params ip="10.10.1.71" portno="3260" protocol="tcp" action="block"

pcs -f iscsiha_cfg resource create p_portblock-store1-unblock ocf:heartbeat:portblock \
params ip="10.10.1.71" portno="3260" protocol="tcp" action="unblock" \
op monitor interval="30s"

pcs -f iscsiha_cfg resource group add g_iscsiha p_portblock-store1-block p_ip_iscsi p_iscsi_store1 \
p_iscsi_store1_lun1 p_portblock-store1-unblock

pcs -f iscsiha_cfg constraint colocation add Started g_iscsiha with Master ms_drbd_iscsiha INFINITY

pcs -f iscsiha_cfg constraint order promote ms_drbd_iscsiha then start g_iscsiha

pcs cluster cib-push iscsiha_cfg


- output of "crm_mon -1"

 Resource Group: g_iscsiha
     p_portblock-store1-block (ocf::heartbeat:portblock): Started srvmgmt01.localdomain.local 
     p_ip_iscsi (ocf::heartbeat:IPaddr2): Started srvmgmt01.localdomain.local 
     p_iscsi_store1 (ocf::heartbeat:iSCSITarget): Started srvmgmt01.localdomain.local 
     p_iscsi_store1_lun1 (ocf::heartbeat:iSCSILogicalUnit): Started srvmgmt01.localdomain.local 
     p_portblock-store1-unblock (ocf::heartbeat:portblock): Started srvmgmt01.localdomain.local 

- output of tgtadm on both nodes while srvmgmt01 is active for the group

[root@srvmgmt01 ~]#  tgtadm --mode target --op show 
Target 1: iqn.2014-07.local.localdomain:store1
    System information:
        Driver: iscsi
        State: ready
    I_T nexus information:
    LUN information:
        LUN: 0
            Type: controller
            SCSI ID: IET     00010000
            SCSI SN: beaf10
            Size: 0 MB, Block size: 1
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            Backing store type: null
            Backing store path: None
            Backing store flags: 
        LUN: 1
            Type: disk
            SCSI ID: p_iscsi_store1_l
            SCSI SN: 66666a41
            Size: 214738 MB, Block size: 512
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            Backing store type: rdwr
            Backing store path: /dev/drbd/by-res/iscsiha
            Backing store flags: 
    Account information:
        iscsiuser
    ACL information:
        10.10.1.61
        10.10.1.62
        10.10.1.63

on the passive node:
[root@srvmgmt02 heartbeat]# tgtadm --mode target --op show 
[root@srvmgmt02 heartbeat]# 

TBV performance and tuning values taken from here:
http://www.dbarticles.com/centos-6-iscsi-tgtd-setup-and-performance-adjustments/

my cluster is basic for testing so not critical for my environment...
at the momento only 1Gbit/s network and one adapter for drbd replica and one for iSCSI traffic
Tested with some I/O basic benchmarks on VM insisting on the SD and I got about 90-95MB/s on both drbd and iSCSI networks. Also relocation of iSCSI service while benchmark active seemed not to cause problems with SD and VM.

- I also enabled iptables on cluster nodes so that the initiators (oVirt hosts) could connect to the ip alias dedicated to iSCSI servicing:
in /etc/sysconfig/iptables:
# iSCSI
-A INPUT -p tcp -m tcp -d 10.10.1.71 --dport 3260 -j ACCEPT

I have to recheck the logs to give exact scenario of what happened causing the problem.... not being a critical system is not so well monitored at the moment...

comments welcome
Gianluca