Mittwoch, 1. Oktober 2014

sysbench: benchmark your mysql storage

0) yum install sysbench -y 
1) mysql -uroot -e "CREATE SCHEMA sbtest;"
2)  sysbench --db-driver=mysql --test=oltp --num-threads=8 --max-requests=10000 --oltp-table-size=2000000 --oltp-test-mode=complex --mysql-host=localhost --mysql-db=sbtest --mysql-table-engine=innodb prepare --mysql-port=3306 --mysql-user=root
3) sysbench --db-driver=mysql --test=oltp --num-threads=8 --max-requests=10000 --oltp-table-size=2000000 --oltp-test-mode=complex --mysql-host=localhost --mysql-db=sbtest --mysql-port=3306 --mysql-user=root  run | tee sysbench.log

4) analyse your sysbench.log

Results on striped 2x3TB SATA-HGST disks /dev/md0:
OLTP test statistics:
    queries performed:
        read:                            140014
        write:                           50005
        other:                           20002
        total:                           210021
    transactions:                        10001  (151.00 per sec.)
    deadlocks:                           0      (0.00 per sec.)
    read/write requests:                 190019 (2868.92 per sec.)
    other operations:                    20002  (301.99 per sec.)

Test execution summary:
    total time:                          66.2336s
    total number of events:              10001
    total time taken by event execution: 529.6338
    per-request statistics:
         min:                                 15.57ms
         avg:                                 52.96ms
         max:                               1392.39ms
         approx.  95 percentile:             100.09ms

Threads fairness:
    events (avg/stddev):           1250.1250/5.11
    execution time (avg/stddev):   66.2042/0.01

Donnerstag, 18. September 2014

cli ansible vs clustershell : running commands on multiple hosts.

Let us start with CentOS 6.5
1) enable epel:  yum locainstall -y
2) yum install clustershell -y
3) yum install ansible

Let us run who command on multiple hosts:
clush -w localhost,anotherhost -B -b w
ansible all -i 'localhost,anotherhost,' -c local -m command -a "w"

Then Analyze output....

Mittwoch, 17. September 2014

OpenSM lid and more...

On centOS 6.5
yum groupinstall "Infiniband Support"
yum install infiniband-diags

Infiniband diagnostics tools are containing nice "weapons" to eliminate network BUGS.
One of them is:
 saquery - query InfiniBand subnet administration attributes

 it looks like this:

NodeRecord dump:

Fix infiniband QDR or FDR troubles on fat memory machines.

Fix memory trouble on FAT machines:
The formula to computer the maximum value of pagepool when using RDMA is:   
2^log_num_mtt  x  2^log_mtts_per_seg * x PAGE_SIZE > ( 2x pagepool )

2^20 bytes x 2^4 x 4K = 64GiB
options mlx4_core log_num_mtt=20 log_mtts_per_seg=4
check changes
more /sys/module/mlx4_core/parameters/log_num_mtt
more /sys/module/mlx4_core/parameters/log_mtts_per_seg

CentOS 7 after install:

CentOS 7 is different management than CentOS 5.x and 6.x.
It is uses systemctl to manage services.

systemctl enable sshd
systemctl list-unit-files
systemctl get-default
systemctl set-default
systemctl disable firstboot-graphical.service
systemctl disable bluetooth.service
systemctl enable network.service
systemctl show
systemd-analyze blame

Tune other stuff as you need.

Lustre 2.1.6 server and 2.5.3 client.

Do they compatible ?
The answer is yes!
Lustre client upgrading using yum in centos 6.5:

vim /etc/yum.repo/lustre.repo
name=CentOS-$releasever - Lustre
name=CentOS-$releasever - Ldiskfs

name=CentOS-$releasever - Lustre

On client:
1) yum update lustre-client -y
2) and finally one liner to restart the lustre mount point:

umount /lustre;lustre_rmmod;service lnet stop;service lnet start;mount /lustre

Ovirt 3.4.3-1: recover VM from an "unknown" state.

After iscsi(iser) storage failure and host local disk corruption one of the HA VMs is stalled in "?"=status unknown. restarting ovirt-engine and hosts did not help much. The host which was owning VM was not there. But Web portal was showing unknown status for the host. I was not able to reboot or stop it. All services are gone with bad disk on the host. Looks like main problem was missing iscsi disk storage. It was hanging in "locked state". I found simple solution in 3 steps:
  1. find the hanging disk ID from web interface it looks something like this:324f9089-0a40-4744-aa33-5c5a108f7f43
  2. on ovirt-engine server: su - postgres
  3. psql -U postgres engine -c "select fn_db_unlock_disk('324f9089-0a40-4744-aa33-5c5a108f7f43');"
    After this steps take down hanging host from the web interface. HA VM will come up to another "healthy node".

Donnerstag, 3. April 2014

Fixing Ovirt cloned vdsm node.

In order to accelerate host deployments I was using home-brewed scripts using  tar and PXE bott functionality of our nodes.
the process is following:
  1. make-image-vdsmnode
  2. boot node with PXE in diskless mode
  3. run format script
  4. run copy script
  5. correct network 
  6. install grub.
  7. reboot to boot mode with local disk PXE mode.

If all this done then try to add new host into cluster.

You will get an error on ovirt-engine  interface:
Duplicate host UUID.

In ovirt-engine.log you may see:

2014-04-03 10:44:28,439 WARN  [org.ovirt.engine.core.bll.AddVdsCommand] (ajp-- [6385bd93] CanDoAction of action AddVds failed. Reasons:VAR__ACTION__ADD,VAR__TYPE__HOST,$server nodeserv1.cls,ACTION_TYPE_FAILED_VDS_WITH_SAME_UUID_EXIST

To fix the issue: remove /etc/vdsm/ from the cloned host:


Mittwoch, 26. März 2014

Get rid of T---- files in glusterfs

If you had a bad performance on the Glusterfs it could be due to the wrong location of the files.
If brick was off-line then file goes to another one.
After bringing back the brick the files are still in the wrong place. Glusterfs generates zero size T---- attribute file to show where the file is.

Find Bad files on a brick and move them back correctly.

Warning this should be done if you have a backup. You may remove or delete important files be careful!

In order to make file move correctly better to work with glusterfs mount point.

 Let us see by example: Assuming /store/01 and /store/02 are the bricks and
/store/glu is mount point of the GluFS.

  • find /store/01 /store/02 -type f -perm 1000| tee /tmp/bad-Tfiles.log
  • cat /tmp/bad-Tfiles.log| sed -r 's/\/store\/0+[1-4]/\/store\/0\?/g'| xargs -I{} echo ls -l {}| parallel | tee /tmp/results.log
If we have duplicate files with same name but one is T--- and another normal file we should remove T--- file and move good file to gluster using glufs mount point: unlink /store/01/file.txt;mv /store/02/file.txt /tmp/;mv /tmp/file.txt /store/glu Now if we do:ls -l /store/0?/file.txt We should see only one file name.

Sonntag, 5. Januar 2014

Lustre 2.4.2 performance tests: ZFS vs ext4 backends.

The fastest mysql performance I have seen on RAID10 with 24 SASx4TB disks.
Before testing dd on both was bringing almost 2.5GB/s:
seq 1..8 | xargs -I{} echo "dd if=/dev/zero of=/data/test.{} bs=4M count=1000&"| sh

Next I was testing with Lustre 2.4.2 MDT:
1)ZFS 0.6.2 CentOS6.5
mds:>thrhi=16 dir_count=16 file_count=200000 mds-survey
Mon Dec 23 14:48:36 CET 2013 /usr/bin/mds-survey from
mdt 1 file  200000 dir   16 thr   16 create 9185.17 [   0.00,15999.01] lookup 421928.47 [421928.47,421928.47] md_getattr 392586.40 [392586.40,392586.40] setxattr 56024.07 [25998.10,29999.37] destroy 10387.71 [   0.00,15999.41]

2) RAID10-ldiskfs
mds:> thrhi=16 dir_count=16 file_count=200000 mds-survey
Mon Dec 23 15:06:46 CET 2013 /usr/bin/mds-survey from
mdt 1 file  200000 dir   16 thr   16 create 126533.11 [126533.11,126533.11] lookup 1129547.84 [1129547.84,1129547.84] md_getattr 776042.03 [776042.03,776042.03] setxattr 115977.53 [115977.53,115977.53] destroy 156916.53 [156916.53,156916.53]

The result is: zfs is slower than ldiskfs. In some operations it is about 10 times slower than ext4.

This can be that zfs was running on hardware raid 10. Maybe with HBA it will bring more performance.

Got bad lustre OST? Disable it!!

mgs:>lctl conf_param

check active/inactive OST:

cat /proc/fs/lustre/lov/fsname-MDT0000-mdtlov/target_obd
Simply replace fsname of your file system name. 

Before taking off or re-formating OST you can move out the data:  
Assuming we would like to move out all data from OST0002, run script on client: 
lfs find /lustre --ost fsanme-OST0002 | xargs -P4 -I{} {}
And script: 
cat /opt/admin/ 

fn=$(mktemp --dry-run "${file}".lureb-XXXXX )
echo $file $fn
cp -a "$file" "$fn" && unlink "$file" && mv "$fn" "$file"

Note: Script does not work properly with files with special characters like space.  

Donnerstag, 2. Januar 2014

Lustre upgrading from 2.4.1 to 2.4.2

For upgrading it is recommended to unmount clients. but if you cannot like in my case, just leave them.

Before upgrade make sure that server is stopped:
  1. umount OST 
  2. umount MDT
  3. umount MGS

Note that order is important!
in one server I got kernel panic because of MGS was unmounted before OST went down.

Then remove all lustre  modules from servers:
 service lustre stop;service lnet stop;lustre_rmmod;service lnet stop

Download latest e2fs from:

NOTE: if you google: "lustre e2fsprogs" you will end up tons of wrong/old links.
The actual one is from!!!

Install e2fsprogs:
yum install e2fsprogs-1.42.7.wc2-7.el6.x86_64.rpm e2fsprogs-libs-1.42.7.wc2-7.el6.x86_64.rpm libcom_err-1.42.7.wc2-7.el6.x86_64.rpm libss-1.42.7.wc2-7.el6.x86_64.rpm

NOTE: once you installed e2fsprogs you cannot  remove them any more.

The last step is installing all rpms from latest-maintenance-release.

Download  recent lustre from:

NOTE: If you dont use zfs back-end do not install  lustre-osd-zfs*.

reboot the servers and mount:
  1. MGS
  2. MDT
  3. OSTs 
wait until recovery mode is gone. Go to clients  and df -h should work.
On first hour on heavy loaded cluster you will probably get very hi-load on OSTS and mdt. It is expected, due to parallel commits from the clients.

Shut up the message: "padlock: VIA PadLock Hash Engine not detected."

If you don't have  motherboard with padlock devices on every service lnet start you will get error messages in dmesg:
alg: No test for crc32 (crc32-table)
alg: No test for adler32 (adler32-zlib)
alg: No test for crc32 (crc32-pclmul)

padlock: VIA PadLock Hash Engine not detected.

or similar.

To shut up the padlock message simply add line in your /etc/modprobe.d/blacklist.conf

blacklist padlock

Lustre 2.4.2 client on SL 6.x: recompile for older kernels.

It is fact that on cluster environments is not possible to keep linux kernel up to date.
For sure it is recommended time to time update kernels on all cluster if some critical security bugs are fixed.
Recently Lustre 2.4.2 has been released with kernel-2.6.32-358.23.2.el6.
If you run different kernel it is not a problem to recompile it.
In order to recompile
1) download src of client:

2) rebuild using following command:

rpmbuild --rebuild --define 'lustre_name lustre-client' --define 'configure_args --disable-server' lustre-client-2.4.2-2.6.32_358.23.2.el6.x86_64.src.rpm

If you run this as root the new rpms will be generated in:

To install:
cd /root/rpmbuild/RPMS/x86_64 && yum install ./*.rpm

The last step reload lnet and lustre modules before remounting the lustre filesystem:
service lustre stop;service lnet stop;lustre_rmmod;service lnet stop;

Some times osc has still open references, therefore we need to   stop lnet twice.

service lnet start;modprobe lustre;dmesg

the dmesg should contain some thing like:
Lustre: Lustre: Build Version: 2.4.2-RC2--PRISTINE-2.6.32-358.14.1.el6.x86_64

If you would like to test which versuon of client is running just check in:
cat /proc/fs/lustre/version

So we did it lustre client upgrade without rebooting. Have a fun with new lustre client.