Montag, 30. Januar 2017

Fun with RobinHood 3 , Lustre 2.9 and PCI-e SSD: /dev/nvme0n1

------------------- TEST PCI-e SSD XFS---------------
MySQL Data on nvram

time rbh-du -d -H  -f /etc/robinhood.d/tmpfs/lnew.conf /lnew/*

real    6m32.658s
user    0m18.175s
sys     0m9.272s

we just run second time to be sure the numbers are ok.

real    6m22.549s
user    0m24.533s
sys     0m10.950s

------------------- TEST PCI-E SSD ext4 -------------

real    5m46.431s
user    0m16.900s
sys     0m9.288s

real    5m45.627s
user    0m15.509s
sys     0m8.985s

------------ TEST ZFS  128k --------------------
MySQL data on zpool RAIDZ0 8x3TBHGST Disks

time rbh-du -d -H  -f /etc/robinhood.d/tmpfs/lnew.conf /lnew/*

real    13m34.484s
user    0m17.150s
sys     0m9.125s
------------------- Same ZFS but with 8K blocksize ............
real    15m20.003s
user    0m16.427s
sys    0m8.811s

Interesting the 8K mithos doesnot help with performance.
What about 64K? It is same as 128k recordsize.

--------------- Lustre 2.9  mds-----------------
Directly on lustrefs

time du -bsh /lnew/*

real    36m12.990s
user    0m24.658s
sys     8m57.458s

Interesting results for the InnoDB  speed on the  different file systems:


Samstag, 28. Januar 2017

PCI/SSD/NVM performance challenge begins!- PART-II

Recently we got a test server with PCI-e SSD from INTEL-P3600 1.2TB.
Operating system see it as: INTEL SSDPEDME012T4
[root@c1701 vector]# isdct show -nvmelog smarthealthinfo

- SMART and Health Information CVMD550300L81P2CGN -

Available Spare Normalized percentage of the remaining spare capacity available : 100
Available Spare Threshold Percentage : 10
Available Spare Space has fallen below the threshold : False
Controller Busy Time : 0x19
Critical Warnings : 0
Data Units Read : 0x380EC1
Data Units Written : 0xC0C8D4
Host Read Commands : 0x031AA3A3
Host Write Commands : 0x08B44DE8
Media Errors : 0x0
Number of Error Info Log Entries : 0x0
Percentage Used : 0
Power Cycles : 0x10
Power On Hours : 0x01FD
Media is in a read-only mode : False
Device reliability has degraded : False
Temperature - Celsius : 35
Temperature has exceeded a critical threshold : False
Unsafe Shutdowns : 0x08
Volatile memory backup device has failed : False

Now the question is rising: Do we need at all PCI-e NVMe SSD? How we can use it to boost any of our daily used systems??
The bad thing with PCI-e is usability, if it starts to fail one need to turn off the  BOX to replace it. For the time critical systems it is unacceptable. One can think well do some kind of HA pairs of boxes. But any HA layer adds its failure probability and for sure dumps down you HW performance.

As a first touch test I just started to check the Robinhood scans of the lustre fs.
This process runs as a daily cron job on the dedicated client.

Here is the benchmarks with recent "Lustre client" hardware:

Aim: to keep lustrefs file tables in mysql InnoDB.

Robinhood 3 with CentOS7.3 and lustrefs 2.9 client
Lustre server 2.9 CentOS6.8,4x4mirrored Intel SSD MDS,
3Jbods-18 OSTS(450-600MB/s per OST),3xMR9271-8i per ost,24 disk per OST, Interconnect FDR 56Gbit,
with MPI tests up to >3.5GB/s random IO
Currently used: 129TB out of 200TB, no striped datasets.
Data distribution histogram: Files count/Size

time lfs find /llust  -type f | wc -l
[root@c1701 ~]# time lfs find /llust  -type f | wc -l

real    11m13.261s
user    0m12.464s
sys     2m15.665s

[root@c1701 ~]# time lfs find /llust  -type d | wc -l

real    6m40.706s
user    0m3.500s
sys     0m56.831s

[root@c1701 ~]# time lfs find /llust   | wc -l

real    3m52.773s
user    0m10.867s
sys     1m4.731s

The bottleneck is probably MDS, but it is fast enough to feed  our InnoDB with files stats and locations.

Final robinhood InnoDB  database size is about ~100GB    
Engine InnoDB from Oracle MySQL: mysql-5.7.15-linux-glibc2.5-x86_64 (Why not MariaDB? there are some deviations with mariadb. It is a little bit slower than OMySQL +/- 5/10%. But for the final test conclusions this is not so relevant)


we setup here RAIDZ2 8x3TB HGST HUS724030ALA640 with LSI-9300-8i in IT mode.

FS setup for the tests:
  1. zpool create tank raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh -f
  2. zfs set compression=lz4 tank
  3. zpool create nvtank /dev/nvme0n1
  4. zfs set compression=lz4 nvtank
  5. mkfs.xfs -Lnvtank  /dev/nvme0n1 -f

Results of mytop query per second-qps:
  1. 8xRAIDZ2 compression none qps: ~98 burst up to 300
  2. 8xRAIDZ2 compression lz4 qps: ~107 burst up to 480
  3. nvmeZFS compression none qps: 5000 burst up to 12000
  4. nvme XFS qps: ~17000 burst up to 18000

First results are showing thath XFS wins against ZFS. The latency of the PCI-e impressive, but once you add any software raid on it like ZFS, compression, its becoming slow but not as slow as  8xSpinning disks.

PCI/SSD/NVM performance challenge begins! PART-I ( This is not a a regular post, it is more or less "Thinking loudly" or blogly? )

Assuming we are trying to build a system with maximum bandwidth for the MySQL applications.
Our Databases are not optimized for the normal usage so we have lot of full table scans. Some of the tables are really huge- multi-terrabyte monster databases.

In the sharded mode we see ~600MB/s load on the single shard. The single node has a local 8x4TB- SATA HGST enterprise disks on RAID6 with LSI-9271(BBU included)  with host memory 65GB RAM.

Let us begin with some optimizations on the single node.
On the single node we following options:
1) Boost the RAM?-actually it make no sense if we don't benefit of MySQL cache
2) speedup CPU?- currently most of the load is IO based not the calculations in MySQL.
3) Speed up RAID/DISK IO?- looks good lets us start here.


The single spinning disk can bring about 100-120MB/s in streaming mode.

The PCI-E3 Total Bandwidth: (x16 link): 32GB/s (or 8.0GT/s,or 1000MB/s per lane )

Single DB file test on XFS shows only about 563MB/s

dd if=/store/TestDatabase.MYD bs=4k | /tmp/pv >/dev/null
 205GB 0:06:13 [ 562MB/s] [                                                <=>                                             ]53674008+1 records in
53674008+1 records out

219848739110 bytes (220 GB) copied, 373.182 s, 589 MB/s
For full table scan queries ~373s is quite a bug performance penalty.
This is probably due to the RAID-6 with 8 Disks. This cannot be bigger than (8-2)*100MB/s~600MB/s .

Also we are not benefiting from Cache-Vault from controller or single disk(64MB).
Some note that the files are not fragmented:
 filefrag /store/TestDatabase.MYD
/store/TestDatabase.MYD: 1 extent found

 The full table scans are evil :)

 What about boosting FS for speedup?

XFS vs ZFS- maybe NVMe-cache ?