Freitag, 30. Juni 2017

NextCloud: What are you doing with my storage?

In order to audit it we should install on the debian auditd package:
apt install auditd
Start auditing:
auditctl -w /DATA/
Stop auditing
auditctl -W /DATA/
Monitoring:
tail /var/log/audit/audit.log  -f 

I just wondering if some one has similar numbers on the nextcloud?

The setup is quite simple:)
1) debian 9
2) mariadb with storage on /dev/shm
3) storage is NFS over 10G with 2Intel NVMe PCIe striped with ZFS

benchmark windows client trying to upload:
  1. gcc-7-sourcecode - 1x
  2. kernel4-sourcecode - 3x
  3. kernel3-sourcecode - 1x
  4. boost-1-64-src         - 1x 
Totally 331K files with about 6GB
The mariadb shows around 5K queries/second, is someone got better numbers?



NextCloud check database tables size

If you need to know how much storage you consume with you NextCloud instance you can run following command on the database:

SELECT       table_schema as `Database`,       table_name AS `Table`,       round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB`  FROM information_schema.TABLES where table_name LIKE 'oc_%' ORDER BY (data_length + index_length);
credits goes to stackoverflow:
https://stackoverflow.com/questions/9620198/how-to-get-the-sizes-of-the-tables-of-a-mysql-database

Freitag, 23. Juni 2017

Torque PBS queue: sending mails.

It is important to configure torque jobqueue server with following parameter:

qmgr -c 'set server mail_from = adm@mail.domain'
qmgr -c 'set server mail_domain = my.pbsserver.host'
qmgr -c 'set server mail_subject_fmt = [NO-REPLY]Cluster Muster: Job %i - %r
'


How to fix owncloud client: No Icons in Finder on MacOS


One should enable Finder extension for the ownCloud client:
left upper Apple icon->System Preferences -> Extensions and enable owncloud for Finder


Freitag, 2. Juni 2017

Torque + Maui with the mixed nodes, eg with different number of cores per numa node


Assuming we have in the compute cluster 2 different types of processors: E-2650v3 and E-2650v4.
Each node has a dual CPU, becouse of 26xx :) . If we divide to numa boards then each "node" will get 10 cores(v3) and 12 cores(v4).

To use the whole cluster as one queue one need to reconfigure the maui:

/usr/local/maui/maui.cfg
ENABLEMULTIREQJOBS              TRUE

Then use the job script with the mixed resources:

#!/bin/bash
#PBS -q mix2
#PBS -r n
#PBS -l nodes=2:ppn=12:dbt+2:ppn=10:db
#PBS -l walltime=1:00:00
#PBS -u arm2arm
Where dbt and db are nodes future flags in the /var/lib/torque/server_priv/nodes file.

And we can can see now our job is  running on different nodes with different number of cores:
lei4-0      db  dbt 15.9  excl 10/10  0.00      6%                9549     arm2arm
lei4-1      db  dbt 16.0  excl 10/10  0.00      2%                9549     arm2arm
lei76-0  b2dis  dbt 63.9  excl 12/12  0.00      2%                9549     arm2arm
lei76-1  b2dis  dbt 64.0  excl 12/12  0.00      2%                9549     arm2arm
lei4   has 2 numa boards each with 10 cores
lei76 has 2 numa boards each with 12 cores
 
 

Mittwoch, 10. Mai 2017

I hate DAPL!!! Who else using DAPL?


Dapl vs ofa protocols with intel 2017:

ofa:
[arm2arm@l1 BOX64_2048]$ mpirun -nolocal -r ssh -genv I_MPI_FABRICS ofa     -n 2  -ppn 1  -hostfile ./m2.txt ~/Projects/BENCH/ping_pong.x 200000000
lei38
2
lei37
2
ping-pong 200000000 bytes ...
200000000 bytes: 34789.00 usec/msg
200000000 bytes: 5748.94 MB/sec

dapl:
[arm2arm@l1 BOX64_2048]$ mpirun -nolocal -r ssh -genv I_MPI_FABRICS dapl      -n 2  -ppn 1  -hostfile ./m2.txt ~/Projects/BENCH/ping_pong.x 200000000
lei38
2
lei37
2
ping-pong 200000000 bytes ...
200000000 bytes: 96856.38 usec/msg
200000000 bytes: 2064.91 MB/sec

Freitag, 3. März 2017

Glusterfs + ZFS + OVIRT +RDMA?

HW The three identical machines with 3 disks each as a Ovirt hosts:
[root@clei22 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            16
Model:                 9
Model name:            AMD Opteron(tm) Processor 6128
Stepping:              1
CPU MHz:               2000.000
BogoMIPS:              4000.38
Virtualization:        AMD-V
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
L3 cache:              5118K
NUMA node0 CPU(s):     0-3
NUMA node1 CPU(s):     4-7
NUMA node2 CPU(s):     12-15
NUMA node3 CPU(s):     8-11

[root@clei22 ~]# lsscsi
[2:0:0:0]    disk    ATA      INTEL SSDSC2CW24 400i  /dev/sda
[3:0:0:0]    disk    ATA      HGST HUS724040AL AA70  /dev/sdb
[4:0:0:0]    disk    ATA      WDC WD2002FYPS-0 1G01  /dev/sdc

[root@clei22 ~]# pvs ;vgs;lvs
  PV                                                 VG            Fmt  Attr PSize   PFree
  /dev/mapper/INTEL_SSDSC2CW240A3_CVCV306302RP240CGN vg_cache      lvm2 a--  223.57g     0
  /dev/sdc2                                          centos_clei22 lvm2 a--    1.82t 64.00m
  VG            #PV #LV #SN Attr   VSize   VFree
  centos_clei22   1   3   0 wz--n-   1.82t 64.00m
  vg_cache        1   2   0 wz--n- 223.57g     0
  LV       VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  home     centos_clei22 -wi-ao----   1.74t                                                   
  root     centos_clei22 -wi-ao----  50.00g                                                   
  swap     centos_clei22 -wi-ao----  31.44g                                                   
  lv_cache vg_cache      -wi-ao---- 213.57g                                                   
  lv_slog  vg_cache      -wi-ao----  10.00g   

[root@clei22 ~]# zpool status -v
  pool: zclei22
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Tue Feb 28 14:16:07 2017
config:

    NAME                                    STATE     READ WRITE CKSUM
    zclei22                                 ONLINE       0     0     0
      HGST_HUS724040ALA640_PN2334PBJ4SV6T1  ONLINE       0     0     0
    logs
      lv_slog                               ONLINE       0     0     0
    cache
      lv_cache                              ONLINE       0     0     0

errors: No known data errors

ZFS config:

[root@clei22 ~]# zfs get all zclei22/01
NAME        PROPERTY              VALUE                  SOURCE
zclei22/01  type                  filesystem             -
zclei22/01  creation              Tue Feb 28 14:06 2017  -
zclei22/01  used                  389G                   -
zclei22/01  available             3.13T                  -
zclei22/01  referenced            389G                   -
zclei22/01  compressratio         1.01x                  -
zclei22/01  mounted               yes                    -
zclei22/01  quota                 none                   default
zclei22/01  reservation           none                   default
zclei22/01  recordsize            128K                   local
zclei22/01  mountpoint            /zclei22/01            default
zclei22/01  sharenfs              off                    default
zclei22/01  checksum              on                     default
zclei22/01  compression           off                    local
zclei22/01  atime                 on                     default
zclei22/01  devices               on                     default
zclei22/01  exec                  on                     default
zclei22/01  setuid                on                     default
zclei22/01  readonly              off                    default
zclei22/01  zoned                 off                    default
zclei22/01  snapdir               hidden                 default
zclei22/01  aclinherit            restricted             default
zclei22/01  canmount              on                     default
zclei22/01  xattr                 sa                     local
zclei22/01  copies                1                      default
zclei22/01  version               5                      -
zclei22/01  utf8only              off                    -
zclei22/01  normalization         none                   -
zclei22/01  casesensitivity       sensitive              -
zclei22/01  vscan                 off                    default
zclei22/01  nbmand                off                    default
zclei22/01  sharesmb              off                    default
zclei22/01  refquota              none                   default
zclei22/01  refreservation        none                   default
zclei22/01  primarycache          metadata               local
zclei22/01  secondarycache        metadata               local
zclei22/01  usedbysnapshots       0                      -
zclei22/01  usedbydataset         389G                   -
zclei22/01  usedbychildren        0                      -
zclei22/01  usedbyrefreservation  0                      -
zclei22/01  logbias               latency                default
zclei22/01  dedup                 off                    default
zclei22/01  mlslabel              none                   default
zclei22/01  sync                  disabled               local
zclei22/01  refcompressratio      1.01x                  -
zclei22/01  written               389G                   -
zclei22/01  logicalused           396G                   -
zclei22/01  logicalreferenced     396G                   -
zclei22/01  filesystem_limit      none                   default
zclei22/01  snapshot_limit        none                   default
zclei22/01  filesystem_count      none                   default
zclei22/01  snapshot_count        none                   default
zclei22/01  snapdev               hidden                 default
zclei22/01  acltype               off                    default
zclei22/01  context               none                   default
zclei22/01  fscontext             none                   default
zclei22/01  defcontext            none                   default
zclei22/01  rootcontext           none                   default
zclei22/01  relatime              off                    default
zclei22/01  redundant_metadata    all                    default
zclei22/01  overlay               off                    default


File content:
2x50GB VM thinprovisioned
4x100GB VMdisk preallocated

Gluster Volume config:
[root@clei22 ~]# gluster volume info

Volume Name: GluReplica
Type: Replicate
Volume ID: ee686dfe-203a-4caa-a691-26353460cc48
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp,rdma
Bricks:
Brick1: 10.10.10.44:/zclei22/01/glu
Brick2: 10.10.10.42:/zclei21/01/glu
Brick3: 10.10.10.41:/zclei26/01/glu (arbiter)
Options Reconfigured:
network.ping-timeout: 30
performance.readdir-ahead: on
nfs.disable: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
features.shard: on
cluster.data-self-heal-algorithm: full
storage.owner-uid: 36
storage.owner-gid: 36
server.allow-insecure: on


GLUSTER primitive TESTS COLD:

time find /zclei22/01/glu/ -type d | wc -l
55473

real    5m58.248s
user    0m0.752s
sys    0m7.649s


[root@clei22 ~]# time find /zclei22/01/glu/ -type f | wc -l
215011

real    6m11.178s
user    0m0.873s
sys    0m9.385s

GLUSTER primitive TESTS WORM:
time find /zclei22/01/glu/ -type d | wc -l
55473

real    0m2.719s
user    0m0.400s
sys    0m2.323s

time find /zclei22/01/glu/ -type f | wc -l
215011

real    0m2.828s
user    0m0.478s
sys    0m2.376s

Adding Infniband:
yum install ibutils.x86_64 rdma.noarch infiniband-diags.x86_64
yum install -y libmlx4.x86_64

bstat
CA 'mlx4_0'
    CA type: MT26428
    Number of ports: 1
    Firmware version: 2.7.700
    Hardware version: b0
    Node GUID: 0x002590ffff163758
    System image GUID: 0x002590ffff16375b
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 10
        Base lid: 273
        LMC: 0
        SM lid: 3
        Capability mask: 0x02590868
        Port GUID: 0x002590ffff163759
        Link layer: InfiniBand

Not bad for the old SDR switch ! :-P
 qperf clei22.vib  ud_lat ud_bw
ud_lat:
    latency  =  23.6 us
ud_bw:
    send_bw  =  981 MB/sec
    recv_bw  =  980 MB/sec


Let us kill and restore the glusterfs:
1) pull out /dev/sda
2) format to xfs
3) destroy xfs: zpool create ...etc...
4) recover old glusterfs config:
[root@clei21 ~]# cat heal.sh
(vol=GluReplica; brick=/zclei21/01/glu;setfattr -n  trusted.glusterfs.volume-id -v 0x$(grep volume-id /var/lib/glusterd/vols/$vol/info | cut -d= -f2 | sed 's/-//g') $brick) gluster volume heal GluReplica full
Terrible performance on zfs+glusterfs :(

Receiver Brick:
root@clei21 ~]# arcstat.py 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c 
13:24:49     0     0      0     0    0     0    0     0    0   4.6G   31G 
13:24:50   154    80     51    80   51     0    0    80   51   4.6G   31G 
13:24:51   179    62     34    62   34     0    0    62   42   4.6G   31G 
13:24:52   148    68     45    68   45     0    0    68   45   4.6G   31G 
13:24:53   140    64     45    64   45     0    0    64   45   4.6G   31G 
13:24:54   124    48     38    48   38     0    0    48   38   4.6G   31G 
13:24:55   157    80     50    80   50     0    0    80   50   4.7G   31G 
13:24:56   202    68     33    68   33     0    0    68   41   4.7G   31G 
13:24:57   127    54     42    54   42     0    0    54   42   4.7G   31G 
13:24:58   126    50     39    50   39     0    0    50   39   4.7G   31G 
13:24:59   116    40     34    40   34     0    0    40   34   4.7G   31G 
 Sender Brick:
[root@clei22 ~]# arcstat.py 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c 
13:28:37     8     2     25     2   25     0    0     2   25   468M   31G 
13:28:38  1.2K   727     62   727   62     0    0   525   54   469M   31G 
13:28:39   815   508     62   508   62     0    0   376   55   469M   31G 
13:28:40   994   624     62   624   62     0    0   450   54   469M   31G 
13:28:41   783   456     58   456   58     0    0   338   50   470M   31G 
13:28:42   916   541     59   541   59     0    0   390   50   470M   31G 
13:28:43   768   437     56   437   57     0    0   313   48   471M   31G 
13:28:44   877   534     60   534   60     0    0   393   53   470M   31G 
13:28:45   957   630     65   630   65     0    0   450   57   470M   31G 
13:28:46   819   479     58   479   58     0    0   357   51   471M   31G 

Activating a volume group

Every time when I need this command I am forgetting the syntax!
vgchange -a y vg_cache

Donnerstag, 2. März 2017

Ansible error after upgrade to 2.2.x: how to fix.

The playbook throws an error:
TASK [create a users emails with hostname] ******************************
fatal: [hostname]: FAILED! => {"failed": true, "msg": "the field 'args' has an invalid value, which appears to include a variable that is undefined. The error was: 'ansible.vars.unsafe_proxy.AnsibleUnsafeText object' has no attribute 'username'\n\nThe error appears to have been in 'deploy_aliases.yml': line 21, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n  - name: create a users emails with host\n    ^ here\n"}

before:
  - name: create a users emails with host
    lineinfile:
       dest: aliases-nmail
       state: present
       regexp: '^{{item.username}}:'
       line: '{{item.username}}:                  {{item.email}}@host'
    with_items: users

Should be now:
  - name: create a users emails with host
    lineinfile:
       dest: aliases-nmail
       state: present
       regexp: '^{{item.username}}:'
       line: '{{item.username}}:                  {{item.email}}@host'
    with_items: "{{users}}"

Dienstag, 28. Februar 2017

Cleanup disk nicely

During the disklessinstalation often I have troubles with partitions and logical volumes.
Before creating new OS better to cleanup the disk:
scrub -p dod /dev/sdb
or for the large disks:
scrub -p dod  /dev/sdb -s 1G

Sonntag, 26. Februar 2017

Impressive power boost by Moto Z power mod

Moto Z is one of the best smartphones for the moment. It comes with daydream support and many "Mods".
With 4k video streaming, several vr360 sessions it is still alive!!! 1day 20hours wow!

Donnerstag, 23. Februar 2017

Smartctl info for nvme

nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 26 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 0%
data_units_read                     : 174,995
data_units_written                  : 283,289
host_read_commands                  : 2,327,186
host_write_commands                 : 842,433
controller_busy_time                : 2
power_cycles                        : 2
power_on_hours                      : 2
unsafe_shutdowns                    : 0
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 26 C
Temperature Sensor 2                : 47 C
Temperature Sensor 3                : 0 C
Temperature Sensor 4                : 0 C
Temperature Sensor 5                : 0 C
Temperature Sensor 6                : 0 C
Temperature Sensor 7                : 0 C
Temperature Sensor 8                : 0 C

Mittwoch, 22. Februar 2017

SSD Samsung 850 EVO 256GB performance

Empty, new SSD with CrystalDiskMark 3.0.2.




Donnerstag, 9. Februar 2017

Windows 7: PCI-e SSD vs striped WD 10K disks vs striped SSD


Intel PCI-E SSD P3600
-----------------------------------------------------------------------
CrystalDiskMark 3.0.1 x64 (C) 2007-2010 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

           Sequential Read :  1318.342 MB/s
          Sequential Write :  1215.438 MB/s
         Random Read 512KB :  1021.725 MB/s
        Random Write 512KB :  1231.051 MB/s
    Random Read 4KB (QD=1) :    37.135 MB/s [  9066.1 IOPS]
   Random Write 4KB (QD=1) :   297.593 MB/s [ 72654.6 IOPS]
   Random Read 4KB (QD=32) :   679.819 MB/s [165971.4 IOPS]
  Random Write 4KB (QD=32) :   525.391 MB/s [128269.4 IOPS]

  Test : 1000 MB [P: 0.0% (0.1/1117.8 GB)] (x5)
  Date : 2017/02/09 18:19:36
    OS : Windows 7 Ultimate Edition SP1 [6.1 Build 7601] (x64)
 
 Striped 2x Crusial SSD disks
 

 -----------------------------------------------------------------------
CrystalDiskMark 3.0.1 x64 (C) 2007-2010 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

           Sequential Read :   706.492 MB/s
          Sequential Write :   374.625 MB/s
         Random Read 512KB :   657.431 MB/s
        Random Write 512KB :   146.501 MB/s
    Random Read 4KB (QD=1) :    25.487 MB/s [  6222.4 IOPS]
   Random Write 4KB (QD=1) :    66.833 MB/s [ 16316.8 IOPS]
   Random Read 4KB (QD=32) :   405.031 MB/s [ 98884.4 IOPS]
  Random Write 4KB (QD=32) :   209.108 MB/s [ 51051.7 IOPS]

  Test : 1000 MB [C: 93.5% (211.7/226.5 GB)] (x5)
  Date : 2017/02/09 18:13:42
    OS : Windows 7 Ultimate Edition SP1 [6.1 Build 7601] (x64)
 

Striped 2x WD VelociRaptor 10K disks
-----------------------------------------------------------------------
CrystalDiskMark 3.0.1 x64 (C) 2007-2010 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

           Sequential Read :   158.779 MB/s
          Sequential Write :   227.951 MB/s
         Random Read 512KB :    66.876 MB/s
        Random Write 512KB :   132.817 MB/s
    Random Read 4KB (QD=1) :     0.886 MB/s [   216.2 IOPS]
   Random Write 4KB (QD=1) :     3.226 MB/s [   787.5 IOPS]
   Random Read 4KB (QD=32) :     3.832 MB/s [   935.6 IOPS]
  Random Write 4KB (QD=32) :     4.467 MB/s [  1090.6 IOPS]

  Test : 1000 MB [E: 45.7% (485.0/1061.8 GB)] (x5)
  Date : 2017/02/09 17:57:49
    OS : Windows 7 Ultimate Edition SP1 [6.1 Build 7601] (x64)
 
 

 


Donnerstag, 2. Februar 2017

lustre 2.9 on nvm PCI-E SSD Intel P3600 1.2TB

Ldiskfs on Intel SSD DC P3600 vs 8xSSD RAID10 with MR9271-8i
thrhi=16 dir_count=16 file_count=200000 mds-survey
Fri Feb  3 01:50:05 CET 2017 /usr/bin/mds-survey from c1701.cls
mdt 1 file  200000 dir   16 thr   16 create 329017.18 [ 329017.18, 329017.18] lookup 3167915.35 [ 3167915.35, 3167915.35] md_getattr 968143.25 [ 968143.25, 968143.25] setxattr 370109.92 [ 370109.92, 370109.92] destroy 235829.03 [ 235829.03, 235829.03]
done!

thrhi=16 dir_count=16 file_count=200000 mds-survey
Sun Jun  8 15:20:30 CEST 2014 /usr/bin/mds-survey from newmds01
mdt 1 file  200000 dir   16 thr   16 create 144417.16 [ 144417.16, 144417.16] lookup 2538618.74 [ 2538618.74, 2538618.74] md_getattr 1210573.15 [ 1210573.15, 1210573.15] setxattr 54542.92 [ 54994.01, 56996.07] destroy 89073.61 [ 76997.69, 76997.69]
done!

Interesting almost 2x faster than 8 SSDs(RAID10) in create operation.  Actually create in RAID10 same as create on 4SSDs. So one can say 1-NVME is 2 times faster than 4xSSDs: nice!

The expensive setxattr is almost 7 times faster!!!

PCI-E SSD INTEL formating to 4K sector size

isdct show -intelssd 

- Intel SSD DC P3600 Series CVMD550300L81P2CGN -

Bootloader : 8B1B0133
DevicePath : /dev/nvme0n1
DeviceStatus : Healthy
Firmware : 8DV101F0
FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.
Index : 0
ModelNumber : INTEL SSDPEDME012T4
ProductFamily : Intel SSD DC P3600 Series
SerialNumber : CVMD550300L81P2CGN
Change sector size from 512(LBAFormat=0) to 4096(LBAFormat=3) :

isdct start -intelssd 0 -nvmeformat LBAFormat=3
Warning: you will loose all data.

Reset the device without rebooting:

echo 1 >  /sys/class/nvme/nvme0/device/reset

 nice we have now 4k sector size:

nvme list
nvme list
Node             SN                   Model                                    Version  Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- -------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     CVMD550300L81P2CGN   INTEL SSDPEDME012T4                      1.0      1           1.20  TB /   1.20  TB      4 KiB +  0 B   8DV101F0

NVME intel performance on CentOS7.3

[root@c1701 ~]# hdparm -tT   --direct /dev/nvme0n1



/dev/nvme0n1:

 Timing O_DIRECT cached reads:   2512 MB in  2.00 seconds = 1255.72 MB/sec

 Timing O_DIRECT disk reads: 3604 MB in  3.00 seconds = 1200.83 MB/sec

[root@c1701 ~]# hdparm -tT   --direct /dev/nvme0n1



/dev/nvme0n1:

 Timing O_DIRECT cached reads:   2398 MB in  2.00 seconds = 1198.11 MB/sec

 Timing O_DIRECT disk reads: 3720 MB in  3.00 seconds = 1239.86 MB/sec

[root@c1701 ~]# 

Lustrefs 2.9 migrate data from OST

One of the osts is full,  we got a new ost,
let us migrate:

lfs find /archive  -obd arch-OST0001_UUID -type f -size +1M | lfs_migrate -y

Montag, 30. Januar 2017

Fun with RobinHood 3 , Lustre 2.9 and PCI-e SSD: /dev/nvme0n1

------------------- TEST PCI-e SSD XFS---------------
MySQL Data on nvram

time rbh-du -d -H  -f /etc/robinhood.d/tmpfs/lnew.conf /lnew/*

real    6m32.658s
user    0m18.175s
sys     0m9.272s

we just run second time to be sure the numbers are ok.

real    6m22.549s
user    0m24.533s
sys     0m10.950s

------------------- TEST PCI-E SSD ext4 -------------

real    5m46.431s
user    0m16.900s
sys     0m9.288s

real    5m45.627s
user    0m15.509s
sys     0m8.985s


------------ TEST ZFS  128k --------------------
MySQL data on zpool RAIDZ0 8x3TBHGST Disks

time rbh-du -d -H  -f /etc/robinhood.d/tmpfs/lnew.conf /lnew/*

real    13m34.484s
user    0m17.150s
sys     0m9.125s
------------------- Same ZFS but with 8K blocksize ............
real    15m20.003s
user    0m16.427s
sys    0m8.811s

Interesting the 8K mithos doesnot help with performance.
What about 64K? It is same as 128k recordsize.

--------------- Lustre 2.9  mds-----------------
Directly on lustrefs

time du -bsh /lnew/*

real    36m12.990s
user    0m24.658s
sys     8m57.458s



Interesting results for the InnoDB  speed on the  different file systems:

ext4


Samstag, 28. Januar 2017

PCI/SSD/NVM performance challenge begins!- PART-II

Recently we got a test server with PCI-e SSD from INTEL-P3600 1.2TB.
Operating system see it as: INTEL SSDPEDME012T4
[root@c1701 vector]# isdct show -nvmelog smarthealthinfo

- SMART and Health Information CVMD550300L81P2CGN -

Available Spare Normalized percentage of the remaining spare capacity available : 100
Available Spare Threshold Percentage : 10
Available Spare Space has fallen below the threshold : False
Controller Busy Time : 0x19
Critical Warnings : 0
Data Units Read : 0x380EC1
Data Units Written : 0xC0C8D4
Host Read Commands : 0x031AA3A3
Host Write Commands : 0x08B44DE8
Media Errors : 0x0
Number of Error Info Log Entries : 0x0
Percentage Used : 0
Power Cycles : 0x10
Power On Hours : 0x01FD
Media is in a read-only mode : False
Device reliability has degraded : False
Temperature - Celsius : 35
Temperature has exceeded a critical threshold : False
Unsafe Shutdowns : 0x08
Volatile memory backup device has failed : False


Now the question is rising: Do we need at all PCI-e NVMe SSD? How we can use it to boost any of our daily used systems??
The bad thing with PCI-e is usability, if it starts to fail one need to turn off the  BOX to replace it. For the time critical systems it is unacceptable. One can think well do some kind of HA pairs of boxes. But any HA layer adds its failure probability and for sure dumps down you HW performance.

As a first touch test I just started to check the Robinhood scans of the lustre fs.
This process runs as a daily cron job on the dedicated client.

Here is the benchmarks with recent "Lustre client" hardware:

Aim: to keep lustrefs file tables in mysql InnoDB.

===========================================
Robinhood 3 with CentOS7.3 and lustrefs 2.9 client
Lustre server 2.9 CentOS6.8,4x4mirrored Intel SSD MDS,
3Jbods-18 OSTS(450-600MB/s per OST),3xMR9271-8i per ost,24 disk per OST, Interconnect FDR 56Gbit,
with MPI tests up to >3.5GB/s random IO
Currently used: 129TB out of 200TB, no striped datasets.
Data distribution histogram: Files count/Size

time lfs find /llust  -type f | wc -l
[root@c1701 ~]# time lfs find /llust  -type f | wc -l
23i935507

real    11m13.261s
user    0m12.464s
sys     2m15.665s

[root@c1701 ~]# time lfs find /llust  -type d | wc -l
776191

real    6m40.706s
user    0m3.500s
sys     0m56.831s

[root@c1701 ~]# time lfs find /llust   | wc -l
24888641

real    3m52.773s
user    0m10.867s
sys     1m4.731s


The bottleneck is probably MDS, but it is fast enough to feed  our InnoDB with files stats and locations.

===========================================
Final robinhood InnoDB  database size is about ~100GB    
Engine InnoDB from Oracle MySQL: mysql-5.7.15-linux-glibc2.5-x86_64 (Why not MariaDB? there are some deviations with mariadb. It is a little bit slower than OMySQL +/- 5/10%. But for the final test conclusions this is not so relevant)

my.cnf
innodb_buffer_pool_size=40G
tmp_table_size=512M
max_heap_table_size=512M
innodb_buffer_pool_instances=64
innodb_flush_log_at_trx_commit=2

==================================
we setup here RAIDZ2 8x3TB HGST HUS724030ALA640 with LSI-9300-8i in IT mode.

FS setup for the tests:
  1. zpool create tank raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh -f
  2. zfs set compression=lz4 tank
  3. zpool create nvtank /dev/nvme0n1
  4. zfs set compression=lz4 nvtank
  5. mkfs.xfs -Lnvtank  /dev/nvme0n1 -f

Results of mytop query per second-qps:
  1. 8xRAIDZ2 compression none qps: ~98 burst up to 300
  2. 8xRAIDZ2 compression lz4 qps: ~107 burst up to 480
  3. nvmeZFS compression none qps: 5000 burst up to 12000
  4. nvme XFS qps: ~17000 burst up to 18000

First results are showing thath XFS wins against ZFS. The latency of the PCI-e impressive, but once you add any software raid on it like ZFS, compression, its becoming slow but not as slow as  8xSpinning disks.



PCI/SSD/NVM performance challenge begins! PART-I ( This is not a a regular post, it is more or less "Thinking loudly" or blogly? )

Assuming we are trying to build a system with maximum bandwidth for the MySQL applications.
Our Databases are not optimized for the normal usage so we have lot of full table scans. Some of the tables are really huge- multi-terrabyte monster databases.

In the sharded mode we see ~600MB/s load on the single shard. The single node has a local 8x4TB- SATA HGST enterprise disks on RAID6 with LSI-9271(BBU included)  with host memory 65GB RAM.

Let us begin with some optimizations on the single node.
On the single node we following options:
1) Boost the RAM?-actually it make no sense if we don't benefit of MySQL cache
2) speedup CPU?- currently most of the load is IO based not the calculations in MySQL.
3) Speed up RAID/DISK IO?- looks good lets us start here.

Theory

The single spinning disk can bring about 100-120MB/s in streaming mode.

The PCI-E3 Total Bandwidth: (x16 link): 32GB/s (or 8.0GT/s,or 1000MB/s per lane )

Single DB file test on XFS shows only about 563MB/s

dd if=/store/TestDatabase.MYD bs=4k | /tmp/pv >/dev/null
 205GB 0:06:13 [ 562MB/s] [                                                <=>                                             ]53674008+1 records in
53674008+1 records out

219848739110 bytes (220 GB) copied, 373.182 s, 589 MB/s
For full table scan queries ~373s is quite a bug performance penalty.
This is probably due to the RAID-6 with 8 Disks. This cannot be bigger than (8-2)*100MB/s~600MB/s .

Also we are not benefiting from Cache-Vault from controller or single disk(64MB).
Some note that the files are not fragmented:
 filefrag /store/TestDatabase.MYD
/store/TestDatabase.MYD: 1 extent found

 The full table scans are evil :)

 What about boosting FS for speedup?

XFS vs ZFS- maybe NVMe-cache ?