mercredi 21 décembre 2011

Fine tuning read_rnd_buffer_size and read_buffer_size

Those variables are  easy to tune.

The evidence popup here between 128 and 256K for in memory workload.

My SQL SQL layer is optimistic and think that data will come from the storage engine at the speed of the memory bus , but in practice for IO bound workload and range scan in the table order it could not be the case , for MyISAM when reading from tables on disk, IOs are aligned on the read buffer so it may be of an intrest to increase it here.

In myisam.cc my_default_record_cache_size is set to  global_system_variables.read_buff_size;

To show IO size :
strace -p id_mysqld -e read -e write

If  monitoring smaller IO  request size from iostat  -xm, it may be that you are suffering fragmentation that does not help the FS cache to serve you efficiently. Optimize table ,Alter table order is a great help in MyISAM

In InnoDB the difference come with prefetch trigger when no fragmentation. It is the case when rows are inserted in increasing primary key order. So conservative read and rnd buffer values will speed up  the data move from the buffer_pool to the SQL thread with almost no impact on the disk speed rate .

Note that the read_rnd_buff will size the thread record cache when rr_sequential and rr_quick in case of full scan and some range scan

In record.cc

static int init_rr_cache(THD *thd, READ_RECORD *info)
 info->cache_records= (thd->variables.read_rnd_buff_size /
                        (info->reclength+info->struct_length));
 

lundi 19 décembre 2011

SUM for the 5 best models in 5 best brands

As we can see in the solution:

Expression are evaluated in the column order in the select clause.

This is the same for the where clause.  An advice is to test first the expression that get more chance to return false at the head of the where conditions :   



  1. SET @BRAND=0;
  2. SET @ct=0;
  3. SELECT
  4.  *
  5. FROM
  6.  (SELECT
  7.    IF(@BRAND<>id_BRAND,@ct:=0 ,@ct:=@ct ) ,
  8.    @ct:=@ct+1 AS ct,   
  9.    @BRAND:=id_BRAND,
  10.    t2.*
  11.   FROM
  12.    (SELECT
  13.      st.ID_BRAND,
  14.      st.ID_MODEL,
  15.      SUM(COMPTE) AS total,
  16.      AVG(totalm)
  17.     FROM
  18.      STAT st JOIN
  19.      (SELECT
  20.        ID_BRAND,
  21.        SUM(COMPTE) AS totalm
  22.       FROM
  23.        STAT
  24.       GROUP BY
  25.        ID_BRAND
  26.       ORDER BY
  27.        totalm DESC
  28.       LIMIT 5
  29.      ) as t USING( ID_BRAND)
  30.     GROUP BY
  31.      st.ID_BRAND,
  32.      st.ID_MODEL
  33.     ORDER BY
  34.      st.ID_BRAND,
  35.      total DESC
  36.    ) AS t2
  37.  ) AS t3
  38. WHERE
  39.  ct <=5

jeudi 15 décembre 2011

Back on disk temporary tables

We already blog a lot on temporary tables on disk, how this could be bad for your workload and should be avoid.  Each temporary table will open a file descriptor,  external kernel call and by nature a well know file system slow operation.

We can point this benchmark simulating a working mail server


Monitoring of such queries can be trace via the status of

created_tmp_disk_tables 

Evidence of what is happening watching many Aria or MyISAM tables created on disk :

/usr/sbin/lsof  | grep "mysql" | grep "#"
mysqld     1855     mysql 1658u      REG                8,1       8192    1505932 /tmp/#sql_73f_1.MAI
mysqld     1855     mysql 1659u      REG                8,1       8192    1505936 /tmp/#sql_73f_1.MAD
/usr/sbin/lsof  | grep "mysql" | grep "#"
mysqld     1855     mysql 1650u      REG                8,1       8192    1505946 /tmp/#sql_73f_1.MAI
mysqld     1855     mysql 1654u      REG                8,1       8192    1505951 /tmp/#sql_73f_1.MAD

Now in MariaDB and Percona Server you can found which queries  causing you trouble with the slow query log

set global log_slow_verbosity=2; 

And you can retieve this information in the slow log :

# User@Host: root[root] @ localhost []
# Thread_id: 93  Schema: test  QC_hit: No
# Query_time: 0.00157  Lock_time: 0.000034  Rows_sent: 4  Rows_examined: 12
# Full_scan: No  Full_join: No  Tmp_table: Yes  Tmp_table_on_disk: Yes
# Filesort: Yes  Filesort_on_disk: No  Merge_passes: 0
SET timestamp=1323937380;
select * from (select user_id,created,  (select user_email from  login2 b where b.user_id=a.user_id ) from login2 a where a.user_id IN (100005, 100002, 100004, 100003)) as a order by created;

Solutions to fixe the issue can be :
  • Add an index to cover the resultset order 
  • Increasing tmp_table_size 
  • Increasing max_heap_table_size 
  • Removing blobs from your queries if you need to sort them in memory 
But somtimes this is not enouth let's have a deeper look :

select * from (
  select 
    user_id,
    created,  
    (select user_email from  login2 b where b.user_id=a.user_id ) as mail  
  from 
    login2 a 
  where 
    a.user_id IN (100005, 100002, 100004, 100003)
) as a order by created;
+---------+---------------------+---------------------+
| user_id | created             | mail                |
+---------+---------------------+---------------------+
|  100002 | 2011-01-11 17:16:45 | stephane@skysql.com |
|  100003 | 2011-01-11 17:16:45 | stephane@skysql.com |
|  100004 | 2011-01-11 17:16:46 | stephane@skysql.com |
|  100005 | 2011-01-11 17:16:46 | stephane@skysql.com |
+---------+---------------------+---------------------+



And inside the table :

CREATE TABLE `login2` (
  `user_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `user_name` varchar(50) DEFAULT NULL,
  `user_email` varchar(2000) CHARACTER SET utf8 DEFAULT NULL,
  `created` datetime DEFAULT NULL,
  PRIMARY KEY (`user_id`)
) ENGINE=InnoDB AUTO_INCREMENT=102048 DEFAULT CHARSET=latin1;


What you can note here is :

`user_email` varchar(2000) CHARACTER SET utf8 DEFAULT NULL

That mail column is embeded in the derivated query and sorted later on the upper query

Well for Monty it tooks few secondes to point me that this is comming from a performance mechanism hardcoded in a constant of the source code

CONVERT_IF_BIGGER_TO_BLOB

This set to 512B and the reason is that in most case sorting would be less performant starting in memory and then converted to disk later.  

It is use like this :

sql/item.cc:  If max_length > CONVERT_IF_BIGGER_TO_BLOB create a blob @n
sql/item.cc:  if (max_length/collation.collation->mbmaxlen > CONVERT_IF_BIGGER_TO_BLOB)
sql/item.cc:    if (fixed_length && max_length < CONVERT_IF_BIGGER_TO_BLOB)
sql/item_sum.h:    if (max_length/collation.collation->mbmaxlen > CONVERT_IF_BIGGER_TO_BLOB )
sql/sql_select.cc:      if ((*tmp->item)->max_length >= CONVERT_IF_BIGGER_TO_BLOB)
sql/unireg.h:#define CONVERT_IF_BIGGER_TO_BLOB 512   


So the max lenght of varchar for not  touching this case would be  512

Alter table login2  modify column  user_email varchar(512) charset utf8;
# Full_scan: Yes  Full_join: No  Tmp_table: Yes  Tmp_table_on_disk: No

Alter table login2  modify column  user_email varchar(513) charset utf8; 
# Full_scan: Yes  Full_join: No  Tmp_table: Yes  Tmp_table_on_disk: Yes

Now you have a solution to fixe this in MariaDB 5.3 with the work done on the optimizer :

just set :

set global optimizer_switch='derived_merge=on'
# Full_scan: Yes  Full_join: No  Tmp_table: Yes  Tmp_table_on_disk: No

Kudo to Sergei Petrunia but wait this is not all 

derived_merge=off
 
mysqlslap --create-schema=test -q testPV2.sql -c2 --number-of-queries=100000 -uroot -ptoto
Benchmark
    Average number of seconds to run all queries: 137.111 seconds
    Minimum number of seconds to run all queries: 137.111 seconds
    Maximum number of seconds to run all queries: 137.111 seconds
    Number of clients running queries: 2
    Average number of queries per client: 50000


derived_merge=on
 
mysqlslap --create-schema=test -q testPV2.sql -c2 --number-of-queries=100000 -uroot -ptoto
Benchmark
    Average number of seconds to run all queries: 14.540 seconds
    Minimum number of seconds to run all queries: 14.540 seconds
    Maximum number of seconds to run all queries: 14.540 seconds
    Number of clients running queries: 2
    Average number of queries per client: 50000
 More the is a bonus here optimizer take less cpu cycles compare to Vanilla MySQL Oracle tm

+----------------------+----------+----------+-----------+
| c2                   | ma       | my       | diff      |
+----------------------+----------+----------+-----------+
|  cleaning up         | 0.000004 | 0.000004 |  0.000000 |
|  closing tables      | 0.000008 | 0.000028 | -0.000020 |
|  executing           | 0.000034 | 0.000036 | -0.000002 |
|  freeing items       | 0.000030 | 0.000030 |  0.000000 |
|  init                | 0.000118 | 0.000019 |  0.000099 |
|  logging slow query  | 0.000003 | 0.000003 |  0.000000 |
|  Opening tables      | 0.000049 | 0.000154 | -0.000105 |
|  optimizing          | 0.000061 | 0.000042 |  0.000019 |
|  preparing           | 0.000085 | 0.000050 |  0.000035 |
|  query end           | 0.000003 | 0.000005 | -0.000002 |
|  removing tmp table  | 0.000081 | 0.000058 |  0.000023 |
|  Sending data        | 0.000681 | 0.000609 |  0.000072 |
|  Sorting result      | 0.000264 | 0.000031 |  0.000233 |
|  starting            | 0.000165 | 0.000148 |  0.000017 |
|  statistics          | 0.000365 | 0.000794 | -0.000429 |
|  System lock         | 0.000014 | 0.000008 |  0.000006 |
|  Table lock          | 0.000015 | 0.000308 | -0.000293 |
+----------------------+----------+----------+-----------+

vendredi 18 novembre 2011

Alter table engine TokuDB

At the le MUG some questions raised on TokuDB, i hope this post will answer more in deph details

InnoDB like any other general purpose RDBMS can be really bad for online back-office with big tables, despite following some usual practice and OLTP normalize schema.
  • Having indexes in memory
  • Touching a minimum set of rows for secondary index scan
  • Scanning some range in primary key order      

This happen frequently when
  • The working dataset is growing 2 to 100 times the size of the hardware memory
  • Joining big tables jumping from secondary indexes to clustered table
  • Increase concurrency on a big tables with range scan starting from a random point in the PRIMARY KEY   

It can be highlight watching random IO disks until 100% disk usage 


iostat –xm 2
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00  305.00  101.00     4.68     0.28    25.04     1.01    2.48   2.41  97.65
sdb               0.00     0.00  285.50  110.50     4.37     0.34    24.36     1.02    2.59   2.47  98.00
sdb               0.00     0.00  361.50  158.50     5.59     2.73    32.76     1.18    2.27   1.87  97.00

iostat
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdb            1215.50     37040.00       335.00      74080        670



Today we force protection of the InnoDB buffer pool by setting all range query with a minimum constant date of today interval minus 2 month on most queries but it is not always possible and limit features

Benchmark


IBM 5030 M3 16 Nehalem cores 32G RAM XFS RAID10 4 SAS 15Ktpm  

Schema extraction for the bench :
               
InnoDBTokuDB
Total Tablespace size249G155G
Index size31G192G
Buffer pool 24G14G


Every primary key are composed of a timestamp and a bigint UUID   

Multiple big tables, many rows and big avg record size

MariaDB> select table_name, table_rows/1000000 from information_schema.tables order by table_rows desc limit 20;
+--------------------+--------------------+
| table_name         |Millon rows         |
+--------------------+--------------------+
| xxx2               |            52.8748 |
| xxx1               |            44.5123 |
| xxx3l              |            31.6692 |
| xxx4               |            30.0430 |
 


MariaDB> select table_name,avg_row_length from information_schema.tables where table_name in('xxx1','xxx2','xxx3l','xxx4');
+-------------+----------------+
| table_name  | avg_row_length |
+-------------+----------------+
| xxx2      |              237 |
| xxx3l     |              610 |
| xxx4      |               87 |
| xxx1      |              248 |

Index and data usage for the slow queries  

Let’s run a test of 100 queries and see what is happening inside the storage


MariaDB> show table_statistics;
| Table_name          | Rows_read |
| xxx2                |    750284|
| xxx1                |   3118583 |


MariaDB> show index_statistics;
|Table_name     | Index_name                  | Rows_read   |
| xxx2          | apnum_idx                   |      64 475 |
| xxx1          | trsitemachdateheure_idx     |   3 023 319 |
| xxx2          | apdatesitemach_idx          |     456 567 |
| xxx1          | PRIMARY                     |      93 381 |
| xxx2          | PRIMARY                     |     228 113 |


From here we now estimate for  every 100 eavy queries we touch

  • 500M of row data ((750284*237) +(3118583*87))/1024/1024
  • 47M of row index ((64475+3023319+456567)*12+(93381+228113)*22)/1024/1024
  • 90% secondary indexes key read in xxx1, a 21GB table and 100G in production
  • 10% secondary indexes key read  xxx2 a 22GB table size and 100G in production 
We estimate 50% of the data in memory on the bench (44G out 24G)
We estimate 10% of the data in memory on the production  (200G out 24G)

BENCHMARK

We collect 1000 queries in the slow query log running beetween 10s to 30s inside InnoDB and replace the interval 2 month with 9 month.

   
InnoDB key read /sTokuDB key read /s
Full memory workload 100 slow queries concurrency 1  Flat status

1M<handler_read_nxt<1,2M   
12s : 10 tps
Flat status

1M<handler_read_nxt<1,2M
9s : 10 tps
Full memory workload 100 slow queries concurrency 4
Flat status

1M<handler_read_nxt<3M
4s : 40 tps
Negative scalability

400K<handler_read_nxt<1M
14s : 7,4 tps
50% memory workload  1000 slow queries
concurrency 1
Eratic status

1K <handler_read_nxt<2M
3730s : 0,26 tps
Flat status

200K<handler_read_nxt<400K   
854s : 1,2 tps
10% memory workload 1000 slow queries
concurrency 1
buffer_pool 5G


still waiting
Flat status

150K<handler_read_nxt<280K   
1262s : 0,8 tps



   
Query time in secondInnoDBTokuDB
Q19,050,22
Q217,560,3
Q35,950,29
Q4190,29
Q50,240,01
Q69,70,09
Q72,010,23



InnoDB stay the reference for in memory concurency workload and this was expected as TokuDB is the challenger, but now we can already state that for any IO bound workload it will make your day.

There are many way to fixe those bad queries (de normalisation project) but nothing is more simple then the alter table engine=TokuDB on a slave and don't forget to alter  secondary indexes using clustered index syntax .



     

lundi 9 mai 2011

Install and forget a story of stability...

Are you rebooting your MySQL server at night?

I'de like to share input send to me by a leading MySQL provider, mostly deploying ecommerce and telco applications on Linux.

One instance is up since 1414 days not bad ....

 
18:13:02 up 355 days,  4:50,  0 users,  load average: 0.00, 0.00, 0.00
18:13:03 up 594 days,  5:04,  0 users,  load average: 1.20, 1.25, 1.26
18:13:03 up 48 days, 22:10,  0 users,  load average: 0.48, 0.38, 0.29
18:13:03 up 936 days,  7:50,  0 users,  load average: 0.00, 0.00, 0.00
18:13:03 up 173 days,  2:39,  0 users,  load average: 1.90, 1.79, 1.79
18:13:04 up 244 days,  3:25,  0 users,  load average: 3.96, 3.66, 3.90
18:13:04 up 643 days, 15:50,  0 users,  load average: 0.09, 0.08, 0.07
18:13:04 up 642 days,  5:15,  0 users,  load average: 0.06, 0.02, 0.00
18:13:05 up 311 days,  1:57,  1 user,  load average: 0.01, 0.03, 0.00
18:13:05 up 329 days,  4:43,  0 users,  load average: 1.07, 0.55, 0.34
18:13:05 up 95 days, 11:34,  0 users,  load average: 2.27, 1.74, 1.64
18:13:06 up 436 days, 23:42,  0 users,  load average: 0.09, 0.10, 0.13
18:13:06 up 188 days,  2:35,  0 users,  load average: 0.10, 0.19, 0.14
18:13:12 up 347 days, 12:57,  0 users,  load average: 0.08, 0.16, 0.22
18:13:12 up 81 days, 11:32,  0 users,  load average: 0.04, 0.19, 0.28
18:13:12 up 81 days, 11:24,  0 users,  load average: 0.33, 0.40, 0.43
18:13:12 up 81 days, 11:13,  0 users,  load average: 0.28, 0.35, 0.36
18:07:39 up 21 days,  6:40,  0 users,  load average: 0.12, 0.14, 0.10
18:13:13 up 643 days, 16:11,  0 users,  load average: 0.36, 0.19, 0.11
18:13:13 up 643 days, 16:10,  0 users,  load average: 0.09, 0.07, 0.02
18:13:13 up 139 days, 35 min,  0 users,  load average: 0.38, 0.35, 0.29
18:13:14 up 179 days, 11:41,  0 users,  load average: 0.12, 0.07, 0.01
18:13:14 up 634 days,  4:40,  0 users,  load average: 0.00, 0.00, 0.00
18:13:14 up 343 days,  3:33,  0 users,  load average: 0.02, 0.05, 0.00
18:13:14 up 441 days,  2:37,  0 users,  load average: 0.13, 0.31, 0.31
18:13:19 up 160 days,  8:19,  0 users,  load average: 0.01, 0.05, 0.01
18:13:20 up 83 days, 22:34,  0 users,  load average: 0.07, 0.10, 0.09
18:13:25 up 602 days,  1:50,  0 users,  load average: 0.01, 0.03, 0.00
18:13:25 up 1414 days,  6:07,  0 users,  load average: 0.00, 0.03, 0.00
18:13:26 up 7 days,  6:32,  0 users,  load average: 0.11, 0.17, 0.20
18:13:26 up 17 days,  2:38,  0 users,  load average: 0.61, 0.49, 0.40
18:13:26 up 17 days,  2:42,  0 users,  load average: 0.10, 0.31, 0.35
18:13:32 up 284 days,  5:03,  0 users,  load average: 1.12, 0.78, 0.67
18:13:32 up 599 days, 40 min,  0 users,  load average: 0.00, 0.00, 0.00
18:13:32 up 320 days,  7:44,  0 users,  load average: 0.06, 0.07, 0.02
18:13:33 up 28 days,  7:00,  3 users,  load average: 0.04, 0.05, 0.01
18:13:33 up 12 days,  8:52,  2 users,  load average: 0.31, 0.07, 0.02
18:13:33 up 68 days,  8:33,  0 users,  load average: 0.00, 0.02, 0.00
18:13:33 up 265 days,  6:36,  0 users,  load average: 0.00, 0.00, 0.00
18:13:34 up 515 days,  3:03,  0 users,  load average: 0.00, 0.00, 0.00
18:13:34 up 97 days, 22:44,  0 users,  load average: 0.06, 0.10, 0.09
18:13:34 up 118 days,  3:38,  1 user,  load average: 0.02, 0.02, 0.00
18:13:35 up 12 days, 17:07,  0 users,  load average: 0.01, 0.07, 0.04
18:13:35 up 18 days,  1:25,  0 users,  load average: 0.39, 0.37, 0.30
18:13:35 up 97 days,  6:02,  0 users,  load average: 0.00, 0.26, 0.31
18:13:36 up 410 days,  7:48,  0 users,  load average: 0.00, 0.00, 0.00
18:13:36 up 538 days, 11:38,  0 users,  load average: 0.01, 0.01, 0.00
18:13:36 up 207 days,  6:56,  0 users,  load average: 0.49, 0.52, 0.54
18:13:37 up 332 days,  8:49,  0 users,  load average: 0.01, 0.02, 0.00
18:13:37 up 263 days,  3:25,  0 users,  load average: 0.01, 0.02, 0.00
18:13:38 up 469 days, 23:05,  0 users,  load average: 0.02, 0.01, 0.00
18:13:38 up 927 days,  2:51,  0 users,  load average: 0.00, 0.01, 0.00
18:13:39 up 350 days, 17:01,  0 users,  load average: 0.01, 0.01, 0.00
18:13:39 up 782 days,  5:27,  0 users,  load average: 0.00, 0.00, 0.00
18:13:40 up 329 days,  3:50,  0 users,  load average: 0.00, 0.00, 0.00
18:13:40 up 192 days,  1:28,  0 users,  load average: 0.44, 0.36, 0.28
18:13:40 up 203 days,  4:04,  0 users,  load average: 0.10, 0.16, 0.11
18:13:41 up 200 days,  3:14,  0 users,  load average: 0.07, 0.06, 0.01
18:13:41 up 200 days,  3:14,  0 users,  load average: 0.13, 0.14, 0.16
18:13:41 up 200 days,  3:14,  0 users,  load average: 0.06, 0.08, 0.08
18:13:42 up 104 days,  6:52,  0 users,  load average: 0.02, 0.01, 0.00
18:13:42 up 752 days,  3:37,  0 users,  load average: 0.00, 0.00, 0.00
18:20:04 up 94 days,  5:56,  0 users,  load average: 0.06, 0.03, 0.01
18:18:43 up 94 days,  5:54,  0 users,  load average: 0.00, 0.00, 0.00
18:20:40 up 11 days,  7:01,  0 users,  load average: 0.04, 0.10, 0.04
18:13:43 up 26 days,  8:50,  0 users,  load average: 0.01, 0.01, 0.00
18:13:43 up 581 days,  8:54,  0 users,  load average: 0.00, 0.00, 0.00
18:13:43 up 430 days,  2:08,  0 users,  load average: 0.01, 0.06, 0.02
18:13:49 up 215 days,  3:56,  0 users,  load average: 0.01, 0.04, 0.10
18:13:49 up 354 days,  4:17,  0 users,  load average: 0.00, 0.00, 0.00
18:13:49 up 523 days,  2:55,  0 users,  load average: 0.00, 0.00, 0.00
18:13:50 up 301 days,  2:12,  0 users,  load average: 0.00, 0.00, 0.00
18:13:50 up 259 days,  9:03,  0 users,  load average: 3.93, 4.07, 4.06
18:13:50 up 47 days,  2:46,  0 users,  load average: 0.00, 0.00, 0.00
18:13:51 up 17 days,  7:42,  0 users,  load average: 0.00, 0.00, 0.00
18:18:10 up 363 days,  4:19,  0 users,  load average: 0.13, 0.26, 0.34
18:13:52 up 145 days,  8:33,  0 users,  load average: 0.23, 0.17, 0.10
18:13:52 up 581 days,  8:34,  0 users,  load average: 0.00, 0.00, 0.00
18:13:52 up 431 days,  1:41,  0 users,  load average: 0.00, 0.01, 0.00
18:13:54 up 396 days,  2:21,  0 users,  load average: 0.00, 0.14, 0.24

Conclusion : It also takes time to be proved a possible next generation database.
 
So let's make a point on the last community work to continue tradition of stability :

  • There have been some effort in that direction in MySQL@Sun-Oracle that goes back to the source code 
  • MariaDB and Drizzle today compile with -Werror, which means that any compiler error is a just considered a failure.  
  • Buildbot is used for all MariaDB pushes, run for major OS, hardware, distribution including windows the test suite run on valgrind for debug and optimized builds to test all flavor. Number of test raise from 2236 to 2642, number of test lines from 567898 to 656864.
  • Buildbot is using test suite to stress for hight number of retry for better transient failures detection
In that area, competition raising quality , minimizing risks for the users. 

mercredi 9 mars 2011

Crossing the hills with Web 2.0

Paybox the fast growing payment system,  ready for new write scalability challenges, with an up to date open source stack, sharding, asynchronous replication, protobuffer, MariaDB and his new group commit feature, ready to go on fusion IO and XFS if just required.


HA and write intensive workload are really challenging to married but it can be very fun !
To follow the team Paybox on the "Rallye des Gazelles 2011" http://timlaline.over-blog.com 

lundi 17 janvier 2011

Muzing on NoSQL, damned ! can't get rid of InifiniDB

NoSQL have been frequently used for building analytic solutions. The big picture is using some scalable client code with map reduce to distribute full data scan. 

This approach have nothing new to the RDBMS world and can be considered an extension of the Kimball ROLAP normalization, just allowing more CPU power on a single query. 

NoSQL or sharding take advantages on 
  • Distributed processing 
NoSQL or sharding are loosing advantages such        
  • Fast memory communication 
  • Per column data type optimization and deserialization cost (NoSQL)
  • C processing when reduced with slower language (NoSQL)
There is more non technical advantages in classic ROLAP normalization like using same well known OLTP tools and bug free storage engine, all coming with GPL licences for reducing the ownership cost. In many, many case it will allow building analytic solutions with acceptable response time.


Our existing tools in My SQL using vertical scalability contains

  • MyISAM for minimizing query latency
  • MyISAM read only pack table for sparing space and io
  • Deprecated merge table for partitioning
  • Native partitioning
  • InnoDB transparent bzip compression
  • InnoDB read ahead , and multiple io threads 
  • Tokutek storage engine with predictable response time on table sizing      

On the same front MariaDB is cutting the gap with the best commercial RDBMS, introducing new algorithms based on query plan optimization
  • Table elimination (MariaDB 5.1)
  • Batched Key Access (MariaDB 5.3)
  • Join buffering with outer joins
  • Index Condition Pushdown
  • Subquery optimizations (MariaDB 5.3)
  • NULL-aware materialization
  • Materialization-scan for grouping queries
  • Predicate caching
  • FROM subquery optimizations   

ROLAP, despite being the most common approach, is facing some well know limitations that OLAP or NoSQL try to balance.

  • Single query is bound to a single server, and for My SQL single core. It is good enough in most case for a marketing or management requirements putting limited load on the analytic solution and asking data with some predefine time delay. 
  • Putting technical team in front of the write vs read challenge, adding more indexes will speed some queries but doubtfully can cover all queries and will slow down writes and DDL. 
  • Over time the schema is more and more linked with specific requirements, bringing aggregates and datamarts and forcing to maintain bunch of ETL assynchronus code.         
NoSQL defenders tend to clams this is not acceptable? Analytic systems today are moving online and could populate events at the rate of the number of Internet sessions, clicks or network packets. This is just true, but is NoSQL the best solution ?      

On the same track to NoSQL,a toolkit is emerging in MariaDB and Drizzle,  to overcome those limitations 

  • Column based OLAP storage engine 
  • ROLAP sharding solutions with or without transparent map reduce 


What is the best today? Again there is no rule of thumb but i can share some founding in my day to day job.


OLAP storage engine InfiniDB, Infobright, LucidDB provide very good results on processing a lot of records, depending on the underlying hardware and type of queries. I tend to give credit to InfiniDB to maximize resources out of a single query and Infobright to provide better response time on limited hardware resources.

Based on the top 10 queries of an existing client running an online DWH 1 Billion records on 8 cores hardware RAID 5 4*15000tpm SAS, 32G RAM
 



InnoDB InfiniDB InfoBright
Avg number of logical records/s on all type of queries 1 598 682 12 098 102 4 527 652
Minimum response time / million logical records scan best case on a 128M rows group by 0.6 0.02 0.06
 So far so good OLAP is the big winner here !
     
Can NoSQL come clause to OLAP, at the rate of 500 000 get/s ? It would required 128 servers to come clause to such response time and 12 of them to maintain avg read records rate. Hopefully NoSQL can use bulk data transfer covered with distributed SQL layer.    

On the other side, ROLAP sharding can put more memory caching on a single query, introducing home made logical partitioning that can follow the business rules like putting most  hot data on a single node, mixing huge aggregates with small data lookups. This approch is already credited with very good results for me on a 20 billion DWH build last year for a fraction of the price of commercial DB 
 
Can ROLAP come clause to OLAP and how many servers will i need to achieve the same performance ?  According to previous specific bench it is 1 to 1 if we can make the all cores work together, but for the same latency it is more 1 to 3 or 4. Can we verify this with micro benchmarking. 

Let's abstract a linear parallel processing on ROLAP designed with some sharding.

With Spider storage engine allowing transparent horizontal sharding, we end up splitting the data in per day shards, giving 2M rows fact tables per partition, distributing the shard across multiple MariaDB instances.  Spider make it possible to do this and reattach your data just reusing same SQL statements. 

Let's now present some micro benchmark that discourage me from the brute force only  approach.


Mounting the partitions on a RAMFS, let catch the time for a single 2 Million rows partition scan



InnoDB 1.3
Pbxt 0.8
MyISAM Pack 1.32
MyISAM          0.46
Aria 32K blocs 0.86




With linear scalability we can get 8 cores to perform 32M row scan/s on MyISAM.  
Giving our previous query doing 128M row aggregate in 2s in InfiniDB, result would have run in 4 secondes, 2 times slower comparing with InfiniDB and this is just abstracting network, disk access and any consolidation time for each shard.


What is the magic behind OLAP, could it be per column sharding ?

Our Q10 queries is touching 3 fields out of the 12 in the giving table.This is opening  the possibility to divide full scan time per 4 with additional 3 times the scan of a primary key and adding 2 times the CPU to join 2 millions rows. This operation could possibly be done transparently with the excellent VP storage engine.

Oups, join on 2M records is adding minimum 8 seconds per shard for a full scan. This POC was just a big u turn. Conclusion, sharding per column and brut force is not the solution in most case giving the cost of  algorithms to rejoin the multiple peaces. 
 

Fetch 2 INT COLUMNS 2M rows MyISAM Join cache level Time to scan all records
FULL SCAN 8 bytes RECORD SIZE 1 1.07
FULL SCAN 108 bytes RECORD SIZE 1 1.87
FULL JOIN NESTED LOOP 1 17.48
FULL JOIN BKAH 8 15.43
FULL JOIN BKA 5 15.3
FULL JOIN NESTED LOOP COVERING INDEX 1 8.37
  
OLAP engines take clever advantage of metadata per shard and pre aggregate informations it means that any ROLAP sharding or NoSQL solution should also come to the same conclusion in regard to latency, and adding hybrid mechanisms like additional metadata or pre calculated tables that can limit the cost of brute forcing.

Some great idea would be to build a transparent metadata layer,with My SQL proxy or gearman that will inject some pre calculated results per shard, execute asynchronously at insert time. I have no doubt that a map reduce solution for building such asynchronously metadata is the way to go for a ROLAP sharding with almost real time statistics.   

For those already convince, I'd like to point out, a not so known feature of My SQL call  WITH ROLLUP  to help trigger aggregating data.