vendredi 18 novembre 2011

Alter table engine TokuDB

At the le MUG some questions raised on TokuDB, i hope this post will answer more in deph details

InnoDB like any other general purpose RDBMS can be really bad for online back-office with big tables, despite following some usual practice and OLTP normalize schema.
  • Having indexes in memory
  • Touching a minimum set of rows for secondary index scan
  • Scanning some range in primary key order      

This happen frequently when
  • The working dataset is growing 2 to 100 times the size of the hardware memory
  • Joining big tables jumping from secondary indexes to clustered table
  • Increase concurrency on a big tables with range scan starting from a random point in the PRIMARY KEY   

It can be highlight watching random IO disks until 100% disk usage 


iostat –xm 2
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00  305.00  101.00     4.68     0.28    25.04     1.01    2.48   2.41  97.65
sdb               0.00     0.00  285.50  110.50     4.37     0.34    24.36     1.02    2.59   2.47  98.00
sdb               0.00     0.00  361.50  158.50     5.59     2.73    32.76     1.18    2.27   1.87  97.00

iostat
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdb            1215.50     37040.00       335.00      74080        670



Today we force protection of the InnoDB buffer pool by setting all range query with a minimum constant date of today interval minus 2 month on most queries but it is not always possible and limit features

Benchmark


IBM 5030 M3 16 Nehalem cores 32G RAM XFS RAID10 4 SAS 15Ktpm  

Schema extraction for the bench :
               
InnoDBTokuDB
Total Tablespace size249G155G
Index size31G192G
Buffer pool 24G14G


Every primary key are composed of a timestamp and a bigint UUID   

Multiple big tables, many rows and big avg record size

MariaDB> select table_name, table_rows/1000000 from information_schema.tables order by table_rows desc limit 20;
+--------------------+--------------------+
| table_name         |Millon rows         |
+--------------------+--------------------+
| xxx2               |            52.8748 |
| xxx1               |            44.5123 |
| xxx3l              |            31.6692 |
| xxx4               |            30.0430 |
 


MariaDB> select table_name,avg_row_length from information_schema.tables where table_name in('xxx1','xxx2','xxx3l','xxx4');
+-------------+----------------+
| table_name  | avg_row_length |
+-------------+----------------+
| xxx2      |              237 |
| xxx3l     |              610 |
| xxx4      |               87 |
| xxx1      |              248 |

Index and data usage for the slow queries  

Let’s run a test of 100 queries and see what is happening inside the storage


MariaDB> show table_statistics;
| Table_name          | Rows_read |
| xxx2                |    750284|
| xxx1                |   3118583 |


MariaDB> show index_statistics;
|Table_name     | Index_name                  | Rows_read   |
| xxx2          | apnum_idx                   |      64 475 |
| xxx1          | trsitemachdateheure_idx     |   3 023 319 |
| xxx2          | apdatesitemach_idx          |     456 567 |
| xxx1          | PRIMARY                     |      93 381 |
| xxx2          | PRIMARY                     |     228 113 |


From here we now estimate for  every 100 eavy queries we touch

  • 500M of row data ((750284*237) +(3118583*87))/1024/1024
  • 47M of row index ((64475+3023319+456567)*12+(93381+228113)*22)/1024/1024
  • 90% secondary indexes key read in xxx1, a 21GB table and 100G in production
  • 10% secondary indexes key read  xxx2 a 22GB table size and 100G in production 
We estimate 50% of the data in memory on the bench (44G out 24G)
We estimate 10% of the data in memory on the production  (200G out 24G)

BENCHMARK

We collect 1000 queries in the slow query log running beetween 10s to 30s inside InnoDB and replace the interval 2 month with 9 month.

   
InnoDB key read /sTokuDB key read /s
Full memory workload 100 slow queries concurrency 1  Flat status

1M<handler_read_nxt<1,2M   
12s : 10 tps
Flat status

1M<handler_read_nxt<1,2M
9s : 10 tps
Full memory workload 100 slow queries concurrency 4
Flat status

1M<handler_read_nxt<3M
4s : 40 tps
Negative scalability

400K<handler_read_nxt<1M
14s : 7,4 tps
50% memory workload  1000 slow queries
concurrency 1
Eratic status

1K <handler_read_nxt<2M
3730s : 0,26 tps
Flat status

200K<handler_read_nxt<400K   
854s : 1,2 tps
10% memory workload 1000 slow queries
concurrency 1
buffer_pool 5G


still waiting
Flat status

150K<handler_read_nxt<280K   
1262s : 0,8 tps



   
Query time in secondInnoDBTokuDB
Q19,050,22
Q217,560,3
Q35,950,29
Q4190,29
Q50,240,01
Q69,70,09
Q72,010,23



InnoDB stay the reference for in memory concurency workload and this was expected as TokuDB is the challenger, but now we can already state that for any IO bound workload it will make your day.

There are many way to fixe those bad queries (de normalisation project) but nothing is more simple then the alter table engine=TokuDB on a slave and don't forget to alter  secondary indexes using clustered index syntax .