lundi 17 janvier 2011

Muzing on NoSQL, damned ! can't get rid of InifiniDB

NoSQL have been frequently used for building analytic solutions. The big picture is using some scalable client code with map reduce to distribute full data scan. 

This approach have nothing new to the RDBMS world and can be considered an extension of the Kimball ROLAP normalization, just allowing more CPU power on a single query. 

NoSQL or sharding take advantages on 
  • Distributed processing 
NoSQL or sharding are loosing advantages such        
  • Fast memory communication 
  • Per column data type optimization and deserialization cost (NoSQL)
  • C processing when reduced with slower language (NoSQL)
There is more non technical advantages in classic ROLAP normalization like using same well known OLTP tools and bug free storage engine, all coming with GPL licences for reducing the ownership cost. In many, many case it will allow building analytic solutions with acceptable response time.

Our existing tools in My SQL using vertical scalability contains

  • MyISAM for minimizing query latency
  • MyISAM read only pack table for sparing space and io
  • Deprecated merge table for partitioning
  • Native partitioning
  • InnoDB transparent bzip compression
  • InnoDB read ahead , and multiple io threads 
  • Tokutek storage engine with predictable response time on table sizing      

On the same front MariaDB is cutting the gap with the best commercial RDBMS, introducing new algorithms based on query plan optimization
  • Table elimination (MariaDB 5.1)
  • Batched Key Access (MariaDB 5.3)
  • Join buffering with outer joins
  • Index Condition Pushdown
  • Subquery optimizations (MariaDB 5.3)
  • NULL-aware materialization
  • Materialization-scan for grouping queries
  • Predicate caching
  • FROM subquery optimizations   

ROLAP, despite being the most common approach, is facing some well know limitations that OLAP or NoSQL try to balance.

  • Single query is bound to a single server, and for My SQL single core. It is good enough in most case for a marketing or management requirements putting limited load on the analytic solution and asking data with some predefine time delay. 
  • Putting technical team in front of the write vs read challenge, adding more indexes will speed some queries but doubtfully can cover all queries and will slow down writes and DDL. 
  • Over time the schema is more and more linked with specific requirements, bringing aggregates and datamarts and forcing to maintain bunch of ETL assynchronus code.         
NoSQL defenders tend to clams this is not acceptable? Analytic systems today are moving online and could populate events at the rate of the number of Internet sessions, clicks or network packets. This is just true, but is NoSQL the best solution ?      

On the same track to NoSQL,a toolkit is emerging in MariaDB and Drizzle,  to overcome those limitations 

  • Column based OLAP storage engine 
  • ROLAP sharding solutions with or without transparent map reduce 

What is the best today? Again there is no rule of thumb but i can share some founding in my day to day job.

OLAP storage engine InfiniDB, Infobright, LucidDB provide very good results on processing a lot of records, depending on the underlying hardware and type of queries. I tend to give credit to InfiniDB to maximize resources out of a single query and Infobright to provide better response time on limited hardware resources.

Based on the top 10 queries of an existing client running an online DWH 1 Billion records on 8 cores hardware RAID 5 4*15000tpm SAS, 32G RAM

InnoDB InfiniDB InfoBright
Avg number of logical records/s on all type of queries 1 598 682 12 098 102 4 527 652
Minimum response time / million logical records scan best case on a 128M rows group by 0.6 0.02 0.06
 So far so good OLAP is the big winner here !
Can NoSQL come clause to OLAP, at the rate of 500 000 get/s ? It would required 128 servers to come clause to such response time and 12 of them to maintain avg read records rate. Hopefully NoSQL can use bulk data transfer covered with distributed SQL layer.    

On the other side, ROLAP sharding can put more memory caching on a single query, introducing home made logical partitioning that can follow the business rules like putting most  hot data on a single node, mixing huge aggregates with small data lookups. This approch is already credited with very good results for me on a 20 billion DWH build last year for a fraction of the price of commercial DB 
Can ROLAP come clause to OLAP and how many servers will i need to achieve the same performance ?  According to previous specific bench it is 1 to 1 if we can make the all cores work together, but for the same latency it is more 1 to 3 or 4. Can we verify this with micro benchmarking. 

Let's abstract a linear parallel processing on ROLAP designed with some sharding.

With Spider storage engine allowing transparent horizontal sharding, we end up splitting the data in per day shards, giving 2M rows fact tables per partition, distributing the shard across multiple MariaDB instances.  Spider make it possible to do this and reattach your data just reusing same SQL statements. 

Let's now present some micro benchmark that discourage me from the brute force only  approach.

Mounting the partitions on a RAMFS, let catch the time for a single 2 Million rows partition scan

InnoDB 1.3
Pbxt 0.8
MyISAM Pack 1.32
MyISAM          0.46
Aria 32K blocs 0.86

With linear scalability we can get 8 cores to perform 32M row scan/s on MyISAM.  
Giving our previous query doing 128M row aggregate in 2s in InfiniDB, result would have run in 4 secondes, 2 times slower comparing with InfiniDB and this is just abstracting network, disk access and any consolidation time for each shard.

What is the magic behind OLAP, could it be per column sharding ?

Our Q10 queries is touching 3 fields out of the 12 in the giving table.This is opening  the possibility to divide full scan time per 4 with additional 3 times the scan of a primary key and adding 2 times the CPU to join 2 millions rows. This operation could possibly be done transparently with the excellent VP storage engine.

Oups, join on 2M records is adding minimum 8 seconds per shard for a full scan. This POC was just a big u turn. Conclusion, sharding per column and brut force is not the solution in most case giving the cost of  algorithms to rejoin the multiple peaces. 

Fetch 2 INT COLUMNS 2M rows MyISAM Join cache level Time to scan all records
FULL SCAN 8 bytes RECORD SIZE 1 1.07
FULL SCAN 108 bytes RECORD SIZE 1 1.87
OLAP engines take clever advantage of metadata per shard and pre aggregate informations it means that any ROLAP sharding or NoSQL solution should also come to the same conclusion in regard to latency, and adding hybrid mechanisms like additional metadata or pre calculated tables that can limit the cost of brute forcing.

Some great idea would be to build a transparent metadata layer,with My SQL proxy or gearman that will inject some pre calculated results per shard, execute asynchronously at insert time. I have no doubt that a map reduce solution for building such asynchronously metadata is the way to go for a ROLAP sharding with almost real time statistics.   

For those already convince, I'd like to point out, a not so known feature of My SQL call  WITH ROLLUP  to help trigger aggregating data. 

lundi 10 janvier 2011

20% to 50% improvement in MariaDB 5.3 Handler Interface using prepared statement

Following a previous post using MySQL as a NoSQL store with the native Handler Interface, some  refactoring have now been coded in MariaDB-5.3 as described in the following worklog  
On concurrency,  doing less work in HANDLER READ to put more caching on HANDLER OPEN is credited with a 7% performance gain.

Source code via
bzr branch lp:~maria-captains/maria/5.3-handler

More, HANDLER READ can now be used with prepared statement to save precious CPU cycles on parsing.    

handler t1 open;
prepare stmt from 'handler t1 read a=(?)';
set @a=1000;
execute stmt using @a;
deallocate prepare stmt;

On micro benchmark prepared statement show 40% better performance. On concurrency this improvement goes up to 13% better performance as demonstrated with a code in C using client side prepared statement, in regard to mysqlslap also in C but without client side prepared statement.


Handler Interface on Pached MariaDB 5.3
MyISAM/Client side prepared statement/PERL
PBXT/client side prepared statement/C client
MyISAM/client side prepared statement/C client

The MariaDB team also investigates some possible contentions on kernel_lock with a new server parameter that disables socket switching to non blocking mode -T8 --disable-thread-alarm, this is having so far no impact on this bench.

Code used during the run :

#include <iostream>
#include <cstdlib>
#include <cstring>
#include <cstdio>
#include "mysql.h"
using namespace std;

int main(int argc, char *argv[])
 cout << "start up" << endl;
 MYSQL * connection;
 connection = mysql_init(0);
 connection = mysql_real_connect(connection, 0, "stephane", "", "test", 0,
                                 "/tmp//mysql.sock", 0);
 if (0 == connection)
   cerr << "connect failed:" << mysql_error(connection) << endl;
   return -1;
 static const char * sql_statement ="HANDLER login READ `PRIMARY`=(?)";
 MYSQL_STMT * stmt_handle = mysql_stmt_init(connection);
 if (0 == stmt_handle)
   cerr << "stmt init failed:" << mysql_error(connection) << endl;
   return -1;
 int mysql_return_code = mysql_stmt_prepare(stmt_handle,
if (0 != mysql_return_code)
   cerr << "stmt prepare failed:" << mysql_stmt_error(stmt_handle) <<
   return -1;

 MYSQL_BIND bind_var[1];
 unsigned long var_length[3];
 my_bool not_null = 0;
 int int_data;

 bind_var[0].buffer_type = MYSQL_TYPE_LONG;
 bind_var[0].buffer = (char *) &int_data;
 bind_var[0].length= 0;
 bind_var[0].is_null = 0;

mysql_return_code = mysql_stmt_bind_param(stmt_handle, bind_var);
 if (0 != mysql_return_code)
   cerr << "stmt bind failed:" << mysql_stmt_error(stmt_handle) << endl;
   return -1;
int_data= 1000;
int user_id;
char user_name[50] = {'\0'};
char user_email[255] = {'\0'};
MYSQL_TIME created;
MYSQL_BIND result_var[4];
unsigned long result_var_length[4];
my_bool is_null[1];

for (int i=0; i<1000000; ++i)
 mysql_return_code = mysql_stmt_execute(stmt_handle);
 if (0 != mysql_return_code)
   cerr << "stmt execute failed:" << mysql_stmt_error(stmt_handle) <<
   return -1;
 bzero(result_var, sizeof(result_var));
 result_var[0].buffer_type= MYSQL_TYPE_LONG;
 result_var[0].buffer= (char *)&user_id;
 result_var[0].is_null= &is_null[0];
 result_var[0].length= &result_var_length[0];
 result_var[1].buffer_type = MYSQL_TYPE_VAR_STRING;
 result_var[1].buffer = (char *)user_name;
 result_var[1].buffer_length = sizeof(user_name)-1;
 result_var[1].is_null = &is_null[1];
 result_var[1].length = &result_var_length[1];
 result_var[2].buffer_type = MYSQL_TYPE_VAR_STRING;
 result_var[2].buffer = (char *)user_email;
 result_var[2].buffer_length = sizeof(user_email)-1;
 result_var[2].is_null = &is_null[2];
 result_var[2].length = &result_var_length[2];
 result_var[3].buffer_type = MYSQL_TYPE_DATETIME;
 result_var[3].buffer = (char *)&created;
 result_var[3].is_null = &is_null[3];
 result_var[3].length = &result_var_length[3];

 mysql_return_code = mysql_stmt_bind_result(stmt_handle, result_var);
 if (0 != mysql_return_code)
   cerr << "stmt bind rslt failed:" << mysql_stmt_error(stmt_handle) <<
   return -1;

 mysql_return_code = mysql_stmt_store_result(stmt_handle);
 if (0 != mysql_return_code)
   cerr << "stmt store rslt failed:" << mysql_stmt_error(stmt_handle) <<
   return -1;

 mysql_return_code = mysql_stmt_fetch(stmt_handle);
 if (0 != mysql_return_code && MYSQL_NO_DATA != mysql_return_code)
   cerr << "stmt fetch failed:" << mysql_stmt_error(stmt_handle) << endl;
   return -1;

 if (MYSQL_NO_DATA != mysql_return_code)
 //  cout << "stmt - got: " << user_name << endl;
} //end for
cout << "done" << endl;
return 0;