jeudi 12 novembre 2015

Big Data: InfiniDB vs Spider: What else ?

Many of my recent engagements have been all around strategy to implement Real Time Big Data Analytics: Computing hardware cost of extending a single table collection with MariaDB and Parallel Query found in the Spider storage engine to offload columnar MPP storage like InfiniDB or Vertica.

As of today Parallel Query is only available from releases of MariaDB Spider supported by spiral arms. The more efficient way to use parallel query with Spider can be done on group by, and count queries that use a single spider table. In such case Spider Engine will execute query push down AKA map reduce.

Spider gets multiple levels of parallel execution for a single partitioned tables.

First level is per backend server:
The way to actually tell spider to scan different backends in concurrency is to set  spider_sts_bg_mode=1

Other level is per partition:
The way to actually tell spider to scan different partitions in concurrency is by set  spider_casual_read=1

Hey but that's sometimes not enough! Per partition parallel query can conflict with the connection recycling feature of Spider. Behind the scene Spider search for some already open connections via a hash table that contains every partition connections: Host, User, Password, Port or Socket, and will reuse the same connection if found one. To really enable concurrent scan inside the same backend or inside localhost you need to create different users to abuse spider on connection recycling and the server should use TCP connection string and not socket or name pipes.

In real life scenario, it's always good to create a different user per partition table and grant only SELECT privileges to the single table accessed attaching the Spider partition. in DDL you later map to partitions via creating a unique server per partition that map every user.

Here is a local test case attach to this scenario inside a single server to use multiple cores to produce an aggregate.

CREATE OR REPLACE TABLE `test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `a` varchar(1000) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

insert into test select *,'test' from seq_1_to_4000000; 

create or replace view v1 as select * from test where id>=0000000 and id<1000000;
create or replace view v2 as select * from test where id>=1000000 and id<2000000;
create or replace view v3 as select * from test where id>=2000000 and id<3000000;
create or replace view v4 as select * from test where id>=3000000 and id<4000000;

grant all on *.* to root1@'127.0.0.1' identified by 'mariadb' ;
grant all on *.* to root2@'127.0.0.1' identified by 'mariadb' ;
grant all on *.* to root3@'127.0.0.1' identified by 'mariadb' ;
grant all on *.* to root4@'127.0.0.1' identified by 'mariadb' ;

create or replace server l1 FOREIGN DATA WRAPPER mysql OPTIONS (USER 'root1',PORT 3307,PASSWORD 'mariadb',HOST '127.0.0.1',DATABASE 'test' );
create or replace server l2 FOREIGN DATA WRAPPER mysql OPTIONS (USER 'root2',PORT 3307,PASSWORD 'mariadb',HOST '127.0.0.1',DATABASE 'test' );
create or replace server l3 FOREIGN DATA WRAPPER mysql OPTIONS (USER 'root3',PORT 3307,PASSWORD 'mariadb',HOST '127.0.0.1',DATABASE 'test' );
create or replace server l4 FOREIGN DATA WRAPPER mysql OPTIONS (USER 'root4',PORT 3307,PASSWORD 'mariadb',HOST '127.0.0.1',DATABASE 'test' );

create or replace table test_spd  ( 
  `id` int(11) NOT NULL AUTO_INCREMENT, 
  `a` varchar(1000) DEFAULT NULL,   
PRIMARY KEY (`id`) ) ENGINE=SPIDER PARTITION BY RANGE(id) (
PARTITION P1 VALUES LESS THAN (1000000) COMMENT='table "v1", srv "l1"',
PARTITION P2 VALUES LESS THAN (2000000) COMMENT='table "v2", srv "l2"',
PARTITION P3 VALUES LESS THAN (3000000) COMMENT='table "v3", srv "l3"',
PARTITION P4 VALUES LESS THAN (4000000) COMMENT='table "v4", srv "l4"');

In this scenario queries on my 40 Million record table will use up to 4 cores or the number of views   that materialized each partition used inside the Spider table.

Now from a real use case, let's ask David Chanial, devops on the Beleive Digital platform that demonstrate this on a 8 node 64 cores cluster.

SVA: How do you initialize the cluster?

DCH: To manage the farm we are deploying using the python fabric library! We had create some  deployment scripts that take as input the node list, the number for cores and the number of replicates.

SVA: What type of tables definitions and data sizing are you using :

DCH: Let's have a look at the spider 2,5 Billion record table ,

MariaDB [spider01_ro]> show table status LIKE 'Sales' \G

           Name: Sales 
         Engine: SPIDER
        Version: 10
     Row_format: Dynamic
           Rows: 2500657155
 Avg_row_length: 91
    Data_length: 228106478935
   Index_length: 40839657478
 Auto_increment: 2618963872
      Collation: utf8_general_ci 
 Create_options: partitioned

             
SVA: What type of of partitioning?

DCH: That's a mixture of technical per auto increment and per business segments. Indeed we are using sub partitioning with double spider tables that point to TokuDB or InnoDB tables in this case and reduce that to modulo of number of cores in the cluster.

SVA: What performance can you get?

MariaDB [spider01_ro]> 
select count(*) from spider01_ro.Sales t; 
select idGenre, count(*) from spider01_ro.Sales GROUP BY idGenre;
+------------+
| count(*)   |
+------------+
| 2506437338 |
+------------+
1 row in set (8.87 sec)

+---------+-----------+
| idGenre | count(*)  |
+---------+-----------+
|       0 |      8137 |
|       1 |  56044584 |
|       2 |  21179162 |
|       3 |  25446110 |
|       4 |  31829221 |
|     293 |   1386236 |
|     294 |     47109 |
|     295 |     50776 |
|     296 |       988 |

..........
|     297 |     47589 |
|     298 |      9610 |
|     299 |      5215 |
|     300 |       224 |
+---------+-----------+
295 rows in set (16.00 sec)

Indeed 149M records read per sec on 8 nodes and inside a single node all the cores are working hard :


| 2848 | tsrc_p15_c02 | 10.93.1.12:51388 | spider01 | Query   |     1 | Queried about 730000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p15_c02` group by `idGenre` |    0.000 |
| 2849 | tsrc_p15_c10 | 10.93.1.12:51396 | spider01 | Query   |     1 | Queried about 720000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p15_c10` group by `idGenre` |    0.000 |
| 2850 | tsrc_p15_c18 | 10.93.1.12:51404 | spider01 | Query   |     1 | Queried about 950000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p15_c18` group by `idGenre` |    0.000 |
| 2851 | tsrc_p15_c26 | 10.93.1.12:51412 | spider01 | Query   |     1 | Queried about 740000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p15_c26` group by `idGenre` |    0.000 |
| 2852 | tsrc_p15_c34 | 10.93.1.12:51420 | spider01 | Query   |     1 | Queried about 1060000 rows                                                  | select count(0),`idGenre` from `spider01`.`tsrc_p15_c34` group by `idGenre` |    0.000 |
| 2853 | tsrc_p15_c42 | 10.93.1.12:51428 | spider01 | Query   |     1 | Queried about 920000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p15_c42` group by `idGenre` |    0.000 |
| 2854 | tsrc_p15_c50 | 10.93.1.12:51436 | spider01 | Query   |     1 | Queried about 530000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p15_c50` group by `idGenre` |    0.000 |
| 2855 | tsrc_p15_c58 | 10.93.1.12:51444 | spider01 | Query   |     1 | Queried about 790000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p15_c58` group by `idGenre` |    0.000 |
| 2856 | tsrc_p16_c02 | 10.93.1.12:51452 | spider01 | Query   |     1 | Queried about 760000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p16_c02` group by `idGenre` |    0.000 |
| 2857 | tsrc_p16_c10 | 10.93.1.12:51460 | spider01 | Query   |     1 | Queried about 660000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p16_c10` group by `idGenre` |    0.000 |
| 2858 | tsrc_p16_c18 | 10.93.1.12:51468 | spider01 | Query   |     1 | Queried about 940000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p16_c18` group by `idGenre` |    0.000 |
| 2859 | tsrc_p16_c26 | 10.93.1.12:51476 | spider01 | Query   |     1 | Queried about 930000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p16_c26` group by `idGenre` |    0.000 |
| 2860 | tsrc_p16_c34 | 10.93.1.12:51484 | spider01 | Query   |     1 | Queried about 910000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p16_c34` group by `idGenre` |    0.000 |
| 2861 | tsrc_p16_c42 | 10.93.1.12:51492 | spider01 | Query   |     1 | Queried about 800000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p16_c42` group by `idGenre` |    0.000 |
| 2862 | tsrc_p16_c50 | 10.93.1.12:51500 | spider01 | Query   |     1 | Queried about 770000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p16_c50` group by `idGenre` |    0.000 |
| 2863 | tsrc_p16_c58 | 10.93.1.12:51508 | spider01 | Query   |     1 | Queried about 740000 rows                                                   | select count(0),`idGenre` from `spider01`.`tsrc_p16_c58` group by `idGenre` |    0.000 |
       
SVA: Thanks David

Take away:

We conclude that columnar model in such scenario with less than 20 columns is only 10 time more efficient for a given hardware cost. But RDBMS take the lead on small range indexed queries.

Parallel and distributed queries is never an easy task but we can make it shine just on regular good old well stable OLTP storage engine.

Stay tuned, thanks to Spiral Arms Open Source spirit, a special to us foundation sponsor, and more and more supported clients inside MariaDB, we will get more of that spider release inside official MariaDB 10.2. branch.

If you feel to help finance such move in getting support ask your MariaDB sales rep some specific spider support or directly to a Spiral Arms sales rep. 

lundi 29 juin 2015

Slave Election is welcoming GTID

Slave election is a popular HA architecture,  first MySQL MariaDB toolkit to manage switchover and failover in a correct way was introduce by Yoshinori Matsunobu into MHA.

Failover and switchover in asynchronous clusters require caution:

- The CAP theorem need to be satisfy. Getting strong consistency, require the slave election to reject transactions ending up in the old master when electing the candidate master.

- Slave election need to take care that all events on the old master are applied to the candidate master before switching roles.

- Should be instrumented to found a good candidate master and make sure it's setup to take the master role.

- Need topology detection, a master role can't be pre defined, as the role is moving around nodes .

- Need monitoring to escalate switchover to failover.

MHA as been coded at a time no unique event id was possible in a cluster,  each event was track as independent coordinate on each node, making MHA architecture to have an internal way to rematch coordinate on all nodes.

With introduction of GTID, MHA brings the heritage and looks like unnecessary complex, with an agent base solution and ssh connections requirement to all nodes .

A lighter MHA was needed for MariaDB when the replication is using GTID, and that's what my colleague Guillame Lefranc have been addressing inside a new MariaDB toolkit

In MariaDB GTID usage is as simple as:

#>stop slave;change master to master_use_gtid=current_pos;start slave; 

As a bonus, the code is in golang and do not require any external dependencies
We can enjoy a singe command line procedure in interactive mode.

mariadb-repmgr -hosts=9.3.3.55:3306,9.3.3.56:3306,9.3.3.57:3306 -user=admin:xxxxx -rpluser=repl:xxxxxx -pre-failover-script="/root/pre-failover.sh" -post-failover-script="/root/post-failover.sh" -verbose -maxdelay 15    
Don't be afraid default is to run in interactive mode and it does not launch anything yet.


In my post configuration script i usually update some haproxy configuration store in a NAS or a SAN and reload or shoot in the head all proxies

Note that the new elected master will be passed as second argument of the script.

I strongly advice not to try to auto failover base on some monitoring, get a good replication monitoring tool and analyze all master status alerts, checking for false positive situation before enjoying pre coded failover.

Loss less semi-synchronous replication in MDEV-162  and multiple performance improvements of semi-synchronous MDEV-7257, have made it to MariaDB 10.1, it can be use to greatly improve zero data lost in case of failure . Combine with parallel replication it's now possible to have an HA architecture that is as robust as asynchronous can be, and under replication delay control is crash safe as well.    

Galera aka MariaDB Cluster as a write speed limit bound to upper network speed, it come at the advantage to always offer crash safe consistency. Slave election HA have the master disk speed limit and do not suffer lower network speed but is losing consistency in failover when slave can't catch.

Interesting time to see how flash storage adoption flavor one or the other architecture.

vendredi 17 avril 2015

Social Networking Using OQGraph

I was given the chance to experiment typical social networking query on an existing 60 Millions edges dataset

How You're Connected


Such algorithms and others are simply hardcoded into the OQGraph. 

With the upgrade of OQGraph V3 into MariaDB 10 we can proceed directly on top of the exiting tables holding the edges kine of featured VIRTUAL VIEW. 



CREATE OR REPLACE TABLE `relations` (
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `id1` int(10) unsigned NOT NULL,
  `id2` int(10) unsigned NOT NULL,
  `relation_type` tinyint(3) unsigned DEFAULT NULL,
  KEY `id1` (`id1`),
  KEY `id2` (`id2`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

oqgraph=# select count(*) from relations;

+----------+
| count(*) |
+----------+
| 59479722 |
+----------+
1 row in set (23.05 sec)

Very nice integration of table discovery that save me referring to documentation to found out all columns definition.  

CREATE TABLE `oq_graph`
ENGINE=OQGRAPH `data_table`='relations' `origid`='id1' `destid`='id2';

oqgraph=# SELECT * FROM oq_graph WHERE latch='breadth_first' AND origid=175135 AND destid=7;
+---------------+--------+--------+--------+------+--------+
| latch         | origid | destid | weight | seq  | linkid |
+---------------+--------+--------+--------+------+--------+
| breadth_first | 175135 |      7 |   NULL |    0 | 175135 |
| breadth_first | 175135 |      7 |      1 |    1 |      7 |
+---------------+--------+--------+--------+------+--------+
2 rows in set (0.00 sec)


oqgraph=# SELECT * FROM oq_graph WHERE latch='breadth_first' AND origid=175135 AND destid=5615775;
+---------------+--------+---------+--------+------+----------+
| latch         | origid | destid  | weight | seq  | linkid   |
+---------------+--------+---------+--------+------+----------+
| breadth_first | 175135 | 5615775 |   NULL |    0 |   175135 |
| breadth_first | 175135 | 5615775 |      1 |    1 |        7 |
| breadth_first | 175135 | 5615775 |      1 |    2 | 13553091 |
| breadth_first | 175135 | 5615775 |      1 |    3 |  1440976 |
| breadth_first | 175135 | 5615775 |      1 |    4 |  5615775 |
+---------------+--------+---------+--------+------+----------+
5 rows in set (0.44 sec)

What we first highlight is that underlying table indexes KEY `id1` (`id1`), KEY `id2` (`id2`) are used by OQgrah to navigate the vertices via a number of key reads and range scans, such 5 level relation was around 2689 jump and 77526  range access to the table . 

Meaning the death of the graph was around 2500 with an average of 30 edges per vertex 

# MyISAM

oqgraph=# SELECT * FROM oq_graph_myisam WHERE latch='breadth_first' AND origid=175135 AND destid=5615775;
+---------------+--------+---------+--------+------+----------+
| latch         | origid | destid  | weight | seq  | linkid   |
+---------------+--------+---------+--------+------+----------+
| breadth_first | 175135 | 5615775 |   NULL |    0 |   175135 |
| breadth_first | 175135 | 5615775 |      1 |    1 |        7 |
| breadth_first | 175135 | 5615775 |      1 |    2 | 13553091 |
| breadth_first | 175135 | 5615775 |      1 |    3 |  1440976 |
| breadth_first | 175135 | 5615775 |      1 |    4 |  5615775 |
+---------------+--------+---------+--------+------+----------+
5 rows in set (0.11 sec)

Need to investigate more such speed difference using MyISAM. Ideas are welcome ?

jeudi 16 avril 2015

Howto - Move a table to different schema with no outage

I remember a time when it was debate if views can be useful for a web oriented workload ?

This post is about one good use case:

The  story is that some tables have been creating into a schema and used by the application into same connection.

Later on some more schema have been added to separate data for multiple application domain but still using original table, kind of cross domain universal table.

With addition of many new domains, a new global schema was added storing freshly create universal tables.

The question was how to move back the old table in the correct new schema without stopping availability of the service ?

We decided to use a view that point to the physical table. Change the application to use the view and later atomically switch the table and the view.


Here is the test case for doing that :


-- Create schemas
CREATE DATABASE schema1;
CREATE DATABASE schema2;

-- Create table in schema 1
CREATE TABLE schema1.t1 (
  id int
);

-- Create views in schema 2
CREATE VIEW schema2.t1 AS SELECT * FROM schema1.t1;
-- Create dummy view on view in schema 1 
CREATE VIEW schema1.t1_new AS SELECT * FROM schema2.t1;

-- Changing the API 

-- Switch schema 1 table and schema 2 view
RENAME TABLE schema2.t1 TO schema2.t1_old,
  schema1.t1 TO schema2.t1,
  schema1.t1_new TO schema1.t1;

Is there some other path ? Surely some triggers + insert ignore like done in OAK or Pt Online Alter table but i also remember a time when it was debate if triggers can be useful for a web oriented workload :)


Thanks to Nicolas @ccmbenchmark for contributing the test case.