vendredi 28 juin 2013

MariaDB subquery cache in a real use case

The big picture here is to track events in a flow with some changing state of an event.

That is done in two way

  • Storing multiple rows in a flow table using an event id and event column representing different states. 
  • Maintain 1 to 1 relation for event specific states in separate tables (flow_event1 to  flow_event5) 

A query generator take various conditions and dynamically build exists subqueries to filter all events that does not match various user conditions.

Using MySQL 5.5 on a 50M rows table the following queries are taking hours.  

SELECT DISTINCT(id) FROM flow main 
WHERE date BETWEEN '2013-06-22 10:11:50' AND '2013-06-23 10:11:50' AND event = 'STATE5' 
AND  

not exists
(SELECT 1
   FROM flow_crm sub
   WHERE main.id = sub.id
    AND sub.date BETWEEN '2013-06-26 10:11:50' AND '2013-06-27 10:11:50'
    LIMIT 1)
AND not exists

(SELECT 1
   FROM flow_event4 sub
   WHERE main.id = sub.id
    AND sub.date BETWEEN '2013-06-26 10:11:50' AND '2013-06-27 10:11:50'
    LIMIT 1)
AND (SELECT COUNT(*)
   FROM flow_event1 sub
    WHERE main.id = sub.id
    AND sub.date BETWEEN '2013-05-28 10:11:50' AND '2013-06-27 10:11:50'
) < 5

AND not exists
(SELECT 1
   FROM flow sub
   WHERE main.id = sub.id
    AND event = 'STATE6'
    AND sub.date BETWEEN '2013-06-22 10:11:50' AND '2013-06-27 10:11:50'
    LIMIT 1)
AND not exists
(SELECT 1
   FROM flow_event1 sub
   WHERE main.id = sub.id
    AND sub.date BETWEEN '2013-06-24 10:11:50' AND '2013-06-27 10:11:50'
    LIMIT 1)

AND not exists
(SELECT 1
   FROM flow_event2 sub
   WHERE main.id = sub.id
    AND sub.date BETWEEN '2013-06-23 10:11:50' AND '2013-06-27 10:11:50'
    LIMIT 1)

AND not exists
(SELECT 1
   FROM flow sub
   WHERE main.id = sub.id
    AND event in ('STATE1', 'STATE2', 'STATE3', 'STATE4')
    AND sub.date BETWEEN '2013-06-23 10:11:50' AND '2013-06-27 10:11:50'
    LIMIT 1)

AND not exists
(SELECT 1
   FROM flow_event3 sub
   WHERE main.id = sub.id
    AND sub.date BETWEEN '2013-06-23 10:11:50' AND '2013-06-27 10:11:50'

    LIMIT 1);


The reason for such slow execution is that every subqueries is run multiple time for every entry of the flow table. But wait we have duplicate id(s) in that table. This redondance events impact that the same id is invoking the same subqueries again and again. 

Can we figure out how much ?

SELECT count(id) FROM flow main 
WHERE date BETWEEN '2013-06-22 10:11:50' AND '2013-06-23 10:11:50' AND event = 'STATE5'
1530545

SELECT count(DISTINCT(id)) FROM flow main 
WHERE date BETWEEN '2013-06-22 10:11:50' AND '2013-06-23 10:11:50' AND event = 'STATE5'
35967

So this give in average around 50 time re execution of the same subquery or 50 different state for a single event.

The win can be huge !      

So the solution is to move to MariaDB and activate sub query cache 

Those features can be control with the optimizer_switch variable that enable and disable optimisations based on the knowledge of your workload .

The result is a major improvement in speed the same hour queries take now 7s !

But wait i'm not a MariaDB fan boy, what can i do about it ?

Solution is simple make the id(s) unique before applying the subqueries in the where clause but this  can still have a drawback of creating many temporary tables in memory as we will have to group before filtering. 

SELECT * FROM ( SELECT DISTINCT(id) FROM flow 
WHERE date BETWEEN '2013-06-22 10:11:50' AND '2013-06-23 10:11:50' AND event = 'STATE5' 
 main WHERE

not exists
(SELECT 1
   FROM flow_crm sub
   WHERE main.id = sub.id
    AND sub.date BETWEEN '2013-06-26 10:11:50' AND '2013-06-27 10:11:50'
    LIMIT 1)
AND not exists
(SELECT 1
   FROM flow_event4 sub
   WHERE main.id = sub.id
    AND sub.date BETWEEN '2013-06-26 10:11:50' AND '2013-06-27 10:11:50'
    LIMIT 1)
AND (SELECT COUNT(*)
   FROM flow_event1 sub
    WHERE main.id = sub.id
    AND sub.date BETWEEN '2013-05-28 10:11:50' AND '2013-06-27 10:11:50'
) < 5
AND not exists
(SELECT 1
   FROM flow sub
   WHERE main.id = sub.id
    AND event = 'STATE6'
    AND sub.date BETWEEN '2013-06-22 10:11:50' AND '2013-06-27 10:11:50'
    LIMIT 1)
AND not exists
(SELECT 1
   FROM flow_event1 sub
   WHERE main.id = sub.id
    AND sub.date BETWEEN '2013-06-24 10:11:50' AND '2013-06-27 10:11:50'
    LIMIT 1)
AND not exists
(SELECT 1
   FROM flow_event2 sub
   WHERE main.id = sub.id
    AND sub.date BETWEEN '2013-06-23 10:11:50' AND '2013-06-27 10:11:50'
    LIMIT 1)
AND not exists
(SELECT 1
   FROM flow sub
   WHERE main.id = sub.id
    AND event in ('STATE1', 'STATE2', 'STATE3', 'STATE4')
    AND sub.date BETWEEN '2013-06-23 10:11:50' AND '2013-06-27 10:11:50'
    LIMIT 1)
AND not exists
(SELECT 1
   FROM flow_event3 sub
   WHERE main.id = sub.id
    AND sub.date BETWEEN '2013-06-23 10:11:50' AND '2013-06-27 10:11:50'

    LIMIT 1);

This query now run in 8s on MySQL 5.5 and before ?

Can we reduce the response time more. A trivial answer is maybe.  

As you can guess some of those subqueries can have better filtering predicate than others so the idea is to move the predicate returning more frequently false at the beginning of the WHERE clause . The Optimizer will stop evaluating where conditions per row as soon one of them turn out to be false and we get a 25% response time decrease. 

Enjoy MariaDB as it save you to understand the arcane of the optimizer and you get good subqueries optimization from 5.3 drop replacement of 5.1 and MariaDB 5.5 drop replacement for MySQL 5.5 and with most of the 5.6 feature already in place.   

MariaDB 10 with independent statistics will help improving more such cases introducing a ref+range join optimization but this need to be coded and we welcome any founding to the MariaDB fondation or by direct service suscription to SkySQL.  

The task on JIRA for this if you feel that ref + range on multiple column index is an issue for you then try to increase the popularity of that task in JIRA .