Multi-Column IN clause - Unexpected MySQL Issue

Peter Zaitsev2008-04-04T18:08:34-04:00

We have an application which stores massive amount of urls. To save on indexes instead of using URL we index CRC32 of the URL which allows to find matching urls quickly. There is a bit of chance there would be some false positives but these are filtered out after reading the data so it works all pretty well.

If we just process urls one by one it works great:

mysql> explain select url from 124pages.124pages where url_crc=484036220 and url="http://www.dell.com/"; +----+-------------+----------+------+---------------+---------+---------+-------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------+------+---------------+---------+---------+-------+------+-------------+ | 1 | SIMPLE | 124pages | ref | url_crc | url_crc | 4 | const | 1 | Using where | +----+-------------+----------+------+---------------+---------+---------+-------+------+-------------+ 1 row in set (0.02 sec)

mysql> explain select url from 124pages.124pages where url_crc=484036220 and url="http://www.dell.com/";
+----+-------------+----------+------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table    | type | possible_keys | key     | key_len | ref   | rows | Extra       |
+----+-------------+----------+------+---------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | 124pages | ref  | url_crc       | url_crc | 4       | const |    1 | Using where |
+----+-------------+----------+------+---------------+---------+---------+-------+------+-------------+

Handling URLs one by one is however not efficient if you’re processing millions of them so we tried to do bulk fetches:

mysql> explain SELECT url FROM 106pages.106pages WHERE (url_crc, url) IN ((2752937066, 'http://members.aye.net/~gharris/blog/'), (3799762538, 'http://www.coxandforkum.com/')); +----+-------------+----------+------+---------------+------+---------+------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------+------+---------------+------+---------+------+-------+-------------+ | 1 | SIMPLE | 106pages | ALL | NULL | NULL | NULL | NULL | 14936 | Using where | +----+-------------+----------+------+---------------+------+---------+------+-------+-------------+ 1 row in set (0.00 sec)

mysql> explain SELECT  url FROM 106pages.106pages WHERE (url_crc, url) IN ((2752937066, 'http://members.aye.net/~gharris/blog/'), (3799762538, 'http://www.coxandforkum.com/'));
+----+-------------+----------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table    | type | possible_keys | key  | key_len | ref  | rows  | Extra       |
+----+-------------+----------+------+---------------+------+---------+------+-------+-------------+
|  1 | SIMPLE      | 106pages | ALL  | NULL          | NULL | NULL    | NULL | 14936 | Using where |
+----+-------------+----------+------+---------------+------+---------+------+-------+-------------+

As you can see just using multiple column IN makes MySQL to pick doing full table scan in this case, even though the cardinality on the first column is almost perfect. I did some more testing and it looks like a bug or missing optimizer feature.

I should not be surprised though as multi-column in is not the most used MySQL feature out there.

For given application case we could simply rewrite query using more standard single column IN clause:

mysql> explain SELECT url FROM 106pages.106pages WHERE url_crc IN (2752937066,3799762538) AND url in('http://members.aye.net/~gharris/blog/','http://www.coxandforkum.com/'); +----+-------------+----------+-------+---------------+---------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------+-------+---------------+---------+---------+------+------+-------------+ | 1 | SIMPLE | 106pages | range | url_crc | url_crc | 4 | NULL | 2 | Using where | +----+-------------+----------+-------+---------------+---------+---------+------+------+-------------+ 1 row in set (0.01 sec)

mysql> explain SELECT  url FROM 106pages.106pages WHERE url_crc IN (2752937066,3799762538) AND url in('http://members.aye.net/~gharris/blog/','http://www.coxandforkum.com/');                                                                  +----+-------------+----------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table    | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+----------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | 106pages | range | url_crc       | url_crc | 4       | NULL |    2 | Using where |
+----+-------------+----------+-------+---------------+---------+---------+------+------+-------------+

Theoretically speaking this query is not equivalent to the first one – because row having url_crc=2752937066 and url=’http://www.coxandforkum.com/’ would match it, while it should not. It however does not happen in our case as url_crc is functionally dependent on url so both queries are equivalent.

So we’ve got our work around and can forget about the issue and MySQL team gets yet another bug to deal with.
What worries me again is – this is very simple case which seems to to be generally broken which raises a question how good coverage MySQL tests have.

Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Home - Wiki
Copyright © 2011-2024 iteam. Current version is 2.134.0. UTC+08:00, 2024-09-29 02:13
浙ICP备14020137号-1 $Map of visitor$