A little fun with InnoDB multi-versioning

Consider the following commands, executed in the MySQL CLI on a new connection with no special preparation (and pay special attention to the execution time):

mysql> show create table t \G *************************** 1. row *************************** Table: t Create Table: CREATE TABLE `t` ( `a` int(10) unsigned NOT NULL, `b` int(10) unsigned NOT NULL, PRIMARY KEY (`a`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)

mysql> select * from t; Empty set (5.20 sec)

mysql> select count(*) from t; +----------+ | count(*) | +----------+ | 0 | +----------+ 1 row in set (5.22 sec)

mysql> select * from t where a = 10; Empty set (0.00 sec)

mysql> select * from t where a < 10; Empty set (5.35 sec)

mysql> select * from t where a > 10; Empty set (5.41 sec)

mysql> select * from t where a in (10, 20, 30); Empty set (0.00 sec)

mysql> select * from t where a > 1000000; Empty set (0.00 sec)

mysql> select * from t where a > 500000; Empty set (2.60 sec)

What is happening? Why is it so slow? Why are some things slow and others not?

Looking for the culprit

The SHOW PROCESSLIST doesn’t show anything unusual:

mysql> show processlist; +----+------+-----------------+------+---------+------+-------+------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+------+-----------------+------+---------+------+-------+------------------+ | 6 | root | localhost:34840 | test | Query | 0 | init | show processlist | | 8 | root | localhost:34842 | test | Sleep | 2116 | | NULL | +----+------+-----------------+------+---------+------+-------+------------------+ 2 rows in set (0.00 sec)

If we look at the SHOW ENGINE INNODB STATUS there’s a clue: a transaction that’s been busy being evil:

---TRANSACTION 1481, ACTIVE 2183 sec 3487 lock struct(s), heap size 570920, 2001743 row lock(s), undo log entries 11000000 MySQL thread id 8, OS thread handle 0x7fca27ba2700, query id 5000144 localhost 127.0.0.1 root cleaning up

It also shows up in information_schema.innodb_trx:

mysql> select * from information_schema.innodb_trx \G *************************** 1. row *************************** trx_id: 1481 trx_state: RUNNING trx_started: 2014-04-17 01:38:27 trx_requested_lock_id: NULL trx_wait_started: NULL trx_weight: 11003487 trx_mysql_thread_id: 8 trx_query: NULL trx_operation_state: NULL trx_tables_in_use: 0 trx_tables_locked: 0 trx_lock_structs: 3487 trx_lock_memory_bytes: 570920 trx_rows_locked: 2001743 trx_rows_modified: 11000000 trx_concurrency_tickets: 0 trx_isolation_level: REPEATABLE READ trx_unique_checks: 1 trx_foreign_key_checks: 1 trx_last_foreign_key_error: NULL trx_adaptive_hash_latched: 0 trx_adaptive_hash_timeout: 10000 trx_is_read_only: 0 trx_autocommit_non_locking: 0 1 row in set (0.01 sec)

How was the transaction evil?

I used the following script to INSERT one million rows, and then UPDATE each row ten times, leaving the transaction open and un-committed:

#!/usr/bin/env ruby

require "mysql"

m = Mysql.new("127.0.0.1", "root", "", "test", 13000)

m.query("DROP TABLE IF EXISTS t") m.query("CREATE TABLE t (a INT UNSIGNED NOT NULL, b INT UNSIGNED NOT NULL, PRIMARY KEY (a)) ENGINE=InnoDB")

m.query("START TRANSACTION")

(1..1000000).each do |i| m.query("INSERT INTO t (a, b) VALUES (#{i}, 0)") puts "Inserted #{i} rows..." if i % 10000 == 0 end

(1..10).each do |i| m.query("UPDATE t SET b=#{i}") puts "Updated #{i} times..." end

sleep 1000000

As described in The basics of the InnoDB undo logging and history system, these modifications are made to the database, and there is a single index structure used by both the uncommitted write transaction and my read transaction. Although my read transaction is not able to see any of the rows inserted and subsequently modified by the writing transaction, in order to figure that out, it needs to apply all the undo records to each row encountered (10 each). This, of course, takes time.

Leaky visibility

You can see some of the hidden visibility of the in-flight data “leaking” in the amount of time different operations take. For instance, scanning all rows takes more than 5 seconds, but using the condition a > 1000000 can be immediately evaluated. Suspiciously the condition a > 500000 takes approximately half as much time as scanning all rows (because it has half as much work to do). Of course, all of these queries return an empty set.

Applicability to the real world

While almost no one would intentionally do what my script does, I have actually seen the end result in production systems before: a very large transaction updated a row many times resulting in many slow queries trying to access the table. Some queries would be fast, some slow.

Serious implications for system administrators and DBAs

There are no limits to how much space a user can consume in the form of undo history. I have filed MySQL Bug #72362: “Users may use an unlimited amount of undo space” suggesting that configuration options be added to limit the amount of undo space which a user could cause to be consumed. I wrote the following:

As InnoDB is currently implemented, it is possible for users to use an unlimited amount of space for undo history by either:

Creating a single or many large transactions writing to as few as one row many times, directly accumulating undo history.

Leaving open a transaction (while minimally keeping it alive) with a read view, indirectly accumulating undo history by preventing purge.

(Additionally, the user could combine both of these approaches, leaving open a transaction with a read view while simultaneously generating a large number of small transactions which individually have a small amount of undo history. This may better prevent detection of the culprit.)

Both of these situations allow a regular user, without special privileges, to consume large amounts of disk space in the system tablespace, potentially causing the system tablespace to be expanded to consume all filesystem space and without an easy recourse from the system administrator.

And I’ve suggested that new configuration options be added for the following:

Limit the undo space consumed by a single transaction.
Limit the aggregate undo space consumed by a given user.
Limit the age of the transaction read view for a given user.