[MySQL 源码] MySQL drop table(压缩表)效率与流程分析-阿里云开发者社区

之前发生过一起连续drop压缩表，最后长时间等待信号量crash，线上alert log里的报错是：

OS WAIT ARRAY INFO: reservation count 36647199, signal count 34050225 --Thread 1331538240 has waited at row0purge.c line 680 for 950.00 seconds the semaphore: S-lock on RW-latch at 0xe60b60 '&dict_operation_lock' a writer (thread id 1383319872) has reserved it in mode exclusive number of readers 0, waiters flag 1, lock_word: 0 Last time read locked in file row0purge.c line 680 Last time write locked in file /home/jiyuan/rpmbuild/BUILD/tb-mysql-5.5.18/storage/innobase/row/row0mysql.c line 3213

从alert log来看，drop table应该长时间处于持有dict_sys->dict_operation_lock状态：

row_drop_table_for_mysql { …… /* Serialize data dictionary operations with dictionary mutex: no deadlocks can occur then in these operations */ if (trx->dict_operation_lock_mode != RW_X_LATCH) { /* Prevent foreign key checks etc. while we are dropping the table */ row_mysql_lock_data_dictionary(trx); locked_dictionary = TRUE; } …… }

从函数row_mysql_lock_data_dictionary可以看出，该事务除了加row_mysql_lock_data_dictionary的x锁，还会持有dict_sys->mutex

正常情况下，直到退出函数dict_sys->mutex时，才会调用row_mysql_unlock_data_dictionary来释放dict_sys->mutex和dict_operation_lock

而在purge线程中，row0purge.c line 680

row_purge_parse_undo_rec{

…… row_mysql_freeze_data_dictionary(trx); mutex_enter(&(dict_sys->mutex));

……

}

我开始尝试做重现，把sysbench做一个简单的修改，使其支持compressed table，这是很简单的工作，以下是patch，基于percona版本的sysbench

Index: tests/db/common.lua
===================================================================
--- tests/db/common.lua (revision 2462)
+++ tests/db/common.lua (revision 2467)
@@ -18,9 +18,15 @@
    end 

    i = table_id
+   
+   local comp_str=" "
+   if mysql_use_compress ~= 0 then
+     comp_str=" ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE="..mysql_use_compress
+   end

-   print("Creating table 'sbtest" .. i .. "'...")
-   if (db_driver == "mysql") then
+   print("Creating table 'sbtest" .. i)
+    
+    if (db_driver == "mysql") then
        query = [[
    CREATE TABLE sbtest]] .. i .. [[ (
           id INTEGER UNSIGNED NOT NULL ]] .. ((oltp_auto_inc and "AUTO_INCREMENT") or "") .. [[, 
@@ -28,7 +34,7 @@
           c CHAR(120) DEFAULT '' NOT NULL,
           pad CHAR(60) DEFAULT '' NOT NULL,
           ]] .. index_name .. [[ (id)
-          ) /*! ENGINE = ]] .. mysql_table_engine .. " MAX_ROWS = " .. myisam_max_rows .. " */"
+          ) /*! ENGINE = ]] .. mysql_table_engine ..comp_str.. " MAX_ROWS = " .. myisam_max_rows .. " */"

    elseif (db_driver == "drizzle") then
    query = [[
Index: tests/db/select_random_points.lua
===================================================================
--- tests/db/select_random_points.lua   (revision 2462)
+++ tests/db/select_random_points.lua   (revision 2467)
@@ -13,6 +13,11 @@
    db_connect()

    print("Creating table 'sbtest'...")
+   
+   local comp_str=" "
+   if mysql_use_compress ~= 0 then
+   comp_str=" ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE="..mysql_use_compress
+   end 

    if (db_driver == "mysql") then
       query = [[
@@ -22,7 +27,7 @@
           c CHAR(120) DEFAULT '' NOT NULL,
           pad CHAR(60) DEFAULT '' NOT NULL,
           PRIMARY KEY (id)
-        ) /*! ENGINE = ]] .. mysql_table_engine .. " MAX_ROWS = " .. myisam_max_rows .. " */"
+        ) /*! ENGINE = ]] .. mysql_table_engine ..comp_str.. " MAX_ROWS = " .. myisam_max_rows .. " */"

    elseif (db_driver == "oracle") then
       query = [[
Index: drivers/mysql/drv_mysql.c
===================================================================
--- drivers/mysql/drv_mysql.c   (revision 2462)
+++ drivers/mysql/drv_mysql.c   (revision 2467)
@@ -75,7 +75,8 @@
   {"mysql-ssl", "use SSL connections, if available in the client library", SB_ARG_TYPE_FLAG, "off"},
   {"myisam-max-rows", "max-rows parameter for MyISAM tables", SB_ARG_TYPE_INT, "1000000"},
   {"mysql-debug", "dump all client library calls", SB_ARG_TYPE_FLAG, "off"},
-  
+  {"mysql-use-compress", "whether to create a compress table or not, and this value equal to key_block_size", 
+  SB_ARG_TYPE_INT , "0"},
   {NULL, NULL, SB_ARG_TYPE_NULL, NULL}
 };

加上上面这个patch，我们就可以通过新选项–mysql-user-compress=4来指定一个key_block_size=4的sbtest表

不过多次尝试，都没有完全重现上述purge线程等待dict_operation_lock超时的现象，倒是如下的信号量等待常可以看到

–Thread 140431724488448 has waited at buf0flu.c line 1454 for 303.00 seconds the semaphore:

S-lock on RW-latch at 0x2bce158 ‘&buf_pool->page_hash_latch’

a writer (thread id 140431697413888) has reserved it in mode exclusive

number of readers 0, waiters flag 1, lock_word: 0

Last time read locked in file buf0buf.c line 2482

可以看到，在innodb层整个drop table的过程中，其stack 如下：

#0 0x00000000008a8c2d in buf_buddy_free_low ()

#1 0x0000000000815234 in buf_LRU_block_remove_hashed_page ()

#2 0x0000000000815fb3 in buf_LRU_free_block ()

#3 0x0000000000817143 in buf_LRU_search_and_free_block ()

#4 0x00000000008187bc in buf_LRU_get_free_block ()

#5 0x000000000080bf0e in buf_page_get_gen ()

#6 0x00000000007fdc58 in btr_search_drop_page_hash_when_freed ()

#7 0x000000000084c489 in fseg_free_extent ()

#8 0x000000000084d284 in fseg_free_step ()

#9 0x00000000007ec4a9 in btr_free_but_not_root ()

#10 0x000000000081d50e in dict_drop_index_tree ()

#11 0x00000000007b6035 in row_upd_clust_step ()

#12 0x00000000007b69de in row_upd_step ()

#13 0x0000000000899b88 in que_run_threads ()

#14 0x000000000089a286 in que_eval_sql ()

#15 0x00000000007a5ce9 in row_drop_table_for_mysql ()

#16 0x00000000007918ae in ha_innobase::delete_table(char const*) ()

或者

Thread 3 (Thread 0x7f2531467700 (LWP 24352)):

#0 0x00000000008a8c2d in buf_buddy_free_low ()

#1 0x00000000008150f4 in buf_LRU_block_remove_hashed_page ()

#2 0x0000000000817def in buf_LRU_invalidate_tablespace ()

#3 0x0000000000847cab in fil_delete_tablespace ()

#4 0x00000000007a606d in row_drop_table_for_mysql ()

#5 0x00000000007918ae in ha_innobase::delete_table(char const*) ()

#6 0x0000000000694c0e in ha_delete_table(THD*, handlerton*, char const*, char const*, char const*, bool) ()

#7 0x0000000000602e55 in mysql_rm_table_no_locks(THD*, TABLE_LIST*, bool, bool, bool, bool) ()

#8 0x00000000006034ad in mysql_rm_table(THD*, TABLE_LIST*, char, char) ()

#9 0x00000000005abd29 in mysql_execute_command(THD*) ()

#10 0x00000000005afaa8 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()

#11 0x00000000005b12c7 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()

#12 0x00000000006355ff in do_handle_one_connection(THD*) ()

#13 0x0000000000635700 in handle_one_connection ()

#14 0x0000000000900da6 in pfs_spawn_thread ()

#15 0x0000003888c077e1 in start_thread () from /lib64/libpthread.so.0

#16 0x00000038888e68ed in clone () from /lib64/libc.so.6

这时候show engine innodb status也被hang住

innodb_show_status->srv_printf_innodb_monitor->buf_print_io->buf_stats_get_pool_info

这里需要buf_pool->LRU_list_mutex

线程状态：

Thread 41 (Thread 0x7fb8d04dc700 (LWP 2554)):

#0 0x0000003888c0b44c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1 0x0000000000884811 in os_event_wait_low ()

#2 0x00000000007c42cf in sync_array_wait_event ()

#3 0x00000000007c54a3 in mutex_spin_wait ()

#4 0x0000000000803b0e in pfs_mutex_enter_func ()

#5 0x0000000000807775 in buf_print_io ()

#6 0x00000000007bf89d in srv_printf_innodb_monitor.clone.3 ()

#7 0x00000000007bfef0 in srv_monitor_thread ()

#8 0x0000003888c077e1 in start_thread () from /lib64/libpthread.so.0

#9 0x00000038888e68ed in clone () from /lib64/libc.so.6

从backtrace里，我们还可以看到master线程也被阻塞在获取s锁：

Thread 39 (Thread 0x7fb8cf0da700 (LWP 2556)):

#0 0x0000003888c0b44c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1 0x0000000000884811 in os_event_wait_low ()

#2 0x00000000007c42cf in sync_array_wait_event ()

#3 0x00000000007c4510 in rw_lock_s_lock_spin ()

#4 0x000000000080fcac in buf_flush_page_and_try_neighbors ()

#5 0x0000000000810dbf in buf_flush_batch ()

#6 0x0000000000811540 in buf_flush_list ()

#7 0x00000000007bca2e in srv_master_thread ()

#8 0x0000003888c077e1 in start_thread () from /lib64/libpthread.so.0

#9 0x00000038888e68ed in clone () from /lib64/libc.so.6

master线程在执行buf_flush_try_neighbors 时，需要频繁获取page_hash_latch的s锁

1454 rw_lock_s_lock(&buf_pool->page_hash_latch);

那么，我们有理由相信，如果drop table的过程足够慢，那么整个server必然会被hang住，直至超过900秒后crash掉。

May be related : http://bugs.mysql.com/bug.php?id=59727

我们注意到在MySQL5.5.18之后的版本中，fix了多个跟低效率drop table相关的issue。这里我们也做一些简单的测试。这样的问题并不好重现，大多数时候会很快完成drop，这应该和buffer pool的状态相关，但目前还不清楚如何每次都能重现，只能用脚本来循环

table size: 1,000,000 records

Percona 5.5.18 –nearly 1~2 minutes

Percona 5.5.28 –very quick

我们先来看看Percona 5.5.18中在innodb层drop table的流程吧

入口函数为row_drop_table_for_mysql ：

1.加锁

if (trx->dict_operation_lock_mode != RW_X_LATCH) {

/* Prevent foreign key checks etc. while we are dropping the

table */

row_mysql_lock_data_dictionary(trx);

locked_dictionary = TRUE;

}

普通的drop语句会进入到if中，加dict_operation_lock的x锁并获取互斥锁dict_sys->mutex

同时将trx->dict_operation_lock_mode = RW_X_LATCH

这样做的目的是由于后续涉及到对数据词典和ibdata内的系统表的操作，因此需要串行化。

2.根据表名将该表加载到dict cache中

table = dict_table_get_low_ignore_err(

name, DICT_ERR_IGNORE_INDEX_ROOT | DICT_ERR_IGNORE_CORRUPT);

从dict_sys->table_hash读取(dict_table_check_if_in_cache_low)或者从磁盘读取（dict_load_table）

3.检查外键约束

4.如果table->n_mysql_handles_opened > 0 ，表示有别的线程打开该表，将其加入到background drop list上，然后goto funct_exit。

added = row_add_table_to_background_drop_list(table->name);

MySQL5.5的MDL锁可以保证在drop table时不会有别的线程使用该表。但alter table的过程中如果需要drop table时，可以允许将表暂时放到row_mysql_drop_list上，当没有在该表上的查询时，有后台master线程来完成drop.

后面再单独分析

5.移除表上所有的锁

lock_remove_all_on_table(table, TRUE);

6.调用innodb层的SQL Parser来对数据词典做操作

row_upd_clust_step->….->dict_drop_index_tree //根据SYS_INDEXS中的记录删除对应的索引树

>>读取记录中存储的索引root_page_no、space id、是否是压缩表（zip_size = fil_space_get_zip_size(space)）

>>释放所有除了root page外的所有page

btr_free_but_not_root(space, zip_size, root_page_no);

>>>释放leaf page segment

finished = fseg_free_step(root + PAGE_HEADER + PAGE_BTR_SEG_LEAF,

&mtr);

函数fseg_free_step可以被重复调用来释放叶子page，这样可以避免过大的mtr.

从函数参数可以看出来，root + PAGE_HEADER + PAGE_BTR_SEG_LEAF指向了LEAF SEGMENT（每个索引有两个segment，一个管理叶子节点，一个管理非叶子节点）

–>读取一个extend descriptor

descr = xdes_get_descriptor(space, zip_size, header_page, mtr);

读取该space的0号page(其中记录了space header，在该page的FSP_HEADER_OFFSET处记录)。

然后再根据space header获取xdes_t。这里的header_page为3，也就是root节点。实际上btree的root节点存储了两个SEGMENT的segment header.

inode = fseg_inode_try_get(header, space, zip_size, mtr);//读取该segment对应的segmeng inode

descr = fseg_get_first_extent(inode, space, zip_size, mtr);

获取该segment的第一个extent。extend list是按照FSEG_FULL->FSEG_NOT_FULL->FSEG_FREE来进行排序

//TODO:对文件系统部分进行系统分析

–>如果descr不为NULL

fseg_free_extent //释放一个extend

依次遍历该extend上的page,并从page hash index中删除，调用如下函数进行

btr_search_drop_page_hash_when_freed(

space, zip_size, first_page_in_extent + i);

|–>先获取对应的block（buf_page_get_gen）

|–>再调用btr_search_drop_page_hash_index去移除adaptive hash index中对应该page的记录

性能瓶颈1:发生在buf_page_get_gen，函数的参数mode为BUF_PEEK_IF_IN_POOL，表示如果page不在bp中，则返回NULL。但如果page在bp中，并且是压缩表时，则继续下面的逻辑，会从bp中获取一个block(block = buf_LRU_get_free_block(buf_pool)),这里可能是耗时的地方

然后从inode上删除对应的extent节点

最后调用fsp_free_extent将该extend加到空闲链表上。

–>最后返回false到上层，下次会继续循环调用该函数来做释放

–>释放碎片页及inode等，这些都是很快的操作，不会成为瓶颈

>>>如果上一步调用函数fseg_free_step返回值为false，表明该segment的释放还没完成，goto leaf_loop继续调用fseg_free_step

>>>释放非叶子节点segement（通过PAGE_BTR_SEG_TOP来指定）

finished = fseg_free_step_not_header(

root + PAGE_HEADER + PAGE_BTR_SEG_TOP, &mtr);

流程和fseg_free_step类似，但不释放对应的inode

>>释放root page

btr_free_root(space, zip_size, root_page_no, mtr); //释放btree的根节点

>>将对应的索引在SYS_INDEXS中的记录DICT_SYS_INDEXES_PAGE_NO_FIELD置空

page_rec_write_field(rec, DICT_SYS_INDEXES_PAGE_NO_FIELD,

FIL_NULL, mtr);

7. fil_delete_tablespace //真正删除表空间

>>检查是否有pending io/pending ibuf merge，有的话，则循环等待

>>如果开启了innodb_lazy_drop_table，调用函数buf_LRU_mark_space_was_deleted(id)

>>>扫描bp的LRU，找到对应的page，然后设置bpage->space_was_being_deleted = TRUE

>>>

>>如果没有设置，则调用buf_LRU_invalidate_tablespace

>>>buf_LRU_drop_page_hash_for_tablespace //删除所有跟该tablespace id 相关的page hash index记录

扫描buf_pool->LRU，采用批量清理的方法，首先临时分配一个1024个元素的page_arr，每当读到一个和当前drop table相关的page时，就加入到数组中，如果数组满了，则调用buf_LRU_drop_page_hash_batch删除对应page hash index记录

疑问：这个之前在dict_drop_index_tree 已经做过一次删除page hash index entries，为何这里还要再做？？

>>>buf_LRU_invalidate_tablespace_buf_pool_instance

从backtrace来看，这里也是耗费时间主要的地方，会再次扫描buf_pool->LRU，读取和space id吻合的page

–>如果page被Pin住（buf_page_get_io_fix(bpage) != BUF_IO_NONE或者bpage->buf_fix_count > 0），读取下一个page

–>将page从LRU移除

buf_LRU_block_remove_hashed_page(bpage, TRUE)

|–>buf_LRU_remove_block(bpage); //将bpage从LRU上移除

|–>读取page hash中对应的page(buf_page_hash_get_low)，做必要的检查后，从page_hash中将当前page移除。

|–>对于普通page，只需要进行memset.如果是压缩表，还需要调用buf_buddy_free来释放压缩页内存

性能瓶颈2：从观察到的现象看，这里buf_buddy_free被长时间频繁调用到

另外，我们也可以看到，上面扫描了两次LRU，这也是bug 51325 、 bug 64284、 bug#59727 等bug讨论到的情况，在后续的版本也做了修改

此外，这里实际上重复进行了AHI记录的删除，从上面的分析可以看出，在释放btree的extend的时候，实际上已经做过了一次删除AHI记录的操作。

在5.5.23之后，这些问题基本都解决了，对fil_delete_tablespace做了优化，当drop table时，调用新的接口函数buf_LRU_flush_or_remove_pages->buf_flush_dirty_pages，从flush list上驱逐掉被删除的表page，并且不做写操作。比起之前的两次LRU扫描，仅仅扫描flush list将效率从分钟级别降低到秒级别。

不过在读代码的过程中发现在Percona5.5.28版本中依旧支持的innodb_lazy_drop_table会进入函数buf_LRU_mark_space_was_deleted，这里依旧还有对AHI记录的操作。我会接着尝试将其去除掉看看。

[MySQL 源码] MySQL drop table(压缩表)效率与流程分析

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像