记一次alteruser导致的mysql主从不同步分析
(2022-05-08 15:41:35)
标签:
mysql |
分类: mysql源码相关 |
一线同学在客户现场碰到一次较为诡异的mysql主从中断。
复现过程
at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/auth/sql_user_table.cc:766
1. 因为alter user失败的场景总是有可能隐式地执行一个flush
privileges但不记录binlog,所以对于alter user失败的场景,是潜在存在导致主从不一致的可能。
683 bool log_and_commit_acl_ddl(THD *thd,
bool transactional_tables,
684
std::set *extra_users,
685
Rewrite_params *rewrite_params,
686
bool extra_error,
687
bool write_to_binlog) {
688 bool
result = false;
689
DBUG_TRACE;
690
assert(thd);
691
result = thd->is_error() || extra_error ||
thd->transaction_rollback_request;
766 if
(acl_end_trans_and_close_tables(thd, result)) result = true;
1. setup a replication m1->s1 with RBR 2. Init test user on m1: 1) create user u1@'%' identified with 'mysql_native_password'; 2) grant all on *.* to u1@'%'; 3. Modify the mysql.user table to set a invalid authentication_string for test user on m1: 1) update mysql.user set authentication_string=MD5("password") where user="u1"; 2) flush privileges; 4. Reset authentication_string to empty for test user on m1: 1) update mysql.user set authentication_string="" where user="u1"; 5. Do the alter user twice on m1: 1) alter user u1@'%' identified with mysql_native_password by 'abc123' password expire never; // this one will fail with "ERROR 1396 (HY000): Operation ALTER USER failed for 'u1'@'%'" 2) alter user u1@'%' identified with mysql_native_password by 'abc123' password expire never; // this one will success 6. check slave status on s1, we will see the following error: Last_SQL_Errno: 1396, Operation ALTER USER failed for...
原因分析
第一个alter user语句报错1396,原因是mysql内存中的acl cache因为update mysql.user
set authentication_string=MD5("password")的非法密码信息导致权限信息加载失败,导致acl
cache中没有u1用户信息,所以报错。
而第二个alter user语句执行成功了,这块是因为第一个alter user语句在执行失败的情况下自带了flush
privileges的效果,将最新的update mysql.user set
authentication_string=""加载到了acl cache中,堆栈如下:
(gdb) bt
#0 reload_acl_caches (thd=0x7fff40000da0,
mdl_locked=true) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/auth/sql_auth_cache.cc:3650
#1 0x00000000035eb843 in
acl_end_trans_and_close_tables (thd=0x7fff40000da0,
rollback_transaction=true) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/auth/sql_user_table.cc:653
#2 0x00000000035ebcf2 in
log_and_commit_acl_ddl (thd=0x7fff40000da0,
transactional_tables=true, extra_users=0x7fffa85f0260,
rewrite_params=0x7fffa85f0120, extra_error=false,
write_to_binlog=true)
#3 0x00000000035fd773 in mysql_alter_user
(thd=0x7fff40000da0, list=..., if_exists=false) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/auth/sql_user.cc:3483
#4 0x000000000330c688 in
mysql_execute_command (thd=0x7fff40000da0, first_level=true) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/sql_parse.cc:4676
#5 0x000000000330df96 in
dispatch_sql_command (thd=0x7fff40000da0,
parser_state=0x7fffa85f5b60) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/sql_parse.cc:5174
#6 0x00000000033045e0 in dispatch_command
(thd=0x7fff40000da0, com_data=0x7fffa85f6c40, command=COM_QUERY) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/sql_parse.cc:1938
#7 0x0000000003302867 in do_command
(thd=0x7fff40000da0) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/sql_parse.cc:1352
#8 0x00000000034df9e7 in handle_connection
(arg=0xa785660) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/conn_handler/connection_handler_per_thread.cc:302
#9 0x000000000502d937 in pfs_spawn_thread
(arg=0xaa0b950) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/storage/perfschema/pfs.cc:2947
#10 0x00007ffff7bc16ba in start_thread (arg=0x7fffa85f7700) at
pthread_create.c:333
#11 0x00007ffff5e3b41d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
flush privileges的堆栈如下:
(gdb) bt
#0 reload_acl_caches (thd=0x7fff40000da0,
mdl_locked=false) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/auth/sql_auth_cache.cc:3650
#1 0x00000000033554b8 in
handle_reload_request (thd=0x7fff40000da0, options=1, tables=0x0,
write_to_binlog=0x7fffa85f523c) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/sql_reload.cc:162
#2 0x000000000330a9ac in
mysql_execute_command (thd=0x7fff40000da0, first_level=true) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/sql_parse.cc:4044
#3 0x000000000330df96 in
dispatch_sql_command (thd=0x7fff40000da0,
parser_state=0x7fffa85f5b60) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/sql_parse.cc:5174
#4 0x00000000033045e0 in dispatch_command
(thd=0x7fff40000da0, com_data=0x7fffa85f6c40, command=COM_QUERY) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/sql_parse.cc:1938
#5 0x0000000003302867 in do_command
(thd=0x7fff40000da0) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/sql_parse.cc:1352
#6 0x00000000034df9e7 in handle_connection
(arg=0xa785660) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/sql/conn_handler/connection_handler_per_thread.cc:302
#7 0x000000000502d937 in pfs_spawn_thread
(arg=0xaa0b950) at
/home/gaoxiaoxin/mysql-repo/mysql-8.0.28/storage/perfschema/pfs.cc:2947
#8 0x00007ffff7bc16ba in start_thread
(arg=0x7fffa85f7700) at pthread_create.c:333
#9 0x00007ffff5e3b41d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
而reload_acl_caches就是重新加载内存中的acl cache。所以在第一个alter
user之前第2个update user语句之后如果有执行一个flush privileges的话,第一个alter
user语句是不会报错,因为这时候内存中的acl cache信息是合法的了。
但是上述操作在binlog文件中记录的是:
1)
UPDATE `mysql`.`user` ...
2)
FLUSH PRIVILEGES
3)
UPDATE `mysql`.`user` ...
4)
ALTER USER 'u1'@'%' IDENTIFIED WITH 'mysql_native_password' AS
'*6691484EA6B50DDDE1926A220DA01FA9E575C18A' PASSWORD EXPIRE
NEVER
相当于在slave回放的时候,相比master上少执行一次alter user,只执行了第一个报错的alter
user。
所以只需要在slave执行start slave sql_thread,让它重新执行一次alter user的binlog
event,同步就正常了。
修复方案
这个bug报给了官方,官方认为不应该直接去修改mysql.user表,所以不认为是bug。
但也可以从源码角度考虑下如何避免上述bug,基本思想如下:
2. 所以考虑alter user一开始强制做一次flush privileges,这样也增加了alter
user语句行为的确定性。
代码调整非常简单,就一行,具体可以参见提给官方的bug:https://bugs.mysql.com/bug.php?id=106942
5.7 无问题原因分析
上述复现步骤在5.7中不会导致主从中断,原因如下:
在5.7中,acl_end_trans_and_close_tables是否re-flush权限信息取决于thd->transaction_rollback_request和rollback_whole_statement。
2006 result|=
2007
acl_end_trans_and_close_tables(thd,
2008
thd->transaction_rollback_request ||
2009
rollback_whole_statement);
因为update mysql.user set authentication_string=MD5("password")
where user="u1";语句修改的权限信息非法,所以之后的flush
privileges并没有将该用户信息加载到内存中,所以对应的后续alter
user语句实际上找不到这个user,进而报错。
1916
1917 if
(!(acl_user= find_acl_user(user_from->host.str,
1918
user_from->user.str, TRUE)))
1919
{
1920
if (if_exists)
1921
{
1922
String warn_user;
1923
append_user(thd,
&warn_user, user_from, FALSE, FALSE);
1924
push_warning_printf(thd,
Sql_condition::SL_NOTE,
1925
ER_USER_DOES_NOT_EXIST,
1926
ER_THD(thd,
ER_USER_DOES_NOT_EXIST),
1927
warn_user.c_ptr_safe());
1928
try
1929
{
1930
extra_users.insert(user_from);
1931
}
1932
catch (...) {}
1933
}
1934
else
1935
{
1936
result= TRUE;
1937
append_user(thd,
&wrong_users, user_from, wrong_users.length() > 0,
1938
false);
1939
}
1940
1941
continue;
1942
}
5.7中,如果找不到acl_user会直接continue,不会继续后续的replace_user_table操作,也不会设置rollback_whole_statement和thd->transaction_rollback_request。所以不会触发re-flush权限信息。
而8.0中,是否会re-flush权限信息还会额外检查thd->is_error()和extra_error,只要alter
user语句执行报错了,thd->is_error()必然为true,所以也必然会触发re-flush权限信息。
...
转载请注明转自高孝鑫的博客!