MySQL 5.7 innoDB集群中的节点崩溃，无法将崩溃的节点重新加入集群

本文介绍了MySQL 5.7 innoDB集群中的节点崩溃，无法将崩溃的节点重新加入集群的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我们的一个环境中有一个MySQL innodb集群.集群中的一个节点崩溃了.虽然，我们能够将崩溃的节点联机，但我们无法将其加入集群.

We have a MySQL innodb cluster in one of our environments. One of the nodes in the cluster was crashed. Though, we were able to bring the crashed node online we were unable to join it to the cluster.

有人可以帮助恢复/还原该节点并将其加入集群吗?我们尝试使用"dba.rebootClusterFromCompleteOutage()"，但没有帮助.

Can someone please help to recover/restore the node and join it to the cluster. We tried to use "dba.rebootClusterFromCompleteOutage()" but it didn't help.

配置:MySQL 5.7.24社区版，CentOS 7，标准三节点innodb集群

Configuration: MySQL 5.7.24 Community Edition, CentOS 7, standard three node innodb cluster

集群状态:

MySQL NODE02:3306 ssl JS > var c=dba.getCluster() MySQL NODE02:3306 ssl JS > c.status() { "clusterName": "QACluster", "defaultReplicaSet": { "name": "default", "primary": "NODE03:3306", "ssl": "REQUIRED", "status": "OK_NO_TOLERANCE", "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active", "topology": { "NODE02:3306": { "address": "NODE02:3306", "mode": "R/O", "readReplicas": {}, "role": "HA", "status": "ONLINE" }, "NODE03:3306": { "address": "NODE03:3306", "mode": "R/W", "readReplicas": {}, "role": "HA", "status": "ONLINE" }, "NODE01:3306": { "address": "NODE01:3306", "mode": "R/O", "readReplicas": {}, "role": "HA", "status": "(MISSING)" } } }, "groupInformationSourceMember": "mysql://clusterAdmin@NODE03:3306" }

在mysql错误日志中记录的错误:

Errors logged in mysql error log:

2019-03-04T23:49:36.970839Z 3624 [Note] Slave SQL thread for channel 'group_replication_recovery' initialized, starting replication in log 'FIRST' at position 0, relay log './NODE01-relay-bin-group_replication_recovery.000001' position: 4 2019-03-04T23:49:36.985336Z 3623 [Note] Slave I/O thread for channel 'group_replication_recovery': connected to master 'mysql_innodb_cluster_r0429584112@NODE02:3306',replication started in log 'FIRST' at position 4 2019-03-04T23:49:36.988164Z 3623 [ERROR] Error reading packet from server for channel 'group_replication_recovery': The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires. (server_errno=1236) 2019-03-04T23:49:36.988213Z 3623 [ERROR] Slave I/O for channel 'group_replication_recovery': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236 2019-03-04T23:49:36.988226Z 3623 [Note] Slave I/O thread exiting for channel 'group_replication_recovery', read up to log 'FIRST', position 4 2019-03-04T23:49:36.988286Z 41 [Note] Plugin group_replication reported: 'Terminating existing group replication donor connection and purging the corresponding logs.' 2019-03-04T23:49:36.988358Z 3624 [Note] Error reading relay log event for channel 'group_replication_recovery': slave SQL thread was killed 2019-03-04T23:49:36.988435Z 3624 [Note] Slave SQL thread for channel 'group_replication_recovery' exiting, replication stopped in log 'FIRST' at position 0 2019-03-04T23:49:37.016864Z 41 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='NODE02', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. 2019-03-04T23:49:37.030769Z 41 [ERROR] Plugin group_replication reported: 'Maximum number of retries when trying to connect to a donor reached. Aborting group replication recovery.' 2019-03-04T23:49:37.030798Z 41 [Note] Plugin group_replication reported: 'Terminating existing group replication donor connection and purging the corresponding logs.' 2019-03-04T23:49:37.051169Z 41 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. 2019-03-04T23:49:37.069184Z 41 [ERROR] Plugin group_replication reported: 'Fatal error during the Recovery process of Group Replication. The server will leave the group.' 2019-03-04T23:49:37.069304Z 41 [Note] Plugin group_replication reported: 'Going to wait for view modification' 2019-03-04T23:49:40.336938Z 0 [Note] Plugin group_replication reported: 'Group membership changed: This member has left the group.'

推荐答案

我执行了以下操作，从备份中还原故障节点并能够恢复群集状态.

I did the following to restore the failed node from backup and able to recover the cluster state.

1)以下是其中一个节点发生故障时(NODE01)的群集状态.

1)Below is the status of the cluster when one of the nodes failed (NODE01).

MySQL NODE02:3306 ssl JS > var c=dba.getCluster() MySQL NODE02:3306 ssl JS > c.status() { "clusterName": "QACluster", "defaultReplicaSet": { "name": "default", "primary": "NODE03:3306", "ssl": "REQUIRED", "status": "OK_NO_TOLERANCE", "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active", "topology": { "NODE02:3306": { "address": "NODE02:3306", "mode": "R/O", "readReplicas": {}, "role": "HA", "status": "ONLINE" }, "NODE03:3306": { "address": "NODE03:3306", "mode": "R/W", "readReplicas": {}, "role": "HA", "status": "ONLINE" }, "NODE01:3306": { "address": "NODE01:3306", "mode": "R/O", "readReplicas": {}, "role": "HA", "status": "(MISSING)" } } }, "groupInformationSourceMember": "mysql://clusterAdmin@NODE03:3306" }

2)使用以下命令从主节点(健康节点)中获取mysqldump.

2) Take mysqldump from the master node (healthy node) using the following command.

[root@NODE03 db_backup]# mysqldump --all-databases --add-drop-database --single-transaction --triggers --routines --port=mysql_port --user=root -p > /db_backup/mysql_dump_03062019.sql Enter password: Warning: A partial dump from a server that has GTIDs will by default include the GTIDs of all transactions, even those that changed suppressed parts of the database. If you don't want to restore GTIDs, pass --set-gtid-purged=OFF. To make a complete dump, pass --all-databases --triggers --routines --events.

3)执行以下步骤，从群集中删除发生故障的节点.

3) Execute below step to remove the failed node from the cluster.

MySQL NODE03:3306 ssl JS > var c=dba.getCluster() MySQL NODE03:3306 ssl JS > c.rescan() Rescanning the cluster... Result of the rescanning operation: { "defaultReplicaSet": { "name": "default", "newlyDiscoveredInstances": [], "unavailableInstances": [ { "host": "NODE01:3306", "label": "NODE01:3306", "member_id": "e2aa897d-1828-11e9-85b3-00505692188c" } ] } } The instance 'NODE01:3306' is no longer part of the HA setup. It is either offline or left the HA group. You can try to add it to the cluster again with the cluster.rejoinInstance('NODE01:3306') command or you can remove it from the cluster configuration. Would you like to remove it from the cluster metadata? [Y/n]: Y Removing instance from the cluster metadata... The instance 'NODE01:3306' was successfully removed from the cluster metadata. MySQL NODE03:3306 ssl JS > c.status() { "clusterName": "QACluster", "defaultReplicaSet": { "name": "default", "primary": "NODE03:3306", "ssl": "REQUIRED", "status": "OK_NO_TOLERANCE", "statusText": "Cluster is NOT tolerant to any failures.", "topology": { "NODE02:3306": { "address": "NODE02:3306", "mode": "R/O", "readReplicas": {}, "role": "HA", "status": "ONLINE" }, "NODE03:3306": { "address": "NODE03:3306", "mode": "R/W", "readReplicas": {}, "role": "HA", "status": "ONLINE" } } }, "groupInformationSourceMember": "mysql://clusterAdmin@NODE03:3306" }

4)如果仍在故障节点上运行，则停止组复制.

4) Stop group replication if it is still running on failed node.

mysql> STOP GROUP_REPLICATION; Query OK, 0 rows affected (1.01 sec)

5)在发生故障的节点上重置"gtid_exected".

5) Reset "gtid_executed" on the failed node.

mysql> show global variables like 'GTID_EXECUTED'; +---------------+--------------------------------------------------------------------------------------------+ | Variable_name | Value | +---------------+--------------------------------------------------------------------------------------------+ | gtid_executed | 01f27b9c-182a-11e9-a199-00505692188c:1-14134172, e2aa897d-1828-11e9-85b3-00505692188c:1-12 | +---------------+--------------------------------------------------------------------------------------------+ 1 row in set (0.01 sec) mysql> reset master; Query OK, 0 rows affected (0.02 sec) mysql> reset slave; Query OK, 0 rows affected (0.02 sec) mysql> show global variables like 'GTID_EXECUTED'; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | gtid_executed | | +---------------+-------+ 1 row in set (0.00 sec)

6)在发生故障的节点上禁用"super_readonly_flag".

6) Disable "super_readonly_flag" on the failed node.

mysql> SELECT @@global.read_only, @@global.super_read_only; +--------------------+--------------------------+ | @@global.read_only | @@global.super_read_only | +--------------------+--------------------------+ | 1 | 1 | +--------------------+--------------------------+ 1 row in set (0.00 sec) mysql> SET GLOBAL super_read_only = 0; Query OK, 0 rows affected (0.00 sec) mysql> SELECT @@global.read_only, @@global.super_read_only; +--------------------+--------------------------+ | @@global.read_only | @@global.super_read_only | +--------------------+--------------------------+ | 1 | 0 | +--------------------+--------------------------+ 1 row in set (0.00 sec)

7)将mysqldump从master恢复到故障节点.

7) Restore the mysqldump from master on to the failed node.

[root@E2LXQA1ALFDB01 db_backup]# mysql -uroot -p < mysql_dump_03062019.sql

8)恢复完成后，在发生故障的节点上启用"super_readonly_flag".

8) Once restore is completed enable "super_readonly_flag" on the failed node.

mysql> SELECT @@global.read_only, @@global.super_read_only; +--------------------+--------------------------+ | @@global.read_only | @@global.super_read_only | +--------------------+--------------------------+ | 1 | 0 | +--------------------+--------------------------+ 1 row in set (0.00 sec) mysql> SET GLOBAL super_read_only = 1; Query OK, 0 rows affected (0.00 sec) mysql> SELECT @@global.read_only, @@global.super_read_only; +--------------------+--------------------------+ | @@global.read_only | @@global.super_read_only | +--------------------+--------------------------+ | 1 | 1 | +--------------------+--------------------------+ 1 row in set (0.00 sec)

9)最后，将发生故障的节点重新添加到innodb集群中.

9) Finally add the failed node back to the innodb cluster.

MySQL NODE03:3306 ssl JS > c.addInstance('clusterAdmin@NODE01:3306'); A new instance will be added to the InnoDB cluster. Depending on the amount of data on the cluster this might take from a few seconds to several hours. Adding instance to the cluster ... Please provide the password for 'clusterAdmin@NODE01:3306': ******************* Save password for 'clusterAdmin@NODE01:3306'? [Y]es/[N]o/Ne[v]er (default No): Validating instance at NODE01:3306... This instance reports its own address as NODE01 WARNING: The following tables do not have a Primary Key or equivalent column: ephesoft.dlf, report.correction_type, report.field_details_ag, report_archive.correction_type, report_archive.field_details_ag, report_archive.global_data_ag Group Replication requires tables to use InnoDB and have a PRIMARY KEY or PRIMARY KEY Equivalent (non-null unique key). Tables that do not follow these requirements will be readable but not updateable when used with Group Replication. If your applications make updates (INSERT, UPDATE or DELETE) to these tables, ensure they use the InnoDB storage engine and have a PRIMARY KEY or PRIMARY KEY Equivalent. Instance configuration is suitable. WARNING: On instance 'NODE01:3306' membership change cannot be persisted since MySQL version 5.7.24 does not support the SET PERSIST command (MySQL version >= 8.0.11 required). Please use the .configureLocalInstance command locally to persist the changes. WARNING: On instance 'NODE02:3306' membership change cannot be persisted since MySQL version 5.7.24 does not support the SET PERSIST command (MySQL version >= 8.0.11 required). Please use the .configureLocalInstance command locally to persist the changes. WARNING: On instance 'NODE03:3306' membership change cannot be persisted since MySQL version 5.7.24 does not support the SET PERSIST command (MySQL version >= 8.0.11 required). Please use the .configureLocalInstance command locally to persist the changes. The instance 'clusterAdmin@NODE01:3306' was successfully added to the cluster. MySQL NODE03:3306 ssl JS > c.status() { "clusterName": "QACluster", "defaultReplicaSet": { "name": "default", "primary": "NODE03:3306", "ssl": "REQUIRED", "status": "OK", "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", "topology": { "NODE01:3306": { "address": "NODE01:3306", "mode": "R/O", "readReplicas": {}, "role": "HA", "status": "ONLINE" }, "NODE02:3306": { "address": "NODE02:3306", "mode": "R/O", "readReplicas": {}, "role": "HA", "status": "ONLINE" }, "NODE03:3306": { "address": "NODE03:3306", "mode": "R/W", "readReplicas": {}, "role": "HA", "status": "ONLINE" } } }, "groupInformationSourceMember": "mysql://clusterAdmin@NODE03:3306" }

更多推荐

MySQL 5.7 innoDB集群中的节点崩溃,无法将崩溃的节点重新加入集群