这篇文章将为大家详细讲解有关MySQL高可用工具Orchestrator如何进行探测机制,文章内容质量较高,因此小编分享给大家做个参考,希望大家阅读完这篇文章后对相关知识有一定的了解。

故障检测

orch使用了一种整体性的方法去探测主库和中间主库是否正常。

一种比较天真的方法,比如,监控工具探测到主库无法连接或者查询,就发出报警。这种方法容易受到网络故障而造成误报。为了降低误报,会通过运行n次每次间隔t时间的方式。在某些情况下,这减少了误报的机会,但是增加了真正故障的响应时间。

orchestrator会利用复制拓扑。orch不仅会监测主库,也会检测从库。比如,要诊断出主库挂了的情况,orch必须满足以下两个条件:

联系不到主库。

可以联系到主库对应的从库,并且这些从库也连不上主库。

orch没有将错误按时间来进行分类,而是按复制拓扑服务器(也就是所谓的multiple observers)本身进行分类。实际上,当所有的从库都连不上主库的时候,说明复制拓扑实际上就被破坏了,有理由需要进行故障转移。

orch的这种整体故障检测方式在生产环境是非常可靠的。

检测机制

orch会每隔InstancePollSeconds(默认5s)时间去被监控的实例上拉取实例状态,并将这些状态信息存入orch的元数据库的orchestrator.database_instance表中,然后orch会每隔InstancePollSeconds秒从元数据库中获取每个instance的状态,展示在web界面上。

拉取实例状态的语句如下:

showvariableslike'maxscale%'showglobalstatuslike'Uptime'select@@global.hostname,ifnull(@@global.report_host,''),@@global.server_id,@@global.version,@@global.version_comment,@@global.read_only,@@global.binlog_format,@@global.log_bin,@@global.log_slave_updatesshowmasterstatusshowglobalstatuslike'rpl_semi_sync_%_status'select@@global.gtid_mode,@@global.server_uuid,@@global.gtid_executed,@@global.gtid_purged,@@global.master_info_repository='TABLE',@@global.binlog_row_imageshowslavestatusselectcount(*)>0andMAX(User_name)!=''frommysql.slave_master_infoshowslavehostsselectsubstring_index(host,':',1)asslave_hostnamefrominformation_schema.processlistwherecommandIN('BinlogDump','BinlogDumpGTID')SELECTSUBSTRING_INDEX(@@hostname,'.',1)

拉取得到实例状态之后,通过下面语句将状态值存入到orch的元数据库中:

注:values后面的值就是上面拉取到的实例状态值。

INSERTINTOdatabase_instance(hostname,port,last_checked,last_attempted_check,last_check_partial_success,uptime,server_id,server_uuid,version,major_version,version_comment,binlog_server,read_only,binlog_format,binlog_row_image,log_bin,log_slave_updates,binary_log_file,binary_log_pos,master_host,master_port,slave_sql_running,slave_io_running,replication_sql_thread_state,replication_io_thread_state,has_replication_filters,supports_oracle_gtid,oracle_gtid,master_uuid,ancestry_uuid,executed_gtid_set,gtid_mode,gtid_purged,gtid_errant,mariadb_gtid,pseudo_gtid,master_log_file,read_master_log_pos,relay_master_log_file,exec_master_log_pos,relay_log_file,relay_log_pos,last_sql_error,last_io_error,seconds_behind_master,slave_lag_seconds,sql_delay,num_slave_hosts,slave_hosts,cluster_name,suggested_cluster_alias,data_center,region,physical_environment,replication_depth,is_co_master,replication_credentials_available,has_replication_credentials,allow_tls,semi_sync_enforced,semi_sync_master_enabled,semi_sync_replica_enabled,instance_alias,last_discovery_latency,last_seen)VALUES('10.10.30.5',3306,NOW(),NOW(),1,322504,1521,'e2685a0f-d8f8-11e9-a2c9-002590e95c3c','5.7.22-log','5.7','MySQLCommunityServer(GPL)',0,1,'ROW','FULL',1,1,'mysql-bin.000016',129186924,'10.10.30.6',3306,1,1,1,1,0,1,1,'6bf30525-d8f8-11e9-808c-0cc47a74fca8','6bf30525-d8f8-11e9-808c-0cc47a74fca8,e2685a0f-d8f8-11e9-a2c9-002590e95c3c','6bf30525-d8f8-11e9-808c-0cc47a74fca8:1-1554568,\ne2685a0f-d8f8-11e9-a2c9-002590e95c3c:1-632541','ON','','',0,0,'mysql-bin.000017',150703414,'mysql-bin.000017',150703414,'mysql-relay-bin.000052',137056344,'','',0,0,0,1,'[{\"Hostname\":\"10.10.30.6\",\"Port\":3306}]','10.10.30.6:3306','qhp-6','','','',1,1,1,1,0,0,0,0,'',8083748,NOW())ONDUPLICATEKEYUPDATEhostname=VALUES(hostname),port=VALUES(port),last_checked=VALUES(last_checked),last_attempted_check=VALUES(last_attempted_check),last_check_partial_success=VALUES(last_check_partial_success),uptime=VALUES(uptime),server_id=VALUES(server_id),server_uuid=VALUES(server_uuid),version=VALUES(version),major_version=VALUES(major_version),version_comment=VALUES(version_comment),binlog_server=VALUES(binlog_server),read_only=VALUES(read_only),binlog_format=VALUES(binlog_format),binlog_row_image=VALUES(binlog_row_image),log_bin=VALUES(log_bin),log_slave_updates=VALUES(log_slave_updates),binary_log_file=VALUES(binary_log_file),binary_log_pos=VALUES(binary_log_pos),master_host=VALUES(master_host),master_port=VALUES(master_port),slave_sql_running=VALUES(slave_sql_running),slave_io_running=VALUES(slave_io_running),replication_sql_thread_state=VALUES(replication_sql_thread_state),replication_io_thread_state=VALUES(replication_io_thread_state),has_replication_filters=VALUES(has_replication_filters),supports_oracle_gtid=VALUES(supports_oracle_gtid),oracle_gtid=VALUES(oracle_gtid),master_uuid=VALUES(master_uuid),ancestry_uuid=VALUES(ancestry_uuid),executed_gtid_set=VALUES(executed_gtid_set),gtid_mode=VALUES(gtid_mode),gtid_purged=VALUES(gtid_purged),gtid_errant=VALUES(gtid_errant),mariadb_gtid=VALUES(mariadb_gtid),pseudo_gtid=VALUES(pseudo_gtid),master_log_file=VALUES(master_log_file),read_master_log_pos=VALUES(read_master_log_pos),relay_master_log_file=VALUES(relay_master_log_file),exec_master_log_pos=VALUES(exec_master_log_pos),relay_log_file=VALUES(relay_log_file),relay_log_pos=VALUES(relay_log_pos),last_sql_error=VALUES(last_sql_error),last_io_error=VALUES(last_io_error),seconds_behind_master=VALUES(seconds_behind_master),slave_lag_seconds=VALUES(slave_lag_seconds),sql_delay=VALUES(sql_delay),num_slave_hosts=VALUES(num_slave_hosts),slave_hosts=VALUES(slave_hosts),cluster_name=VALUES(cluster_name),suggested_cluster_alias=VALUES(suggested_cluster_alias),data_center=VALUES(data_center),region=VALUES(region),physical_environment=VALUES(physical_environment),replication_depth=VALUES(replication_depth),is_co_master=VALUES(is_co_master),replication_credentials_available=VALUES(replication_credentials_available),has_replication_credentials=VALUES(has_replication_credentials),allow_tls=VALUES(allow_tls),semi_sync_enforced=VALUES(semi_sync_enforced),semi_sync_master_enabled=VALUES(semi_sync_master_enabled),semi_sync_replica_enabled=VALUES(semi_sync_replica_enabled),instance_alias=VALUES(instance_alias),last_discovery_latency=VALUES(last_discovery_latency),last_seen=VALUES(last_seen)

然后orch会每隔InstancePollSeconds秒从元数据库中获取每个被监控实例的状态,通过web端展示到页面上。

探测实例失败

如果某个instance挂了,orch每隔InstancePollSeconds时间拉取实例状态失败,无法获取到最新的实例状态,也就无法用上面这条insert将实例状态存入到元数据库中,那么orch会按下面的方式更新元数据库:

//每隔InstancePollSeconds时间更新database_instance表的last_checked和last_check_partial_success字段updatedatabase_instancesetlast_checked=NOW(),last_check_partial_success=0wherehostname='10.10.30.170'andport=3306//每隔InstancePollSeconds+1s时间更新database_instance表的last_attempted_check字段updatedatabase_instancesetlast_attempted_check=NOW()wherehostname='10.10.30.170'andport=3306

这里为什么要引入last_attempted_check,摘两处源码中的注释。

//UpdateInstanceLastAttemptedCheckupdatesthelast_attempted_checktimestampintheorchestratorbackeddatabase//foragiveninstance.//Thisisusedasafailsafemechanismincaseaccesstotheinstancegetshung(ithappens),inwhichcase//theentireReadTopologygetsstuck(andno,connectiontimeoutnordrivertimeoutsdon'thelp.Don'tlookatme,//theworldisaharshplacetolivein).//Andsowemakesuretonotedown*before*weevenattempttoaccesstheinstance;andthisraisesaredflagwhenwe//wishtoaccesstheinstanceagain:iflast_attempted_checkis*newer*thanlast_checked,that'sbadnewsandmeans//wehavea"hanging"issue.funcUpdateInstanceLastAttemptedCheck(instanceKey*InstanceKey)error{writeFunc:=func()error{_,err:=db.ExecOrchestrator(`updatedatabase_instancesetlast_attempted_check=NOW()wherehostname=?andport=?`,instanceKey.Hostname,instanceKey.Port,)returnlog.Errore(err)}returnExecDBWriteFunc(writeFunc)}//ValidSecondsFromSeenToLastAttemptedCheckreturnsthemaximumallowedelapsedtime//betweenlast_attempted_checktolast_checkedbeforeweconsidertheinstanceasinvalid.funcValidSecondsFromSeenToLastAttemptedCheck()uint{returnconfig.Config.InstancePollSeconds+1}

判断实例是否存活

被orch监控的实例是否正常,通过如下方式进行判断:

//instance每隔InstancePollSeconds时间通过如下方式判断某个instance是否正常selectifnull(last_checked<=last_seen,0)asis_last_check_validfromdatabase_instancewherehostname='10.10.30.170'andport=3306orderbyhostname,port;//masterinstance是否存活的判断比较复杂//首先,orch会每秒通过类似于如下方式判断masterinstance是否正常SELECTMIN(master_instance.last_checked<=master_instance.last_seenandmaster_instance.last_attempted_check<=master_instance.last_seen+interval6second)=1ASis_last_check_validfromdatabase_instancemaster_instanceGROUPBYmaster_instance.hostname,master_instance.port;//如果is_last_check_valid为0,然后需要判断masterinstance对应的从库是否连得上,并且这些从库是否也连不上主库//整个sql如下:SELECTmaster_instance.hostname,master_instance.port,master_instance.read_onlyASread_only,MIN(master_instance.data_center)ASdata_center,MIN(master_instance.region)ASregion,MIN(master_instance.physical_environment)ASphysical_environment,MIN(master_instance.master_host)ASmaster_host,MIN(master_instance.master_port)ASmaster_port,MIN(master_instance.cluster_name)AScluster_name,MIN(IFNULL(cluster_alias.alias,master_instance.cluster_name))AScluster_alias,MIN(master_instance.last_checked<=master_instance.last_seenandmaster_instance.last_attempted_check<=master_instance.last_seen+interval6second)=1ASis_last_check_valid,MIN(master_instance.last_check_partial_success)aslast_check_partial_success,MIN(master_instance.master_hostIN('','_')ORmaster_instance.master_port=0ORsubstr(master_instance.master_host,1,2)='//')ASis_master,MIN(master_instance.is_co_master)ASis_co_master,MIN(CONCAT(master_instance.hostname,':',master_instance.port)=master_instance.cluster_name)ASis_cluster_master,MIN(master_instance.gtid_mode)ASgtid_mode,COUNT(replica_instance.server_id)AScount_replicas,IFNULL(SUM(replica_instance.last_checked<=replica_instance.last_seen),0)AScount_valid_slaves,IFNULL(SUM(replica_instance.last_checked<=replica_instance.last_seenANDreplica_instance.slave_io_running!=0ANDreplica_instance.slave_sql_running!=0),0)AScount_valid_replicating_slaves,IFNULL(SUM(replica_instance.last_checked<=replica_instance.last_seenANDreplica_instance.slave_io_running=0ANDreplica_instance.last_io_errorlike'%error%connectingtomaster%'ANDreplica_instance.slave_sql_running=1),0)AScount_replicas_failing_to_connect_to_master,MIN(master_instance.replication_depth)ASreplication_depth,GROUP_CONCAT(concat(replica_instance.Hostname,':',replica_instance.Port))asslave_hosts,MIN(master_instance.slave_sql_running=1ANDmaster_instance.slave_io_running=0ANDmaster_instance.last_io_errorlike'%error%connectingtomaster%')ASis_failing_to_connect_to_master,MIN(master_downtime.downtime_activeisnotnullandifnull(master_downtime.end_timestamp,now())>now())ASis_downtimed,MIN(IFNULL(master_downtime.end_timestamp,''))ASdowntime_end_timestamp,MIN(IFNULL(unix_timestamp()-unix_timestamp(master_downtime.end_timestamp),0))ASdowntime_remaining_seconds,MIN(master_instance.binlog_server)ASis_binlog_server,MIN(master_instance.pseudo_gtid)ASis_pseudo_gtid,MIN(master_instance.supports_oracle_gtid)ASsupports_oracle_gtid,SUM(replica_instance.oracle_gtid)AScount_oracle_gtid_slaves,IFNULL(SUM(replica_instance.last_checked<=replica_instance.last_seenANDreplica_instance.oracle_gtid!=0),0)AScount_valid_oracle_gtid_slaves,SUM(replica_instance.binlog_server)AScount_binlog_server_slaves,IFNULL(SUM(replica_instance.last_checked<=replica_instance.last_seenANDreplica_instance.binlog_server!=0),0)AScount_valid_binlog_server_slaves,MIN(master_instance.mariadb_gtid)ASis_mariadb_gtid,SUM(replica_instance.mariadb_gtid)AScount_mariadb_gtid_slaves,IFNULL(SUM(replica_instance.last_checked<=replica_instance.last_seenANDreplica_instance.mariadb_gtid!=0),0)AScount_valid_mariadb_gtid_slaves,IFNULL(SUM(replica_instance.log_binANDreplica_instance.log_slave_updates),0)AScount_logging_replicas,IFNULL(SUM(replica_instance.log_binANDreplica_instance.log_slave_updatesANDreplica_instance.binlog_format='STATEMENT'),0)AScount_statement_based_loggin_slaves,IFNULL(SUM(replica_instance.log_binANDreplica_instance.log_slave_updatesANDreplica_instance.binlog_format='MIXED'),0)AScount_mixed_based_loggin_slaves,IFNULL(SUM(replica_instance.log_binANDreplica_instance.log_slave_updatesANDreplica_instance.binlog_format='ROW'),0)AScount_row_based_loggin_slaves,IFNULL(SUM(replica_instance.sql_delay>0),0)AScount_delayed_replicas,IFNULL(SUM(replica_instance.slave_lag_seconds>10),0)AScount_lagging_replicas,IFNULL(MIN(replica_instance.gtid_mode),'')ASmin_replica_gtid_mode,IFNULL(MAX(replica_instance.gtid_mode),'')ASmax_replica_gtid_mode,IFNULL(MAX(casewhenreplica_downtime.downtime_activeisnotnullandifnull(replica_downtime.end_timestamp,now())>now()then''elsereplica_instance.gtid_errantend),'')ASmax_replica_gtid_errant,IFNULL(SUM(replica_downtime.downtime_activeisnotnullandifnull(replica_downtime.end_timestamp,now())>now()),0)AScount_downtimed_replicas,COUNT(DISTINCTcasewhenreplica_instance.log_binANDreplica_instance.log_slave_updatesthenreplica_instance.major_versionelseNULLend)AScount_distinct_logging_major_versionsFROMdatabase_instancemaster_instanceLEFTJOINhostname_resolveON(master_instance.hostname=hostname_resolve.hostname)LEFTJOINdatabase_instancereplica_instanceON(COALESCE(hostname_resolve.resolved_hostname,master_instance.hostname)=replica_instance.master_hostANDmaster_instance.port=replica_instance.master_port)LEFTJOINdatabase_instance_maintenanceON(master_instance.hostname=database_instance_maintenance.hostnameANDmaster_instance.port=database_instance_maintenance.portANDdatabase_instance_maintenance.maintenance_active=1)LEFTJOINdatabase_instance_downtimeasmaster_downtimeON(master_instance.hostname=master_downtime.hostnameANDmaster_instance.port=master_downtime.portANDmaster_downtime.downtime_active=1)LEFTJOINdatabase_instance_downtimeasreplica_downtimeON(replica_instance.hostname=replica_downtime.hostnameANDreplica_instance.port=replica_downtime.portANDreplica_downtime.downtime_active=1)LEFTJOINcluster_aliasON(cluster_alias.cluster_name=master_instance.cluster_name)WHEREdatabase_instance_maintenance.database_instance_maintenance_idISNULLAND''IN('',master_instance.cluster_name)GROUPBYmaster_instance.hostname,master_instance.portHAVING(MIN(master_instance.last_checked<=master_instance.last_seenandmaster_instance.last_attempted_check<=master_instance.last_seen+interval6second)=1)=0OR(IFNULL(SUM(replica_instance.last_checked<=replica_instance.last_seenANDreplica_instance.slave_io_running=0ANDreplica_instance.last_io_errorlike'%error%connectingtomaster%'ANDreplica_instance.slave_sql_running=1),0)>0)OR(IFNULL(SUM(replica_instance.last_checked<=replica_instance.last_seen),0)<COUNT(replica_instance.server_id))OR(IFNULL(SUM(replica_instance.last_checked<=replica_instance.last_seenANDreplica_instance.slave_io_running!=0ANDreplica_instance.slave_sql_running!=0),0)<COUNT(replica_instance.server_id))OR(MIN(master_instance.slave_sql_running=1ANDmaster_instance.slave_io_running=0ANDmaster_instance.last_io_errorlike'%error%connectingtomaster%'))OR(COUNT(replica_instance.server_id)>0)ORDERBYis_masterDESC,is_cluster_masterDESC,count_replicasDESC\G

关于MySQL高可用工具Orchestrator如何进行探测机制就分享到这里了,希望以上内容可以对大家有一定的帮助,可以学到更多知识。如果觉得文章不错,可以把它分享出去让更多的人看到。