oslo_messaging中的heartbeat_check
最近在做OpenStack控制节点高可用(三控)的测试,当关掉其中一个控制节点的时候,nova service-list 看到所有nova服务都是down的。 nova-compute的log中有大量这种错误信息:
2016-11-0803:46:23.887127895INFOoslo.messaging._drivers.impl_rabbit[-]Arecoverableconnection/channelerroroccurred,tryingtoreconnect:[Errno32]Brokenpipe2016-11-0803:46:27.275127895INFOoslo.messaging._drivers.impl_rabbit[-]Arecoverableconnection/channelerroroccurred,tryingtoreconnect:[Errno32]Brokenpipe2016-11-0803:46:27.276127895INFOoslo.messaging._drivers.impl_rabbit[-]Arecoverableconnection/channelerroroccurred,tryingtoreconnect:[Errno32]Brokenpipe2016-11-0803:46:27.276127895INFOoslo.messaging._drivers.impl_rabbit[-]Arecoverableconnection/channelerroroccurred,tryingtoreconnect:[Errno32]Brokenpipe2016-11-0803:46:27.277127895INFOoslo.messaging._drivers.impl_rabbit[-]Arecoverableconnection/channelerroroccurred,tryingtoreconnect:[Errno32]Brokenpipe2016-11-0803:46:27.277127895INFOoslo.messaging._drivers.impl_rabbit[-]Arecoverableconnection/channelerroroccurred,tryingtoreconnect:[Errno32]Brokenpipe2016-11-0803:46:27.278127895INFOoslo.messaging._drivers.impl_rabbit[-]Arecoverableconnection/channelerroroccurred,tryingtoreconnect:[Errno32]Brokenpipe2016-11-0803:46:27.278127895INFOoslo.messaging._drivers.impl_rabbit[-]Arecoverableconnection/channelerroroccurred,tryingtoreconnect:[Errno32]Brokenpipe
上述抛出的异常在oslo_messaging/_drivers/impl_rabbit.py中定位出来了:
def_heartbeat_thread_job(self):"""Threadthatmaintainsinactiveconnections"""whilenotself._heartbeat_exit_event.is_set():withself._connection_lock.for_heartbeat():recoverable_errors=(self.connection.recoverable_channel_errors+self.connection.recoverable_connection_errors)try:try:self._heartbeat_check()#NOTE(sileht):Weneedtodraineventtoreceive#heartbeatfromthebrokerbutdon'tholdthe#connectiontoomuchtimes.Inamqpdriveraconnection#isusedexclusivlyforreadorforwrite,sowehave#todothisforconnectionusedforwritedrain_events#alreadydothatforotherconnectiontry:self.connection.drain_events(timeout=0.001)exceptsocket.timeout:passexceptrecoverable_errorsasexc:LOG.info(_LI("Arecoverableconnection/channelerror""occurred,tryingtoreconnect:%s"),exc)self.ensure_connection()exceptException:LOG.warning(_LW("Unexpectederrorduringheartbeart""threadprocessing,retrying..."))LOG.debug('Exception',exc_info=True)self._heartbeat_exit_event.wait(timeout=self._heartbeat_wait_timeout)self._heartbeat_exit_event.clear()
原本heartbeat check就是来检测组件服务和rabbitmq server之间的连接是否是活着的,oslo_messaging中的heartbeat_check任务在服务启动的时候就跑在后台了,当关闭一个控制节点时,实际上也关闭了一个rabbitmq server节点。只不过这里会一直处于循环之中,一直抛出recoverable_errors捕获到的异常,只有当self._heartbeat_exit_event.is_set()才会退出while循环。按理说应该加个超时的东西,这样就就不会一直处于循环之中,过好几分钟后才恢复。
今天我在虚拟机中安装了三控高可用,在nova.conf中加了如下参数:
[oslo_messaging_rabbit]
rabbit_max_retries = 2 # 重连最大次数
heartbeat_timeout_threshold = 0 # 禁止heartbeat check
测试,nova_compute 并不会一直抛出recoverable_errors捕获到的异常,nova service-list并不会出现所有服务down的情况。
后续有待在物理机上测试。。。。。。
声明:本站所有文章资源内容,如无特殊说明或标注,均为采集网络资源。如若本站内容侵犯了原著者的合法权益,可联系本站删除。