nagios-event-triger-AutoRecover

思路:

useNRPEtoexecutethenecessarycommandsontheremotehosts

InordertoadapttheschemefromtheNagiosdocstoworkonremoteserversaswellthreethingsneedtobedone:


1.ThecommandthatisexecutedbytheeventhandlerscriptshouldbechangedtouseNRPE

2.Ontheremotemachinethenagiosuser(underwhichtheNRPEserviceisrunning)shouldbegivensomesudorightssothatitisactuallyallowedtostartaservice.

3.TheNRPEconfigurationontheremotemachineshouldofcoursebechangedtoincludethenewcommand(s)forstartingservices.


1.nagiosmanageserver

(1)vilocalhost.cfg

defineservice{usegeneric-servicehost_nametest2.bigdata.comservice_descriptiongmondcheck_commandcheck_nrpe_eventhandler!check_gmondnotifications_enabled1notification_interval0max_check_attempts4event_handlerrestart-service!gmond}defineservice{usegeneric-servicehost_nametest2.bigdata.comservice_descriptionmysqldcheck_commandcheck_nrpe_eventhandler!check_mysqldnotifications_enabled1notification_interval0max_check_attempts5event_handlerrestart-service!mysqld}


(2)vicommands.cfg

definecommand{command_namecheck_nrpe_eventhandlercommand_line$USER1$/check_nrpe-H$HOSTADDRESS$-t40-c$ARG1$}definecommand{command_namerestart-servicecommand_line$USER1$/eventhandlers/event_handler_script.sh$SERVICESTATE$$SERVICESTATETYPE$$SERVICEATTEMPT$$HOSTADDRESS$$ARG1$$SERVICEDESC$}


利用后面编写的通用的事件处理脚本模块文件/usr/local/nagios/libexec/eventhandlers/event_handler_script.sh传递监控服务所需的参数$ARG1$


2.onremotemachinewithNRPErunning

(1)vinrpe.cfg

command[check_gmond]=/usr/local/nagios/libexec/check_gmondcommand[restart_gmond]=/usr/bin/sudo/etc/init.d/gmondrestartcommand[check_mysqld]=/usr/local/nagios/libexec/check_mysqldcommand[restart_mysqld]=/usr/bin/sudo/usr/local/nagios/libexec/restart_mysqld

(2)edityourservicemanagescriptonremotemachine.

/usr/local/nagios/libexec/check_mysqld

/usr/local/nagios/libexec/restart_mysqld



3.通用的事件处理脚本模块文件/usr/local/nagios/libexec/eventhandlers/event_handler_script.sh内容如下

#!/bin/sh##Eventhandlerscriptforrestartingthewebserveronthelocalmachine##Note:Thisscriptwillonlyrestartthewebserveriftheserviceis#retried3times(ina"soft"state)orifthewebservicesomehow#managestofallintoa"hard"errorstate.#update2015/10/23#version:0.2date=`date`#WhatstateistheHTTPserviceincase"$1"inOK)#Theservicejustcamebackup,sodon'tdoanything...;;WARNING)#Wedon'treallycareaboutwarningstates,sincetheserviceisprobablystillrunning...;;UNKNOWN)#Wedon'tknowwhatmightbecausinganunknownerror,sodon'tdoanything...;;CRITICAL)#Wedon'treallycareaboutwarningstates,sincetheserviceisprobablystillrunning...#Aha!TheHTTPserviceappearstohaveaproblem-perhapsweshouldrestarttheserver...#Isthisa"soft"ora"hard"state?case"$2"in#We'reina"soft"state,meaningthatNagiosisinthemiddleofretryingthe#checkbeforeitturnsintoa"hard"stateandcontactsgetnotified...SOFT)#Whatcheckattemptareweon?Wedon'twanttorestartthewebserveronthefirst#check,becauseitmayjustbeafluke!case"$3"in#Waituntilthecheckhasbeentried3timesbeforerestartingthewebserver.#Ifthecheckfailsonthe4thtime(afterwerestartthewebserver),thestate#typewillturnto"hard"andcontactswillbenotifiedoftheproblem.#Hopefullythiswillrestartthewebserversuccessfully,sothe4thcheckwill#resultina"soft"recovery.Ifthathappensnoonegetsnotifiedbecausewe#fixedtheproblem!3)echo-n"Restartingservice$6(3rdsoftcriticalstate)...\n"#CallNRPEtorestarttheserviceontheremotemachine/usr/local/nagios/libexec/check_nrpe-H$4-crestart_$5echo"$date-restart$6onserver$4-atretry$3times-SOFT">>/tmp/eventhandlers;;esac;;#TheHTTPservicesomehowmanagedtoturnintoaharderrorwithoutgettingfixed.#Itshouldhavebeenrestartedbythecodeabove,butforsomereasonitdidn't.#Let'sgiveitonelasttry,shallwe?#Note:Contactshavealreadybeennotifiedofaproblemwiththeserviceatthis#point(unlessyoudisablednotificationsforthisservice)HARD)echo-n"Restarting$6service...\n"#CalltheinitscripttorestarttheNRPEserverecho"$date-restart$6onserver$4-atretry$3times-HARD">>/tmp/eventhandlers/usr/local/nagios/libexec/check_nrpe-H$4-crestart_$5;;esac;;esacexit0