接到zz的任务,实现自动化处理nagios某项报警

脑海里有个印象,这个功能之前线下做过实验

一、首先必须查看下nagios的官方文档,确认可行,以下是笔者整理的一些自认为有用的信息

1)
了解命令的定义方法
Writing Event Handler Commands
Event handler commands will likely be shell or perl scripts, but they can be any type of executable that can run from a
command prompt. At a minimum, the scripts should take the following macros as arguments:
For Services: $SERVICESTATE$, $SERVICESTATETYPE$, $SERVICEATTEMPT$
For Hosts: $HOSTSTATE$, $HOSTSTATETYPE$, $HOSTATTEMPT$
这段说的是,针对于主机处理需要的一些参数,跟针对于服务需要的一些参数,这方面配置是在objects/commands.cfg配置的
官方文档记录
define command{
command_name restart-httpd
command_line /usr/local/nagios/libexec/eventhandlers/restart-httpd $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$
}

2)
了解主机配置文件的方法(笔者线上host和service合并的一个文件,一般是分开的,主host跟service是分开的)
官方文档记录:
define service{
host_name somehost
service_description HTTP
max_check_attempts 4
event_handler restart-httpd
...
}


二、一些解释:

1)
变量解释:
$SERVICESTATE$:服务的当前状态(OK、WARNING、UNKNOWN、CRITICAL)
$SERVICESTATETYPE$:服务器状态类型,分为两种,软状态,硬状态
$SERVICEATTEMPT$:软状态的尝试check的次数
这些值是后面自动恢复脚本必须要处理的三个参数

2)
HARD:硬状态
SOFT:软状态
nagios在检测服务正常的过程中,如果第一次检测失败,状态成SOFT 尝试设置的最大次数后,状态就改变成为HARD

3)
事件处理的一些参数配置
event_handler_timeout=30 超时时间
enable_event_handlers=1 开机事件处理机制
event_handler


三、操作步骤:
1)
确认事件处理开关有没有打开
enable_event_handlers=1
0:关闭
1:打开

2)
自恢复脚本制作
需要处理的参数太多,建议使用case,下面是官网的脚本例子,当然可以写成其他的
#!/bin/sh
#
# Event handler script for restarting the web server on the local machine
#
# Note: This script will only restart the web server if the service is
# retried 3 times (in a "soft" state) or if the web service somehow
# manages to fall into a "hard" error state.
#
# What state is the HTTP service in?
case "$1" in
OK)
# The service just came back up, so don't do anything...
;;
WARNING)
# We don't really care about warning states, since the service is probably still running...
;;
UNKNOWN)
# We don't know what might be causing an unknown error, so don't do anything...
;;
CRITICAL)
# Aha! The HTTP service appears to have a problem - perhaps we should restart the server...
# Is this a "soft" or a "hard" state?
case "$2" in
# We're in a "soft" state, meaning that Nagios is in the middle of retrying the
# check before it turns into a "hard" state and contacts get notified...
SOFT)
# What check attempt are we on? We don't want to restart the web server on the first
# check, because it may just be a fluke!
case "$3" in
# Wait until the check has been tried 3 times before restarting the web server.
# If the check fails on the 4th time (after we restart the web server), the state
# type will turn to "hard" and contacts will be notified of the problem.
# Hopefully this will restart the web server successfully, so the 4th check will
# result in a "soft" recovery. If that happens no one gets notified because we
# fixed the problem!
3)
echo -n "Restarting HTTP service (3rd soft critical state)..."
# Call the init script to restart the HTTPD server
/etc/rc.d/init.d/httpd restart
;;
esac
;;
#

3)报警产生之后,如何当nagios服务器得到下面传上来的报警呢,做出自维护呢(objects/commands.cfg)
在objects/commands.cfg该文件中表明当报警产生,如何去执行远端的脚本,


4)
在主机服务的配置文件中,启用时间处理机制
event_handler shell_name


/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
检测上面修改的文件,有没有报错

杀掉个服务,做下简单的测试,ok

不辜负zz的信任,解决了

搞定收工