nagios整合ganglia实现hadoop、Hbase监控及手机短信报警
预计该博文篇幅较长,这里不再废话,对ganglia不太了解的去问谷老师,直接看环境:
hadoop1.updb.com192.168.0.101
hadoop2.updb.com192.168.0.102
hadoop3.updb.com192.168.0.103
hadoop4.updb.com192.168.0.104
hadoop5.updb.com192.168.0.105
操作系统:centos 6.5 x86_64,使用自带网络yum源,同时配置epel扩展源。
在安装ganglia之前,确保你的hadoop及hbase已经安装成功,看我的安装规划:
hadoop1.updb.com NameNode|HMaster|gmetad|gmond|ganglia-web|nagios
hadoop2.updb.com DataNode|Regionserver|gmond|nrpe
hadoop3.updb.com DataNode|Regionserver|gmond|nrpe
hadoop4.updb.com DataNode|Regionserver|gmond|nrpe
hadoop5.updb.com DataNode|Regionserver|gmond|nrpe
hadoop1作为ganglia和nagios的主控端,安装的软件为ganglia的服务端gmetad、由于要监控自身节点,所以还需要安装ganglia的客户端gmond以及ganglia的web应用ganglia-web和nagios服务端;hadoop2、hadoop3、hadoop4、hadoop5作为被控端,安装的软件有ganglia的客户端gmond以及nagios的客户端nrpe。注意这里的nrpe不是一定要安装的,这里是因为我要监控hadoop2、hadoop3、hadoop4、hadoop5节点上的mysql及其他一些服务,所以选择安装nrpe。
1、hadoop1安装ganglia的gmetad、gmond及ganglia-web
首先安装ganglia所需要的依赖包
[root@hadoop1~]#catganglia.rpmapr-develapr-utilcheck-develcairo-develpango-devellibxml2-develglib2-develdbus-develfreetype-develfontconfig-develgcc-c++expat-develpython-devellibXrender-develzliblibart_lgpllibpngdejavu-lgc-sans-mono-fontsdejavu-sans-mono-fontsperl-ExtUtils-CBuilderperl-ExtUtils-MakeMaker[root@hadoop1~]#yuminstall-y`catganglia.rpm`
除了上面的依赖,还需要安装confuse-2.7.tar.gz、rrdtool-1.4.8.tar.gz两个软件
##解压软件[root@hadoop1pub]#tarxfrrdtool-1.4.8.tar.gz-C/opt/soft/[root@hadoop1pub]#tarxfconfuse-2.7.tar.gz-C/opt/soft/##安装rrdtool[root@hadoop1rrdtool-1.4.8]#./configure-prefix=/usr/local/rrdtool[root@hadoop1rrdtool-1.4.8]#make&&makeinstall[root@hadoop1rrdtool-1.4.8]#mkdir/usr/local/rrdtool/lib64[root@hadoop1rrdtool-1.4.8]#cp/usr/local/rrdtool/lib/*/usr/local/rrdtool/lib64/-rf[root@hadoop1rrdtool-1.4.8]#cp/usr/local/rrdtool/lib/librrd.so/usr/lib/[root@hadoop1rrdtool-1.4.8]#cp/usr/local/rrdtool/lib/librrd.so/usr/lib64/##安装confuse[root@hadoop1rrdtool-1.4.8]#cd../confuse-2.7/[root@hadoop1confuse-2.7]#./configureCFLAGS=-fPIC--disable-nls--prefix=/usr/local/confuse[root@hadoop1confuse-2.7]#make&&makeinstall[root@hadoop1confuse-2.7]#mkdir/usr/local/confuse/lib64[root@hadoop1confuse-2.7]#cp/usr/local/confuse/lib/*/usr/local/confuse/lib64/-rf
ok,准备工作做好之后,开始安装ganglia软件中的gmetad和gmond
##解压软件[root@hadoop1pub]#tarxfganglia-3.6.0.tar.gz-C/opt/soft/[root@hadoop1pub]#cd/opt/soft/ganglia-3.6.0/##安装gmetad[root@hadoop1ganglia-3.6.0]#./configure--prefix=/usr/local/ganglia--with-librrd=/usr/local/rrdtool--with-libconfuse=/usr/local/confuse--with-gmetad--with-libpcre=no--enable-gexec--enable-status--sysconfdir=/etc/ganglia[root@hadoop1ganglia-3.6.0]#make&&makeinstall[root@hadoop1ganglia-3.6.0]#cpgmetad/gmetad.init/etc/init.d/gmetad[root@hadoop1ganglia-3.6.0]#cp/usr/local/ganglia/sbin/gmetad/usr/sbin/[root@hadoop1ganglia-3.6.0]#chkconfig--addgmetad##安装gmond[root@hadoop1ganglia-3.6.0]#cpgmond/gmond.init/etc/init.d/gmond[root@hadoop1ganglia-3.6.0]#cp/usr//local/ganglia/sbin/gmond/usr/sbin/[root@hadoop1ganglia-3.6.0]#gmond--default_config>/etc/ganglia/gmond.conf[root@hadoop1ganglia-3.6.0]#chkconfig--addgmond
至此,hadoop1上的gmetad、gmond安装成功,接着安装ganglia-web,首先要安装php和httpd
yuminstallphphttpd-y
修改httpd的配置文件/etc/httpd/conf/httpd.conf,只把监听端口改为8080
Listen8080
安装ganglia-web
[root@hadoop1pub]#tarxfganglia-web-3.6.2.tar.gz-C/opt/soft/[root@hadoop1pub]#cd/opt/soft/[root@hadoop1soft]#mvganglia-web-3.6.2//var/www/html/ganglia[root@hadoop1soft]#chmod777/var/www/html/ganglia-R[root@hadoop1soft]#cd/var/www/html/ganglia[root@hadoop1ganglia]#useraddwww-data[root@hadoop1ganglia]#makeinstall[root@hadoop1ganglia]#chmod777/var/lib/ganglia-web/dwoo/cache/[root@hadoop1ganglia]#chmod777/var/lib/ganglia-web/dwoo/compiled/
至此ganglia-web安装完成,修改conf_default.php修改文件,指定ganglia-web的目录及rrds的数据目录,修改如下两行:
36#Wheregmetadstorestherrdarchives.37$conf['gmetad_root']="/var/www/html/ganglia";##改为web程序的安装目录38$conf['rrds']="/var/lib/ganglia/rrds";##指定rrd数据存放的路径
创建rrd数据存放目录并授权
[root@hadoop1ganglia]#mkdir/var/lib/ganglia/rrds-p[root@hadoop1ganglia]#chownnobody:nobody/var/lib/ganglia/rrds/-R
到这里,hadoop1上的ganglia的所有安装工作就完成了,接下来就是要在hadoop2、hadoop3、hadoop4、hadoop5上安装ganglia的gmond客户端。
2、在hadoop2、hadoop3、hadoop4、hadoop5上安装gmond
首先还是需要安装依赖,参照hadoop1中的前两步来安装所需依赖
ok,准备工作做好之后,开始安装gmond,4个节点的操作是一样的,这里以hadoop2为例
##解压软件[root@hadoop2pub]#tarxfganglia-3.6.0.tar.gz-C/opt/soft/[root@hadoop2pub]#cd/opt/soft/ganglia-3.6.0/##安装gmond,注意这里的编译和gmetad相比少了--with-gmetad[root@hadoop2ganglia-3.6.0]#./configure--prefix=/usr/local/ganglia--with-librrd=/usr/local/rrdtool--with-libconfuse=/usr/local/confuse--with-libpcre=no--enable-gexec--enable-status--sysconfdir=/etc/ganglia[root@hadoop2ganglia-3.6.0]#make&&makeinstall[root@hadoop2ganglia-3.6.0]#cpgmond/gmond.init/etc/init.d/gmond[root@hadoop2ganglia-3.6.0]#cp/usr//local/ganglia/sbin/gmond/usr/sbin/[root@hadoop2ganglia-3.6.0]#gmond--default_config>/etc/ganglia/gmond.conf[root@hadoop2ganglia-3.6.0]#chkconfig--addgmond
到这里hadoop2上的gmond已经安装成功,hadoop3、hadoop4、hadoop5依次安装成功。
3、配置ganglia,分为服务端和客户端的配置,服务端的配置文件为gmetad.conf,客户端的配置文件为gmond.conf
首先配置hadoop1上的gmetad.conf
[root@hadoop1~]#vi/etc/ganglia/gmetad.conf##定义数据源的名字及监听地址,gmond会将收集的数据发送到数据源监听机器上的rrd数据目录中data_source"hadoopcluster"192.168.0.101:8649
gmetad.conf的配置相当的简单,注意gmetad.conf只有hadoop1上有,因此只在hadoop1上配置。接着配置hadoop1上的gmond.conf
[root@hadoop1~]#head-n80/etc/ganglia/gmond.conf/*Thisconfigurationisascloseto2.5.xdefaultbehavioraspossibleThevaluescloselymatch./gmond/metric.hdefinitionsin2.5.x*/globals{daemonize=yes##以守护进程运行setuid=yesuser=nobody##运行gmond的用户debug_level=0##改为1会在启动时打印debug信息max_udp_msg_len=1472mute=no##哑巴,本节点将不会再广播任何自己收集到的数据到网络上deaf=no##聋子,本节点将不再接收任何其他节点广播的数据包allow_extra_data=yeshost_dmax=86400/*secs.Expires(removesfromwebinterface)hostsin1day*/host_tmax=20/*secs*/cleanup_threshold=300/*secs*/gexec=no#BydefaultgmondwillusereverseDNSresolutionwhendisplayingyourhostname#Uncommetingfollowingvaluewilloverridethatvalue.#override_hostname="mywebserver.domain.com"#Ifyouarenotusingmulticastthisvalueshouldbesettosomethingotherthan0.#Otherwiseifyourestartaggregatorgmondyouwillgetemptygraphs.60secondsisreasonablesend_metadata_interval=0/*secs*/}/**Theclusterattributesspecifiedwillbeusedaspartofthe<CLUSTER>*tagthatwillwrapallhostscollectedbythisinstance.*/cluster{name="hadoopcluster"##指定集群的名字owner="nobody"##集群的所有者latlong="unspecified"url="unspecified"}/*Thehostsectiondescribesattributesofthehost,likethelocation*/host{location="unspecified"}/*Feelfreetospecifyasmanyudp_send_channelsasyoulike.Gmondusedtoonlysupporthavingasinglechannel*/udp_send_channel{#bind_hostname=yes#Highlyrecommended,soontobedefault.#Thisoptiontellsgmondtouseasourceaddress#thatresolvestothemachine'shostname.Without#this,themetricsmayappeartocomefromany#interfaceandtheDNSnamesassociatedwith#thoseIPswillbeusedtocreatetheRRDs.#mcast_join=239.2.11.71##单播模式要注释调这行host=192.168.0.101##单播模式,指定接受数据的主机port=8649##监听端口ttl=1}/*Youcanspecifyasmanyudp_recv_channelsasyoulikeaswell.*/udp_recv_channel{#mcast_join=239.2.11.71##单播模式要注释调这行port=8649#bind=239.2.11.71##单播模式要注释调这行retry_bind=true#SizeoftheUDPbuffer.Ifyouarehandlinglotsofmetricsyoureally#shouldbumpituptoe.g.10MBorevenhigher.#buffer=10485760}/*Youcanspecifyasmanytcp_accept_channelsasyouliketoshareanxmldescriptionofthestateofthecluster*/tcp_accept_channel{port=8649#IfyouwanttogzipXMLoutputgzip_output=no}/*ChanneltoreceivesFlowdatagrams*/#udp_recv_channel{#port=6343#}/*OptionalsFlowsettings*/
好了,hadoop1上的gmetad.conf和gmond.conf配置文件已经修改完成,这时,直接将hadoop1上的gmond.conf文件scp到hadoop2、hadoop3、hadoop4、hadoop5上相同的路径下覆盖原来的gmond.conf即可。
4、启动服务
启动hadoop1、hadoop2、hadoop3、hadoop4、hadoop5上的gmond服务
/etc/init.d/gmondstart
启动hadoop1上的httpd服务和gmetad服务
/etc/init.d/gmetadstart/etc/init.d/httpdstart
5、在浏览器中访问192.168.0.101:8080/ganglia,就会出现下面的页面
但此时,ganglia只是监控了各主机基本的性能,并没有监控到hadoop和hbase,接下来需要配置hadoop和hbase的配置文件,这里以hadoop1上的配置文件为例,其他节点对应的配置文件应从hadoop1上拷贝,首先需要修改的是hadoop配置目录下的hadoop-metrics2.properties
[root@hadoop1~]#cd/opt/hadoop-2.4.1/etc/hadoop/[root@hadoop1hadoop]#cathadoop-metrics2.properties##LicensedtotheApacheSoftwareFoundation(ASF)underoneormore#contributorlicenseagreements.SeetheNOTICEfiledistributedwith#thisworkforadditionalinformationregardingcopyrightownership.#TheASFlicensesthisfiletoYouundertheApacheLicense,Version2.0#(the"License");youmaynotusethisfileexceptincompliancewith#theLicense.YoumayobtainacopyoftheLicenseat##http://www.apache.org/licenses/LICENSE-2.0##Unlessrequiredbyapplicablelaworagreedtoinwriting,software#distributedundertheLicenseisdistributedonan"ASIS"BASIS,#WITHOUTWARRANTIESORCONDITIONSOFANYKIND,eitherexpressorimplied.#SeetheLicenseforthespecificlanguagegoverningpermissionsand#limitationsundertheLicense.##syntax:[prefix].[source|sink].[instance].[options]#Seejavadocofpackage-info.javafororg.apache.hadoop.metrics2fordetails#*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink#defaultsamplingperiod,inseconds#*.period=10#Thenamenode-metrics.outwillcontainmetricsfromallcontext#namenode.sink.file.filename=namenode-metrics.out#Specifyingaspecialsamplingperiodfornamenode:#namenode.sink.*.period=8#datanode.sink.file.filename=datanode-metrics.out#thefollowingexamplesplitmetricsofdifferent#contexttodifferentsinks(inthiscasefiles)#jobtracker.sink.file_jvm.context=jvm#jobtracker.sink.file_jvm.filename=jobtracker-jvm-metrics.out#jobtracker.sink.file_mapred.context=mapred#jobtracker.sink.file_mapred.filename=jobtracker-mapred-metrics.out#tasktracker.sink.file.filename=tasktracker-metrics.out#maptask.sink.file.filename=maptask-metrics.out#reducetask.sink.file.filename=reducetask-metrics.out*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31*.sink.ganglia.period=10*.sink.ganglia.supportsparse=true*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40namenode.sink.ganglia.servers=192.168.0.101:8649datanode.sink.ganglia.servers=192.168.0.101:8649resourcemanager.sink.ganglia.servers=192.168.0.101:8649secondarynamenode.sink.ganglia.servers=192.168.0.101:8649nodemanager.sink.ganglia.servers=192.168.0.101:8649
接着需要修改hbase配置目录下的hadoop-metrics2-hbase.properties
[root@hadoop1hadoop]#cd/opt/hbase-0.98.4-hadoop2/conf/[root@hadoop1conf]#cathadoop-metrics2-hbase.properties#syntax:[prefix].[source|sink].[instance].[options]#Seejavadocofpackage-info.javafororg.apache.hadoop.metrics2fordetails#*.sink.file*.class=org.apache.hadoop.metrics2.sink.FileSink#defaultsamplingperiod#*.period=10#Belowaresomeexamplesofsinksthatcouldbeused#tomonitordifferenthbasedaemons.#hbase.sink.file-all.class=org.apache.hadoop.metrics2.sink.FileSink#hbase.sink.file-all.filename=all.metrics#hbase.sink.file0.class=org.apache.hadoop.metrics2.sink.FileSink#hbase.sink.file0.context=hmaster#hbase.sink.file0.filename=master.metrics#hbase.sink.file1.class=org.apache.hadoop.metrics2.sink.FileSink#hbase.sink.file1.context=thrift-one#hbase.sink.file1.filename=thrift-one.metrics#hbase.sink.file2.class=org.apache.hadoop.metrics2.sink.FileSink#hbase.sink.file2.context=thrift-two#hbase.sink.file2.filename=thrift-one.metrics#hbase.sink.file3.class=org.apache.hadoop.metrics2.sink.FileSink#hbase.sink.file3.context=rest#hbase.sink.file3.filename=rest.metrics*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31*.sink.ganglia.period=10hbase.sink.ganglia.period=10hbase.sink.ganglia.servers=192.168.0.101:8649
将hadoop1上的这两个文件,scp到hadoop2~5这4个节点上相同的路径下,覆盖原来的文件,然后重启hadoop、hbase,这时ganglia就能够监控到hadoop和hbase了,如下如
到这里位置ganglia已经完全安装配置完成了,且已经成功的监控到了hadoop和hbase。
接下来,安装nagios,在hadoop1上安装nagios服务端、在hadoop2、hadoop3、hadoop4、hadoop5上安装客户端nrpe。
首先在hadoop1上安装nagios、及相关插件
yuminstallnagiosnagios-pluginsnagios-plugins-allnagios-plugins-nrpe-y
设置nagios web界面的登录口令
[root@hadoop1~]#cat/etc/httpd/conf.d/nagios.conf#SAMPLECONFIGSNIPPETSFORAPACHEWEBSERVER#LastModified:11-26-2005##Thisfilecontainsexamplesofentriesthatneed#tobeincorporatedintoyourApachewebserver#configurationfile.Customizethepaths,etc.as#neededtofityoursystem.ScriptAlias/nagios/cgi-bin/"/usr/lib64/nagios/cgi-bin/"<Directory"/usr/lib64/nagios/cgi-bin/">#SSLRequireSSLOptionsExecCGIAllowOverrideNoneOrderallow,denyAllowfromall#Orderdeny,allow#Denyfromall#Allowfrom127.0.0.1##这里的用户名必须是nagiosadminAuthName"nagiosadmin"AuthTypeBasic##这里指定密码文件的路径AuthUserFile/etc/nagios/htpasswd.usersRequirevalid-user</Directory>Alias/nagios"/usr/share/nagios/html"<Directory"/usr/share/nagios/html">#SSLRequireSSLOptionsNoneAllowOverrideNoneOrderallow,denyAllowfromall#Orderdeny,allow#Denyfromall#Allowfrom127.0.0.1##这里的用户名必须是nagiosadminAuthName"nagiosadmin"AuthTypeBasic##这里指定密码文件的路径AuthUserFile/etc/nagios/htpasswd.usersRequirevalid-user</Directory>
保存,退出,生成密码文件
[root@hadoop1~]#htpasswd-c/etc/nagios/htpasswd.usersnagiosadminNewpassword:Re-typenewpassword:Addingpasswordforusernagiosadmin[root@hadoop1~]#cat/etc/nagios/htpasswd.usersnagiosadmin:qWrXYKDlycqHM
生成密码成功,接着在hadoop2、hadoop3、hadoop4、hadoop5上安装客户端nrpe及相关插件
yuminstallnagios-pluginsnagios-plugins-nrpenrpenagios-plugins-all-y
所有节点的插件均位于/usr/lib64/nagios/plugins/下
[root@hadoop2~]#ls/usr/lib64/nagios/plugins/check_breezecheck_gamecheck_mrtgtrafcheck_overcrcheck_swapcheck_by_sshcheck_hpjdcheck_mysqlcheck_pgsqlcheck_tcpcheck_clamdcheck_httpcheck_mysql_querycheck_pingcheck_timecheck_clustercheck_icmpcheck_nagioscheck_popcheck_udpcheck_dhcpcheck_ide_smartcheck_nntpcheck_procscheck_upscheck_digcheck_imapcheck_nntpscheck_realcheck_userscheck_diskcheck_ircdcheck_nrpecheck_rpccheck_wavecheck_disk_smbcheck_jabbercheck_ntcheck_sensorsnegatecheck_dnscheck_ldapcheck_ntpcheck_simapurlizecheck_dummycheck_ldapscheck_ntp_peercheck_smtputils.pmcheck_file_agecheck_loadcheck_ntp.plcheck_snmputils.shcheck_flexlmcheck_logcheck_ntp_timecheck_spopcheck_fpingcheck_mailqcheck_nwstatcheck_sshcheck_ftpcheck_mrtgcheck_oraclecheck_ssmtp
要想让nagios与ganglia整合起来,就需要在hadoop1上把ganglia安装包中的ganglia的插件放到nagios的插件目录下
[root@hadoop1~]#cd/opt/soft/ganglia-3.6.0/[root@hadoop1ganglia-3.6.0]#lscontrib/check_ganglia.pycontrib/check_ganglia.py[root@hadoop1ganglia-3.6.0]#cpcontrib/check_ganglia.py/usr/lib64/nagios/plugins/
默认的check_ganglia.py 插件中只有监控项的实际值大于critical阀值的情况,这里需要增加监控项的实际值小于critical阀值的情况,即最后添加的一段代码
[root@hadoop1plugins]#catcheck_ganglia.py#!/usr/bin/envpythonimportsysimportgetoptimportsocketimportxml.parsers.expatclassGParser:def__init__(self,host,metric):self.inhost=0self.inmetric=0self.value=Noneself.host=hostself.metric=metricdefparse(self,file):p=xml.parsers.expat.ParserCreate()p.StartElementHandler=parser.start_elementp.EndElementHandler=parser.end_elementp.ParseFile(file)ifself.value==None:raiseException('Host/valuenotfound')returnfloat(self.value)defstart_element(self,name,attrs):ifname=="HOST":ifattrs["NAME"]==self.host:self.inhost=1elifself.inhost==1andname=="METRIC"andattrs["NAME"]==self.metric:self.value=attrs["VAL"]defend_element(self,name):ifname=="HOST"andself.inhost==1:self.inhost=0defusage():print"""Usage:check_ganglia\-h|--host=-m|--metric=-w|--warning=\-c|--critical=[-s|--server=][-p|--port=]"""sys.exit(3)if__name__=="__main__":##############################################################ganglia_host='192.168.0.101'ganglia_port=8649host=Nonemetric=Nonewarning=Nonecritical=Nonetry:options,args=getopt.getopt(sys.argv[1:],"h:m:w:c:s:p:",["host=","metric=","warning=","critical=","server=","port="],)exceptgetopt.GetoptError,err:print"check_gmond:",str(err)usage()sys.exit(3)foro,ainoptions:ifoin("-h","--host"):host=aelifoin("-m","--metric"):metric=aelifoin("-w","--warning"):warning=float(a)elifoin("-c","--critical"):critical=float(a)elifoin("-p","--port"):ganglia_port=int(a)elifoin("-s","--server"):ganglia_host=aifcritical==Noneorwarning==Noneormetric==Noneorhost==None:usage()sys.exit(3)try:s=socket.socket(socket.AF_INET,socket.SOCK_STREAM)s.connect((ganglia_host,ganglia_port))parser=GParser(host,metric)value=parser.parse(s.makefile("r"))s.close()exceptException,err:print"CHECKGANGLIAUNKNOWN:Errorwhilegettingvalue\"%s\""%(err)sys.exit(3)ifcritical>warning:ifvalue>=critical:print"CHECKGANGLIACRITICAL:%sis%.2f"%(metric,value)sys.exit(2)elifvalue>=warning:print"CHECKGANGLIAWARNING:%sis%.2f"%(metric,value)sys.exit(1)else:print"CHECKGANGLIAOK:%sis%.2f"%(metric,value)sys.exit(0)else:ifcritical>=value:print"CHECKGANGLIACRITICAL:%sis%.2f"%(metric,value)sys.exit(2)elifwarning>=value:print"CHECKGANGLIAWARNING:%sis%.2f"%(metric,value)sys.exit(1)else:print"CHECKGANGLIAOK:%sis%.2f"%(metric,value)sys.exit(0)
配置hadoop2、hadoop3、hadoop4、hadoop5上的nrpe客户端,这里以hadoop2为例演示,其他节点直接从hadoop2上scp,然后覆盖相同路径的下的文件即可
[root@hadoop2~]#cat/etc/nagios/nrpe.cfglog_facility=daemonpid_file=/var/run/nrpe/nrpe.pid##nagios的监听端口server_port=5666nrpe_user=nrpenrpe_group=nrpe##nagios服务器主机地址allowed_hosts=192.168.0.101dont_blame_nrpe=0allow_bash_command_substitution=0debug=0command_timeout=60connection_timeout=300##监控负载command[check_load]=/usr/lib64/nagios/plugins/check_load-w15,10,5-c30,25,20##当前系统用户数command[check_users]=/usr/lib64/nagios/plugins/check_users-w5-c10##根分区空闲容量command[check_sda2]=/usr/lib64/nagios/plugins/check_disk-w20%-c10%-p/dev/sda2##mysql状态command[check_mysql]=/usr/lib64/nagios/plugins/check_mysql-Hhadoop2.updb.com-P3306-dkora-ukora-pupbjsxt##主机是否存活command[check_ping]=/usr/lib64/nagios/plugins/check_ping-Hhadoop2.updb.com-w100.0,20%-c500.0,60%##当前系统的进程总数command[check_total_procs]=/usr/lib64/nagios/plugins/check_procs-w150-c200include_dir=/etc/nrpe.d/
scp该文件到hadoop3、hadoop4、hadoop5上相同路径下覆盖源文件,注意要将文件中的主机名改为对应的主机名。
hadoop1上配置各个主机及对应的监控项,有如下配置文件
[root@hadoop1plugins]#cd/etc/nagios/objects/##每个节点对应一个host文件和一个监控项文件,如hadoop2对应的是hadoop2.cfg和service2.cfg[root@hadoop1objects]#lscommands.cfghadoop3.cfglocalhost.cfgservice3.cfgtemplates.cfgcontacts.cfghadoop4.cfgprinter.cfgservice4.cfgtimeperiods.cfghadoop1.cfghadoop5.cfgservice1.cfgservice5.cfgwindows.cfghadoop2.cfghosts.cfgservice2.cfgswitch.cfg
首先在commond.cfg中声明check_ganglia、check_nrpe命令,在文件最后追加如下内容
#'check_ganglia'commanddefinitiondefinecommand{command_namecheck_gangliacommand_line$USER1$/check_ganglia.py-h$HOSTADDRESS$-m$ARG1$-w$ARG2$-c$ARG3$}#'check_nrpe'commanddefinitiondefinecommand{command_namecheck_nrpecommand_line$USER1$/check_nrpe-H$HOSTADDRESS$-c$ARG1$}
然后修改templates.cfg模版配置文件,在最后追加如下内容
defineservice{usegeneric-servicenameganglia-service1##这里的配置在service1.cfg中用到hostgroup_namehadoop1##这里的配置在hadoop1.cfg中用到service_groupsganglia-metrics1##这里的配置在service1.cfg中用到register0}defineservice{usegeneric-servicenameganglia-service2##这里的配置在service2.cfg中用到hostgroup_namehadoop2##这里的配置在hadoop2.cfg中用到service_groupsganglia-metrics2##这里的配置在service2.cfg中用到register0}defineservice{usegeneric-servicenameganglia-service3##这里的配置在service3.cfg中用到hostgroup_namehadoop3##这里的配置在hadoop3.cfg中用到service_groupsganglia-metrics3##这里的配置在service3.cfg中用到register0}defineservice{usegeneric-servicenameganglia-service4##这里的配置在service4.cfg中用到hostgroup_namehadoop4##这里的配置在hadoop4.cfg中用到service_groupsganglia-metrics4##这里的配置在service4.cfg中用到register0}defineservice{usegeneric-servicenameganglia-service5##这里的配置在service5.cfg中用到hostgroup_namehadoop5##这里的配置在hadoop5.cfg中用到service_groupsganglia-metrics5##这里的配置在service5.cfg中用到register0}
hadoop1的配置如下,由于hadoop1是服务端,无需使用nrpe来监控自己,配置如下
##hadoop1.cfg中的监控项为常规的本机监控项,而service1.cfg中的监控项为ganglia的监控项[root@hadoop1objects]#cathadoop1.cfgdefinehost{uselinux-serverhost_namehadoop1.updb.comaliashadoop1.updb.comaddresshadoop1.updb.com}definehostgroup{hostgroup_namehadoop1aliashadoop1membershadoop1.updb.com}defineservice{uselocal-servicehost_namehadoop1.updb.comservice_descriptionPINGcheck_commandcheck_ping!100,20%!500,60%}defineservice{uselocal-servicehost_namehadoop1.updb.comservice_description根分区check_commandcheck_local_disk!20%!10%!/#contact_groupsadmins}defineservice{uselocal-servicehost_namehadoop1.updb.comservice_description用户数量check_commandcheck_local_users!20!50}defineservice{uselocal-servicehost_namehadoop1.updb.comservice_description进程数check_commandcheck_local_procs!250!400!RSZDT}defineservice{uselocal-servicehost_namehadoop1.updb.comservice_description系统负载check_commandcheck_local_load!5.0,4.0,3.0!10.0,6.0,4.0}##services.cfg[root@hadoop1objects]#catservice1.cfgdefineservicegroup{servicegroup_nameganglia-metrics1aliasGangliaMetrics1}##这里的check_ganglia为commonds.cfg中声明的check_ganglia命令defineservice{useganglia-service1service_descriptionHMaster负载check_commandcheck_ganglia!master.Server.averageLoad!5!10}defineservice{useganglia-service1service_description内存空闲check_commandcheck_ganglia!mem_free!200!50}defineservice{useganglia-service1service_descriptionNameNode同步check_commandcheck_ganglia!dfs.namenode.SyncsAvgTime!10!50}
hadoop2的配置如下,需要注意使用check_nrpe插件的监控项必须要在hadoop2上的nrpe.cfg中声明
##这里的监控项就使用了远程客户端节点上的nrpe来收集数据并周期性的发送给hadoop1的nagiosserver[root@hadoop1objects]#cathadoop2.cfgdefinehost{uselinux-serverhost_namehadoop2.updb.comaliashadoop2.updb.comaddresshadoop2.updb.com}definehostgroup{hostgroup_namehadoop2aliashadoop2membershadoop2.updb.com}##这里的check_nrpe为commonds.cfg中声明的check_nrpedefineservice{uselocal-servicehost_namehadoop2.updb.comservice_descriptionMysql状态check_commandcheck_nrpe!check_mysql}defineservice{uselocal-servicehost_namehadoop2.updb.comservice_descriptionPINGcheck_commandcheck_nrpe!check_ping}defineservice{uselocal-servicehost_namehadoop2.updb.comservice_description根分区check_commandcheck_nrpe!check_sda2}defineservice{uselocal-servicehost_namehadoop2.updb.comservice_description用户数量check_commandcheck_nrpe!check_users}defineservice{uselocal-servicehost_namehadoop2.updb.comservice_description进程数check_commandcheck_nrpe!check_total_procs}defineservice{uselocal-servicehost_namehadoop2.updb.comservice_description系统负载check_commandcheck_nrpe!check_load}##这里的监控项为ganglia的监控项使用check_ganglia插件[root@hadoop1objects]#catservice2.cfgdefineservicegroup{servicegroup_nameganglia-metrics2aliasGangliaMetrics2}defineservice{useganglia-service2service_description内存空闲check_commandcheck_ganglia!mem_free!200!50}defineservice{useganglia-service2service_descriptionRegionServer_Getcheck_commandcheck_ganglia!regionserver.Server.Get_min!5!15}defineservice{useganglia-service2service_descriptionDateNode_Heartbeatcheck_commandcheck_ganglia!dfs.datanode.HeartbeatsAvgTime!15!40}
hadoop3、hadoop4、hadoop5的配置与hadoop2一样,除了主机名要改之外。
最后,还要将这些配置include到nagios的主配置文件中去
##涉及修改的内容如下,其他的保持不变即可[root@hadoop1objects]#vi../nagios.cfg#Youcanspecifyindividualobjectconfigfilesasshownbelow:#cfg_file=/etc/nagios/objects/localhost.cfgcfg_file=/etc/nagios/objects/commands.cfgcfg_file=/etc/nagios/objects/contacts.cfgcfg_file=/etc/nagios/objects/timeperiods.cfgcfg_file=/etc/nagios/objects/templates.cfg##将host文件引入进来cfg_file=/etc/nagios/objects/hadoop1.cfgcfg_file=/etc/nagios/objects/hadoop2.cfgcfg_file=/etc/nagios/objects/hadoop3.cfgcfg_file=/etc/nagios/objects/hadoop4.cfgcfg_file=/etc/nagios/objects/hadoop5.cfg##将监控项的文件引入进来cfg_file=/etc/nagios/objects/service1.cfgcfg_file=/etc/nagios/objects/service2.cfgcfg_file=/etc/nagios/objects/service3.cfgcfg_file=/etc/nagios/objects/service4.cfgcfg_file=/etc/nagios/objects/service5.cfg
接下来验证,配置项是否正确
[root@hadoop1objects]#nagios-v../nagios.cfgNagiosCore3.5.1Copyright(c)2009-2011NagiosCoreDevelopmentTeamandCommunityContributorsCopyright(c)1999-2009EthanGalstadLastModified:08-30-2013License:GPLWebsite:http://www.nagios.orgReadingconfigurationdata...Readmainconfigfileokay...Processingobjectconfigfile'/etc/nagios/objects/commands.cfg'...Processingobjectconfigfile'/etc/nagios/objects/contacts.cfg'...Processingobjectconfigfile'/etc/nagios/objects/timeperiods.cfg'...Processingobjectconfigfile'/etc/nagios/objects/templates.cfg'...Processingobjectconfigfile'/etc/nagios/objects/hadoop1.cfg'...Processingobjectconfigfile'/etc/nagios/objects/hadoop2.cfg'...Processingobjectconfigfile'/etc/nagios/objects/hadoop3.cfg'...Processingobjectconfigfile'/etc/nagios/objects/hadoop4.cfg'...Processingobjectconfigfile'/etc/nagios/objects/hadoop5.cfg'...Processingobjectconfigfile'/etc/nagios/objects/service1.cfg'...Processingobjectconfigfile'/etc/nagios/objects/service2.cfg'...Processingobjectconfigfile'/etc/nagios/objects/service3.cfg'...Processingobjectconfigfile'/etc/nagios/objects/service4.cfg'...Processingobjectconfigfile'/etc/nagios/objects/service5.cfg'...Processingobjectconfigdirectory'/etc/nagios/conf.d'...Readobjectconfigfilesokay...Runningpre-flightcheckonconfigurationdata...Checkingservices...Checked44services.Checkinghosts...Checked5hosts.Checkinghostgroups...Checked5hostgroups.Checkingservicegroups...Checked5servicegroups.Checkingcontacts...Checked1contacts.Checkingcontactgroups...Checked1contactgroups.Checkingserviceescalations...Checked0serviceescalations.Checkingservicedependencies...Checked0servicedependencies.Checkinghostescalations...Checked0hostescalations.Checkinghostdependencies...Checked0hostdependencies.Checkingcommands...Checked26commands.Checkingtimeperiods...Checked5timeperiods.Checkingforcircularpathsbetweenhosts...Checkingforcircularhostandservicedependencies...Checkingglobaleventhandlers...Checkingobsessivecompulsiveprocessorcommands...Checkingmiscsettings...TotalWarnings:0TotalErrors:0Thingslookokay-Noseriousproblemsweredetectedduringthepre-flightcheck
ok,没有错误,这时就可以启动hadoop1上的nagios服务
[root@hadoop1objects]#/etc/init.d/nagiosstartStartingnagios:done.
启动hadoop2-5上的nrpe服务
[root@hadoop2~]#/etc/init.d/nrpestartStartingnrpe:[OK]
在hadoop1上测试nagios与nrpe通信是否正常
[root@hadoop1objects]#cd/usr/lib64/nagios/plugins/[root@hadoop1plugins]#./check_nrpe-Hhadoop1.updb.com[root@hadoop1plugins]#./check_nrpe-Hhadoop2.updb.comNRPEv2.15[root@hadoop1plugins]#./check_nrpe-Hhadoop3.updb.comNRPEv2.15[root@hadoop1plugins]#./check_nrpe-Hhadoop4.updb.comNRPEv2.15[root@hadoop1plugins]#./check_nrpe-Hhadoop5.updb.comNRPEv2.15
ok,通信正常,验证check_ganglia.py插件是否工作正常
[root@hadoop1~]#cd/usr/lib64/nagios/plugins/[root@hadoop1plugins]#./check_ganglia.py-hhadoop2.updb.com-mmem_free-w200-c50CHECKGANGLIAOK:mem_freeis72336.00
工作正常,现在我们可以nagios的web页面,看是否监控成功。
可以看到已经成功监控,如果需要监控hadoop、hbase更多更加详细的监控项,可以在这里查找,nagios中配置监控项的名字和这里出现的一致即可
你会发现,ganglia中出现的可监控项非常的多,只要你愿意,你可以监控任何你想要监控的集群选项,前提是要弄明白这些监控项的含义,以及阀值设置为多少最为合适,显然这才是真正的挑战,实验中的集群监控项我是随便选取的,并没有特别的含义,因为鄙人也还是屌丝一枚,正在慢慢学习这些监控项,好了,到这里nagios已经成功监控到了我们想要监控的内容了,那么怎么实现手机短信报警呢?
看下面,首先我们要配置contacts.cfg,添加邮件的接受人
[root@hadoop1plugins]#cd/etc/nagios/objects/[root@hadoop1objects]#catcontacts.cfgdefinecontact{contact_namenagiosadminusegeneric-contactaliasNagiosAdmin##告警时间段service_notification_period24x7host_notification_period24x7##告警信息格式service_notification_optionsw,u,c,r,f,shost_notification_optionsd,u,r,f,s##告警方式为邮件service_notification_commandsnotify-service-by-emailhost_notification_commandsnotify-host-by-email##联系人的139邮箱email1820280----@139.com}##定义联系人所属组definecontactgroup{contactgroup_nameadminsaliasNagiosAdministratorsmembersnagiosadmin}
重启nagios服务,要保证你的139邮箱开启了长短信转发格式,确保邮件到达邮箱后会转发到你的手机,这里上一张手机接受到邮件后的截图:
好了,大功告成。看到此刻的你,是否犹如孙悟空的精魂刚刚回归肉体,与周围的世界有片刻的神离。哈哈,好好学习ganglia的监控项,以后嘛嘛就再也不用担心你的运维工作了,上班时间就是喝茶看新闻了,坐等短信告诉你哪个地方有问题。
声明:本站所有文章资源内容,如无特殊说明或标注,均为采集网络资源。如若本站内容侵犯了原著者的合法权益,可联系本站删除。