这篇文章主要介绍如何搭建python爬虫代理池,文中介绍的非常详细,具有一定的参考价值,感兴趣的小伙伴们一定要看完!

在众多的网站防爬措施中,有一种是根据ip的访问频率进行限制,即在某一时间段内,当某个ip的访问次数达到一定的阀值时,该ip就会被拉黑、在一段时间内禁止访问。

应对的方法有两种:

1、降低爬虫的爬取频率,避免IP被限制访问,缺点显而易见:会大大降低爬取的效率。

2、搭建一个IP代理池,使用不同的IP轮流进行爬取。

搭建爬虫代理池的思路:

1、从代理网站(如:西刺代理、快代理、云代理、无忧代理)爬取代理IP;

2、验证代理IP的可用性(使用代理IP去请求指定URL,根据响应验证代理IP是否生效);

3、将可用的代理IP保存到数据库;

实现代码:

ipproxy.py

IPProxy代理类定义了要爬取的IP代理的字段信息和一些基础方法。

#-*-coding:utf-8-*-importreimporttimefromsettingsimportPROXY_URL_FORMATTERschema_pattern=re.compile(r'http|https$',re.I)ip_pattern=re.compile(r'^([0-9]{1,3}.){3}[0-9]{1,3}$',re.I)port_pattern=re.compile(r'^[0-9]{2,5}$',re.I)classIPProxy:'''{"schema":"http",#代理的类型"ip":"127.0.0.1",#代理的IP地址"port":"8050",#代理的端口号"used_total":11,#代理的使用次数"success_times":5,#代理请求成功的次数"continuous_failed":3,#使用代理发送请求,连续失败的次数"created_time":"2018-05-02"#代理的爬取时间}'''def__init__(self,schema,ip,port,used_total=0,success_times=0,continuous_failed=0,created_time=None):"""Initializetheproxyinstance"""ifschema==""orschemaisNone:schema="http"self.schema=schema.lower()self.ip=ipself.port=portself.used_total=used_totalself.success_times=success_timesself.continuous_failed=continuous_failedifcreated_timeisNone:created_time=time.strftime('%Y-%m-%d',time.localtime(time.time()))self.created_time=created_timedef_get_url(self):'''Returntheproxyurl'''returnPROXY_URL_FORMATTER%{'schema':self.schema,'ip':self.ip,'port':self.port}def_check_format(self):'''ReturnTrueiftheproxyfieldsarewell-formed,otherwisereturnFalse'''ifself.schemaisnotNoneandself.ipisnotNoneandself.portisnotNone:ifschema_pattern.match(self.schema)andip_pattern.match(self.ip)andport_pattern.match(self.port):returnTruereturnFalsedef_is_https(self):'''ReturnTrueiftheproxyishttps,otherwisereturnFalse'''returnself.schema=='https'def_update(self,successed=False):'''Updateproxybasedontheresultoftherequest'sresponse'''self.used_total=self.used_total+1ifsuccessed:self.continuous_failed=0self.success_times=self.success_times+1else:print(self.continuous_failed)self.continuous_failed=self.continuous_failed+1if__name__=='__main__':proxy=IPProxy('HTTPS','192.168.2.25',"8080")print(proxy._get_url())print(proxy._check_format())print(proxy._is_https())

settings.py

settings.py中汇聚了工程所需要的配置信息。

#指定Redis的主机名和端口REDIS_HOST='localhost'REDIS_PORT=6379#代理保存到Rediskey格式化字符串PROXIES_REDIS_FORMATTER='proxies::{}'#已经存在的HTTP代理和HTTPS代理集合PROXIES_REDIS_EXISTED='proxies::existed'#最多连续失败几次MAX_CONTINUOUS_TIMES=3#代理地址的格式化字符串PROXY_URL_FORMATTER='%(schema)s://%(ip)s:%(port)s'USER_AGENT_LIST=["Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.1(KHTML,likeGecko)Chrome/22.0.1207.1Safari/537.1","Mozilla/5.0(X11;CrOSi6862268.111.0)AppleWebKit/536.11(KHTML,likeGecko)Chrome/20.0.1132.57Safari/536.11","Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/536.6(KHTML,likeGecko)Chrome/20.0.1092.0Safari/536.6","Mozilla/5.0(WindowsNT6.2)AppleWebKit/536.6(KHTML,likeGecko)Chrome/20.0.1090.0Safari/536.6","Mozilla/5.0(WindowsNT6.2;WOW64)AppleWebKit/537.1(KHTML,likeGecko)Chrome/19.77.34.5Safari/537.1","Mozilla/5.0(X11;Linuxx86_64)AppleWebKit/536.5(KHTML,likeGecko)Chrome/19.0.1084.9Safari/536.5","Mozilla/5.0(WindowsNT6.0)AppleWebKit/536.5(KHTML,likeGecko)Chrome/19.0.1084.36Safari/536.5","Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1063.0Safari/536.3","Mozilla/5.0(WindowsNT5.1)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1063.0Safari/536.3","Mozilla/5.0(Macintosh;IntelMacOSX10_8_0)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1063.0Safari/536.3","Mozilla/5.0(WindowsNT6.2)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1062.0Safari/536.3","Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1062.0Safari/536.3","Mozilla/5.0(WindowsNT6.2)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1061.1Safari/536.3","Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1061.1Safari/536.3","Mozilla/5.0(WindowsNT6.1)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1061.1Safari/536.3","Mozilla/5.0(WindowsNT6.2)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1061.0Safari/536.3","Mozilla/5.0(X11;Linuxx86_64)AppleWebKit/535.24(KHTML,likeGecko)Chrome/19.0.1055.1Safari/535.24","Mozilla/5.0(WindowsNT6.2;WOW64)AppleWebKit/535.24(KHTML,likeGecko)Chrome/19.0.1055.1Safari/535.24"]#爬取到的代理保存前先检验是否可用,默认TruePROXY_CHECK_BEFOREADD=True#检验代理可用性的请求地址,支持多个PROXY_CHECK_URLS={'https':['https://icanhazip.com'],'http':['http://icanhazip.com']}

proxy_util.py

proxy_util.py 中主要定义了一些实用方法,例如:proxy_to_dict(proxy)用来将IPProxy代理实例转换成字典;proxy_from_dict(d)用来将字典转换为IPProxy实例;request_page()用来发送请求;_is_proxy_available()用来校验代理IP是否可用。

#-*-coding:utf-8-*-importrandomimportloggingimportrequestsfromipproxyimportIPProxyfromsettingsimportUSER_AGENT_LIST,PROXY_CHECK_URLS#Settingloggeroutputformatlogging.basicConfig(level=logging.INFO,format='[%(asctime)-15s][%(levelname)8s][%(name)10s]-%(message)s(%(filename)s:%(lineno)s)',datefmt='%Y-%m-%d%T')logger=logging.getLogger(__name__)defproxy_to_dict(proxy):d={"schema":proxy.schema,"ip":proxy.ip,"port":proxy.port,"used_total":proxy.used_total,"success_times":proxy.success_times,"continuous_failed":proxy.continuous_failed,"created_time":proxy.created_time}returnddefproxy_from_dict(d):returnIPProxy(schema=d['schema'],ip=d['ip'],port=d['port'],used_total=d['used_total'],success_times=d['success_times'],continuous_failed=d['continuous_failed'],created_time=d['created_time'])#Truncateheaderandtailerblanksdefstrip(data):ifdataisnotNone:returndata.strip()returndatabase_headers={'Accept-Encoding':'gzip,deflate,br','Accept-Language':'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7'}defrequest_page(url,options={},encoding='utf-8'):"""sendarequest,getresponse"""headers=dict(base_headers,**options)if'User-Agent'notinheaders.keys():headers['User-Agent']=random.choice(USER_AGENT_LIST)logger.info('正在抓取:'+url)try:response=requests.get(url,headers=headers)ifresponse.status_code==200:logger.info('抓取成功:'+url)returnresponse.content.decode(encoding=encoding)exceptConnectionError:logger.error('抓取失败'+url)returnNonedef_is_proxy_available(proxy,options={}):"""CheckwhethertheProxyisavailableornot"""headers=dict(base_headers,**options)if'User-Agent'notinheaders.keys():headers['User-Agent']=random.choice(USER_AGENT_LIST)proxies={proxy.schema:proxy._get_url()}check_urls=PROXY_CHECK_URLS[proxy.schema]forurlincheck_urls:try:response=requests.get(url=url,proxies=proxies,headers=headers,timeout=5)exceptBaseException:logger.info("<"+url+">验证代理<"+proxy._get_url()+">结果:不可用")else:ifresponse.status_code==200:logger.info("<"+url+">验证代理<"+proxy._get_url()+">结果:可用")returnTrueelse:logger.info("<"+url+">验证代理<"+proxy._get_url()+">结果:不可用")returnFalseif__name__=='__main__':headers=dict(base_headers)if'User-Agent'notinheaders.keys():headers['User-Agent']=random.choice(USER_AGENT_LIST)proxies={"https":"https://163.125.255.154:9797"}response=requests.get("https://www.baidu.com",headers=headers,proxies=proxies,timeout=3)print(response.content)

proxy_queue.py

代理队列用来保存并对外提供 IP代理,不同的代理队列内代理IP的保存和提取策略可以不同。在这里, BaseQueue 是所有代理队列的基类,其中声明了所有代理队列都需要实现的保存代理IP、提取代理IP、查看代理IP数量等接口。示例的 FifoQueue 是一个先进先出队列,底层使用 Redis 列表实现,为了确保同一个代理IP只能被放入队列一次,这里使用了一个Redis proxies::existed 集合进行入队前重复校验。

#-*-coding:utf-8-*-fromproxy_utilimportloggerimportjsonimportredisfromipproxyimportIPProxyfromproxy_utilimportproxy_to_dict,proxy_from_dict,_is_proxy_availablefromsettingsimportPROXIES_REDIS_EXISTED,PROXIES_REDIS_FORMATTER,MAX_CONTINUOUS_TIMES,PROXY_CHECK_BEFOREADD"""ProxyQueueBaseClass"""classBaseQueue(object):def__init__(self,server):"""InitializetheproxyqueueinstanceParameters----------server:StrictRedisRedisclientinstance"""self.server=serverdef_serialize_proxy(self,proxy):"""Serializeproxyinstance"""returnproxy_to_dict(proxy)def_deserialize_proxy(self,serialized_proxy):"""deserializeproxyinstance"""returnproxy_from_dict(eval(serialized_proxy))def__len__(self,schema='http'):"""Returnthelengthofthequeue"""raiseNotImplementedErrordefpush(self,proxy,need_check):"""Pushaproxy"""raiseNotImplementedErrordefpop(self,schema='http',timeout=0):"""Popaproxy"""raiseNotImplementedErrorclassFifoQueue(BaseQueue):"""Firstinfirstoutqueue"""def__len__(self,schema='http'):"""Returnthelengthofthequeue"""returnself.server.llen(PROXIES_REDIS_FORMATTER.format(schema))defpush(self,proxy,need_check=PROXY_CHECK_BEFOREADD):"""Pushaproxy"""ifneed_checkandnot_is_proxy_available(proxy):returnelifproxy.continuous_failed<MAX_CONTINUOUS_TIMESandnotself._is_existed(proxy):key=PROXIES_REDIS_FORMATTER.format(proxy.schema)self.server.rpush(key,json.dumps(self._serialize_proxy(proxy),ensure_ascii=False))defpop(self,schema='http',timeout=0):"""Popaproxy"""iftimeout>0:p=self.server.blpop(PROXIES_REDIS_FORMATTER.format(schema.lower()),timeout)ifisinstance(p,tuple):p=p[1]else:p=self.server.lpop(PROXIES_REDIS_FORMATTER.format(schema.lower()))ifp:p=self._deserialize_proxy(p)self.server.srem(PROXIES_REDIS_EXISTED,p._get_url())returnpdef_is_existed(self,proxy):added=self.server.sadd(PROXIES_REDIS_EXISTED,proxy._get_url())returnadded==0if__name__=='__main__':r=redis.StrictRedis(host='localhost',port=6379)queue=FifoQueue(r)proxy=IPProxy('http','218.66.253.144','80')queue.push(proxy)proxy=queue.pop(schema='http')print(proxy._get_url())

proxy_crawlers.py

ProxyBaseCrawler 是所有代理爬虫的基类,其中只定义了一个 _start_crawl() 方法用来从搜集到的代理网站爬取代理IP。

#-*-coding:utf-8-*-fromlxmlimportetreefromipproxyimportIPProxyfromproxy_utilimportstrip,request_page,loggerclassProxyBaseCrawler(object):def__init__(self,queue=None,website=None,urls=[]):self.queue=queueself.website=websiteself.urls=urlsdef_start_crawl(self):raiseNotImplementedErrorclassKuaiDailiCrawler(ProxyBaseCrawler):#快代理def_start_crawl(self):forurl_dictinself.urls:logger.info("开始爬取["+self.website+"]:::>["+url_dict['type']+"]")has_more=Trueurl=Nonewhilehas_more:if'page'inurl_dict.keys()andstr.find(url_dict['url'],'{}')!=-1:url=url_dict['url'].format(str(url_dict['page']))url_dict['page']=url_dict['page']+1else:url=url_dict['url']has_more=Falsehtml=etree.HTML(request_page(url))tr_list=html.xpath("//table[@class='tabletable-borderedtable-striped']/tbody/tr")fortrintr_list:ip=tr.xpath("./td[@data-title='IP']/text()")[0]iflen(tr.xpath("./td[@data-title='IP']/text()"))elseNoneport=tr.xpath("./td[@data-title='PORT']/text()")[0]iflen(tr.xpath("./td[@data-title='PORT']/text()"))elseNoneschema=tr.xpath("./td[@data-title='类型']/text()")[0]iflen(tr.xpath("./td[@data-title='类型']/text()"))elseNoneproxy=IPProxy(schema=strip(schema),ip=strip(ip),port=strip(port))ifproxy._check_format():self.queue.push(proxy)iftr_listisNone:has_more=FalseclassFeiyiDailiCrawler(ProxyBaseCrawler):#飞蚁代理def_start_crawl(self):forurl_dictinself.urls:logger.info("开始爬取["+self.website+"]:::>["+url_dict['type']+"]")has_more=Trueurl=Nonewhilehas_more:if'page'inurl_dict.keys()andstr.find(url_dict['url'],'{}')!=-1:url=url_dict['url'].format(str(url_dict['page']))url_dict['page']=url_dict['page']+1else:url=url_dict['url']has_more=Falsehtml=etree.HTML(request_page(url))tr_list=html.xpath("//div[@id='main-content']//table/tr[position()>1]")fortrintr_list:ip=tr.xpath("./td[1]/text()")[0]iflen(tr.xpath("./td[1]/text()"))elseNoneport=tr.xpath("./td[2]/text()")[0]iflen(tr.xpath("./td[2]/text()"))elseNoneschema=tr.xpath("./td[4]/text()")[0]iflen(tr.xpath("./td[4]/text()"))elseNoneproxy=IPProxy(schema=strip(schema),ip=strip(ip),port=strip(port))ifproxy._check_format():self.queue.push(proxy)iftr_listisNone:has_more=FalseclassWuyouDailiCrawler(ProxyBaseCrawler):#无忧代理def_start_crawl(self):forurl_dictinself.urls:logger.info("开始爬取["+self.website+"]:::>["+url_dict['type']+"]")has_more=Trueurl=Nonewhilehas_more:if'page'inurl_dict.keys()andstr.find(url_dict['url'],'{}')!=-1:url=url_dict['url'].format(str(url_dict['page']))url_dict['page']=url_dict['page']+1else:url=url_dict['url']has_more=Falsehtml=etree.HTML(request_page(url))ul_list=html.xpath("//div[@class='wlist'][2]//ul[@class='l2']")forulinul_list:ip=ul.xpath("./span[1]/li/text()")[0]iflen(ul.xpath("./span[1]/li/text()"))elseNoneport=ul.xpath("./span[2]/li/text()")[0]iflen(ul.xpath("./span[2]/li/text()"))elseNoneschema=ul.xpath("./span[4]/li/text()")[0]iflen(ul.xpath("./span[4]/li/text()"))elseNoneproxy=IPProxy(schema=strip(schema),ip=strip(ip),port=strip(port))ifproxy._check_format():self.queue.push(proxy)iful_listisNone:has_more=FalseclassIPhaiDailiCrawler(ProxyBaseCrawler):#IP海代理def_start_crawl(self):forurl_dictinself.urls:logger.info("开始爬取["+self.website+"]:::>["+url_dict['type']+"]")has_more=Trueurl=Nonewhilehas_more:if'page'inurl_dict.keys()andstr.find(url_dict['url'],'{}')!=-1:url=url_dict['url'].format(str(url_dict['page']))url_dict['page']=url_dict['page']+1else:url=url_dict['url']has_more=Falsehtml=etree.HTML(request_page(url))tr_list=html.xpath("//table//tr[position()>1]")fortrintr_list:ip=tr.xpath("./td[1]/text()")[0]iflen(tr.xpath("./td[1]/text()"))elseNoneport=tr.xpath("./td[2]/text()")[0]iflen(tr.xpath("./td[2]/text()"))elseNoneschema=tr.xpath("./td[4]/text()")[0]iflen(tr.xpath("./td[4]/text()"))elseNoneproxy=IPProxy(schema=strip(schema),ip=strip(ip),port=strip(port))ifproxy._check_format():self.queue.push(proxy)iftr_listisNone:has_more=FalseclassYunDailiCrawler(ProxyBaseCrawler):#云代理def_start_crawl(self):forurl_dictinself.urls:logger.info("开始爬取["+self.website+"]:::>["+url_dict['type']+"]")has_more=Trueurl=Nonewhilehas_more:if'page'inurl_dict.keys()andstr.find(url_dict['url'],'{}')!=-1:url=url_dict['url'].format(str(url_dict['page']))url_dict['page']=url_dict['page']+1else:url=url_dict['url']has_more=Falsehtml=etree.HTML(request_page(url,encoding='gbk'))tr_list=html.xpath("//table/tbody/tr")fortrintr_list:ip=tr.xpath("./td[1]/text()")[0]iflen(tr.xpath("./td[1]/text()"))elseNoneport=tr.xpath("./td[2]/text()")[0]iflen(tr.xpath("./td[2]/text()"))elseNoneschema=tr.xpath("./td[4]/text()")[0]iflen(tr.xpath("./td[4]/text()"))elseNoneproxy=IPProxy(schema=strip(schema),ip=strip(ip),port=strip(port))ifproxy._check_format():self.queue.push(proxy)iftr_listisNone:has_more=FalseclassXiCiDailiCrawler(ProxyBaseCrawler):#西刺代理def_start_crawl(self):forurl_dictinself.urls:logger.info("开始爬取["+self.website+"]:::>["+url_dict['type']+"]")has_more=Trueurl=Nonewhilehas_more:if'page'inurl_dict.keys()andstr.find(url_dict['url'],'{}')!=-1:url=url_dict['url'].format(str(url_dict['page']))url_dict['page']=url_dict['page']+1else:url=url_dict['url']has_more=Falsehtml=etree.HTML(request_page(url))tr_list=html.xpath("//table[@id='ip_list']//tr[@class!='subtitle']")fortrintr_list:ip=tr.xpath("./td[2]/text()")[0]iflen(tr.xpath("./td[2]/text()"))elseNoneport=tr.xpath("./td[3]/text()")[0]iflen(tr.xpath("./td[3]/text()"))elseNoneschema=tr.xpath("./td[6]/text()")[0]iflen(tr.xpath("./td[6]/text()"))elseNoneifschema.lower()=="http"orschema.lower()=="https":proxy=IPProxy(schema=strip(schema),ip=strip(ip),port=strip(port))ifproxy._check_format():self.queue.push(proxy)iftr_listisNone:has_more=False

run.py

通过run.py启动各个代理网站爬虫。

#-*-coding:utf-8-*-importredisfromproxy_queueimportFifoQueuefromsettingsimportREDIS_HOST,REDIS_PORTfromproxy_crawlersimportWuyouDailiCrawler,FeiyiDailiCrawler,KuaiDailiCrawler,IPhaiDailiCrawler,YunDailiCrawler,\XiCiDailiCrawlerr=redis.StrictRedis(host=REDIS_HOST,port=REDIS_PORT)fifo_queue=FifoQueue(r)defrun_kuai():kuaidailiCrawler=KuaiDailiCrawler(queue=fifo_queue,website='快代理[国内高匿]',urls=[{'url':'https://www.kuaidaili.com/free/inha/{}/','type':'国内高匿','page':1},{'url':'https://www.kuaidaili.com/free/intr/{}/','type':'国内普通','page':1}])kuaidailiCrawler._start_crawl()defrun_feiyi():feiyidailiCrawler=FeiyiDailiCrawler(queue=fifo_queue,website='飞蚁代理',urls=[{'url':'http://www.feiyiproxy.com/?page_id=1457','type':'首页推荐'}])feiyidailiCrawler._start_crawl()defrun_wuyou():wuyoudailiCrawler=WuyouDailiCrawler(queue=fifo_queue,website='无忧代理',urls=[{'url':'http://www.data5u.com/free/index.html','type':'首页推荐'},{'url':'http://www.data5u.com/free/gngn/index.shtml','type':'国内高匿'},{'url':'http://www.data5u.com/free/gnpt/index.shtml','type':'国内普通'}])wuyoudailiCrawler._start_crawl()defrun_iphai():crawler=IPhaiDailiCrawler(queue=fifo_queue,website='IP海代理',urls=[{'url':'http://www.iphai.com/free/ng','type':'国内高匿'},{'url':'http://www.iphai.com/free/np','type':'国内普通'},{'url':'http://www.iphai.com/free/wg','type':'国外高匿'},{'url':'http://www.iphai.com/free/wp','type':'国外普通'}])crawler._start_crawl()defrun_yun():crawler=YunDailiCrawler(queue=fifo_queue,website='云代理',urls=[{'url':'http://www.ip3366.net/free/?stype=1&page={}','type':'国内高匿','page':1},{'url':'http://www.ip3366.net/free/?stype=2&page={}','type':'国内普通','page':1},{'url':'http://www.ip3366.net/free/?stype=3&page={}','type':'国外高匿','page':1},{'url':'http://www.ip3366.net/free/?stype=4&page={}','type':'国外普通','page':1}])crawler._start_crawl()defrun_xici():crawler=XiCiDailiCrawler(queue=fifo_queue,website='西刺代理',urls=[{'url':'https://www.xicidaili.com/','type':'首页推荐'},{'url':'https://www.xicidaili.com/nn/{}','type':'国内高匿','page':1},{'url':'https://www.xicidaili.com/nt/{}','type':'国内普通','page':1},{'url':'https://www.xicidaili.com/wn/{}','type':'国外高匿','page':1},{'url':'https://www.xicidaili.com/wt/{}','type':'国外普通','page':1}])crawler._start_crawl()if__name__=='__main__':run_xici()run_iphai()run_kuai()run_feiyi()run_yun()run_wuyou()

以上是如何搭建python爬虫代理池的所有内容,感谢各位的阅读!希望分享的内容对大家有帮助,更多相关知识,欢迎关注亿速云行业资讯频道!