Python 爬虫IP代理池的实现


很多时候,如果要多线程的爬取网页,或者是单纯的反爬,我们需要通过代理IP来进行访问。下面看看一个基本的实现方法。

代理IP附件.txt的提取,网上有很多网站都提供这个服务。基本上可靠性和银子是成正比的。国内提供的免费IP基本上都是没法用的,如果要可靠的代理只能付费;国外稍微好些,有些免费IP还是比较靠谱的。

网上随便搜索了一下,找了个网页,本来还想手动爬一些对应的IP,结果发现可以直接下载现成的txt文件

下载之后,试试看用不同的代理去爬百度首页

#!/usr/bin/envpython#!-*-coding:utf-8-*-#Author:YuanLiimportre,urllib.requestfp=open("c:\\temp\\thebigproxylist-17-12-20.txt",'r')lines=fp.readlines()foripinlines:try:print("当前代理IP"+ip)proxy=urllib.request.ProxyHandler({"http":ip})opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)urllib.request.install_opener(opener)url="http://www.baidu.com"data=urllib.request.urlopen(url).read().decode('utf-8','ignore')print("通过")print("-----------------------------")exceptExceptionaserr:print(err)print("-----------------------------")fp.close()

结果如下:

C:\Python36\python.exeC:/Users/yuan.li/Documents/GitHub/Python/Misc/爬虫/proxy.py当前代理IP137.74.168.174:80通过-----------------------------当前代理IP103.28.161.68:8080通过-----------------------------当前代理IP91.151.106.127:53281HTTPError503:ServiceUnavailable-----------------------------当前代理IP177.136.252.7:3128<urlopenerror[WinError10060]Aconnectionattemptfailedbecausetheconnectedpartydidnotproperlyrespondafteraperiodoftime,orestablishedconnectionfailedbecauseconnectedhosthasfailedtorespond>-----------------------------当前代理IP47.89.22.200:80通过-----------------------------当前代理IP118.69.61.57:8888HTTPError503:ServiceUnavailable-----------------------------当前代理IP192.241.190.167:8080通过-----------------------------当前代理IP185.124.112.130:80通过-----------------------------当前代理IP83.65.246.181:3128通过-----------------------------当前代理IP79.137.42.124:3128通过-----------------------------当前代理IP95.0.217.32:8080<urlopenerror[WinError10060]Aconnectionattemptfailedbecausetheconnectedpartydidnotproperlyrespondafteraperiodoftime,orestablishedconnectionfailedbecauseconnectedhosthasfailedtorespond>-----------------------------当前代理IP104.131.94.221:8080通过

不过上面这种方式只适合比较稳定的IP源,如果IP不稳定的话,可能很快对应的文本就失效了,最好可以动态地去获取最新的IP地址。很多网站都提供API可以实时地去查询
还是用刚才的网站,这次我们用API去调用,这里需要浏览器伪装一下才能爬取

#!/usr/bin/envpython#!-*-coding:utf-8-*-#Author:YuanLiimportre,urllib.requestheaders=("User-Agent","Mozilla/5.0(WindowsNT10.0;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/49.0.2623.22Safari/537.36SE2.XMetaSr1.0")opener=urllib.request.build_opener()opener.addheaders=[headers]#安装为全局urllib.request.install_opener(opener)data=urllib.request.urlopen("http://www.thebigproxylist.com/members/proxy-api.php?output=all&user=list&pass=8a544b2637e7a45d1536e34680e11adf").read().decode('utf8')ippool=data.split('\n')foripinippool:ip=ip.split(',')[0]try:print("当前代理IP"+ip)proxy=urllib.request.ProxyHandler({"http":ip})opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)urllib.request.install_opener(opener)url="http://www.baidu.com"data=urllib.request.urlopen(url).read().decode('utf-8','ignore')print("通过")print("-----------------------------")exceptExceptionaserr:print(err)print("-----------------------------")fp.close()

结果如下:

C:\Python36\python.exeC:/Users/yuan.li/Documents/GitHub/Python/Misc/爬虫/proxy.py当前代理IP213.233.57.134:80HTTPError403:Forbidden-----------------------------当前代理IP144.76.81.79:3128通过-----------------------------当前代理IP45.55.132.29:53281HTTPError503:ServiceUnavailable-----------------------------当前代理IP180.254.133.124:8080通过-----------------------------当前代理IP5.196.215.231:3128HTTPError503:ServiceUnavailable-----------------------------当前代理IP177.99.175.195:53281HTTPError503:ServiceUnavailable

因为直接for循环来按顺序读取文本实在是太慢了,我试着改成多线程来读取,这样速度就快多了

#!/usr/bin/envpython#!-*-coding:utf-8-*-#Author:YuanLiimportthreadingimportqueueimportre,urllib.request#Numberofthreadsn_thread=10#Createqueuequeue=queue.Queue()classThreadClass(threading.Thread):def__init__(self,queue):threading.Thread.__init__(self)

super(ThreadClass,self).__init__()#Assignthreadworkingwithqueueself.queue=queuedefrun(self):whileTrue:#Getfromqueuejobhost=self.queue.get()

print(self.getName()+":"+host)try:#print("当前代理IP"+host)proxy=urllib.request.ProxyHandler({"http":host})opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)urllib.request.install_opener(opener)url="http://www.baidu.com"data=urllib.request.urlopen(url).read().decode('utf-8','ignore')print("通过")print("-----------------------------")

exceptExceptionaserr:print(err)

print("-----------------------------")#signalstoqueuejo