网站爬虫限制默认在心中
robots.txt

爬一个网站怎么预测爬的量


每个网站都使用各种各样的技术,怎么确定网站使用的技术
pip install builtwith

>>> import builtwith

>>> builtwith.parse('http://www.douban.com')

{u'javascript-frameworks': [u'jQuery'], u'tag-managers': [u'Google Tag Manager'], u'analytics': [u'Piwik']}



#网站的所属者pipinstallpython-whois>>>printwhois.whois('cnblogs.com'){"updated_date":["2014-11-1200:00:00","2014-11-1201:07:15"],"status":["clientDeleteProhibitedhttps://icann.org/epp#clientDeleteProhibited","clientTransferProhibitedhttps://icann.org/epp#clientTransferProhibited"],"name":"duyong","dnssec":"unsigned","city":"Shanghai","expiration_date":["2021-11-1200:00:00","2021-11-1104:00:00"],"zipcode":"201203","domain_name":["CNBLOGS.COM","cnblogs.com"],"country":"CN","whois_server":"whois.35.com","state":"Shanghai","registrar":"35TechnologyCo.,Ltd.","referral_url":"http://www.35.com","address":"Room312,No.22BOXIARd,PudongNewDistrict","name_servers":["NS3.DNSV4.COM","NS4.DNSV4.COM","ns3.dnsv4.com","ns4.dnsv4.com"],"org":"ShanghaiYuchengInformationTechnologyCo.Ltd.","creation_date":["2003-11-1200:00:00","2003-11-1104:00:00"],"emails":["abuse@35.cn","dudu.yz@gmail.com"]}