scrapy学习笔记1---一个爬取的完整例子
一、创建工程
scrapy startproject dmoz
二、建立dmoz_spider.py
fromscrapy.spiderimportSpiderfromscrapy.selectorimportSelectorfromdmoz.itemsimportDmozItemclassDmozSpider(Spider):name="dmoz"allowed_domains=["dmoz.org"]start_urls=["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/","http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",]defparse(self,response):"""Thelinesbelowisaspidercontract.Formoreinfosee:http://doc.scrapy.org/en/latest/topics/contracts.html@urlhttp://www.dmoz.org/Computers/Programming/Languages/Python/Resources/@scrapesname"""sel=Selector(response)sites=sel.xpath('//ul[@class="directory-url"]/li')items=[]forsiteinsites:item=DmozItem()item['name']=site.xpath('a/text()').extract()item['url']=site.xpath('a/@href').extract()item['description']=site.xpath('text()').re('-\s[^\n]*\\r')items.append(item)returnitems
三、改写items.py
#-*-coding:utf-8-*-#Defineherethemodelsforyourscrapeditems##Seedocumentationin:#http://doc.scrapy.org/en/latest/topics/items.htmlfromscrapy.itemimportItem,FieldclassDmozItem(Item):name=Field()description=Field()url=Field()
四、改写pipeline.py
#-*-coding:utf-8-*-#Defineherethemodelsforyourscrapeditems##Seedocumentationin:#http://doc.scrapy.org/en/latest/topics/items.htmlfromscrapy.itemimportItem,FieldclassDmozItem(Item):name=Field()description=Field()url=Field()
五、在dmoz文件夹根目录执行
scrapy crawl dmoz -o dmoz.json
运行spider
声明:本站所有文章资源内容,如无特殊说明或标注,均为采集网络资源。如若本站内容侵犯了原著者的合法权益,可联系本站删除。