古诗文网站的网络爬虫编写方式,通过网络爬虫抓去内容
1. 以下就是古诗文网站的爬虫代码,请看:
#encoding:utf-8importrequestsimportreimportjsondefparse_page(url):#1.请求网站headers={"User-Agent":"Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/67.0.3396.62Safari/537.36"}response=requests.get(url,headers=headers)text=response.text#2.解析网站titles=re.findall(r'<div\sclass="cont">.*?<b>(.*?)</b>',text,re.DOTALL)#printjson.dumps(titles,encoding="utf-8",ensure_ascii=False)times=re.findall(r'<p\sclass="source">.*?<a\s.*?>(.*?)</a>',text,re.DOTALL)#printjson.dumps(times,encoding="utf-8",ensure_ascii=False)authors=re.findall(r'<pclass="source">.*?<a.*?<a.*?>(.*?)</a>',text,re.DOTALL)poems_ret=re.findall(r'<divclass="contson"id=.*?>(.*?)</div>',text,re.DOTALL)poems=[]forpoeminpoems_ret:temp=re.sub("<.*?>","",poem)poems.append(temp.strip())#forindex,valueinenumerate(titles):#printtitles[index]#printtimes[index]#printauthors[index]#printpoems[index]#print"*"*50#zip函数自动实现上述组合results=[]forvalueinzip(titles,times,authors,poems):title,time,author,poem=valueresult={"标题":title,"朝代":time,"作者":author,"原文":poem}printresult["标题"]results.append(result)#printresultsdefmain():url_base="https://www.xzslx.net/gushi/"foriinrange(1,11):url=url_base.format(i)print""*20+"优美古诗文"+""*20print"*"*50parse_page(url)print"*"*50if__name__=='__main__':main()
2. 输出来的结果是:
C:\DDD\python22\python.exeC:/PyCharm/dytt_spider/poems.py古诗文**************************************************关山月明月出天山,苍茫云海间。长风几×××,吹度玉门关。汉下白登道,胡窥青海湾。[2]由来征战地,不见有人还。戍客望边邑,思归多苦颜。高楼当此夜,叹息未应闲。**************************************************古诗文**************************************************陇西行四首·其二誓扫匈奴不顾身,五千貂锦丧胡尘。可怜无定河边骨,犹是春闺梦里人!**************************************************古诗文**************************************************嫦娥(嫦娥应悔偷灵药)云母屏风烛影深,长河渐落晓星沉。嫦娥应悔偷灵药,碧海青天夜夜心。**************************************************
Process finished with exit code 0
声明:本站所有文章资源内容,如无特殊说明或标注,均为采集网络资源。如若本站内容侵犯了原著者的合法权益,可联系本站删除。