Python爬虫：爬取小说并存储到数据库

2025-04-09 技术教程

爬取小说网站的小说，并保存到数据库

第一步：先获取小说内容

#!/usr/bin/python#-*-coding:UTF-8-*-importurllib2,redomain='http://www.quanshu.net'headers={"User-Agent":"Mozilla/5.0(WindowsNT6.3;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/58.0.3029.110Safari/537.36"}defgetTypeList(pn=1):#获取分类列表的函数req=urllib2.Request('http://www.quanshu.net/map/%s.html'%pn)#实例将要请求的对象req.headers=headers#替换所有头信息#req.add_header()#添加单个头信息res=urllib2.urlopen(req)#开始请求html=res.read().decode('gbk')#decode解码，解码成Unicodereg=r'<ahref="(/book/.*?)"target="_blank">(.*?)</a>'reg=re.compile(reg)#增加匹配效率正则匹配返回的类型为Listreturnre.findall(reg,html)defgetNovelList(url):#获取章节列表函数req=urllib2.Request(domain+url)req.headers=headersres=urllib2.urlopen(req)html=res.read().decode('gbk')reg=r'<li><ahref="(.*?)"title=".*?">(.*?)</a></li>'reg=re.compile(reg)returnre.findall(reg,html)defgetNovelContent(url):#获取章节内容req=urllib2.Request(domain+url)req.headers=headersres=urllib2.urlopen(req)html=res.read().decode('gbk')reg=r'style5\(\);</script>(.*?)<scripttype="text/javascript">style6\(\)'returnre.findall(reg,html)[0]if__name__=='__main__':fortypeinrange(1,10):forurl,titleingetTypeList(type):forzurl,ztitleingetNovelList(url):printu'正则爬取----%s'%ztitlecontent=getNovelContent(url.replace('index.html',zurl))printcontentbreakbreak

执行后结果如下：

第二步：存储到数据库

1、设计数据库

1.1 新建库：novel

1.2 设计表：novel

1.3 设计表：chapter

并设置外键

2、编写脚本

#!/usr/bin/python#-*-coding:UTF-8-*-importurllib2,reimportMySQLdbclassSql(object):conn=MySQLdb.connect(host='192.168.19.213',port=3306,user='root',passwd='Admin123',db='novel',charset='utf8')defaddnovels(self,sort,novelname):cur=self.conn.cursor()cur.execute("insertintonovel(sort,novelname)values(%s,'%s')"%(sort,novelname))lastrowid=cur.lastrowidcur.close()self.conn.commit()returnlastrowiddefaddchapters(self,novelid,chaptername,content):cur=self.conn.cursor()cur.execute("insertintochapter(novelid,chaptername,content)values(%s,'%s','%s')"%(novelid,chaptername,content))cur.close()self.conn.commit()domain='http://www.quanshu.net'headers={"User-Agent":"Mozilla/5.0(WindowsNT6.3;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/58.0.3029.110Safari/537.36"}defgetTypeList(pn=1):#获取分类列表的函数req=urllib2.Request('http://www.quanshu.net/map/%s.html'%pn)#实例将要请求的对象req.headers=headers#替换所有头信息#req.add_header()#添加单个头信息res=urllib2.urlopen(req)#开始请求html=res.read().decode('gbk')#decode解码，解码成Unicodereg=r'<ahref="(/book/.*?)"target="_blank">(.*?)</a>'reg=re.compile(reg)#增加匹配效率正则匹配返回的类型为Listreturnre.findall(reg,html)defgetNovelList(url):#获取章节列表函数req=urllib2.Request(domain+url)req.headers=headersres=urllib2.urlopen(req)html=res.read().decode('gbk')reg=r'<li><ahref="(.*?)"title=".*?">(.*?)</a></li>'reg=re.compile(reg)returnre.findall(reg,html)defgetNovelContent(url):#获取章节内容req=urllib2.Request(domain+url)req.headers=headersres=urllib2.urlopen(req)html=res.read().decode('gbk')reg=r'style5\(\);</script>(.*?)<scripttype="text/javascript">style6\(\)'returnre.findall(reg,html)[0]mysql=Sql()if__name__=='__main__':forsortinrange(1,10):forurl,titleingetTypeList(sort):lastrowid=mysql.addnovels(sort,title)forzurl,ztitleingetNovelList(url):printu'正则爬取----%s'%ztitlecontent=getNovelContent(url.replace('index.html',zurl))printu'正在存储----%s'%ztitlemysql.addchapters(lastrowid,ztitle,content)

3、执行脚本

4、查看数据库

可以看到已经存储成功了。

报错：

_mysql_exceptions.OperationalError: (1364, "Field 'novelid' doesn't have a default value")

解决：执行sql语句

SELECT @@GLOBAL.sql_mode;

SET @@GLOBAL.sql_mode="NO_ENGINE_SUBSTITUTION";

报错参考：http://blog.sina.com.cn/s/blog_6d2b3e4901011j9w.html