Python 爬虫 urllib模块:post方式
本程序以爬取 'http://httpbin.org/post'为例
格式:
导入urllib.request
导入urllib.parse
数据编码处理,再设为utf-8编码: bytes(urllib.parse.urlencode({'word': 'hello'}), encoding = 'utf-8')
打开爬取的网页: response = urllib.request.urlopen('网址', data = data)
读取网页代码: html = response.read()
打印:
1.不decode
print(html) #爬取的网页代码会不分行,没有空格显示,很难看
2.decode
print(html.decode()) #爬取的网页代码会分行,像写规范的代码一样,看起来很舒服
查询请求结果:
a. response.status # 返回 200:请求成功 404:网页找不到,请求失败
b. response.getcode() # 返回 200:请求成功 404:网页找不到,请求失败
1.不decode的程序如下:
importurllib.requestimporturllib.parssedata=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')response=urllib.request.urlopen('data=data)html=response.read()print(html)print("------------------------------------------------------------------")print("------------------------------------------------------------------")print(response.status)print(response.getcode())
运行结果:
2.带decode的程序如下:
importurllib.requestimporturllib.parssedata=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')response=urllib.request.urlopen('data=data)html=response.read()print(html.decode())print("------------------------------------------------------------------")print("------------------------------------------------------------------")print(response.status)print(response.getcode())
运行结果:
{"args":{},"data":"","files":{},"form":{"word":"hello"},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"10","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Python-urllib/3.4"},"json":null,"origin":"106.14.17.222","url":"http://httpbin.org/post"}------------------------------------------------------------------------------------------------------------------------------------200200
为什么要用bytes转换?
因为
data=urllib.parse.urlencode({'word':'hello'})##没有用bytesresponse=urllib.request.urlopen('http://httpbin.org/post',data=data)html=response.read()
错误提示:
Traceback(mostrecentcalllast):File"/usercode/file.py",line15,in<module>response=urllib.request.urlopen('http://httpbin.org/post',data=data)File"/usr/lib/python3.4/urllib/request.py",line153,inurlopenreturnopener.open(url,data,timeout)File"/usr/lib/python3.4/urllib/request.py",line453,inopenreq=meth(req)File"/usr/lib/python3.4/urllib/request.py",line1104,indo_request_raiseTypeError(msg)TypeError:POSTdatashouldbebytesoraniterableofbytes.Itcannotbeoftypestr.
由此可见,post方式需要将请求内容用二进制编码。
classbytes
([source[,encoding[,errors]]])
Return a new “bytes” object, which is an immutable sequence of integers in the range0<=x<256
.bytes
is an immutable version ofbytearray
– it has the same non-mutating methods and the same indexing and slicing behavior.
Accordingly, constructor arguments are interpreted as forbytearray()
.
声明:本站所有文章资源内容,如无特殊说明或标注,均为采集网络资源。如若本站内容侵犯了原著者的合法权益,可联系本站删除。