本程序以爬取 'http://httpbin.org/post'为例

格式:

导入urllib.request

导入urllib.parse

数据编码处理,再设为utf-8编码: bytes(urllib.parse.urlencode({'word': 'hello'}), encoding = 'utf-8')

打开爬取的网页: response = urllib.request.urlopen('网址', data = data)

读取网页代码: html = response.read()

打印:

1.不decode

print(html) #爬取的网页代码会不分行,没有空格显示,很难看

2.decode

print(html.decode()) #爬取的网页代码会分行,像写规范的代码一样,看起来很舒服

查询请求结果:

a. response.status # 返回 200:请求成功 404:网页找不到,请求失败

b. response.getcode() # 返回 200:请求成功 404:网页找不到,请求失败



1.不decode的程序如下:

importurllib.requestimporturllib.parssedata=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')response=urllib.request.urlopen('data=data)html=response.read()print(html)print("------------------------------------------------------------------")print("------------------------------------------------------------------")print(response.status)print(response.getcode())


运行结果:


2.带decode的程序如下:

importurllib.requestimporturllib.parssedata=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')response=urllib.request.urlopen('data=data)html=response.read()print(html.decode())print("------------------------------------------------------------------")print("------------------------------------------------------------------")print(response.status)print(response.getcode())


运行结果:

{"args":{},"data":"","files":{},"form":{"word":"hello"},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"10","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Python-urllib/3.4"},"json":null,"origin":"106.14.17.222","url":"http://httpbin.org/post"}------------------------------------------------------------------------------------------------------------------------------------200200


为什么要用bytes转换?

因为

data=urllib.parse.urlencode({'word':'hello'})##没有用bytesresponse=urllib.request.urlopen('http://httpbin.org/post',data=data)html=response.read()

错误提示:

Traceback(mostrecentcalllast):File"/usercode/file.py",line15,in<module>response=urllib.request.urlopen('http://httpbin.org/post',data=data)File"/usr/lib/python3.4/urllib/request.py",line153,inurlopenreturnopener.open(url,data,timeout)File"/usr/lib/python3.4/urllib/request.py",line453,inopenreq=meth(req)File"/usr/lib/python3.4/urllib/request.py",line1104,indo_request_raiseTypeError(msg)TypeError:POSTdatashouldbebytesoraniterableofbytes.Itcannotbeoftypestr.

由此可见,post方式需要将请求内容用二进制编码。

classbytes([source[,encoding[,errors]]])

Return a new “bytes” object, which is an immutable sequence of integers in the range0<=x<256.bytesis an immutable version ofbytearray– it has the same non-mutating methods and the same indexing and slicing behavior.

Accordingly, constructor arguments are interpreted as forbytearray().