获取网页的方式

其实在加载网页的时候, 有几种类型, 而这几种类型就是你打开网页的关键. 最重要的类型 (method) 就是 get 和 post (当然还有其他的, 比如 head, delete)。

以下分析两个重要的类型的重要特点。

post

账号登录
搜索内容
上传图片
上传文件
往服务器传数据等
get
正常打开网页
不往服务器传数据
这样看来, 很多网页使用 get 就可以了, 而 post, 我们则是给服务器发送个性化请求, 比如将你的账号密码传给服务器, 让它给你返回一个含有你个人信息的 HTML.

从主动和被动的角度来说, post 中文是发送, 比较主动, 你控制了服务器返回的内容. 而 get 中文是取得, 是被动的, 你没有发送给服务器个性化的信息, 它不会根据你个性化的信息返回不一样的 HTML.

get方法

get请求的参数一般是在网址后面加入?parameter1=xxx&parameter2=xxxx，使用?传递参数，用&并列参数。

使用requests的包直接请求baidu，如下所示：

>>> r = requests.get("http://www.baidu.com")
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä
···

r.text得到的是unicode编码数据，可能会出现乱码，可以使用r.encoding = ‘utf8’强制转换后再提取

>>> r.encoding='utf8'
>>> r.text
'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head>
···

注意：注意：response.content得到的是二进制数据，而response.text得到的是Unicode编码数据，一般content用于获取图片、视频等，text用于获取文字类数据。

访问http://httpbin/get

>>> r = requests.get('http://httpbin.org/get')
>>> print(r.text)
{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "close",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.18.4"
  },
  "origin": "xxx.xxx.xxx.xxx",
  "url": "http://httpbin.org/get"
}

如果带参数：

>>> r = requests.get('http://httpbin.org/get?a=2&c=3&w=')
>>> print(r.text)
{
  "args": {
    "a": "2",
    "c": "3",
    "w": ""
  },
···

可以看到他获取到了参数并且输出了。

也可以使用另一种写法：

>>> r = requests.get('http://httpbin.org/get',params=parameter)
>>> print(r.text)
{
  "args": {
    "a": "23",
    "b": "32",
    "c": "string"
  },
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "close",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.18.4"
  },
  "origin": "xxx.xxx.xxx.xxx",
  "url": "http://httpbin.org/get?a=23&b=32&c=string"
}

可以直接使用json方法转化成json（与使用json的loads效果一样）：

>>> r.json
<bound method Response.json of <Response [200]>>
>>> r.json()
{'args': {'a': '23', 'b': '32', 'c': 'string'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.18.4'}, 'origin': '115.153.174.11', 'url': 'http://httpbin.org/get?a=23&b=32&c=string'}

有一些网页不使用headers无法访问，例如知乎，使用headers：

>>> headers = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5221.400 QQBrowser/10.0.1125.400'}
>>> r = requests.get("http://www.zhihu.com")
>>> print(r)
<Response [400]>
>>> r = requests.get("http://www.zhihu.com",headers=headers)
>>> r
<Response [200]>
>>>

可以通过r.headers看到请求头信息

1
2

>>> r.headers
{'Date': 'Tue, 29 Jan 2019 09:05:50 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Content-Security-Policy': "default-src * blob:; img-src * data: blob:; connect-src * wss: blob:; frame-src 'self' *.zhihu.com weixin: *.vzuu.com getpocket.com note.youdao.com safari-extension://com.evernote.safari.clipper-Q79WDW8YH9 zhihujs: captcha.guard.qcloud.com; script-src 'self' blob: *.zhihu.com res.wx.qq.com 'unsafe-eval' unpkg.zhimg.com unicom.zhimg.com captcha.gtimg.com captcha.guard.qcloud.com pagead2.googlesyndication.com i.hao61.net 'nonce-a0691bfd-cf49-40b4-8ad7-37d552f59c50'; style-src 'self' 'unsafe-inline' *.zhihu.com unicom.zhimg.com captcha.gtimg.com", 'X-Frame-Options': 'SAMEORIGIN', 'Strict-Transport-Security': 'max-age=15552000; includeSubDomains', 'Surrogate-Control': 'no-store', 'Cache-Control': 'no-store, no-cache, must-revalidate, proxy-revalidate', 'Pragma': 'no-cache', 'Expires': '0', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Encoding': 'gzip', 'Server': 'ZWS'}

post方法

最简单的依然是

>>> r = requests.post("http://httpbin.org/post")
>>> print(r.text)
{
  "args": {},
  "data": "",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "close",
    "Content-Length": "0",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.18.4"
  },
  "json": null,
  "origin": "xxx.xxx.xxx.xxx",
  "url": "http://httpbin.org/post"
}

传入参数：

>>> r = requests.post("http://httpbin.org/post",params=parameter)
>>> print(r.text)
{
  "args": {
    "a": "23",
    "b": "32",
    "c": "string"
  },
···

使用headers和get方法一样。