request模块
request模块简单的说就是在模拟浏览器发送请求的过程
浏览器发送请求的过程
1.指定url
2.发起请求
3.获取响应数据
request模块发起请求
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| response = request.get() get方法的三个参数 1.url 发送请求的网址 2.params 文件的路径,可以是一个可迭代的对象,会与上述的url进行拼接 3.headers 对应请求头 4.proxies 代理池使用
response = request.post() post请求除了get请求的三个参数,还有一些参数 1.data 携带的数据 2.json json格式的数据,使用这个参数,若使用data服务端取不到值
|
响应对象的方法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| response.text
response.content
response.json()
response.iter_content() for content in response.iter_content() f.write(content)
respone.encoding
respone.headers
respone.cookies respone.cookies.get_dict() respone.cookies.items()
respone.url
respone.history
|
session对象
1 2 3 4 5 6 7 8
| 我们可以使用session对象来帮助我们自动处理cookie,这样子就不需要我们手动去获取cookie了 session = request.session()
然后使用这个session对象发送请求
session.get() session.post()
|
反扒措施一:UA检测
User—Agent:浏览器的对应版本
通过检查请求头中的User—Agent服务器可以判断是否是一个正常的浏览器来进行访问
对应措施
1 2 3 4 5
| 将User-Agent放到字典中,可以通过抓包的方式获取User-Agent headers = { 'User-Agent':'' }
|
案例一:百度网页爬取
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| import requests
kw = input('>>>') headers ={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36' } info ={ 'wd':kw }
response = requests.get('https://www.baidu.com/s?',info,headers = headers) page_text = response.text with open(f'{kw}.html','w',encoding='utf-8') as f: f.write(page_text)
|
案例二:百度翻译破解(利用request.post())
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| import requests import json word = input('>>>') headers ={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36' } info ={ 'kw':word }
response = requests.post('https://fanyi.baidu.com/sug?',info,headers = headers) info_json = response.json() print(info_json)
with open(f'{word}.json','w',encoding='utf-8') as f: json.dump(info_json,f,ensure_ascii=False)
|
案例三:豆瓣电影爬取
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| import requests import json
url = 'https://movie.douban.com/j/chart/top_list' params = { 'type': '25', 'interval_id': '100:90', 'action': '', 'start': '0', 'limit': '20', } headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36' }
response = requests.get(url, params, headers=headers) print(response.status_code) info_json = response.json() print(info_json)
with open('豆瓣电影动漫.json', 'w', encoding='utf-8') as f: json.dump(info_json, f, ensure_ascii=False)
|
案例四:爬取化妆品化妆品生产许可证信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
| import requests import json from concurrent.futures import ThreadPoolExecutor
def get_data(url, params, method, is_json=True): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36' } if method == 'get': response = requests.get(url, params, headers=headers)
else: response = requests.post(url, params, headers=headers) print(response.status_code) if is_json: info_json = response.json() print(info_json) return info_json info = response.text print(info) return info
def get_sure_data(page): url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList' params = { 'on': 'true', 'page': f'{page}', 'pageSize': '15', 'productName': '', 'conditionType': '1', 'applyname': '', }
id_list = [] json_data = get_data(url, params, 'post') for data_dict in json_data.get('list'): id_list.append(data_dict['ID']) url2 = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById' for id in id_list: params2 = { 'id': id } json_data = get_data(url2, params2, 'post') with open('test.json','a',encoding='utf-8') as f: json.dump(json_data,f,ensure_ascii=False,indent=True)
print('over') if __name__ == '__main__': t = ThreadPoolExecutor(5) for i in range(355): t.submit(get_sure_data(i)) t.shutdown()
|