博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Scrapy中间件的使用
阅读量:3941 次
发布时间:2019-05-24

本文共 5011 字,大约阅读时间需要 16 分钟。

下载中间件(MiddleproDownloaderMiddleware)

  • 位置:引擎和下载器之间
  • 作用:批量拦截到整个工程中所有的请求和响应
  • 拦截请求:
    • UA伪装
    • IP代理
  • 拦截响应:
    • 篡改响应数据、响应请求
[middlewares.py] MiddleproDownloaderMiddleware类中有3个重要方法
import randomfrom fake_useragent import UserAgentclass MiddleproDownloaderMiddleware:    # Not all methods need to be defined. If a method is not defined,    # scrapy acts as if the downloader middleware does not modify the    # passed objects.    USER_AGENT_LIST = [        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"    ]        PROXY_http = [        '153.180.102.104:80',        '195.208.131.189:56055'    ]    PROXY_https = [        '120.83.49.90:9000',        '95.189.112.214:35508'    ]
  • process_request() 拦截请求

    1. 使用UA池(不推荐)

      def process_request(self, request, spider):        # Called for each request that goes through the downloader        # middleware.        # Must either:        # - return None: continue processing this request        # - or return a Response object        # - or return a Request object        # - or raise IgnoreRequest: process_exception() methods of        #   installed downloader middleware will be called        """        函数说明:拦截请求        :param request:        :param spider:        :return:        """        # UA伪装        request.headers['User-Agent'] = rando.chiose(self.USER_AGENT_LIST)        return None
    2. 使用 fake-useragent 模块(推荐)

      安装模块:pip install fake-useragent

      def process_request(self, request, spider):            # Called for each request that goes through the downloader            # middleware.            # Must either:            # - return None: continue processing this request            # - or return a Response object            # - or return a Request object            # - or raise IgnoreRequest: process_exception() methods of            #   installed downloader middleware will be called            """            函数说明:拦截请求            :param request:            :param spider:            :return:            """            # UA伪装            request.headers['User-Agent'] = UserAgent().random
  • process_response() 拦截所有的响应

    • 这里以
  • process_exception() 拦截异常的请求

    • 代理IP

      PROXY_http = [    '153.180.102.104:80',    '195.208.131.189:56055']PROXY_https = [    '120.83.49.90:9000',    '95.189.112.214:35508']    	def process_exception(self, request, exception, spider):        # Called when a download handler or a process_request()        # (from other downloader middleware) raises an exception.        # Must either:        # - return None: continue processing this exception        # - return a Response object: stops process_exception() chain        # - return a Request object: stops process_exception() chain        """        函数说明:拦截发生异常的请求        :param request:        :param exception:        :param spider:        :return:        """        # 代理IP        if request.url.split(':')[0] == 'http':            request.meta['proxy'] = 'http://' + random.choice(self.PROXY_http)        else:            request.meta['proxy'] = 'https://' + random.choice(self.PROXY_https)        # 请修正之后的请求对象进行重新的请求发送        return request

转载地址:http://qkiwi.baihongyu.com/

你可能感兴趣的文章
删除链表的倒数第N个节点
查看>>
回文链表
查看>>
容器盛水问题
查看>>
滑动窗口最大值
查看>>
win7 文件删除后要刷新后才会消失
查看>>
用ffmpeg转多音轨的mkv文件
查看>>
ubuntu12.04 安装VLC,在root用户下不能使用的问题
查看>>
简单而又完整的Makefile
查看>>
GNU/Linux下如何卸载源码安装的软件
查看>>
ffmpeg 常用 命令随手记
查看>>
av_seek_frame中flags值的意义
查看>>
git 学习笔记
查看>>
C++类中的static的用法
查看>>
vector 释放内存 swap
查看>>
在linux下新增一块硬盘的操作。(包含大于2T的硬盘在linux下挂载操作)
查看>>
在32位系统中使用fseek和lseek或fwrite、write写大文件时,最大只能写2G左右的解决办法
查看>>
整理华为C/C++编码规范
查看>>
C语言中嵌入正则表达式
查看>>
libxml2 指南(中文)
查看>>
虚拟机VMware中实现linux与windows的共享
查看>>