目录
- 11.1. 安装 scrapy 开发环境
-
- 11.1.1. Mac
- 11.1.2. Ubuntu
- 11.1.3. 使用 pip 安装 scrapy
- 11.1.4. 测试 scrapy
- 11.2. scrapy 命令
-
- 11.2.1.
- 11.2.2. 新建 spider
- 11.2.3. 列出可用的 spiders
- 11.2.4. 运行 spider
- 11.3. Scrapy Shell
-
- 11.3.1. response
-
- 11.3.1.1. 当前URL地址
- 11.3.1.2. status HTTP 状态
- 11.3.1.3. text 正文
- 11.3.1.4. css
-
- 11.3.1.4.1. 获取 html 属性
- 11.3.1.5. xpath
- 11.3.1.6. headers
- 11.4. 爬虫项目
-
- 11.4.1. 创建项目
- 11.4.2. Spider
-
- 11.4.2.1. 翻页操作
- 11.4.2.2. 采集内容保存到文件
- 11.4.3. settings.py 爬虫配置文件
-
- 11.4.3.1. 忽略 robots.txt 规则
- 11.4.4. Item
- 11.4.5. Pipeline
- 11.5. 下载图片
-
- 11.5.1. 配置 settings.py
- 11.5.2. 修改 pipelines.py 文件
- 11.5.3. 编辑 items.py
- 11.5.4. Spider 爬虫文件
- 11.6. xpath
-
- 11.6.1. 逻辑运算符
-
- 11.6.1.1. and
- 11.6.1.2. or
- 11.6.2. function
-
- 11.6.2.1. text()
- 11.6.2.2. contains()
https://scrapy.org
11.1. 安装 scrapy 开发环境
11.1.1. Mac
neo@MacBook-Pro ~ % brew install python3 neo@MacBook-Pro ~ % pip3 install scrapy
11.1.2. Ubuntu
搜索 scrapy 包,scrapy 支持 Python2.7 和 Python3 我们只需要 python3 版本
neo@netkiller ~ % apt-cache search scrapy | grep python3 python3-scrapy - Python web scraping and crawling framework (Python 3) python3-scrapy-djangoitem - Scrapy extension to write scraped items using Django models (Python3 version) python3-w3lib - Collection of web-related functions (Python 3)
Ubuntu 17.04 默认 scrapy 版本为 1.3.0-1 如果需要最新的 1.4.0 请使用 pip 命令安装
neo@netkiller ~ % apt search python3-scrapy Sorting... Done Full Text Search... Done python3-scrapy/zesty,zesty 1.3.0-1~exp2 all Python web scraping and crawling framework (Python 3) python3-scrapy-djangoitem/zesty,zesty 1.1.1-1 all Scrapy extension to write scraped items using Django models (Python3 version)
安装 scrapy
neo@netkiller ~ % sudo apt install python3-scrapy [sudo] password for neo: Reading package lists... Done Building dependency tree Reading state information... Done The following additional packages will be installed: ipython3 libmysqlclient20 libwebpmux2 mysql-common python-pexpect python-ptyprocess python3-attr python3-boto python3-bs4 python3-cffi-backend python3-click python3-colorama python3-constantly python3-cryptography python3-cssselect python3-decorator python3-html5lib python3-idna python3-incremental python3-ipython python3-ipython-genutils python3-libxml2 python3-lxml python3-mysqldb python3-openssl python3-pam python3-parsel python3-pexpect python3-pickleshare python3-pil python3-prompt-toolkit python3-ptyprocess python3-pyasn1 python3-pyasn1-modules python3-pydispatch python3-pygments python3-queuelib python3-serial python3-service-identity python3-setuptools python3-simplegeneric python3-traitlets python3-twisted python3-twisted-bin python3-w3lib python3-wcwidth python3-webencodings python3-zope.interface Suggested packages: python-pexpect-doc python-attr-doc python-cryptography-doc python3-cryptography-vectors python3-genshi python3-lxml-dbg python-lxml-doc default-mysql-server | virtual-mysql-server python-egenix-mxdatetime python3-mysqldb-dbg python-openssl-doc python3-openssl-dbg python3-pam-dbg python-pil-doc python3-pil-dbg doc-base python-pydispatch-doc ttf-bitstream-vera python-scrapy-doc python3-wxgtk3.0 | python3-wxgtk python-setuptools-doc python3-tk python3-gtk2 python3-glade2 python3-qt4 python3-wxgtk2.8 python3-twisted-bin-dbg The following NEW packages will be installed: ipython3 libmysqlclient20 libwebpmux2 mysql-common python-pexpect python-ptyprocess python3-attr python3-boto python3-bs4 python3-cffi-backend python3-click python3-colorama python3-constantly python3-cryptography python3-cssselect python3-decorator python3-html5lib python3-idna python3-incremental python3-ipython python3-ipython-genutils python3-libxml2 python3-lxml python3-mysqldb python3-openssl python3-pam python3-parsel python3-pexpect python3-pickleshare python3-pil python3-prompt-toolkit python3-ptyprocess python3-pyasn1 python3-pyasn1-modules python3-pydispatch python3-pygments python3-queuelib python3-scrapy python3-serial python3-service-identity python3-setuptools python3-simplegeneric python3-traitlets python3-twisted python3-twisted-bin python3-w3lib python3-wcwidth python3-webencodings python3-zope.interface 0 upgraded, 49 newly installed, 0 to remove and 0 not upgraded. Need to get 7,152 kB of archives. After this operation, 40.8 MB of additional disk space will be used. Do you want to continue? [Y/n]
输入大写 “Y” 然后回车
11.1.3. 使用 pip 安装 scrapy
neo@netkiller ~ % sudo apt install python3-pip neo@netkiller ~ % pip3 install scrapy
11.1.4. 测试 scrapy
创建测试程序,用于验证 scrapy 安装是否存在问题。
$ cat > myspider.py <<EOF import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['https://blog.scrapinghub.com'] def parse(self, response): for title in response.css('h2.entry-title'): yield {'title': title.css('a ::text').extract_first()} for next_page in response.css('div.prev-post > a'): yield response.follow(next_page, self.parse) EOF
运行爬虫
$ scrapy runspider myspider.py
原文出处:Netkiller 系列 手札
本文作者:陈景峯
转载请与作者联系,同时请务必标明文章原始出处和作者信息及本声明。