命令行输入scrapy，输出

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  commands
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

以下介绍常用的。

创建项目

1	scrapy startproject myproject [ project_dir ]

这将在该project_dir目录下创建一个Scrapy项目。如果project_dir没有指定，project_dir将会和myproject名称一样。

接下来，进入新的项目目录：

1	cd project_dir

创建爬虫

1	scrapy genspider mydomain mydomain.com

通过上述命令创建了一个spider name为mydomain的爬虫，start_urls为http://www.cnblogs.com/的爬虫。

启动爬虫

1	scrapy crawl <spider>

检查爬虫

1	scrapy check [-l] <spider>

示例

C:\shu_item\tutorial>scrapy check quotes

----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK

列出爬虫

1	scrapy list

列出当前项目中的所有可用爬虫。每行输出一个爬虫。

打开shell

1	scrapy shell [url]

参数

1
2
3

--spider=SPIDER：绕过爬虫自动检测和强制使用特定的爬虫
-c code：评估shell中的代码，打印结果并退出
--no-redirect：不遵循HTTP 3xx重定向（默认是遵循它们）; 这只影响你可以在命令行上作为参数传递的URL; 一旦你在shell中，fetch(url)默认情况下仍然会遵循HTTP重定向。

示例

$ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]

$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
(200, 'http://www.example.com/')

# shell follows HTTP redirects by default
$ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(200, 'http://example.com/')

# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')

view

1	scrapy view <url>

在浏览器中打开给定的URL，并以Scrapy spider获取到的形式展现。有些时候spider获取到的页面和普通用户看到的并不相同。因此该命令可以用来检查spider所获取到的页面，并确认这是您所期望的。

例如，访问淘宝，就会发现很多内容是ajax加载的，实际上写在页面中的基本只有一个模版。

1	scarpy view http://www.taobao.com

version

1	scrapy version [-v]

输出Scrapy版本。配合 -v 运行时，该命令同时输出Python, Twisted以及平台的信息，方便bug提交。
示例

λ scrapy version -v
Scrapy       : 1.5.0
lxml         : 4.2.1.0
libxml2      : 2.9.5
cssselect    : 1.0.3
parsel       : 1.4.0
w3lib        : 1.19.0
Twisted      : 17.9.0
Python       : 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)]
pyOpenSSL    : 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018)
cryptography : 2.2.2
Platform     : Windows-10-10.0.17134-SP0

bench测试

1	scrapy bench

测试当前爬行速度，运行性能等。

settings

1	scrapy settings [options]

在项目中运行时，该命令将会输出项目的设定值，否则输出Scrapy默认设定。

示例

C:\shu_item\tutorial>scrapy settings --get MONGO_URI
localhost
C:\shu_item\tutorial>scrapy settings --getbool=ROBOYSTXT_OBEY
False