Skip to content

Scrapy

快速开始

安装

shell
pip3 install scrapy

新建项目

bash
scrapy startproject houser

生成的文件

bash
houser/
├── houser
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

4 directories, 7 files

生成spider文件

设置起始页(进入spiders目录,然后生成)

bash
scrapy genspider houser_spider www.cszjw.net

运行爬虫(需要回到工程根目录)

bash
scrapy crawl spider_houser

从程序运行

python
from scrapy import cmdline
cmdline.execute('scrapy crawl code'.split(' '))

# 输出csv格式
cmdline.execute('scrapy crawl code -o code.csv'.split(' '))

# 限制爬取数量
cmdline.execute('scrapy crawl code -s CLOSESPIDER_ITEMCOUNT=10'.split(' '))

设置

settings.py

python
ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}

ITEM_PIPELINES = {
   'citycode.pipelines.CitycodePipeline': 300,
}

框架图

人生感悟