技术栈

Scrapy 的详细使用指南，包含安装、项目创建、爬虫编写及示例代码：

1. 安装 Scrapy

bash

pip install scrapy

2. 创建 Scrapy 项目

bash

scrapy startproject myproject
cd myproject

3. 生成爬虫模板

创建一个名为 quotes 的爬虫（目标网站：http://quotes.toscrape.com）：

bash

scrapy genspider quotes quotes.toscrape.com

4. 编写爬虫逻辑

编辑 myproject/spiders/quotes.py：

python

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["http://quotes.toscrape.com"]

    def parse(self, response):
        # 提取名言内容
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall()
            }

        # 翻页逻辑
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

5. 运行爬虫

bash

# 输出到 JSON 文件
scrapy crawl quotes -o quotes.json

# 或输出到终端
scrapy crawl quotes

6. 配置管道（处理数据）

编辑 myproject/pipelines.py：

python

class MyprojectPipeline:
    def process_item(self, item, spider):
        # 自定义数据处理（如保存到数据库）
        return item

在 settings.py 中启用管道：

python

ITEM_PIPELINES = {
    'myproject.pipelines.MyprojectPipeline': 300,
}

7. 配置中间件（如 User-Agent）

编辑 settings.py：

python

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

8. 使用 Item 定义结构化数据

编辑 items.py：

python

import scrapy


class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

在爬虫中使用：

python

from myproject.items import QuoteItem

item = QuoteItem()
item['text'] = quote.css('span.text::text').get()
yield item

9. 处理动态页面（Splash 集成）

安装 Splash：

bash

pip install scrapy-splash

配置 settings.py：

python

SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashMiddleware': 725,
}

在爬虫中使用：

python

yield scrapy.Request(url, self.parse, meta={'splash': {'args': {'wait': 2}}})

10. 常用命令

功能	命令
创建项目	`scrapy startproject myproject`
生成爬虫	`scrapy genspider spider_name domain`
运行爬虫	`scrapy crawl spider_name`
导出数据（JSON/CSV）	`scrapy crawl spider_name -o data.json`

示例场景：爬取电商商品

python

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        for product in response.css('div.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
                'link': product.css('a::attr(href)').get()
            }

常见问题

反爬虫拦截：
- 使用随机 User-Agent（如 scrapy-fake-useragent）。
- 设置下载延迟：DOWNLOAD_DELAY = 2。
数据存储：
- 使用 scrapy export 导出到文件，或通过管道保存到数据库（如 MongoDB、MySQL）。
动态内容加载：
- 集成 Splash 或 Selenium 渲染 JavaScript。

通过以上步骤，可以快速构建高效、结构化的爬虫，适用于数据采集、竞品分析、SEO 监控等场景。

1. 安装 Scrapy ​

2. 创建 Scrapy 项目 ​

3. 生成爬虫模板 ​

4. 编写爬虫逻辑 ​

5. 运行爬虫 ​

6. 配置管道（处理数据） ​

7. 配置中间件（如 User-Agent） ​

8. 使用 Item 定义结构化数据 ​

9. 处理动态页面（Splash 集成） ​

10. 常用命令 ​

示例场景：爬取电商商品 ​

常见问题 ​

1. 安装 Scrapy

2. 创建 Scrapy 项目

3. 生成爬虫模板

4. 编写爬虫逻辑

5. 运行爬虫

6. 配置管道（处理数据）

7. 配置中间件（如 User-Agent）

8. 使用 Item 定义结构化数据

9. 处理动态页面（Splash 集成）

10. 常用命令

示例场景：爬取电商商品

常见问题