Scrapy 2.14 文件

Scrapy 是一個快速的高階網頁爬取網頁擷取框架,用於爬取網站並從網頁中提取結構化資料。它的應用範圍廣泛,從資料探勘到監控以及自動化測試皆可適用。

Getting help

遇到困難了嗎?我們很樂意提供幫助!

First steps

Scrapy 速覽

Understand what Scrapy is and how it can help you.

安裝指南

在您的電腦上安裝 Scrapy。

Scrapy 教學

編寫您的第一個 Scrapy 專案。

範例

Learn more by playing with a pre-made Scrapy project.

基礎概念

命令行工具

了解用於管理您的 Scrapy 專案的命令行工具。

Spiders

Write the rules to crawl your websites.

Selectors

Extract the data from web pages using XPath.

Scrapy 外殼

Test your extraction code in an interactive environment.

項目

Define the data you want to scrape.

Item Loaders

Populate your items with the extracted data.

Item Pipeline

Post-process and store your scraped data.

Feed exports

Output your scraped data using different formats and storages.

Requests and Responses

Understand the classes used to represent HTTP requests and responses.

Link Extractors

Convenient classes to extract links to follow from pages.

設定

Learn how to configure Scrapy and see all available settings.

Exceptions

See all available exceptions and their meaning.

內建服務

Logging

Learn how to use Python's built-in logging on Scrapy.

Stats Collection

Collect statistics about your scraping crawler.

Sending e-mail

Send email notifications when certain events occur.

Telnet Console

Inspect a running crawler using a built-in Python console.

解決特定問題

Frequently Asked Questions

Get answers to most frequently asked questions.

Debugging Spiders

Learn how to debug common problems of your Scrapy spider.

Spiders Contracts

Learn how to use contracts for testing your spiders.

Common Practices

Get familiar with some Scrapy common practices.

Broad Crawls

Tune Scrapy for crawling a lot domains in parallel.

使用您的瀏覽器開發者工具進行網頁擷取

Learn how to scrape with your browser's developer tools.

Selecting dynamically-loaded content

Read webpage data that is loaded dynamically.

Debugging memory leaks

Learn how to find and get rid of memory leaks in your crawler.

Downloading and processing files and images

Download files and/or images associated with your scraped items.

Deploying Spiders

Deploying your Scrapy spiders and run them in a remote server.

AutoThrottle extension

Adjust crawl rate dynamically based on load.

基準測試

Check how Scrapy performs on your hardware.

Jobs: pausing and resuming crawls

Learn how to pause and resume crawls for large spiders.

Coroutines

Use the coroutine syntax.

asyncio

Use asyncio and asyncio-powered libraries.

Extending Scrapy

架構概覽

Understand the Scrapy architecture.

Add-ons

Enable and configure third-party extensions.

Downloader Middleware

Customize how pages get requested and downloaded.

Spider Middleware

Customize the input and output of your spiders.

Extensions

Extend Scrapy with your custom functionality

Signals

See all available signals and how to work with them.

Scheduler

Understand the scheduler component.

項目匯出器

Quickly export your scraped items to a file (XML, CSV, etc).

Download handlers

Customize how requests are downloaded or add support for new URL schemes.

Components

Learn the common API and some good practices when building custom Scrapy components.

核心 API

Use it on extensions and middlewares to extend Scrapy functionality.

所有測試

發行備註

See what has changed in recent Scrapy versions.

Contributing to Scrapy

Learn how to contribute to the Scrapy project.

Versioning and API stability

Understand Scrapy versioning and API stability.