Download handlers

Download handlers are Scrapy components used to download requests and produce responses from them.

Using download handlers

The DOWNLOAD_HANDLERS_BASE and DOWNLOAD_HANDLERS settings tell Scrapy which handler is responsible for a given URL scheme. Their values are merged into a mapping from scheme names to handler classes. When Scrapy initializes it creates instances of all configured download handlers (except for lazy ones) and stores them in a similar mapping. When Scrapy needs to download a request it extracts the scheme from its URL, finds the handler for this scheme, passes the request to it and gets a response from it. If there is no handler for the scheme, the request is not downloaded and a NotSupported exception is raised.

The DOWNLOAD_HANDLERS_BASE setting contains the default mapping of handlers. You can use the DOWNLOAD_HANDLERS setting to add handlers for additional schemes and to replace or disable default ones:

DOWNLOAD_HANDLERS = {
    # disable support for ftp:// requests
    "ftp": None,
    # replace the default one for http://
    "http": "my.download_handlers.HttpHandler",
    # http:// and https:// are different schemes,
    # even though they may use the same handler
    "https": "my.download_handlers.HttpHandler",
    # support for any custom scheme can be added
    "sftp": "my.download_handlers.SftpHandler",
}

Replacing HTTP(S) download handlers

While Scrapy provides a default handler for http and https schemes, users may want to use a different handler, provided by Scrapy or by some 3rd-party package. There are several considerations to keep in mind related to this.

First of all, as http and https are separate schemes, they need separate entries in the DOWNLOAD_HANDLERS setting, even though it’s likely that the same handler class will be used for both schemes.

Additionally, some of the Scrapy settings, like DOWNLOAD_MAXSIZE, are honored by the default HTTP(S) handler but not necessarily by alternative ones. The same may apply to other Scrapy features, e.g. the bytes_received and headers_received signals.

Lazy instantiation of download handlers

A download handler can be marked as “lazy” by setting its lazy class attribute to True. Such handlers are only instantiated when they need to download their first request. This may be useful when the instantiation is slow or requires dependencies that are not always available, and the handler is not needed on every spider run. For example, the built-in S3 handler is lazy.

Writing your own download handler

A download handler is a component that defines the following API:

class SampleDownloadHandler

lazy: bool

If False, the handler will be instantiated when Scrapy is initialized.

If True, the handler will only be instantiated when the first request handled by it needs to be downloaded.

async download_request(request: Request) → Response:: Download the given request and return a response.

async close() → None: Clean up any resources used by the handler.

An optional base class for custom handlers is provided:

class scrapy.core.downloader.handlers.base.BaseDownloadHandler(crawler: Crawler)[source]

Optional base class for download handlers.

lazy: bool = False

classmethod from_crawler(crawler: Crawler) → Self[source]

abstractmethod async download_request(request: Request) → Response[source]

async close() → None[source]

Exceptions raised by download handlers

Added in version 2.15.0.

The built-in download handlers raise Scrapy-specific exceptions instead of implementation-specific ones, so that code that handles these exceptions can be written in a generic way. We recommend custom download handlers to also use these exceptions.

exception scrapy.exceptions.CannotResolveHostError[source]: Indicates that the provided hostname cannot be resolved.

exception scrapy.exceptions.DownloadCancelledError[source]: Indicates that a request download was cancelled.

exception scrapy.exceptions.DownloadConnectionRefusedError[source]: Indicates that a connection was refused by the server.

exception scrapy.exceptions.DownloadFailedError[source]: Indicates that a request download has failed.

exception scrapy.exceptions.DownloadTimeoutError[source]: Indicates that a request download has timed out.

exception scrapy.exceptions.ResponseDataLossError[source]: Indicates that Scrapy couldn’t get a complete response.

exception scrapy.exceptions.UnsupportedURLSchemeError[source]: Indicates that the URL scheme is not supported.

Built-in download handlers reference

DataURIDownloadHandler

class scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler(crawler: Crawler)[source]

Supported scheme: data.

Lazy: no.

This handler supports RFC 2397 data:content/type;base64, data URIs.

FileDownloadHandler

class scrapy.core.downloader.handlers.file.FileDownloadHandler(crawler: Crawler)[source]

Supported scheme: file.

Lazy: no.

This handler supports file:///path local file URIs. It doesn’t support remote files.

FTPDownloadHandler

class scrapy.core.downloader.handlers.ftp.FTPDownloadHandler(crawler: Crawler)[source]

Supported scheme: ftp.

Lazy: no.

This handler supports ftp://host/path FTP URIs.

It’s implemented using twisted.protocols.ftp.

Note

This handler is not supported when TWISTED_REACTOR_ENABLED is False.

H2DownloadHandler

class scrapy.core.downloader.handlers.http2.H2DownloadHandler(crawler: Crawler)[source]

Supported scheme: https.

Lazy: yes.

This handler supports https://host/path URLs and uses the HTTP/2 protocol for them.

It’s implemented using twisted.web.client and the h2 library.

For this handler to work you need to install the Twisted[http2] extra dependency.

If you want to use this handler you need to replace the default one for the https scheme:

DOWNLOAD_HANDLERS = {
    "https": "scrapy.core.downloader.handlers.http2.H2DownloadHandler",
}

Warning

This handler is experimental, and not yet recommended for production environments. Future Scrapy versions may introduce related changes without a deprecation period or warning.

Note

Known limitations of the HTTP/2 implementation in this handler include:

No support for proxies.
No support for HTTP/2 Cleartext (h2c), since no major browser supports HTTP/2 unencrypted (refer http2 faq).
No setting to specify a maximum frame size larger than the default value, 16384. Connections to servers that send a larger frame will fail.
No support for server pushes, which are ignored.
No support for the bytes_received and headers_received signals.

Note

This handler is not supported when TWISTED_REACTOR_ENABLED is False.

HTTP11DownloadHandler

class scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler(crawler: Crawler)[source]

Supported schemes: http, https.

Lazy: no.

This handler supports http://host/path and https://host/path URLs and uses the HTTP/1.1 protocol for them.

It’s implemented using twisted.web.client.

Note

This handler is not supported when TWISTED_REACTOR_ENABLED is False.

HttpxDownloadHandler

class scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler(crawler: Crawler)[source]

Supported schemes: http, https.

Lazy: no.

This handler supports http://host/path and https://host/path URLs and uses the HTTP/1.1 protocol for them.

It’s implemented using the httpx library and needs it to be installed.

If you want to use this handler you need to replace the default ones for the http and https schemes:

DOWNLOAD_HANDLERS = {
    "http": "scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler",
    "https": "scrapy.core.downloader.handlers._httpx.HttpxDownloadHandler",
}

Warning

This handler is experimental, and not yet recommended for production environments. Future Scrapy versions may introduce related changes without a deprecation period or warning or even remove it altogether.

Note

As this handler is based on a different HTTP client implementation compared to HTTP11DownloadHandler, it’s expected that its behavior on some websites may be different. Additionally, these are the Scrapy features that are explicitly not supported when using it:

Per-request bind address support (the bindaddress meta key). The global DOWNLOAD_BIND_ADDRESS setting is supported but the port number, if specified, will be ignored.
The DOWNLOADER_CLIENT_TLS_METHOD setting.
Settings specific to the Twisted networking or HTTP implementation, like DNS_RESOLVER.
Using non-asyncio reactors (httpx requires asyncio).

S3DownloadHandler

class scrapy.core.downloader.handlers.s3.S3DownloadHandler(crawler: Crawler)[source]

Supported scheme: s3.

Lazy: yes.

This handler supports s3://bucket/path S3 URIs.

It’s implemented using the botocore library and needs it to be installed.