Scheduler
The scheduler component receives requests from the engine and stores them into persistent and/or non-persistent data structures. It also gets those requests and feeds them back to the engine when it asks for a next request to be downloaded.
Overriding the default scheduler
You can use your own custom scheduler class by supplying its full
Python path in the SCHEDULER setting.
Minimal scheduler interface
- class scrapy.core.scheduler.BaseScheduler[source]
The scheduler component is responsible for storing requests received from the engine, and feeding them back upon request (also to the engine).
The original sources of said requests are:
Spider:
startmethod, requests created for URLs in thestart_urlsattribute, request callbacksSpider middleware:
process_spider_outputandprocess_spider_exceptionmethodsDownloader middleware:
process_request,process_responseandprocess_exceptionmethods
The order in which the scheduler returns its stored requests (via the
next_requestmethod) plays a great part in determining the order in which those requests are downloaded. See Request order.The methods defined in this class constitute the minimal interface that the Scrapy engine will interact with.
- close(reason: str) Deferred[None] | None[source]
Called when the spider is closed by the engine. It receives the reason why the crawl finished as argument and it’s useful to execute cleaning code.
- Parameters:
reason (
str) – a string which describes the reason why the spider was closed
- abstract enqueue_request(request: Request) bool[source]
Process a request received by the engine.
Return
Trueif the request is stored correctly,Falseotherwise.If
False, the engine will fire arequest_droppedsignal, and will not make further attempts to schedule the request at a later time. For reference, the default Scrapy scheduler returnsFalsewhen the request is rejected by the dupefilter.
- classmethod from_crawler(crawler: Crawler) Self[source]
Factory method which receives the current
Crawlerobject as argument.
- abstract has_pending_requests() bool[source]
Trueif the scheduler has enqueued requests,Falseotherwise
- abstract next_request() Request | None[source]
Return the next
Requestto be processed, orNoneto indicate that there are no requests to be considered ready at the moment.Returning
Noneimplies that no request from the scheduler will be sent to the downloader in the current reactor cycle. The engine will continue callingnext_requestuntilhas_pending_requestsisFalse.
Default scheduler
- class scrapy.core.scheduler.Scheduler[source]
Default scheduler.
Requests are stored into priority queues (
SCHEDULER_PRIORITY_QUEUE) that sort requests bypriority.By default, a single, memory-based priority queue is used for all requests. When using
JOBDIR, a disk-based priority queue is also created, and only unserializable requests are stored in the memory-based priority queue. For a given priority value, requests in memory take precedence over requests in disk.Each priority queue stores requests in separate internal queues, one per priority value. The memory priority queue uses
SCHEDULER_MEMORY_QUEUEqueues, while the disk priority queue usesSCHEDULER_DISK_QUEUEqueues. The internal queues determine request order when requests have the same priority. Start requests are stored into separate internal queues by default, and ordered differently.Duplicate requests are filtered out with an instance of
DUPEFILTER_CLASS.Request order
With default settings, pending requests are stored in a LIFO queue (except for start requests). As a result, crawling happens in DFO order, which is usually the most convenient crawl order. However, you can enforce BFO or a custom order (except for the first few requests).
Start request order
Start requests are sent in the order they are yielded from
start(), and given the samepriority, other requests take precedence over start requests.You can set
SCHEDULER_START_MEMORY_QUEUEandSCHEDULER_START_DISK_QUEUEtoNoneto handle start requests the same as other requests when it comes to order and priority.Crawling in BFO order
If you do want to crawl in BFO order, you can do it by setting the following settings:
DEPTH_PRIORITY=1SCHEDULER_DISK_QUEUE="scrapy.squeues.PickleFifoDiskQueue"SCHEDULER_MEMORY_QUEUE="scrapy.squeues.FifoMemoryQueue"Crawling in a custom order
You can manually set
priorityon requests to force a specific request order.Concurrency affects order
While pending requests are below the configured values of
CONCURRENT_REQUESTS,CONCURRENT_REQUESTS_PER_DOMAINorCONCURRENT_REQUESTS_PER_IP, those requests are sent concurrently.As a result, the first few requests of a crawl may not follow the desired order. Lowering those settings to
1enforces the desired order except for the very first request, but it significantly slows down the crawl as a whole.- __init__(dupefilter: BaseDupeFilter, jobdir: str | None = None, dqclass: type[BaseQueue] | None = None, mqclass: type[BaseQueue] | None = None, logunser: bool = False, stats: StatsCollector | None = None, pqclass: type[ScrapyPriorityQueue] | None = None, crawler: Crawler | None = None)[source]
Initialize the scheduler.
- Parameters:
dupefilter (
scrapy.dupefilters.BaseDupeFilterinstance or similar: any class that implements the BaseDupeFilter interface) – An object responsible for checking and filtering duplicate requests. The value for theDUPEFILTER_CLASSsetting is used by default.jobdir (
strorNone) – The path of a directory to be used for persisting the crawl’s state. The value for theJOBDIRsetting is used by default. See Jobs: pausing and resuming crawls.dqclass (class) – A class to be used as persistent request queue. The value for the
SCHEDULER_DISK_QUEUEsetting is used by default.mqclass (class) – A class to be used as non-persistent request queue. The value for the
SCHEDULER_MEMORY_QUEUEsetting is used by default.logunser (bool) – A boolean that indicates whether or not unserializable requests should be logged. The value for the
SCHEDULER_DEBUGsetting is used by default.stats (
scrapy.statscollectors.StatsCollectorinstance or similar: any class that implements the StatsCollector interface) – A stats collector object to record stats about the request scheduling process. The value for theSTATS_CLASSsetting is used by default.pqclass (class) – A class to be used as priority queue for requests. The value for the
SCHEDULER_PRIORITY_QUEUEsetting is used by default.crawler (
scrapy.crawler.Crawler) – The crawler object corresponding to the current crawl.
- close(reason: str) Deferred[None] | None[source]
dump pending requests to disk if there is a disk queue
return the result of the dupefilter’s
closemethod
- enqueue_request(request: Request) bool[source]
Unless the received request is filtered out by the Dupefilter, attempt to push it into the disk queue, falling back to pushing it into the memory queue.
Increment the appropriate stats, such as:
scheduler/enqueued,scheduler/enqueued/disk,scheduler/enqueued/memory.Return
Trueif the request was stored successfully,Falseotherwise.
- classmethod from_crawler(crawler: Crawler) Self[source]
Factory method which receives the current
Crawlerobject as argument.
- next_request() Request | None[source]
Return a
Requestobject from the memory queue, falling back to the disk queue if the memory queue is empty. ReturnNoneif there are no more enqueued requests.Increment the appropriate stats, such as:
scheduler/dequeued,scheduler/dequeued/disk,scheduler/dequeued/memory.