SitemapRequestLoader
Hierarchy
- RequestLoader
- SitemapRequestLoader
Index
Methods
__aenter__
Enter the context manager.
Returns SitemapRequestLoader
__aexit__
Exit the context manager.
Parameters
exc_type: type[BaseException] | None
exc_value: BaseException | None
exc_traceback: TracebackType | None
Returns None
__init__
Initialize the sitemap request loader.
Parameters
sitemap_urls: list[str]
Configuration options for the loader.
http_client: HttpClient
the instance of
HttpClient
to use for fetching sitemaps.optionalkeyword-onlyproxy_info: ProxyInfo | None = None
Optional proxy to use for fetching sitemaps.
optionalkeyword-onlyinclude: list[re.Pattern[Any] | Glob] | None = None
List of glob or regex patterns to include URLs.
optionalkeyword-onlyexclude: list[re.Pattern[Any] | Glob] | None = None
List of glob or regex patterns to exclude URLs.
optionalkeyword-onlymax_buffer_size: int = 200
Maximum number of URLs to buffer in memory.
optionalkeyword-onlypersist_state_key: str | None = None
A key for persisting the loader's state in the KeyValueStore. When provided, allows resuming from where it left off after interruption. If None, no state persistence occurs.
Returns None
abort_loading
Abort the sitemap loading process.
Returns None
close
Close the request loader.
Returns None
fetch_next_request
Fetch the next request to process.
Returns Request | None
get_handled_count
Return the number of URLs that have been handled.
Returns int
get_total_count
Return the total number of URLs found so far.
Returns int
is_empty
Check if there are no more URLs to process.
Returns bool
is_finished
Check if all URLs have been processed.
Returns bool
mark_request_as_handled
Mark a request as successfully handled.
Parameters
request: Request
Returns ProcessedRequest | None
start
Start the sitemap loading process.
Returns None
to_tandem
Combine the loader with a request manager to support adding and reclaiming requests.
Parameters
optionalrequest_manager: RequestManager | None = None
Request manager to combine the loader with. If None is given, the default request queue is used.
Returns RequestManagerTandem
A request loader that reads URLs from sitemap(s).
The loader fetches and parses sitemaps in the background, allowing crawling to start before all URLs are loaded. It supports filtering URLs using glob and regex patterns.
The loader supports state persistence, allowing it to resume from where it left off after interruption when a
persist_state_key
is provided during initialization.