Skip to main content

SitemapRequestLoader

A request loader that reads URLs from sitemap(s).

The loader fetches and parses sitemaps in the background, allowing crawling to start before all URLs are loaded. It supports filtering URLs using glob and regex patterns.

The loader supports state persistence, allowing it to resume from where it left off after interruption when a persist_state_key is provided during initialization.

Hierarchy

Index

Methods

__aenter__

__aexit__

  • async __aexit__(exc_type, exc_value, exc_traceback): None
  • Exit the context manager.


    Parameters

    • exc_type: type[BaseException] | None
    • exc_value: BaseException | None
    • exc_traceback: TracebackType | None

    Returns None

__init__

  • __init__(sitemap_urls, http_client, *, proxy_info, include, exclude, max_buffer_size, persist_state_key): None
  • Initialize the sitemap request loader.


    Parameters

    • sitemap_urls: list[str]

      Configuration options for the loader.

    • http_client: HttpClient

      the instance of HttpClient to use for fetching sitemaps.

    • optionalkeyword-onlyproxy_info: ProxyInfo | None = None

      Optional proxy to use for fetching sitemaps.

    • optionalkeyword-onlyinclude: list[re.Pattern[Any] | Glob] | None = None

      List of glob or regex patterns to include URLs.

    • optionalkeyword-onlyexclude: list[re.Pattern[Any] | Glob] | None = None

      List of glob or regex patterns to exclude URLs.

    • optionalkeyword-onlymax_buffer_size: int = 200

      Maximum number of URLs to buffer in memory.

    • optionalkeyword-onlypersist_state_key: str | None = None

      A key for persisting the loader's state in the KeyValueStore. When provided, allows resuming from where it left off after interruption. If None, no state persistence occurs.

    Returns None

abort_loading

  • async abort_loading(): None
  • Abort the sitemap loading process.


    Returns None

close

  • async close(): None
  • Close the request loader.


    Returns None

fetch_next_request

  • async fetch_next_request(): Request | None

get_handled_count

  • async get_handled_count(): int

get_total_count

  • async get_total_count(): int

is_empty

  • async is_empty(): bool

is_finished

  • async is_finished(): bool

mark_request_as_handled

start

  • async start(): None
  • Start the sitemap loading process.


    Returns None

to_tandem

  • Combine the loader with a request manager to support adding and reclaiming requests.


    Parameters

    • optionalrequest_manager: RequestManager | None = None

      Request manager to combine the loader with. If None is given, the default request queue is used.

    Returns RequestManagerTandem