参考

¥arun_many(...) Reference

笔记:此功能与arun()但专注于并发或者爬行。如果你不熟悉arun()使用方法,请先阅读该文档,然后查看此文档以了解差异。

¥

Note: This function is very similar to arun() but focused on concurrent or batch crawling. If you’re unfamiliar with arun() usage, please read that doc first, then review this for differences.

函数签名

¥Function Signature

async def arun_many(
    urls: Union[List[str], List[Any]],
    config: Optional[Union[CrawlerRunConfig, List[CrawlerRunConfig]]] = None,
    dispatcher: Optional[BaseDispatcher] = None,
    ...
) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
    """
    Crawl multiple URLs concurrently or in batches.

    :param urls: A list of URLs (or tasks) to crawl.
    :param config: (Optional) Either:
        - A single `CrawlerRunConfig` applying to all URLs
        - A list of `CrawlerRunConfig` objects with url_matcher patterns
    :param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
    ...
    :return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
    """

arun()

¥Differences from arun()

1.多个 URL

¥1. Multiple URLs:

  • 您无需抓取单个 URL,而是传递它们的列表(字符串或任务)。

    ¥Instead of crawling a single URL, you pass a list of them (strings or tasks). 

  • 该函数返回一个列表CrawlResult异步生成器如果启用了流媒体。

    ¥The function returns either a list of CrawlResult or an async generator if streaming is enabled.

2.并发与调度器

¥2. Concurrency & Dispatchers:

  • dispatcherparam 允许高级并发控制。

    ¥dispatcher param allows advanced concurrency control. 

  • 如果省略,则使用默认调度程序(如MemoryAdaptiveDispatcher) 供内部使用。

    ¥If omitted, a default dispatcher (like MemoryAdaptiveDispatcher) is used internally. 

  • 调度程序处理并发、速率限制和基于内存的自适应节流(参见多 URL 爬取)。

    ¥Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see Multi-URL Crawling).

3.流媒体支持

¥3. Streaming Support:

  • 通过设置启用流stream=True在你的CrawlerRunConfig

    ¥Enable streaming by setting stream=True in your CrawlerRunConfig.

  • 流式传输时,使用async for在结果可用时进行处理。

    ¥When streaming, use async for to process results as they become available.

  • 非常适合处理大量 URL,无需等待所有 URL 完成。

    ¥Ideal for processing large numbers of URLs without waiting for all to complete.

4.平行线执行**:

¥4. Parallel Execution**:

  • 可以在后台同时运行多个请求。

    ¥arun_many() can run multiple requests concurrently under the hood. 

  • 每个CrawlResult可能还包括dispatch_result具有并发详细信息(例如内存使用情况、开始/结束时间)。

    ¥Each CrawlResult might also include a dispatch_result with concurrency details (like memory usage, start/end times).

基本示例(批处理模式)

¥Basic Example (Batch Mode)

# Minimal usage: The default dispatcher will be used
results = await crawler.arun_many(
    urls=["https://site1.com", "https://site2.com"],
    config=CrawlerRunConfig(stream=False)  # Default behavior
)

for res in results:
    if res.success:
        print(res.url, "crawled OK!")
    else:
        print("Failed:", res.url, "-", res.error_message)

流示例

¥Streaming Example

config = CrawlerRunConfig(
    stream=True,  # Enable streaming mode
    cache_mode=CacheMode.BYPASS
)

# Process results as they complete
async for result in await crawler.arun_many(
    urls=["https://site1.com", "https://site2.com", "https://site3.com"],
    config=config
):
    if result.success:
        print(f"Just completed: {result.url}")
        # Process each result immediately
        process_result(result)

使用自定义调度程序

¥With a Custom Dispatcher

dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=70.0,
    max_session_permit=10
)
results = await crawler.arun_many(
    urls=["https://site1.com", "https://site2.com", "https://site3.com"],
    config=my_run_config,
    dispatcher=dispatcher
)

URL 特定的配置

¥URL-Specific Configurations

不要对所有 URL 使用一个配置,而是提供配置列表url_matcher模式:

¥Instead of using one config for all URLs, provide a list of configs with url_matcher patterns:

from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

# PDF files - specialized extraction
pdf_config = CrawlerRunConfig(
    url_matcher="*.pdf",
    scraping_strategy=PDFContentScrapingStrategy()
)

# Blog/article pages - content filtering
blog_config = CrawlerRunConfig(
    url_matcher=["*/blog/*", "*/article/*", "*python.org*"],
    markdown_generator=DefaultMarkdownGenerator(
        content_filter=PruningContentFilter(threshold=0.48)
    )
)

# Dynamic pages - JavaScript execution
github_config = CrawlerRunConfig(
    url_matcher=lambda url: 'github.com' in url,
    js_code="window.scrollTo(0, 500);"
)

# API endpoints - JSON extraction
api_config = CrawlerRunConfig(
    url_matcher=lambda url: 'api' in url or url.endswith('.json'),
    # Custome settings for JSON extraction
)

# Default fallback config
default_config = CrawlerRunConfig()  # No url_matcher means it never matches except as fallback

# Pass the list of configs - first match wins!
results = await crawler.arun_many(
    urls=[
        "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",  # → pdf_config
        "https://blog.python.org/",  # → blog_config
        "https://github.com/microsoft/playwright",  # → github_config
        "https://httpbin.org/json",  # → api_config
        "https://example.com/"  # → default_config
    ],
    config=[pdf_config, blog_config, github_config, api_config, default_config]
)

URL 匹配功能:-字符串模式"*.pdf""*/blog/*""*python.org*" -函数匹配器lambda url: 'api' in url -混合模式:将字符串和函数组合起来MatchMode.OR或者MatchMode.AND-第一场比赛获胜:按顺序评估配置

¥URL Matching Features: - String patterns: "*.pdf", "*/blog/*", "*python.org*" - Function matchers: lambda url: 'api' in url - Mixed patterns: Combine strings and functions with MatchMode.OR or MatchMode.AND - First match wins: Configs are evaluated in order

关键点: - 每个 URL 由相同或单独的会话处理,具体取决于调度程序的策略。-dispatch_result在每个CrawlResult(如果使用并发)可以保存内存和时间信息。 - 如果您需要处理身份验证或会话 ID,请在每个单独的任务中或运行配置中传递它们。 -重要的:始终包含默认配置(不包含url_matcher) 作为最后一项,用于处理所有 URL。否则,不匹配的 URL 将会失败。

¥Key Points: - Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy. - dispatch_result in each CrawlResult (if using concurrency) can hold memory and timing info.  - If you need to handle authentication or session IDs, pass them in each individual task or within your run config. - Important: Always include a default config (without url_matcher) as the last item if you want to handle all URLs. Otherwise, unmatched URLs will fail.

返回值

¥Return Value

要么列表CrawlResult对象,或者异步生成器是否启用了流式传输。您可以迭代检查result.success或阅读每件物品的extracted_contentmarkdown , 或者dispatch_result

¥Either a list of CrawlResult objects, or an async generator if streaming is enabled. You can iterate to check result.success or read each item’s extracted_content, markdown, or dispatch_result.


调度程序参考

¥Dispatcher Reference

  • MemoryAdaptiveDispatcher:根据系统内存使用情况动态管理并发。

    ¥MemoryAdaptiveDispatcher: Dynamically manages concurrency based on system memory usage. 

  • SemaphoreDispatcher:固定并发限制,比较简单,但适应性较差。

    ¥SemaphoreDispatcher: Fixed concurrency limit, simpler but less adaptive. 

有关高级用法或自定义设置,请参阅使用调度程序进行多 URL 抓取

¥For advanced usage or custom settings, see Multi-URL Crawling with Dispatchers.


常见陷阱

¥Common Pitfalls

1.大型列表:如果您传递数千个 URL,请注意内存或速率限制。调度程序可以提供帮助。

¥1. Large Lists: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help. 

2.会话重用:如果您需要专门的登录或持久上下文,请确保您的调度程序或任务相应地处理会话。

¥2. Session Reuse: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly. 

3.错误处理: 每个CrawlResult可能会因不同原因而失败——务必检查result.successerror_message然后继续。

¥3. Error Handling: Each CrawlResult might fail for different reasons—always check result.success or the error_message before proceeding.


结论

¥Conclusion

使用arun_many()当你想抓取多个 URL同时或在受控的并行任务中执行。如果您需要高级并发功能(例如基于内存的自适应节流或复杂的速率限制),请提供调度员. 每个结果都是一个标准CrawlResult,可能还会增加并发统计数据(dispatch_result )进行更深入的检查。有关并发逻辑和调度程序的更多详细信息,请参阅高级多 URL 爬取文档。

¥Use arun_many() when you want to crawl multiple URLs simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a dispatcher. Each result is a standard CrawlResult, possibly augmented with concurrency stats (dispatch_result) for deeper inspection. For more details on concurrency logic and dispatchers, see the Advanced Multi-URL Crawling docs.


> Feedback