参考
¥arun_many(...) Reference
笔记:此功能与
arun()但专注于并发或者批爬行。如果你不熟悉arun()使用方法,请先阅读该文档,然后查看此文档以了解差异。¥Note: This function is very similar to
arun()but focused on concurrent or batch crawling. If you’re unfamiliar witharun()usage, please read that doc first, then review this for differences.
函数签名
¥Function Signature
async def arun_many(
urls: Union[List[str], List[Any]],
config: Optional[Union[CrawlerRunConfig, List[CrawlerRunConfig]]] = None,
dispatcher: Optional[BaseDispatcher] = None,
...
) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
"""
Crawl multiple URLs concurrently or in batches.
:param urls: A list of URLs (or tasks) to crawl.
:param config: (Optional) Either:
- A single `CrawlerRunConfig` applying to all URLs
- A list of `CrawlerRunConfig` objects with url_matcher patterns
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
...
:return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
"""
与arun()
¥Differences from arun()
1.多个 URL :
¥1. Multiple URLs:
-
您无需抓取单个 URL,而是传递它们的列表(字符串或任务)。
¥Instead of crawling a single URL, you pass a list of them (strings or tasks).
-
该函数返回一个列表的
CrawlResult或异步生成器如果启用了流媒体。¥The function returns either a list of
CrawlResultor an async generator if streaming is enabled.
2.并发与调度器:
¥2. Concurrency & Dispatchers:
-
dispatcherparam 允许高级并发控制。¥
dispatcherparam allows advanced concurrency control. -
如果省略,则使用默认调度程序(如
MemoryAdaptiveDispatcher) 供内部使用。¥If omitted, a default dispatcher (like
MemoryAdaptiveDispatcher) is used internally. -
调度程序处理并发、速率限制和基于内存的自适应节流(参见多 URL 爬取)。
¥Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see Multi-URL Crawling).
3.流媒体支持:
¥3. Streaming Support:
-
通过设置启用流
stream=True在你的CrawlerRunConfig。¥Enable streaming by setting
stream=Truein yourCrawlerRunConfig. -
流式传输时,使用
async for在结果可用时进行处理。¥When streaming, use
async forto process results as they become available. -
非常适合处理大量 URL,无需等待所有 URL 完成。
¥Ideal for processing large numbers of URLs without waiting for all to complete.
4.平行线执行**:
¥4. Parallel Execution**:
-
可以在后台同时运行多个请求。
¥
arun_many()can run multiple requests concurrently under the hood. -
每个
CrawlResult可能还包括dispatch_result具有并发详细信息(例如内存使用情况、开始/结束时间)。¥Each
CrawlResultmight also include adispatch_resultwith concurrency details (like memory usage, start/end times).
基本示例(批处理模式)
¥Basic Example (Batch Mode)
# Minimal usage: The default dispatcher will be used
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com"],
config=CrawlerRunConfig(stream=False) # Default behavior
)
for res in results:
if res.success:
print(res.url, "crawled OK!")
else:
print("Failed:", res.url, "-", res.error_message)
流示例
¥Streaming Example
config = CrawlerRunConfig(
stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS
)
# Process results as they complete
async for result in await crawler.arun_many(
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
config=config
):
if result.success:
print(f"Just completed: {result.url}")
# Process each result immediately
process_result(result)
使用自定义调度程序
¥With a Custom Dispatcher
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
max_session_permit=10
)
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
config=my_run_config,
dispatcher=dispatcher
)
URL 特定的配置
¥URL-Specific Configurations
不要对所有 URL 使用一个配置,而是提供配置列表url_matcher模式:
¥Instead of using one config for all URLs, provide a list of configs with url_matcher patterns:
from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# PDF files - specialized extraction
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
)
# Blog/article pages - content filtering
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*python.org*"],
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48)
)
)
# Dynamic pages - JavaScript execution
github_config = CrawlerRunConfig(
url_matcher=lambda url: 'github.com' in url,
js_code="window.scrollTo(0, 500);"
)
# API endpoints - JSON extraction
api_config = CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json'),
# Custome settings for JSON extraction
)
# Default fallback config
default_config = CrawlerRunConfig() # No url_matcher means it never matches except as fallback
# Pass the list of configs - first match wins!
results = await crawler.arun_many(
urls=[
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # → pdf_config
"https://blog.python.org/", # → blog_config
"https://github.com/microsoft/playwright", # → github_config
"https://httpbin.org/json", # → api_config
"https://example.com/" # → default_config
],
config=[pdf_config, blog_config, github_config, api_config, default_config]
)
URL 匹配功能:-字符串模式:"*.pdf" ,"*/blog/*" ,"*python.org*" -函数匹配器:lambda url: 'api' in url -混合模式:将字符串和函数组合起来MatchMode.OR或者MatchMode.AND-第一场比赛获胜:按顺序评估配置
¥URL Matching Features:
- String patterns: "*.pdf", "*/blog/*", "*python.org*"
- Function matchers: lambda url: 'api' in url
- Mixed patterns: Combine strings and functions with MatchMode.OR or MatchMode.AND
- First match wins: Configs are evaluated in order
关键点: - 每个 URL 由相同或单独的会话处理,具体取决于调度程序的策略。-dispatch_result在每个CrawlResult(如果使用并发)可以保存内存和时间信息。 - 如果您需要处理身份验证或会话 ID,请在每个单独的任务中或运行配置中传递它们。 -重要的:始终包含默认配置(不包含url_matcher) 作为最后一项,用于处理所有 URL。否则,不匹配的 URL 将会失败。
¥Key Points:
- Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy.
- dispatch_result in each CrawlResult (if using concurrency) can hold memory and timing info.
- If you need to handle authentication or session IDs, pass them in each individual task or within your run config.
- Important: Always include a default config (without url_matcher) as the last item if you want to handle all URLs. Otherwise, unmatched URLs will fail.
返回值
¥Return Value
要么列表的CrawlResult对象,或者异步生成器是否启用了流式传输。您可以迭代检查result.success或阅读每件物品的extracted_content,markdown , 或者dispatch_result。
¥Either a list of CrawlResult objects, or an async generator if streaming is enabled. You can iterate to check result.success or read each item’s extracted_content, markdown, or dispatch_result.
调度程序参考
¥Dispatcher Reference
-
MemoryAdaptiveDispatcher:根据系统内存使用情况动态管理并发。¥
MemoryAdaptiveDispatcher: Dynamically manages concurrency based on system memory usage. -
SemaphoreDispatcher:固定并发限制,比较简单,但适应性较差。¥
SemaphoreDispatcher: Fixed concurrency limit, simpler but less adaptive.
有关高级用法或自定义设置,请参阅使用调度程序进行多 URL 抓取。
¥For advanced usage or custom settings, see Multi-URL Crawling with Dispatchers.
常见陷阱
¥Common Pitfalls
1.大型列表:如果您传递数千个 URL,请注意内存或速率限制。调度程序可以提供帮助。
¥1. Large Lists: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.
2.会话重用:如果您需要专门的登录或持久上下文,请确保您的调度程序或任务相应地处理会话。
¥2. Session Reuse: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.
3.错误处理: 每个CrawlResult可能会因不同原因而失败——务必检查result.success或error_message然后继续。
¥3. Error Handling: Each CrawlResult might fail for different reasons—always check result.success or the error_message before proceeding.
结论
¥Conclusion
使用arun_many()当你想抓取多个 URL同时或在受控的并行任务中执行。如果您需要高级并发功能(例如基于内存的自适应节流或复杂的速率限制),请提供调度员. 每个结果都是一个标准CrawlResult,可能还会增加并发统计数据(dispatch_result )进行更深入的检查。有关并发逻辑和调度程序的更多详细信息,请参阅高级多 URL 爬取文档。
¥Use arun_many() when you want to crawl multiple URLs simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a dispatcher. Each result is a standard CrawlResult, possibly augmented with concurrency stats (dispatch_result) for deeper inspection. For more details on concurrency logic and dispatchers, see the Advanced Multi-URL Crawling docs.