异步Web爬虫
¥AsyncWebCrawler
这AsyncWebCrawler是 Crawl4AI 中异步网页爬取的核心类。通常情况下,您可以创建它一次,可选择使用BrowserConfig(例如,无头,用户代理),然后跑步多种的arun()用不同的电话CrawlerRunConfig对象。
¥The AsyncWebCrawler is the core class for asynchronous web crawling in Crawl4AI. You typically create it once, optionally customize it with a BrowserConfig (e.g., headless, user agent), then run multiple arun() calls with different CrawlerRunConfig objects.
建议使用:
¥Recommended usage:
1.创造一个BrowserConfig用于全局浏览器设置。
¥1. Create a BrowserConfig for global browser settings.
2.实例化AsyncWebCrawler(config=browser_config)。
¥2. Instantiate AsyncWebCrawler(config=browser_config).
3.使用异步上下文管理器中的爬虫(async with )或手动管理启动/关闭。
¥3. Use the crawler in an async context manager (async with) or manage start/close manually.
4.称呼arun(url, config=crawler_run_config)对于您想要的每个页面。
¥4. Call arun(url, config=crawler_run_config) for each page you want.
1. 构造函数概述
¥1. Constructor Overview
class AsyncWebCrawler:
def __init__(
self,
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
config: Optional[BrowserConfig] = None,
always_bypass_cache: bool = False, # deprecated
always_by_pass_cache: Optional[bool] = None, # also deprecated
base_directory: str = ...,
thread_safe: bool = False,
**kwargs,
):
"""
Create an AsyncWebCrawler instance.
Args:
crawler_strategy:
(Advanced) Provide a custom crawler strategy if needed.
config:
A BrowserConfig object specifying how the browser is set up.
always_bypass_cache:
(Deprecated) Use CrawlerRunConfig.cache_mode instead.
base_directory:
Folder for storing caches/logs (if relevant).
thread_safe:
If True, attempts some concurrency safeguards. Usually False.
**kwargs:
Additional legacy or debugging parameters.
"""
)
### Typical Initialization
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
verbose=True
)
crawler = AsyncWebCrawler(config=browser_cfg)
笔记:
¥Notes:
-
遗产参数如
always_bypass_cache保留向后兼容性,但更喜欢设置缓存在CrawlerRunConfig。¥Legacy parameters like
always_bypass_cacheremain for backward compatibility, but prefer to set caching inCrawlerRunConfig.
2. 生命周期:启动/关闭或上下文管理器
¥2. Lifecycle: Start/Close or Context Manager
2.1 上下文管理器(推荐)
¥2.1 Context Manager (Recommended)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://example.com")
# The crawler automatically starts/closes resources
当async with块结束,爬虫清理(关闭浏览器等)。
¥When the async with block ends, the crawler cleans up (closes the browser, etc.).
2.2 手动启动和关闭
¥2.2 Manual Start & Close
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
result1 = await crawler.arun("https://example.com")
result2 = await crawler.arun("https://another.com")
await crawler.close()
如果您有长期运行应用程序或需要完全控制爬虫的生命周期。
¥Use this style if you have a long-running application or need full control of the crawler’s lifecycle.
3. 主要方法:arun()
¥3. Primary Method: arun()
async def arun(
self,
url: str,
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters for backward compatibility...
) -> CrawlResult:
...
3.1 新方法
¥3.1 New Approach
你通过CrawlerRunConfig设置有关爬网的所有内容的对象 - 内容过滤、缓存、会话重用、JS 代码、屏幕截图等。
¥You pass a CrawlerRunConfig object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="main.article",
word_count_threshold=10,
screenshot=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://example.com/news", config=run_cfg)
print("Crawled HTML length:", len(result.cleaned_html))
if result.screenshot:
print("Screenshot base64 length:", len(result.screenshot))
3.2 遗留参数仍然被接受
¥3.2 Legacy Parameters Still Accepted
为了落后兼容性,arun()仍然可以接受直接论点,例如css_selector=...,word_count_threshold=...等,但我们强烈建议将它们迁移到CrawlerRunConfig。
¥For backward compatibility, arun() can still accept direct arguments like css_selector=..., word_count_threshold=..., etc., but we strongly advise migrating them into a CrawlerRunConfig.
4.批处理:arun_many()
¥4. Batch Processing: arun_many()
async def arun_many(
self,
urls: List[str],
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters maintained for backwards compatibility...
) -> List[CrawlResult]:
"""
Process multiple URLs with intelligent rate limiting and resource monitoring.
"""
4.1 资源感知爬取
¥4.1 Resource-Aware Crawling
这arun_many()方法现在使用智能调度程序:
¥The arun_many() method now uses an intelligent dispatcher that:
-
监控系统内存使用情况
¥Monitors system memory usage
-
实现自适应速率限制
¥Implements adaptive rate limiting
-
提供详细的进度监控
¥Provides detailed progress monitoring
-
有效管理并发爬网
¥Manages concurrent crawls efficiently
4.2 示例用法
¥4.2 Example Usage
检查页面多 URL 爬取有关如何使用的详细示例arun_many()。
¥Check page Multi-url Crawling for a detailed example of how to use arun_many().
### 4.3 Key Features
1. **Rate Limiting**
- Automatic delay between requests
- Exponential backoff on rate limit detection
- Domain-specific rate limiting
- Configurable retry strategy
2. **Resource Monitoring**
- Memory usage tracking
- Adaptive concurrency based on system load
- Automatic pausing when resources are constrained
3. **Progress Monitoring**
- Detailed or aggregated progress display
- Real-time status updates
- Memory usage statistics
4. **Error Handling**
- Graceful handling of rate limits
- Automatic retries with backoff
- Detailed error reporting
---
## 5. `CrawlResult` Output
Each `arun()` returns a **`CrawlResult`** containing:
- `url`: Final URL (if redirected).
- `html`: Original HTML.
- `cleaned_html`: Sanitized HTML.
- `markdown_v2`: Deprecated. Instead just use regular `markdown`
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
- `screenshot`, `pdf`: If screenshots/PDF requested.
- `media`, `links`: Information about discovered images/links.
- `success`, `error_message`: Status info.
For details, see [CrawlResult doc](./crawl-result.md).
---
## 6. Quick Example
Below is an example hooking it all together:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
import json
async def main():
# 1. Browser config
browser_cfg = BrowserConfig(
browser_type="firefox",
headless=False,
verbose=True
)
# 2. Run config
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{
"name": "title",
"selector": "h2",
"type": "text"
},
{
"name": "url",
"selector": "a",
"type": "attribute",
"attribute": "href"
}
]
}
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
word_count_threshold=15,
remove_overlay_elements=True,
wait_for="css:.post" # Wait for posts to appear
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/blog",
config=run_cfg
)
if result.success:
print("Cleaned HTML length:", len(result.cleaned_html))
if result.extracted_content:
articles = json.loads(result.extracted_content)
print("Extracted articles:", articles[:2])
else:
print("Error:", result.error_message)
asyncio.run(main())
解释:
¥Explanation:
-
我们定义一个
BrowserConfig使用 Firefox,无需无头,并且verbose=True。¥We define a
BrowserConfigwith Firefox, no headless, andverbose=True. -
我们定义一个
CrawlerRunConfig那绕过缓存,使用CSS提取模式,有一个word_count_threshold=15, ETC。¥We define a
CrawlerRunConfigthat bypasses cache, uses a CSS extraction schema, has aword_count_threshold=15, etc. -
我们将它们传递给
AsyncWebCrawler(config=...)和arun(url=..., config=...)。¥We pass them to
AsyncWebCrawler(config=...)andarun(url=..., config=...).
7.最佳实践和迁移说明
¥7. Best Practices & Migration Notes
1.使用BrowserConfig为了全球的浏览器环境的设置。2.使用CrawlerRunConfig为了每次抓取逻辑(缓存、内容过滤、提取策略、等待条件)。3.避免遗留参数如css_selector或者word_count_threshold直接在arun()。 反而:
¥1. Use BrowserConfig for global settings about the browser’s environment.
2. Use CrawlerRunConfig for per-crawl logic (caching, content filtering, extraction strategies, wait conditions).
3. Avoid legacy parameters like css_selector or word_count_threshold directly in arun(). Instead:
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
result = await crawler.arun(url="...", config=run_cfg)
4.上下文管理器除非您想要在多次调用中使用持久的爬虫,否则使用是最简单的。
¥4. Context Manager usage is simplest unless you want a persistent crawler across many calls.
8.总结
¥8. Summary
异步Web爬虫是异步爬取的入口点:
¥AsyncWebCrawler is your entry point to asynchronous crawling:
-
构造函数接受
BrowserConfig(或默认值)。¥Constructor accepts
BrowserConfig(or defaults). -
arun(url, config=CrawlerRunConfig)是单页爬取的主要方法。¥
arun(url, config=CrawlerRunConfig)is the main method for single-page crawls. -
arun_many(urls, config=CrawlerRunConfig)处理跨多个 URL 的并发。¥
arun_many(urls, config=CrawlerRunConfig)handles concurrency across multiple URLs. -
对于高级生命周期控制,使用
start()和close()明确地。¥For advanced lifecycle control, use
start()andclose()explicitly.
迁移:
¥Migration:
-
如果你使用
AsyncWebCrawler(browser_type="chromium", css_selector="..."),将浏览器设置移至BrowserConfig(...)和内容/抓取逻辑CrawlerRunConfig(...)。¥If you used
AsyncWebCrawler(browser_type="chromium", css_selector="..."), move browser settings toBrowserConfig(...)and content/crawl logic toCrawlerRunConfig(...).
这种模块化方法可确保您的代码干净的,可扩展, 和易于维护。对于任何高级或很少使用的参数,请参阅浏览器配置文档。
¥This modular approach ensures your code is clean, scalable, and easy to maintain. For any advanced or rarely used parameters, see the BrowserConfig docs.