异步Web爬虫
这AsyncWebCrawler
是 Crawl4AI 中异步网页爬取的核心类。通常情况下,您可以创建一次,也可以选择使用BrowserConfig
(例如,无头,用户代理),然后运行多个arun()
用不同的电话CrawlerRunConfig
对象。
建议用法:
1. 创建一个BrowserConfig
用于全局浏览器设置。
2.实例化AsyncWebCrawler(config=browser_config)
。
3. 在异步上下文管理器中使用爬虫(async with
)或手动管理启动/关闭。
4. 打电话arun(url, config=crawler_run_config)
对于您想要的每个页面。
1. 构造函数概述
class AsyncWebCrawler:
def __init__(
self,
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
config: Optional[BrowserConfig] = None,
always_bypass_cache: bool = False, # deprecated
always_by_pass_cache: Optional[bool] = None, # also deprecated
base_directory: str = ...,
thread_safe: bool = False,
**kwargs,
):
"""
Create an AsyncWebCrawler instance.
Args:
crawler_strategy:
(Advanced) Provide a custom crawler strategy if needed.
config:
A BrowserConfig object specifying how the browser is set up.
always_bypass_cache:
(Deprecated) Use CrawlerRunConfig.cache_mode instead.
base_directory:
Folder for storing caches/logs (if relevant).
thread_safe:
If True, attempts some concurrency safeguards. Usually False.
**kwargs:
Additional legacy or debugging parameters.
"""
)
### Typical Initialization
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
verbose=True
)
crawler = AsyncWebCrawler(config=browser_cfg)
笔记:
- 遗留参数如
always_bypass_cache
保留向后兼容性,但更喜欢设置缓存CrawlerRunConfig
。
2. 生命周期:启动/关闭或上下文管理器
2.1 上下文管理器(推荐)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://example.com")
# The crawler automatically starts/closes resources
当async with
块结束,爬虫清理(关闭浏览器等)。
2.2 手动启动和关闭
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
result1 = await crawler.arun("https://example.com")
result2 = await crawler.arun("https://another.com")
await crawler.close()
如果您有一个长期运行的应用程序或需要完全控制爬虫的生命周期,请使用这种风格。
3. 主要方法:arun()
async def arun(
self,
url: str,
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters for backward compatibility...
) -> CrawlResult:
...
3.1 新方法
你通过CrawlerRunConfig
设置有关爬网的所有内容的对象 - 内容过滤、缓存、会话重用、JS 代码、屏幕截图等。
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="main.article",
word_count_threshold=10,
screenshot=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("https://example.com/news", config=run_cfg)
print("Crawled HTML length:", len(result.cleaned_html))
if result.screenshot:
print("Screenshot base64 length:", len(result.screenshot))
3.2 遗留参数仍然被接受
为了向后兼容,arun()
仍然可以接受直接论点,例如css_selector=...
,word_count_threshold=...
等,但我们强烈建议将它们迁移到CrawlerRunConfig
。
4.批处理:arun_many()
async def arun_many(
self,
urls: List[str],
config: Optional[CrawlerRunConfig] = None,
# Legacy parameters maintained for backwards compatibility...
) -> List[CrawlResult]:
"""
Process multiple URLs with intelligent rate limiting and resource monitoring.
"""
4.1 资源感知爬取
这arun_many()
方法现在使用智能调度程序:
- 监控系统内存使用情况
- 实现自适应速率限制
- 提供详细的进度监控
- 有效管理并发爬网
4.2 示例用法
查看页面多 URL 抓取,了解使用方法的详细示例arun_many()
。
### 4.3 Key Features
1. **Rate Limiting**
- Automatic delay between requests
- Exponential backoff on rate limit detection
- Domain-specific rate limiting
- Configurable retry strategy
2. **Resource Monitoring**
- Memory usage tracking
- Adaptive concurrency based on system load
- Automatic pausing when resources are constrained
3. **Progress Monitoring**
- Detailed or aggregated progress display
- Real-time status updates
- Memory usage statistics
4. **Error Handling**
- Graceful handling of rate limits
- Automatic retries with backoff
- Detailed error reporting
---
## 5. `CrawlResult` Output
Each `arun()` returns a **`CrawlResult`** containing:
- `url`: Final URL (if redirected).
- `html`: Original HTML.
- `cleaned_html`: Sanitized HTML.
- `markdown_v2`: Deprecated. Instead just use regular `markdown`
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
- `screenshot`, `pdf`: If screenshots/PDF requested.
- `media`, `links`: Information about discovered images/links.
- `success`, `error_message`: Status info.
For details, see [CrawlResult doc](./crawl-result.md).
---
## 6. Quick Example
Below is an example hooking it all together:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
import json
async def main():
# 1. Browser config
browser_cfg = BrowserConfig(
browser_type="firefox",
headless=False,
verbose=True
)
# 2. Run config
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{
"name": "title",
"selector": "h2",
"type": "text"
},
{
"name": "url",
"selector": "a",
"type": "attribute",
"attribute": "href"
}
]
}
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
word_count_threshold=15,
remove_overlay_elements=True,
wait_for="css:.post" # Wait for posts to appear
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/blog",
config=run_cfg
)
if result.success:
print("Cleaned HTML length:", len(result.cleaned_html))
if result.extracted_content:
articles = json.loads(result.extracted_content)
print("Extracted articles:", articles[:2])
else:
print("Error:", result.error_message)
asyncio.run(main())
解释:
- 我们定义一个
BrowserConfig
使用 Firefox,无需无头,并且verbose=True
。 - 我们定义一个
CrawlerRunConfig
绕过缓存,使用 CSS 提取模式,具有word_count_threshold=15
, ETC。 - 我们将它们传递给
AsyncWebCrawler(config=...)
和arun(url=..., config=...)
。
7.最佳实践和迁移说明
1.使用BrowserConfig
用于设置浏览器环境的全局设置。2. 使用CrawlerRunConfig
用于每次爬取的逻辑(缓存、内容过滤、提取策略、等待条件)。3. 避免使用遗留参数,例如css_selector
或者word_count_threshold
直接在arun()
。 反而:
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
result = await crawler.arun(url="...", config=run_cfg)
4. 上下文管理器的使用是最简单的,除非您想要在多次调用中使用持久的爬虫。
8.总结
AsyncWebCrawler 是异步爬取的入口点:
- 构造函数接受
BrowserConfig
(或默认值)。 - 是单页爬取的主要方法。
- 处理跨多个 URL 的并发。
- 对于高级生命周期控制,使用
start()
和close()
明确地。
迁移:
- 如果你使用
AsyncWebCrawler(browser_type="chromium", css_selector="...")
,将浏览器设置移至BrowserConfig(...)
和内容/抓取逻辑CrawlerRunConfig(...)
。
这种模块化方法可确保您的代码简洁、可扩展且易于维护。有关任何高级或不常用的参数,请参阅BrowserConfig 文档。