异步Web爬虫

¥AsyncWebCrawler

AsyncWebCrawler是 Crawl4AI 中异步网页爬取的核心类。通常情况下,您可以创建它一次,可选择使用BrowserConfig(例如,无头,用户代理),然后跑步多种的arun()用不同的电话CrawlerRunConfig对象。

¥The AsyncWebCrawler is the core class for asynchronous web crawling in Crawl4AI. You typically create it once, optionally customize it with a BrowserConfig (e.g., headless, user agent), then run multiple arun() calls with different CrawlerRunConfig objects.

建议使用

¥Recommended usage:

1.创造一个BrowserConfig用于全局浏览器设置。

¥1. Create a BrowserConfig for global browser settings. 

2.实例化AsyncWebCrawler(config=browser_config)

¥2. Instantiate AsyncWebCrawler(config=browser_config). 

3.使用异步上下文管理器中的爬虫(async with )或手动管理启动/关闭。

¥3. Use the crawler in an async context manager (async with) or manage start/close manually. 

4.称呼arun(url, config=crawler_run_config)对于您想要的每个页面。

¥4. Call arun(url, config=crawler_run_config) for each page you want.


1. 构造函数概述

¥1. Constructor Overview

class AsyncWebCrawler:
    def __init__(
        self,
        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
        config: Optional[BrowserConfig] = None,
        always_bypass_cache: bool = False,           # deprecated
        always_by_pass_cache: Optional[bool] = None, # also deprecated
        base_directory: str = ...,
        thread_safe: bool = False,
        **kwargs,
    ):
        """
        Create an AsyncWebCrawler instance.

        Args:
            crawler_strategy: 
                (Advanced) Provide a custom crawler strategy if needed.
            config: 
                A BrowserConfig object specifying how the browser is set up.
            always_bypass_cache: 
                (Deprecated) Use CrawlerRunConfig.cache_mode instead.
            base_directory:     
                Folder for storing caches/logs (if relevant).
            thread_safe: 
                If True, attempts some concurrency safeguards. Usually False.
            **kwargs: 
                Additional legacy or debugging parameters.
        """
    )

### Typical Initialization

```python
from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_cfg = BrowserConfig(
    browser_type="chromium",
    headless=True,
    verbose=True
)

crawler = AsyncWebCrawler(config=browser_cfg)

笔记

¥Notes:

  • 遗产参数如always_bypass_cache保留向后兼容性,但更喜欢设置缓存CrawlerRunConfig

    ¥Legacy parameters like always_bypass_cache remain for backward compatibility, but prefer to set caching in CrawlerRunConfig.


2. 生命周期:启动/关闭或上下文管理器

¥2. Lifecycle: Start/Close or Context Manager

¥2.1 Context Manager (Recommended)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun("https://example.com")
    # The crawler automatically starts/closes resources

async with块结束,爬虫清理(关闭浏览器等)。

¥When the async with block ends, the crawler cleans up (closes the browser, etc.).

2.2 手动启动和关闭

¥2.2 Manual Start & Close

crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()

result1 = await crawler.arun("https://example.com")
result2 = await crawler.arun("https://another.com")

await crawler.close()

如果您有长期运行应用程序或需要完全控制爬虫的生命周期。

¥Use this style if you have a long-running application or need full control of the crawler’s lifecycle.


3. 主要方法:arun()

¥3. Primary Method: arun()

async def arun(
    self,
    url: str,
    config: Optional[CrawlerRunConfig] = None,
    # Legacy parameters for backward compatibility...
) -> CrawlResult:
    ...

3.1 新方法

¥3.1 New Approach

你通过CrawlerRunConfig设置有关爬网的所有内容的对象 - 内容过滤、缓存、会话重用、JS 代码、屏幕截图等。

¥You pass a CrawlerRunConfig object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.

import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode

run_cfg = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    css_selector="main.article",
    word_count_threshold=10,
    screenshot=True
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun("https://example.com/news", config=run_cfg)
    print("Crawled HTML length:", len(result.cleaned_html))
    if result.screenshot:
        print("Screenshot base64 length:", len(result.screenshot))

3.2 遗留参数仍然被接受

¥3.2 Legacy Parameters Still Accepted

为了落后兼容性,arun()仍然可以接受直接论点,例如css_selector=...word_count_threshold=...等,但我们强烈建议将它们迁移到CrawlerRunConfig

¥For backward compatibility, arun() can still accept direct arguments like css_selector=..., word_count_threshold=..., etc., but we strongly advise migrating them into a CrawlerRunConfig.


4.批处理:arun_many()

¥4. Batch Processing: arun_many()

async def arun_many(
    self,
    urls: List[str],
    config: Optional[CrawlerRunConfig] = None,
    # Legacy parameters maintained for backwards compatibility...
) -> List[CrawlResult]:
    """
    Process multiple URLs with intelligent rate limiting and resource monitoring.
    """

4.1 资源感知爬取

¥4.1 Resource-Aware Crawling

arun_many()方法现在使用智能调度程序:

¥The arun_many() method now uses an intelligent dispatcher that:

  • 监控系统内存使用情况

    ¥Monitors system memory usage

  • 实现自适应速率限制

    ¥Implements adaptive rate limiting

  • 提供详细的进度监控

    ¥Provides detailed progress monitoring

  • 有效管理并发爬网

    ¥Manages concurrent crawls efficiently

4.2 示例用法

¥4.2 Example Usage

检查页面多 URL 爬取有关如何使用的详细示例arun_many()

¥Check page Multi-url Crawling for a detailed example of how to use arun_many().

### 4.3 Key Features

1. **Rate Limiting**

   - Automatic delay between requests
   - Exponential backoff on rate limit detection
   - Domain-specific rate limiting
   - Configurable retry strategy

2. **Resource Monitoring**

   - Memory usage tracking
   - Adaptive concurrency based on system load
   - Automatic pausing when resources are constrained

3. **Progress Monitoring**

   - Detailed or aggregated progress display
   - Real-time status updates
   - Memory usage statistics

4. **Error Handling**

   - Graceful handling of rate limits
   - Automatic retries with backoff
   - Detailed error reporting

---

## 5. `CrawlResult` Output

Each `arun()` returns a **`CrawlResult`** containing:

- `url`: Final URL (if redirected).
- `html`: Original HTML.
- `cleaned_html`: Sanitized HTML.
- `markdown_v2`: Deprecated. Instead just use regular `markdown`
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
- `screenshot`, `pdf`: If screenshots/PDF requested.
- `media`, `links`: Information about discovered images/links.
- `success`, `error_message`: Status info.

For details, see [CrawlResult doc](./crawl-result.md).

---

## 6. Quick Example

Below is an example hooking it all together:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
import json

async def main():
    # 1. Browser config
    browser_cfg = BrowserConfig(
        browser_type="firefox",
        headless=False,
        verbose=True
    )

    # 2. Run config
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {
                "name": "title", 
                "selector": "h2", 
                "type": "text"
            },
            {
                "name": "url", 
                "selector": "a", 
                "type": "attribute", 
                "attribute": "href"
            }
        ]
    }

    run_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema),
        word_count_threshold=15,
        remove_overlay_elements=True,
        wait_for="css:.post"  # Wait for posts to appear
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://example.com/blog",
            config=run_cfg
        )

        if result.success:
            print("Cleaned HTML length:", len(result.cleaned_html))
            if result.extracted_content:
                articles = json.loads(result.extracted_content)
                print("Extracted articles:", articles[:2])
        else:
            print("Error:", result.error_message)

asyncio.run(main())

解释

¥Explanation:

  • 我们定义一个BrowserConfig使用 Firefox,无需无头,并且verbose=True

    ¥We define a BrowserConfig with Firefox, no headless, and verbose=True. 

  • 我们定义一个CrawlerRunConfig绕过缓存,使用CSS提取模式,有一个word_count_threshold=15, ETC。

    ¥We define a CrawlerRunConfig that bypasses cache, uses a CSS extraction schema, has a word_count_threshold=15, etc. 

  • 我们将它们传递给AsyncWebCrawler(config=...)arun(url=..., config=...)

    ¥We pass them to AsyncWebCrawler(config=...) and arun(url=..., config=...).


7.最佳实践和迁移说明

¥7. Best Practices & Migration Notes

1.使用BrowserConfig为了全球的浏览器环境的设置。2.使用CrawlerRunConfig为了每次抓取逻辑(缓存、内容过滤、提取策略、等待条件)。3.避免遗留参数如css_selector或者word_count_threshold直接在arun()。 反而:

¥1. Use BrowserConfig for global settings about the browser’s environment.  2. Use CrawlerRunConfig for per-crawl logic (caching, content filtering, extraction strategies, wait conditions).  3. Avoid legacy parameters like css_selector or word_count_threshold directly in arun(). Instead:

run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
result = await crawler.arun(url="...", config=run_cfg)

4.上下文管理器除非您想要在多次调用中使用持久的爬虫,否则使用是最简单的。

¥4. Context Manager usage is simplest unless you want a persistent crawler across many calls.


8.总结

¥8. Summary

异步Web爬虫是异步爬取的入口点:

¥AsyncWebCrawler is your entry point to asynchronous crawling:

  • 构造函数接受BrowserConfig(或默认值)。

    ¥Constructor accepts BrowserConfig (or defaults). 

  • arun(url, config=CrawlerRunConfig)是单页爬取的主要方法。

    ¥arun(url, config=CrawlerRunConfig) is the main method for single-page crawls. 

  • arun_many(urls, config=CrawlerRunConfig)处理跨多个 URL 的并发。

    ¥arun_many(urls, config=CrawlerRunConfig) handles concurrency across multiple URLs. 

  • 对于高级生命周期控制,使用start()close()明确地。

    ¥For advanced lifecycle control, use start() and close() explicitly. 

迁移

¥Migration:

  • 如果你使用AsyncWebCrawler(browser_type="chromium", css_selector="..."),将浏览器设置移至BrowserConfig(...)和内容/抓取逻辑CrawlerRunConfig(...)

    ¥If you used AsyncWebCrawler(browser_type="chromium", css_selector="..."), move browser settings to BrowserConfig(...) and content/crawl logic to CrawlerRunConfig(...).

这种模块化方法可确保您的代码干净的可扩展, 和易于维护。对于任何高级或很少使用的参数,请参阅浏览器配置文档

¥This modular approach ensures your code is clean, scalable, and easy to maintain. For any advanced or rarely used parameters, see the BrowserConfig docs.


> Feedback