浏览器、爬虫和 LLM 配置（快速概览）

¥Browser, Crawler & LLM Configuration (Quick Overview)

Crawl4AI 的灵活性源于两个关键类别：

¥Crawl4AI's flexibility stems from two key classes:

BrowserConfig– 指令如何浏览器启动并运行（例如，无头或可见、代理、用户代理）。

¥BrowserConfig – Dictates how the browser is launched and behaves (e.g., headless or visible, proxy, user agent).
CrawlerRunConfig– 指令如何每个爬行操作（例如，缓存、提取、超时、运行 JavaScript 代码等）。

¥CrawlerRunConfig – Dictates how each crawl operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).
LLMConfig- 口述如何LLM 提供程序已配置。（模型、API 令牌、基本 URL、温度等）

¥LLMConfig - Dictates how LLM providers are configured. (model, api token, base url, temperature etc.)

在大多数示例中，您创建一BrowserConfig对于整个爬虫会话，然后传递一个新鲜的或重复使用CrawlerRunConfig无论何时你打电话arun()本教程展示了最常用的参数。如果您需要高级或不常用的字段，请参阅配置参数。

¥In most examples, you create one BrowserConfig for the entire crawler session, then pass a fresh or re-used CrawlerRunConfig whenever you call arun(). This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the Configuration Parameters.

1. BrowserConfig 基本信息

¥1. BrowserConfig Essentials

class BrowserConfig:
    def __init__(
        browser_type="chromium",
        headless=True,
        proxy_config=None,
        viewport_width=1080,
        viewport_height=600,
        verbose=True,
        use_persistent_context=False,
        user_data_dir=None,
        cookies=None,
        headers=None,
        user_agent=None,
        text_mode=False,
        light_mode=False,
        extra_args=None,
        enable_stealth=False,
        # ... other advanced parameters omitted here
    ):
        ...

需要注意的关键字段

¥Key Fields to Note

browser_type

¥browser_type
选项："chromium" ，"firefox" ，或者"webkit"。

¥Options: "chromium", "firefox", or "webkit".
默认为"chromium"。

¥Defaults to "chromium".
如果您需要不同的引擎，请在此处指定。

¥
If you need a different engine, specify it here.
headless

¥
headless
：以无头模式（隐形浏览器）运行浏览器。

¥True: Runs the browser in headless mode (invisible browser).
：以可见模式运行浏览器，有助于调试。

¥
False: Runs the browser in visible mode, which helps with debugging.
proxy_config

¥
proxy_config

具有如下字段的字典：

{
    "server": "http://proxy.example.com:8080", 
    "username": "...", 
    "password": "..."
}

¥A dictionary with fields like:

{
    "server": "http://proxy.example.com:8080", 
    "username": "...", 
    "password": "..."
}

保留为None如果不需要代理。

¥
Leave as None if a proxy is not required.
viewport_width&viewport_height ：

¥
viewport_width & viewport_height:
初始窗口大小。

¥The initial window size.
某些网站在视口较大或较小时会表现出不同的行为。

¥
Some sites behave differently with smaller or bigger viewports.
verbose：

¥
verbose:
如果True，打印额外的日志。

¥If True, prints extra logs.
方便调试。

¥
Handy for debugging.
use_persistent_context：

¥
use_persistent_context:
如果True，使用执着的浏览器配置文件，在运行期间存储 cookie/本地存储。

¥If True, uses a persistent browser profile, storing cookies/local storage across runs.
通常还设置user_data_dir指向一个文件夹。

¥
Typically also set user_data_dir to point to a folder.
cookies&headers ：

¥
cookies & headers:
如果您想要从特定的 cookie 开始或添加通用 HTTP 标头，请在此处进行设置。

¥If you want to start with specific cookies or add universal HTTP headers, set them here.
例如cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]。

¥
E.g. cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}].
user_agent：

¥
user_agent:
自定义 User-Agent 字符串。如果None，则使用默认值。

¥Custom User-Agent string. If None, a default is used.
您还可以设置user_agent_mode="random"用于随机化（如果你想对抗机器人检测）。

¥
You can also set user_agent_mode="random" for randomization (if you want to fight bot detection).
text_mode&light_mode ：

¥
text_mode & light_mode:
禁用图像，可能会加快纯文本爬行速度。

¥text_mode=True disables images, possibly speeding up text-only crawls.
为提高性能，请关闭某些后台功能。

¥
light_mode=True turns off certain background features for performance.
extra_args：底层浏览器的附加标志。例如["--disable-extensions"]。

¥
extra_args:
- Additional flags for the underlying browser.
- E.g. ["--disable-extensions"].
enable_stealth：如果True，使用 playwright-stealth 启用隐身模式。修改浏览器指纹以规避基本的机器人检测。默认值为False. 推荐用于具有机器人保护功能的网站。

¥
enable_stealth:
- If True, enables stealth mode using playwright-stealth.
- Modifies browser fingerprints to avoid basic bot detection.
- Default is False. Recommended for sites with bot protection.

辅助方法

¥Helper Methods

两种配置类都提供了clone()创建修改副本的方法：

¥Both configuration classes provide a clone() method to create modified copies:

# Create a base browser config
base_browser = BrowserConfig(
    browser_type="chromium",
    headless=True,
    text_mode=True
)

# Create a visible browser config for debugging
debug_browser = base_browser.clone(
    headless=False,
    verbose=True
)

最小示例：

¥Minimal Example:

from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_conf = BrowserConfig(
    browser_type="firefox",
    headless=False,
    text_mode=True
)

async with AsyncWebCrawler(config=browser_conf) as crawler:
    result = await crawler.arun("https://example.com")
    print(result.markdown[:300])

2. CrawlerRunConfig 基本信息

¥2. CrawlerRunConfig Essentials

class CrawlerRunConfig:
    def __init__(
        word_count_threshold=200,
        extraction_strategy=None,
        markdown_generator=None,
        cache_mode=None,
        js_code=None,
        wait_for=None,
        screenshot=False,
        pdf=False,
        capture_mhtml=False,
        # Location and Identity Parameters
        locale=None,            # e.g. "en-US", "fr-FR"
        timezone_id=None,       # e.g. "America/New_York"
        geolocation=None,       # GeolocationConfig object
        # Resource Management
        enable_rate_limiting=False,
        rate_limit_config=None,
        memory_threshold_percent=70.0,
        check_interval=1.0,
        max_session_permit=20,
        display_mode=None,
        verbose=True,
        stream=False,  # Enable streaming for arun_many()
        # ... other advanced parameters omitted
    ):
        ...

需要注意的关键字段

¥Key Fields to Note

word_count_threshold：

¥word_count_threshold:
考虑区块之前的最小字数。

¥The minimum word count before a block is considered.
如果您的网站有很多短段落或项目，您可以降低它。

¥
If your site has lots of short paragraphs or items, you can lower it.
extraction_strategy：

¥
extraction_strategy:
在其中插入基于 JSON 的提取（CSS、LLM 等）。

¥Where you plug in JSON-based extraction (CSS, LLM, etc.).
如果None，没有进行结构化提取（仅进行原始/清理的 HTML + markdown）。

¥
If None, no structured extraction is done (only raw/cleaned HTML + markdown).
markdown_generator：

¥
markdown_generator:
例如，DefaultMarkdownGenerator(...) ，控制如何进行 HTML→Markdown 转换。

¥E.g., DefaultMarkdownGenerator(...), controlling how HTML→Markdown conversion is done.
如果None，使用默认方法。

¥
If None, a default approach is used.
cache_mode：

¥
cache_mode:
控制缓存行为（ENABLED ，BYPASS ，DISABLED ， ETC。）。

¥Controls caching behavior (ENABLED, BYPASS, DISABLED, etc.).
如果None，默认为某种级别的缓存，或者您可以指定CacheMode.ENABLED。

¥
If None, defaults to some level of caching or you can specify CacheMode.ENABLED.
js_code：

¥
js_code:
要执行的字符串或 JS 字符串列表。

¥A string or list of JS strings to execute.
非常适合“加载更多”按钮或用户交互。

¥
Great for "Load More" buttons or user interactions.
wait_for：

¥
wait_for:
提取内容之前要等待的 CSS 或 JS 表达式。

¥A CSS or JS expression to wait for before extracting content.
常见用法：wait_for="css:.main-loaded"或者wait_for="js:() => window.loaded === true"。

¥
Common usage: wait_for="css:.main-loaded" or wait_for="js:() => window.loaded === true".
screenshot，pdf ，&capture_mhtml ：

¥
screenshot, pdf, & capture_mhtml:
如果True，在页面完全加载后捕获屏幕截图、PDF 或 MHTML 快照。

¥If True, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
结果result.screenshot（base64），result.pdf （字节），或result.mhtml（细绳）。

¥
The results go to result.screenshot (base64), result.pdf (bytes), or result.mhtml (string).
位置参数：

¥
Location Parameters:
locale：浏览器的语言环境（例如，"en-US" ，"fr-FR" ) 语言偏好设置

¥locale: Browser's locale (e.g., "en-US", "fr-FR") for language preferences
timezone_id：浏览器的时区（例如，"America/New_York" ，"Europe/Paris" )

¥timezone_id: Browser's timezone (e.g., "America/New_York", "Europe/Paris")
geolocation：GPS 坐标通过GeolocationConfig(latitude=48.8566, longitude=2.3522)

¥geolocation: GPS coordinates via GeolocationConfig(latitude=48.8566, longitude=2.3522)
看基于身份的爬取

¥
See Identity Based Crawling
verbose：

¥
verbose:
记录额外的运行时详细信息。

¥Logs additional runtime details.
如果也设置为，则与浏览器的详细程度重叠True在BrowserConfig。

¥
Overlaps with the browser's verbosity if also set to True in BrowserConfig.
enable_rate_limiting：

¥
enable_rate_limiting:
如果True，启用批处理的速率限制。

¥If True, enables rate limiting for batch processing.
需要rate_limit_config待设置。

¥
Requires rate_limit_config to be set.
memory_threshold_percent：要监控的内存阈值（百分比）。如果超出阈值，爬虫将暂停或减慢速度。

¥
memory_threshold_percent:
- The memory threshold (as a percentage) to monitor.
- If exceeded, the crawler will pause or slow down.
check_interval：检查系统资源的间隔（以秒为单位）。影响监控内存和 CPU 使用率的频率。

¥
check_interval:
- The interval (in seconds) to check system resources.
- Affects how often memory and CPU usage are monitored.
max_session_permit：并发抓取会话的最大数量。有助于防止系统过载。

¥
max_session_permit:
- The maximum number of concurrent crawl sessions.
- Helps prevent overwhelming the system.
url_matcher&match_mode ：与以下项一起使用时启用 URL 特定的配置arun_many()。放url_matcher匹配特定 URL 的模式（全局、函数或列表）。使用match_mode（或/与）来控制多个模式的组合方式。请参阅URL 特定的配置例如。

¥
url_matcher & match_mode:
- Enable URL-specific configurations when used with arun_many().
- Set url_matcher to patterns (glob, function, or list) to match specific URLs.
- Use match_mode (OR/AND) to control how multiple patterns combine.
- See URL-Specific Configurations for examples.
display_mode：进度信息的显示模式（DETAILED ，BRIEF等）。影响抓取过程中打印的信息量。

¥
display_mode:
- The display mode for progress information (DETAILED, BRIEF, etc.).
- Affects how much information is printed during the crawl.

辅助方法

¥Helper Methods

这clone()方法对于创建爬虫配置的变体特别有用：

¥The clone() method is particularly useful for creating variations of your crawler configuration:

# Create a base configuration
base_config = CrawlerRunConfig(
    cache_mode=CacheMode.ENABLED,
    word_count_threshold=200,
    wait_until="networkidle"
)

# Create variations for different use cases
stream_config = base_config.clone(
    stream=True,  # Enable streaming mode
    cache_mode=CacheMode.BYPASS
)

debug_config = base_config.clone(
    page_timeout=120000,  # Longer timeout for debugging
    verbose=True
)

这clone()方法： - 使用所有相同的设置创建新实例 - 仅更新指定的参数 - 保持原始配置不变 - 非常适合创建变体而无需重复所有参数

¥The clone() method: - Creates a new instance with all the same settings - Updates only the specified parameters - Leaves the original configuration unchanged - Perfect for creating variations without repeating all parameters

3. LLMConfig 要点

¥3. LLMConfig Essentials

需要注意的关键字段

¥Key fields to note

provider：

¥provider:
使用哪个 LLM 提供商。

¥Which LLM provider to use.
可能的值包括"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"
（默认："openai/gpt-4o-mini" )

¥
Possible values are "ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"
(default: "openai/gpt-4o-mini")
api_token：可选。如果未明确提供，api_token 将根据提供商从环境变量中读取。例如：如果将 gemini 模型作为提供商传递，则"GEMINI_API_KEY"将从 LLM 提供商的环境变量 API 令牌中读取
例如：api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"环境变量 - 使用前缀“env:”
例如：api_token = "env: GROQ_API_KEY"

¥
api_token:
- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,"GEMINI_API_KEY" will be read from environment variables
- API token of LLM provider
  eg: api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"
- Environment variable - use with prefix "env:"
  eg:api_token = "env: GROQ_API_KEY"
base_url：

¥
base_url:
如果您的提供商有自定义端点

¥If your provider has a custom endpoint

llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))

4. 整合

¥4. Putting It All Together

在典型情况下，您定义一BrowserConfig为您的爬虫会话，然后创建一个或多个CrawlerRunConfig&LLMConfig根据每次呼叫的需求：

¥In a typical scenario, you define one BrowserConfig for your crawler session, then create one or more CrawlerRunConfig & LLMConfig depending on each call's needs:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig, LLMContentFilter, DefaultMarkdownGenerator
from crawl4ai import JsonCssExtractionStrategy

async def main():
    # 1) Browser config: headless, bigger viewport, no proxy
    browser_conf = BrowserConfig(
        headless=True,
        viewport_width=1280,
        viewport_height=720
    )

    # 2) Example extraction strategy
    schema = {
        "name": "Articles",
        "baseSelector": "div.article",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }
    extraction = JsonCssExtractionStrategy(schema)

    # 3) Example LLM content filtering

    gemini_config = LLMConfig(
        provider="gemini/gemini-1.5-pro", 
        api_token = "env:GEMINI_API_TOKEN"
    )

    # Initialize LLM filter with specific instruction
    filter = LLMContentFilter(
        llm_config=gemini_config,  # or your preferred provider
        instruction="""
        Focus on extracting the core educational content.
        Include:
        - Key concepts and explanations
        - Important code examples
        - Essential technical details
        Exclude:
        - Navigation elements
        - Sidebars
        - Footer content
        Format the output as clean markdown with proper code blocks and headers.
        """,
        chunk_token_threshold=500,  # Adjust based on your needs
        verbose=True
    )

    md_generator = DefaultMarkdownGenerator(
        content_filter=filter,
        options={"ignore_links": True}
    )

    # 4) Crawler run config: skip cache, use extraction
    run_conf = CrawlerRunConfig(
        markdown_generator=md_generator,
        extraction_strategy=extraction,
        cache_mode=CacheMode.BYPASS,
    )

    async with AsyncWebCrawler(config=browser_conf) as crawler:
        # 4) Execute the crawl
        result = await crawler.arun(url="https://example.com/news", config=run_conf)

        if result.success:
            print("Extracted content:", result.extracted_content)
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

5. 后续步骤

¥5. Next Steps

对于详细清单可用参数（包括高级参数），请参阅：

¥For a detailed list of available parameters (including advanced ones), see:

BrowserConfig、CrawlerRunConfig 和 LLMConfig 参考

¥BrowserConfig, CrawlerRunConfig & LLMConfig Reference

您可以探索以下主题：

¥You can explore topics like:

自定义钩子和授权（注入 JavaScript 或处理登录表单）。

¥Custom Hooks & Auth (Inject JavaScript or handle login forms).
会话管理（重复使用页面，在多次调用中保留状态）。

¥Session Management (Re-use pages, preserve state across multiple calls).
魔法模式或者基于身份的爬取（通过模拟用户行为来对抗机器人检测）。

¥Magic Mode or Identity-based Crawling (Fight bot detection by simulating user behavior).
高级缓存（微调读/写缓存模式）。

¥Advanced Caching (Fine-tune read/write cache modes).

6. 结论

¥6. Conclusion

浏览器配置， CrawlerRunConfig和LLM配置为您提供直接的定义方法：

¥BrowserConfig, CrawlerRunConfig and LLMConfig give you straightforward ways to define:

哪个浏览器启动方式、运行方式以及任何代理或用户代理需求。

¥Which browser to launch, how it should run, and any proxy or user agent needs.
如何每次抓取都应该遵循缓存、超时、JavaScript 代码、提取策略等行为。

¥How each crawl should behave—caching, timeouts, JavaScript code, extraction strategies, etc.
哪个要使用的 LLM 提供程序、API 令牌、温度和自定义端点的基本 URL

¥Which LLM provider to use, api token, temperature and base url for custom endpoints

一起使用清晰、可维护代码，当你需要更专业的行为时，请查看参考文档. 爬行愉快！

¥Use them together for clear, maintainable code, and when you need more specialized behavior, check out the advanced parameters in the reference docs. Happy crawling!