浏览器、爬虫和 LLM 配置(快速概览)
¥Browser, Crawler & LLM Configuration (Quick Overview)
Crawl4AI 的灵活性源于两个关键类别:
¥Crawl4AI's flexibility stems from two key classes:
-
BrowserConfig– 指令如何浏览器启动并运行(例如,无头或可见、代理、用户代理)。¥
BrowserConfig– Dictates how the browser is launched and behaves (e.g., headless or visible, proxy, user agent). -
CrawlerRunConfig– 指令如何每个爬行操作(例如,缓存、提取、超时、运行 JavaScript 代码等)。¥
CrawlerRunConfig– Dictates how each crawl operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.). -
LLMConfig- 口述如何LLM 提供程序已配置。(模型、API 令牌、基本 URL、温度等)¥
LLMConfig- Dictates how LLM providers are configured. (model, api token, base url, temperature etc.)
在大多数示例中,您创建一BrowserConfig对于整个爬虫会话,然后传递一个新鲜的或重复使用CrawlerRunConfig无论何时你打电话arun()本教程展示了最常用的参数。如果您需要高级或不常用的字段,请参阅配置参数。
¥In most examples, you create one BrowserConfig for the entire crawler session, then pass a fresh or re-used CrawlerRunConfig whenever you call arun(). This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the Configuration Parameters.
1. BrowserConfig 基本信息
¥1. BrowserConfig Essentials
class BrowserConfig:
def __init__(
browser_type="chromium",
headless=True,
proxy_config=None,
viewport_width=1080,
viewport_height=600,
verbose=True,
use_persistent_context=False,
user_data_dir=None,
cookies=None,
headers=None,
user_agent=None,
text_mode=False,
light_mode=False,
extra_args=None,
enable_stealth=False,
# ... other advanced parameters omitted here
):
...
需要注意的关键字段
¥Key Fields to Note
-
browser_type¥
browser_type -
选项:
"chromium","firefox", 或者"webkit"。¥Options:
"chromium","firefox", or"webkit". -
默认为
"chromium"。¥Defaults to
"chromium". -
如果您需要不同的引擎,请在此处指定。
¥
If you need a different engine, specify it here.
-
headless¥
headless -
:以无头模式(隐形浏览器)运行浏览器。
¥
True: Runs the browser in headless mode (invisible browser). -
:以可见模式运行浏览器,有助于调试。
¥
False: Runs the browser in visible mode, which helps with debugging. -
proxy_config¥
proxy_config -
具有如下字段的字典:
{ "server": "http://proxy.example.com:8080", "username": "...", "password": "..." }¥A dictionary with fields like:
-
保留为
None如果不需要代理。¥
Leave as
Noneif a proxy is not required. -
viewport_width&viewport_height:¥
viewport_width&viewport_height: -
初始窗口大小。
¥The initial window size.
-
某些网站在视口较大或较小时会表现出不同的行为。
¥
Some sites behave differently with smaller or bigger viewports.
-
verbose:¥
verbose: -
如果
True,打印额外的日志。¥If
True, prints extra logs. -
方便调试。
¥
Handy for debugging.
-
use_persistent_context:¥
use_persistent_context: -
如果
True,使用执着的浏览器配置文件,在运行期间存储 cookie/本地存储。¥If
True, uses a persistent browser profile, storing cookies/local storage across runs. -
通常还设置
user_data_dir指向一个文件夹。¥
Typically also set
user_data_dirto point to a folder. -
cookies&headers:¥
cookies&headers: -
如果您想要从特定的 cookie 开始或添加通用 HTTP 标头,请在此处进行设置。
¥If you want to start with specific cookies or add universal HTTP headers, set them here.
-
例如
cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]。¥
E.g.
cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]. -
user_agent:¥
user_agent: -
自定义 User-Agent 字符串。如果
None,则使用默认值。¥Custom User-Agent string. If
None, a default is used. -
您还可以设置
user_agent_mode="random"用于随机化(如果你想对抗机器人检测)。¥
You can also set
user_agent_mode="random"for randomization (if you want to fight bot detection). -
text_mode&light_mode:¥
text_mode&light_mode: -
禁用图像,可能会加快纯文本爬行速度。
¥
text_mode=Truedisables images, possibly speeding up text-only crawls. -
为提高性能,请关闭某些后台功能。
¥
light_mode=Trueturns off certain background features for performance. -
extra_args:底层浏览器的附加标志。例如["--disable-extensions"]。¥
extra_args:- Additional flags for the underlying browser.
- E.g.
["--disable-extensions"].
-
enable_stealth: 如果True,使用 playwright-stealth 启用隐身模式。修改浏览器指纹以规避基本的机器人检测。默认值为False. 推荐用于具有机器人保护功能的网站。¥
enable_stealth:- If
True, enables stealth mode using playwright-stealth. - Modifies browser fingerprints to avoid basic bot detection.
- Default is
False. Recommended for sites with bot protection.
- If
辅助方法
¥Helper Methods
两种配置类都提供了clone()创建修改副本的方法:
¥Both configuration classes provide a clone() method to create modified copies:
# Create a base browser config
base_browser = BrowserConfig(
browser_type="chromium",
headless=True,
text_mode=True
)
# Create a visible browser config for debugging
debug_browser = base_browser.clone(
headless=False,
verbose=True
)
最小示例:
¥Minimal Example:
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_conf = BrowserConfig(
browser_type="firefox",
headless=False,
text_mode=True
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:300])
2. CrawlerRunConfig 基本信息
¥2. CrawlerRunConfig Essentials
class CrawlerRunConfig:
def __init__(
word_count_threshold=200,
extraction_strategy=None,
markdown_generator=None,
cache_mode=None,
js_code=None,
wait_for=None,
screenshot=False,
pdf=False,
capture_mhtml=False,
# Location and Identity Parameters
locale=None, # e.g. "en-US", "fr-FR"
timezone_id=None, # e.g. "America/New_York"
geolocation=None, # GeolocationConfig object
# Resource Management
enable_rate_limiting=False,
rate_limit_config=None,
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=20,
display_mode=None,
verbose=True,
stream=False, # Enable streaming for arun_many()
# ... other advanced parameters omitted
):
...
需要注意的关键字段
¥Key Fields to Note
-
word_count_threshold:¥
word_count_threshold: -
考虑区块之前的最小字数。
¥The minimum word count before a block is considered.
-
如果您的网站有很多短段落或项目,您可以降低它。
¥
If your site has lots of short paragraphs or items, you can lower it.
-
extraction_strategy:¥
extraction_strategy: -
在其中插入基于 JSON 的提取(CSS、LLM 等)。
¥Where you plug in JSON-based extraction (CSS, LLM, etc.).
-
如果
None,没有进行结构化提取(仅进行原始/清理的 HTML + markdown)。¥
If
None, no structured extraction is done (only raw/cleaned HTML + markdown). -
markdown_generator:¥
markdown_generator: -
例如,
DefaultMarkdownGenerator(...),控制如何进行 HTML→Markdown 转换。¥E.g.,
DefaultMarkdownGenerator(...), controlling how HTML→Markdown conversion is done. -
如果
None,使用默认方法。¥
If
None, a default approach is used. -
cache_mode:¥
cache_mode: -
控制缓存行为(
ENABLED,BYPASS,DISABLED, ETC。)。¥Controls caching behavior (
ENABLED,BYPASS,DISABLED, etc.). -
如果
None,默认为某种级别的缓存,或者您可以指定CacheMode.ENABLED。¥
If
None, defaults to some level of caching or you can specifyCacheMode.ENABLED. -
js_code:¥
js_code: -
要执行的字符串或 JS 字符串列表。
¥A string or list of JS strings to execute.
-
非常适合“加载更多”按钮或用户交互。
¥
Great for "Load More" buttons or user interactions.
-
wait_for:¥
wait_for: -
提取内容之前要等待的 CSS 或 JS 表达式。
¥A CSS or JS expression to wait for before extracting content.
-
常见用法:
wait_for="css:.main-loaded"或者wait_for="js:() => window.loaded === true"。¥
Common usage:
wait_for="css:.main-loaded"orwait_for="js:() => window.loaded === true". -
screenshot,pdf,&capture_mhtml:¥
screenshot,pdf, &capture_mhtml: -
如果
True,在页面完全加载后捕获屏幕截图、PDF 或 MHTML 快照。¥If
True, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded. -
结果
result.screenshot(base64),result.pdf(字节),或result.mhtml(细绳)。¥
The results go to
result.screenshot(base64),result.pdf(bytes), orresult.mhtml(string). -
位置参数:
¥
Location Parameters:
-
locale:浏览器的语言环境(例如,"en-US","fr-FR") 语言偏好设置¥
locale: Browser's locale (e.g.,"en-US","fr-FR") for language preferences -
timezone_id:浏览器的时区(例如,"America/New_York","Europe/Paris")¥
timezone_id: Browser's timezone (e.g.,"America/New_York","Europe/Paris") -
geolocation:GPS 坐标通过GeolocationConfig(latitude=48.8566, longitude=2.3522)¥
geolocation: GPS coordinates viaGeolocationConfig(latitude=48.8566, longitude=2.3522) -
¥
-
verbose:¥
verbose: -
记录额外的运行时详细信息。
¥Logs additional runtime details.
-
如果也设置为,则与浏览器的详细程度重叠
True在BrowserConfig。¥
Overlaps with the browser's verbosity if also set to
TrueinBrowserConfig. -
enable_rate_limiting:¥
enable_rate_limiting: -
如果
True,启用批处理的速率限制。¥If
True, enables rate limiting for batch processing. -
需要
rate_limit_config待设置。¥
Requires
rate_limit_configto be set. -
memory_threshold_percent:要监控的内存阈值(百分比)。如果超出阈值,爬虫将暂停或减慢速度。¥
memory_threshold_percent:- The memory threshold (as a percentage) to monitor.
- If exceeded, the crawler will pause or slow down.
-
check_interval:检查系统资源的间隔(以秒为单位)。影响监控内存和 CPU 使用率的频率。¥
check_interval:- The interval (in seconds) to check system resources.
- Affects how often memory and CPU usage are monitored.
-
max_session_permit:并发抓取会话的最大数量。有助于防止系统过载。¥
max_session_permit:- The maximum number of concurrent crawl sessions.
- Helps prevent overwhelming the system.
-
url_matcher&match_mode:与以下项一起使用时启用 URL 特定的配置arun_many()。 放url_matcher匹配特定 URL 的模式(全局、函数或列表)。使用match_mode(或/与)来控制多个模式的组合方式。请参阅URL 特定的配置例如。¥
url_matcher&match_mode:- Enable URL-specific configurations when used with
arun_many(). - Set
url_matcherto patterns (glob, function, or list) to match specific URLs. - Use
match_mode(OR/AND) to control how multiple patterns combine. - See URL-Specific Configurations for examples.
- Enable URL-specific configurations when used with
-
display_mode:进度信息的显示模式(DETAILED,BRIEF等)。影响抓取过程中打印的信息量。¥
display_mode:- The display mode for progress information (
DETAILED,BRIEF, etc.). - Affects how much information is printed during the crawl.
- The display mode for progress information (
辅助方法
¥Helper Methods
这clone()方法对于创建爬虫配置的变体特别有用:
¥The clone() method is particularly useful for creating variations of your crawler configuration:
# Create a base configuration
base_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
word_count_threshold=200,
wait_until="networkidle"
)
# Create variations for different use cases
stream_config = base_config.clone(
stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS
)
debug_config = base_config.clone(
page_timeout=120000, # Longer timeout for debugging
verbose=True
)
这clone()方法: - 使用所有相同的设置创建新实例 - 仅更新指定的参数 - 保持原始配置不变 - 非常适合创建变体而无需重复所有参数
¥The clone() method:
- Creates a new instance with all the same settings
- Updates only the specified parameters
- Leaves the original configuration unchanged
- Perfect for creating variations without repeating all parameters
3. LLMConfig 要点
¥3. LLMConfig Essentials
需要注意的关键字段
¥Key fields to note
-
provider:¥
provider: -
使用哪个 LLM 提供商。
¥Which LLM provider to use.
-
可能的值包括
"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"
(默认:"openai/gpt-4o-mini")¥
Possible values are
"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"
(default:"openai/gpt-4o-mini") -
api_token:可选。如果未明确提供,api_token 将根据提供商从环境变量中读取。例如:如果将 gemini 模型作为提供商传递,则"GEMINI_API_KEY"将从 LLM 提供商的环境变量 API 令牌中读取
例如:api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"环境变量 - 使用前缀“env:”
例如:api_token = "env: GROQ_API_KEY"¥
api_token:- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,
"GEMINI_API_KEY"will be read from environment variables - API token of LLM provider
eg:api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv" - Environment variable - use with prefix "env:"
eg:api_token = "env: GROQ_API_KEY"
- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,
-
base_url:¥
base_url: -
如果您的提供商有自定义端点
¥If your provider has a custom endpoint
4. 整合
¥4. Putting It All Together
在典型情况下,您定义一BrowserConfig为您的爬虫会话,然后创建一个或多个CrawlerRunConfig&LLMConfig根据每次呼叫的需求:
¥In a typical scenario, you define one BrowserConfig for your crawler session, then create one or more CrawlerRunConfig & LLMConfig depending on each call's needs:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig, LLMContentFilter, DefaultMarkdownGenerator
from crawl4ai import JsonCssExtractionStrategy
async def main():
# 1) Browser config: headless, bigger viewport, no proxy
browser_conf = BrowserConfig(
headless=True,
viewport_width=1280,
viewport_height=720
)
# 2) Example extraction strategy
schema = {
"name": "Articles",
"baseSelector": "div.article",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
extraction = JsonCssExtractionStrategy(schema)
# 3) Example LLM content filtering
gemini_config = LLMConfig(
provider="gemini/gemini-1.5-pro",
api_token = "env:GEMINI_API_TOKEN"
)
# Initialize LLM filter with specific instruction
filter = LLMContentFilter(
llm_config=gemini_config, # or your preferred provider
instruction="""
Focus on extracting the core educational content.
Include:
- Key concepts and explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
Format the output as clean markdown with proper code blocks and headers.
""",
chunk_token_threshold=500, # Adjust based on your needs
verbose=True
)
md_generator = DefaultMarkdownGenerator(
content_filter=filter,
options={"ignore_links": True}
)
# 4) Crawler run config: skip cache, use extraction
run_conf = CrawlerRunConfig(
markdown_generator=md_generator,
extraction_strategy=extraction,
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
# 4) Execute the crawl
result = await crawler.arun(url="https://example.com/news", config=run_conf)
if result.success:
print("Extracted content:", result.extracted_content)
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
5. 后续步骤
¥5. Next Steps
对于详细清单可用参数(包括高级参数),请参阅:
¥For a detailed list of available parameters (including advanced ones), see:
您可以探索以下主题:
¥You can explore topics like:
-
自定义钩子和授权(注入 JavaScript 或处理登录表单)。
¥Custom Hooks & Auth (Inject JavaScript or handle login forms).
-
会话管理(重复使用页面,在多次调用中保留状态)。
¥Session Management (Re-use pages, preserve state across multiple calls).
-
魔法模式或者基于身份的爬取(通过模拟用户行为来对抗机器人检测)。
¥Magic Mode or Identity-based Crawling (Fight bot detection by simulating user behavior).
-
高级缓存(微调读/写缓存模式)。
¥Advanced Caching (Fine-tune read/write cache modes).
6. 结论
¥6. Conclusion
浏览器配置, CrawlerRunConfig和LLM配置为您提供直接的定义方法:
¥BrowserConfig, CrawlerRunConfig and LLMConfig give you straightforward ways to define:
-
哪个浏览器启动方式、运行方式以及任何代理或用户代理需求。
¥Which browser to launch, how it should run, and any proxy or user agent needs.
-
如何每次抓取都应该遵循缓存、超时、JavaScript 代码、提取策略等行为。
¥How each crawl should behave—caching, timeouts, JavaScript code, extraction strategies, etc.
-
哪个要使用的 LLM 提供程序、API 令牌、温度和自定义端点的基本 URL
¥Which LLM provider to use, api token, temperature and base url for custom endpoints
一起使用清晰、可维护代码,当你需要更专业的行为时,请查看参考文档. 爬行愉快!
¥Use them together for clear, maintainable code, and when you need more specialized behavior, check out the advanced parameters in the reference docs. Happy crawling!