1.浏览器配置– 控制浏览器
¥1. BrowserConfig – Controlling the Browser
专注于如何浏览器启动并运行。这包括无头模式、代理、用户代理和其他环境调整。
¥BrowserConfig focuses on how the browser is launched and behaves. This includes headless mode, proxies, user agents, and other environment tweaks.
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
viewport_width=1280,
viewport_height=720,
proxy="http://user:pass@proxy:8080",
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36",
)
1.1 参数亮点
¥1.1 Parameter Highlights
¥Parameter
¥Type / Default
¥What It Does
¥browser_type
¥"chromium", "firefox", "webkit"
(default: "chromium")
¥Which browser engine to use. "chromium" is typical for many sites, "firefox" or "webkit" for specialized tests.
¥headless
¥bool (default: True)
¥Headless means no visible UI. False is handy for debugging.
¥viewport_width
¥int (default: 1080)
¥Initial page width (in px). Useful for testing responsive layouts.
¥viewport_height
¥int (default: 600)
¥Initial page height (in px).
¥proxy
¥str (default: None)
¥Single-proxy URL if you want all traffic to go through it, e.g. "http://user:pass@proxy:8080".
¥proxy_config
¥dict (default: None)
¥For advanced or multi-proxy needs, specify details like {"server": "...", "username": "...", ...}.
¥use_persistent_context
¥bool (default: False)
¥If True, uses a persistent browser context (keep cookies, sessions across runs). Also sets use_managed_browser=True.
¥user_data_dir
¥str or None (default: None)
¥Directory to store user data (profiles, cookies). Must be set if you want permanent sessions.
¥ignore_https_errors
¥bool (default: True)
¥If True, continues despite invalid certificates (common in dev/staging).
¥java_script_enabled
¥bool (default: True)
¥Disable if you want no JS overhead, or if only static content is needed.
¥cookies
¥list (default: [])
¥Pre-set cookies, each a dict like {"name": "session", "value": "...", "url": "..."}.
¥headers
¥dict (default: {})
¥Extra HTTP headers for every request, e.g. {"Accept-Language": "en-US"}.
¥user_agent
¥str (default: Chrome-based UA)
¥Your custom or random user agent. user_agent_mode="random" can shuffle it.
¥light_mode
¥bool (default: False)
¥Disables some background features for performance gains.
¥text_mode
¥bool (default: False)
¥If True, tries to disable images/other heavy content for speed.
¥use_managed_browser
¥bool (default: False)
¥For advanced “managed” interactions (debugging, CDP usage). Typically set automatically if persistent context is on.
¥extra_args
¥list (default: [])
¥Additional flags for the underlying browser process, e.g. ["--disable-extensions"].
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
browser_type |
,"firefox" ,"webkit"(默认: "chromium" ) |
使用哪种浏览器引擎。"chromium"对许多网站来说都是典型的,"firefox"或者"webkit"用于专门的测试。 |
headless |
(默认:True ) |
无头意味着没有可见的用户界面。False方便调试。 |
viewport_width |
(默认:1080 ) |
初始页面宽度(单位:px)。用于测试响应式布局。 |
viewport_height |
(默认:600 ) |
初始页面高度(以 px 为单位)。 |
proxy |
(默认:None ) |
如果您希望所有流量都通过单代理 URL,例如"http://user:pass@proxy:8080"。 |
proxy_config |
(默认:None ) |
对于高级或多代理需求,请指定如下详细信息{"server": "...", "username": "...", ...}。 |
use_persistent_context |
(默认:False ) |
如果True,使用执着的浏览器上下文(在运行过程中保留 cookie、会话)。还设置use_managed_browser=True。 |
user_data_dir |
(默认:None ) |
存储用户数据(配置文件、Cookie)的目录。如果您希望保持会话永久有效,则必须设置此目录。 |
ignore_https_errors |
(默认:True ) |
如果True,尽管证书无效(在 dev/staging 中很常见),但仍继续。 |
java_script_enabled |
(默认:True ) |
如果您不想有 JS 开销,或者只需要静态内容,请禁用。 |
cookies |
(默认:[] ) |
预设的 cookies,每个都是一个字典,例如{"name": "session", "value": "...", "url": "..."}。 |
headers |
(默认:{} ) |
每个请求的额外 HTTP 标头,例如{"Accept-Language": "en-US"}。 |
user_agent |
(默认:基于 Chrome 的 UA) | 您的自定义或随机用户代理。user_agent_mode="random"可以将其洗牌。 |
light_mode |
(默认:False ) |
禁用一些后台功能以提高性能。 |
text_mode |
(默认:False ) |
如果True,尝试禁用图像/其他重要内容以提高速度。 |
use_managed_browser |
(默认:False ) |
用于高级“托管”交互(调试、CDP 使用)。如果持久上下文已启用,通常会自动设置。 |
extra_args |
(默认:[] ) |
底层浏览器进程的附加标志,例如["--disable-extensions"]。 |
尖端: - 放headless=False视觉上调试页面如何加载或交互如何进行。
- 如果你需要验证存储或重复会话,考虑use_persistent_context=True并指定user_data_dir。
- 对于大页面,您可能需要更大的viewport_width和viewport_height处理动态内容。
¥Tips:
- Set headless=False to visually debug how pages load or how interactions proceed.
- If you need authentication storage or repeated sessions, consider use_persistent_context=True and specify user_data_dir.
- For large pages, you might need a bigger viewport_width and viewport_height to handle dynamic content.
2. CrawlerRunConfig – 控制每次爬行
¥2. CrawlerRunConfig – Controlling Each Crawl
尽管BrowserConfig设置环境,CrawlerRunConfig细节如何每个爬行操作应该表现:缓存、内容过滤、链接或域名阻止、超时、JavaScript 代码等。
¥While BrowserConfig sets up the environment, CrawlerRunConfig details how each crawl operation should behave: caching, content filtering, link or domain blocking, timeouts, JavaScript code, etc.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
run_cfg = CrawlerRunConfig(
wait_for="css:.main-content",
word_count_threshold=15,
excluded_tags=["nav", "footer"],
exclude_external_links=True,
stream=True, # Enable streaming for arun_many()
)
2.1 参数亮点
¥2.1 Parameter Highlights
我们按类别对它们进行分组。
¥We group them by category.
一个)内容处理
¥A) Content Processing
¥Parameter
¥Type / Default
¥What It Does
¥word_count_threshold
¥int (default: ~200)
¥Skips text blocks below X words. Helps ignore trivial sections.
¥extraction_strategy
¥ExtractionStrategy (default: None)
¥If set, extracts structured data (CSS-based, LLM-based, etc.).
¥markdown_generator
¥MarkdownGenerationStrategy (None)
¥If you want specialized markdown output (citations, filtering, chunking, etc.). Can be customized with options such as content_source parameter to select the HTML input source ('cleaned_html', 'raw_html', or 'fit_html').
¥css_selector
¥str (None)
¥Retains only the part of the page matching this selector. Affects the entire extraction process.
¥target_elements
¥List[str] (None)
¥List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than css_selector.
¥excluded_tags
¥list (None)
¥Removes entire tags (e.g. ["script", "style"]).
¥excluded_selector
¥str (None)
¥Like css_selector but to exclude. E.g. "#ads, .tracker".
¥only_text
¥bool (False)
¥If True, tries to extract text-only content.
¥prettiify
¥bool (False)
¥If True, beautifies final HTML (slower, purely cosmetic).
¥keep_data_attributes
¥bool (False)
¥If True, preserve data-* attributes in cleaned HTML.
¥remove_forms
¥bool (False)
¥If True, remove all <form> elements.
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
word_count_threshold |
(默认值:~200) | 跳过 X 个单词以下的文本块。有助于忽略无关紧要的部分。 |
extraction_strategy |
(默认值:无) | 如果设置,则提取结构化数据(基于 CSS、基于 LLM 等)。 |
markdown_generator |
(没有任何) | 如果您需要特殊的 Markdown 输出(引用、筛选、分块等),可以使用以下选项进行自定义:content_source参数来选择 HTML 输入源('cleaned_html'、'raw_html' 或 'fit_html')。 |
css_selector |
(没有任何) | 仅保留与此选择器匹配的页面部分。影响整个提取过程。 |
target_elements |
(没有任何) | 用于 Markdown 生成和数据提取的 CSS 选择器列表,同时仍处理整个页面的链接、媒体等。提供比css_selector。 |
excluded_tags |
(没有任何) | 删除整个标签(例如["script", "style"])。 |
excluded_selector |
(没有任何) | 喜欢css_selector但要排除。例如"#ads, .tracker"。 |
only_text |
(错误的) | 如果True,尝试提取纯文本内容。 |
prettiify |
(错误的) | 如果True,美化最终的 HTML(较慢,纯粹是装饰性的)。 |
keep_data_attributes |
(错误的) | 如果True, 保存data-*清理后的 HTML 中的属性。 |
remove_forms |
(错误的) | 如果True,删除所有<form>元素。 |
B)缓存和会话
¥B) Caching & Session
¥Parameter
¥Type / Default
¥What It Does
¥cache_mode
¥Controls how caching is handled (ENABLED, BYPASS, DISABLED, etc.). If None, typically defaults to ENABLED.
¥session_id
¥Assign a unique ID to reuse a single browser session across multiple arun() calls.
¥bypass_cache
¥bool (False)
¥If True, acts like CacheMode.BYPASS.
¥disable_cache
¥bool (False)
¥If True, acts like CacheMode.DISABLED.
¥no_cache_read
¥bool (False)
¥If True, acts like CacheMode.WRITE_ONLY (writes cache but never reads).
¥no_cache_write
¥bool (False)
¥If True, acts like CacheMode.READ_ONLY (reads cache but never writes).
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
cache_mode |
CacheMode or None |
控制如何处理缓存(ENABLED ,BYPASS ,DISABLED等)。如果None,通常默认为ENABLED。 |
session_id |
str or None |
分配唯一 ID 以在多个浏览器之间重复使用单个浏览器会话arun()呼叫。 |
bypass_cache |
(错误的) | 如果True,就像CacheMode.BYPASS。 |
disable_cache |
(错误的) | 如果True,就像CacheMode.DISABLED。 |
no_cache_read |
(错误的) | 如果True,就像CacheMode.WRITE_ONLY(写入缓存但从不读取)。 |
no_cache_write |
(错误的) | 如果True,就像CacheMode.READ_ONLY(读取缓存但从不写入)。 |
使用这些来控制是否从本地内容缓存读取或写入。对于大批量爬取或重复访问网站非常有用。
¥Use these for controlling whether you read or write from a local content cache. Handy for large batch crawls or repeated site visits.
C)页面导航和时间
¥C) Page Navigation & Timing
¥Parameter
¥Type / Default
¥What It Does
¥wait_until
¥str (domcontentloaded)
¥Condition for navigation to “complete”. Often "networkidle" or "domcontentloaded".
¥page_timeout
¥int (60000 ms)
¥Timeout for page navigation or JS steps. Increase for slow sites.
¥wait_for
¥Wait for a CSS ("css:selector") or JS ("js:() => bool") condition before content extraction.
¥wait_for_images
¥bool (False)
¥Wait for images to load before finishing. Slows down if you only want text.
¥delay_before_return_html
¥float (0.1)
¥Additional pause (seconds) before final HTML is captured. Good for last-second updates.
¥check_robots_txt
¥bool (False)
¥Whether to check and respect robots.txt rules before crawling. If True, caches robots.txt for efficiency.
¥mean_delay and max_range
¥float (0.1, 0.3)
¥If you call arun_many(), these define random delay intervals between crawls, helping avoid detection or rate limits.
¥semaphore_count
¥int (5)
¥Max concurrency for arun_many(). Increase if you have resources for parallel crawls.
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
wait_until |
(domcontentloaded) | 导航“完成”的条件。通常"networkidle"或者"domcontentloaded"。 |
page_timeout |
(60000 毫秒) | 页面导航或 JS 步骤的超时时间。如果网站速度较慢,请增加此值。 |
wait_for |
str or None |
等待 CSS ("css:selector" ) 或 JS ("js:() => bool" ) 内容提取前的条件。 |
wait_for_images |
(错误的) | 等待图片加载完成。如果只想加载文本,加载速度会比较慢。 |
delay_before_return_html |
(0.1) | 在捕获最终 HTML 之前会额外暂停(秒)。适合最后一秒的更新。 |
check_robots_txt |
(错误的) | 是否在抓取前检查并遵守 robots.txt 规则。如果设置为 True,则缓存 robots.txt 以提高效率。 |
mean_delay和max_range |
(0.1, 0.3) | 如果你打电话arun_many(),这些定义了爬网之间的随机延迟间隔,有助于避免检测或速率限制。 |
semaphore_count |
(5) | 最大并发数arun_many(). 如果您有并行爬网的资源,请增加。 |
D)页面交互
¥D) Page Interaction
¥Parameter
¥Type / Default
¥What It Does
¥js_code
¥str or list[str] (None)
¥JavaScript to run after load. E.g. "document.querySelector('button')?.click();".
¥js_only
¥bool (False)
¥If True, indicates we’re reusing an existing session and only applying JS. No full reload.
¥ignore_body_visibility
¥bool (True)
¥Skip checking if <body> is visible. Usually best to keep True.
¥scan_full_page
¥bool (False)
¥If True, auto-scroll the page to load dynamic content (infinite scroll).
¥scroll_delay
¥float (0.2)
¥Delay between scroll steps if scan_full_page=True.
¥process_iframes
¥bool (False)
¥Inlines iframe content for single-page extraction.
¥remove_overlay_elements
¥bool (False)
¥Removes potential modals/popups blocking the main content.
¥simulate_user
¥bool (False)
¥Simulate user interactions (mouse movements) to avoid bot detection.
¥override_navigator
¥bool (False)
¥Override navigator properties in JS for stealth.
¥magic
¥bool (False)
¥Automatic handling of popups/consent banners. Experimental.
¥adjust_viewport_to_content
¥bool (False)
¥Resizes viewport to match page content height.
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
js_code |
(没有任何) | 加载后运行的 JavaScript。例如"document.querySelector('button')?.click();"。 |
js_only |
(错误的) | 如果True,表示我们正在重用现有会话并仅应用 JS。无需完全重新加载。 |
ignore_body_visibility |
(真的) | 跳过检查<body>可见。通常最好保留True。 |
scan_full_page |
(错误的) | 如果True,自动滚动页面以加载动态内容(无限滚动)。 |
scroll_delay |
(0.2) | 如果滚动步骤之间有延迟scan_full_page=True。 |
process_iframes |
(错误的) | 内联 iframe 内容以进行单页提取。 |
remove_overlay_elements |
(错误的) | 删除阻碍主要内容的潜在模式/弹出窗口。 |
simulate_user |
(错误的) | 模拟用户交互(鼠标移动)以避免被机器人检测到。 |
override_navigator |
(错误的) | 覆盖navigatorJS 中的隐身属性。 |
magic |
(错误的) | 自动处理弹出窗口/同意横幅。实验性。 |
adjust_viewport_to_content |
(错误的) | 调整视口大小以匹配页面内容高度。 |
如果您的页面是单页应用,且 JS 重复更新,请设置js_only=True在后续调用中,加上session_id重复使用同一个标签。
¥If your page is a single-page app with repeated JS updates, set js_only=True in subsequent calls, plus a session_id for reusing the same tab.
E)媒体处理
¥E) Media Handling
¥Parameter
¥Type / Default
¥What It Does
¥screenshot
¥bool (False)
¥Capture a screenshot (base64) in result.screenshot.
¥screenshot_wait_for
¥Extra wait time before the screenshot.
¥screenshot_height_threshold
¥int (~20000)
¥If the page is taller than this, alternate screenshot strategies are used.
¥pdf
¥bool (False)
¥If True, returns a PDF in result.pdf.
¥capture_mhtml
¥bool (False)
¥If True, captures an MHTML snapshot of the page in result.mhtml. MHTML includes all page resources (CSS, images, etc.) in a single file.
¥image_description_min_word_threshold
¥int (~50)
¥Minimum words for an image’s alt text or description to be considered valid.
¥image_score_threshold
¥int (~3)
¥Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.).
¥exclude_external_images
¥bool (False)
¥Exclude images from other domains.
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
screenshot |
(错误的) | 截取屏幕截图(base64)result.screenshot 。 |
screenshot_wait_for |
float or None |
截图前需额外等待一段时间。 |
screenshot_height_threshold |
(约20000) | 如果页面高度高于此值,则使用替代的屏幕截图策略。 |
pdf |
(错误的) | 如果True,返回 PDF 格式result.pdf。 |
capture_mhtml |
(错误的) | 如果True,捕获页面的 MHTML 快照result.mhtml。MHTML 将所有页面资源(CSS、图像等)包含在一个文件中。 |
image_description_min_word_threshold |
(约50) | 图像替代文本或描述的最少字数才被视为有效。 |
image_score_threshold |
(约3) | 过滤掉得分低的图片。爬虫会根据相关性(大小、上下文等)对图片进行评分。 |
exclude_external_images |
(错误的) | 排除来自其他域的图像。 |
F)链接/域处理
¥F) Link/Domain Handling
¥Parameter
¥Type / Default
¥What It Does
¥exclude_social_media_domains
¥list (e.g. Facebook/Twitter)
¥A default list can be extended. Any link to these domains is removed from final output.
¥exclude_external_links
¥bool (False)
¥Removes all links pointing outside the current domain.
¥exclude_social_media_links
¥bool (False)
¥Strips links specifically to social sites (like Facebook or Twitter).
¥exclude_domains
¥list ([])
¥Provide a custom list of domains to exclude (like ["ads.com", "trackers.io"]).
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
exclude_social_media_domains |
(例如 Facebook/Twitter) | 默认列表可以扩展。任何指向这些域的链接都将从最终输出中删除。 |
exclude_external_links |
(错误的) | 删除指向当前域之外的所有链接。 |
exclude_social_media_links |
(错误的) | 删除指向社交网站(如 Facebook 或 Twitter)的链接。 |
exclude_domains |
([]) | 提供要排除的域名的自定义列表(例如["ads.com", "trackers.io"])。 |
使用这些进行链接级内容过滤(通常是为了保持抓取“内部”或删除垃圾域)。
¥Use these for link-level content filtering (often to keep crawls “internal” or to remove spammy domains).
G)调试和日志记录
¥G) Debug & Logging
¥Parameter
¥Type / Default
¥What It Does
¥verbose
¥bool (True)
¥Prints logs detailing each step of crawling, interactions, or errors.
¥log_console
¥bool (False)
¥Logs the page’s JavaScript console output if you want deeper JS debugging.
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
verbose |
(真的) | 打印详细记录爬行、交互或错误的每个步骤的日志。 |
log_console |
(错误的) | 如果您想要进行更深入的 JS 调试,请记录页面的 JavaScript 控制台输出。 |
H)虚拟涡旋配置
¥H) Virtual Scroll Configuration
¥Parameter
¥Type / Default
¥What It Does
¥virtual_scroll_config
¥VirtualScrollConfig or dict (None)
¥Configuration for handling virtualized scrolling on sites like Twitter/Instagram where content is replaced rather than appended.
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
virtual_scroll_config |
(没有任何) | 用于处理 Twitter/Instagram 等网站上的虚拟化滚动的配置,其中内容被替换而不是附加。 |
当网站使用虚拟滚动(滚动时内容会替换)时,使用VirtualScrollConfig:
¥When sites use virtual scrolling (content replaced as you scroll), use VirtualScrollConfig:
from crawl4ai import VirtualScrollConfig
virtual_config = VirtualScrollConfig(
container_selector="#timeline", # CSS selector for scrollable container
scroll_count=30, # Number of times to scroll
scroll_by="container_height", # How much to scroll: "container_height", "page_height", or pixels (e.g. 500)
wait_after_scroll=0.5 # Seconds to wait after each scroll for content to load
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config
)
VirtualScrollConfig 参数:
¥VirtualScrollConfig Parameters:
¥Parameter
¥Type / Default
¥What It Does
¥container_selector
¥str (required)
¥CSS selector for the scrollable container (e.g., "#feed", ".timeline")
¥scroll_count
¥int (10)
¥Maximum number of scrolls to perform
¥scroll_by
¥str or int ("container_height")
¥Scroll amount: "container_height", "page_height", or pixels (e.g., 500)
¥wait_after_scroll
¥float (0.5)
¥Time in seconds to wait after each scroll for new content to load
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
container_selector |
(必需的) | 可滚动容器的 CSS 选择器(例如,"#feed" ,".timeline" ) |
scroll_count |
(10) | 执行的最大滚动次数 |
scroll_by |
(“容器高度”) | 滚动量:"container_height" ,"page_height"或像素(例如,500 ) |
wait_after_scroll |
(0.5) | 每次滚动后等待加载新内容的时间(秒) |
何时使用虚拟滚动与 scan_full_page: - 使用virtual_scroll_config当内容被取代在滚动期间(Twitter、Instagram)- 使用scan_full_page当内容附加滚动期间(传统无限滚动)
¥When to use Virtual Scroll vs scan_full_page:
- Use virtual_scroll_config when content is replaced during scroll (Twitter, Instagram)
- Use scan_full_page when content is appended during scroll (traditional infinite scroll)
看虚拟滚动文档详细示例。
¥See Virtual Scroll documentation for detailed examples.
我) URL 匹配配置
¥I) URL Matching Configuration
¥Parameter
¥Type / Default
¥What It Does
¥url_matcher
¥UrlMatcher (None)
¥Pattern(s) to match URLs against. Can be: string (glob), function, or list of mixed types. None means match ALL URLs
¥match_mode
¥MatchMode (MatchMode.OR)
¥How to combine multiple matchers in a list: MatchMode.OR (any match) or MatchMode.AND (all must match)
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
url_matcher |
(没有任何) | 用于匹配 URL 的模式。可以是:字符串 (glob)、函数或混合类型列表。 None 表示匹配所有 URL |
match_mode |
(匹配模式.OR) | 如何在列表中组合多个匹配器:MatchMode.OR (任意匹配)或MatchMode.AND(全部必须匹配) |
这url_matcher参数与以下项一起使用时启用 URL 特定的配置arun_many():
¥The url_matcher parameter enables URL-specific configurations when used with arun_many():
from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# Simple string pattern (glob-style)
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
)
# Multiple patterns with OR logic (default)
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*/news/*"],
match_mode=MatchMode.OR # Any pattern matches
)
# Function matcher
api_config = CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json'),
# Other settings like extraction_strategy
)
# Mixed: String + Function with AND logic
complex_config = CrawlerRunConfig(
url_matcher=[
lambda url: url.startswith('https://'), # Must be HTTPS
"*.org/*", # Must be .org domain
lambda url: 'docs' in url # Must contain 'docs'
],
match_mode=MatchMode.AND # ALL conditions must match
)
# Combined patterns and functions with AND logic
secure_docs = CrawlerRunConfig(
url_matcher=["https://*", lambda url: '.doc' in url],
match_mode=MatchMode.AND # Must be HTTPS AND contain .doc
)
# Default config - matches ALL URLs
default_config = CrawlerRunConfig() # No url_matcher = matches everything
UrlMatcher 类型: -无(默认) : 什么时候url_matcher为 None 或未设置,则配置匹配所有 URL -字符串模式:类似 Glob 样式的模式"*.pdf","*/api/*" ,"https://*.example.com/*" -功能:lambda url: bool - 复杂匹配的自定义逻辑 -列表:混合字符串和函数,结合MatchMode.OR或者MatchMode.AND
¥UrlMatcher Types:
- None (default): When url_matcher is None or not set, the config matches ALL URLs
- String patterns: Glob-style patterns like "*.pdf", "*/api/*", "https://*.example.com/*"
- Functions: lambda url: bool - Custom logic for complex matching
- Lists: Mix strings and functions, combined with MatchMode.OR or MatchMode.AND
重要行为: - 当传递配置列表到arun_many(),URL 与每个配置的url_matcher按顺序。第一个匹配成功! - 如果没有配置与 URL 匹配,并且没有默认配置(没有url_matcher),URL 将失败并显示“未找到匹配的配置” - 如果要处理所有 URL,请始终将默认配置作为最后一项
¥Important Behavior:
- When passing a list of configs to arun_many(), URLs are matched against each config's url_matcher in order. First match wins!
- If no config matches a URL and there's no default config (one without url_matcher), the URL will fail with "No matching configuration found"
- Always include a default config as the last item if you want to handle all URLs
---## 2.2 辅助方法
¥---## 2.2 Helper Methods
两个都BrowserConfig和CrawlerRunConfig提供一个clone()创建修改副本的方法:
¥Both BrowserConfig and CrawlerRunConfig provide a clone() method to create modified copies:
# Create a base configuration
base_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
word_count_threshold=200
)
# Create variations using clone()
stream_config = base_config.clone(stream=True)
no_cache_config = base_config.clone(
cache_mode=CacheMode.BYPASS,
stream=True
)
这clone()当您需要针对不同的用例使用稍微不同的配置,而无需修改原始配置时,此方法特别有用。
¥The clone() method is particularly useful when you need slightly different configurations for different use cases, without modifying the original config.
2.3 示例用法
¥2.3 Example Usage
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
# Configure the browser
browser_cfg = BrowserConfig(
headless=False,
viewport_width=1280,
viewport_height=720,
proxy="http://user:pass@myproxy:8080",
text_mode=True
)
# Configure the run
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="my_session",
css_selector="main.article",
excluded_tags=["script", "style"],
exclude_external_links=True,
wait_for="css:.article-loaded",
screenshot=True,
stream=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/news",
config=run_cfg
)
if result.success:
print("Final cleaned_html length:", len(result.cleaned_html))
if result.screenshot:
print("Screenshot captured (base64, length):", len(result.screenshot))
else:
print("Crawl failed:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
2.4 合规与道德
¥2.4 Compliance & Ethics
¥Parameter
¥Type / Default
¥What It Does
¥check_robots_txt
¥bool (False)
¥When True, checks and respects robots.txt rules before crawling. Uses efficient caching with SQLite backend.
¥user_agent
¥str (None)
¥User agent string to identify your crawler. Used for robots.txt checking when enabled.
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
check_robots_txt |
(错误的) | 设置为 True 时,抓取前会检查并遵守 robots.txt 规则。使用 SQLite 后端的高效缓存。 |
user_agent |
(没有任何) | 用于识别您的爬虫程序的用户代理字符串。启用后,用于 robots.txt 检查。 |
run_config = CrawlerRunConfig(
check_robots_txt=True, # Enable robots.txt compliance
user_agent="MyBot/1.0" # Identify your crawler
)
3. LLM配置- 设立法学硕士 (LLM) 提供商
¥3. LLMConfig - Setting up LLM providers
LLMConfig 可用于将 LLM 提供程序配置传递给依赖 LLM 进行提取、过滤、模式生成等操作的策略和函数。目前,它可以用于以下用途 -
¥LLMConfig is useful to pass LLM provider config to strategies and functions that rely on LLMs to do extraction, filtering, schema generation etc. Currently it can be used in the following -
-
法学硕士提取策略
¥LLMExtractionStrategy
-
LLM内容过滤器
¥LLMContentFilter
-
JsonCssExtractionStrategy.generate_schema
¥JsonCssExtractionStrategy.generate_schema
-
JsonXPathExtractionStrategy.generate_schema
¥JsonXPathExtractionStrategy.generate_schema
3.1 参数
¥3.1 Parameters
¥Parameter
¥Type / Default
¥What It Does
¥provider
¥"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"
(default: "openai/gpt-4o-mini")
¥Which LLM provider to use.
¥api_token
¥1.Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,"GEMINI_API_KEY" will be read from environment variables
2. API token of LLM provider
eg: api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"
3. Environment variable - use with prefix "env:"
eg:api_token = "env: GROQ_API_KEY"
¥API token to use for the given provider
¥base_url
¥Optional. Custom API endpoint
¥If your provider has a custom endpoint
| 范围 | 类型/默认 | 它的作用 |
|---|---|---|
provider |
(默认: "openai/gpt-4o-mini" ) |
使用哪个 LLM 提供商。 |
api_token |
1.可选。如果未明确提供,api_token 将根据提供商从环境变量中读取。例如:如果将 gemini 模型作为提供商传递,则"GEMINI_API_KEY"将从环境变量中读取2. LLM 提供商的 API 令牌 例如: api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"3. 环境变量 - 使用前缀“env:” 例如: api_token = "env: GROQ_API_KEY" |
用于给定提供商的 API 令牌 |
base_url |
可选。自定义 API 端点 | 如果您的提供商有自定义端点 |
3.2 示例用法
¥3.2 Example Usage
4. 整合
¥4. Putting It All Together
-
使用
BrowserConfig为了全球的浏览器设置:引擎、无头、代理、用户代理。¥Use
BrowserConfigfor global browser settings: engine, headless, proxy, user agent. -
使用
CrawlerRunConfig对于每个爬网语境:如何过滤内容、处理缓存、等待动态元素或运行 JS。¥Use
CrawlerRunConfigfor each crawl’s context: how to filter content, handle caching, wait for dynamic elements, or run JS. -
经过两个配置
AsyncWebCrawler(这BrowserConfig)然后arun()(这CrawlerRunConfig)。¥Pass both configs to
AsyncWebCrawler(theBrowserConfig) and then toarun()(theCrawlerRunConfig). -
使用
LLMConfig适用于所有提取、过滤和模式生成任务的 LLM 提供程序配置。可用于:LLMExtractionStrategy,LLMContentFilter,JsonCssExtractionStrategy.generate_schema&JsonXPathExtractionStrategy.generate_schema¥Use
LLMConfigfor LLM provider configurations that can be used across all extraction, filtering, schema generation tasks. Can be used in -LLMExtractionStrategy,LLMContentFilter,JsonCssExtractionStrategy.generate_schema&JsonXPathExtractionStrategy.generate_schema