参考
¥CrawlResult Reference
这CrawlResult类封装了单次抓取操作后返回的所有内容。它提供了原始或处理过的内容、链接和媒体的详细信息,以及可选的元数据(如屏幕截图、PDF 或提取的 JSON)。
¥The CrawlResult class encapsulates everything returned after a single crawl operation. It provides the raw or processed content, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).
地点:crawl4ai/crawler/models.py (仅供参考)
¥Location: crawl4ai/crawler/models.py (for reference)
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
fit_html: Optional[str] = None # Preprocessed HTML optimized for extraction
media: Dict[str, List[Dict]] = {}
links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
mhtml: Optional[str] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
session_id: Optional[str] = None
response_headers: Optional[dict] = None
status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
dispatch_result: Optional[DispatchResult] = None
...
下面是逐个字段解释和可能的使用模式。
¥Below is a field-by-field explanation and possible usage patterns.
1. 基本爬取信息
¥1. Basic Crawl Info
1.1url (字符串)
¥1.1 url (str)
什么:最终抓取的 URL(任何重定向之后)。
用法:
¥What: The final crawled URL (after any redirects).
Usage:
1.2success (布尔值)
¥1.2 success (bool)
什么:True如果爬取管道结束时没有出现重大错误;False否则。
用法:
¥What: True if the crawl pipeline ended without major errors; False otherwise.
Usage:
1.3status_code (可选[int])
¥1.3 status_code (Optional[int])
什么:页面的 HTTP 状态代码(例如 200、404)。
用法:
¥What: The page's HTTP status code (e.g., 200, 404).
Usage:
1.4error_message (可选[str])
¥1.4 error_message (Optional[str])
什么: 如果success=False,失败的文本描述。
用法:
¥What: If success=False, a textual description of the failure.
Usage:
1.5session_id (可选[str])
¥1.5 session_id (Optional[str])
什么:用于在多个调用之间重复使用浏览器上下文的 ID。
用法:
¥What: The ID used for reusing a browser context across multiple calls.
Usage:
# If you used session_id="login_session" in CrawlerRunConfig, see it here:
print("Session:", result.session_id)
1.6response_headers (可选[dict])
¥1.6 response_headers (Optional[dict])
什么:最终的 HTTP 响应标头。
用法:
¥What: Final HTTP response headers.
Usage:
1.7ssl_certificate (可选[SSL证书])
¥1.7 ssl_certificate (Optional[SSLCertificate])
什么: 如果fetch_ssl_certificate=True在你的 CrawlerRunConfig 中,result.ssl_certificate包含一个SSLCertificate描述站点证书的对象。您可以以多种格式(PEM/DER/JSON)导出证书,或访问其属性,例如issuer,subject ,valid_from ,valid_until , ETC。用法:
¥What: If fetch_ssl_certificate=True in your CrawlerRunConfig, result.ssl_certificate contains a SSLCertificate object describing the site's certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like issuer,
subject, valid_from, valid_until, etc.
Usage:
2. 原始/清理内容
¥2. Raw / Cleaned Content
2.1html (字符串)
¥2.1 html (str)
什么: 这原来的最终页面加载时未修改的 HTML。
用法:
¥What: The original unmodified HTML from the final page load.
Usage:
2.2cleaned_html (可选[str])
¥2.2 cleaned_html (Optional[str])
什么:已清理的 HTML 版本——脚本、样式或排除的标签将根据您的CrawlerRunConfig。
用法:
¥What: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your CrawlerRunConfig.
Usage:
3. Markdown 字段
¥3. Markdown Fields
3.1 Markdown 生成方法
¥3.1 The Markdown Generation Approach
Crawl4AI 可以将 HTML 转换为 Markdown,可选包括:
¥Crawl4AI can convert HTML→Markdown, optionally including:
-
生的降价
¥Raw markdown
-
链接作为引用(含参考文献部分)
¥Links as citations (with a references section)
-
合身如果内容过滤器被使用(例如修剪或 BM25)
¥Fit markdown if a content filter is used (like Pruning or BM25)
MarkdownGenerationResult包括:-raw_markdown (字符串) :完整的 HTML→Markdown 转换。
-markdown_with_citations (字符串) :相同的 markdown,但使用链接引用作为学术风格的引用。
-references_markdown (字符串) :文末的参考文献列表或脚注。
-fit_markdown (可选[str]) :如果应用了内容过滤(修剪/BM25),则过滤后的文本为“适合”。
-fit_html (可选[str]) :导致fit_markdown。
¥MarkdownGenerationResult includes:
- raw_markdown (str): The full HTML→Markdown conversion.
- markdown_with_citations (str): Same markdown, but with link references as academic-style citations.
- references_markdown (str): The reference list or footnotes at the end.
- fit_markdown (Optional[str]): If content filtering (Pruning/BM25) was applied, the filtered "fit" text.
- fit_html (Optional[str]): The HTML that led to fit_markdown.
用法:
¥Usage:
if result.markdown:
md_res = result.markdown
print("Raw MD:", md_res.raw_markdown[:300])
print("Citations MD:", md_res.markdown_with_citations[:300])
print("References:", md_res.references_markdown)
if md_res.fit_markdown:
print("Pruned text:", md_res.fit_markdown[:300])
3.2markdown (可选[Union[str,MarkdownGenerationResult]])
¥3.2 markdown (Optional[Union[str, MarkdownGenerationResult]])
什么:持有MarkdownGenerationResult。
用法:
¥What: Holds the MarkdownGenerationResult.
Usage:
print(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html)
fit_markdown/fit_html) exists in result.markdown, only if you used a filter (like PruningContentFilter or BM25ContentFilter) within a MarkdownGenerationStrategy.
4.媒体与链接
¥4. Media & Links
4.1media (字典[str,列表[字典]])
¥4.1 media (Dict[str, List[Dict]])
什么:包含已发现图像、视频或音频的信息。通常键为:"images" ,"videos" ,"audios" 。
公共字段在每个项目中:
¥What: Contains info about discovered images, videos, or audio. Typically keys: "images", "videos", "audios".
Common Fields in each item:
-
(字符串) : 媒体网址
¥
src(str): Media URL -
或者
title(字符串) :描述性文字¥
altortitle(str): Descriptive text -
(漂浮) :如果爬虫的启发式算法认为它“重要”,则相关性得分
¥
score(float): Relevance score if the crawler's heuristic found it "important" -
或者
description(可选[str]) :从周围文本中提取的附加上下文¥
descordescription(Optional[str]): Additional context extracted from surrounding text
用法:
¥Usage:
images = result.media.get("images", [])
for img in images:
if img.get("score", 0) > 5:
print("High-value image:", img["src"])
4.2links (字典[str,列表[字典]])
¥4.2 links (Dict[str, List[Dict]])
什么:保存内部和外部链接数据。通常有两个键:"internal"和"external"。
公共字段:
¥What: Holds internal and external link data. Usually two keys: "internal" and "external".
Common Fields:
-
(字符串) :链接目标
¥
href(str): The link target -
(字符串) :链接文本
¥
text(str): Link text -
(字符串) :标题属性
¥
title(str): Title attribute -
(字符串) :周围的文本片段
¥
context(str): Surrounding text snippet -
(字符串) :如果是外部的,则域
¥
domain(str): If external, the domain
用法:
¥Usage:
for link in result.links["internal"]:
print(f"Internal link to {link['href']} with text {link['text']}")
5.附加字段
¥5. Additional Fields
5.1extracted_content (可选[str])
¥5.1 extracted_content (Optional[str])
什么:如果你使用extraction_strategy(CSS、LLM等),结构化输出(JSON)。
用法:
¥What: If you used extraction_strategy (CSS, LLM, etc.), the structured output (JSON).
Usage:
5.2downloaded_files (可选[列表[字符串]])
¥5.2 downloaded_files (Optional[List[str]])
什么: 如果accept_downloads=True在你的BrowserConfig+downloads_path ,列出下载项目的本地文件路径。
用法:
¥What: If accept_downloads=True in your BrowserConfig + downloads_path, lists local file paths for downloaded items.
Usage:
if result.downloaded_files:
for file_path in result.downloaded_files:
print("Downloaded:", file_path)
5.3screenshot (可选[str])
¥5.3 screenshot (Optional[str])
什么:Base64 编码的屏幕截图screenshot=True在CrawlerRunConfig。
用法:
¥What: Base64-encoded screenshot if screenshot=True in CrawlerRunConfig.
Usage:
import base64
if result.screenshot:
with open("page.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
5.4pdf (可选[字节])
¥5.4 pdf (Optional[bytes])
什么:原始 PDF 字节,如果pdf=True在CrawlerRunConfig。
用法:
¥What: Raw PDF bytes if pdf=True in CrawlerRunConfig.
Usage:
5.5mhtml (可选[str])
¥5.5 mhtml (Optional[str])
什么:页面的 MHTML 快照capture_mhtml=True在CrawlerRunConfig。MHTML(MIME HTML)格式将整个网页及其所有资源(CSS、图像、脚本等)保存在一个文件中。
用法:
¥What: MHTML snapshot of the page if capture_mhtml=True in CrawlerRunConfig. MHTML (MIME HTML) format preserves the entire web page with all its resources (CSS, images, scripts, etc.) in a single file.
Usage:
5.6metadata (可选[dict])
¥5.6 metadata (Optional[dict])
什么:如果发现,则为页面级元数据(标题、描述、OG 数据等)。
用法:
¥What: Page-level metadata if discovered (title, description, OG data, etc.).
Usage:
if result.metadata:
print("Title:", result.metadata.get("title"))
print("Author:", result.metadata.get("author"))
6.dispatch_result (选修的)
¥6. dispatch_result (optional)
一个DispatchResult对象在并行抓取 URL 时提供额外的并发和资源使用信息(例如通过arun_many()使用自定义调度程序)。它包含:
¥A DispatchResult object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via arun_many() with custom dispatchers). It contains:
-
task_id:并行任务的唯一标识符。¥
task_id: A unique identifier for the parallel task. -
memory_usage(浮点数):完成时使用的内存(以 MB 为单位)。¥
memory_usage(float): The memory (in MB) used at the time of completion. -
peak_memory(浮点数):任务执行期间记录的峰值内存使用量(以 MB 为单位)。¥
peak_memory(float): The peak memory usage (in MB) recorded during the task's execution. -
start_time/end_time(datetime):本次爬取任务的时间范围。¥
start_time/end_time(datetime): Time range for this crawling task. -
error_message(str):遇到的任何与调度程序或并发相关的错误。¥
error_message(str): Any dispatcher- or concurrency-related error encountered.
# Example usage:
for result in results:
if result.success and result.dispatch_result:
dr = result.dispatch_result
print(f"URL: {result.url}, Task ID: {dr.task_id}")
print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
print(f"Duration: {dr.end_time - dr.start_time}")
笔记:此字段通常在使用时填充
arun_many(...)旁边调度员(例如,MemoryAdaptiveDispatcher或者SemaphoreDispatcher)。如果没有使用并发或调度程序,dispatch_result可能会保留None。¥Note: This field is typically populated when using
arun_many(...)alongside a dispatcher (e.g.,MemoryAdaptiveDispatcherorSemaphoreDispatcher). If no concurrency or dispatcher is used,dispatch_resultmay remainNone.
7. 网络请求和控制台消息
¥7. Network Requests & Console Messages
当您启用网络和控制台消息捕获时CrawlerRunConfig使用capture_network_requests=True和capture_console_messages=True, 这CrawlResult将包括以下字段:
¥When you enable network and console message capturing in CrawlerRunConfig using capture_network_requests=True and capture_console_messages=True, the CrawlResult will include these fields:
7.1network_requests (可选[列表[字典[str,任意]]])
¥7.1 network_requests (Optional[List[Dict[str, Any]]])
什么:包含有关抓取期间捕获的所有网络请求、响应和失败的信息的字典列表。结构: - 每件物品都有一个event_type可以"request","response" , 或者"request_failed".- 请求事件包括url,method ,headers ,post_data ,resource_type , 和is_navigation_request.- 响应事件包括url,status ,status_text ,headers , 和request_timing.- 失败请求事件包括url,method ,resource_type , 和failure_text. - 所有活动均包含timestamp场地。
¥What: A list of dictionaries containing information about all network requests, responses, and failures captured during the crawl.
Structure:
- Each item has an event_type field that can be "request", "response", or "request_failed".
- Request events include url, method, headers, post_data, resource_type, and is_navigation_request.
- Response events include url, status, status_text, headers, and request_timing.
- Failed request events include url, method, resource_type, and failure_text.
- All events include a timestamp field.
用法:
¥Usage:
if result.network_requests:
# Count different types of events
requests = [r for r in result.network_requests if r.get("event_type") == "request"]
responses = [r for r in result.network_requests if r.get("event_type") == "response"]
failures = [r for r in result.network_requests if r.get("event_type") == "request_failed"]
print(f"Captured {len(requests)} requests, {len(responses)} responses, and {len(failures)} failures")
# Analyze API calls
api_calls = [r for r in requests if "api" in r.get("url", "")]
# Identify failed resources
for failure in failures:
print(f"Failed to load: {failure.get('url')} - {failure.get('failure_text')}")
7.2console_messages (可选[列表[字典[str,任意]]])
¥7.2 console_messages (Optional[List[Dict[str, Any]]])
什么:包含抓取期间捕获的所有浏览器控制台消息的字典列表。结构: - 每件物品都有一个type指示消息类型的字段(例如,"log" ,"error" ,"warning"等)。——text字段包含实际的消息文本。 - 一些消息包括location信息(URL、行、列)。- 所有消息都包含timestamp场地。
¥What: A list of dictionaries containing all browser console messages captured during the crawl.
Structure:
- Each item has a type field indicating the message type (e.g., "log", "error", "warning", etc.).
- The text field contains the actual message text.
- Some messages include location information (URL, line, column).
- All messages include a timestamp field.
用法:
¥Usage:
if result.console_messages:
# Count messages by type
message_types = {}
for msg in result.console_messages:
msg_type = msg.get("type", "unknown")
message_types[msg_type] = message_types.get(msg_type, 0) + 1
print(f"Message type counts: {message_types}")
# Display errors (which are usually most important)
for msg in result.console_messages:
if msg.get("type") == "error":
print(f"Error: {msg.get('text')}")
这些字段提供了对页面网络活动和浏览器控制台的深入可见性,这对于调试、安全分析和理解复杂的 Web 应用程序非常有价值。
¥These fields provide deep visibility into the page's network activity and browser console, which is invaluable for debugging, security analysis, and understanding complex web applications.
有关网络和控制台捕获的更多详细信息,请参阅网络和控制台捕获文档。
¥For more details on network and console capturing, see the Network & Console Capture documentation.
8.示例:访问所有内容
¥8. Example: Accessing Everything
async def handle_result(result: CrawlResult):
if not result.success:
print("Crawl error:", result.error_message)
return
# Basic info
print("Crawled URL:", result.url)
print("Status code:", result.status_code)
# HTML
print("Original HTML size:", len(result.html))
print("Cleaned HTML size:", len(result.cleaned_html or ""))
# Markdown output
if result.markdown:
print("Raw Markdown:", result.markdown.raw_markdown[:300])
print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
if result.markdown.fit_markdown:
print("Fit Markdown:", result.markdown.fit_markdown[:200])
# Media & Links
if "images" in result.media:
print("Image count:", len(result.media["images"]))
if "internal" in result.links:
print("Internal link count:", len(result.links["internal"]))
# Extraction strategy result
if result.extracted_content:
print("Structured data:", result.extracted_content)
# Screenshot/PDF/MHTML
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
if result.mhtml:
print("MHTML length:", len(result.mhtml))
# Network and console capturing
if result.network_requests:
print(f"Network requests captured: {len(result.network_requests)}")
# Analyze request types
req_types = {}
for req in result.network_requests:
if "resource_type" in req:
req_types[req["resource_type"]] = req_types.get(req["resource_type"], 0) + 1
print(f"Resource types: {req_types}")
if result.console_messages:
print(f"Console messages captured: {len(result.console_messages)}")
# Count by message type
msg_types = {}
for msg in result.console_messages:
msg_types[msg.get("type", "unknown")] = msg_types.get(msg.get("type", "unknown"), 0) + 1
print(f"Message types: {msg_types}")
9. 重点与未来
¥9. Key Points & Future
1.弃用的 CrawlResult 旧属性
-markdown_v2 - 在 v0.5 中已弃用。只需使用markdown。它拥有MarkdownGenerationResult现在! -fit_markdown和fit_html- 在 v0.5 中已弃用。现在可以通过以下方式访问MarkdownGenerationResult在result.markdown例如:result.markdown.fit_markdown和result.markdown.fit_html
¥1. Deprecated legacy properties of CrawlResult
- markdown_v2 - Deprecated in v0.5. Just use markdown. It holds the MarkdownGenerationResult now!
- fit_markdown and fit_html - Deprecated in v0.5. They can now be accessed via MarkdownGenerationResult in result.markdown. eg: result.markdown.fit_markdown and result.markdown.fit_html
2.适合内容
-fit_markdown和fit_html仅当您使用内容过滤器(例如修剪内容过滤器或者BM25内容过滤器)在你的Markdown生成策略或直接设置它们。
- 如果不使用过滤器,它们会残留None。
¥2. Fit Content
- fit_markdown and fit_html appear in MarkdownGenerationResult, only if you used a content filter (like PruningContentFilter or BM25ContentFilter) inside your MarkdownGenerationStrategy or set them directly.
- If no filter is used, they remain None.
3.参考文献
- 如果您在DefaultMarkdownGenerator(options={"citations": True} ),你会看到markdown_with_citations加上references_markdown块。这有助于大型语言模型或类似学术的引用。
¥3. References & Citations
- If you enable link citations in your DefaultMarkdownGenerator (options={"citations": True}), you’ll see markdown_with_citations plus a references_markdown block. This helps large language models or academic-like referencing.
4.链接和媒体
-links["internal"]和links["external"]组按域发现锚点。
-media["images"] /["videos"] /["audios"]存储提取的媒体元素以及可选的评分或上下文。
¥4. Links & Media
- links["internal"] and links["external"] group discovered anchors by domain.
- media["images"] / ["videos"] / ["audios"] store extracted media elements with optional scoring or context.
5.错误案例
- 如果success=False, 查看error_message(例如超时、无效的 URL)。
-status_code可能是None如果我们在 HTTP 响应之前失败了。
¥5. Error Cases
- If success=False, check error_message (e.g., timeouts, invalid URLs).
- status_code might be None if we failed before an HTTP response.
使用CrawlResult收集所有最终输出并将其输入到数据管道、AI 模型或档案库中。通过适当配置的协同作用浏览器配置和CrawlerRunConfig ,爬虫可以在这里生成健壮、结构化的结果CrawlResult。
¥Use CrawlResult to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured BrowserConfig and CrawlerRunConfig, the crawler can produce robust, structured results here in CrawlResult.