参考

¥CrawlResult Reference

这CrawlResult类封装了单次抓取操作后返回的所有内容。它提供了原始或处理过的内容、链接和媒体的详细信息，以及可选的元数据（如屏幕截图、PDF 或提取的 JSON）。

¥The CrawlResult class encapsulates everything returned after a single crawl operation. It provides the raw or processed content, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).

地点：crawl4ai/crawler/models.py （仅供参考）

¥Location: crawl4ai/crawler/models.py (for reference)

class CrawlResult(BaseModel):
    url: str
    html: str
    success: bool
    cleaned_html: Optional[str] = None
    fit_html: Optional[str] = None  # Preprocessed HTML optimized for extraction
    media: Dict[str, List[Dict]] = {}
    links: Dict[str, List[Dict]] = {}
    downloaded_files: Optional[List[str]] = None
    screenshot: Optional[str] = None
    pdf : Optional[bytes] = None
    mhtml: Optional[str] = None
    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
    extracted_content: Optional[str] = None
    metadata: Optional[dict] = None
    error_message: Optional[str] = None
    session_id: Optional[str] = None
    response_headers: Optional[dict] = None
    status_code: Optional[int] = None
    ssl_certificate: Optional[SSLCertificate] = None
    dispatch_result: Optional[DispatchResult] = None
    ...

下面是逐个字段解释和可能的使用模式。

¥Below is a field-by-field explanation and possible usage patterns.

1. 基本爬取信息

¥1. Basic Crawl Info

1.1`url` （字符串）

¥1.1 url (str)

什么：最终抓取的 URL（任何重定向之后）。
用法：

¥What: The final crawled URL (after any redirects).
Usage:

print(result.url)  # e.g., "https://example.com/"

1.2`success` （布尔值）

¥1.2 success (bool)

什么：True如果爬取管道结束时没有出现重大错误；False否则。
用法：

¥What: True if the crawl pipeline ended without major errors; False otherwise.
Usage:

if not result.success:
    print(f"Crawl failed: {result.error_message}")

1.3`status_code` （可选[int]）

¥1.3 status_code (Optional[int])

什么：页面的 HTTP 状态代码（例如 200、404）。
用法：

¥What: The page's HTTP status code (e.g., 200, 404).
Usage:

if result.status_code == 404:
    print("Page not found!")

1.4`error_message` （可选[str]）

¥1.4 error_message (Optional[str])

什么：如果success=False，失败的文本描述。
用法：

¥What: If success=False, a textual description of the failure.
Usage:

if not result.success:
    print("Error:", result.error_message)

1.5`session_id` （可选[str]）

¥1.5 session_id (Optional[str])

什么：用于在多个调用之间重复使用浏览器上下文的 ID。
用法：

¥What: The ID used for reusing a browser context across multiple calls.
Usage:

# If you used session_id="login_session" in CrawlerRunConfig, see it here:
print("Session:", result.session_id)

1.6`response_headers` （可选[dict]）

¥1.6 response_headers (Optional[dict])

什么：最终的 HTTP 响应标头。
用法：

¥What: Final HTTP response headers.
Usage:

if result.response_headers:
    print("Server:", result.response_headers.get("Server", "Unknown"))

1.7`ssl_certificate` （可选[SSL证书]）

¥1.7 ssl_certificate (Optional[SSLCertificate])

什么：如果fetch_ssl_certificate=True在你的 CrawlerRunConfig 中，result.ssl_certificate包含一个SSLCertificate描述站点证书的对象。您可以以多种格式（PEM/DER/JSON）导出证书，或访问其属性，例如issuer，subject ，valid_from ，valid_until ， ETC。用法：

¥What: If fetch_ssl_certificate=True in your CrawlerRunConfig, result.ssl_certificate contains a SSLCertificate object describing the site's certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like issuer, subject, valid_from, valid_until, etc. Usage:

if result.ssl_certificate:
    print("Issuer:", result.ssl_certificate.issuer)

2. 原始/清理内容

¥2. Raw / Cleaned Content

2.1`html` （字符串）

¥2.1 html (str)

什么：这原来的最终页面加载时未修改的 HTML。
用法：

¥What: The original unmodified HTML from the final page load.
Usage:

# Possibly large
print(len(result.html))

2.2`cleaned_html` （可选[str]）

¥2.2 cleaned_html (Optional[str])

什么：已清理的 HTML 版本——脚本、样式或排除的标签将根据您的CrawlerRunConfig。
用法：

¥What: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your CrawlerRunConfig.
Usage:

print(result.cleaned_html[:500])  # Show a snippet

3. Markdown 字段

¥3. Markdown Fields

3.1 Markdown 生成方法

¥3.1 The Markdown Generation Approach

Crawl4AI 可以将 HTML 转换为 Markdown，可选包括：

¥Crawl4AI can convert HTML→Markdown, optionally including:

生的降价

¥Raw markdown
链接作为引用（含参考文献部分）

¥Links as citations (with a references section)
合身如果内容过滤器被使用（例如修剪或 BM25）

¥Fit markdown if a content filter is used (like Pruning or BM25)

MarkdownGenerationResult包括：-raw_markdown （字符串） ：完整的 HTML→Markdown 转换。
-markdown_with_citations （字符串） ：相同的 markdown，但使用链接引用作为学术风格的引用。
-references_markdown （字符串） ：文末的参考文献列表或脚注。
-fit_markdown （可选[str]） ：如果应用了内容过滤（修剪/BM25），则过滤后的文本为“适合”。
-fit_html （可选[str]） ：导致fit_markdown。

¥MarkdownGenerationResult includes: - raw_markdown (str): The full HTML→Markdown conversion.
- markdown_with_citations (str): Same markdown, but with link references as academic-style citations.
- references_markdown (str): The reference list or footnotes at the end.
- fit_markdown (Optional[str]): If content filtering (Pruning/BM25) was applied, the filtered "fit" text.
- fit_html (Optional[str]): The HTML that led to fit_markdown.

用法：

¥Usage:

if result.markdown:
    md_res = result.markdown
    print("Raw MD:", md_res.raw_markdown[:300])
    print("Citations MD:", md_res.markdown_with_citations[:300])
    print("References:", md_res.references_markdown)
    if md_res.fit_markdown:
        print("Pruned text:", md_res.fit_markdown[:300])

3.2`markdown` （可选[Union[str，MarkdownGenerationResult]]）

¥3.2 markdown (Optional[Union[str, MarkdownGenerationResult]])

什么：持有MarkdownGenerationResult。
用法：

¥What: Holds the MarkdownGenerationResult.
Usage:

print(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html)

Important: "Fit" content (in fit_markdown/fit_html) exists in result.markdown, only if you used a filter (like PruningContentFilter or BM25ContentFilter) within a MarkdownGenerationStrategy.

4.媒体与链接

¥4. Media & Links

4.1`media` （字典[str，列表[字典]]）

¥4.1 media (Dict[str, List[Dict]])

什么：包含已发现图像、视频或音频的信息。通常键为："images" ，"videos" ，"audios" 。
公共字段在每个项目中：

¥What: Contains info about discovered images, videos, or audio. Typically keys: "images", "videos", "audios".
Common Fields in each item:

（字符串） : 媒体网址

¥src (str): Media URL
或者title（字符串） ：描述性文字

¥alt or title (str): Descriptive text
（漂浮） ：如果爬虫的启发式算法认为它“重要”，则相关性得分

¥score (float): Relevance score if the crawler's heuristic found it "important"
或者description（可选[str]） ：从周围文本中提取的附加上下文

¥desc or description (Optional[str]): Additional context extracted from surrounding text

用法：

¥Usage:

images = result.media.get("images", [])
for img in images:
    if img.get("score", 0) > 5:
        print("High-value image:", img["src"])

4.2`links` （字典[str，列表[字典]]）

¥4.2 links (Dict[str, List[Dict]])

什么：保存内部和外部链接数据。通常有两个键："internal"和"external"。
公共字段：

¥What: Holds internal and external link data. Usually two keys: "internal" and "external".
Common Fields:

（字符串） ：链接目标

¥href (str): The link target
（字符串） ：链接文本

¥text (str): Link text
（字符串） ：标题属性

¥title (str): Title attribute
（字符串） ：周围的文本片段

¥context (str): Surrounding text snippet
（字符串） ：如果是外部的，则域

¥domain (str): If external, the domain

用法：

¥Usage:

for link in result.links["internal"]:
    print(f"Internal link to {link['href']} with text {link['text']}")

5.附加字段

¥5. Additional Fields

5.1`extracted_content` （可选[str]）

¥5.1 extracted_content (Optional[str])

什么：如果你使用extraction_strategy（CSS、LLM等），结构化输出（JSON）。
用法：

¥What: If you used extraction_strategy (CSS, LLM, etc.), the structured output (JSON).
Usage:

if result.extracted_content:
    data = json.loads(result.extracted_content)
    print(data)

5.2`downloaded_files` （可选[列表[字符串]]）

¥5.2 downloaded_files (Optional[List[str]])

什么：如果accept_downloads=True在你的BrowserConfig+downloads_path ，列出下载项目的本地文件路径。
用法：

¥What: If accept_downloads=True in your BrowserConfig + downloads_path, lists local file paths for downloaded items.
Usage:

if result.downloaded_files:
    for file_path in result.downloaded_files:
        print("Downloaded:", file_path)

5.3`screenshot` （可选[str]）

¥5.3 screenshot (Optional[str])

什么：Base64 编码的屏幕截图screenshot=True在CrawlerRunConfig。
用法：

¥What: Base64-encoded screenshot if screenshot=True in CrawlerRunConfig.
Usage:

import base64
if result.screenshot:
    with open("page.png", "wb") as f:
        f.write(base64.b64decode(result.screenshot))

5.4`pdf` （可选[字节]）

¥5.4 pdf (Optional[bytes])

什么：原始 PDF 字节，如果pdf=True在CrawlerRunConfig。
用法：

¥What: Raw PDF bytes if pdf=True in CrawlerRunConfig.
Usage:

if result.pdf:
    with open("page.pdf", "wb") as f:
        f.write(result.pdf)

5.5`mhtml` （可选[str]）

¥5.5 mhtml (Optional[str])

什么：页面的 MHTML 快照capture_mhtml=True在CrawlerRunConfig。MHTML（MIME HTML）格式将整个网页及其所有资源（CSS、图像、脚本等）保存在一个文件中。
用法：

¥What: MHTML snapshot of the page if capture_mhtml=True in CrawlerRunConfig. MHTML (MIME HTML) format preserves the entire web page with all its resources (CSS, images, scripts, etc.) in a single file.
Usage:

if result.mhtml:
    with open("page.mhtml", "w", encoding="utf-8") as f:
        f.write(result.mhtml)

5.6`metadata` （可选[dict]）

¥5.6 metadata (Optional[dict])

什么：如果发现，则为页面级元数据（标题、描述、OG 数据等）。
用法：

¥What: Page-level metadata if discovered (title, description, OG data, etc.).
Usage:

if result.metadata:
    print("Title:", result.metadata.get("title"))
    print("Author:", result.metadata.get("author"))

6.`dispatch_result` （选修的）

¥6. dispatch_result (optional)

一个DispatchResult对象在并行抓取 URL 时提供额外的并发和资源使用信息（例如通过arun_many()使用自定义调度程序）。它包含：

¥A DispatchResult object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via arun_many() with custom dispatchers). It contains:

task_id：并行任务的唯一标识符。

¥task_id: A unique identifier for the parallel task.
memory_usage(浮点数)：完成时使用的内存（以 MB 为单位）。

¥memory_usage (float): The memory (in MB) used at the time of completion.
peak_memory(浮点数)：任务执行期间记录的峰值内存使用量（以 MB 为单位）。

¥peak_memory (float): The peak memory usage (in MB) recorded during the task's execution.
start_time/end_time (datetime)：本次爬取任务的时间范围。

¥start_time / end_time (datetime): Time range for this crawling task.
error_message(str)：遇到的任何与调度程序或并发相关的错误。

¥error_message (str): Any dispatcher- or concurrency-related error encountered.

# Example usage:
for result in results:
    if result.success and result.dispatch_result:
        dr = result.dispatch_result
        print(f"URL: {result.url}, Task ID: {dr.task_id}")
        print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
        print(f"Duration: {dr.end_time - dr.start_time}")

笔记：此字段通常在使用时填充arun_many(...)旁边调度员（例如，MemoryAdaptiveDispatcher或者SemaphoreDispatcher）。如果没有使用并发或调度程序，dispatch_result可能会保留None。

¥
Note: This field is typically populated when using arun_many(...) alongside a dispatcher (e.g., MemoryAdaptiveDispatcher or SemaphoreDispatcher). If no concurrency or dispatcher is used, dispatch_result may remain None.

7. 网络请求和控制台消息

¥7. Network Requests & Console Messages

当您启用网络和控制台消息捕获时CrawlerRunConfig使用capture_network_requests=True和capture_console_messages=True，这CrawlResult将包括以下字段：

¥When you enable network and console message capturing in CrawlerRunConfig using capture_network_requests=True and capture_console_messages=True, the CrawlResult will include these fields:

7.1`network_requests` （可选[列表[字典[str，任意]]]）

¥7.1 network_requests (Optional[List[Dict[str, Any]]])

什么：包含有关抓取期间捕获的所有网络请求、响应和失败的信息的字典列表。结构: - 每件物品都有一个event_type可以"request"，"response" ，或者"request_failed".- 请求事件包括url，method ，headers ，post_data ，resource_type ，和is_navigation_request.- 响应事件包括url，status ，status_text ，headers ，和request_timing.- 失败请求事件包括url，method ，resource_type ，和failure_text. - 所有活动均包含timestamp场地。

¥What: A list of dictionaries containing information about all network requests, responses, and failures captured during the crawl. Structure: - Each item has an event_type field that can be "request", "response", or "request_failed". - Request events include url, method, headers, post_data, resource_type, and is_navigation_request. - Response events include url, status, status_text, headers, and request_timing. - Failed request events include url, method, resource_type, and failure_text. - All events include a timestamp field.

用法：

¥Usage:

if result.network_requests:
    # Count different types of events
    requests = [r for r in result.network_requests if r.get("event_type") == "request"]
    responses = [r for r in result.network_requests if r.get("event_type") == "response"]
    failures = [r for r in result.network_requests if r.get("event_type") == "request_failed"]

    print(f"Captured {len(requests)} requests, {len(responses)} responses, and {len(failures)} failures")

    # Analyze API calls
    api_calls = [r for r in requests if "api" in r.get("url", "")]

    # Identify failed resources
    for failure in failures:
        print(f"Failed to load: {failure.get('url')} - {failure.get('failure_text')}")

7.2`console_messages` （可选[列表[字典[str，任意]]]）

¥7.2 console_messages (Optional[List[Dict[str, Any]]])

什么：包含抓取期间捕获的所有浏览器控制台消息的字典列表。结构: - 每件物品都有一个type指示消息类型的字段（例如，"log" ，"error" ，"warning"等）。——text字段包含实际的消息文本。 - 一些消息包括location信息（URL、行、列）。- 所有消息都包含timestamp场地。

¥What: A list of dictionaries containing all browser console messages captured during the crawl. Structure: - Each item has a type field indicating the message type (e.g., "log", "error", "warning", etc.). - The text field contains the actual message text. - Some messages include location information (URL, line, column). - All messages include a timestamp field.

用法：

¥Usage:

if result.console_messages:
    # Count messages by type
    message_types = {}
    for msg in result.console_messages:
        msg_type = msg.get("type", "unknown")
        message_types[msg_type] = message_types.get(msg_type, 0) + 1

    print(f"Message type counts: {message_types}")

    # Display errors (which are usually most important)
    for msg in result.console_messages:
        if msg.get("type") == "error":
            print(f"Error: {msg.get('text')}")

这些字段提供了对页面网络活动和浏览器控制台的深入可见性，这对于调试、安全分析和理解复杂的 Web 应用程序非常有价值。

¥These fields provide deep visibility into the page's network activity and browser console, which is invaluable for debugging, security analysis, and understanding complex web applications.

有关网络和控制台捕获的更多详细信息，请参阅网络和控制台捕获文档。

¥For more details on network and console capturing, see the Network & Console Capture documentation.

8.示例：访问所有内容

¥8. Example: Accessing Everything

async def handle_result(result: CrawlResult):
    if not result.success:
        print("Crawl error:", result.error_message)
        return

    # Basic info
    print("Crawled URL:", result.url)
    print("Status code:", result.status_code)

    # HTML
    print("Original HTML size:", len(result.html))
    print("Cleaned HTML size:", len(result.cleaned_html or ""))

    # Markdown output
    if result.markdown:
        print("Raw Markdown:", result.markdown.raw_markdown[:300])
        print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
        if result.markdown.fit_markdown:
            print("Fit Markdown:", result.markdown.fit_markdown[:200])

    # Media & Links
    if "images" in result.media:
        print("Image count:", len(result.media["images"]))
    if "internal" in result.links:
        print("Internal link count:", len(result.links["internal"]))

    # Extraction strategy result
    if result.extracted_content:
        print("Structured data:", result.extracted_content)

    # Screenshot/PDF/MHTML
    if result.screenshot:
        print("Screenshot length:", len(result.screenshot))
    if result.pdf:
        print("PDF bytes length:", len(result.pdf))
    if result.mhtml:
        print("MHTML length:", len(result.mhtml))

    # Network and console capturing
    if result.network_requests:
        print(f"Network requests captured: {len(result.network_requests)}")
        # Analyze request types
        req_types = {}
        for req in result.network_requests:
            if "resource_type" in req:
                req_types[req["resource_type"]] = req_types.get(req["resource_type"], 0) + 1
        print(f"Resource types: {req_types}")

    if result.console_messages:
        print(f"Console messages captured: {len(result.console_messages)}")
        # Count by message type
        msg_types = {}
        for msg in result.console_messages:
            msg_types[msg.get("type", "unknown")] = msg_types.get(msg.get("type", "unknown"), 0) + 1
        print(f"Message types: {msg_types}")

9. 重点与未来

¥9. Key Points & Future

1.弃用的 CrawlResult 旧属性
-markdown_v2 - 在 v0.5 中已弃用。只需使用markdown。它拥有MarkdownGenerationResult现在！ -fit_markdown和fit_html- 在 v0.5 中已弃用。现在可以通过以下方式访问MarkdownGenerationResult在result.markdown例如：result.markdown.fit_markdown和result.markdown.fit_html

¥1. Deprecated legacy properties of CrawlResult
- markdown_v2 - Deprecated in v0.5. Just use markdown. It holds the MarkdownGenerationResult now! - fit_markdown and fit_html - Deprecated in v0.5. They can now be accessed via MarkdownGenerationResult in result.markdown. eg: result.markdown.fit_markdown and result.markdown.fit_html

2.适合内容
-fit_markdown和fit_html仅当您使用内容过滤器（例如修剪内容过滤器或者BM25内容过滤器）在你的Markdown生成策略或直接设置它们。
- 如果不使用过滤器，它们会残留None。

¥2. Fit Content
- fit_markdown and fit_html appear in MarkdownGenerationResult, only if you used a content filter (like PruningContentFilter or BM25ContentFilter) inside your MarkdownGenerationStrategy or set them directly.
- If no filter is used, they remain None.

3.参考文献
- 如果您在DefaultMarkdownGenerator(options={"citations": True} ），你会看到markdown_with_citations加上references_markdown块。这有助于大型语言模型或类似学术的引用。

¥3. References & Citations
- If you enable link citations in your DefaultMarkdownGenerator (options={"citations": True}), you’ll see markdown_with_citations plus a references_markdown block. This helps large language models or academic-like referencing.

4.链接和媒体
-links["internal"]和links["external"]组按域发现锚点。
-media["images"] /["videos"] /["audios"]存储提取的媒体元素以及可选的评分或上下文。

¥4. Links & Media
- links["internal"] and links["external"] group discovered anchors by domain.
- media["images"] / ["videos"] / ["audios"] store extracted media elements with optional scoring or context.

5.错误案例
- 如果success=False，查看error_message（例如超时、无效的 URL）。
-status_code可能是None如果我们在 HTTP 响应之前失败了。

¥5. Error Cases
- If success=False, check error_message (e.g., timeouts, invalid URLs).
- status_code might be None if we failed before an HTTP response.

使用CrawlResult收集所有最终输出并将其输入到数据管道、AI 模型或档案库中。通过适当配置的协同作用浏览器配置和CrawlerRunConfig ，爬虫可以在这里生成健壮、结构化的结果CrawlResult。

¥Use CrawlResult to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured BrowserConfig and CrawlerRunConfig, the crawler can produce robust, structured results here in CrawlResult.

参考

1. 基本爬取信息

1.1url （字符串）

1.2success （布尔值）

1.3status_code （可选[int]）

1.4error_message （可选[str]）

1.5session_id （可选[str]）

1.6response_headers （可选[dict]）

1.7ssl_certificate （可选[SSL证书]）

2. 原始/清理内容

2.1html （字符串）

2.2cleaned_html （可选[str]）

3. Markdown 字段

3.1 Markdown 生成方法

3.2markdown （可选[Union[str，MarkdownGenerationResult]]）

4.媒体与链接

4.1media （字典[str，列表[字典]]）

4.2links （字典[str，列表[字典]]）

5.附加字段

5.1extracted_content （可选[str]）

5.2downloaded_files （可选[列表[字符串]]）

5.3screenshot （可选[str]）

5.4pdf （可选[字节]）

5.5mhtml （可选[str]）

5.6metadata （可选[dict]）

6.dispatch_result （选修的）

7. 网络请求和控制台消息

7.1network_requests （可选[列表[字典[str，任意]]]）

7.2console_messages （可选[列表[字典[str，任意]]]）

8.示例：访问所有内容

9. 重点与未来

1.1`url` （字符串）

1.2`success` （布尔值）

1.3`status_code` （可选[int]）

1.4`error_message` （可选[str]）

1.5`session_id` （可选[str]）

1.6`response_headers` （可选[dict]）

1.7`ssl_certificate` （可选[SSL证书]）

2.1`html` （字符串）

2.2`cleaned_html` （可选[str]）

3.2`markdown` （可选[Union[str，MarkdownGenerationResult]]）

4.1`media` （字典[str，列表[字典]]）

4.2`links` （字典[str，列表[字典]]）

5.1`extracted_content` （可选[str]）

5.2`downloaded_files` （可选[列表[字符串]]）

5.3`screenshot` （可选[str]）

5.4`pdf` （可选[字节]）

5.5`mhtml` （可选[str]）

5.6`metadata` （可选[dict]）

6.`dispatch_result` （选修的）

7.1`network_requests` （可选[列表[字典[str，任意]]]）

7.2`console_messages` （可选[列表[字典[str，任意]]]）