参考

¥CrawlResult Reference

CrawlResult类封装了单次抓取操作后返回的所有内容。它提供了原始或处理过的内容、链接和媒体的详细信息,以及可选的元数据(如屏幕截图、PDF 或提取的 JSON)。

¥The CrawlResult class encapsulates everything returned after a single crawl operation. It provides the raw or processed content, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).

地点crawl4ai/crawler/models.py (仅供参考)

¥Location: crawl4ai/crawler/models.py (for reference)

class CrawlResult(BaseModel):
    url: str
    html: str
    success: bool
    cleaned_html: Optional[str] = None
    fit_html: Optional[str] = None  # Preprocessed HTML optimized for extraction
    media: Dict[str, List[Dict]] = {}
    links: Dict[str, List[Dict]] = {}
    downloaded_files: Optional[List[str]] = None
    screenshot: Optional[str] = None
    pdf : Optional[bytes] = None
    mhtml: Optional[str] = None
    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
    extracted_content: Optional[str] = None
    metadata: Optional[dict] = None
    error_message: Optional[str] = None
    session_id: Optional[str] = None
    response_headers: Optional[dict] = None
    status_code: Optional[int] = None
    ssl_certificate: Optional[SSLCertificate] = None
    dispatch_result: Optional[DispatchResult] = None
    ...

下面是逐个字段解释和可能的使用模式。

¥Below is a field-by-field explanation and possible usage patterns.


1. 基本爬取信息

¥1. Basic Crawl Info

1.1url (字符串)

¥1.1 url (str)

什么:最终抓取的 URL(任何重定向之后)。
用法

¥What: The final crawled URL (after any redirects).
Usage:

print(result.url)  # e.g., "https://example.com/"

1.2success (布尔值)

¥1.2 success (bool)

什么True如果爬取管道结束时没有出现重大错误;False否则。
用法

¥What: True if the crawl pipeline ended without major errors; False otherwise.
Usage:

if not result.success:
    print(f"Crawl failed: {result.error_message}")

1.3status_code (可选[int])

¥1.3 status_code (Optional[int])

什么:页面的 HTTP 状态代码(例如 200、404)。
用法

¥What: The page's HTTP status code (e.g., 200, 404).
Usage:

if result.status_code == 404:
    print("Page not found!")

1.4error_message (可选[str])

¥1.4 error_message (Optional[str])

什么: 如果success=False,失败的文本描述。
用法

¥What: If success=False, a textual description of the failure.
Usage:

if not result.success:
    print("Error:", result.error_message)

1.5session_id (可选[str])

¥1.5 session_id (Optional[str])

什么:用于在多个调用之间重复使用浏览器上下文的 ID。
用法

¥What: The ID used for reusing a browser context across multiple calls.
Usage:

# If you used session_id="login_session" in CrawlerRunConfig, see it here:
print("Session:", result.session_id)

1.6response_headers (可选[dict])

¥1.6 response_headers (Optional[dict])

什么:最终的 HTTP 响应标头。
用法

¥What: Final HTTP response headers.
Usage:

if result.response_headers:
    print("Server:", result.response_headers.get("Server", "Unknown"))

1.7ssl_certificate (可选[SSL证书])

¥1.7 ssl_certificate (Optional[SSLCertificate])

什么: 如果fetch_ssl_certificate=True在你的 CrawlerRunConfig 中,result.ssl_certificate包含一个SSLCertificate描述站点证书的对象。您可以以多种格式(PEM/DER/JSON)导出证书,或访问其属性,例如issuersubjectvalid_fromvalid_until , ETC。用法

¥What: If fetch_ssl_certificate=True in your CrawlerRunConfig, result.ssl_certificate contains a SSLCertificate object describing the site's certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like issuer, subject, valid_from, valid_until, etc. Usage:

if result.ssl_certificate:
    print("Issuer:", result.ssl_certificate.issuer)


2. 原始/清理内容

¥2. Raw / Cleaned Content

2.1html (字符串)

¥2.1 html (str)

什么: 这原来的最终页面加载时未修改的 HTML。
用法

¥What: The original unmodified HTML from the final page load.
Usage:

# Possibly large
print(len(result.html))

2.2cleaned_html (可选[str])

¥2.2 cleaned_html (Optional[str])

什么:已清理的 HTML 版本——脚本、样式或排除的标签将根据您的CrawlerRunConfig
用法

¥What: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your CrawlerRunConfig.
Usage:

print(result.cleaned_html[:500])  # Show a snippet


3. Markdown 字段

¥3. Markdown Fields

3.1 Markdown 生成方法

¥3.1 The Markdown Generation Approach

Crawl4AI 可以将 HTML 转换为 Markdown,可选包括:

¥Crawl4AI can convert HTML→Markdown, optionally including:

  • 生的降价

    ¥Raw markdown

  • 链接作为引用(含参考文献部分)

    ¥Links as citations (with a references section)

  • 合身如果内容过滤器被使用(例如修剪或 BM25)

    ¥Fit markdown if a content filter is used (like Pruning or BM25)

MarkdownGenerationResult包括:-raw_markdown (字符串) :完整的 HTML→Markdown 转换。
-markdown_with_citations (字符串) :相同的 markdown,但使用链接引用作为学术风格的引用。
-references_markdown (字符串) :文末的参考文献列表或脚注。
-fit_markdown (可选[str]) :如果应用了内容过滤(修剪/BM25),则过滤后的文本为“适合”。
-fit_html (可选[str]) :导致fit_markdown

¥MarkdownGenerationResult includes: - raw_markdown (str): The full HTML→Markdown conversion.
- markdown_with_citations (str): Same markdown, but with link references as academic-style citations.
- references_markdown (str): The reference list or footnotes at the end.
- fit_markdown (Optional[str]): If content filtering (Pruning/BM25) was applied, the filtered "fit" text.
- fit_html (Optional[str]): The HTML that led to fit_markdown.

用法

¥Usage:

if result.markdown:
    md_res = result.markdown
    print("Raw MD:", md_res.raw_markdown[:300])
    print("Citations MD:", md_res.markdown_with_citations[:300])
    print("References:", md_res.references_markdown)
    if md_res.fit_markdown:
        print("Pruned text:", md_res.fit_markdown[:300])

3.2markdown (可选[Union[str,MarkdownGenerationResult]])

¥3.2 markdown (Optional[Union[str, MarkdownGenerationResult]])

什么:持有MarkdownGenerationResult
用法

¥What: Holds the MarkdownGenerationResult.
Usage:

print(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html)
Important: "Fit" content (in fit_markdown/fit_html) exists in result.markdown, only if you used a filter (like PruningContentFilter or BM25ContentFilter) within a MarkdownGenerationStrategy.


¥4. Media & Links

4.1media (字典[str,列表[字典]])

¥4.1 media (Dict[str, List[Dict]])

什么:包含已发现图像、视频或音频的信息。通常键为:"images""videos""audios"
公共字段在每个项目中:

¥What: Contains info about discovered images, videos, or audio. Typically keys: "images", "videos", "audios".
Common Fields in each item:

  • (字符串) : 媒体网址

    ¥src (str): Media URL

  • 或者title(字符串) :描述性文字

    ¥alt or title (str): Descriptive text

  • (漂浮) :如果爬虫的启发式算法认为它“重要”,则相关性得分

    ¥score (float): Relevance score if the crawler's heuristic found it "important"

  • 或者description(可选[str]) :从周围文本中提取的附加上下文

    ¥desc or description (Optional[str]): Additional context extracted from surrounding text

用法

¥Usage:

images = result.media.get("images", [])
for img in images:
    if img.get("score", 0) > 5:
        print("High-value image:", img["src"])

¥4.2 links (Dict[str, List[Dict]])

什么:保存内部和外部链接数据。通常有两个键:"internal""external"
公共字段

¥What: Holds internal and external link data. Usually two keys: "internal" and "external".
Common Fields:

  • (字符串) :链接目标

    ¥href (str): The link target

  • (字符串) :链接文本

    ¥text (str): Link text

  • (字符串) :标题属性

    ¥title (str): Title attribute

  • (字符串) :周围的文本片段

    ¥context (str): Surrounding text snippet

  • (字符串) :如果是外部的,则域

    ¥domain (str): If external, the domain

用法

¥Usage:

for link in result.links["internal"]:
    print(f"Internal link to {link['href']} with text {link['text']}")


5.附加字段

¥5. Additional Fields

5.1extracted_content (可选[str])

¥5.1 extracted_content (Optional[str])

什么:如果你使用extraction_strategy(CSS、LLM等),结构化输出(JSON)。
用法

¥What: If you used extraction_strategy (CSS, LLM, etc.), the structured output (JSON).
Usage:

if result.extracted_content:
    data = json.loads(result.extracted_content)
    print(data)

5.2downloaded_files (可选[列表[字符串]])

¥5.2 downloaded_files (Optional[List[str]])

什么: 如果accept_downloads=True在你的BrowserConfig+downloads_path ,列出下载项目的本地文件路径。
用法

¥What: If accept_downloads=True in your BrowserConfig + downloads_path, lists local file paths for downloaded items.
Usage:

if result.downloaded_files:
    for file_path in result.downloaded_files:
        print("Downloaded:", file_path)

5.3screenshot (可选[str])

¥5.3 screenshot (Optional[str])

什么:Base64 编码的屏幕截图screenshot=TrueCrawlerRunConfig
用法

¥What: Base64-encoded screenshot if screenshot=True in CrawlerRunConfig.
Usage:

import base64
if result.screenshot:
    with open("page.png", "wb") as f:
        f.write(base64.b64decode(result.screenshot))

5.4pdf (可选[字节])

¥5.4 pdf (Optional[bytes])

什么:原始 PDF 字节,如果pdf=TrueCrawlerRunConfig
用法

¥What: Raw PDF bytes if pdf=True in CrawlerRunConfig.
Usage:

if result.pdf:
    with open("page.pdf", "wb") as f:
        f.write(result.pdf)

5.5mhtml (可选[str])

¥5.5 mhtml (Optional[str])

什么:页面的 MHTML 快照capture_mhtml=TrueCrawlerRunConfig。MHTML(MIME HTML)格式将整个网页及其所有资源(CSS、图像、脚本等)保存在一个文件中。
用法

¥What: MHTML snapshot of the page if capture_mhtml=True in CrawlerRunConfig. MHTML (MIME HTML) format preserves the entire web page with all its resources (CSS, images, scripts, etc.) in a single file.
Usage:

if result.mhtml:
    with open("page.mhtml", "w", encoding="utf-8") as f:
        f.write(result.mhtml)

5.6metadata (可选[dict])

¥5.6 metadata (Optional[dict])

什么:如果发现,则为页面级元数据(标题、描述、OG 数据等)。
用法

¥What: Page-level metadata if discovered (title, description, OG data, etc.).
Usage:

if result.metadata:
    print("Title:", result.metadata.get("title"))
    print("Author:", result.metadata.get("author"))


6.dispatch_result (选修的)

¥6. dispatch_result (optional)

一个DispatchResult对象在并行抓取 URL 时提供额外的并发和资源使用信息(例如通过arun_many()使用自定义调度程序)。它包含:

¥A DispatchResult object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via arun_many() with custom dispatchers). It contains:

  • task_id:并行任务的唯一标识符。

    ¥task_id: A unique identifier for the parallel task.

  • memory_usage(浮点数):完成时使用的内存(以 MB 为单位)。

    ¥memory_usage (float): The memory (in MB) used at the time of completion.

  • peak_memory(浮点数):任务执行期间记录的峰值内存使用量(以 MB 为单位)。

    ¥peak_memory (float): The peak memory usage (in MB) recorded during the task's execution.

  • start_time/end_time (datetime):本次爬取任务的时间范围。

    ¥start_time / end_time (datetime): Time range for this crawling task.

  • error_message(str):遇到的任何与调度程序或并发相关的错误。

    ¥error_message (str): Any dispatcher- or concurrency-related error encountered.

# Example usage:
for result in results:
    if result.success and result.dispatch_result:
        dr = result.dispatch_result
        print(f"URL: {result.url}, Task ID: {dr.task_id}")
        print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
        print(f"Duration: {dr.end_time - dr.start_time}")

笔记:此字段通常在使用时填充arun_many(...)旁边调度员(例如,MemoryAdaptiveDispatcher或者SemaphoreDispatcher)。如果没有使用并发或调度程序,dispatch_result可能会保留None

¥

Note: This field is typically populated when using arun_many(...) alongside a dispatcher (e.g., MemoryAdaptiveDispatcher or SemaphoreDispatcher). If no concurrency or dispatcher is used, dispatch_result may remain None.


7. 网络请求和控制台消息

¥7. Network Requests & Console Messages

当您启用网络和控制台消息捕获时CrawlerRunConfig使用capture_network_requests=Truecapture_console_messages=True, 这CrawlResult将包括以下字段:

¥When you enable network and console message capturing in CrawlerRunConfig using capture_network_requests=True and capture_console_messages=True, the CrawlResult will include these fields:

7.1network_requests (可选[列表[字典[str,任意]]])

¥7.1 network_requests (Optional[List[Dict[str, Any]]])

什么:包含有关抓取期间捕获的所有网络请求、响应和失败的信息的字典列表。结构: - 每件物品都有一个event_type可以"request""response" , 或者"request_failed".- 请求事件包括urlmethodheaderspost_dataresource_type , 和is_navigation_request.- 响应事件包括urlstatusstatus_textheaders , 和request_timing.- 失败请求事件包括urlmethodresource_type , 和failure_text. - 所有活动均包含timestamp场地。

¥What: A list of dictionaries containing information about all network requests, responses, and failures captured during the crawl. Structure: - Each item has an event_type field that can be "request", "response", or "request_failed". - Request events include url, method, headers, post_data, resource_type, and is_navigation_request. - Response events include url, status, status_text, headers, and request_timing. - Failed request events include url, method, resource_type, and failure_text. - All events include a timestamp field.

用法

¥Usage:

if result.network_requests:
    # Count different types of events
    requests = [r for r in result.network_requests if r.get("event_type") == "request"]
    responses = [r for r in result.network_requests if r.get("event_type") == "response"]
    failures = [r for r in result.network_requests if r.get("event_type") == "request_failed"]

    print(f"Captured {len(requests)} requests, {len(responses)} responses, and {len(failures)} failures")

    # Analyze API calls
    api_calls = [r for r in requests if "api" in r.get("url", "")]

    # Identify failed resources
    for failure in failures:
        print(f"Failed to load: {failure.get('url')} - {failure.get('failure_text')}")

7.2console_messages (可选[列表[字典[str,任意]]])

¥7.2 console_messages (Optional[List[Dict[str, Any]]])

什么:包含抓取期间捕获的所有浏览器控制台消息的字典列表。结构: - 每件物品都有一个type指示消息类型的字段(例如,"log""error""warning"等)。——text字段包含实际的消息文本。 - 一些消息包括location信息(URL、行、列)。- 所有消息都包含timestamp场地。

¥What: A list of dictionaries containing all browser console messages captured during the crawl. Structure: - Each item has a type field indicating the message type (e.g., "log", "error", "warning", etc.). - The text field contains the actual message text. - Some messages include location information (URL, line, column). - All messages include a timestamp field.

用法

¥Usage:

if result.console_messages:
    # Count messages by type
    message_types = {}
    for msg in result.console_messages:
        msg_type = msg.get("type", "unknown")
        message_types[msg_type] = message_types.get(msg_type, 0) + 1

    print(f"Message type counts: {message_types}")

    # Display errors (which are usually most important)
    for msg in result.console_messages:
        if msg.get("type") == "error":
            print(f"Error: {msg.get('text')}")

这些字段提供了对页面网络活动和浏览器控制台的深入可见性,这对于调试、安全分析和理解复杂的 Web 应用程序非常有价值。

¥These fields provide deep visibility into the page's network activity and browser console, which is invaluable for debugging, security analysis, and understanding complex web applications.

有关网络和控制台捕获的更多详细信息,请参阅网络和控制台捕获文档

¥For more details on network and console capturing, see the Network & Console Capture documentation.


8.示例:访问所有内容

¥8. Example: Accessing Everything

async def handle_result(result: CrawlResult):
    if not result.success:
        print("Crawl error:", result.error_message)
        return

    # Basic info
    print("Crawled URL:", result.url)
    print("Status code:", result.status_code)

    # HTML
    print("Original HTML size:", len(result.html))
    print("Cleaned HTML size:", len(result.cleaned_html or ""))

    # Markdown output
    if result.markdown:
        print("Raw Markdown:", result.markdown.raw_markdown[:300])
        print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
        if result.markdown.fit_markdown:
            print("Fit Markdown:", result.markdown.fit_markdown[:200])

    # Media & Links
    if "images" in result.media:
        print("Image count:", len(result.media["images"]))
    if "internal" in result.links:
        print("Internal link count:", len(result.links["internal"]))

    # Extraction strategy result
    if result.extracted_content:
        print("Structured data:", result.extracted_content)

    # Screenshot/PDF/MHTML
    if result.screenshot:
        print("Screenshot length:", len(result.screenshot))
    if result.pdf:
        print("PDF bytes length:", len(result.pdf))
    if result.mhtml:
        print("MHTML length:", len(result.mhtml))

    # Network and console capturing
    if result.network_requests:
        print(f"Network requests captured: {len(result.network_requests)}")
        # Analyze request types
        req_types = {}
        for req in result.network_requests:
            if "resource_type" in req:
                req_types[req["resource_type"]] = req_types.get(req["resource_type"], 0) + 1
        print(f"Resource types: {req_types}")

    if result.console_messages:
        print(f"Console messages captured: {len(result.console_messages)}")
        # Count by message type
        msg_types = {}
        for msg in result.console_messages:
            msg_types[msg.get("type", "unknown")] = msg_types.get(msg.get("type", "unknown"), 0) + 1
        print(f"Message types: {msg_types}")

9. 重点与未来

¥9. Key Points & Future

1.弃用的 CrawlResult 旧属性
-markdown_v2 - 在 v0.5 中已弃用。只需使用markdown。它拥有MarkdownGenerationResult现在! -fit_markdownfit_html- 在 v0.5 中已弃用。现在可以通过以下方式访问MarkdownGenerationResultresult.markdown例如:result.markdown.fit_markdownresult.markdown.fit_html

¥1. Deprecated legacy properties of CrawlResult
- markdown_v2 - Deprecated in v0.5. Just use markdown. It holds the MarkdownGenerationResult now! - fit_markdown and fit_html - Deprecated in v0.5. They can now be accessed via MarkdownGenerationResult in result.markdown. eg: result.markdown.fit_markdown and result.markdown.fit_html

2.适合内容
-fit_markdownfit_html仅当您使用内容过滤器(例如修剪内容过滤器或者BM25内容过滤器)在你的Markdown生成策略或直接设置它们。
- 如果不使用过滤器,它们会残留None

¥2. Fit Content
- fit_markdown and fit_html appear in MarkdownGenerationResult, only if you used a content filter (like PruningContentFilter or BM25ContentFilter) inside your MarkdownGenerationStrategy or set them directly.
- If no filter is used, they remain None.

3.参考文献
- 如果您在DefaultMarkdownGenerator(options={"citations": True} ),你会看到markdown_with_citations加上references_markdown块。这有助于大型语言模型或类似学术的引用。

¥3. References & Citations
- If you enable link citations in your DefaultMarkdownGenerator (options={"citations": True}), you’ll see markdown_with_citations plus a references_markdown block. This helps large language models or academic-like referencing.

4.链接和媒体
-links["internal"]links["external"]组按域发现锚点。
-media["images"] /["videos"] /["audios"]存储提取的媒体元素以及可选的评分或上下文。

¥4. Links & Media
- links["internal"] and links["external"] group discovered anchors by domain.
- media["images"] / ["videos"] / ["audios"] store extracted media elements with optional scoring or context.

5.错误案例
- 如果success=False, 查看error_message(例如超时、无效的 URL)。
-status_code可能是None如果我们在 HTTP 响应之前失败了。

¥5. Error Cases
- If success=False, check error_message (e.g., timeouts, invalid URLs).
- status_code might be None if we failed before an HTTP response.

使用CrawlResult收集所有最终输出并将其输入到数据管道、AI 模型或档案库中。通过适当配置的协同作用浏览器配置CrawlerRunConfig ,爬虫可以在这里生成健壮、结构化的结果CrawlResult

¥Use CrawlResult to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured BrowserConfig and CrawlerRunConfig, the crawler can produce robust, structured results here in CrawlResult.


> Feedback