抓取结果及输出

¥Crawl Result and Output

当你打电话时arun()在页面上,Crawl4AI 返回CrawlResult对象包含您可能需要的所有内容——原始 HTML、清理后的版本、可选的屏幕截图或 PDF、结构化提取结果等等。本文档将解释这些字段以及它们如何映射到不同的输出类型。

¥When you call arun() on a page, Crawl4AI returns a CrawlResult object containing everything you might need—raw HTML, a cleaned version, optional screenshots or PDFs, structured extraction results, and more. This document explains those fields and how they map to different output types.


1.CrawlResult模型

¥1. The CrawlResult Model

以下是核心架构。每个字段捕获抓取结果的不同方面:

¥Below is the core schema. Each field captures a different aspect of the crawl’s result:

class MarkdownGenerationResult(BaseModel):
    raw_markdown: str
    markdown_with_citations: str
    references_markdown: str
    fit_markdown: Optional[str] = None
    fit_html: Optional[str] = None

class CrawlResult(BaseModel):
    url: str
    html: str
    fit_html: Optional[str] = None
    success: bool
    cleaned_html: Optional[str] = None
    media: Dict[str, List[Dict]] = {}
    links: Dict[str, List[Dict]] = {}
    downloaded_files: Optional[List[str]] = None
    js_execution_result: Optional[Dict[str, Any]] = None
    screenshot: Optional[str] = None
    pdf: Optional[bytes] = None
    mhtml: Optional[str] = None
    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
    extracted_content: Optional[str] = None
    metadata: Optional[dict] = None
    error_message: Optional[str] = None
    session_id: Optional[str] = None
    response_headers: Optional[dict] = None
    status_code: Optional[int] = None
    ssl_certificate: Optional[SSLCertificate] = None
    dispatch_result: Optional[DispatchResult] = None
    redirected_url: Optional[str] = None
    network_requests: Optional[List[Dict[str, Any]]] = None
    console_messages: Optional[List[Dict[str, Any]]] = None
    tables: List[Dict] = Field(default_factory=list)

    class Config:
        arbitrary_types_allowed = True

表:关键字段CrawlResult

¥Table: Key Fields in CrawlResult

¥Field (Name & Type)

¥Description

¥url (str)

¥The final or actual URL crawled (in case of redirects).

¥html (str)

¥Original, unmodified page HTML. Good for debugging or custom processing.

¥fit_html (Optional[str])

¥Preprocessed HTML optimized for extraction and content filtering.

¥success (bool)

¥True if the crawl completed without major errors, else False.

¥cleaned_html (Optional[str])

¥Sanitized HTML with scripts/styles removed; can exclude tags if configured via excluded_tags etc.

¥media (Dict[str, List[Dict]])

¥Extracted media info (images, audio, etc.), each with attributes like src, alt, score, etc.

¥links (Dict[str, List[Dict]])

¥Extracted link data, split by internal and external. Each link usually has href, text, etc.

¥downloaded_files (Optional[List[str]])

¥If accept_downloads=True in BrowserConfig, this lists the filepaths of saved downloads.

¥js_execution_result (Optional[Dict[str, Any]])

¥Results from JavaScript execution during crawling.

¥screenshot (Optional[str])

¥Screenshot of the page (base64-encoded) if screenshot=True.

¥pdf (Optional[bytes])

¥PDF of the page if pdf=True.

¥mhtml (Optional[str])

¥MHTML snapshot of the page if capture_mhtml=True. Contains the full page with all resources.

¥markdown (Optional[str or MarkdownGenerationResult])

¥It holds a MarkdownGenerationResult. Over time, this will be consolidated into markdown. The generator can provide raw markdown, citations, references, and optionally fit_markdown.

¥extracted_content (Optional[str])

¥The output of a structured extraction (CSS/LLM-based) stored as JSON string or other text.

¥metadata (Optional[dict])

¥Additional info about the crawl or extracted data.

¥error_message (Optional[str])

¥If success=False, contains a short description of what went wrong.

¥session_id (Optional[str])

¥The ID of the session used for multi-page or persistent crawling.

¥response_headers (Optional[dict])

¥HTTP response headers, if captured.

¥status_code (Optional[int])

¥HTTP status code (e.g., 200 for OK).

¥ssl_certificate (Optional[SSLCertificate])

¥SSL certificate info if fetch_ssl_certificate=True.

¥dispatch_result (Optional[DispatchResult])

¥Additional concurrency and resource usage information when crawling URLs in parallel.

¥redirected_url (Optional[str])

¥The URL after any redirects (different from url which is the final URL).

¥network_requests (Optional[List[Dict[str, Any]]])

¥List of network requests, responses, and failures captured during the crawl if capture_network_requests=True.

¥console_messages (Optional[List[Dict[str, Any]]])

¥List of browser console messages captured during the crawl if capture_console_messages=True.

¥tables (List[Dict])

¥Table data extracted from HTML tables with structure [{headers, rows, caption, summary}].

字段(名称和类型) 描述
网址(str ) 抓取的最终或实际 URL(重定向的情况下)。
html(str ) 原始、未经修改的页面 HTML。适合用于调试或自定义处理。
适合html(Optional[str] ) 针对提取和内容过滤进行了优化的预处理 HTML。
成功 (bool ) 如果爬取完成且没有重大错误,否则False
cleaned_html(Optional[str] ) 已清理的 HTML,删除了脚本/样式;如果通过配置,则可以排除标签excluded_tagsETC。
媒体 (Dict[str, List[Dict]] ) 提取媒体信息(图像、音频等),每个信息都具有如下属性srcaltscore , ETC。
链接(Dict[str, List[Dict]] ) 提取链接数据,按以下方式拆分internalexternal. 每个链接通常有hreftext , ETC。
下载的文件(Optional[List[str]] ) 如果accept_downloads=TrueBrowserConfig,列出了已保存下载的文件路径。
js_execution_result(Optional[Dict[str, Any]] ) 爬取过程中 JavaScript 执行的结果。
截屏 (Optional[str] ) 页面截图(base64编码)screenshot=True
pdf(Optional[bytes] ) 页面的 PDF 格式pdf=True
mhtml(Optional[str] ) 页面的 MHTML 快照capture_mhtml=True. 包含所有资源的完整页面。
降价(Optional[str or MarkdownGenerationResult] ) 它拥有MarkdownGenerationResult随着时间的推移,这将被整合到markdown。生成器可以提供原始 markdown、引用、参考文献,以及可选的fit_markdown
提取的内容(Optional[str] ) 结构化提取(基于 CSS/LLM)的输出存储为 JSON 字符串或其他文本。
元数据(Optional[dict] ) 有关抓取或提取的数据的附加信息。
错误信息 (Optional[str] ) 如果success=False,包含对出错情况的简短描述。
会话 ID(Optional[str] ) 用于多页面或持久抓取的会话的ID。
响应头(Optional[dict] ) HTTP 响应标头(如果捕获)。
状态代码(Optional[int] ) HTTP 状态代码(例如,200 表示 OK)。
ssl_证书(Optional[SSLCertificate] ) SSL 证书信息(如果fetch_ssl_certificate=True
dispatch_result(Optional[DispatchResult] ) 并行抓取 URL 时的附加并发和资源使用信息。
重定向网址(Optional[str] ) 任何重定向后的 URL(不同于url这是最终的 URL)。
网络请求(Optional[List[Dict[str, Any]]] ) 抓取过程中捕获的网络请求、响应和失败列表(如果capture_network_requests=True
控制台消息(Optional[List[Dict[str, Any]]] ) 抓取过程中捕获的浏览器控制台消息列表capture_console_messages=True
表格(List[Dict] ) 从具有结构的 HTML 表中提取的表格数据[{headers, rows, caption, summary}]

2. HTML变体

¥2. HTML Variants

:原始 HTML

¥html: Raw HTML

Crawl4AI 保留精确的 HTMLresult.html . 适用于:

¥Crawl4AI preserves the exact HTML as result.html. Useful for:

  • 调试页面问题或检查原始内容。

    ¥Debugging page issues or checking the original content.

  • 如果需要,执行您自己的专门解析。

    ¥Performing your own specialized parse if needed.

:已消毒

¥cleaned_html: Sanitized

如果您在CrawlerRunConfig(喜欢excluded_tagsremove_forms等),您将在这里看到结果:

¥If you specify any cleanup or exclusion parameters in CrawlerRunConfig (like excluded_tags, remove_forms, etc.), you’ll see the result here:

config = CrawlerRunConfig(
    excluded_tags=["form", "header", "footer"],
    keep_data_attributes=False
)
result = await crawler.arun("https://example.com", config=config)
print(result.cleaned_html)  # Freed of forms, header, footer, data-* attributes

3. Markdown 生成

¥3. Markdown Generation

3.1markdown

¥3.1 markdown

  • markdown:Markdown 详细输出的当前位置,返回MarkdownGenerationResult目的。

    ¥markdown: The current location for detailed markdown output, returning a MarkdownGenerationResult object.

  • markdown_v2:自 v0.5 起已弃用。

    ¥markdown_v2: Deprecated since v0.5.

MarkdownGenerationResult字段:

¥MarkdownGenerationResult Fields:

¥Field

¥Description

¥raw_markdown

¥The basic HTML→Markdown conversion.

¥markdown_with_citations

¥Markdown including inline citations that reference links at the end.

¥references_markdown

¥The references/citations themselves (if citations=True).

¥fit_markdown

¥The filtered/“fit” markdown if a content filter was used.

¥fit_html

¥The filtered HTML that generated fit_markdown.

场地 描述
raw_markdown 基本的 HTML→Markdown 转换。
markdown_with_citations Markdown 包括在末尾引用链接的内联引用。
引用_markdown 参考文献/引文本身(如果citations=True)。
fit_markdown 如果使用了内容过滤器,则过滤/“适合”标记。
fit_html 生成的经过过滤的 HTMLfit_markdown

3.2 Markdown 生成器的基本示例

¥3.2 Basic Example with a Markdown Generator

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

config = CrawlerRunConfig(
    markdown_generator=DefaultMarkdownGenerator(
        options={"citations": True, "body_width": 80}  # e.g. pass html2text style options
    )
)
result = await crawler.arun(url="https://example.com", config=config)

md_res = result.markdown  # or eventually 'result.markdown'
print(md_res.raw_markdown[:500])
print(md_res.markdown_with_citations)
print(md_res.references_markdown)

笔记:如果您使用类似PruningContentFilter,你会得到fit_markdownfit_html也一样。

¥Note: If you use a filter like PruningContentFilter, you’ll get fit_markdown and fit_html as well.


4.结构化提取:extracted_content

¥4. Structured Extraction: extracted_content

如果运行基于 JSON 的提取策略(CSS、XPath、LLM 等),则结构化数据是不是存储在markdown—它被放置在result.extracted_content作为 JSON 字符串(或有时是纯文本)。

¥If you run a JSON-based extraction strategy (CSS, XPath, LLM, etc.), the structured data is not stored in markdown—it’s placed in result.extracted_content as a JSON string (or sometimes plain text).

示例:使用 CSS 提取raw://HTML

¥Example: CSS Extraction with raw:// HTML

import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy

async def main():
    schema = {
        "name": "Example Items",
        "baseSelector": "div.item",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }
    raw_html = "<div class='item'><h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a></div>"

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="raw://" + raw_html,
            config=CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,
                extraction_strategy=JsonCssExtractionStrategy(schema)
            )
        )
        data = json.loads(result.extracted_content)
        print(data)

if __name__ == "__main__":
    asyncio.run(main())

这里: -url="raw://..."直接传递 HTML 内容,无需网络请求。
- 这CSS提取策略填充result.extracted_content使用 JSON 数组[{"title": "...", "link": "..."}]

¥Here: - url="raw://..." passes the HTML content directly, no network requests.
- The CSS extraction strategy populates result.extracted_content with the JSON array [{"title": "...", "link": "..."}].


¥5. More Fields: Links, Media, Tables and More

¥5.1 links

字典,通常包含"internal""external"列表。每个条目可能有hreftexttitle等。如果您没有禁用链接提取,则会自动捕获此信息。

¥A dictionary, typically with "internal" and "external" lists. Each entry might have href, text, title, etc. This is automatically captured if you haven’t disabled link extraction.

print(result.links["internal"][:3])  # Show first 3 internal links

5.2media

¥5.2 media

类似地,"images""audio""video"等。每项可以包括srcaltscore以及更多,如果您的爬虫程序设置为收集它们。

¥Similarly, a dictionary with "images", "audio", "video", etc. Each item could include src, alt, score, and more, if your crawler is set to gather them.

images = result.media.get("images", [])
for img in images:
    print("Image URL:", img["src"], "Alt:", img.get("alt"))

5.3tables

¥5.3 tables

tables字段包含从抓取的页面上的 HTML 表格中提取的结构化数据。表格会根据各种标准进行分析,以确定它们是否是实际的数据表(而不是布局表),这些标准包括:

¥The tables field contains structured data extracted from HTML tables found on the crawled page. Tables are analyzed based on various criteria to determine if they are actual data tables (as opposed to layout tables), including:

  • 存在 thead 和 tbody 部分

    ¥Presence of thead and tbody sections

  • 使用 th 元素作为标题

    ¥Use of th elements for headers

  • 列一致性

    ¥Column consistency

  • 文本密度

    ¥Text density

  • 以及其他因素

    ¥And other factors

得分高于阈值(默认值:7)的表将被提取并存储在 result.tables 中。

¥Tables that score above the threshold (default: 7) are extracted and stored in result.tables.

访问表数据:

¥Accessing Table data:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.w3schools.com/html/html_tables.asp",
            config=CrawlerRunConfig(
                table_score_threshold=7  # Minimum score for table detection
            )
        )

        if result.success and result.tables:
            print(f"Found {len(result.tables)} tables")

            for i, table in enumerate(result.tables):
                print(f"\nTable {i+1}:")
                print(f"Caption: {table.get('caption', 'No caption')}")
                print(f"Headers: {table['headers']}")
                print(f"Rows: {len(table['rows'])}")

                # Print first few rows as example
                for j, row in enumerate(table['rows'][:3]):
                    print(f"  Row {j+1}: {row}")

if __name__ == "__main__":
    asyncio.run(main())

配置表提取:

¥Configuring Table Extraction:

您可以使用以下方式调整表格检测算法的灵敏度:

¥You can adjust the sensitivity of the table detection algorithm with:

config = CrawlerRunConfig(
    table_score_threshold=5  # Lower value = more tables detected (default: 7)
)

每个提取的表包含:

¥Each extracted table contains:

  • :列标题名称

    ¥headers: Column header names

  • :行列表,每行包含单元格值

    ¥rows: List of rows, each containing cell values

  • :表格标题文本(如果有)

    ¥caption: Table caption text (if available)

  • :表格摘要属性(如果指定)

    ¥summary: Table summary attribute (if specified)

表格提取技巧

¥Table Extraction Tips

  • 并非所有 HTML 表格都会被提取 - 仅提取那些被检测为“数据表”而不是布局表的表格。

    ¥Not all HTML tables are extracted - only those detected as "data tables" vs. layout tables.

  • 单元格数量不一致的表格、嵌套表格或纯粹用于布局的表格可能会被跳过。

    ¥Tables with inconsistent cell counts, nested tables, or those used purely for layout may be skipped.

  • 如果缺少表格,请尝试调整table_score_threshold为较低的值(默认值为 7)。

    ¥If you're missing tables, try adjusting the table_score_threshold to a lower value (default is 7).

表格检测算法会根据列的一致性、标题的存在性、文本密度等特征对表格进行评分。得分高于阈值的表格将被视为值得提取的数据表。

¥The table detection algorithm scores tables based on features like consistent columns, presence of headers, text density, and more. Tables scoring above the threshold are considered data tables worth extracting.

5.4screenshotpdf , 和mhtml

¥5.4 screenshot, pdf, and mhtml

如果你设置screenshot=Truepdf=True , 或者capture_mhtml=TrueCrawlerRunConfig, 然后:

¥If you set screenshot=True, pdf=True, or capture_mhtml=True in CrawlerRunConfig, then:

  • 包含 base64 编码的 PNG 字符串。

    ¥result.screenshot contains a base64-encoded PNG string.

  • 包含原始 PDF 字节(您可以将它们写入文件)。

    ¥result.pdf contains raw PDF bytes (you can write them to a file).

  • 包含页面的 MHTML 快照作为字符串(您可以将其写入 .mhtml 文件)。

    ¥result.mhtml contains the MHTML snapshot of the page as a string (you can write it to a .mhtml file).

# Save the PDF
with open("page.pdf", "wb") as f:
    f.write(result.pdf)

# Save the MHTML
if result.mhtml:
    with open("page.mhtml", "w", encoding="utf-8") as f:
        f.write(result.mhtml)

MHTML(MIME HTML)格式特别有用,因为它将整个网页及其所有资源(CSS、图像、脚本等)捕获到一个文件中,非常适合存档或离线查看。

¥The MHTML (MIME HTML) format is particularly useful as it captures the entire web page including all of its resources (CSS, images, scripts, etc.) in a single file, making it perfect for archiving or offline viewing.

5.5ssl_certificate

¥5.5 ssl_certificate

如果fetch_ssl_certificate=Trueresult.ssl_certificate保存有关网站 SSL 证书的详细信息,例如颁发者、有效日期等。

¥If fetch_ssl_certificate=True, result.ssl_certificate holds details about the site’s SSL cert, such as issuer, validity dates, etc.


6.访问这些字段

¥6. Accessing These Fields

运行后:

¥After you run:

result = await crawler.arun(url="https://example.com", config=some_config)

检查任意字段:

¥Check any field:

if result.success:
    print(result.status_code, result.response_headers)
    print("Links found:", len(result.links.get("internal", [])))
    if result.markdown:
        print("Markdown snippet:", result.markdown.raw_markdown[:200])
    if result.extracted_content:
        print("Structured JSON:", result.extracted_content)
else:
    print("Error:", result.error_message)

弃用:自 v0.5 起result.markdown_v2result.fit_htmlresult.fit_markdown已弃用。使用result.markdown相反!它MarkdownGenerationResult,其中包括fit_htmlfit_markdown因为它的属性。

¥Deprecation: Since v0.5 result.markdown_v2, result.fit_html,result.fit_markdown are deprecated. Use result.markdown instead! It holds MarkdownGenerationResult, which includes fit_html and fit_markdown as it's properties.


7. 后续步骤

¥7. Next Steps

  • Markdown 生成:深入了解如何配置DefaultMarkdownGenerator以及各种过滤器。

    ¥Markdown Generation: Dive deeper into how to configure DefaultMarkdownGenerator and various filters.

  • 内容过滤:了解如何使用BM25ContentFilterPruningContentFilter

    ¥Content Filtering: Learn how to use BM25ContentFilter and PruningContentFilter.

  • 会话和钩子:如果您想要操作页面或跨多个保存状态arun()调用,请参阅挂钩或会话文档。

    ¥Session & Hooks: If you want to manipulate the page or preserve state across multiple arun() calls, see the hooking or session docs.

  • 法学硕士 (LLM) 提取:对于需要 AI 驱动解析的复杂或非结构化内容,请查看基于 LLM 的策略文档。

    ¥LLM Extraction: For complex or unstructured content requiring AI-driven parsing, check the LLM-based strategies doc.

享受探索这一切CrawlResult提供——无论您需要原始 HTML、清理输出、markdown 还是完全结构化的数据,Crawl4AI 都能满足您的需求!

¥Enjoy exploring all that CrawlResult offers—whether you need raw HTML, sanitized output, markdown, or fully structured data, Crawl4AI has you covered!


> Feedback