提取 JSON(法学硕士)

¥Extracting JSON (LLM)

在某些情况下,您需要提取复杂或非结构化网页上的信息无法通过简单的 CSS/XPath 模式轻松解析。或者你想人工智能驱动的洞察、分类或汇总。针对这些场景,Crawl4AI 提供了基于LLM的提取策略那:

¥In some cases, you need to extract complex or unstructured information from a webpage that a simple CSS/XPath schema cannot easily parse. Or you want AI-driven insights, classification, or summarization. For these scenarios, Crawl4AI provides an LLM-based extraction strategy that:

  1. 适用于任何大型语言模型支持精简法学硕士(Ollama、OpenAI、Claude 等)。

    ¥Works with any large language model supported by LiteLLM (Ollama, OpenAI, Claude, and more).

  2. 自动将内容分成块(如果需要)以处理令牌限制,然后合并结果。

    ¥Automatically splits content into chunks (if desired) to handle token limits, then combines results.

  3. 让你定义一个模式(如 Pydantic 模型)或更简单的“块”提取方法。

    ¥Lets you define a schema (like a Pydantic model) or a simpler “block” extraction approach.

重要的:基于 LLM 的提取可能比基于模式的方法更慢且成本更高。如果您的页面数据高度结构化,请考虑使用JsonCssExtractionStrategy或者JsonXPathExtractionStrategy首先。但如果你需要人工智能来解释或重新组织内容,请继续阅读!

¥Important: LLM-based extraction can be slower and costlier than schema-based approaches. If your page data is highly structured, consider using JsonCssExtractionStrategy or JsonXPathExtractionStrategy first. But if you need AI to interpret or reorganize content, read on!


1. 为什么要使用法学硕士学位?

¥1. Why Use an LLM?

  • 复杂推理:如果网站的数据是非结构化的、分散的,或者充满自然语言上下文。

    ¥Complex Reasoning: If the site’s data is unstructured, scattered, or full of natural language context.

  • 语义提取:需要理解的摘要、知识图或关系数据。

    ¥Semantic Extraction: Summaries, knowledge graphs, or relational data that require comprehension.

  • 灵活的:您可以将指令传递给模型以进行更高级的转换或分类。

    ¥Flexible: You can pass instructions to the model to do more advanced transformations or classification.


2. 通过 LiteLLM 实现提供商无关

¥2. Provider-Agnostic via LiteLLM

您可以使用 LlmConfig 快速配置多个 LLM 变体,并进行实验以找到最适合您用例的变体。您可以阅读更多关于 LlmConfig 的信息这里

¥You can use LlmConfig, to quickly configure multiple variations of LLMs and experiment with them to find the optimal one for your use case. You can read more about LlmConfig here.

llmConfig = LlmConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))

Crawl4AI 使用“提供者字符串”(例如,"openai/gpt-4o""ollama/llama2.0""aws/titan" )来识别您的法学硕士 (LLM)。任何LiteLLM 支持的模型是公平的。你只需提供:

¥Crawl4AI uses a “provider string” (e.g., "openai/gpt-4o", "ollama/llama2.0", "aws/titan") to identify your LLM. Any model that LiteLLM supports is fair game. You just provide:

  • provider: 这<provider>/<model_name>标识符(例如,"openai/gpt-4""ollama/llama2""huggingface/google-flan" , ETC。)。

    ¥provider: The <provider>/<model_name> identifier (e.g., "openai/gpt-4", "ollama/llama2", "huggingface/google-flan", etc.).

  • api_token:如果需要(对于 OpenAI、HuggingFace 等);本地模型或 Ollama 可能不需要它。

    ¥api_token: If needed (for OpenAI, HuggingFace, etc.); local models or Ollama might not require it.

  • base_url(可选):如果您的提供商有自定义端点。

    ¥base_url (optional): If your provider has a custom endpoint.

这意味着你没有锁定专注于单一的法学硕士 (LLM) 供应商。轻松切换或尝试。

¥This means you aren’t locked into a single LLM vendor. Switch or experiment easily.


3. LLM 提取的工作原理

¥3. How LLM Extraction Works

3.1 流程

¥3.1 Flow

1.分块(可选):如果 HTML 或 Markdown 很长,则会将其拆分为更小的段(基于chunk_token_threshold、重叠等)。
2.快速施工:对于每个块,图书馆都会形成一个提示,其中包含您的instruction(可能还有模式或示例)。
3. LLM推理:每个块以并行或顺序方式发送到模型(取决于您的并发性)。
4.合并:每个块的结果被合并并解析为 JSON。

¥1. Chunking (optional): The HTML or markdown is split into smaller segments if it’s very long (based on chunk_token_threshold, overlap, etc.).
2. Prompt Construction: For each chunk, the library forms a prompt that includes your instruction (and possibly schema or examples).
3. LLM Inference: Each chunk is sent to the model in parallel or sequentially (depending on your concurrency).
4. Combining: The results from each chunk are merged and parsed into JSON.

3.2extraction_type

¥3.2 extraction_type

  • "schema":该模型尝试返回符合基于 Pydantic 模式的 JSON。

    ¥"schema": The model tries to return JSON conforming to your Pydantic-based schema.

  • "block":该模型返回库收集的自由格式文本或较小的 JSON 结构。

    ¥"block": The model returns freeform text, or smaller JSON structures, which the library collects.

对于结构化数据,"schema"建议。您提供schema=YourPydanticModel.model_json_schema()

¥For structured data, "schema" is recommended. You provide schema=YourPydanticModel.model_json_schema().


4. 关键参数

¥4. Key Parameters

以下是重要的 LLM 提取参数的概述。所有参数通常都设置在LLMExtractionStrategy(...)。然后你把这个策略放在你的CrawlerRunConfig(..., extraction_strategy=...)

¥Below is an overview of important LLM extraction parameters. All are typically set inside LLMExtractionStrategy(...). You then put that strategy in your CrawlerRunConfig(..., extraction_strategy=...).

1.llmConfig (LlmConfig):例如,"openai/gpt-4""ollama/llama2"
2.schema (dict):描述所需字段的 JSON 模式。通常由YourModel.model_json_schema()
3.extraction_type (字符串):"schema"或者"block"
4.instruction (str):提示文本,告诉 LLM 你想要提取的内容。例如,“将这些字段提取为 JSON 数组”。
5.chunk_token_threshold (int):每个块的最大标记数。如果你的内容很大,可以将其拆分成 LLM 格式。
6.overlap_rate (浮点数):相邻块之间的重叠率。例如,0.1意味着每个块的 10% 被重复以保持上下文的连续性。
7.apply_chunking (布尔值):设置True自动分块。如果您想要单次传递,请设置False
8.input_format (str): 确定哪个爬虫结果将传递给 LLM。选项包括:
-"markdown" :原始 markdown(默认)。
-"fit_markdown" :如果您使用内容过滤器,则过滤后的“适合”标记。
-"html" :已清理或原始的 HTML。
9.extra_args (dict): 其他 LLM 参数,例如temperaturemax_tokenstop_p , ETC。
10.show_usage() :您可以调用此方法来打印使用信息(每个块的令牌使用情况,如果已知则为总成本)。

¥1. llmConfig (LlmConfig): e.g., "openai/gpt-4", "ollama/llama2".
2. schema (dict): A JSON schema describing the fields you want. Usually generated by YourModel.model_json_schema().
3. extraction_type (str): "schema" or "block".
4. instruction (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
5. chunk_token_threshold (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
6. overlap_rate (float): Overlap ratio between adjacent chunks. E.g., 0.1 means 10% of each chunk is repeated to preserve context continuity.
7. apply_chunking (bool): Set True to chunk automatically. If you want a single pass, set False.
8. input_format (str): Determines which crawler result is passed to the LLM. Options include:
- "markdown": The raw markdown (default).
- "fit_markdown": The filtered “fit” markdown if you used a content filter.
- "html": The cleaned or raw HTML.
9. extra_args (dict): Additional LLM parameters like temperature, max_tokens, top_p, etc.
10. show_usage(): A method you can call to print out usage info (token usage per chunk, total cost if known).

例子

¥Example:

extraction_strategy = LLMExtractionStrategy(
    llm_config = LLMConfig(provider="openai/gpt-4", api_token="YOUR_OPENAI_KEY"),
    schema=MyModel.model_json_schema(),
    extraction_type="schema",
    instruction="Extract a list of items from the text with 'name' and 'price' fields.",
    chunk_token_threshold=1200,
    overlap_rate=0.1,
    apply_chunking=True,
    input_format="html",
    extra_args={"temperature": 0.1, "max_tokens": 1000},
    verbose=True
)

5. 投入CrawlerRunConfig

¥5. Putting It in CrawlerRunConfig

重要的:在 Crawl4AI 中,所有策略定义都应该放在CrawlerRunConfig,而不是直接作为参数arun()。这是一个完整的例子:

¥Important: In Crawl4AI, all strategy definitions should go inside the CrawlerRunConfig, not directly as a param in arun(). Here’s a full example:

import os
import asyncio
import json
from pydantic import BaseModel, Field
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai import LLMExtractionStrategy

class Product(BaseModel):
    name: str
    price: str

async def main():
    # 1. Define the LLM extraction strategy
    llm_strategy = LLMExtractionStrategy(
        llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv('OPENAI_API_KEY')),
        schema=Product.schema_json(), # Or use model_json_schema()
        extraction_type="schema",
        instruction="Extract all product objects with 'name' and 'price' from the content.",
        chunk_token_threshold=1000,
        overlap_rate=0.0,
        apply_chunking=True,
        input_format="markdown",   # or "html", "fit_markdown"
        extra_args={"temperature": 0.0, "max_tokens": 800}
    )

    # 2. Build the crawler config
    crawl_config = CrawlerRunConfig(
        extraction_strategy=llm_strategy,
        cache_mode=CacheMode.BYPASS
    )

    # 3. Create a browser config if needed
    browser_cfg = BrowserConfig(headless=True)

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        # 4. Let's say we want to crawl a single page
        result = await crawler.arun(
            url="https://example.com/products",
            config=crawl_config
        )

        if result.success:
            # 5. The extracted content is presumably JSON
            data = json.loads(result.extracted_content)
            print("Extracted items:", data)

            # 6. Show usage stats
            llm_strategy.show_usage()  # prints token usage
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

6. 分块细节

¥6. Chunking Details

6.1chunk_token_threshold

¥6.1 chunk_token_threshold

如果您的页面很大,您可能会超出 LLM 的上下文窗口。chunk_token_threshold设置每个块的近似最大标记数。该库使用以下公式计算单词→标记比率:word_token_rate (通常默认值为 ~0.75)。如果启用了分块(apply_chunking=True ),文本被分成几段。

¥If your page is large, you might exceed your LLM’s context window. chunk_token_threshold sets the approximate max tokens per chunk. The library calculates word→token ratio using word_token_rate (often ~0.75 by default). If chunking is enabled (apply_chunking=True), the text is split into segments.

6.2overlap_rate

¥6.2 overlap_rate

为了保持语境在各个块之间的连续性,我们可以将它们重叠。例如,overlap_rate=0.1表示每个后续区块包含前一个区块文本的 10%。如果您需要的信息可能跨越区块边界,这将非常有用。

¥To keep context continuous across chunks, we can overlap them. E.g., overlap_rate=0.1 means each subsequent chunk includes 10% of the previous chunk’s text. This is helpful if your needed info might straddle chunk boundaries.

6.3 性能与并行性

¥6.3 Performance & Parallelism

通过分块,您可以并行处理多个块(取决于您的并发设置和 LLM 提供商)。如果网站规模庞大或包含多个部分,这可以减少总时间。

¥By chunking, you can potentially process multiple chunks in parallel (depending on your concurrency settings and the LLM provider). This reduces total time if the site is huge or has many sections.


7.输入格式

¥7. Input Format

默认情况下,法学硕士提取策略用途input_format="markdown",这意味着爬虫的最终降价被送往法学硕士(LLM)。你可以更改为:

¥By default, LLMExtractionStrategy uses input_format="markdown", meaning the crawler’s final markdown is fed to the LLM. You can change to:

  • html:清理后的 HTML 或原始 HTML(取决于您的爬虫配置)进入 LLM。

    ¥html: The cleaned HTML or raw HTML (depending on your crawler config) goes into the LLM.

  • fit_markdown:例如,如果你使用PruningContentFilter,则使用“fit”版本的 Markdown。如果您信任过滤器,这可以大幅减少标记数。

    ¥fit_markdown: If you used, for instance, PruningContentFilter, the “fit” version of the markdown is used. This can drastically reduce tokens if you trust the filter.

  • markdown:爬虫的标准 markdown 输出markdown_generator

    ¥markdown: Standard markdown output from the crawler’s markdown_generator.

这个设置至关重要:如果 LLM 指令依赖于 HTML 标签,请选择"html"。如果您更喜欢基于文本的方法,请选择"markdown"

¥This setting is crucial: if the LLM instructions rely on HTML tags, pick "html". If you prefer a text-based approach, pick "markdown".

LLMExtractionStrategy(
    # ...
    input_format="html",  # Instead of "markdown" or "fit_markdown"
)

8. 代币使用和展示使用

¥8. Token Usage & Show Usage

为了跟踪令牌和成本,每个块都通过 LLM 调用进行处理。我们将使用情况记录在:

¥To keep track of tokens and cost, each chunk is processed with an LLM call. We record usage in:

  • usages(列表):每个块或调用的令牌使用情况。

    ¥usages (list): token usage per chunk or call.

  • total_usage:所有块调用的总和。

    ¥total_usage: sum of all chunk calls.

  • show_usage():打印使用情况报告(如果提供商返回使用情况数据)。

    ¥show_usage(): prints a usage report (if the provider returns usage data).

llm_strategy = LLMExtractionStrategy(...)
# ...
llm_strategy.show_usage()
# e.g. “Total usage: 1241 tokens across 2 chunk calls”

如果您的模型提供商不返回使用信息,这些字段可能是部分的或空的。

¥If your model provider doesn’t return usage info, these fields might be partial or empty.


9.示例:构建知识图谱

¥9. Example: Building a Knowledge Graph

下面是一个片段,结合了LLMExtractionStrategy使用 Pydantic 架构来构建知识图谱。注意我们如何传递instruction告诉模型要解析什么。

¥Below is a snippet combining LLMExtractionStrategy with a Pydantic schema for a knowledge graph. Notice how we pass an instruction telling the model what to parse.

import os
import json
import asyncio
from typing import List
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai import LLMExtractionStrategy

class Entity(BaseModel):
    name: str
    description: str

class Relationship(BaseModel):
    entity1: Entity
    entity2: Entity
    description: str
    relation_type: str

class KnowledgeGraph(BaseModel):
    entities: List[Entity]
    relationships: List[Relationship]

async def main():
    # LLM extraction strategy
    llm_strat = LLMExtractionStrategy(
        llmConfig = LLMConfig(provider="openai/gpt-4", api_token=os.getenv('OPENAI_API_KEY')),
        schema=KnowledgeGraph.model_json_schema(),
        extraction_type="schema",
        instruction="Extract entities and relationships from the content. Return valid JSON.",
        chunk_token_threshold=1400,
        apply_chunking=True,
        input_format="html",
        extra_args={"temperature": 0.1, "max_tokens": 1500}
    )

    crawl_config = CrawlerRunConfig(
        extraction_strategy=llm_strat,
        cache_mode=CacheMode.BYPASS
    )

    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        # Example page
        url = "https://www.nbcnews.com/business"
        result = await crawler.arun(url=url, config=crawl_config)

        print("--- LLM RAW RESPONSE ---")
        print(result.extracted_content)
        print("--- END LLM RAW RESPONSE ---")

        if result.success:
            with open("kb_result.json", "w", encoding="utf-8") as f:
                f.write(result.extracted_content)
            llm_strat.show_usage()
        else:
            print("Crawl failed:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

关键观察

¥Key Observations:

  • extraction_type="schema"确保我们得到符合我们要求的 JSONKnowledgeGraph

    ¥extraction_type="schema" ensures we get JSON fitting our KnowledgeGraph.

  • input_format="html"意味着我们将 HTML 提供给模型。

    ¥input_format="html" means we feed HTML to the model.

  • instruction指导模型输出结构化的知识图谱。

    ¥instruction guides the model to output a structured knowledge graph.


10. 最佳实践与注意事项

¥10. Best Practices & Caveats

1.成本和延迟:LLM 调用可能很慢或很昂贵。如果您只需要部分数据,请考虑分块或缩小覆盖范围。
2.模型代币限制:如果您的页面+指令超出了上下文窗口,则分块是必不可少的。
3.教学工程:精心设计的指令可以大大提高输出的可靠性。
4.模式严格性"schema"提取会尝试将模型输出解析为 JSON。如果模型返回无效的 JSON,则可能会发生部分提取,或者您可能会收到错误。
5.并行与串行:该库可以并行处理多个块,但您必须注意某些提供商的速率限制。
6.检查输出:有时,LLM 可能会遗漏字段或产生多余的文本。您可能需要使用 Pydantic 进行后期验证或进行额外的清理。

¥1. Cost & Latency: LLM calls can be slow or expensive. Consider chunking or smaller coverage if you only need partial data.
2. Model Token Limits: If your page + instruction exceed the context window, chunking is essential.
3. Instruction Engineering: Well-crafted instructions can drastically improve output reliability.
4. Schema Strictness: "schema" extraction tries to parse the model output as JSON. If the model returns invalid JSON, partial extraction might happen, or you might get an error.
5. Parallel vs. Serial: The library can process multiple chunks in parallel, but you must watch out for rate limits on certain providers.
6. Check Output: Sometimes, an LLM might omit fields or produce extraneous text. You may want to post-validate with Pydantic or do additional cleanup.


11. 结论

¥11. Conclusion

基于法学硕士的提取在 Crawl4AI 中提供商无关,让您通过 LiteLLM 从数百个模型中进行选择。它非常适合语义复杂任务或生成知识图谱等高级结构。然而,慢点并且可能比基于模式的方法成本更高。请记住以下几点:

¥LLM-based extraction in Crawl4AI is provider-agnostic, letting you choose from hundreds of models via LiteLLM. It’s perfect for semantically complex tasks or generating advanced structures like knowledge graphs. However, it’s slower and potentially costlier than schema-based approaches. Keep these tips in mind:

  • 制定你的 LLM 策略CrawlerRunConfig

    ¥Put your LLM strategy in CrawlerRunConfig.

  • 使用input_format选择 LLM 看到的形式(markdown、HTML、fit_markdown)。

    ¥Use input_format to pick which form (markdown, HTML, fit_markdown) the LLM sees.

  • 调整chunk_token_thresholdoverlap_rate , 和apply_chunking高效处理大量内容。

    ¥Tweak chunk_token_threshold, overlap_rate, and apply_chunking to handle large content efficiently.

  • 使用以下方式监控令牌使用情况show_usage()

    ¥Monitor token usage with show_usage().

如果您的网站数据一致或重复,请考虑JsonCssExtractionStrategy首先是为了速度和简单。但如果你需要人工智能驱动方法,LLMExtractionStrategy提供灵活的多提供商解决方案,用于从任何网站提取结构化 JSON。

¥If your site’s data is consistent or repetitive, consider JsonCssExtractionStrategy first for speed and simplicity. But if you need an AI-driven approach, LLMExtractionStrategy offers a flexible, multi-provider solution for extracting structured JSON from any website.

后续步骤

¥Next Steps:

1.尝试不同的提供商
- 尝试切换provider(例如,"ollama/llama2""openai/gpt-4o"等)来查看速度、准确性或成本方面的差异。
- 通过不同的extra_args喜欢temperaturetop_p , 和max_tokens来微调您的结果。

¥1. Experiment with Different Providers
- Try switching the provider (e.g., "ollama/llama2", "openai/gpt-4o", etc.) to see differences in speed, accuracy, or cost.
- Pass different extra_args like temperature, top_p, and max_tokens to fine-tune your results.

2.性能调优
- 如果页面很大,请进行调整chunk_token_thresholdoverlap_rate , 或者apply_chunking以优化吞吐量。
- 使用以下方式检查使用日志show_usage()密切关注代币消耗并识别潜在的瓶颈。

¥2. Performance Tuning
- If pages are large, tweak chunk_token_threshold, overlap_rate, or apply_chunking to optimize throughput.
- Check the usage logs with show_usage() to keep an eye on token consumption and identify potential bottlenecks.

3.验证输出
- 如果使用extraction_type="schema",使用 Pydantic 模型解析 LLM 的 JSON 以进行最后的验证步骤。
- 优雅地记录或处理任何解析错误,特别是当模型偶尔返回格式错误的 JSON 时。

¥3. Validate Outputs
- If using extraction_type="schema", parse the LLM’s JSON with a Pydantic model for a final validation step.
- Log or handle any parse errors gracefully, especially if the model occasionally returns malformed JSON.

4.探索 Hooks 和自动化
- 将 LLM 提取与钩子用于复杂的预/后处理。
- 使用多步骤管道:抓取、过滤、LLM 提取,然后存储或索引结果以供进一步分析。

¥4. Explore Hooks & Automation
- Integrate LLM extraction with hooks for complex pre/post-processing.
- Use a multi-step pipeline: crawl, filter, LLM-extract, then store or index results for further analysis.

上次更新:2025年1月1日

¥Last Updated: 2025-01-01


这就是提取 JSON(法学硕士) ——现在您可以利用人工智能来解析、分类或重组网络上的数据。祝您爬行愉快!

¥That’s it for Extracting JSON (LLM)—now you can harness AI to parse, classify, or reorganize data on the web. Happy crawling!


> Feedback