提取 JSON(无 LLM)
¥Extracting JSON (No LLM)
Crawl4AI 的最强大的特征提取结构化 JSON来自网站没有依赖于大型语言模型。Crawl4AI 提供了几种无需 LLM 的提取策略:
¥One of Crawl4AI's most powerful features is extracting structured JSON from websites without relying on large language models. Crawl4AI offers several strategies for LLM-free extraction:
-
基于模式的提取使用 CSS 或 XPath 选择器
JsonCssExtractionStrategy和JsonXPathExtractionStrategy¥Schema-based extraction with CSS or XPath selectors via
JsonCssExtractionStrategyandJsonXPathExtractionStrategy -
正则表达式提取和
RegexExtractionStrategy用于快速模式匹配¥Regular expression extraction with
RegexExtractionStrategyfor fast pattern matching
这些方法让您可以立即提取数据(即使是从复杂或嵌套的 HTML 结构中提取数据),而无需 LLM 的成本、延迟或环境影响。
¥These approaches let you extract data instantly—even from complex or nested HTML structures—without the cost, latency, or environmental impact of an LLM.
为什么要避免使用 LLM 进行基本提取?
¥Why avoid LLM for basic extractions?
-
更快、更便宜:无需 API 调用或 GPU 开销。
¥Faster & Cheaper: No API calls or GPU overhead.
-
降低碳足迹:LLM 推理可能非常耗能。基于模式的提取几乎是零碳排放的。
¥Lower Carbon Footprint: LLM inference can be energy-intensive. Pattern-based extraction is practically carbon-free.
-
精确且可重复:CSS/XPath 选择器和正则表达式模式完全按照您的指定执行。LLM 输出可能会有所不同或产生幻觉。
¥Precise & Repeatable: CSS/XPath selectors and regex patterns do exactly what you specify. LLM outputs can vary or hallucinate.
-
轻松扩展:对于数千页,基于模式的提取可以快速并行运行。
¥Scales Readily: For thousands of pages, pattern-based extraction runs quickly and in parallel.
下面,我们将探讨如何设计这些模式并将其与JsonCss提取策略(或者JsonXPath提取策略如果您更喜欢 XPath)。我们还将重点介绍以下高级功能嵌套字段和基本元素属性。
¥Below, we'll explore how to craft these schemas and use them with JsonCssExtractionStrategy (or JsonXPathExtractionStrategy if you prefer XPath). We'll also highlight advanced features like nested fields and base element attributes.
1. 基于模式的提取简介
¥1. Intro to Schema-Based Extraction
模式定义:
¥A schema defines:
-
一个基础选择器标识页面上的每个“容器”元素(例如,产品行、博客文章卡)。
¥A base selector that identifies each "container" element on the page (e.g., a product row, a blog post card).
-
字段描述要捕获的每条数据(文本、属性、HTML 块等)要使用哪些 CSS/XPath 选择器。
¥Fields describing which CSS/XPath selectors to use for each piece of data you want to capture (text, attribute, HTML block, etc.).
-
嵌套或者列表重复或分层结构的类型。
¥Nested or list types for repeated or hierarchical structures.
例如,如果您有一个产品列表,每个产品可能都有名称、价格、评论和“相关产品”。对于一致、结构化的页面来说,这种方法比 LLM 更快、更可靠。
¥For example, if you have a list of products, each one might have a name, price, reviews, and "related products." This approach is faster and more reliable than an LLM for consistent, structured pages.
2. 简单示例:加密货币价格
¥2. Simple Example: Crypto Prices
让我们从简单的基于模式的提取JsonCssExtractionStrategy。以下是从网站提取加密货币价格的代码片段(类似于旧版 Coinbase 示例)。请注意,我们不致电任何法学硕士:
¥Let's begin with a simple schema-based extraction using the JsonCssExtractionStrategy. Below is a snippet that extracts cryptocurrency prices from a site (similar to the legacy Coinbase example). Notice we don't call any LLM:
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def extract_crypto_prices():
# 1. Define a simple extraction schema
schema = {
"name": "Crypto Prices",
"baseSelector": "div.crypto-row", # Repeated elements
"fields": [
{
"name": "coin_name",
"selector": "h2.coin-name",
"type": "text"
},
{
"name": "price",
"selector": "span.coin-price",
"type": "text"
}
]
}
# 2. Create the extraction strategy
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
# 3. Set up your crawler config (if needed)
config = CrawlerRunConfig(
# e.g., pass js_code or wait_for if the page is dynamic
# wait_for="css:.crypto-row:nth-child(20)"
cache_mode = CacheMode.BYPASS,
extraction_strategy=extraction_strategy,
)
async with AsyncWebCrawler(verbose=True) as crawler:
# 4. Run the crawl and extraction
result = await crawler.arun(
url="https://example.com/crypto-prices",
config=config
)
if not result.success:
print("Crawl failed:", result.error_message)
return
# 5. Parse the extracted JSON
data = json.loads(result.extracted_content)
print(f"Extracted {len(data)} coin entries")
print(json.dumps(data[0], indent=2) if data else "No data found")
asyncio.run(extract_crypto_prices())
亮点:
¥Highlights:
-
baseSelector:告诉我们每个“项目”(加密行)在哪里。¥
baseSelector: Tells us where each "item" (crypto row) is. -
fields:两个字段(coin_name,price) 使用简单的 CSS 选择器。¥
fields: Two fields (coin_name,price) using simple CSS selectors. -
每个字段定义一个
type(例如,text,attribute,html,regex, ETC。)。¥Each field defines a
type(e.g.,text,attribute,html,regex, etc.).
无需LLM,且绩效近乎即时数百或数千个项目。
¥No LLM is needed, and the performance is near-instant for hundreds or thousands of items.
XPath 示例raw://HTML
¥XPath Example with raw:// HTML
下面是一个简短的例子,演示XPath提取加上raw://方案。我们将通过虚拟 HTML直接(无需网络请求)并定义提取策略CrawlerRunConfig。
¥Below is a short example demonstrating XPath extraction plus the raw:// scheme. We'll pass a dummy HTML directly (no network request) and define the extraction strategy in CrawlerRunConfig.
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai import JsonXPathExtractionStrategy
async def extract_crypto_prices_xpath():
# 1. Minimal dummy HTML with some repeating rows
dummy_html = """
<html>
<body>
<div class='crypto-row'>
<h2 class='coin-name'>Bitcoin</h2>
<span class='coin-price'>$28,000</span>
</div>
<div class='crypto-row'>
<h2 class='coin-name'>Ethereum</h2>
<span class='coin-price'>$1,800</span>
</div>
</body>
</html>
"""
# 2. Define the JSON schema (XPath version)
schema = {
"name": "Crypto Prices via XPath",
"baseSelector": "//div[@class='crypto-row']",
"fields": [
{
"name": "coin_name",
"selector": ".//h2[@class='coin-name']",
"type": "text"
},
{
"name": "price",
"selector": ".//span[@class='coin-price']",
"type": "text"
}
]
}
# 3. Place the strategy in the CrawlerRunConfig
config = CrawlerRunConfig(
extraction_strategy=JsonXPathExtractionStrategy(schema, verbose=True)
)
# 4. Use raw:// scheme to pass dummy_html directly
raw_url = f"raw://{dummy_html}"
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=raw_url,
config=config
)
if not result.success:
print("Crawl failed:", result.error_message)
return
data = json.loads(result.extracted_content)
print(f"Extracted {len(data)} coin rows")
if data:
print("First item:", data[0])
asyncio.run(extract_crypto_prices_xpath())
关键点:
¥Key Points:
-
JsonXPathExtractionStrategy用来代替JsonCssExtractionStrategy。¥
JsonXPathExtractionStrategyis used instead ofJsonCssExtractionStrategy. -
baseSelector以及每个字段的"selector"使用XPath而不是 CSS。¥
baseSelectorand each field's"selector"use XPath instead of CSS. -
raw://让我们通过dummy_html无需真正的网络请求——方便本地测试。¥
raw://lets us passdummy_htmlwith no real network request—handy for local testing. -
一切(包括提取策略)都在
CrawlerRunConfig。¥Everything (including the extraction strategy) is in
CrawlerRunConfig.
这就是保持配置独立的方法,说明XPath使用,并演示生的直接 HTML 输入方案——同时避免传递extraction_strategy直接arun()。
¥That's how you keep the config self-contained, illustrate XPath usage, and demonstrate the raw scheme for direct HTML input—all while avoiding the old approach of passing extraction_strategy directly to arun().
3. 高级模式和嵌套结构
¥3. Advanced Schema & Nested Structures
真实的网站通常有嵌套或者重复数据——比如包含产品的类别,这些产品本身也包含评论或功能列表。为此,我们可以定义嵌套或者列表(甚至嵌套列表) 字段。
¥Real sites often have nested or repeated data—like categories containing products, which themselves have a list of reviews or features. For that, we can define nested or list (and even nested_list) fields.
电子商务 HTML 示例
¥Sample E-Commerce HTML
我们有一个电子商务示例GitHub 上的 HTML 文件(示例):
¥We have a sample e-commerce HTML file on GitHub (example):
https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html
schema = {
"name": "E-commerce Product Catalog",
"baseSelector": "div.category",
# (1) We can define optional baseFields if we want to extract attributes
# from the category container
"baseFields": [
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
],
"fields": [
{
"name": "category_name",
"selector": "h2.category-name",
"type": "text"
},
{
"name": "products",
"selector": "div.product",
"type": "nested_list", # repeated sub-objects
"fields": [
{
"name": "name",
"selector": "h3.product-name",
"type": "text"
},
{
"name": "price",
"selector": "p.product-price",
"type": "text"
},
{
"name": "details",
"selector": "div.product-details",
"type": "nested", # single sub-object
"fields": [
{
"name": "brand",
"selector": "span.brand",
"type": "text"
},
{
"name": "model",
"selector": "span.model",
"type": "text"
}
]
},
{
"name": "features",
"selector": "ul.product-features li",
"type": "list",
"fields": [
{"name": "feature", "type": "text"}
]
},
{
"name": "reviews",
"selector": "div.review",
"type": "nested_list",
"fields": [
{
"name": "reviewer",
"selector": "span.reviewer",
"type": "text"
},
{
"name": "rating",
"selector": "span.rating",
"type": "text"
},
{
"name": "comment",
"selector": "p.review-text",
"type": "text"
}
]
},
{
"name": "related_products",
"selector": "ul.related-products li",
"type": "list",
"fields": [
{
"name": "name",
"selector": "span.related-name",
"type": "text"
},
{
"name": "price",
"selector": "span.related-price",
"type": "text"
}
]
}
]
}
]
}
关键要点:
¥Key Takeaways:
-
嵌套与列表:
¥Nested vs. List:
-
type: "nested"意味着单身的子对象(如details)。¥
type: "nested"means a single sub-object (likedetails). -
type: "list"表示多个项目简单的字典或单个文本字段。¥
type: "list"means multiple items that are simple dictionaries or single text fields. -
type: "nested_list"意味着重复复杂的对象(如products或者reviews)。¥
type: "nested_list"means repeated complex objects (likeproductsorreviews). -
基础字段:我们可以提取属性从容器元素通过
"baseFields"。 例如,"data_cat_id"可能是data-cat-id="elect123"。¥Base Fields: We can extract attributes from the container element via
"baseFields". For instance,"data_cat_id"might bedata-cat-id="elect123". -
变换:我们还可以定义一个
transform如果我们想要小写/大写,去除空格,甚至运行自定义函数。¥Transforms: We can also define a
transformif we want to lower/upper case, strip whitespace, or even run a custom function.
运行提取
¥Running the Extraction
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai import JsonCssExtractionStrategy
ecommerce_schema = {
# ... the advanced schema from above ...
}
async def extract_ecommerce_data():
strategy = JsonCssExtractionStrategy(ecommerce_schema, verbose=True)
config = CrawlerRunConfig()
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
extraction_strategy=strategy,
config=config
)
if not result.success:
print("Crawl failed:", result.error_message)
return
# Parse the JSON output
data = json.loads(result.extracted_content)
print(json.dumps(data, indent=2) if data else "No data found.")
asyncio.run(extract_ecommerce_data())
如果一切顺利,你会得到结构化每个“类别”的 JSON 数组包含一个数组products. 每件产品包括details,features ,reviews等等。所有这些没有法学硕士学位。
¥If all goes well, you get a structured JSON array with each "category," containing an array of products. Each product includes details, features, reviews, etc. All of that without an LLM.
4. RegexExtractionStrategy - 基于模式的快速提取
¥4. RegexExtractionStrategy - Fast Pattern-Based Extraction
Crawl4AI 现在提供强大的新零 LLM 提取策略:RegexExtractionStrategy 。该策略使用预编译的正则表达式,可以快速提取电子邮件、电话号码、URL、日期等常见数据类型。
¥Crawl4AI now offers a powerful new zero-LLM extraction strategy: RegexExtractionStrategy. This strategy provides lightning-fast extraction of common data types like emails, phone numbers, URLs, dates, and more using pre-compiled regular expressions.
主要特点
¥Key Features
-
零法学硕士依赖:无需任何 AI 模型调用即可提取数据
¥Zero LLM Dependency: Extracts data without any AI model calls
-
极速:使用预编译的正则表达式模式来实现最佳性能
¥Blazing Fast: Uses pre-compiled regex patterns for maximum performance
-
内置模式:包括常见数据类型的现成模式
¥Built-in Patterns: Includes ready-to-use patterns for common data types
-
自定义图案:添加您自己的正则表达式模式以进行特定于域的提取
¥Custom Patterns: Add your own regex patterns for domain-specific extraction
-
LLM辅助模式生成:可选择使用一次 LLM 来生成优化模式,然后重复使用它们而无需进一步调用 LLM
¥LLM-Assisted Pattern Generation: Optionally use an LLM once to generate optimized patterns, then reuse them without further LLM calls
简单示例:提取公共实体
¥Simple Example: Extracting Common Entities
最简单的开始方式是使用内置模式目录:
¥The easiest way to start is by using the built-in pattern catalog:
import json
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
RegexExtractionStrategy
)
async def extract_with_regex():
# Create a strategy using built-in patterns for URLs and currencies
strategy = RegexExtractionStrategy(
pattern = RegexExtractionStrategy.Url | RegexExtractionStrategy.Currency
)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=config
)
if result.success:
data = json.loads(result.extracted_content)
for item in data[:5]: # Show first 5 matches
print(f"{item['label']}: {item['value']}")
print(f"Total matches: {len(data)}")
asyncio.run(extract_with_regex())
可用的内置模式
¥Available Built-in Patterns
提供这些常见模式作为 IntFlag 属性,以便于组合:
¥RegexExtractionStrategy provides these common patterns as IntFlag attributes for easy combining:
# Use individual patterns
strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email)
# Combine multiple patterns
strategy = RegexExtractionStrategy(
pattern = (
RegexExtractionStrategy.Email |
RegexExtractionStrategy.PhoneUS |
RegexExtractionStrategy.Url
)
)
# Use all available patterns
strategy = RegexExtractionStrategy(pattern=RegexExtractionStrategy.All)
可用的模式包括:-Email - 电子邮件地址 -PhoneIntl - 国际电话号码 -PhoneUS - 美国格式的电话号码 -Url - HTTP/HTTPS URL -IPv4 IPv4地址IPv6IPv6地址Uuid- UUID -Currency - 货币价值(美元、欧元等)-Percentage - 百分比值 -Number - 数值 -DateIso - ISO 格式日期 -DateUS - 美国格式的日期 -Time24h - 24 小时格式时间 -PostalUS - 美国邮政编码 -PostalUK - 英国邮政编码 -HexColor - HTML 十六进制颜色代码 -TwitterHandle Twitter 账号Hashtag- 标签 -MacAddr - MAC地址 -Iban - 国际银行账号 -CreditCard - 信用卡号码
¥Available patterns include:
- Email - Email addresses
- PhoneIntl - International phone numbers
- PhoneUS - US-format phone numbers
- Url - HTTP/HTTPS URLs
- IPv4 - IPv4 addresses
- IPv6 - IPv6 addresses
- Uuid - UUIDs
- Currency - Currency values (USD, EUR, etc.)
- Percentage - Percentage values
- Number - Numeric values
- DateIso - ISO format dates
- DateUS - US format dates
- Time24h - 24-hour format times
- PostalUS - US postal codes
- PostalUK - UK postal codes
- HexColor - HTML hex color codes
- TwitterHandle - Twitter handles
- Hashtag - Hashtags
- MacAddr - MAC addresses
- Iban - International bank account numbers
- CreditCard - Credit card numbers
自定义图案示例
¥Custom Pattern Example
为了更有针对性的提取,您可以提供自定义模式:
¥For more targeted extraction, you can provide custom patterns:
import json
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
RegexExtractionStrategy
)
async def extract_prices():
# Define a custom pattern for US Dollar prices
price_pattern = {"usd_price": r"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?"}
# Create strategy with custom pattern
strategy = RegexExtractionStrategy(custom=price_pattern)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.example.com/products",
config=config
)
if result.success:
data = json.loads(result.extracted_content)
for item in data:
print(f"Found price: {item['value']}")
asyncio.run(extract_prices())
LLM辅助模式生成
¥LLM-Assisted Pattern Generation
对于复杂或特定于站点的模式,您可以使用一次 LLM 来生成优化模式,然后保存并重复使用它而无需进一步的 LLM 调用:
¥For complex or site-specific patterns, you can use an LLM once to generate an optimized pattern, then save and reuse it without further LLM calls:
import json
import asyncio
from pathlib import Path
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
RegexExtractionStrategy,
LLMConfig
)
async def extract_with_generated_pattern():
cache_dir = Path("./pattern_cache")
cache_dir.mkdir(exist_ok=True)
pattern_file = cache_dir / "price_pattern.json"
# 1. Generate or load pattern
if pattern_file.exists():
pattern = json.load(pattern_file.open())
print(f"Using cached pattern: {pattern}")
else:
print("Generating pattern via LLM...")
# Configure LLM
llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token="env:OPENAI_API_KEY",
)
# Get sample HTML for context
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/products")
html = result.fit_html
# Generate pattern (one-time LLM usage)
pattern = RegexExtractionStrategy.generate_pattern(
label="price",
html=html,
query="Product prices in USD format",
llm_config=llm_config,
)
# Cache pattern for future use
json.dump(pattern, pattern_file.open("w"), indent=2)
# 2. Use pattern for extraction (no LLM calls)
strategy = RegexExtractionStrategy(custom=pattern)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/products",
config=config
)
if result.success:
data = json.loads(result.extracted_content)
for item in data[:10]:
print(f"Extracted: {item['value']}")
print(f"Total matches: {len(data)}")
asyncio.run(extract_with_generated_pattern())
此模式允许您:1. 使用一次 LLM 为您的特定站点生成高度优化的正则表达式 2. 将模式保存到磁盘以供重复使用 3. 在生产中仅使用正则表达式(无需进一步的 LLM 调用)提取数据
¥This pattern allows you to: 1. Use an LLM once to generate a highly optimized regex for your specific site 2. Save the pattern to disk for reuse 3. Extract data using only regex (no further LLM calls) in production
提取结果格式
¥Extraction Results Format
这RegexExtractionStrategy以一致的格式返回结果:
¥The RegexExtractionStrategy returns results in a consistent format:
[
{
"url": "https://example.com",
"label": "email",
"value": "contact@example.com",
"span": [145, 163]
},
{
"url": "https://example.com",
"label": "url",
"value": "https://support.example.com",
"span": [210, 235]
}
]
每场比赛包括:-url :源 URL -label :匹配的模式名称(例如“email”、“phone_us”) -value :提取的文本 -span :源内容中的开始和结束位置
¥Each match includes:
- url: The source URL
- label: The pattern name that matched (e.g., "email", "phone_us")
- value: The extracted text
- span: The start and end positions in the source content
5. 为什么“没有法学硕士”通常更好
¥5. Why "No LLM" Is Often Better
-
零幻觉:基于模式的提取不会猜测文本。它要么找到文本,要么找不到。
¥Zero Hallucination: Pattern-based extraction doesn't guess text. It either finds it or not.
-
保证结构:相同的模式或正则表达式在许多页面上产生一致的 JSON,因此您的下游管道可以依赖稳定的键。
¥Guaranteed Structure: The same schema or regex yields consistent JSON across many pages, so your downstream pipeline can rely on stable keys.
-
速度:对于大规模爬取,基于 LLM 的提取速度可能会慢 10 到 1000 倍。
¥Speed: LLM-based extraction can be 10–1000x slower for large-scale crawling.
-
可扩展:添加或更新字段只是调整模式或正则表达式的问题,而不是重新调整模型。
¥Scalable: Adding or updating a field is a matter of adjusting the schema or regex, not re-tuning a model.
您什么时候会考虑攻读法学硕士学位?如果网站结构极其混乱,或者你需要 AI 摘要,那么这种方法或许可行。但对于重复或一致的数据模式,请务必先尝试使用模式或正则表达式。
¥When might you consider an LLM? Possibly if the site is extremely unstructured or you want AI summarization. But always try a schema or regex approach first for repeated or consistent data patterns.
6. 基本元素属性和附加字段
¥6. Base Element Attributes & Additional Fields
很容易提取属性(喜欢href,src , 或者data-xxx)从您的基础或嵌套元素中使用:
¥It's easy to extract attributes (like href, src, or data-xxx) from your base or nested elements using:
您可以在baseFields(从主容器元素中提取)或每个字段的子列表中。如果您需要将项目的链接或 ID 存储在父级中,这将特别有用<div>。
¥You can define them in baseFields (extracted from the main container element) or in each field's sub-lists. This is especially helpful if you need an item's link or ID stored in the parent <div>.
7. 整合:更大的示例
¥7. Putting It All Together: Larger Example
考虑一个博客网站。我们有一个模式,它提取网址从每张明信片(通过baseFields与"attribute": "href"),加上标题、日期、摘要和作者:
¥Consider a blog site. We have a schema that extracts the URL from each post card (via baseFields with an "attribute": "href"), plus the title, date, summary, and author:
schema = {
"name": "Blog Posts",
"baseSelector": "a.blog-post-card",
"baseFields": [
{"name": "post_url", "type": "attribute", "attribute": "href"}
],
"fields": [
{"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
{"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
{"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
{"name": "author", "selector": "span.post-author", "type": "text", "default": ""}
]
}
然后运行JsonCssExtractionStrategy(schema)获取博客文章对象数组,每个对象都包含"post_url","title" ,"date" ,"summary" ,"author" 。
¥Then run with JsonCssExtractionStrategy(schema) to get an array of blog post objects, each with "post_url", "title", "date", "summary", "author".
8. 技巧与最佳实践
¥8. Tips & Best Practices
-
检查 DOM在 Chrome DevTools 或 Firefox 的 Inspector 中找到稳定的选择器。
¥Inspect the DOM in Chrome DevTools or Firefox's Inspector to find stable selectors.
-
从简单开始:验证是否可以提取单个字段。然后添加嵌套对象或列表等复杂性。
¥Start Simple: Verify you can extract a single field. Then add complexity like nested objects or lists.
-
测试在大规模爬取之前,在部分 HTML 或测试页面上显示您的架构。
¥Test your schema on partial HTML or a test page before a big crawl.
-
与 JS 执行结合如果网站动态加载内容。您可以通过
js_code或者wait_for在CrawlerRunConfig。¥Combine with JS Execution if the site loads content dynamically. You can pass
js_codeorwait_forinCrawlerRunConfig. -
查看日志什么时候
verbose=True:如果您的选择器关闭或者您的模式格式不正确,它通常会显示警告。¥Look at Logs when
verbose=True: if your selectors are off or your schema is malformed, it'll often show warnings. -
使用 baseFields如果你需要容器元素的属性(例如,
href,data-id),尤其是对于“父”项。¥Use baseFields if you need attributes from the container element (e.g.,
href,data-id), especially for the "parent" item. -
表现:对于大页面,请确保选择器尽可能窄。
¥Performance: For large pages, make sure your selectors are as narrow as possible.
-
首先考虑使用正则表达式:对于电子邮件、URL 和日期等简单数据类型,
RegexExtractionStrategy往往是最快的方法。¥Consider Using Regex First: For simple data types like emails, URLs, and dates,
RegexExtractionStrategyis often the fastest approach.
9. 模式生成实用程序
¥9. Schema Generation Utility
虽然手动创建模式功能强大且精确,但 Crawl4AI 现在提供了一个便捷的实用程序来自动生成使用 LLM 提取模式。这在以下情况下尤其有用:
¥While manually crafting schemas is powerful and precise, Crawl4AI now offers a convenient utility to automatically generate extraction schemas using LLM. This is particularly useful when:
-
您正在处理一个新的网站结构,并希望快速入门
¥You're dealing with a new website structure and want a quick starting point
-
您需要提取复杂的嵌套数据结构
¥You need to extract complex nested data structures
-
您希望避免 CSS/XPath 选择器语法的学习曲线
¥You want to avoid the learning curve of CSS/XPath selector syntax
使用模式生成器
¥Using the Schema Generator
模式生成器可作为静态方法在JsonCssExtractionStrategy和JsonXPathExtractionStrategy。您可以选择 OpenAI 的 GPT-4 或开源 Ollama 进行模式生成:
¥The schema generator is available as a static method on both JsonCssExtractionStrategy and JsonXPathExtractionStrategy. You can choose between OpenAI's GPT-4 or the open-source Ollama for schema generation:
from crawl4ai import JsonCssExtractionStrategy, JsonXPathExtractionStrategy
from crawl4ai import LLMConfig
# Sample HTML with product information
html = """
<div class="product-card">
<h2 class="title">Gaming Laptop</h2>
<div class="price">$999.99</div>
<div class="specs">
<ul>
<li>16GB RAM</li>
<li>1TB SSD</li>
</ul>
</div>
</div>
"""
# Option 1: Using OpenAI (requires API token)
css_schema = JsonCssExtractionStrategy.generate_schema(
html,
schema_type="css",
llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token")
)
# Option 2: Using Ollama (open source, no token needed)
xpath_schema = JsonXPathExtractionStrategy.generate_schema(
html,
schema_type="xpath",
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None) # Not needed for Ollama
)
# Use the generated schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(css_schema)
LLM 提供者选项
¥LLM Provider Options
-
OpenAI GPT-4(
openai/gpt4o)¥OpenAI GPT-4 (
openai/gpt4o) -
默认提供程序
¥Default provider
-
需要 API 令牌
¥Requires an API token
-
通常提供更准确的模式
¥Generally provides more accurate schemas
-
通过环境变量设置:
OPENAI_API_KEY¥
Set via environment variable:
OPENAI_API_KEY -
奥拉玛(
ollama/llama3.3)¥
Ollama (
ollama/llama3.3) -
开源替代方案
¥Open source alternative
-
无需 API 令牌
¥No API token required
-
自托管选项
¥Self-hosted option
-
适合开发和测试
¥Good for development and testing
模式生成的好处
¥Benefits of Schema Generation
-
一次性成本:虽然模式生成使用 LLM,但这只是一次性成本。生成的模式可以重复使用,进行无限次提取,无需进一步调用 LLM。
¥One-Time Cost: While schema generation uses LLM, it's a one-time cost. The generated schema can be reused for unlimited extractions without further LLM calls.
-
智能模式识别:LLM 分析 HTML 结构并识别常见模式,通常会产生比手动尝试更强大的选择器。
¥Smart Pattern Recognition: The LLM analyzes the HTML structure and identifies common patterns, often producing more robust selectors than manual attempts.
-
自动排料:复杂的嵌套结构会被自动检测并在模式中正确表示。
¥Automatic Nesting: Complex nested structures are automatically detected and properly represented in the schema.
-
学习工具:生成的模式是学习如何编写自己的模式的绝佳示例。
¥Learning Tool: The generated schemas serve as excellent examples for learning how to write your own schemas.
最佳实践
¥Best Practices
-
审查生成的模式:虽然生成器很智能,但在生产中使用之前,请务必检查并测试生成的模式。
¥Review Generated Schemas: While the generator is smart, always review and test the generated schema before using it in production.
-
提供代表性 HTML :示例 HTML 越能体现整体结构,生成的模式就越准确。
¥Provide Representative HTML: The better your sample HTML represents the overall structure, the more accurate the generated schema will be.
-
同时考虑 CSS 和 XPath :尝试两种模式类型并选择最适合您具体情况的模式类型。
¥Consider Both CSS and XPath: Try both schema types and choose the one that works best for your specific case.
-
缓存生成的模式:由于生成使用 LLM,因此保存成功的模式以供重用。
¥Cache Generated Schemas: Since generation uses LLM, save successful schemas for reuse.
-
API 令牌安全:切勿对 API 令牌进行硬编码。请使用环境变量或安全配置管理。
¥API Token Security: Never hardcode API tokens. Use environment variables or secure configuration management.
-
明智地选择提供商:
¥Choose Provider Wisely:
-
使用 OpenAI 实现生产质量模式
¥Use OpenAI for production-quality schemas
-
使用 Ollama 进行开发、测试或需要自托管解决方案时
¥Use Ollama for development, testing, or when you need a self-hosted solution
10. 结论
¥10. Conclusion
借助 Crawl4AI 的无 LLM 提取策略 -JsonCssExtractionStrategy ,JsonXPathExtractionStrategy ,现在RegexExtractionStrategy- 您可以构建强大的管道:
¥With Crawl4AI's LLM-free extraction strategies - JsonCssExtractionStrategy, JsonXPathExtractionStrategy, and now RegexExtractionStrategy - you can build powerful pipelines that:
-
抓取任何一致站点的结构化数据。
¥Scrape any consistent site for structured data.
-
支持嵌套对象、重复列表或基于模式的提取。
¥Support nested objects, repeating lists, or pattern-based extraction.
-
快速可靠地扩展到数千页。
¥Scale to thousands of pages quickly and reliably.
选择正确的策略:
¥Choosing the Right Strategy:
-
使用
RegexExtractionStrategy用于快速提取常见数据类型,如电子邮件、电话、URL、日期等。¥Use
RegexExtractionStrategyfor fast extraction of common data types like emails, phones, URLs, dates, etc. -
使用
JsonCssExtractionStrategy或者JsonXPathExtractionStrategy对于具有清晰 HTML 模式的结构化数据¥Use
JsonCssExtractionStrategyorJsonXPathExtractionStrategyfor structured data with clear HTML patterns -
如果两者都需要:首先使用 JSON 策略提取结构化数据,然后在特定字段上使用正则表达式
¥If you need both: first extract structured data with JSON strategies, then use regex on specific fields
记住:对于重复的结构化数据,您无需付费或等待 LLM。精心设计的架构和正则表达式模式可以让您更快、更清晰、更经济地获取数据。真正的力量Crawl4AI 的。
¥Remember: For repeated, structured data, you don't need to pay for or wait on an LLM. Well-crafted schemas and regex patterns get you the data faster, cleaner, and cheaper—the real power of Crawl4AI.
上次更新:2025年5月2日
¥Last Updated: 2025-05-02
这就是提取 JSON(无 LLM) !您已经了解了基于模式的方法(CSS 或 XPath)和正则表达式如何能够处理从简单列表到深度嵌套的产品目录等各种数据——即时且开销极小。尽情构建强大的数据抓取工具,为您的数据管道生成一致、结构化的 JSON 数据吧!
¥That's it for Extracting JSON (No LLM)! You've seen how schema-based approaches (either CSS or XPath) and regex patterns can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines!