提取和分块策略 API
¥Extraction & Chunking Strategies API
本文档涵盖了 Crawl4AI 中的提取和分块策略的 API 参考。
¥This documentation covers the API reference for extraction and chunking strategies in Crawl4AI.
提取策略
¥Extraction Strategies
所有提取策略都继承自基础ExtractionStrategy类并实现两个关键方法:-extract(url: str, html: str) -> List[Dict[str, Any]] -run(url: str, sections: List[str]) -> List[Dict[str, Any]]
¥All extraction strategies inherit from the base ExtractionStrategy class and implement two key methods:
- extract(url: str, html: str) -> List[Dict[str, Any]]
- run(url: str, sections: List[str]) -> List[Dict[str, Any]]
法学硕士提取策略
¥LLMExtractionStrategy
用于使用语言模型提取结构化数据。
¥Used for extracting structured data using Language Models.
LLMExtractionStrategy(
# Required Parameters
provider: str = DEFAULT_PROVIDER, # LLM provider (e.g., "ollama/llama2")
api_token: Optional[str] = None, # API token
# Extraction Configuration
instruction: str = None, # Custom extraction instruction
schema: Dict = None, # Pydantic model schema for structured data
extraction_type: str = "block", # "block" or "schema"
# Chunking Parameters
chunk_token_threshold: int = 4000, # Maximum tokens per chunk
overlap_rate: float = 0.1, # Overlap between chunks
word_token_rate: float = 0.75, # Word to token conversion rate
apply_chunking: bool = True, # Enable/disable chunking
# API Configuration
base_url: str = None, # Base URL for API
extra_args: Dict = {}, # Additional provider arguments
verbose: bool = False # Enable verbose logging
)
正则表达式提取策略
¥RegexExtractionStrategy
用于使用正则表达式快速基于模式提取常见实体。
¥Used for fast pattern-based extraction of common entities using regular expressions.
RegexExtractionStrategy(
# Pattern Configuration
pattern: IntFlag = RegexExtractionStrategy.Nothing, # Bit flags of built-in patterns to use
custom: Optional[Dict[str, str]] = None, # Custom pattern dictionary {label: regex}
# Input Format
input_format: str = "fit_html", # "html", "markdown", "text" or "fit_html"
)
# Built-in Patterns as Bit Flags
RegexExtractionStrategy.Email # Email addresses
RegexExtractionStrategy.PhoneIntl # International phone numbers
RegexExtractionStrategy.PhoneUS # US-format phone numbers
RegexExtractionStrategy.Url # HTTP/HTTPS URLs
RegexExtractionStrategy.IPv4 # IPv4 addresses
RegexExtractionStrategy.IPv6 # IPv6 addresses
RegexExtractionStrategy.Uuid # UUIDs
RegexExtractionStrategy.Currency # Currency values (USD, EUR, etc)
RegexExtractionStrategy.Percentage # Percentage values
RegexExtractionStrategy.Number # Numeric values
RegexExtractionStrategy.DateIso # ISO format dates
RegexExtractionStrategy.DateUS # US format dates
RegexExtractionStrategy.Time24h # 24-hour format times
RegexExtractionStrategy.PostalUS # US postal codes
RegexExtractionStrategy.PostalUK # UK postal codes
RegexExtractionStrategy.HexColor # HTML hex color codes
RegexExtractionStrategy.TwitterHandle # Twitter handles
RegexExtractionStrategy.Hashtag # Hashtags
RegexExtractionStrategy.MacAddr # MAC addresses
RegexExtractionStrategy.Iban # International bank account numbers
RegexExtractionStrategy.CreditCard # Credit card numbers
RegexExtractionStrategy.All # All available patterns
余弦策略
¥CosineStrategy
用于基于内容相似性的提取和聚类。
¥Used for content similarity-based extraction and clustering.
CosineStrategy(
# Content Filtering
semantic_filter: str = None, # Topic/keyword filter
word_count_threshold: int = 10, # Minimum words per cluster
sim_threshold: float = 0.3, # Similarity threshold
# Clustering Parameters
max_dist: float = 0.2, # Maximum cluster distance
linkage_method: str = 'ward', # Clustering method
top_k: int = 3, # Top clusters to return
# Model Configuration
model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model
verbose: bool = False # Enable verbose logging
)
JsonCss提取策略
¥JsonCssExtractionStrategy
用于基于 CSS 选择器的结构化数据提取。
¥Used for CSS selector-based structured data extraction.
JsonCssExtractionStrategy(
schema: Dict[str, Any], # Extraction schema
verbose: bool = False # Enable verbose logging
)
# Schema Structure
schema = {
"name": str, # Schema name
"baseSelector": str, # Base CSS selector
"fields": [ # List of fields to extract
{
"name": str, # Field name
"selector": str, # CSS selector
"type": str, # Field type: "text", "attribute", "html", "regex"
"attribute": str, # For type="attribute"
"pattern": str, # For type="regex"
"transform": str, # Optional: "lowercase", "uppercase", "strip"
"default": Any # Default value if extraction fails
}
]
}
分块策略
¥Chunking Strategies
所有分块策略均继承自ChunkingStrategy并实施chunk(text: str) -> list方法。
¥All chunking strategies inherit from ChunkingStrategy and implement the chunk(text: str) -> list method.
正则表达式分块
¥RegexChunking
根据正则表达式模式拆分文本。
¥Splits text based on regex patterns.
滑动窗口分块
¥SlidingWindowChunking
使用滑动窗口方法创建重叠块。
¥Creates overlapping chunks with a sliding window approach.
SlidingWindowChunking(
window_size: int = 100, # Window size in words
step: int = 50 # Step size between windows
)
重叠窗口分块
¥OverlappingWindowChunking
创建具有指定重叠的块。
¥Creates chunks with specified overlap.
OverlappingWindowChunking(
window_size: int = 1000, # Chunk size in words
overlap: int = 100 # Overlap size in words
)
使用示例
¥Usage Examples
法学硕士 (LLM) 提取
¥LLM Extraction
from pydantic import BaseModel
from crawl4ai import LLMExtractionStrategy
from crawl4ai import LLMConfig
# Define schema
class Article(BaseModel):
title: str
content: str
author: str
# Create strategy
strategy = LLMExtractionStrategy(
llm_config = LLMConfig(provider="ollama/llama2"),
schema=Article.schema(),
instruction="Extract article details"
)
# Use with crawler
result = await crawler.arun(
url="https://example.com/article",
extraction_strategy=strategy
)
# Access extracted data
data = json.loads(result.extracted_content)
正则表达式提取
¥Regex Extraction
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, RegexExtractionStrategy
# Method 1: Use built-in patterns
strategy = RegexExtractionStrategy(
pattern = RegexExtractionStrategy.Email | RegexExtractionStrategy.Url
)
# Method 2: Use custom patterns
price_pattern = {"usd_price": r"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?"}
strategy = RegexExtractionStrategy(custom=price_pattern)
# Method 3: Generate pattern with LLM assistance (one-time)
from crawl4ai import LLMConfig
async with AsyncWebCrawler() as crawler:
# Get sample HTML first
sample_result = await crawler.arun("https://example.com/products")
html = sample_result.fit_html
# Generate regex pattern once
pattern = RegexExtractionStrategy.generate_pattern(
label="price",
html=html,
query="Product prices in USD format",
llm_config=LLMConfig(provider="openai/gpt-4o-mini")
)
# Save pattern for reuse
import json
with open("price_pattern.json", "w") as f:
json.dump(pattern, f)
# Use pattern for extraction (no LLM calls)
strategy = RegexExtractionStrategy(custom=pattern)
result = await crawler.arun(
url="https://example.com/products",
config=CrawlerRunConfig(extraction_strategy=strategy)
)
# Process results
data = json.loads(result.extracted_content)
for item in data:
print(f"{item['label']}: {item['value']}")
CSS提取
¥CSS Extraction
from crawl4ai import JsonCssExtractionStrategy
# Define schema
schema = {
"name": "Product List",
"baseSelector": ".product-card",
"fields": [
{
"name": "title",
"selector": "h2.title",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text",
"transform": "strip"
},
{
"name": "image",
"selector": "img",
"type": "attribute",
"attribute": "src"
}
]
}
# Create and use strategy
strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(
url="https://example.com/products",
extraction_strategy=strategy
)
内容分块
¥Content Chunking
from crawl4ai.chunking_strategy import OverlappingWindowChunking
from crawl4ai import LLMConfig
# Create chunking strategy
chunker = OverlappingWindowChunking(
window_size=500, # 500 words per chunk
overlap=50 # 50 words overlap
)
# Use with extraction strategy
strategy = LLMExtractionStrategy(
llm_config = LLMConfig(provider="ollama/llama2"),
chunking_strategy=chunker
)
result = await crawler.arun(
url="https://example.com/long-article",
extraction_strategy=strategy
)
最佳实践
¥Best Practices
-
选择正确的策略
¥Choose the Right Strategy
-
使用
RegexExtractionStrategy适用于电子邮件、电话、URL、日期等常见数据类型¥Use
RegexExtractionStrategyfor common data types like emails, phones, URLs, dates -
使用
JsonCssExtractionStrategy适用于结构良好、模式一致的 HTML¥Use
JsonCssExtractionStrategyfor well-structured HTML with consistent patterns -
使用
LLMExtractionStrategy对于需要推理的复杂、非结构化内容¥Use
LLMExtractionStrategyfor complex, unstructured content requiring reasoning -
使用
CosineStrategy用于内容相似性和聚类¥
Use
CosineStrategyfor content similarity and clustering -
策略选择指南
Is the target data a common type (email/phone/date/URL)? → RegexExtractionStrategy Does the page have consistent HTML structure? → JsonCssExtractionStrategy or JsonXPathExtractionStrategy Is the data semantically complex or unstructured? → LLMExtractionStrategy Need to find content similar to a specific topic? → CosineStrategy¥
Strategy Selection Guide
Is the target data a common type (email/phone/date/URL)? → RegexExtractionStrategy Does the page have consistent HTML structure? → JsonCssExtractionStrategy or JsonXPathExtractionStrategy Is the data semantically complex or unstructured? → LLMExtractionStrategy Need to find content similar to a specific topic? → CosineStrategy -
优化分块
# For long documents strategy = LLMExtractionStrategy( chunk_token_threshold=2000, # Smaller chunks overlap_rate=0.1 # 10% overlap )¥
Optimize Chunking
-
结合策略以获得最佳表现
# First pass: Extract structure with CSS css_strategy = JsonCssExtractionStrategy(product_schema) css_result = await crawler.arun(url, config=CrawlerRunConfig(extraction_strategy=css_strategy)) product_data = json.loads(css_result.extracted_content) # Second pass: Extract specific fields with regex descriptions = [product["description"] for product in product_data] regex_strategy = RegexExtractionStrategy( pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS, custom={"dimension": r"\d+x\d+x\d+ (?:cm|in)"} ) # Process descriptions with regex for text in descriptions: matches = regex_strategy.extract("", text) # Direct extraction¥
Combine Strategies for Best Performance
# First pass: Extract structure with CSS css_strategy = JsonCssExtractionStrategy(product_schema) css_result = await crawler.arun(url, config=CrawlerRunConfig(extraction_strategy=css_strategy)) product_data = json.loads(css_result.extracted_content) # Second pass: Extract specific fields with regex descriptions = [product["description"] for product in product_data] regex_strategy = RegexExtractionStrategy( pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS, custom={"dimension": r"\d+x\d+x\d+ (?:cm|in)"} ) # Process descriptions with regex for text in descriptions: matches = regex_strategy.extract("", text) # Direct extraction -
处理错误
try: result = await crawler.arun( url="https://example.com", extraction_strategy=strategy ) if result.success: content = json.loads(result.extracted_content) except Exception as e: print(f"Extraction failed: {e}")¥
Handle Errors
-
监控性能
strategy = CosineStrategy( verbose=True, # Enable logging word_count_threshold=20, # Filter short content top_k=5 # Limit results )¥
Monitor Performance
-
缓存生成的模式
# For RegexExtractionStrategy pattern generation import json from pathlib import Path cache_dir = Path("./pattern_cache") cache_dir.mkdir(exist_ok=True) pattern_file = cache_dir / "product_pattern.json" if pattern_file.exists(): with open(pattern_file) as f: pattern = json.load(f) else: # Generate once with LLM pattern = RegexExtractionStrategy.generate_pattern(...) with open(pattern_file, "w") as f: json.dump(pattern, f)¥
Cache Generated Patterns
# For RegexExtractionStrategy pattern generation import json from pathlib import Path cache_dir = Path("./pattern_cache") cache_dir.mkdir(exist_ok=True) pattern_file = cache_dir / "product_pattern.json" if pattern_file.exists(): with open(pattern_file) as f: pattern = json.load(f) else: # Generate once with LLM pattern = RegexExtractionStrategy.generate_pattern(...) with open(pattern_file, "w") as f: json.dump(pattern, f)