提取和分块策略 API

¥Extraction & Chunking Strategies API

本文档涵盖了 Crawl4AI 中的提取和分块策略的 API 参考。

¥This documentation covers the API reference for extraction and chunking strategies in Crawl4AI.

提取策略

¥Extraction Strategies

所有提取策略都继承自基础ExtractionStrategy类并实现两个关键方法：-extract(url: str, html: str) -> List[Dict[str, Any]] -run(url: str, sections: List[str]) -> List[Dict[str, Any]]

¥All extraction strategies inherit from the base ExtractionStrategy class and implement two key methods: - extract(url: str, html: str) -> List[Dict[str, Any]] - run(url: str, sections: List[str]) -> List[Dict[str, Any]]

法学硕士提取策略

¥LLMExtractionStrategy

用于使用语言模型提取结构化数据。

¥Used for extracting structured data using Language Models.

LLMExtractionStrategy(
    # Required Parameters
    provider: str = DEFAULT_PROVIDER,     # LLM provider (e.g., "ollama/llama2")
    api_token: Optional[str] = None,      # API token

    # Extraction Configuration
    instruction: str = None,              # Custom extraction instruction
    schema: Dict = None,                  # Pydantic model schema for structured data
    extraction_type: str = "block",       # "block" or "schema"

    # Chunking Parameters
    chunk_token_threshold: int = 4000,    # Maximum tokens per chunk
    overlap_rate: float = 0.1,           # Overlap between chunks
    word_token_rate: float = 0.75,       # Word to token conversion rate
    apply_chunking: bool = True,         # Enable/disable chunking

    # API Configuration
    base_url: str = None,                # Base URL for API
    extra_args: Dict = {},               # Additional provider arguments
    verbose: bool = False                # Enable verbose logging
)

正则表达式提取策略

¥RegexExtractionStrategy

用于使用正则表达式快速基于模式提取常见实体。

¥Used for fast pattern-based extraction of common entities using regular expressions.

RegexExtractionStrategy(
    # Pattern Configuration
    pattern: IntFlag = RegexExtractionStrategy.Nothing,  # Bit flags of built-in patterns to use
    custom: Optional[Dict[str, str]] = None,           # Custom pattern dictionary {label: regex}

    # Input Format
    input_format: str = "fit_html",                    # "html", "markdown", "text" or "fit_html"
)

# Built-in Patterns as Bit Flags
RegexExtractionStrategy.Email           # Email addresses
RegexExtractionStrategy.PhoneIntl       # International phone numbers 
RegexExtractionStrategy.PhoneUS         # US-format phone numbers
RegexExtractionStrategy.Url             # HTTP/HTTPS URLs
RegexExtractionStrategy.IPv4            # IPv4 addresses
RegexExtractionStrategy.IPv6            # IPv6 addresses
RegexExtractionStrategy.Uuid            # UUIDs
RegexExtractionStrategy.Currency        # Currency values (USD, EUR, etc)
RegexExtractionStrategy.Percentage      # Percentage values
RegexExtractionStrategy.Number          # Numeric values
RegexExtractionStrategy.DateIso         # ISO format dates
RegexExtractionStrategy.DateUS          # US format dates
RegexExtractionStrategy.Time24h         # 24-hour format times
RegexExtractionStrategy.PostalUS        # US postal codes
RegexExtractionStrategy.PostalUK        # UK postal codes
RegexExtractionStrategy.HexColor        # HTML hex color codes
RegexExtractionStrategy.TwitterHandle   # Twitter handles
RegexExtractionStrategy.Hashtag         # Hashtags
RegexExtractionStrategy.MacAddr         # MAC addresses
RegexExtractionStrategy.Iban            # International bank account numbers
RegexExtractionStrategy.CreditCard      # Credit card numbers
RegexExtractionStrategy.All             # All available patterns

余弦策略

¥CosineStrategy

用于基于内容相似性的提取和聚类。

¥Used for content similarity-based extraction and clustering.

CosineStrategy(
    # Content Filtering
    semantic_filter: str = None,        # Topic/keyword filter
    word_count_threshold: int = 10,     # Minimum words per cluster
    sim_threshold: float = 0.3,         # Similarity threshold

    # Clustering Parameters
    max_dist: float = 0.2,             # Maximum cluster distance
    linkage_method: str = 'ward',       # Clustering method
    top_k: int = 3,                    # Top clusters to return

    # Model Configuration
    model_name: str = 'sentence-transformers/all-MiniLM-L6-v2',  # Embedding model

    verbose: bool = False              # Enable verbose logging
)

JsonCss提取策略

¥JsonCssExtractionStrategy

用于基于 CSS 选择器的结构化数据提取。

¥Used for CSS selector-based structured data extraction.

JsonCssExtractionStrategy(
    schema: Dict[str, Any],    # Extraction schema
    verbose: bool = False      # Enable verbose logging
)

# Schema Structure
schema = {
    "name": str,              # Schema name
    "baseSelector": str,      # Base CSS selector
    "fields": [               # List of fields to extract
        {
            "name": str,      # Field name
            "selector": str,  # CSS selector
            "type": str,     # Field type: "text", "attribute", "html", "regex"
            "attribute": str, # For type="attribute"
            "pattern": str,  # For type="regex"
            "transform": str, # Optional: "lowercase", "uppercase", "strip"
            "default": Any    # Default value if extraction fails
        }
    ]
}

分块策略

¥Chunking Strategies

所有分块策略均继承自ChunkingStrategy并实施chunk(text: str) -> list方法。

¥All chunking strategies inherit from ChunkingStrategy and implement the chunk(text: str) -> list method.

正则表达式分块

¥RegexChunking

根据正则表达式模式拆分文本。

¥Splits text based on regex patterns.

RegexChunking(
    patterns: List[str] = None  # Regex patterns for splitting
                               # Default: [r'\n\n']
)

滑动窗口分块

¥SlidingWindowChunking

使用滑动窗口方法创建重叠块。

¥Creates overlapping chunks with a sliding window approach.

SlidingWindowChunking(
    window_size: int = 100,    # Window size in words
    step: int = 50             # Step size between windows
)

重叠窗口分块

¥OverlappingWindowChunking

创建具有指定重叠的块。

¥Creates chunks with specified overlap.

OverlappingWindowChunking(
    window_size: int = 1000,   # Chunk size in words
    overlap: int = 100         # Overlap size in words
)

使用示例

¥Usage Examples

法学硕士 (LLM) 提取

¥LLM Extraction

from pydantic import BaseModel
from crawl4ai import LLMExtractionStrategy
from crawl4ai import LLMConfig

# Define schema
class Article(BaseModel):
    title: str
    content: str
    author: str

# Create strategy
strategy = LLMExtractionStrategy(
    llm_config = LLMConfig(provider="ollama/llama2"),
    schema=Article.schema(),
    instruction="Extract article details"
)

# Use with crawler
result = await crawler.arun(
    url="https://example.com/article",
    extraction_strategy=strategy
)

# Access extracted data
data = json.loads(result.extracted_content)

正则表达式提取

¥Regex Extraction

import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, RegexExtractionStrategy

# Method 1: Use built-in patterns
strategy = RegexExtractionStrategy(
    pattern = RegexExtractionStrategy.Email | RegexExtractionStrategy.Url
)

# Method 2: Use custom patterns
price_pattern = {"usd_price": r"\$\s?\d{1,3}(?:,\d{3})*(?:\.\d{2})?"}
strategy = RegexExtractionStrategy(custom=price_pattern)

# Method 3: Generate pattern with LLM assistance (one-time)
from crawl4ai import LLMConfig

async with AsyncWebCrawler() as crawler:
    # Get sample HTML first
    sample_result = await crawler.arun("https://example.com/products")
    html = sample_result.fit_html

    # Generate regex pattern once
    pattern = RegexExtractionStrategy.generate_pattern(
        label="price",
        html=html,
        query="Product prices in USD format",
        llm_config=LLMConfig(provider="openai/gpt-4o-mini")
    )

    # Save pattern for reuse
    import json
    with open("price_pattern.json", "w") as f:
        json.dump(pattern, f)

    # Use pattern for extraction (no LLM calls)
    strategy = RegexExtractionStrategy(custom=pattern)
    result = await crawler.arun(
        url="https://example.com/products",
        config=CrawlerRunConfig(extraction_strategy=strategy)
    )

    # Process results
    data = json.loads(result.extracted_content)
    for item in data:
        print(f"{item['label']}: {item['value']}")

CSS提取

¥CSS Extraction

from crawl4ai import JsonCssExtractionStrategy

# Define schema
schema = {
    "name": "Product List",
    "baseSelector": ".product-card",
    "fields": [
        {
            "name": "title",
            "selector": "h2.title",
            "type": "text"
        },
        {
            "name": "price",
            "selector": ".price",
            "type": "text",
            "transform": "strip"
        },
        {
            "name": "image",
            "selector": "img",
            "type": "attribute",
            "attribute": "src"
        }
    ]
}

# Create and use strategy
strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(
    url="https://example.com/products",
    extraction_strategy=strategy
)

内容分块

¥Content Chunking

from crawl4ai.chunking_strategy import OverlappingWindowChunking
from crawl4ai import LLMConfig

# Create chunking strategy
chunker = OverlappingWindowChunking(
    window_size=500,  # 500 words per chunk
    overlap=50        # 50 words overlap
)

# Use with extraction strategy
strategy = LLMExtractionStrategy(
    llm_config = LLMConfig(provider="ollama/llama2"),
    chunking_strategy=chunker
)

result = await crawler.arun(
    url="https://example.com/long-article",
    extraction_strategy=strategy
)

最佳实践

¥Best Practices

选择正确的策略

¥Choose the Right Strategy
使用RegexExtractionStrategy适用于电子邮件、电话、URL、日期等常见数据类型

¥Use RegexExtractionStrategy for common data types like emails, phones, URLs, dates
使用JsonCssExtractionStrategy适用于结构良好、模式一致的 HTML

¥Use JsonCssExtractionStrategy for well-structured HTML with consistent patterns
使用LLMExtractionStrategy对于需要推理的复杂、非结构化内容

¥Use LLMExtractionStrategy for complex, unstructured content requiring reasoning
使用CosineStrategy用于内容相似性和聚类

¥
Use CosineStrategy for content similarity and clustering

策略选择指南

Is the target data a common type (email/phone/date/URL)? 
→ RegexExtractionStrategy

Does the page have consistent HTML structure?
→ JsonCssExtractionStrategy or JsonXPathExtractionStrategy

Is the data semantically complex or unstructured?
→ LLMExtractionStrategy

Need to find content similar to a specific topic?
→ CosineStrategy

Strategy Selection Guide

Is the target data a common type (email/phone/date/URL)? 
→ RegexExtractionStrategy

Does the page have consistent HTML structure?
→ JsonCssExtractionStrategy or JsonXPathExtractionStrategy

Is the data semantically complex or unstructured?
→ LLMExtractionStrategy

Need to find content similar to a specific topic?
→ CosineStrategy

优化分块

# For long documents
strategy = LLMExtractionStrategy(
    chunk_token_threshold=2000,  # Smaller chunks
    overlap_rate=0.1           # 10% overlap
)

Optimize Chunking

# For long documents
strategy = LLMExtractionStrategy(
    chunk_token_threshold=2000,  # Smaller chunks
    overlap_rate=0.1           # 10% overlap
)

结合策略以获得最佳表现

# First pass: Extract structure with CSS
css_strategy = JsonCssExtractionStrategy(product_schema)
css_result = await crawler.arun(url, config=CrawlerRunConfig(extraction_strategy=css_strategy))
product_data = json.loads(css_result.extracted_content)

# Second pass: Extract specific fields with regex
descriptions = [product["description"] for product in product_data]
regex_strategy = RegexExtractionStrategy(
    pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS,
    custom={"dimension": r"\d+x\d+x\d+ (?:cm|in)"}
)

# Process descriptions with regex
for text in descriptions:
    matches = regex_strategy.extract("", text)  # Direct extraction

Combine Strategies for Best Performance

# First pass: Extract structure with CSS
css_strategy = JsonCssExtractionStrategy(product_schema)
css_result = await crawler.arun(url, config=CrawlerRunConfig(extraction_strategy=css_strategy))
product_data = json.loads(css_result.extracted_content)

# Second pass: Extract specific fields with regex
descriptions = [product["description"] for product in product_data]
regex_strategy = RegexExtractionStrategy(
    pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS,
    custom={"dimension": r"\d+x\d+x\d+ (?:cm|in)"}
)

# Process descriptions with regex
for text in descriptions:
    matches = regex_strategy.extract("", text)  # Direct extraction

处理错误

try:
    result = await crawler.arun(
        url="https://example.com",
        extraction_strategy=strategy
    )
    if result.success:
        content = json.loads(result.extracted_content)
except Exception as e:
    print(f"Extraction failed: {e}")

Handle Errors

try:
    result = await crawler.arun(
        url="https://example.com",
        extraction_strategy=strategy
    )
    if result.success:
        content = json.loads(result.extracted_content)
except Exception as e:
    print(f"Extraction failed: {e}")

监控性能

strategy = CosineStrategy(
    verbose=True,  # Enable logging
    word_count_threshold=20,  # Filter short content
    top_k=5  # Limit results
)

Monitor Performance

strategy = CosineStrategy(
    verbose=True,  # Enable logging
    word_count_threshold=20,  # Filter short content
    top_k=5  # Limit results
)

缓存生成的模式

# For RegexExtractionStrategy pattern generation
import json
from pathlib import Path

cache_dir = Path("./pattern_cache")
cache_dir.mkdir(exist_ok=True)
pattern_file = cache_dir / "product_pattern.json"

if pattern_file.exists():
    with open(pattern_file) as f:
        pattern = json.load(f)
else:
    # Generate once with LLM
    pattern = RegexExtractionStrategy.generate_pattern(...)
    with open(pattern_file, "w") as f:
        json.dump(pattern, f)

Cache Generated Patterns

# For RegexExtractionStrategy pattern generation
import json
from pathlib import Path

cache_dir = Path("./pattern_cache")
cache_dir.mkdir(exist_ok=True)
pattern_file = cache_dir / "product_pattern.json"

if pattern_file.exists():
    with open(pattern_file) as f:
        pattern = json.load(f)
else:
    # Generate once with LLM
    pattern = RegexExtractionStrategy.generate_pattern(...)
    with open(pattern_file, "w") as f:
        json.dump(pattern, f)