余弦策略

¥Cosine Strategy

Crawl4AI 中的余弦策略使用基于相似度的聚类来识别和提取网页中的相关内容部分。当您需要基于语义相似度而非结构模式来查找和提取内容时,此策略尤其有用。

¥The Cosine Strategy in Crawl4AI uses similarity-based clustering to identify and extract relevant content sections from web pages. This strategy is particularly useful when you need to find and extract content based on semantic similarity rather than structural patterns.

工作原理

¥How It Works

余弦策略:1. 将页面内容分解为有意义的块 2. 将文本转换为矢量表示 3. 计算块之间的相似度 4. 将相似内容聚类在一起 5. 根据相关性对内容进行排名和过滤

¥The Cosine Strategy: 1. Breaks down page content into meaningful chunks 2. Converts text into vector representations 3. Calculates similarity between chunks 4. Clusters similar content together 5. Ranks and filters content based on relevance

基本用法

¥Basic Usage

from crawl4ai import CosineStrategy

strategy = CosineStrategy(
    semantic_filter="product reviews",    # Target content type
    word_count_threshold=10,             # Minimum words per cluster
    sim_threshold=0.3                    # Similarity threshold
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com/reviews",
        extraction_strategy=strategy
    )

    content = result.extracted_content

配置选项

¥Configuration Options

核心参数

¥Core Parameters

CosineStrategy(
    # Content Filtering
    semantic_filter: str = None,       # Keywords/topic for content filtering
    word_count_threshold: int = 10,    # Minimum words per cluster
    sim_threshold: float = 0.3,        # Similarity threshold (0.0 to 1.0)

    # Clustering Parameters
    max_dist: float = 0.2,            # Maximum distance for clustering
    linkage_method: str = 'ward',      # Clustering linkage method
    top_k: int = 3,                   # Number of top categories to extract

    # Model Configuration
    model_name: str = 'sentence-transformers/all-MiniLM-L6-v2',  # Embedding model

    verbose: bool = False             # Enable logging
)

参数详细信息

¥Parameter Details

1.语义过滤器- 设置目标主题或内容类型 - 使用与所需内容相关的关键字 - 例如:“技术规格”、“用户评论”、“定价信息”

¥1. semantic_filter - Sets the target topic or content type - Use keywords relevant to your desired content - Example: "technical specifications", "user reviews", "pricing information"

2.模拟阈值- 控制相似内容如何分组 - 值越高(例如 0.8)意味着匹配越严格 - 值越低(例如 0.3)允许更多变化

¥2. sim_threshold - Controls how similar content must be to be grouped together - Higher values (e.g., 0.8) mean stricter matching - Lower values (e.g., 0.3) allow more variation

# Strict matching
strategy = CosineStrategy(sim_threshold=0.8)

# Loose matching
strategy = CosineStrategy(sim_threshold=0.3)

3.字数阈值- 过滤掉短内容块 - 帮助消除噪音和不相关的内容

¥3. word_count_threshold - Filters out short content blocks - Helps eliminate noise and irrelevant content

# Only consider substantial paragraphs
strategy = CosineStrategy(word_count_threshold=50)

4. top_k - 返回的顶级内容集群数量 - 值越高,返回的内容就越多样化

¥4. top_k - Number of top content clusters to return - Higher values return more diverse content

# Get top 5 most relevant content clusters
strategy = CosineStrategy(top_k=5)

用例

¥Use Cases

1.文章内容提取

¥1. Article Content Extraction

strategy = CosineStrategy(
    semantic_filter="main article content",
    word_count_threshold=100,  # Longer blocks for articles
    top_k=1                   # Usually want single main content
)

result = await crawler.arun(
    url="https://example.com/blog/post",
    extraction_strategy=strategy
)

2. 产品评论分析

¥2. Product Review Analysis

strategy = CosineStrategy(
    semantic_filter="customer reviews and ratings",
    word_count_threshold=20,   # Reviews can be shorter
    top_k=10,                 # Get multiple reviews
    sim_threshold=0.4         # Allow variety in review content
)

3.技术文档

¥3. Technical Documentation

strategy = CosineStrategy(
    semantic_filter="technical specifications documentation",
    word_count_threshold=30,
    sim_threshold=0.6,        # Stricter matching for technical content
    max_dist=0.3             # Allow related technical sections
)

高级功能

¥Advanced Features

自定义聚类

¥Custom Clustering

strategy = CosineStrategy(
    linkage_method='complete',  # Alternative clustering method
    max_dist=0.4,              # Larger clusters
    model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'  # Multilingual support
)

内容过滤管道

¥Content Filtering Pipeline

strategy = CosineStrategy(
    semantic_filter="pricing plans features",
    word_count_threshold=15,
    sim_threshold=0.5,
    top_k=3
)

async def extract_pricing_features(url: str):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            extraction_strategy=strategy
        )

        if result.success:
            content = json.loads(result.extracted_content)
            return {
                'pricing_features': content,
                'clusters': len(content),
                'similarity_scores': [item['score'] for item in content]
            }

最佳实践

¥Best Practices

1.迭代调整阈值- 从默认值开始 - 根据结果进行调整 - 监控聚类质量

¥1. Adjust Thresholds Iteratively - Start with default values - Adjust based on results - Monitor clustering quality

2.选择适当的字数阈值- 文章级别较高(100+) - 评论级别较低(20+) - 产品描述级别中等(50+)

¥2. Choose Appropriate Word Count Thresholds - Higher for articles (100+) - Lower for reviews/comments (20+) - Medium for product descriptions (50+)

3.优化性能

¥3. Optimize Performance

strategy = CosineStrategy(
    word_count_threshold=10,  # Filter early
    top_k=5,                 # Limit results
    verbose=True             # Monitor performance
)

4.处理不同类型的内容

¥4. Handle Different Content Types

# For mixed content pages
strategy = CosineStrategy(
    semantic_filter="product features",
    sim_threshold=0.4,      # More flexible matching
    max_dist=0.3,          # Larger clusters
    top_k=3                # Multiple relevant sections
)

错误处理

¥Error Handling

try:
    result = await crawler.arun(
        url="https://example.com",
        extraction_strategy=strategy
    )

    if result.success:
        content = json.loads(result.extracted_content)
        if not content:
            print("No relevant content found")
    else:
        print(f"Extraction failed: {result.error_message}")

except Exception as e:
    print(f"Error during extraction: {str(e)}")

余弦策略在以下情况下特别有效: - 内容结构不一致 - 您需要语义理解 - 您想要找到相似的内容块 - 基于结构的提取(CSS / XPath)不可靠

¥The Cosine Strategy is particularly effective when: - Content structure is inconsistent - You need semantic understanding - You want to find similar content blocks - Structure-based extraction (CSS/XPath) isn't reliable

它与其他策略配合良好,可以用作基于 LLM 的提取的预处理步骤。

¥It works well with other strategies and can be used as a pre-processing step for LLM-based extraction.


> Feedback