余弦策略
Crawl4AI 中的余弦策略使用基于相似度的聚类来识别和提取网页中的相关内容部分。当您需要基于语义相似度而非结构模式来查找和提取内容时,此策略尤其有用。
工作原理
余弦策略:1. 将页面内容分解为有意义的块 2. 将文本转换为矢量表示 3. 计算块之间的相似度 4. 将相似内容聚类在一起 5. 根据相关性对内容进行排名和过滤
基本用法
from crawl4ai import CosineStrategy
strategy = CosineStrategy(
semantic_filter="product reviews", # Target content type
word_count_threshold=10, # Minimum words per cluster
sim_threshold=0.3 # Similarity threshold
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/reviews",
extraction_strategy=strategy
)
content = result.extracted_content
配置选项
核心参数
CosineStrategy(
# Content Filtering
semantic_filter: str = None, # Keywords/topic for content filtering
word_count_threshold: int = 10, # Minimum words per cluster
sim_threshold: float = 0.3, # Similarity threshold (0.0 to 1.0)
# Clustering Parameters
max_dist: float = 0.2, # Maximum distance for clustering
linkage_method: str = 'ward', # Clustering linkage method
top_k: int = 3, # Number of top categories to extract
# Model Configuration
model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model
verbose: bool = False # Enable logging
)
参数详细信息
1. semantic_filter - 设置目标主题或内容类型 - 使用与您所需内容相关的关键字 - 例如:“技术规格”、“用户评论”、“定价信息”
2. sim_threshold - 控制相似内容的分组方式 - 值越高(例如 0.8)表示匹配越严格 - 值越低(例如 0.3)表示允许更多变化
# Strict matching
strategy = CosineStrategy(sim_threshold=0.8)
# Loose matching
strategy = CosineStrategy(sim_threshold=0.3)
3. word_count_threshold - 过滤短内容块 - 帮助消除噪音和不相关的内容
4. top_k - 返回的顶级内容集群数量 - 值越高,返回的内容越多样化
用例
1.文章内容提取
strategy = CosineStrategy(
semantic_filter="main article content",
word_count_threshold=100, # Longer blocks for articles
top_k=1 # Usually want single main content
)
result = await crawler.arun(
url="https://example.com/blog/post",
extraction_strategy=strategy
)
2. 产品评论分析
strategy = CosineStrategy(
semantic_filter="customer reviews and ratings",
word_count_threshold=20, # Reviews can be shorter
top_k=10, # Get multiple reviews
sim_threshold=0.4 # Allow variety in review content
)
3.技术文档
strategy = CosineStrategy(
semantic_filter="technical specifications documentation",
word_count_threshold=30,
sim_threshold=0.6, # Stricter matching for technical content
max_dist=0.3 # Allow related technical sections
)
高级功能
自定义聚类
strategy = CosineStrategy(
linkage_method='complete', # Alternative clustering method
max_dist=0.4, # Larger clusters
model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' # Multilingual support
)
内容过滤管道
strategy = CosineStrategy(
semantic_filter="pricing plans features",
word_count_threshold=15,
sim_threshold=0.5,
top_k=3
)
async def extract_pricing_features(url: str):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
extraction_strategy=strategy
)
if result.success:
content = json.loads(result.extracted_content)
return {
'pricing_features': content,
'clusters': len(content),
'similarity_scores': [item['score'] for item in content]
}
最佳实践
1. 迭代调整阈值 - 从默认值开始 - 根据结果进行调整 - 监控聚类质量
2. 选择合适的字数阈值 - 文章字数较高(100+) - 评论/留言字数较低(20+) - 产品描述字数中等(50+)
3.优化性能
strategy = CosineStrategy(
word_count_threshold=10, # Filter early
top_k=5, # Limit results
verbose=True # Monitor performance
)
4.处理不同的内容类型
# For mixed content pages
strategy = CosineStrategy(
semantic_filter="product features",
sim_threshold=0.4, # More flexible matching
max_dist=0.3, # Larger clusters
top_k=3 # Multiple relevant sections
)
错误处理
try:
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
if result.success:
content = json.loads(result.extracted_content)
if not content:
print("No relevant content found")
else:
print(f"Extraction failed: {result.error_message}")
except Exception as e:
print(f"Error during extraction: {str(e)}")
余弦策略在以下情况下特别有效: - 内容结构不一致 - 您需要语义理解 - 您想要找到相似的内容块 - 基于结构的提取(CSS / XPath)不可靠
它与其他策略配合良好,可以用作基于 LLM 的提取的预处理步骤。