高级自适应策略
概述
虽然默认的自适应爬网配置适用于大多数用例,但了解底层策略和评分机制可以让您针对特定领域和要求对爬网程序进行微调。
三层评分系统
1. 覆盖率分数
覆盖率衡量您的知识库涵盖查询词和相关概念的全面程度。
数学基础
Coverage(K, Q) = Σ(t ∈ Q) score(t, K) / |Q|
where score(t, K) = doc_coverage(t) × (1 + freq_boost(t))
成分
- 文档覆盖率:包含该术语的文档的百分比
- 频率提升:词频的对数奖励
- 查询分解:智能处理多词查询
调优覆盖范围
# For technical documentation with specific terminology
config = AdaptiveConfig(
confidence_threshold=0.85, # Require high coverage
top_k_links=5 # Cast wider net
)
# For general topics with synonyms
config = AdaptiveConfig(
confidence_threshold=0.6, # Lower threshold
top_k_links=2 # More focused
)
2.一致性评分
一致性评估跨页面的信息是否连贯且不矛盾。
工作原理
- 从每个文档中提取关键语句
- 比较不同文档中的语句
- 测量一致与矛盾
- 返回标准化分数(0-1)
实际影响
- 高一致性(>0.8):信息可靠且连贯
- 中等一致性(0.5-0.8):有一些变化,但总体一致
- 一致性低(<0.5):信息相互矛盾,需要更多来源
3.饱和度分数
饱和度检测新页面何时停止提供新信息。
检测算法
# Tracks new unique terms per page
new_terms_page_1 = 50
new_terms_page_2 = 30 # 60% of first
new_terms_page_3 = 15 # 50% of second
new_terms_page_4 = 5 # 33% of third
# Saturation detected: rapidly diminishing returns
配置
链接排名算法
预期信息增益
每个未抓取的链接根据以下标准进行评分:
1.相关性评分
对链接预览文本使用 BM25 算法:
因素:- 预览中的词频 - 逆文档频率 - 预览长度规范化
2. 新颖性估计
衡量链接与已抓取内容的差异:
防止抓取重复或高度相似的页面。
3. 权限计算
URL结构和域名分析:
因素:- 域名信誉 - URL 深度(斜杠越少 = 权限越高)- 清晰的 URL 结构
自定义链接评分
class CustomLinkScorer:
def score(self, link: Link, query: str, state: CrawlState) -> float:
# Prioritize specific URL patterns
if "/api/reference/" in link.href:
return 2.0 # Double the score
# Deprioritize certain sections
if "/archive/" in link.href:
return 0.1 # Reduce score by 90%
# Default scoring
return 1.0
# Use with adaptive crawler
adaptive = AdaptiveCrawler(
crawler,
config=config,
link_scorer=CustomLinkScorer()
)
特定领域配置
技术文档
tech_doc_config = AdaptiveConfig(
confidence_threshold=0.85,
max_pages=30,
top_k_links=3,
min_gain_threshold=0.05 # Keep crawling for small gains
)
基本原理: - 高阈值确保全面覆盖 - 较低的增益阈值捕获边缘情况 - 适度的链接跟踪以确保深度
新闻与文章
news_config = AdaptiveConfig(
confidence_threshold=0.6,
max_pages=10,
top_k_links=5,
min_gain_threshold=0.15 # Stop quickly on repetition
)
理由: - 门槛较低(文章经常重复信息) - 收益门槛较高(避免重复报道) - 每页链接数较多(探索不同视角)
电子商务
ecommerce_config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=20,
top_k_links=2,
min_gain_threshold=0.1
)
基本原理: - 产品变化的平衡阈值 - 重点链接跟踪(避免无限产品) - 标准增益阈值
研究与学术
research_config = AdaptiveConfig(
confidence_threshold=0.9,
max_pages=50,
top_k_links=4,
min_gain_threshold=0.02 # Very low - capture citations
)
理由: - 完整性门槛非常高 - 页面数量多,可供深入研究 - 获取参考文献的增益门槛非常低
性能优化
内存管理
# For large crawls, use streaming
config = AdaptiveConfig(
max_pages=100,
save_state=True,
state_path="large_crawl.json"
)
# Periodically clean state
if len(state.knowledge_base) > 1000:
# Keep only most relevant
state.knowledge_base = get_top_relevant(state.knowledge_base, 500)
并行处理
# Use multiple start points
start_urls = [
"https://docs.example.com/intro",
"https://docs.example.com/api",
"https://docs.example.com/guides"
]
# Crawl in parallel
tasks = [
adaptive.digest(url, query)
for url in start_urls
]
results = await asyncio.gather(*tasks)
缓存策略
# Enable caching for repeated crawls
async with AsyncWebCrawler(
config=BrowserConfig(
cache_mode=CacheMode.ENABLED
)
) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
调试与分析
启用详细日志记录
import logging
logging.basicConfig(level=logging.DEBUG)
adaptive = AdaptiveCrawler(crawler, config, verbose=True)
分析爬取模式
# After crawling
state = await adaptive.digest(start_url, query)
# Analyze link selection
print("Link selection order:")
for i, url in enumerate(state.crawl_order):
print(f"{i+1}. {url}")
# Analyze term discovery
print("\nTerm discovery rate:")
for i, new_terms in enumerate(state.new_terms_history):
print(f"Page {i+1}: {new_terms} new terms")
# Analyze score progression
print("\nScore progression:")
print(f"Coverage: {state.metrics['coverage_history']}")
print(f"Saturation: {state.metrics['saturation_history']}")
导出以进行分析
# Export detailed metrics
import json
metrics = {
"query": query,
"total_pages": len(state.crawled_urls),
"confidence": adaptive.confidence,
"coverage_stats": adaptive.coverage_stats,
"crawl_order": state.crawl_order,
"term_frequencies": dict(state.term_frequencies),
"new_terms_history": state.new_terms_history
}
with open("crawl_analysis.json", "w") as f:
json.dump(metrics, f, indent=2)
定制策略
实施自定义策略
from crawl4ai.adaptive_crawler import BaseStrategy
class DomainSpecificStrategy(BaseStrategy):
def calculate_coverage(self, state: CrawlState) -> float:
# Custom coverage calculation
# e.g., weight certain terms more heavily
pass
def calculate_consistency(self, state: CrawlState) -> float:
# Custom consistency logic
# e.g., domain-specific validation
pass
def rank_links(self, links: List[Link], state: CrawlState) -> List[Link]:
# Custom link ranking
# e.g., prioritize specific URL patterns
pass
# Use custom strategy
adaptive = AdaptiveCrawler(
crawler,
config=config,
strategy=DomainSpecificStrategy()
)
结合策略
class HybridStrategy(BaseStrategy):
def __init__(self):
self.strategies = [
TechnicalDocStrategy(),
SemanticSimilarityStrategy(),
URLPatternStrategy()
]
def calculate_confidence(self, state: CrawlState) -> float:
# Weighted combination of strategies
scores = [s.calculate_confidence(state) for s in self.strategies]
weights = [0.5, 0.3, 0.2]
return sum(s * w for s, w in zip(scores, weights))
最佳实践
1. 开始保守
从默认设置开始并根据结果进行调整:
# Start with defaults
result = await adaptive.digest(url, query)
# Analyze and adjust
if adaptive.confidence < 0.7:
config.max_pages += 10
config.confidence_threshold -= 0.1
2. 监控资源使用情况
import psutil
# Check memory before large crawls
memory_percent = psutil.virtual_memory().percent
if memory_percent > 80:
config.max_pages = min(config.max_pages, 20)
3. 利用领域知识
# For API documentation
if "api" in start_url:
config.top_k_links = 2 # APIs have clear structure
# For blogs
if "blog" in start_url:
config.min_gain_threshold = 0.2 # Avoid similar posts
4.验证结果
# Always validate the knowledge base
relevant_content = adaptive.get_relevant_content(top_k=10)
# Check coverage
query_terms = set(query.lower().split())
covered_terms = set()
for doc in relevant_content:
content_lower = doc['content'].lower()
for term in query_terms:
if term in content_lower:
covered_terms.add(term)
coverage_ratio = len(covered_terms) / len(query_terms)
print(f"Query term coverage: {coverage_ratio:.0%}")