自适应网页爬取
¥Adaptive Web Crawling
介绍
¥Introduction
传统的网络爬虫遵循预定的模式,盲目地抓取网页,而不知道何时收集到足够的信息。自适应爬行通过在爬取过程中引入智能来改变这种模式。
¥Traditional web crawlers follow predetermined patterns, crawling pages blindly without knowing when they've gathered enough information. Adaptive Crawling changes this paradigm by introducing intelligence into the crawling process.
把它想象成研究:当你寻找信息时,你不会把图书馆里的每一本书都读完。当你找到足够的信息来回答你的问题时,你就会停下来。这正是自适应爬虫在网页抓取中所做的事情。
¥Think of it like research: when you're looking for information, you don't read every book in the library. You stop when you've found sufficient information to answer your question. That's exactly what Adaptive Crawling does for web scraping.
关键概念
¥Key Concepts
它解决的问题
¥The Problem It Solves
当抓取网站以获取特定信息时,您会面临两个挑战:1.爬行不足:停止得太早而错过关键信息2。过度爬行:抓取不相关的页面,浪费资源
¥When crawling websites for specific information, you face two challenges: 1. Under-crawling: Stopping too early and missing crucial information 2. Over-crawling: Wasting resources by crawling irrelevant pages
自适应爬行通过使用三层评分系统来解决这两个问题,该系统确定何时拥有“足够”的信息。
¥Adaptive Crawling solves both by using a three-layer scoring system that determines when you have "enough" information.
工作原理
¥How It Works
AdaptiveCrawler 使用三个指标来衡量信息充分性:
¥The AdaptiveCrawler uses three metrics to measure information sufficiency:
-
覆盖范围:您收集的页面对查询词的覆盖程度
¥Coverage: How well your collected pages cover the query terms
-
一致性:各页面信息是否连贯
¥Consistency: Whether the information is coherent across pages
-
饱和:检测新页面是否未添加新信息
¥Saturation: Detecting when new pages aren't adding new information
当这些指标表明已收集到足够的信息时,爬网就会自动停止。
¥When these metrics indicate sufficient information has been gathered, crawling stops automatically.
快速入门
¥Quick Start
基本用法
¥Basic Usage
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def main():
async with AsyncWebCrawler() as crawler:
# Create an adaptive crawler (config is optional)
adaptive = AdaptiveCrawler(crawler)
# Start crawling with a query
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async context managers"
)
# View statistics
adaptive.print_stats()
# Get the most relevant content
relevant_pages = adaptive.get_relevant_content(top_k=5)
for page in relevant_pages:
print(f"- {page['url']} (score: {page['score']:.2f})")
配置选项
¥Configuration Options
from crawl4ai import AdaptiveConfig
config = AdaptiveConfig(
confidence_threshold=0.8, # Stop when 80% confident (default: 0.7)
max_pages=30, # Maximum pages to crawl (default: 20)
top_k_links=5, # Links to follow per page (default: 3)
min_gain_threshold=0.05 # Minimum expected gain to continue (default: 0.1)
)
adaptive = AdaptiveCrawler(crawler, config)
爬行策略
¥Crawling Strategies
自适应爬网支持两种不同的策略来确定信息充分性:
¥Adaptive Crawling supports two distinct strategies for determining information sufficiency:
统计策略(默认)
¥Statistical Strategy (Default)
统计策略采用纯信息理论和基于术语的分析:
¥The statistical strategy uses pure information theory and term-based analysis:
-
快速高效- 无需 API 调用或模型加载
¥Fast and efficient - No API calls or model loading
-
基于期限的覆盖范围- 分析查询词的存在和分布
¥Term-based coverage - Analyzes query term presence and distribution
-
无外部依赖- 离线工作
¥No external dependencies - Works offline
-
最适合:具有特定术语的定义明确的查询
¥Best for: Well-defined queries with specific terminology
# Default configuration uses statistical strategy
config = AdaptiveConfig(
strategy="statistical", # This is the default
confidence_threshold=0.8
)
嵌入策略
¥Embedding Strategy
嵌入策略使用语义嵌入来获得更深入的理解:
¥The embedding strategy uses semantic embeddings for deeper understanding:
-
语义理解- 捕捉超越精确术语匹配的含义
¥Semantic understanding - Captures meaning beyond exact term matches
-
查询扩展- 自动生成查询变体
¥Query expansion - Automatically generates query variations
-
差距驱动的选择- 识别知识中的语义差距
¥Gap-driven selection - Identifies semantic gaps in knowledge
-
基于验证的停止- 使用保留查询来验证覆盖范围
¥Validation-based stopping - Uses held-out queries to validate coverage
-
最适合:复杂查询、模糊主题、概念理解
¥Best for: Complex queries, ambiguous topics, conceptual understanding
# Configure embedding strategy
config = AdaptiveConfig(
strategy="embedding",
embedding_model="sentence-transformers/all-MiniLM-L6-v2", # Default
n_query_variations=10, # Generate 10 query variations
embedding_min_confidence_threshold=0.1 # Stop if completely irrelevant
)
# With custom embedding provider (e.g., OpenAI)
config = AdaptiveConfig(
strategy="embedding",
embedding_llm_config={
'provider': 'openai/text-embedding-3-small',
'api_token': 'your-api-key'
}
)
策略比较
¥Strategy Comparison
¥Feature
¥Statistical
¥Embedding
¥Speed
¥Very fast
¥Moderate (API calls)
¥Cost
¥Free
¥Depends on provider
¥Accuracy
¥Good for exact terms
¥Excellent for concepts
¥Dependencies
¥None
¥Embedding model/API
¥Query Understanding
¥Literal
¥Semantic
¥Best Use Case
¥Technical docs, specific terms
¥Research, broad topics
| 特征 | 统计 | 嵌入 |
|---|---|---|
| 速度 | 非常快 | 中等(API 调用) |
| 成本 | 自由的 | 取决于提供商 |
| 准确性 | 适用于精确术语 | 非常适合概念 |
| 依赖项 | 没有任何 | 嵌入模型/API |
| 查询理解 | 文字 | 语义 |
| 最佳用例 | 技术文档、具体术语 | 研究,广泛的主题 |
嵌入策略配置
¥Embedding Strategy Configuration
嵌入策略通过几个参数提供微调控制:
¥The embedding strategy offers fine-tuned control through several parameters:
config = AdaptiveConfig(
strategy="embedding",
# Model configuration
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
embedding_llm_config=None, # Use for API-based embeddings
# Query expansion
n_query_variations=10, # Number of query variations to generate
# Coverage parameters
embedding_coverage_radius=0.2, # Distance threshold for coverage
embedding_k_exp=3.0, # Exponential decay factor (higher = stricter)
# Stopping criteria
embedding_min_relative_improvement=0.1, # Min improvement to continue
embedding_validation_min_score=0.3, # Min validation score
embedding_min_confidence_threshold=0.1, # Below this = irrelevant
# Link selection
embedding_overlap_threshold=0.85, # Similarity for deduplication
# Display confidence mapping
embedding_quality_min_confidence=0.7, # Min displayed confidence
embedding_quality_max_confidence=0.95 # Max displayed confidence
)
处理不相关的查询
¥Handling Irrelevant Queries
嵌入策略可以检测查询何时与内容完全不相关:
¥The embedding strategy can detect when a query is completely unrelated to the content:
# This will stop quickly with low confidence
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="how to cook pasta" # Irrelevant to Python docs
)
# Check if query was irrelevant
if result.metrics.get('is_irrelevant', False):
print("Query is unrelated to the content!")
何时使用自适应爬行
¥When to Use Adaptive Crawling
适合:
¥Perfect For:
-
研究任务:查找有关某个主题的全面信息
¥Research Tasks: Finding comprehensive information about a topic
-
问答:收集足够的背景信息来回答具体的问题
¥Question Answering: Gathering sufficient context to answer specific queries
-
知识库建设:为 AI/ML 应用程序创建重点数据集
¥Knowledge Base Building: Creating focused datasets for AI/ML applications
-
竞争情报:收集有关特定产品/功能的完整信息
¥Competitive Intelligence: Collecting complete information about specific products/features
不推荐用于:
¥Not Recommended For:
-
完整网站存档:当你需要每个页面,无论其内容如何时
¥Full Site Archiving: When you need every page regardless of content
-
结构化数据提取:针对特定的已知页面模式时
¥Structured Data Extraction: When targeting specific, known page patterns
-
实时监控:当您需要持续更新时
¥Real-time Monitoring: When you need continuous updates
理解输出
¥Understanding the Output
置信度分数
¥Confidence Score
置信度分数(0-1)表示收集到的信息的充分性:- 0.0-0.3 :信息不足,需要进一步抓取 - 0.3-0.6 :部分信息,可以回答基本问题 - 0.6-0.7 :覆盖面广,可以回答大多数问题 - 0.7-1.0 :覆盖面广,信息全面
¥The confidence score (0-1) indicates how sufficient the gathered information is: - 0.0-0.3: Insufficient information, needs more crawling - 0.3-0.6: Partial information, may answer basic queries - 0.6-0.7: Good coverage, can answer most queries - 0.7-1.0: Excellent coverage, comprehensive information
统计显示
¥Statistics Display
adaptive.print_stats(detailed=False) # Summary table
adaptive.print_stats(detailed=True) # Detailed metrics
摘要显示:- 已抓取页面与获得的置信度 - 覆盖率、一致性和饱和度分数 - 抓取效率指标
¥The summary shows: - Pages crawled vs. confidence achieved - Coverage, consistency, and saturation scores - Crawling efficiency metrics
持久性和恢复
¥Persistence and Resumption
保存进度
¥Saving Progress
config = AdaptiveConfig(
save_state=True,
state_path="my_crawl_state.json"
)
# Crawl will auto-save progress
result = await adaptive.digest(start_url, query)
恢复爬网
¥Resuming a Crawl
# Resume from saved state
result = await adaptive.digest(
start_url,
query,
resume_from="my_crawl_state.json"
)
导出知识库
¥Exporting Knowledge Base
# Export collected pages to JSONL
adaptive.export_knowledge_base("knowledge_base.jsonl")
# Import into another session
new_adaptive = AdaptiveCrawler(crawler)
new_adaptive.import_knowledge_base("knowledge_base.jsonl")
最佳实践
¥Best Practices
1. 查询公式
¥1. Query Formulation
-
使用具体的描述性查询
¥Use specific, descriptive queries
-
包括你希望找到的关键词
¥Include key terms you expect to find
-
避免过于宽泛的查询
¥Avoid overly broad queries
2. 阈值调整
¥2. Threshold Tuning
-
从默认值(0.7)开始,供一般使用
¥Start with default (0.7) for general use
-
探索性爬行时降低至 0.5-0.6
¥Lower to 0.5-0.6 for exploratory crawling
-
提高至 0.8+ 以实现全面覆盖
¥Raise to 0.8+ for exhaustive coverage
3.性能优化
¥3. Performance Optimization
-
使用适当的
max_pages限制¥Use appropriate
max_pageslimits -
调整
top_k_links基于站点结构¥Adjust
top_k_linksbased on site structure -
启用重复抓取的缓存
¥Enable caching for repeat crawls
4. 链接选择
¥4. Link Selection
-
爬虫根据以下原则确定链接的优先级:
¥The crawler prioritizes links based on:
-
与查询的相关性
¥Relevance to query
-
预期信息增益
¥Expected information gain
-
URL 结构和深度
¥URL structure and depth
示例
¥Examples
研究助理
¥Research Assistant
# Gather information about a programming concept
result = await adaptive.digest(
start_url="https://realpython.com",
query="python decorators implementation patterns"
)
# Get the most relevant excerpts
for doc in adaptive.get_relevant_content(top_k=3):
print(f"\nFrom: {doc['url']}")
print(f"Relevance: {doc['score']:.2%}")
print(doc['content'][:500] + "...")
知识库构建器
¥Knowledge Base Builder
# Build a focused knowledge base about machine learning
queries = [
"supervised learning algorithms",
"neural network architectures",
"model evaluation metrics"
]
for query in queries:
await adaptive.digest(
start_url="https://scikit-learn.org/stable/",
query=query
)
# Export combined knowledge base
adaptive.export_knowledge_base("ml_knowledge.jsonl")
API文档爬虫
¥API Documentation Crawler
# Intelligently crawl API documentation
config = AdaptiveConfig(
confidence_threshold=0.85, # Higher threshold for completeness
max_pages=30
)
adaptive = AdaptiveCrawler(crawler, config)
result = await adaptive.digest(
start_url="https://api.example.com/docs",
query="authentication endpoints rate limits"
)
后续步骤
¥Next Steps
-
了解高级自适应策略
¥Learn about Advanced Adaptive Strategies
-
¥Explore the AdaptiveCrawler API Reference
-
查看更多示例
¥See more Examples
常问问题
¥FAQ
问:这与传统爬行有何不同?答:传统爬取遵循固定模式(BFS/DFS)。自适应爬取则根据信息增益,智能地决定跟踪哪些链接以及何时停止。
¥Q: How is this different from traditional crawling? A: Traditional crawling follows fixed patterns (BFS/DFS). Adaptive crawling makes intelligent decisions about which links to follow and when to stop based on information gain.
问:我可以将它用于 JavaScript 密集型网站吗?答:是的!AdaptiveCrawler 继承了 AsyncWebCrawler 的所有功能,包括 JavaScript 执行。
¥Q: Can I use this with JavaScript-heavy sites? A: Yes! AdaptiveCrawler inherits all capabilities from AsyncWebCrawler, including JavaScript execution.
问:它如何处理大型网站?答:算法会自然地将抓取限制在相关部分。使用max_pages作为安全限度。
¥Q: How does it handle large websites?
A: The algorithm naturally limits crawling to relevant sections. Use max_pages as a safety limit.
问:我可以自定义评分算法吗?答:高级用户可以实施自定义策略。请参阅自适应策略。
¥Q: Can I customize the scoring algorithms? A: Advanced users can implement custom strategies. See Adaptive Strategies.