深度爬行
Crawl4AI 最强大的功能之一是其可配置的深度爬取能力,可以探索网站不止一个页面。Crawl4AI 可以精细控制爬取深度、域名边界和内容过滤,为您提供精准提取所需内容的工具。
在本教程中,您将学习:
- 如何使用 BFS 策略设置基本深度爬虫
- 了解流式输出和非流式输出之间的区别
- 实施过滤器和评分器以定位特定内容
- 为复杂的爬取创建高级过滤链
- 使用 BestFirstCrawling 进行智能探索优先级排序
先决条件 - 您已完成或阅读AsyncWebCrawler 基础知识,了解如何运行简单的爬网。 - 您知道如何配置CrawlerRunConfig
。
1. 快速示例
下面是使用 BFSDeepCrawlStrategy 实现基本深度爬取的最小代码片段:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
async def main():
# Configure a 2-level deep crawl
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
include_external=False
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun("https://example.com", config=config)
print(f"Crawled {len(results)} pages in total")
# Access individual results
for result in results[:3]: # Show first 3 results
print(f"URL: {result.url}")
print(f"Depth: {result.metadata.get('depth', 0)}")
if __name__ == "__main__":
asyncio.run(main())
发生什么事了?BFSDeepCrawlStrategy(max_depth=2, include_external=False)
指示 Crawl4AI: - 抓取起始页面(深度 0)加上另外 2 个级别 - 保持在同一域内(不跟踪外部链接) - 每个结果都包含抓取深度等元数据 - 所有抓取完成后,结果以列表形式返回
2. 了解深度爬取策略选项
2.1 BFSDeepCrawlStrategy(广度优先搜索)
BFSDeepCrawlStrategy 采用广度优先的方法,先在一个深度探索所有链接,然后再深入探索:
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
# Basic configuration
strategy = BFSDeepCrawlStrategy(
max_depth=2, # Crawl initial page + 2 levels deep
include_external=False, # Stay within the same domain
max_pages=50, # Maximum number of pages to crawl (optional)
score_threshold=0.3, # Minimum score for URLs to be crawled (optional)
)
关键参数:-max_depth
:超出起始页的爬行层数 -include_external
:是否跟踪指向其他域的链接 -max_pages
:要抓取的最大页面数(默认值:无限) -score_threshold
:要抓取的 URL 的最低分数(默认值:-inf)-filter_chain
:用于 URL 过滤的 FilterChain 实例 -url_scorer
:用于评估 URL 的评分器实例
2.2 DFSDeepCrawlStrategy(深度优先搜索)
DFSDeepCrawlStrategy 使用深度优先方法,在回溯之前尽可能深入地探索分支。
from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
# Basic configuration
strategy = DFSDeepCrawlStrategy(
max_depth=2, # Crawl initial page + 2 levels deep
include_external=False, # Stay within the same domain
max_pages=30, # Maximum number of pages to crawl (optional)
score_threshold=0.5, # Minimum score for URLs to be crawled (optional)
)
关键参数:-max_depth
:超出起始页的爬行层数 -include_external
:是否跟踪指向其他域的链接 -max_pages
:要抓取的最大页面数(默认值:无限) -score_threshold
:要抓取的 URL 的最低分数(默认值:-inf)-filter_chain
:用于 URL 过滤的 FilterChain 实例 -url_scorer
:用于评估 URL 的评分器实例
2.3 BestFirstCrawlingStrategy(⭐️ - 推荐的深度爬行策略)
为了实现更智能的抓取,请使用 BestFirstCrawlingStrategy 和评分器来优先显示最相关的页面:
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
# Create a scorer
scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"],
weight=0.7
)
# Configure the strategy
strategy = BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
url_scorer=scorer,
max_pages=25, # Maximum number of pages to crawl (optional)
)
这种抓取方法: - 根据评分标准评估每个发现的 URL - 优先访问得分较高的页面 - 有助于将抓取资源集中在最相关的内容上 - 可以限制抓取的页面总数max_pages
- 不需要score_threshold
因为它自然地按分数优先
3. 流式传输与非流式传输结果
Crawl4AI 可以以两种模式返回结果:
3.1 非流模式(默认)
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
stream=False # Default behavior
)
async with AsyncWebCrawler() as crawler:
# Wait for ALL results to be collected before returning
results = await crawler.arun("https://example.com", config=config)
for result in results:
process_result(result)
何时使用非流模式: - 处理前需要完整的数据集 - 对所有结果一起执行批处理操作 - 抓取时间不是关键因素
3.2 流模式
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
stream=True # Enable streaming
)
async with AsyncWebCrawler() as crawler:
# Returns an async iterator
async for result in await crawler.arun("https://example.com", config=config):
# Process each result as it becomes available
process_result(result)
流模式的优点: - 在发现结果时立即处理 - 在继续抓取时开始使用早期结果 - 更适合实时应用程序或渐进式显示 - 处理许多页面时减少内存压力
4. 使用过滤链过滤内容
过滤器可帮助您缩小抓取页面的范围。使用 FilterChain 组合多个过滤器,实现更精准的定位。
4.1 基本 URL 模式过滤器
from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter
# Only follow URLs containing "blog" or "docs"
url_filter = URLPatternFilter(patterns=["*blog*", "*docs*"])
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=FilterChain([url_filter])
)
)
4.2 组合多个过滤器
from crawl4ai.deep_crawling.filters import (
FilterChain,
URLPatternFilter,
DomainFilter,
ContentTypeFilter
)
# Create a chain of filters
filter_chain = FilterChain([
# Only follow URLs with specific patterns
URLPatternFilter(patterns=["*guide*", "*tutorial*"]),
# Only crawl specific domains
DomainFilter(
allowed_domains=["docs.example.com"],
blocked_domains=["old.docs.example.com"]
),
# Only include specific content types
ContentTypeFilter(allowed_types=["text/html"])
])
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
filter_chain=filter_chain
)
)
4.3 可用的过滤器类型
Crawl4AI 包含几个专门的过滤器:
- :使用通配符语法匹配 URL 模式
- :控制要包含或排除的域
- :基于 HTTP 内容类型的过滤器
- :使用与文本查询的相似性
- :评估 SEO 元素(元标签、标题等)
5. 使用评分器进行优先抓取
评分器为发现的 URL 分配优先级值,帮助爬虫首先关注最相关的内容。
5.1 关键词相关性评分器
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
# Create a keyword relevance scorer
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"],
weight=0.7 # Importance of this scorer (0.0 to 1.0)
)
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
url_scorer=keyword_scorer
),
stream=True # Recommended with BestFirstCrawling
)
# Results will come in order of relevance score
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://example.com", config=config):
score = result.metadata.get("score", 0)
print(f"Score: {score:.2f} | {result.url}")
评分器的工作原理: - 在爬取之前评估每个发现的 URL - 根据各种信号计算相关性 - 帮助爬虫对遍历顺序做出明智的选择
6.高级过滤技术
6.1 用于质量评估的 SEO 过滤器
SEOFilter 可帮助您识别具有强大 SEO 特征的页面:
from crawl4ai.deep_crawling.filters import FilterChain, SEOFilter
# Create an SEO filter that looks for specific keywords in page metadata
seo_filter = SEOFilter(
threshold=0.5, # Minimum score (0.0 to 1.0)
keywords=["tutorial", "guide", "documentation"]
)
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=FilterChain([seo_filter])
)
)
6.2 内容相关性过滤器
ContentRelevanceFilter 分析页面的实际内容:
from crawl4ai.deep_crawling.filters import FilterChain, ContentRelevanceFilter
# Create a content relevance filter
relevance_filter = ContentRelevanceFilter(
query="Web crawling and data extraction with Python",
threshold=0.7 # Minimum similarity score (0.0 to 1.0)
)
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=FilterChain([relevance_filter])
)
)
此过滤器: - 测量查询和页面内容之间的语义相似度 - 这是一个基于 BM25 的相关性过滤器,使用头部内容
7. 构建完整的高级爬虫
此示例结合了多种技术以实现复杂的爬网:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
FilterChain,
DomainFilter,
URLPatternFilter,
ContentTypeFilter
)
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
async def run_advanced_crawler():
# Create a sophisticated filter chain
filter_chain = FilterChain([
# Domain boundaries
DomainFilter(
allowed_domains=["docs.example.com"],
blocked_domains=["old.docs.example.com"]
),
# URL patterns to include
URLPatternFilter(patterns=["*guide*", "*tutorial*", "*blog*"]),
# Content type filtering
ContentTypeFilter(allowed_types=["text/html"])
])
# Create a relevance scorer
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"],
weight=0.7
)
# Set up the configuration
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
filter_chain=filter_chain,
url_scorer=keyword_scorer
),
scraping_strategy=LXMLWebScrapingStrategy(),
stream=True,
verbose=True
)
# Execute the crawl
results = []
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://docs.example.com", config=config):
results.append(result)
score = result.metadata.get("score", 0)
depth = result.metadata.get("depth", 0)
print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")
# Analyze the results
print(f"Crawled {len(results)} high-value pages")
print(f"Average score: {sum(r.metadata.get('score', 0) for r in results) / len(results):.2f}")
# Group by depth
depth_counts = {}
for result in results:
depth = result.metadata.get("depth", 0)
depth_counts[depth] = depth_counts.get(depth, 0) + 1
print("Pages crawled by depth:")
for depth, count in sorted(depth_counts.items()):
print(f" Depth {depth}: {count} pages")
if __name__ == "__main__":
asyncio.run(run_advanced_crawler())
8.限制和控制爬取大小
8.1 使用max_pages
您可以限制使用max_pages
范围:
# Limit to exactly 20 pages regardless of depth
strategy = BFSDeepCrawlStrategy(
max_depth=3,
max_pages=20
)
此功能适用于: - 控制 API 成本 - 设置可预测的执行时间 - 关注最重要的内容 - 在完全执行之前测试抓取配置
8.2 使用score_threshold
对于 BFS 和 DFS 策略,您可以设置最低分数阈值以仅抓取高质量页面:
# Only follow links with scores above 0.4
strategy = DFSDeepCrawlStrategy(
max_depth=2,
url_scorer=KeywordRelevanceScorer(keywords=["api", "guide", "reference"]),
score_threshold=0.4 # Skip URLs with scores below this value
)
请注意,对于 BestFirstCrawlingStrategy,不需要 score_threshold,因为页面已经按照最高分数优先的顺序进行处理。
9. 常见陷阱与技巧
1. 设定切合实际的限制。谨慎行事max_depth
值 > 3,这可以成倍增加爬网规模。使用max_pages
设置硬性限制。
2. 不要忽视评分环节。BestFirstCrawling 与经过精心调校的评分器配合使用效果最佳。请尝试调整关键词权重,以获得最佳优先级。
3. 做一个优秀的网络公民。尊重 robots.txt 文件。(默认情况下禁用)
4. 妥善处理页面错误。并非所有页面都能访问。请检查result.status
处理结果时。
5. 平衡广度与深度。明智地选择你的策略——BFS 用于全面覆盖,DFS 用于深度探索,BestFirst 用于基于相关性的重点抓取。
10.总结及后续步骤
在本使用 Crawl4AI 进行深度爬取的教程中,您学习了:
- 配置 BFSDeepCrawlStrategy、DFSDeepCrawlStrategy 和 BestFirstCrawlingStrategy
- 以流式或非流式模式处理结果
- 应用过滤器来定位特定内容
- 使用评分器对最相关的页面进行优先排序
- 限制抓取
max_pages
和score_threshold
参数 - 结合多种技术构建完整的高级爬虫
使用这些工具,您可以高效地从网站大规模提取结构化数据,精确地关注您特定用例所需的内容。