深度爬行
¥Deep Crawling
Crawl4AI 最强大的功能之一是它能够执行可配置的深度抓取它可以探索网站,而不仅仅是单个页面。Crawl4AI 通过对抓取深度、域名边界和内容过滤进行精细控制,为您提供精准提取所需内容的工具。
¥One of Crawl4AI's most powerful features is its ability to perform configurable deep crawling that can explore websites beyond a single page. With fine-tuned control over crawl depth, domain boundaries, and content filtering, Crawl4AI gives you the tools to extract precisely the content you need.
在本教程中,您将学习:
¥In this tutorial, you'll learn:
-
如何设置基础深层爬行者采用 BFS 策略
¥How to set up a Basic Deep Crawler with BFS strategy
-
了解流式和非流式输出
¥Understanding the difference between streamed and non-streamed output
-
实施过滤器和评分器针对特定内容
¥Implementing filters and scorers to target specific content
-
创建高级过滤链用于复杂的爬网
¥Creating advanced filtering chains for sophisticated crawls
-
使用最佳优先爬行用于智能勘探优先级排序
¥Using BestFirstCrawling for intelligent exploration prioritization
先决条件
- 您已完成或阅读AsyncWebCrawler 基础知识了解如何运行简单的爬网。
- 你知道如何配置CrawlerRunConfig。¥Prerequisites
- You’ve completed or read AsyncWebCrawler Basics to understand how to run a simple crawl.
- You know how to configureCrawlerRunConfig.
1. 快速示例
¥1. Quick Example
这是一个最小的代码片段,它使用BFSDeepCrawl策略:
¥Here's a minimal code snippet that implements a basic deep crawl using the BFSDeepCrawlStrategy:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
async def main():
# Configure a 2-level deep crawl
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
include_external=False
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun("https://example.com", config=config)
print(f"Crawled {len(results)} pages in total")
# Access individual results
for result in results[:3]: # Show first 3 results
print(f"URL: {result.url}")
print(f"Depth: {result.metadata.get('depth', 0)}")
if __name__ == "__main__":
asyncio.run(main())
发生什么事了?
-BFSDeepCrawlStrategy(max_depth=2, include_external=False)指示 Crawl4AI: - 抓取起始页面(深度 0)加上另外 2 个级别 - 保持在同一域内(不跟踪外部链接) - 每个结果都包含抓取深度等元数据 - 所有抓取完成后,结果以列表形式返回
¥What's happening?
- BFSDeepCrawlStrategy(max_depth=2, include_external=False) instructs Crawl4AI to:
- Crawl the starting page (depth 0) plus 2 more levels
- Stay within the same domain (don't follow external links)
- Each result contains metadata like the crawl depth
- Results are returned as a list after all crawling is complete
2. 了解深度爬取策略选项
¥2. Understanding Deep Crawling Strategy Options
2.1 BFSDeepCrawlStrategy(广度优先搜索)
¥2.1 BFSDeepCrawlStrategy (Breadth-First Search)
这BFSDeepCrawl策略采用广度优先的方法,先在一个深度探索所有链接,然后再深入探索:
¥The BFSDeepCrawlStrategy uses a breadth-first approach, exploring all links at one depth before moving deeper:
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
# Basic configuration
strategy = BFSDeepCrawlStrategy(
max_depth=2, # Crawl initial page + 2 levels deep
include_external=False, # Stay within the same domain
max_pages=50, # Maximum number of pages to crawl (optional)
score_threshold=0.3, # Minimum score for URLs to be crawled (optional)
)
关键参数: -max_depth :超出起始页的爬行层数 -include_external :是否跟踪指向其他域的链接 -max_pages :要抓取的最大页面数(默认值:无限) -score_threshold :要抓取的 URL 的最低分数(默认值:-inf)-filter_chain :用于 URL 过滤的 FilterChain 实例 -url_scorer :用于评估 URL 的评分器实例
¥Key parameters:
- max_depth: Number of levels to crawl beyond the starting page
- include_external: Whether to follow links to other domains
- max_pages: Maximum number of pages to crawl (default: infinite)
- score_threshold: Minimum score for URLs to be crawled (default: -inf)
- filter_chain: FilterChain instance for URL filtering
- url_scorer: Scorer instance for evaluating URLs
2.2 DFSDeepCrawlStrategy(深度优先搜索)
¥2.2 DFSDeepCrawlStrategy (Depth-First Search)
这DFSDeepCrawl策略使用深度优先的方法,在回溯之前尽可能深入地探索分支。
¥The DFSDeepCrawlStrategy uses a depth-first approach, explores as far down a branch as possible before backtracking.
from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
# Basic configuration
strategy = DFSDeepCrawlStrategy(
max_depth=2, # Crawl initial page + 2 levels deep
include_external=False, # Stay within the same domain
max_pages=30, # Maximum number of pages to crawl (optional)
score_threshold=0.5, # Minimum score for URLs to be crawled (optional)
)
关键参数: -max_depth :超出起始页的爬行层数 -include_external :是否跟踪指向其他域的链接 -max_pages :要抓取的最大页面数(默认值:无限) -score_threshold :要抓取的 URL 的最低分数(默认值:-inf)-filter_chain :用于 URL 过滤的 FilterChain 实例 -url_scorer :用于评估 URL 的评分器实例
¥Key parameters:
- max_depth: Number of levels to crawl beyond the starting page
- include_external: Whether to follow links to other domains
- max_pages: Maximum number of pages to crawl (default: infinite)
- score_threshold: Minimum score for URLs to be crawled (default: -inf)
- filter_chain: FilterChain instance for URL filtering
- url_scorer: Scorer instance for evaluating URLs
2.3 BestFirstCrawlingStrategy(⭐️ - 推荐的深度爬行策略)
¥2.3 BestFirstCrawlingStrategy (⭐️ - Recommended Deep crawl strategy)
为了更加智能地抓取,使用BestFirstCrawling策略与评分者一起优先考虑最相关的页面:
¥For more intelligent crawling, use BestFirstCrawlingStrategy with scorers to prioritize the most relevant pages:
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
# Create a scorer
scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"],
weight=0.7
)
# Configure the strategy
strategy = BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
url_scorer=scorer,
max_pages=25, # Maximum number of pages to crawl (optional)
)
这种抓取方法: - 根据评分标准评估每个发现的 URL - 优先访问得分较高的页面 - 有助于将抓取资源集中在最相关的内容上 - 可以限制抓取的页面总数max_pages- 不需要score_threshold因为它自然地按分数优先
¥This crawling approach:
- Evaluates each discovered URL based on scorer criteria
- Visits higher-scoring pages first
- Helps focus crawl resources on the most relevant content
- Can limit total pages crawled with max_pages
- Does not need score_threshold as it naturally prioritizes by score
3. 流式传输与非流式传输结果
¥3. Streaming vs. Non-Streaming Results
Crawl4AI 可以以两种模式返回结果:
¥Crawl4AI can return results in two modes:
3.1 非流模式(默认)
¥3.1 Non-Streaming Mode (Default)
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
stream=False # Default behavior
)
async with AsyncWebCrawler() as crawler:
# Wait for ALL results to be collected before returning
results = await crawler.arun("https://example.com", config=config)
for result in results:
process_result(result)
何时使用非流模式: - 处理之前需要完整的数据集 - 对所有结果一起执行批处理操作 - 抓取时间不是关键因素
¥When to use non-streaming mode: - You need the complete dataset before processing - You're performing batch operations on all results together - Crawl time isn't a critical factor
3.2 流模式
¥3.2 Streaming Mode
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
stream=True # Enable streaming
)
async with AsyncWebCrawler() as crawler:
# Returns an async iterator
async for result in await crawler.arun("https://example.com", config=config):
# Process each result as it becomes available
process_result(result)
流模式的好处: - 在发现结果后立即处理 - 在继续抓取的同时开始处理早期结果 - 更适合实时应用或渐进式显示 - 处理大量页面时减少内存压力
¥Benefits of streaming mode: - Process results immediately as they're discovered - Start working with early results while crawling continues - Better for real-time applications or progressive display - Reduces memory pressure when handling many pages
4. 使用过滤链过滤内容
¥4. Filtering Content with Filter Chains
过滤器可帮助您缩小要抓取的页面范围。使用以下过滤器组合多个过滤器过滤链实现强大的定位。
¥Filters help you narrow down which pages to crawl. Combine multiple filters using FilterChain for powerful targeting.
4.1 基本 URL 模式过滤器
¥4.1 Basic URL Pattern Filter
from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter
# Only follow URLs containing "blog" or "docs"
url_filter = URLPatternFilter(patterns=["*blog*", "*docs*"])
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=FilterChain([url_filter])
)
)
4.2 组合多个过滤器
¥4.2 Combining Multiple Filters
from crawl4ai.deep_crawling.filters import (
FilterChain,
URLPatternFilter,
DomainFilter,
ContentTypeFilter
)
# Create a chain of filters
filter_chain = FilterChain([
# Only follow URLs with specific patterns
URLPatternFilter(patterns=["*guide*", "*tutorial*"]),
# Only crawl specific domains
DomainFilter(
allowed_domains=["docs.example.com"],
blocked_domains=["old.docs.example.com"]
),
# Only include specific content types
ContentTypeFilter(allowed_types=["text/html"])
])
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
filter_chain=filter_chain
)
)
4.3 可用的过滤器类型
¥4.3 Available Filter Types
Crawl4AI 包含几个专门的过滤器:
¥Crawl4AI includes several specialized filters:
-
URLPatternFilter:使用通配符语法匹配 URL 模式¥
URLPatternFilter: Matches URL patterns using wildcard syntax -
DomainFilter:控制要包含或排除的域¥
DomainFilter: Controls which domains to include or exclude -
ContentTypeFilter:基于 HTTP 内容类型的过滤器¥
ContentTypeFilter: Filters based on HTTP Content-Type -
ContentRelevanceFilter:使用与文本查询的相似性¥
ContentRelevanceFilter: Uses similarity to a text query -
SEOFilter:评估 SEO 元素(元标签、标题等)¥
SEOFilter: Evaluates SEO elements (meta tags, headers, etc.)
5. 使用评分器进行优先抓取
¥5. Using Scorers for Prioritized Crawling
评分器为发现的 URL 分配优先级值,帮助爬虫首先关注最相关的内容。
¥Scorers assign priority values to discovered URLs, helping the crawler focus on the most relevant content first.
5.1 关键词相关性评分器
¥5.1 KeywordRelevanceScorer
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
# Create a keyword relevance scorer
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"],
weight=0.7 # Importance of this scorer (0.0 to 1.0)
)
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
url_scorer=keyword_scorer
),
stream=True # Recommended with BestFirstCrawling
)
# Results will come in order of relevance score
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://example.com", config=config):
score = result.metadata.get("score", 0)
print(f"Score: {score:.2f} | {result.url}")
评分员的工作原理: - 在爬取之前评估每个发现的 URL - 根据各种信号计算相关性 - 帮助爬虫对遍历顺序做出明智的选择
¥How scorers work: - Evaluate each discovered URL before crawling - Calculate relevance based on various signals - Help the crawler make intelligent choices about traversal order
6.高级过滤技术
¥6. Advanced Filtering Techniques
6.1 用于质量评估的 SEO 过滤器
¥6.1 SEO Filter for Quality Assessment
这SEO过滤器帮助您识别具有强大 SEO 特征的页面:
¥The SEOFilter helps you identify pages with strong SEO characteristics:
from crawl4ai.deep_crawling.filters import FilterChain, SEOFilter
# Create an SEO filter that looks for specific keywords in page metadata
seo_filter = SEOFilter(
threshold=0.5, # Minimum score (0.0 to 1.0)
keywords=["tutorial", "guide", "documentation"]
)
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=FilterChain([seo_filter])
)
)
6.2 内容相关性过滤器
¥6.2 Content Relevance Filter
这内容相关性过滤器分析页面的实际内容:
¥The ContentRelevanceFilter analyzes the actual content of pages:
from crawl4ai.deep_crawling.filters import FilterChain, ContentRelevanceFilter
# Create a content relevance filter
relevance_filter = ContentRelevanceFilter(
query="Web crawling and data extraction with Python",
threshold=0.7 # Minimum similarity score (0.0 to 1.0)
)
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=FilterChain([relevance_filter])
)
)
此过滤器: - 测量查询和页面内容之间的语义相似度 - 这是一个基于 BM25 的相关性过滤器,使用头部内容
¥This filter: - Measures semantic similarity between query and page content - It's a BM25-based relevance filter using head section content
7. 构建完整的高级爬虫
¥7. Building a Complete Advanced Crawler
此示例结合了多种技术以实现复杂的爬网:
¥This example combines multiple techniques for a sophisticated crawl:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
FilterChain,
DomainFilter,
URLPatternFilter,
ContentTypeFilter
)
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
async def run_advanced_crawler():
# Create a sophisticated filter chain
filter_chain = FilterChain([
# Domain boundaries
DomainFilter(
allowed_domains=["docs.example.com"],
blocked_domains=["old.docs.example.com"]
),
# URL patterns to include
URLPatternFilter(patterns=["*guide*", "*tutorial*", "*blog*"]),
# Content type filtering
ContentTypeFilter(allowed_types=["text/html"])
])
# Create a relevance scorer
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"],
weight=0.7
)
# Set up the configuration
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
filter_chain=filter_chain,
url_scorer=keyword_scorer
),
scraping_strategy=LXMLWebScrapingStrategy(),
stream=True,
verbose=True
)
# Execute the crawl
results = []
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://docs.example.com", config=config):
results.append(result)
score = result.metadata.get("score", 0)
depth = result.metadata.get("depth", 0)
print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")
# Analyze the results
print(f"Crawled {len(results)} high-value pages")
print(f"Average score: {sum(r.metadata.get('score', 0) for r in results) / len(results):.2f}")
# Group by depth
depth_counts = {}
for result in results:
depth = result.metadata.get("depth", 0)
depth_counts[depth] = depth_counts.get(depth, 0) + 1
print("Pages crawled by depth:")
for depth, count in sorted(depth_counts.items()):
print(f" Depth {depth}: {count} pages")
if __name__ == "__main__":
asyncio.run(run_advanced_crawler())
8.限制和控制爬取大小
¥8. Limiting and Controlling Crawl Size
8.1 使用max_pages
¥8.1 Using max_pages
您可以限制使用max_pages范围:
¥You can limit the total number of pages crawled with the max_pages parameter:
# Limit to exactly 20 pages regardless of depth
strategy = BFSDeepCrawlStrategy(
max_depth=3,
max_pages=20
)
此功能适用于: - 控制 API 成本 - 设置可预测的执行时间 - 关注最重要的内容 - 在完全执行之前测试抓取配置
¥This feature is useful for: - Controlling API costs - Setting predictable execution times - Focusing on the most important content - Testing crawl configurations before full execution
8.2 使用score_threshold
¥8.2 Using score_threshold
对于 BFS 和 DFS 策略,您可以设置最低分数阈值以仅抓取高质量页面:
¥For BFS and DFS strategies, you can set a minimum score threshold to only crawl high-quality pages:
# Only follow links with scores above 0.4
strategy = DFSDeepCrawlStrategy(
max_depth=2,
url_scorer=KeywordRelevanceScorer(keywords=["api", "guide", "reference"]),
score_threshold=0.4 # Skip URLs with scores below this value
)
请注意,对于 BestFirstCrawlingStrategy,不需要 score_threshold,因为页面已经按照最高分数优先的顺序进行处理。
¥Note that for BestFirstCrawlingStrategy, score_threshold is not needed since pages are already processed in order of highest score first.
9. 常见陷阱与技巧
¥9. Common Pitfalls & Tips
1.设定现实的限制。谨慎max_depth值 > 3,这可以成倍增加爬网规模。使用max_pages设置硬性限制。
¥1.Set realistic limits. Be cautious with max_depth values > 3, which can exponentially increase crawl size. Use max_pages to set hard limits.
2.不要忽视得分部分。 BestFirstCrawling 与经过精心调校的评分器配合使用效果最佳。您可以尝试调整关键词权重,以获得最佳的优先级。
¥2.Don't neglect the scoring component. BestFirstCrawling works best with well-tuned scorers. Experiment with keyword weights for optimal prioritization.
3.做一个优秀的网络公民。尊重 robots.txt。(默认禁用)
¥3.Be a good web citizen. Respect robots.txt. (disabled by default)
4.优雅地处理页面错误。并非所有页面均可访问。请检查result.status处理结果时。
¥4.Handle page errors gracefully. Not all pages will be accessible. Check result.status when processing results.
5.平衡广度与深度。明智地选择您的策略 - BFS 用于全面覆盖,DFS 用于深度探索,BestFirst 用于基于相关性的重点抓取。
¥5.Balance breadth vs. depth. Choose your strategy wisely - BFS for comprehensive coverage, DFS for deep exploration, BestFirst for focused relevance-based crawling.
10.总结及后续步骤
¥10. Summary & Next Steps
在此使用 Crawl4AI 进行深度爬取在本教程中,您学习了:
¥In this Deep Crawling with Crawl4AI tutorial, you learned to:
-
配置BFSDeepCrawl策略, DFSDeepCrawl策略, 和BestFirstCrawling策略
¥Configure BFSDeepCrawlStrategy, DFSDeepCrawlStrategy, and BestFirstCrawlingStrategy
-
以流式或非流式模式处理结果
¥Process results in streaming or non-streaming mode
-
应用过滤器来定位特定内容
¥Apply filters to target specific content
-
使用评分器对最相关的页面进行优先排序
¥Use scorers to prioritize the most relevant pages
-
限制抓取
max_pages和score_threshold参数¥Limit crawls with
max_pagesandscore_thresholdparameters -
结合多种技术构建完整的高级爬虫
¥Build a complete advanced crawler with combined techniques
使用这些工具,您可以高效地从网站大规模提取结构化数据,精确地关注您特定用例所需的内容。
¥With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.