深度爬行

¥Deep Crawling

Crawl4AI 最强大的功能之一是它能够执行可配置的深度抓取它可以探索网站,而不仅仅是单个页面。Crawl4AI 通过对抓取深度、域名边界和内容过滤进行精细控制,为您提供精准提取所需内容的工具。

¥One of Crawl4AI's most powerful features is its ability to perform configurable deep crawling that can explore websites beyond a single page. With fine-tuned control over crawl depth, domain boundaries, and content filtering, Crawl4AI gives you the tools to extract precisely the content you need.

在本教程中,您将学习:

¥In this tutorial, you'll learn:

  1. 如何设置基础深层爬行者采用 BFS 策略

    ¥How to set up a Basic Deep Crawler with BFS strategy

  2. 了解流式和非流式输出

    ¥Understanding the difference between streamed and non-streamed output

  3. 实施过滤器和评分器针对特定内容

    ¥Implementing filters and scorers to target specific content

  4. 创建高级过滤链用于复杂的爬网

    ¥Creating advanced filtering chains for sophisticated crawls

  5. 使用最佳优先爬行用于智能勘探优先级排序

    ¥Using BestFirstCrawling for intelligent exploration prioritization

先决条件
- 您已完成或阅读AsyncWebCrawler 基础知识了解如何运行简单的爬网。
- 你知道如何配置CrawlerRunConfig

¥

Prerequisites
- You’ve completed or read AsyncWebCrawler Basics to understand how to run a simple crawl.
- You know how to configure CrawlerRunConfig.


1. 快速示例

¥1. Quick Example

这是一个最小的代码片段,它使用BFSDeepCrawl策略

¥Here's a minimal code snippet that implements a basic deep crawl using the BFSDeepCrawlStrategy:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy

async def main():
    # Configure a 2-level deep crawl
    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2, 
            include_external=False
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun("https://example.com", config=config)

        print(f"Crawled {len(results)} pages in total")

        # Access individual results
        for result in results[:3]:  # Show first 3 results
            print(f"URL: {result.url}")
            print(f"Depth: {result.metadata.get('depth', 0)}")

if __name__ == "__main__":
    asyncio.run(main())

发生什么事了?
-BFSDeepCrawlStrategy(max_depth=2, include_external=False)指示 Crawl4AI: - 抓取起始页面(深度 0)加上另外 2 个级别 - 保持在同一域内(不跟踪外部链接) - 每个结果都包含抓取深度等元数据 - 所有抓取完成后,结果以列表形式返回

¥What's happening?
- BFSDeepCrawlStrategy(max_depth=2, include_external=False) instructs Crawl4AI to: - Crawl the starting page (depth 0) plus 2 more levels - Stay within the same domain (don't follow external links) - Each result contains metadata like the crawl depth - Results are returned as a list after all crawling is complete


2. 了解深度爬取策略选项

¥2. Understanding Deep Crawling Strategy Options

¥2.1 BFSDeepCrawlStrategy (Breadth-First Search)

BFSDeepCrawl策略采用广度优先的方法,先在一个深度探索所有链接,然后再深入探索:

¥The BFSDeepCrawlStrategy uses a breadth-first approach, exploring all links at one depth before moving deeper:

from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

# Basic configuration
strategy = BFSDeepCrawlStrategy(
    max_depth=2,               # Crawl initial page + 2 levels deep
    include_external=False,    # Stay within the same domain
    max_pages=50,              # Maximum number of pages to crawl (optional)
    score_threshold=0.3,       # Minimum score for URLs to be crawled (optional)
)

关键参数: -max_depth :超出起始页的爬行层数 -include_external :是否跟踪指向其他域的链接 -max_pages :要抓取的最大页面数(默认值:无限) -score_threshold :要抓取的 URL 的最低分数(默认值:-inf)-filter_chain :用于 URL 过滤的 FilterChain 实例 -url_scorer :用于评估 URL 的评分器实例

¥Key parameters: - max_depth: Number of levels to crawl beyond the starting page - include_external: Whether to follow links to other domains - max_pages: Maximum number of pages to crawl (default: infinite) - score_threshold: Minimum score for URLs to be crawled (default: -inf) - filter_chain: FilterChain instance for URL filtering - url_scorer: Scorer instance for evaluating URLs

¥2.2 DFSDeepCrawlStrategy (Depth-First Search)

DFSDeepCrawl策略使用深度优先的方法,在回溯之前尽可能深入地探索分支。

¥The DFSDeepCrawlStrategy uses a depth-first approach, explores as far down a branch as possible before backtracking.

from crawl4ai.deep_crawling import DFSDeepCrawlStrategy

# Basic configuration
strategy = DFSDeepCrawlStrategy(
    max_depth=2,               # Crawl initial page + 2 levels deep
    include_external=False,    # Stay within the same domain
    max_pages=30,              # Maximum number of pages to crawl (optional)
    score_threshold=0.5,       # Minimum score for URLs to be crawled (optional)
)

关键参数: -max_depth :超出起始页的爬行层数 -include_external :是否跟踪指向其他域的链接 -max_pages :要抓取的最大页面数(默认值:无限) -score_threshold :要抓取的 URL 的最低分数(默认值:-inf)-filter_chain :用于 URL 过滤的 FilterChain 实例 -url_scorer :用于评估 URL 的评分器实例

¥Key parameters: - max_depth: Number of levels to crawl beyond the starting page - include_external: Whether to follow links to other domains - max_pages: Maximum number of pages to crawl (default: infinite) - score_threshold: Minimum score for URLs to be crawled (default: -inf) - filter_chain: FilterChain instance for URL filtering - url_scorer: Scorer instance for evaluating URLs

¥2.3 BestFirstCrawlingStrategy (⭐️ - Recommended Deep crawl strategy)

为了更加智能地抓取,使用BestFirstCrawling策略与评分者一起优先考虑最相关的页面:

¥For more intelligent crawling, use BestFirstCrawlingStrategy with scorers to prioritize the most relevant pages:

from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer

# Create a scorer
scorer = KeywordRelevanceScorer(
    keywords=["crawl", "example", "async", "configuration"],
    weight=0.7
)

# Configure the strategy
strategy = BestFirstCrawlingStrategy(
    max_depth=2,
    include_external=False,
    url_scorer=scorer,
    max_pages=25,              # Maximum number of pages to crawl (optional)
)

这种抓取方法: - 根据评分标准评估每个发现的 URL - 优先访问得分较高的页面 - 有助于将抓取资源集中在最相关的内容上 - 可以限制抓取的页面总数max_pages- 不需要score_threshold因为它自然地按分数优先

¥This crawling approach: - Evaluates each discovered URL based on scorer criteria - Visits higher-scoring pages first - Helps focus crawl resources on the most relevant content - Can limit total pages crawled with max_pages - Does not need score_threshold as it naturally prioritizes by score


3. 流式传输与非流式传输结果

¥3. Streaming vs. Non-Streaming Results

Crawl4AI 可以以两种模式返回结果:

¥Crawl4AI can return results in two modes:

3.1 非流模式(默认)

¥3.1 Non-Streaming Mode (Default)

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
    stream=False  # Default behavior
)

async with AsyncWebCrawler() as crawler:
    # Wait for ALL results to be collected before returning
    results = await crawler.arun("https://example.com", config=config)

    for result in results:
        process_result(result)

何时使用非流模式: - 处理之前需要完整的数据集 - 对所有结果一起执行批处理操作 - 抓取时间不是关键因素

¥When to use non-streaming mode: - You need the complete dataset before processing - You're performing batch operations on all results together - Crawl time isn't a critical factor

3.2 流模式

¥3.2 Streaming Mode

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
    stream=True  # Enable streaming
)

async with AsyncWebCrawler() as crawler:
    # Returns an async iterator
    async for result in await crawler.arun("https://example.com", config=config):
        # Process each result as it becomes available
        process_result(result)

流模式的好处: - 在发现结果后立即处理 - 在继续抓取的同时开始处理早期结果 - 更适合实时应用或渐进式显示 - 处理大量页面时减少内存压力

¥Benefits of streaming mode: - Process results immediately as they're discovered - Start working with early results while crawling continues - Better for real-time applications or progressive display - Reduces memory pressure when handling many pages


4. 使用过滤链过滤内容

¥4. Filtering Content with Filter Chains

过滤器可帮助您缩小要抓取的页面范围。使用以下过滤器组合多个过滤器过滤链实现强大的定位。

¥Filters help you narrow down which pages to crawl. Combine multiple filters using FilterChain for powerful targeting.

4.1 基本 URL 模式过滤器

¥4.1 Basic URL Pattern Filter

from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter

# Only follow URLs containing "blog" or "docs"
url_filter = URLPatternFilter(patterns=["*blog*", "*docs*"])

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=1,
        filter_chain=FilterChain([url_filter])
    )
)

4.2 组合多个过滤器

¥4.2 Combining Multiple Filters

from crawl4ai.deep_crawling.filters import (
    FilterChain,
    URLPatternFilter,
    DomainFilter,
    ContentTypeFilter
)

# Create a chain of filters
filter_chain = FilterChain([
    # Only follow URLs with specific patterns
    URLPatternFilter(patterns=["*guide*", "*tutorial*"]),

    # Only crawl specific domains
    DomainFilter(
        allowed_domains=["docs.example.com"],
        blocked_domains=["old.docs.example.com"]
    ),

    # Only include specific content types
    ContentTypeFilter(allowed_types=["text/html"])
])

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=2,
        filter_chain=filter_chain
    )
)

4.3 可用的过滤器类型

¥4.3 Available Filter Types

Crawl4AI 包含几个专门的过滤器:

¥Crawl4AI includes several specialized filters:

  • URLPatternFilter:使用通配符语法匹配 URL 模式

    ¥URLPatternFilter: Matches URL patterns using wildcard syntax

  • DomainFilter:控制要包含或排除的域

    ¥DomainFilter: Controls which domains to include or exclude

  • ContentTypeFilter:基于 HTTP 内容类型的过滤器

    ¥ContentTypeFilter: Filters based on HTTP Content-Type

  • ContentRelevanceFilter:使用与文本查询的相似性

    ¥ContentRelevanceFilter: Uses similarity to a text query

  • SEOFilter:评估 SEO 元素(元标签、标题等)

    ¥SEOFilter: Evaluates SEO elements (meta tags, headers, etc.)


5. 使用评分器进行优先抓取

¥5. Using Scorers for Prioritized Crawling

评分器为发现的 URL 分配优先级值,帮助爬虫首先关注最相关的内容。

¥Scorers assign priority values to discovered URLs, helping the crawler focus on the most relevant content first.

5.1 关键词相关性评分器

¥5.1 KeywordRelevanceScorer

from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy

# Create a keyword relevance scorer
keyword_scorer = KeywordRelevanceScorer(
    keywords=["crawl", "example", "async", "configuration"],
    weight=0.7  # Importance of this scorer (0.0 to 1.0)
)

config = CrawlerRunConfig(
    deep_crawl_strategy=BestFirstCrawlingStrategy(
        max_depth=2,
        url_scorer=keyword_scorer
    ),
    stream=True  # Recommended with BestFirstCrawling
)

# Results will come in order of relevance score
async with AsyncWebCrawler() as crawler:
    async for result in await crawler.arun("https://example.com", config=config):
        score = result.metadata.get("score", 0)
        print(f"Score: {score:.2f} | {result.url}")

评分员的工作原理: - 在爬取之前评估每个发现的 URL - 根据各种信号计算相关性 - 帮助爬虫对遍历顺序做出明智的选择

¥How scorers work: - Evaluate each discovered URL before crawling - Calculate relevance based on various signals - Help the crawler make intelligent choices about traversal order


6.高级过滤技术

¥6. Advanced Filtering Techniques

6.1 用于质量评估的 SEO 过滤器

¥6.1 SEO Filter for Quality Assessment

SEO过滤器帮助您识别具有强大 SEO 特征的页面:

¥The SEOFilter helps you identify pages with strong SEO characteristics:

from crawl4ai.deep_crawling.filters import FilterChain, SEOFilter

# Create an SEO filter that looks for specific keywords in page metadata
seo_filter = SEOFilter(
    threshold=0.5,  # Minimum score (0.0 to 1.0)
    keywords=["tutorial", "guide", "documentation"]
)

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=1,
        filter_chain=FilterChain([seo_filter])
    )
)

6.2 内容相关性过滤器

¥6.2 Content Relevance Filter

内容相关性过滤器分析页面的实际内容:

¥The ContentRelevanceFilter analyzes the actual content of pages:

from crawl4ai.deep_crawling.filters import FilterChain, ContentRelevanceFilter

# Create a content relevance filter
relevance_filter = ContentRelevanceFilter(
    query="Web crawling and data extraction with Python",
    threshold=0.7  # Minimum similarity score (0.0 to 1.0)
)

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=1,
        filter_chain=FilterChain([relevance_filter])
    )
)

此过滤器: - 测量查询和页面内容之间的语义相似度 - 这是一个基于 BM25 的相关性过滤器,使用头部内容

¥This filter: - Measures semantic similarity between query and page content - It's a BM25-based relevance filter using head section content


7. 构建完整的高级爬虫

¥7. Building a Complete Advanced Crawler

此示例结合了多种技术以实现复杂的爬网:

¥This example combines multiple techniques for a sophisticated crawl:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
    FilterChain,
    DomainFilter,
    URLPatternFilter,
    ContentTypeFilter
)
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer

async def run_advanced_crawler():
    # Create a sophisticated filter chain
    filter_chain = FilterChain([
        # Domain boundaries
        DomainFilter(
            allowed_domains=["docs.example.com"],
            blocked_domains=["old.docs.example.com"]
        ),

        # URL patterns to include
        URLPatternFilter(patterns=["*guide*", "*tutorial*", "*blog*"]),

        # Content type filtering
        ContentTypeFilter(allowed_types=["text/html"])
    ])

    # Create a relevance scorer
    keyword_scorer = KeywordRelevanceScorer(
        keywords=["crawl", "example", "async", "configuration"],
        weight=0.7
    )

    # Set up the configuration
    config = CrawlerRunConfig(
        deep_crawl_strategy=BestFirstCrawlingStrategy(
            max_depth=2,
            include_external=False,
            filter_chain=filter_chain,
            url_scorer=keyword_scorer
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        stream=True,
        verbose=True
    )

    # Execute the crawl
    results = []
    async with AsyncWebCrawler() as crawler:
        async for result in await crawler.arun("https://docs.example.com", config=config):
            results.append(result)
            score = result.metadata.get("score", 0)
            depth = result.metadata.get("depth", 0)
            print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")

    # Analyze the results
    print(f"Crawled {len(results)} high-value pages")
    print(f"Average score: {sum(r.metadata.get('score', 0) for r in results) / len(results):.2f}")

    # Group by depth
    depth_counts = {}
    for result in results:
        depth = result.metadata.get("depth", 0)
        depth_counts[depth] = depth_counts.get(depth, 0) + 1

    print("Pages crawled by depth:")
    for depth, count in sorted(depth_counts.items()):
        print(f"  Depth {depth}: {count} pages")

if __name__ == "__main__":
    asyncio.run(run_advanced_crawler())

8.限制和控制爬取大小

¥8. Limiting and Controlling Crawl Size

8.1 使用max_pages

¥8.1 Using max_pages

您可以限制使用max_pages范围:

¥You can limit the total number of pages crawled with the max_pages parameter:

# Limit to exactly 20 pages regardless of depth
strategy = BFSDeepCrawlStrategy(
    max_depth=3,
    max_pages=20
)

此功能适用于: - 控制 API 成本 - 设置可预测的执行时间 - 关注最重要的内容 - 在完全执行之前测试抓取配置

¥This feature is useful for: - Controlling API costs - Setting predictable execution times - Focusing on the most important content - Testing crawl configurations before full execution

8.2 使用score_threshold

¥8.2 Using score_threshold

对于 BFS 和 DFS 策略,您可以设置最低分数阈值以仅抓取高质量页面:

¥For BFS and DFS strategies, you can set a minimum score threshold to only crawl high-quality pages:

# Only follow links with scores above 0.4
strategy = DFSDeepCrawlStrategy(
    max_depth=2,
    url_scorer=KeywordRelevanceScorer(keywords=["api", "guide", "reference"]),
    score_threshold=0.4  # Skip URLs with scores below this value
)

请注意,对于 BestFirstCrawlingStrategy,不需要 score_threshold,因为页面已经按照最高分数优先的顺序进行处理。

¥Note that for BestFirstCrawlingStrategy, score_threshold is not needed since pages are already processed in order of highest score first.

9. 常见陷阱与技巧

¥9. Common Pitfalls & Tips

1.设定现实的限制。谨慎max_depth值 > 3,这可以成倍增加爬网规模。使用max_pages设置硬性限制。

¥1.Set realistic limits. Be cautious with max_depth values > 3, which can exponentially increase crawl size. Use max_pages to set hard limits.

2.不要忽视得分部分。 BestFirstCrawling 与经过精心调校的评分器配合使用效果最佳。您可以尝试调整关键词权重,以获得最佳的优先级。

¥2.Don't neglect the scoring component. BestFirstCrawling works best with well-tuned scorers. Experiment with keyword weights for optimal prioritization.

3.做一个优秀的网络公民。尊重 robots.txt。(默认禁用)

¥3.Be a good web citizen. Respect robots.txt. (disabled by default)

4.优雅地处理页面错误。并非所有页面均可访问。请检查result.status处理结果时。

¥4.Handle page errors gracefully. Not all pages will be accessible. Check result.status when processing results.

5.平衡广度与深度。明智地选择您的策略 - BFS 用于全面覆盖,DFS 用于深度探索,BestFirst 用于基于相关性的重点抓取。

¥5.Balance breadth vs. depth. Choose your strategy wisely - BFS for comprehensive coverage, DFS for deep exploration, BestFirst for focused relevance-based crawling.


10.总结及后续步骤

¥10. Summary & Next Steps

在此使用 Crawl4AI 进行深度爬取在本教程中,您学习了:

¥In this Deep Crawling with Crawl4AI tutorial, you learned to:

  • 配置BFSDeepCrawl策略 DFSDeepCrawl策略, 和BestFirstCrawling策略

    ¥Configure BFSDeepCrawlStrategy, DFSDeepCrawlStrategy, and BestFirstCrawlingStrategy

  • 以流式或非流式模式处理结果

    ¥Process results in streaming or non-streaming mode

  • 应用过滤器来定位特定内容

    ¥Apply filters to target specific content

  • 使用评分器对最相关的页面进行优先排序

    ¥Use scorers to prioritize the most relevant pages

  • 限制抓取max_pagesscore_threshold参数

    ¥Limit crawls with max_pages and score_threshold parameters

  • 结合多种技术构建完整的高级爬虫

    ¥Build a complete advanced crawler with combined techniques

使用这些工具,您可以高效地从网站大规模提取结构化数据,精确地关注您特定用例所需的内容。

¥With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.


> Feedback