深度爬行

Crawl4AI 最强大的功能之一是其可配置的深度爬取能力,可以探索网站不止一个页面。Crawl4AI 可以精细控制爬取深度、域名边界和内容过滤,为您提供精准提取所需内容的工具。

在本教程中,您将学习:

  1. 如何使用 BFS 策略设置基本深度爬虫
  2. 了解流式输出和非流式输出之间的区别
  3. 实施过滤器和评分器以定位特定内容
  4. 为复杂的爬取创建高级过滤链
  5. 使用 BestFirstCrawling 进行智能探索优先级排序
先决条件 - 您已完成或阅读AsyncWebCrawler 基础知识,了解如何运行简单的爬网。 - 您知道如何配置CrawlerRunConfig

1. 快速示例

下面是使用 BFSDeepCrawlStrategy 实现基本深度爬取的最小代码片段:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy

async def main():
    # Configure a 2-level deep crawl
    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2, 
            include_external=False
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun("https://example.com", config=config)

        print(f"Crawled {len(results)} pages in total")

        # Access individual results
        for result in results[:3]:  # Show first 3 results
            print(f"URL: {result.url}")
            print(f"Depth: {result.metadata.get('depth', 0)}")

if __name__ == "__main__":
    asyncio.run(main())

发生什么事了?BFSDeepCrawlStrategy(max_depth=2, include_external=False)指示 Crawl4AI: - 抓取起始页面(深度 0)加上另外 2 个级别 - 保持在同一域内(不跟踪外部链接) - 每个结果都包含抓取深度等元数据 - 所有抓取完成后,结果以列表形式返回


2. 了解深度爬取策略选项

BFSDeepCrawlStrategy 采用广度优先的方法,先在一个深度探索所有链接,然后再深入探索:

from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

# Basic configuration
strategy = BFSDeepCrawlStrategy(
    max_depth=2,               # Crawl initial page + 2 levels deep
    include_external=False,    # Stay within the same domain
    max_pages=50,              # Maximum number of pages to crawl (optional)
    score_threshold=0.3,       # Minimum score for URLs to be crawled (optional)
)

关键参数:-max_depth :超出起始页的爬行层数 -include_external :是否跟踪指向其他域的链接 -max_pages :要抓取的最大页面数(默认值:无限) -score_threshold :要抓取的 URL 的最低分数(默认值:-inf)-filter_chain :用于 URL 过滤的 FilterChain 实例 -url_scorer :用于评估 URL 的评分器实例

DFSDeepCrawlStrategy 使用深度优先方法,在回溯之前尽可能深入地探索分支。

from crawl4ai.deep_crawling import DFSDeepCrawlStrategy

# Basic configuration
strategy = DFSDeepCrawlStrategy(
    max_depth=2,               # Crawl initial page + 2 levels deep
    include_external=False,    # Stay within the same domain
    max_pages=30,              # Maximum number of pages to crawl (optional)
    score_threshold=0.5,       # Minimum score for URLs to be crawled (optional)
)

关键参数:-max_depth :超出起始页的爬行层数 -include_external :是否跟踪指向其他域的链接 -max_pages :要抓取的最大页面数(默认值:无限) -score_threshold :要抓取的 URL 的最低分数(默认值:-inf)-filter_chain :用于 URL 过滤的 FilterChain 实例 -url_scorer :用于评估 URL 的评分器实例

为了实现更智能的抓取,请使用 BestFirstCrawlingStrategy 和评分器来优先显示最相关的页面:

from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer

# Create a scorer
scorer = KeywordRelevanceScorer(
    keywords=["crawl", "example", "async", "configuration"],
    weight=0.7
)

# Configure the strategy
strategy = BestFirstCrawlingStrategy(
    max_depth=2,
    include_external=False,
    url_scorer=scorer,
    max_pages=25,              # Maximum number of pages to crawl (optional)
)

这种抓取方法: - 根据评分标准评估每个发现的 URL - 优先访问得分较高的页面 - 有助于将抓取资源集中在最相关的内容上 - 可以限制抓取的页面总数max_pages- 不需要score_threshold因为它自然地按分数优先


3. 流式传输与非流式传输结果

Crawl4AI 可以以两种模式返回结果:

3.1 非流模式(默认)

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
    stream=False  # Default behavior
)

async with AsyncWebCrawler() as crawler:
    # Wait for ALL results to be collected before returning
    results = await crawler.arun("https://example.com", config=config)

    for result in results:
        process_result(result)

何时使用非流模式: - 处理前需要完整的数据集 - 对所有结果一起执行批处理操作 - 抓取时间不是关键因素

3.2 流模式

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
    stream=True  # Enable streaming
)

async with AsyncWebCrawler() as crawler:
    # Returns an async iterator
    async for result in await crawler.arun("https://example.com", config=config):
        # Process each result as it becomes available
        process_result(result)

流模式的优点: - 在发现结果时立即处理 - 在继续抓取时开始使用早期结果 - 更适合实时应用程序或渐进式显示 - 处理许多页面时减少内存压力


4. 使用过滤链过滤内容

过滤器可帮助您缩小抓取页面的范围。使用 FilterChain 组合多个过滤器,实现更精准的定位。

4.1 基本 URL 模式过滤器

from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter

# Only follow URLs containing "blog" or "docs"
url_filter = URLPatternFilter(patterns=["*blog*", "*docs*"])

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=1,
        filter_chain=FilterChain([url_filter])
    )
)

4.2 组合多个过滤器

from crawl4ai.deep_crawling.filters import (
    FilterChain,
    URLPatternFilter,
    DomainFilter,
    ContentTypeFilter
)

# Create a chain of filters
filter_chain = FilterChain([
    # Only follow URLs with specific patterns
    URLPatternFilter(patterns=["*guide*", "*tutorial*"]),

    # Only crawl specific domains
    DomainFilter(
        allowed_domains=["docs.example.com"],
        blocked_domains=["old.docs.example.com"]
    ),

    # Only include specific content types
    ContentTypeFilter(allowed_types=["text/html"])
])

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=2,
        filter_chain=filter_chain
    )
)

4.3 可用的过滤器类型

Crawl4AI 包含几个专门的过滤器:

  • :使用通配符语法匹配 URL 模式
  • :控制要包含或排除的域
  • :基于 HTTP 内容类型的过滤器
  • :使用与文本查询的相似性
  • :评估 SEO 元素(元标签、标题等)

5. 使用评分器进行优先抓取

评分器为发现的 URL 分配优先级值,帮助爬虫首先关注最相关的内容。

5.1 关键词相关性评分器

from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy

# Create a keyword relevance scorer
keyword_scorer = KeywordRelevanceScorer(
    keywords=["crawl", "example", "async", "configuration"],
    weight=0.7  # Importance of this scorer (0.0 to 1.0)
)

config = CrawlerRunConfig(
    deep_crawl_strategy=BestFirstCrawlingStrategy(
        max_depth=2,
        url_scorer=keyword_scorer
    ),
    stream=True  # Recommended with BestFirstCrawling
)

# Results will come in order of relevance score
async with AsyncWebCrawler() as crawler:
    async for result in await crawler.arun("https://example.com", config=config):
        score = result.metadata.get("score", 0)
        print(f"Score: {score:.2f} | {result.url}")

评分器的工作原理: - 在爬取之前评估每个发现的 URL - 根据各种信号计算相关性 - 帮助爬虫对遍历顺序做出明智的选择


6.高级过滤技术

6.1 用于质量评估的 SEO 过滤器

SEOFilter 可帮助您识别具有强大 SEO 特征的页面:

from crawl4ai.deep_crawling.filters import FilterChain, SEOFilter

# Create an SEO filter that looks for specific keywords in page metadata
seo_filter = SEOFilter(
    threshold=0.5,  # Minimum score (0.0 to 1.0)
    keywords=["tutorial", "guide", "documentation"]
)

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=1,
        filter_chain=FilterChain([seo_filter])
    )
)

6.2 内容相关性过滤器

ContentRelevanceFilter 分析页面的实际内容:

from crawl4ai.deep_crawling.filters import FilterChain, ContentRelevanceFilter

# Create a content relevance filter
relevance_filter = ContentRelevanceFilter(
    query="Web crawling and data extraction with Python",
    threshold=0.7  # Minimum similarity score (0.0 to 1.0)
)

config = CrawlerRunConfig(
    deep_crawl_strategy=BFSDeepCrawlStrategy(
        max_depth=1,
        filter_chain=FilterChain([relevance_filter])
    )
)

此过滤器: - 测量查询和页面内容之间的语义相似度 - 这是一个基于 BM25 的相关性过滤器,使用头部内容


7. 构建完整的高级爬虫

此示例结合了多种技术以实现复杂的爬网:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
    FilterChain,
    DomainFilter,
    URLPatternFilter,
    ContentTypeFilter
)
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer

async def run_advanced_crawler():
    # Create a sophisticated filter chain
    filter_chain = FilterChain([
        # Domain boundaries
        DomainFilter(
            allowed_domains=["docs.example.com"],
            blocked_domains=["old.docs.example.com"]
        ),

        # URL patterns to include
        URLPatternFilter(patterns=["*guide*", "*tutorial*", "*blog*"]),

        # Content type filtering
        ContentTypeFilter(allowed_types=["text/html"])
    ])

    # Create a relevance scorer
    keyword_scorer = KeywordRelevanceScorer(
        keywords=["crawl", "example", "async", "configuration"],
        weight=0.7
    )

    # Set up the configuration
    config = CrawlerRunConfig(
        deep_crawl_strategy=BestFirstCrawlingStrategy(
            max_depth=2,
            include_external=False,
            filter_chain=filter_chain,
            url_scorer=keyword_scorer
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        stream=True,
        verbose=True
    )

    # Execute the crawl
    results = []
    async with AsyncWebCrawler() as crawler:
        async for result in await crawler.arun("https://docs.example.com", config=config):
            results.append(result)
            score = result.metadata.get("score", 0)
            depth = result.metadata.get("depth", 0)
            print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")

    # Analyze the results
    print(f"Crawled {len(results)} high-value pages")
    print(f"Average score: {sum(r.metadata.get('score', 0) for r in results) / len(results):.2f}")

    # Group by depth
    depth_counts = {}
    for result in results:
        depth = result.metadata.get("depth", 0)
        depth_counts[depth] = depth_counts.get(depth, 0) + 1

    print("Pages crawled by depth:")
    for depth, count in sorted(depth_counts.items()):
        print(f"  Depth {depth}: {count} pages")

if __name__ == "__main__":
    asyncio.run(run_advanced_crawler())

8.限制和控制爬取大小

8.1 使用max_pages

您可以限制使用max_pages范围:

# Limit to exactly 20 pages regardless of depth
strategy = BFSDeepCrawlStrategy(
    max_depth=3,
    max_pages=20
)

此功能适用于: - 控制 API 成本 - 设置可预测的执行时间 - 关注最重要的内容 - 在完全执行之前测试抓取配置

8.2 使用score_threshold

对于 BFS 和 DFS 策略,您可以设置最低分数阈值以仅抓取高质量页面:

# Only follow links with scores above 0.4
strategy = DFSDeepCrawlStrategy(
    max_depth=2,
    url_scorer=KeywordRelevanceScorer(keywords=["api", "guide", "reference"]),
    score_threshold=0.4  # Skip URLs with scores below this value
)

请注意,对于 BestFirstCrawlingStrategy,不需要 score_threshold,因为页面已经按照最高分数优先的顺序进行处理。

9. 常见陷阱与技巧

1. 设定切合实际的限制。谨慎行事max_depth值 > 3,这可以成倍增加爬网规模。使用max_pages设置硬性限制。

2. 不要忽视评分环节。BestFirstCrawling 与经过精心调校的评分器配合使用效果最佳。请尝试调整关键词权重,以获得最佳优先级。

3. 做一个优秀的网络公民。尊重 robots.txt 文件。(默认情况下禁用)

4. 妥善处理页面错误。并非所有页面都能访问。请检查result.status处理结果时。

5. 平衡广度与深度。明智地选择你的策略——BFS 用于全面覆盖,DFS 用于深度探索,BestFirst 用于基于相关性的重点抓取。


10.总结及后续步骤

在本使用 Crawl4AI 进行深度爬取的教程中,您学习了:

  • 配置 BFSDeepCrawlStrategy、DFSDeepCrawlStrategy 和 BestFirstCrawlingStrategy
  • 以流式或非流式模式处理结果
  • 应用过滤器来定位特定内容
  • 使用评分器对最相关的页面进行优先排序
  • 限制抓取max_pagesscore_threshold参数
  • 结合多种技术构建完整的高级爬虫

使用这些工具,您可以高效地从网站大规模提取结构化数据,精确地关注您特定用例所需的内容。


> Feedback