消化()

¥digest()

digest()方法是自适应网页爬取的主要接口。它以给定的 URL 为起点,根据查询条件智能地爬取网站,并自动判断何时收集到足够的信息。

¥The digest() method is the primary interface for adaptive web crawling. It intelligently crawls websites starting from a given URL, guided by a query, and automatically determines when sufficient information has been gathered.

方法签名

¥Method Signature

async def digest(
    start_url: str,
    query: str,
    resume_from: Optional[Union[str, Path]] = None
) -> CrawlState

参数

¥Parameters

起始网址

¥start_url

  • 类型str

    ¥Type: str

  • 必需的: 是的

    ¥Required: Yes

  • 描述:抓取的起始 URL。这应该是一个有效的 HTTP/HTTPS URL,作为信息收集的入口点。

    ¥Description: The starting URL for the crawl. This should be a valid HTTP/HTTPS URL that serves as the entry point for information gathering.

询问

¥query

  • 类型str

    ¥Type: str

  • 必需的: 是的

    ¥Required: Yes

  • 描述:引导抓取过程的搜索查询。它应该包含与您要查找的信息相关的关键词。抓取工具会使用此查询来评估相关性并确定要跟踪的链接。

    ¥Description: The search query that guides the crawling process. This should contain key terms related to the information you're seeking. The crawler uses this to evaluate relevance and determine which links to follow.

恢复

¥resume_from

  • 类型Optional[Union[str, Path]]

    ¥Type: Optional[Union[str, Path]]

  • 默认None

    ¥Default: None

  • 描述:先前保存的抓取状态文件的路径。提供此路径后,抓取工具将从已保存的状态恢复,而不是从头开始。

    ¥Description: Path to a previously saved crawl state file. When provided, the crawler resumes from the saved state instead of starting fresh.

返回值

¥Return Value

返回CrawlState对象包含:

¥Returns a CrawlState object containing:

  • crawled_urls (Set[str] ): 所有已抓取的 URL

    ¥crawled_urls (Set[str]): All URLs that have been crawled

  • 知识库(List[CrawlResult] ): 已爬取页面及内容集合

    ¥knowledge_base (List[CrawlResult]): Collection of crawled pages with content

  • 待处理链接(List[Link] ): 已发现但尚未抓取的链接

    ¥pending_links (List[Link]): Links discovered but not yet crawled

  • 指标(Dict[str, float] ): 性能和质量指标

    ¥metrics (Dict[str, float]): Performance and quality metrics

  • 询问(str ): 原始查询

    ¥query (str): The original query

  • 用于评分的附加统计信息

    ¥Additional statistical information for scoring

工作原理

¥How It Works

digest()方法实现了智能爬取算法:

¥The digest() method implements an intelligent crawling algorithm:

  1. 初始爬行:从提供的 URL 开始

    ¥Initial Crawl: Starts from the provided URL

  2. 链接分析:评估所有发现的链接的相关性

    ¥Link Analysis: Evaluates all discovered links for relevance

  3. 得分:使用三个指标来评估信息充分性:

    ¥Scoring: Uses three metrics to assess information sufficiency:

  4. 覆盖范围:查询词的覆盖程度

    ¥Coverage: How well the query terms are covered

  5. 一致性:跨页面信息连贯性

    ¥Consistency: Information coherence across pages

  6. 饱和:收益递减检测

    ¥Saturation: Diminishing returns detection

  7. 自适应选择:选择最有希望的链接进行跟踪

    ¥Adaptive Selection: Chooses the most promising links to follow

  8. 停止决定:达到置信度阈值时自动停止

    ¥Stopping Decision: Automatically stops when confidence threshold is reached

示例

¥Examples

基本用法

¥Basic Usage

async with AsyncWebCrawler() as crawler:
    adaptive = AdaptiveCrawler(crawler)

    state = await adaptive.digest(
        start_url="https://docs.python.org/3/",
        query="async await context managers"
    )

    print(f"Crawled {len(state.crawled_urls)} pages")
    print(f"Confidence: {adaptive.confidence:.0%}")

带配置

¥With Configuration

config = AdaptiveConfig(
    confidence_threshold=0.9,  # Require high confidence
    max_pages=30,             # Allow more pages
    top_k_links=3             # Follow top 3 links per page
)

adaptive = AdaptiveCrawler(crawler, config=config)

state = await adaptive.digest(
    start_url="https://api.example.com/docs",
    query="authentication endpoints rate limits"
)

恢复之前的爬取

¥Resuming a Previous Crawl

# First crawl - may be interrupted
state1 = await adaptive.digest(
    start_url="https://example.com",
    query="machine learning algorithms"
)

# Save state (if not auto-saved)
state1.save("ml_crawl_state.json")

# Later, resume from saved state
state2 = await adaptive.digest(
    start_url="https://example.com",
    query="machine learning algorithms",
    resume_from="ml_crawl_state.json"
)

带进度监控

¥With Progress Monitoring

state = await adaptive.digest(
    start_url="https://docs.example.com",
    query="api reference"
)

# Monitor progress
print(f"Pages crawled: {len(state.crawled_urls)}")
print(f"New terms discovered: {state.new_terms_history}")
print(f"Final confidence: {adaptive.confidence:.2%}")

# View detailed statistics
adaptive.print_stats(detailed=True)

查询最佳实践

¥Query Best Practices

  1. 具体一点:使用目标内容中出现的描述性术语

    # Good
    query = "python async context managers implementation"
    
    # Too broad
    query = "python programming"
    

    ¥

    Be Specific: Use descriptive terms that appear in target content

    # Good
    query = "python async context managers implementation"
    
    # Too broad
    query = "python programming"
    

  2. 包括关键术语:添加您希望找到的技术术语

    query = "oauth2 jwt refresh tokens authorization"
    

    ¥

    Include Key Terms: Add technical terms you expect to find

    query = "oauth2 jwt refresh tokens authorization"
    

  3. 多个概念:结合相关概念,实现全面覆盖

    query = "rest api pagination sorting filtering"
    

    ¥

    Multiple Concepts: Combine related concepts for comprehensive coverage

    query = "rest api pagination sorting filtering"
    

性能考虑

¥Performance Considerations

  • 初始 URL :选择具有良好导航的页面(例如文档索引)

    ¥Initial URL: Choose a page with good navigation (e.g., documentation index)

  • 查询长度:通常 3-8 个术语效果最佳

    ¥Query Length: 3-8 terms typically work best

  • 链接密度:导航清晰的网站抓取效率更高

    ¥Link Density: Sites with clear navigation crawl more efficiently

  • 缓存:对同一域名的重复抓取启用缓存

    ¥Caching: Enable caching for repeated crawls of the same domain

错误处理

¥Error Handling

try:
    state = await adaptive.digest(
        start_url="https://example.com",
        query="search terms"
    )
except Exception as e:
    print(f"Crawl failed: {e}")
    # State is auto-saved if save_state=True in config

停止条件

¥Stopping Conditions

当满足以下任一条件时,爬网就会停止:

¥The crawl stops when any of these conditions are met:

  1. 置信阈值:已达到配置的置信度

    ¥Confidence Threshold: Reached the configured confidence level

  2. 页面限制:已抓取最大数量的页面

    ¥Page Limit: Crawled the maximum number of pages

  3. 收益递减:预期信息增益低于阈值

    ¥Diminishing Returns: Expected information gain below threshold

  4. 没有相关链接:没有可追踪的链接

    ¥No Relevant Links: No promising links remain to follow

参见

¥See Also


> Feedback