🚀 Crawl4AI v0.7.0：自适应智能更新

2025 年 1 月 28 日 • 阅读时间：10 分钟

今天，我发布了 Crawl4AI v0.7.0——自适应智能更新。此版本通过自适应学习、智能内容发现和高级提取功能，对 Crawl4AI 处理现代网络复杂性的方式进行了根本性的改进。

🎯 新功能一览

自适应爬行：您的爬虫现在可以学习并适应网站模式
虚拟滚动支持：从无限滚动页面完成内容提取
智能评分链接预览：智能链接分析和优先级排序
异步 URL Seeder：通过智能过滤在几秒钟内发现数千个 URL
性能优化：显著提高速度和内存

🧠 自适应爬行：通过模式学习实现智能

问题：网站会变。类名会变。ID会消失。你精心设计的选择器会在凌晨 3 点崩溃，醒来后你会看到空空如也的数据集和愤怒的利益相关者。

我的解决方案：我实现了一个自适应学习系统，它可以观察模式，建立置信度评分，并动态调整提取策略。这就像一个初级开发人员，随着抓取每一页数据，他的工作效率都会提高。

技术深入探讨

自适应爬虫为每个域维护一个持久状态，跟踪：- 模式成功率 - 选择器随时间变化的稳定性 - 内容结构变化 - 提取置信度分数

from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
import asyncio

async def main():

    # Configure adaptive crawler
    config = AdaptiveConfig(
        strategy="statistical",  # or "embedding" for semantic understanding
        max_pages=10,
        confidence_threshold=0.7,  # Stop at 70% confidence
        top_k_links=3,  # Follow top 3 links per page
        min_gain_threshold=0.05  # Need 5% information gain to continue
    )

    async with AsyncWebCrawler(verbose=False) as crawler:
        adaptive = AdaptiveCrawler(crawler, config)

        print("Starting adaptive crawl about Python decorators...")
        result = await adaptive.digest(
            start_url="https://docs.python.org/3/glossary.html",
            query="python decorators functions wrapping"
        )

        print(f"\n✅ Crawling Complete!")
        print(f"• Confidence Level: {adaptive.confidence:.0%}")
        print(f"• Pages Crawled: {len(result.crawled_urls)}")
        print(f"• Knowledge Base: {len(adaptive.state.knowledge_base)} documents")

        # Get most relevant content
        relevant = adaptive.get_relevant_content(top_k=3)
        print(f"\nMost Relevant Pages:")
        for i, page in enumerate(relevant, 1):
            print(f"{i}. {page['url']} (relevance: {page['score']:.2%})")

asyncio.run(main())

预期的实际影响： - 新闻聚合：即使新闻网站更新其模板，也能保持 95％以上的提取准确率 - 电子商务监控：无需持续维护即可跟踪数百家商店的产品变化 - 研究数据收集：构建可在网站重新设计后继续使用的强大学术数据集 - 减少维护：将频繁变化的网站的选择器更新时间缩短 80％

🌊 虚拟滚动：完整内容捕获

问题：现代 Web 应用仅渲染可见内容。向下滚动，新内容出现，旧内容消失。传统的爬虫程序只能捕获第一个视口，却会错过 90% 的内容。这就像每本书都只读第一页一样。

我的解决方案：我构建了虚拟滚动支持，模仿人类的浏览行为，在加载时捕获内容并在浏览器的垃圾收集器攻击之前保存它。

实现细节

from crawl4ai import VirtualScrollConfig

# For social media feeds (Twitter/X style)
twitter_config = VirtualScrollConfig(
    container_selector="[data-testid='primaryColumn']",
    scroll_count=20,                    # Number of scrolls
    scroll_by="container_height",       # Smart scrolling by container size
    wait_after_scroll=1.0              # Let content load
)

# For e-commerce product grids (Instagram style)
grid_config = VirtualScrollConfig(
    container_selector="main .product-grid",
    scroll_count=30,
    scroll_by=800,                     # Fixed pixel scrolling
    wait_after_scroll=1.5              # Images need time
)

# For news feeds with lazy loading
news_config = VirtualScrollConfig(
    container_selector=".article-feed",
    scroll_count=50,
    scroll_by="page_height",           # Viewport-based scrolling
    wait_after_scroll=0.5              # Wait for content to load
)

# Use it in your crawl
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        "https://twitter.com/trending",
        config=CrawlerRunConfig(
            virtual_scroll_config=twitter_config,
            # Combine with other features
            extraction_strategy=JsonCssExtractionStrategy({
                "tweets": {
                    "selector": "[data-testid='tweet']",
                    "fields": {
                        "text": {"selector": "[data-testid='tweetText']", "type": "text"},
                        "likes": {"selector": "[data-testid='like']", "type": "text"}
                    }
                }
            })
        )
    )

    print(f"Captured {len(result.extracted_content['tweets'])} tweets")

主要功能： - DOM 回收感知：检测并处理虚拟 DOM 元素回收 - 智能滚动物理：三种模式 - 容器高度、页面高度或固定像素 - 内容保存：在内容被销毁之前捕获内容 - 智能停止：在没有新内容出现时停止 - 内存高效：流式传输内容而不是将所有内容保存在内存中

预期的现实世界影响： - 社交媒体分析：捕获包含数百条回复的整个 Twitter 帖子，而不仅仅是前 10 条 - 电子商务抓取：从无限滚动目录中提取 500 多种产品，而使用传统方法只能提取 20-50 种产品 - 新闻聚合：从现代新闻网站获取所有文章，而不仅仅是首屏内容 - 研究应用：使用虚拟分页从学术数据库中提取完整的数据

🔗 链接预览：智能链接分析和评分

问题：你爬取一个页面，得到了 200 个链接。哪些链接重要？哪些链接能引导你找到真正想要的内容？传统的爬虫会强迫你“跟踪”所有内容，或者构建复杂的过滤器。

我的解决方案：我实施了一个三层评分系统，可以像人类一样分析链接——考虑它们的位置、背景以及与目标的相关性。

智能链接分析和评分

import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode, AsyncWebCrawler
from crawl4ai.adaptive_crawler import LinkPreviewConfig

async def main():
    # Configure intelligent link analysis
    link_config = LinkPreviewConfig(
        include_internal=True,
        include_external=False,
        max_links=10,
        concurrency=5,
        query="python tutorial",  # For contextual scoring
        score_threshold=0.3,
        verbose=True
    )
    # Use in your crawl
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            "https://www.geeksforgeeks.org/",
            config=CrawlerRunConfig(
                link_preview_config=link_config,
                score_links=True,  # Enable intrinsic scoring
                cache_mode=CacheMode.BYPASS
            )
        )

        # Access scored and sorted links
        if result.success and result.links:
            for link in result.links.get("internal", []):
                text = link.get('text', 'No text')[:40]
                print(
                    text,
                    f"{link.get('intrinsic_score', 0):.1f}/10" if link.get('intrinsic_score') is not None else "0.0/10",
                    f"{link.get('contextual_score', 0):.2f}/1" if link.get('contextual_score') is not None else "0.00/1",
                    f"{link.get('total_score', 0):.3f}" if link.get('total_score') is not None else "0.000"
                )

asyncio.run(main())

评分要素：

内在分数：基于链接质量指标
页面位置（导航、内容、页脚）
链接属性（rel、title、class 名称）
锚文本质量和长度
URL 结构和深度
上下文分数：使用 BM25 算法计算与查询的相关性
链接文本和标题中的关键字匹配
元描述分析
内容预览评分
总分：综合得分，确定最终排名

预期的现实影响： - 研究效率：通过仅关注高分链接，以 10 倍的速度查找相关论文 - 竞争分析：自动识别竞争对手网站上的重要页面 - 内容发现：构建始终保持正轨的以主题为中心的爬虫 - SEO 审核：识别并优先考虑高价值的内部链接机会

🎣 异步 URL Seeder：大规模自动 URL 发现

问题：您想爬取整个域名，但只抓取了首页。或者更糟的是，您想抓取数千个页面中的特定内容类型。手动查找 URL？那是机器的工作，不是人类的工作。

我的解决方案：我构建了 Async URL Seeder——一个涡轮增压 URL 发现引擎，它将多个来源与智能过滤和相关性评分相结合。

技术架构

import asyncio
from crawl4ai import AsyncUrlSeeder, SeedingConfig

async def main():
    async with AsyncUrlSeeder() as seeder:
        # Discover Python tutorial URLs
        config = SeedingConfig(
            source="sitemap",  # Use sitemap
            pattern="*python*",  # URL pattern filter
            extract_head=True,  # Get metadata
            query="python tutorial",  # For relevance scoring
            scoring_method="bm25",
            score_threshold=0.2,
            max_urls=10
        )

        print("Discovering Python async tutorial URLs...")
        urls = await seeder.urls("https://www.geeksforgeeks.org/", config)

        print(f"\n✅ Found {len(urls)} relevant URLs:")
        for i, url_info in enumerate(urls[:5], 1):
            print(f"\n{i}. {url_info['url']}")
            if url_info.get('relevance_score'):
                print(f"   Relevance: {url_info['relevance_score']:.3f}")
            if url_info.get('head_data', {}).get('title'):
                print(f"   Title: {url_info['head_data']['title'][:60]}...")

asyncio.run(main())

发现方法： - 站点地图挖掘：解析 robots.txt 和所有链接的站点地图 - 通用爬取：查询通用爬取索引中的历史 URL - 智能爬取：通过智能深度控制跟踪链接 - 模式分析：学习 URL 结构并生成变体

预期的实际影响： - 迁移项目：在 60 秒内从旧站点发现 10,000 多个 URL - 市场研究：自动映射整个竞争对手生态系统 - 学术研究：无需手动收集 URL 即可构建全面的数据集 - SEO 审核：通过内容评分查找每个可索引页面 - 内容存档：确保站点迁移期间不会遗漏任何内容

⚡ 性能优化

此版本通过优化资源处理、更好的并发管理和减少内存占用，显著提高了性能。

我们优化了什么

# Optimized crawling with v0.7.0 improvements
results = []
for url in urls:
    result = await crawler.arun(
        url,
        config=CrawlerRunConfig(
            # Performance optimizations
            wait_until="domcontentloaded",  # Faster than networkidle
            cache_mode=CacheMode.ENABLED    # Enable caching
        )
    )
    results.append(result)

性能提升： - 启动时间：浏览器初始化速度提高 70% - 页面加载：使用智能资源阻止减少 40% - 提取：使用编译的 CSS 选择器速度提高 3 倍 - 内存使用量：使用流处理减少 60% - 并发抓取：处理 5 倍以上的并行请求

🔧 重要变更

重大变化

重命名为link_preview（更好地体现功能性）
最低 Python 版本现在为 3.9
拆分成CrawlerRunConfig和BrowserConfig

迁移指南

# Old (v0.6.x)
from crawl4ai import CrawlerConfig
config = CrawlerConfig(timeout=30000)

# New (v0.7.0)
from crawl4ai import CrawlerRunConfig, BrowserConfig
browser_config = BrowserConfig(timeout=30000)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)

🤖 即将推出：智能 Web 自动化

我目前正在致力于为 Crawl4AI 引入高级自动化功能。其中包括：

爬虫代理：自主爬虫，了解你的目标并调整策略
自动 JS 生成：自动生成 JavaScript 代码，用于复杂的交互
智能表单处理：智能表单检测和填写
上下文感知操作：了解页面上下文并做出决策的爬虫

这些功能正在积极开发中，将彻底改变我们的 Web 自动化方式。敬请期待！

🚀 开始

pip install crawl4ai==0.7.0

查看更新后的文档。

有问题吗？我随时欢迎您： - GitHub： github.com/unclecode/crawl4ai - Discord： discord.gg/crawl4ai - Twitter： @unclecode

爬行快乐！🕷️

附言：如果您在生产环境中使用 Crawl4AI，我很乐意听听您的分享。您的用例将启发我们开发下一个功能。