Crawl4AI v0.5.0 发行说明

发布主题：强大、灵活、可扩展

Crawl4AI v0.5.0 是一个重要版本，旨在显著增强该库的功能、灵活性和可扩展性。主要改进包括：全新的深度爬取系统、用于处理大规模爬取的内存自适应调度器、多种爬取策略（包括一个快速的纯 HTTP 爬取程序）、Docker 部署选项以及强大的命令行界面 (CLI)。此版本还包含大量错误修复、性能优化和文档更新。

重要提示：此版本包含多项重大变更。请仔细阅读“重大变更”部分，并相应地更新您的代码。

主要特点

1.深度爬行

Crawl4AI 现在支持深度爬取，让您可以探索初始 URL 以外的网站。此功能由deep_crawl_strategy参数输入CrawlerRunConfig. 有几种可用的策略：

（广度优先搜索）：逐级探索网站。（默认）
（深度优先搜索）：回溯之前尽可能深入地探索每个分支。
：使用评分函数来确定下一步要抓取的 URL 的优先级。

import time
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import DomainFilter, ContentTypeFilter, FilterChain, URLPatternFilter, KeywordRelevanceScorer, BestFirstCrawlingStrategy
import asyncio

# Create a filter chain to filter urls based on patterns, domains and content type
filter_chain = FilterChain(
    [
        DomainFilter(
            allowed_domains=["docs.crawl4ai.com"],
            blocked_domains=["old.docs.crawl4ai.com"],
        ),
        URLPatternFilter(patterns=["*core*", "*advanced*"],),
        ContentTypeFilter(allowed_types=["text/html"]),
    ]
)

# Create a keyword scorer that prioritises the pages with certain keywords first
keyword_scorer = KeywordRelevanceScorer(
    keywords=["crawl", "example", "async", "configuration"], weight=0.7
)

# Set up the configuration
deep_crawl_config = CrawlerRunConfig(
    deep_crawl_strategy=BestFirstCrawlingStrategy(
        max_depth=2,
        include_external=False,
        filter_chain=filter_chain,
        url_scorer=keyword_scorer,
    ),
    scraping_strategy=LXMLWebScrapingStrategy(),
    stream=True,
    verbose=True,
)

async def main():
    async with AsyncWebCrawler() as crawler:
        start_time = time.perf_counter()
        results = []
        async for result in await crawler.arun(url="https://crawl4ai-docs.iloveaiwork.com", config=deep_crawl_config):
            print(f"Crawled: {result.url} (Depth: {result.metadata['depth']}), score: {result.metadata['score']:.2f}")
            results.append(result)
        duration = time.perf_counter() - start_time
        print(f"\n✅ Crawled {len(results)} high-value pages in {duration:.2f} seconds")

asyncio.run(main())

重大变化：max_depth参数现在是CrawlerRunConfig并控制抓取的深度，而不是并发抓取的数量。arun()和arun_many()方法现已修饰，用于处理深度爬取策略。深度爬取策略的导入已更改。更多详情，请参阅深度爬取文档。

2. 内存自适应调度器

新的MemoryAdaptiveDispatcher根据可用系统内存动态调整并发量，并包含内置速率限制。这可以防止内存不足错误，并避免目标网站不堪重负。

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MemoryAdaptiveDispatcher
import asyncio

# Configure the dispatcher (optional, defaults are used if not provided)
dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=80.0,  # Pause if memory usage exceeds 80%
    check_interval=0.5,  # Check memory every 0.5 seconds
)

async def batch_mode():
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(
            urls=["https://crawl4ai-docs.iloveaiwork.com", "https://github.com/unclecode/crawl4ai"],
            config=CrawlerRunConfig(stream=False),  # Batch mode
            dispatcher=dispatcher,
        )
        for result in results:
            print(f"Crawled: {result.url} with status code: {result.status_code}")

async def stream_mode():
    async with AsyncWebCrawler() as crawler:
        # OR, for streaming:
        async for result in await crawler.arun_many(
            urls=["https://crawl4ai-docs.iloveaiwork.com", "https://github.com/unclecode/crawl4ai"],
            config=CrawlerRunConfig(stream=True),
            dispatcher=dispatcher,
        ):
            print(f"Crawled: {result.url} with status code: {result.status_code}")

print("Dispatcher in batch mode:")
asyncio.run(batch_mode())
print("-" * 50)
print("Dispatcher in stream mode:")
asyncio.run(stream_mode())

重大变化：AsyncWebCrawler.arun_many()现在使用MemoryAdaptiveDispatcher默认情况下。依赖于无限制并发的现有代码可能需要进行调整。

3. 多种爬取策略（Playwright 和 HTTP）

Crawl4AI现在提供两种爬取策略：

（默认）：使用 Playwright 进行基于浏览器的抓取，支持 JavaScript 渲染和复杂交互。
：一款轻量级、快速且内存高效的 HTTP 爬虫。非常适合无需浏览器渲染的简单数据抓取任务。

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
import asyncio

# Use the HTTP crawler strategy
http_crawler_config = HTTPCrawlerConfig(
        method="GET",
        headers={"User-Agent": "MyCustomBot/1.0"},
        follow_redirects=True,
        verify_ssl=True
)

async def main():
    async with AsyncWebCrawler(crawler_strategy=AsyncHTTPCrawlerStrategy(browser_config =http_crawler_config)) as crawler:
        result = await crawler.arun("https://example.com")
        print(f"Status code: {result.status_code}")
        print(f"Content length: {len(result.html)}")

asyncio.run(main())

4. Docker部署

Crawl4AI 现在可以轻松部署为 Docker 容器，从而提供一致且隔离的环境。Docker 镜像包含一个 FastAPI 服务器，该服务器同时具有流式和非流式端点。

# Build the image (from the project root)
docker build -t crawl4ai .

# Run the container
docker run -d -p 8000:8000 --name crawl4ai crawl4ai

API 端点：

（POST）：非流式抓取。
（POST）：流式抓取（NDJSON）。
（获取）：健康检查。
（获取）：返回配置模式。
（GET）：返回 URL 的 markdown 内容。
（获取）：返回 LLM 提取的内容。
（POST）：获取 JWT 令牌

重大变化：

Docker 部署现在需要.llm.envAPI 密钥文件。
Docker 部署现在需要 Redis 和新的config.yml结构。
服务器启动现在使用supervisord而不是直接的流程管理。
Docker 服务器现在默认需要身份验证（JWT 令牌）。

有关详细说明，请参阅Docker 部署文档。

5.命令行界面（CLI）

新的 CLI (crwl ) 可以从终端方便地访问 Crawl4AI 的功能。

# Basic crawl
crwl https://example.com

# Get markdown output
crwl https://example.com -o markdown

# Use a configuration file
crwl https://example.com -B browser.yml -C crawler.yml

# Use LLM-based extraction
crwl https://example.com -e extract.yml -s schema.json

# Ask a question about the crawled content
crwl https://example.com -q "What is the main topic?"

# See usage examples
crwl --example

有关更多详细信息，请参阅CLI 文档。

6. LXML 抓取模式

额外LXMLWebScrapingStrategy为了更快地解析 HTML，使用lxml库。这可以显著提高抓取性能，尤其是对于大型或复杂的页面。设置scraping_strategy=LXMLWebScrapingStrategy()在你的CrawlerRunConfig。

重大变化：ScrapingMode枚举已被策略模式取代。使用WebScrapingStrategy（默认）或LXMLWebScrapingStrategy。

7. 代理轮换

额外ProxyRotationStrategy抽象基类RoundRobinProxyStrategy具体实施。

import re
from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CrawlerRunConfig,
    CacheMode,
    RoundRobinProxyStrategy,
)
import asyncio
from crawl4ai import ProxyConfig
async def main():
    # Load proxies and create rotation strategy
    proxies = ProxyConfig.from_env()
    #eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2"
    if not proxies:
        print("No proxies found in environment. Set PROXIES env variable!")
        return

    proxy_strategy = RoundRobinProxyStrategy(proxies)

    # Create configs
    browser_config = BrowserConfig(headless=True, verbose=False)
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        proxy_rotation_strategy=proxy_strategy
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        urls = ["https://httpbin.org/ip"] * (len(proxies) * 2)  # Test each proxy twice

        print("\n📈 Initializing crawler with proxy rotation...")
        async with AsyncWebCrawler(config=browser_config) as crawler:
            print("\n🚀 Starting batch crawl with proxy rotation...")
            results = await crawler.arun_many(
                urls=urls,
                config=run_config
            )
            for result in results:
                if result.success:
                    ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
                    current_proxy = run_config.proxy_config if run_config.proxy_config else None

                    if current_proxy and ip_match:
                        print(f"URL {result.url}")
                        print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}")
                        verified = ip_match.group(0) == current_proxy.ip
                        if verified:
                            print(f"✅ Proxy working! IP matches: {current_proxy.ip}")
                        else:
                            print("❌ Proxy failed or IP mismatch!")
                    print("---")

asyncio.run(main())

其他变化和改进

额外：LLMContentFilter用于智能 Markdown 生成。这个新的过滤器使用 LLM 来创建更集中、更相关的 Markdown 输出。

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
from crawl4ai import LLMConfig
import asyncio

llm_config = LLMConfig(provider="gemini/gemini-1.5-pro", api_token="env:GEMINI_API_KEY")

markdown_generator = DefaultMarkdownGenerator(
    content_filter=LLMContentFilter(llm_config=llm_config, instruction="Extract key concepts and summaries")
)

config = CrawlerRunConfig(markdown_generator=markdown_generator)
async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://crawl4ai-docs.iloveaiwork.com", config=config)
        print(result.markdown.fit_markdown)

asyncio.run(main())

新增：URL 重定向跟踪。爬虫现在会自动跟踪 HTTP 重定向（301、302、307、308），并在redirected_url的领域CrawlResult对象。无需更改任何代码即可启用此功能；它是自动的。
新增：LLM 支持的模式生成实用程序。generate_schema方法已添加到JsonCssExtractionStrategy和JsonXPathExtractionStrategy。这大大简化了创建提取模式。

from crawl4ai import JsonCssExtractionStrategy
from crawl4ai import LLMConfig

llm_config = LLMConfig(provider="gemini/gemini-1.5-pro", api_token="env:GEMINI_API_KEY")

schema = JsonCssExtractionStrategy.generate_schema(
    html="<div class='product'><h2>Product Name</h2><span class='price'>$99</span></div>",
    llm_config = llm_config,
    query="Extract product name and price"
)
print(schema)

预期输出（可能因法学硕士而略有不同）

{
  "name": "ProductExtractor",
  "baseSelector": "div.product",
  "fields": [
      {"name": "name", "selector": "h2", "type": "text"},
      {"name": "price", "selector": ".price", "type": "text"}
    ]
 }

新增：robots.txt 合规性支持。爬虫现在可以遵循robots.txt规则。通过设置启用此功能check_robots_txt=True在CrawlerRunConfig。

config = CrawlerRunConfig(check_robots_txt=True)

新增：PDF 处理功能。Crawl4AI 现在可以从 PDF 文件（本地和远程）中提取文本、图像和元数据。这使用了新的PDFCrawlerStrategy和PDFContentScrapingStrategy。

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
import asyncio

async def main():
    async with AsyncWebCrawler(crawler_strategy=PDFCrawlerStrategy()) as crawler:
        result = await crawler.arun(
            "https://arxiv.org/pdf/2310.06825.pdf",
            config=CrawlerRunConfig(
                scraping_strategy=PDFContentScrapingStrategy()
            )
        )
        print(result.markdown)  # Access extracted text
        print(result.metadata)  # Access PDF metadata (title, author, etc.)

asyncio.run(main())

新增：支持冻结集序列化。改进了配置序列化，尤其是针对允许/阻止域集。无需更改代码。
添加：新LLMConfig参数。此新参数可用于传递提取、过滤和模式生成任务。它简化了在所有需要 LLM 配置的部分中传递提供程序字符串、API 令牌和基本 URL 的过程。它还支持重用，并允许在不同的 LLM 配置之间快速进行实验。

from crawl4ai import LLMConfig
from crawl4ai import LLMExtractionStrategy
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

# Example of using LLMConfig with LLMExtractionStrategy
llm_config = LLMConfig(provider="openai/gpt-4o", api_token="YOUR_API_KEY")
strategy = LLMExtractionStrategy(llm_config=llm_config, schema=...)

# Example usage within a crawler
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        config=CrawlerRunConfig(extraction_strategy=strategy)
    )

Breaking Change: Removed old parameters like provider, api_token, base_url, and api_base from LLMExtractionStrategy and LLMContentFilter. Users should migrate to using the LLMConfig object.

变更：改进了浏览器上下文管理并添加了共享数据支持。（重大变更：BrowserContext API 已更新）。浏览器上下文管理现在更加高效，从而减少了资源使用。新的shared_data字典可在BrowserContext允许在爬取过程的不同阶段之间传递数据。重大变化：BrowserContext API 已经改变，旧的get_context方法已被弃用。
更改：重命名final_url到redirected_url在CrawledURL。这提高了一致性和清晰度。更新所有引用旧字段名称的代码。
变更：改进了类型提示并删除了未使用的文件。这是内部改进，不需要更改代码。
变更：将深度爬取功能重新组织到专用模块中。（重大变更：导入路径DeepCrawlStrategy以及相关类已更改）。这改进了代码组织。更新导入以使用新的crawl4ai.deep_crawling模块。
变更：改进了 HTML 处理和代码库清理。（重大变更：已删除ssl_certificate.json文件）。这将删除一个未使用的文件。如果您依赖此文件进行自定义证书验证，则需要实施其他方法。
变更：增强了序列化和配置处理。（重大变更：FastFilterChain已被替换为FilterChain）。此更改简化了配置并改进了序列化。
新增：修改了 Apache 2.0 许可证，添加了必要的署名条款。请参阅LICENSE详情请参阅文件。所有用户现在必须在使用、分发或创作衍生作品时明确注明 Crawl4AI 项目。
已修复：通过确保 Playwright 页面正确关闭来防止内存泄漏。无需更改代码。
已修复：使用默认值使模型字段可选（重大变更：依赖于所有字段的代码可能需要调整）。数据模型中的字段（例如CrawledURL）现在是可选的，具有默认值（通常None). 更新代码以处理潜在的None值。
已修复：调整内存阈值并修复调度程序初始化问题。这是内部错误修复，无需更改代码。
已修复：确保运行 doctor 命令后正确退出。无需更改代码。
已修复：JsonCss 选择器和爬虫改进。
已修复：长页面截图不起作用（#403）
文档：将文档 URL 更新至新域名。
文档：添加了 SERP API 项目示例。
文档：添加了 CSS 选择器行为的澄清注释。
文档：为项目添加行为准则（#410）

重大变更摘要

调度员：MemoryAdaptiveDispatcher现在是默认的arun_many()，改变并发行为。arun_many取决于stream范围。
深度爬行：max_depth现在是CrawlerRunConfig并控制抓取深度。深度抓取策略的导入路径已更改。
浏览器上下文：BrowserContext API已更新。
模型：数据模型中的许多字段现在是可选的，具有默认值。
抓取模式：ScrapingMode枚举被策略模式取代（WebScrapingStrategy ，LXMLWebScrapingStrategy ）。
内容过滤器：已删除content_filter参数来自CrawlerRunConfig. 使用提取策略或带有过滤器的 markdown 生成器。
已删除：同步WebCrawler、CLI 和文档管理功能。
Docker：Docker 部署发生重大变化，包括新的要求和配置。
已删除文件：已删除可能影响现有证书验证的 ssl_certificate.json 文件
重命名：final_url 为 redirected_url，以保持一致性
配置：FastFilterChain 已被 FilterChain 取代
深度爬行：DeepCrawlStrategy.arun 现在返回 Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
代理：删除同步 WebCrawler 支持和相关速率限制配置

迁移指南

更新进口：调整进口DeepCrawlStrategy，BreadthFirstSearchStrategy以及由于新的deep_crawling模块结构。
：移动max_depth到CrawlerRunConfig. 如果使用content_filter，迁移到提取策略或带有过滤器的 markdown 生成器。
：使代码适应新的MemoryAdaptiveDispatcher行为和返回类型。
：使用更新代码BrowserContextAPI。
模型：处理潜力None数据模型中可选字段的值。
刮擦：替换ScrapingMode枚举WebScrapingStrategy或者LXMLWebScrapingStrategy。
Docker：查看更新的 Docker 文档并相应地调整您的部署。
CLI：迁移到新的crwl命令并使用旧的 CLI 更新任何脚本。
代理::删除了同步 WebCrawler 支持和相关速率限制配置。
配置::将FastFilterChain替换为FilterChain