Crawl4AI 博客

¥Crawl4AI Blog

欢迎来到 Crawl4AI 博客!在这里,您可以找到详细的发行说明、技术见解以及项目的最新动态。无论您是想了解最新改进,还是想深入了解网络爬虫技术,这里都是您的理想之选。

¥Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical insights, and updates about the project. Whether you're looking for the latest improvements or want to dive deep into web crawling techniques, this is the place.

¥Featured Articles

何时停止爬行:了解“足够”的艺术

¥When to Stop Crawling: The Art of Knowing "Enough"

2025年1月29日

¥January 29, 2025

传统的爬虫就像时间无限的游客——它们会走遍每一条街道、每一条小巷、每一条死胡同。但如果您的爬虫能够像一位有最后期限的研究人员一样思考,会怎么样呢?探索自适应爬虫如何通过判断何时停止来彻底改变网页抓取。了解三层智能系统如何评估覆盖率、一致性和饱和度,从而构建专注的知识库,而不是无休止的页面集合。

¥Traditional crawlers are like tourists with unlimited time—they'll visit every street, every alley, every dead end. But what if your crawler could think like a researcher with a deadline? Discover how Adaptive Crawling revolutionizes web scraping by knowing when to stop. Learn about the three-layer intelligence system that evaluates coverage, consistency, and saturation to build focused knowledge bases instead of endless page collections.

阅读全文 →

¥Read the full article →

LLM 上下文协议:为什么你的 AI 助手需要记忆、推理和示例

¥The LLM Context Protocol: Why Your AI Assistant Needs Memory, Reasoning, and Examples

2025年1月24日

¥January 24, 2025

您是否想过,为什么您的 AI 编程助手即使拥有详尽的文档,仍然难以理解您的代码库?本文介绍了三维上下文协议,它彻底改变了 AI 理解代码的方式。了解为什么记忆、推理和示例共同创造了智慧,而不仅仅是信息。

¥Ever wondered why your AI coding assistant struggles with your library despite comprehensive documentation? This article introduces the three-dimensional context protocol that transforms how AI understands code. Learn why memory, reasoning, and examples together create wisdom—not just information.

阅读全文 →

¥Read the full article →

最新版本

¥Latest Release

Crawl4AI v0.7.3 – 多配置智能更新

¥Crawl4AI v0.7.3 – The Multi-Config Intelligence Update

2025年8月6日

¥August 6, 2025

Crawl4AI v0.7.3 带来更智能的 URL 特定配置、灵活的 Docker 部署以及关键的稳定性改进。您可以批量配置不同 URL 模式的不同抓取策略,非常适合包含文档、博客和 API 的混合内容网站。

¥Crawl4AI v0.7.3 brings smarter URL-specific configurations, flexible Docker deployments, and critical stability improvements. Configure different crawling strategies for different URL patterns in a single batch—perfect for mixed content sites with docs, blogs, and APIs.

主要亮点:多 URL 配置:针对不同 URL 模式的单次抓取采用不同的策略 -灵活的 Docker LLM 提供商:通过环境变量配置提供程序
-错误修复:生产部署的关键稳定性改进 -文档更新:更清晰的示例和改进的 API 文档

¥Key highlights: - Multi-URL Configurations: Different strategies for different URL patterns in one crawl - Flexible Docker LLM Providers: Configure providers via environment variables
- Bug Fixes: Critical stability improvements for production deployments - Documentation Updates: Clearer examples and improved API documentation

阅读完整发行说明 →

¥Read full release notes →


先前版本

¥Previous Releases

Crawl4AI v0.7.0 – 自适应智能更新

¥Crawl4AI v0.7.0 – The Adaptive Intelligence Update

2025年1月28日

¥January 28, 2025

引入了突破性的智能功能,包括自适应爬行、虚拟滚动支持、智能链接预览和用于大量 URL 发现的异步 URL 播种器。

¥Introduced groundbreaking intelligence features including Adaptive Crawling, Virtual Scroll support, intelligent Link Preview, and the Async URL Seeder for massive URL discovery.

阅读发行说明 →

¥Read release notes →

Crawl4AI v0.6.0 – 全球感知爬取、预热浏览器和 MCP API

¥Crawl4AI v0.6.0 – World-Aware Crawling, Pre-Warmed Browsers, and the MCP API

2024年12月23日

¥December 23, 2024

Crawl4AI v0.6.0 带来了重大的架构升级,包括全球感知爬取(设置地理位置、语言环境和时区)、实时流量捕获以及带有预热页面的内存高效爬取器池。

¥Crawl4AI v0.6.0 brought major architectural upgrades including world-aware crawling (set geolocation, locale, and timezone), real-time traffic capture, and a memory-efficient crawler pool with pre-warmed pages.

Docker 服务器现已开放功能齐全的 MCP 套接字 + SSE 接口,支持流式传输,并配备了全新的 Playground UI。此外,表格提取现已原生支持,新的压力测试框架支持抓取 1,000 多个 URL。

¥The Docker server now exposes a full-featured MCP socket + SSE interface, supports streaming, and comes with a new Playground UI. Plus, table extraction is now native, and the new stress-test framework supports crawling 1,000+ URLs.

其他主要变化:

¥Other key changes:

  • 本机支持result.media["tables"]导出 DataFrames

    ¥Native support for result.media["tables"] to export DataFrames

  • 每次抓取的完整网络 + 控制台日志和 MHTML 快照

    ¥Full network + console logs and MHTML snapshot per crawl

  • 浏览器池和预热以实现更快的冷启动

    ¥Browser pooling and pre-warming for faster cold starts

  • 通过 MCP API 和 Playground 提供新的流媒体端点

    ¥New streaming endpoints via MCP API and Playground

  • Robots.txt 支持、代理轮换和改进的会话处理

    ¥Robots.txt support, proxy rotation, and improved session handling

  • 弃用旧的 Markdown 名称,清理遗留模块

    ¥Deprecated old markdown names, legacy modules cleaned up

  • 大规模 repo 清理:在 121 个文件中插入约 36,000 条数据,删除约 5,000 条数据

    ¥Massive repo cleanup: ~36K insertions, ~5K deletions across 121 files

阅读完整发行说明 →

¥Read full release notes →


Crawl4AI v0.5.0:深度爬行、可扩展性和新的 CLI!

¥Crawl4AI v0.5.0: Deep Crawling, Scalability, and a New CLI!

亲爱的爬虫朋友们,Crawl4AI v0.5.0 正式发布啦!此版本带来了丰富的新功能、性能提升以及更流畅的开发者体验。以下是新功能的详细介绍:

¥My dear friends and crawlers, there you go, this is the release of Crawl4AI v0.5.0! This release brings a wealth of new features, performance improvements, and a more streamlined developer experience. Here's a breakdown of what's new:

主要新功能:

¥Major New Features:

  • 深度爬行:使用可配置策略(广度优先、深度优先、最佳优先)探索整个网站。定义自定义过滤器和 URL 评分,以进行定向爬取。

    ¥Deep Crawling: Explore entire websites with configurable strategies (BFS, DFS, Best-First). Define custom filters and URL scoring for targeted crawls.

  • 内存自适应调度程序:轻松处理大规模爬取!我们的新调度程序会根据可用内存动态调整并发性,并包含内置速率限制。

    ¥Memory-Adaptive Dispatcher: Handle large-scale crawls with ease! Our new dispatcher dynamically adjusts concurrency based on available memory and includes built-in rate limiting.

  • 多种爬虫策略:选择功能齐全的 Playwright 基于浏览器的爬虫或新的很多更快的 HTTP 专用爬虫,用于更简单的任务。

    ¥Multiple Crawler Strategies: Choose between the full-featured Playwright browser-based crawler or a new, much faster HTTP-only crawler for simpler tasks.

  • Docker部署:将 Crawl4AI 部署为可扩展的、独立的服务,具有内置 API 端点和可选的 JWT 身份验证。

    ¥Docker Deployment: Deploy Crawl4AI as a scalable, self-contained service with built-in API endpoints and optional JWT authentication.

  • 命令行界面(CLI):直接从您的终端与 Crawl4AI 交互。使用简单的命令即可抓取、配置和提取数据。

    ¥Command-Line Interface (CLI): Interact with Crawl4AI directly from your terminal. Crawl, configure, and extract data with simple commands.

  • LLM 配置(LLMConfig ):一种新的、统一的方式来配置 LLM 提供程序(OpenAI、Anthropic、Ollama 等),用于提取、过滤和模式生成。简化了 API 密钥管理和模型之间的切换。

    ¥LLM Configuration (LLMConfig): A new, unified way to configure LLM providers (OpenAI, Anthropic, Ollama, etc.) for extraction, filtering, and schema generation. Simplifies API key management and switching between models.

小更新和改进:

¥Minor Updates & Improvements:

  • LXML 抓取模式:更快的 HTML 解析LXMLWebScrapingStrategy

    ¥LXML Scraping Mode: Faster HTML parsing with LXMLWebScrapingStrategy.

  • 代理轮换:额外ProxyRotationStrategyRoundRobinProxyStrategy执行。

    ¥Proxy Rotation: Added ProxyRotationStrategy with a RoundRobinProxyStrategy implementation.

  • PDF处理:从 PDF 文件中提取文本、图像和元数据。

    ¥PDF Processing: Extract text, images, and metadata from PDF files.

  • URL 重定向跟踪:自动跟踪并记录重定向。

    ¥URL Redirection Tracking: Automatically follows and records redirects.

  • Robots.txt 合规性:可选择遵守网站抓取规则。

    ¥Robots.txt Compliance: Optionally respect website crawling rules.

  • LLM 支持的模式生成:使用 LLM 自动创建提取模式。

    ¥LLM-Powered Schema Generation: Automatically create extraction schemas using an LLM.

  • LLMContentFilter使用 LLM 生成高质量、重点突出的 markdown。

    ¥LLMContentFilter: Generate high-quality, focused markdown using an LLM.

  • 改进的错误处理和稳定性:修复了大量错误并增强了性能。

    ¥Improved Error Handling & Stability: Numerous bug fixes and performance enhancements.

  • 增强文档:更新指南和示例。

    ¥Enhanced Documentation: Updated guides and examples.

重大变更和迁移:

¥Breaking Changes & Migration:

此版本包含多项重大变更,旨在改进库的结构和一致性。以下是您需要了解的内容:

¥This release includes several breaking changes to improve the library's structure and consistency. Here's what you need to know:

  • arun_many()行为:现在使用MemoryAdaptiveDispatcher默认情况下。返回类型取决于stream参数输入CrawlerRunConfig. 调整依赖于无限制并发的代码。

    ¥arun_many() Behavior: Now uses the MemoryAdaptiveDispatcher by default. The return type depends on the stream parameter in CrawlerRunConfig. Adjust code that relied on unbounded concurrency.

  • max_depth地点:移至CrawlerRunConfig现在控制爬行深度

    ¥max_depth Location: Moved to CrawlerRunConfig and now controls crawl depth.

  • 深度抓取导入:进口DeepCrawlStrategy以及来自的相关课程crawl4ai.deep_crawling

    ¥Deep Crawling Imports: Import DeepCrawlStrategy and related classes from crawl4ai.deep_crawling.

  • BrowserContextAPI:更新;旧的get_context方法已被弃用。

    ¥BrowserContext API: Updated; the old get_context method is deprecated.

  • 可选模型字段:许多数据模型字段现在是可选的。处理潜在的None值。

    ¥Optional Model Fields: Many data model fields are now optional. Handle potential None values.

  • ScrapingMode枚举:替换为策略模式(WebScrapingStrategyLXMLWebScrapingStrategy )。

    ¥ScrapingMode Enum: Replaced with strategy pattern (WebScrapingStrategy, LXMLWebScrapingStrategy).

  • content_filter范围:已从CrawlerRunConfig. 使用提取策略或带有过滤器的 markdown 生成器。

    ¥content_filter Parameter: Removed from CrawlerRunConfig. Use extraction strategies or markdown generators with filters.

  • 删除的功能:同步WebCrawler、旧的 CLI 和文档管理工具已被删除。

    ¥Removed Functionality: The synchronous WebCrawler, the old CLI, and docs management tools have been removed.

  • Docker:部署方面有重大变化。请参阅Docker 文档

    ¥Docker: Significant changes to deployment. See the Docker documentation.

  • ssl_certificate.json该文件已被删除。

    ¥ssl_certificate.json: This file has been removed.

  • 配置:FastFilterChain 已被 FilterChain 取代

    ¥Config: FastFilterChain has been replaced with FilterChain

  • 深层爬行:DeepCrawlStrategy.arun 现在返回 Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]

    ¥Deep-Crawl: DeepCrawlStrategy.arun now returns Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]

  • 代理人:删除了同步 WebCrawler 支持和相关速率限制配置

    ¥Proxy: Removed synchronous WebCrawler support and related rate limiting configurations

  • LLM参数:使用新的LLMConfig对象而不是传递providerapi_tokenbase_url , 和api_base直接LLMExtractionStrategyLLMContentFilter

    ¥LLM Parameters: Use the new LLMConfig object instead of passing provider, api_token, base_url, and api_base directly to LLMExtractionStrategy and LLMContentFilter.

简而言之:更新导入,调整arun_many()用法,检查可选字段,并查看 Docker 部署指南。

¥In short: Update imports, adjust arun_many() usage, check for optional fields, and review the Docker deployment guide.

许可证变更

¥License Change

Crawl4AI v0.5.0 将许可证更新为 Apache 2.0带有必需归属条款。这意味着您可以自由使用、修改和分发 Crawl4AI(甚至用于商业用途),但您必须在任何公开使用或分发中明确注明项目归属。请参阅更新后的LICENSE查阅完整的法律文本和具体要求。

¥Crawl4AI v0.5.0 updates the license to Apache 2.0 with a required attribution clause. This means you are free to use, modify, and distribute Crawl4AI (even commercially), but you must clearly attribute the project in any public use or distribution. See the updated LICENSE file for the full legal text and specific requirements.

开始:

¥Get Started:

我很高兴看到您使用 Crawl4AI v0.5.0 构建的内容!

¥I'm very excited to see what you build with Crawl4AI v0.5.0!


0.4.2 - 可配置爬虫、会话管理和更智能的截图

¥0.4.2 - Configurable Crawlers, Session Management, and Smarter Screenshots

2024年12月12日

¥December 12, 2024

0.4.2 更新带来了配置方面的重大改进,使用专用对象使爬虫和浏览器更易于管理。您现在可以导入/导出本地存储,实现无缝会话管理。此外,长页面截图速度更快、更清晰,并且现在可以导出整页 PDF。查看所有新功能,让您的爬虫体验更加流畅。

¥The 0.4.2 update brings massive improvements to configuration, making crawlers and browsers easier to manage with dedicated objects. You can now import/export local storage for seamless session management. Plus, long-page screenshots are faster and cleaner, and full-page PDF exports are now possible. Check out all the new features to make your crawling experience even smoother.

阅读完整发行说明 →

¥Read full release notes →


0.4.1 - 通过延迟加载处理、纯文本模式等实现更智能的抓取

¥0.4.1 - Smarter Crawling with Lazy-Load Handling, Text-Only Mode, and More

2024年12月8日

¥December 8, 2024

此版本对延迟加载图片的处理、超快的纯文本模式、支持无限滚动的全页扫描、动态视口调整以及会话复用以实现高效抓取等功能进行了重大改进。如果您希望提升速度、可靠性或轻松处理动态内容,此更新将满足您的需求。

¥This release brings major improvements to handling lazy-loaded images, a blazing-fast Text-Only Mode, full-page scanning for infinite scrolls, dynamic viewport adjustments, and session reuse for efficient crawling. If you're looking to improve speed, reliability, or handle dynamic content with ease, this update has you covered.

阅读完整发行说明 →

¥Read full release notes →


0.4.0 - 主要内容过滤更新

¥0.4.0 - Major Content Filtering Update

2024年12月1日

¥December 1, 2024

对内容过滤、多线程环境处理和用户代理生成进行了重大改进。此版本新增了 PruningContentFilter,增强了线程安全性,并提升了测试覆盖率。

¥Introduced significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage.

阅读完整发行说明 →

¥Read full release notes →

项目历史

¥Project History

想知道 Crawl4AI 是如何发展的吗?查看我们的完整的变更日志了解所有版本和更新的详细历史记录。

¥Curious about how Crawl4AI has evolved? Check out our complete changelog for a detailed history of all versions and updates.

保持更新

¥Stay Updated

  • 为我们加星标GitHub

    ¥Star us on GitHub

  • 跟随@unclecode在推特上

    ¥Follow @unclecode on Twitter

  • 加入我们在 GitHub 上的社区讨论

    ¥Join our community discussions on GitHub


> Feedback