🚀🤖 Crawl4AI：LLM 友好型开源 Web 爬虫和抓取工具

¥🚀🤖 Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper

Crawl4AI 是 GitHub 上排名第一的热门代码库，由活跃的社区积极维护。它提供极速 AI 就绪的网页爬取功能，专为大型语言模型、AI 代理和数据管道量身定制。完全开源，灵活，专为实时性能打造。 Crawl4AI为开发人员提供无与伦比的速度、精度和部署便利性。

¥Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.

笔记：如果您正在寻找旧文档，您可以访问它这里。

¥
Note: If you're looking for the old documentation, you can access it here.

🎯 新功能：自适应网页爬取

¥🎯 New: Adaptive Web Crawling

Crawl4AI 现在具有智能自适应爬取功能，它知道何时停止！它使用先进的信息搜寻算法，判断何时收集到足够的信息来回答您的查询。

¥Crawl4AI now features intelligent adaptive crawling that knows when to stop! Using advanced information foraging algorithms, it determines when sufficient information has been gathered to answer your query.

了解有关自适应爬行的更多信息 →

¥Learn more about Adaptive Crawling →

快速入门

¥Quick Start

下面是一个简单的示例，向您展示使用 Crawl4AI 及其异步功能是多么容易：

¥Here's a quick example to show you how easy it is to use Crawl4AI with its asynchronous capabilities:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    # Create an instance of AsyncWebCrawler
    async with AsyncWebCrawler() as crawler:
        # Run the crawler on a URL
        result = await crawler.arun(url="https://crawl4ai.com")

        # Print the extracted content
        print(result.markdown)

# Run the async main function
asyncio.run(main())

视频教程

¥Video Tutorial

Crawl4AI 做什么？

¥What Does Crawl4AI Do?

Crawl4AI 是一款功能丰富的爬虫和抓取工具，旨在：

¥Crawl4AI is a feature-rich crawler and scraper that aims to:

1.生成干净的 Markdown ：非常适合 RAG 管道或直接引入 LLM。
2.结构化提取：使用 CSS、XPath 或基于 LLM 的提取来解析重复模式。
3.高级浏览器控制：挂钩、代理、隐身模式、会话重用——细粒度控制。
4.高性能：并行爬取、基于块的提取、实时用例。
5.开源：没有强制 API 密钥，没有付费墙——每个人都可以访问他们的数据。

¥1. Generate Clean Markdown: Perfect for RAG pipelines or direct ingestion into LLMs.
2. Structured Extraction: Parse repeated patterns with CSS, XPath, or LLM-based extraction.
3. Advanced Browser Control: Hooks, proxies, stealth modes, session re-use—fine-grained control.
4. High Performance: Parallel crawling, chunk-based extraction, real-time use cases.
5. Open Source: No forced API keys, no paywalls—everyone can access their data.

核心理念：-数据民主化：免费使用、透明且高度可配置。
-法学硕士友好：经过最少处理、结构良好的文本、图像和元数据，因此 AI 模型可以轻松使用它们。

¥Core Philosophies: - Democratize Data: Free to use, transparent, and highly configurable.
- LLM Friendly: Minimally processed, well-structured text, images, and metadata, so AI models can easily consume it.

文档结构

¥Documentation Structure

为了帮助您入门，我们将文档组织成清晰的部分：

¥To help you get started, we’ve organized our docs into clear sections:

设置与安装
通过 pip 或 Docker 安装 Crawl4AI 的基本说明。

¥Setup & Installation
Basic instructions to install Crawl4AI via pip or Docker.
快速入门
实践介绍如何进行第一次抓取、生成 Markdown 以及进行简单的提取。

¥Quick Start
A hands-on introduction showing how to do your first crawl, generate Markdown, and do a simple extraction.
核
有关单页爬取、高级浏览器/爬虫参数、内容过滤和缓存的更深入指南。

¥Core
Deeper guides on single-page crawling, advanced browser/crawler parameters, content filtering, and caching.
先进的
探索链接和媒体处理、延迟加载、挂钩和身份验证、代理、会话管理等。

¥Advanced
Explore link & media handling, lazy loading, hooking & authentication, proxies, session management, and more.
萃取
非 LLM（CSS、XPath）与基于 LLM 的策略、分块和聚类方法的详细参考。

¥Extraction
Detailed references for no-LLM (CSS, XPath) vs. LLM-based strategies, chunking, and clustering approaches.
API 参考
查找每个类和方法的技术细节，包括AsyncWebCrawler，arun() ，和CrawlResult。

¥API Reference
Find the technical specifics of each class and method, including AsyncWebCrawler, arun(), and CrawlResult.

在这些部分中，您将找到可以复制粘贴融入您的环境。如果缺少某些内容或不清楚，请提出问题或 PR。

¥Throughout these sections, you’ll find code samples you can copy-paste into your environment. If something is missing or unclear, raise an issue or PR.

如何提供支持

¥How You Can Support

星叉：如果您发现 Crawl4AI 有帮助，请在 GitHub 上为该 repo 加注星标或对其进行分叉以添加您自己的功能。

¥Star & Fork: If you find Crawl4AI helpful, star the repo on GitHub or fork it to add your own features.
文件问题：遇到错误或功能缺失？请提交问题告知我们，以便我们改进。

¥File Issues: Encounter a bug or missing feature? Let us know by filing an issue, so we can improve.
拉取请求：无论是小修复、大功能还是更好的文档，我们都欢迎贡献。

¥Pull Requests: Whether it’s a small fix, a big feature, or better docs—contributions are always welcome.
加入 Discord ：来与社区讨论网络抓取、抓取技巧或 AI 工作流程。

¥Join Discord: Come chat about web scraping, crawling tips, or AI workflows with the community.
传播信息：在您的博客文章、演讲或社交媒体上提及 Crawl4AI。

¥Spread the Word: Mention Crawl4AI in your blog posts, talks, or on social media.

我们的使命：让每个人——学生、研究人员、企业家、数据科学家——都能以快速、经济高效和自由的创作方式访问、解析和塑造世界数据。

¥Our mission: to empower everyone—students, researchers, entrepreneurs, data scientists—to access, parse, and shape the world’s data with speed, cost-efficiency, and creative freedom.

快速链接

¥Quick Links

GitHub 仓库

¥GitHub Repo
安装指南

¥Installation Guide
快速入门

¥Quick Start
API 参考

¥API Reference
变更日志

¥Changelog

感谢你们与我同行。让我们继续构建开放、民主将数据提取和人工智能结合起来的方法。

¥Thank you for joining me on this journey. Let’s keep building an open, democratic approach to data extraction and AI together.

快乐爬行！
— Unclecode，Crawl4AI 的创始人和维护者

¥Happy Crawling!
— Unclecode, Founder & Maintainer of Crawl4AI