🚀🤖 Crawl4AI:LLM 友好型开源 Web 爬虫和抓取工具

¥🚀🤖 Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper

Crawl4AI 是 GitHub 上排名第一的热门代码库,由活跃的社区积极维护。它提供极速 AI 就绪的网页爬取功能,专为大型语言模型、AI 代理和数据管道量身定制。完全开源,灵活,专为实时性能打造。 Crawl4AI为开发人员提供无与伦比的速度、精度和部署便利性。

¥Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.

笔记:如果您正在寻找旧文档,您可以访问它这里

¥

Note: If you're looking for the old documentation, you can access it here.

🎯 新功能:自适应网页爬取

¥🎯 New: Adaptive Web Crawling

Crawl4AI 现在具有智能自适应爬取功能,它知道何时停止!它使用先进的信息搜寻算法,判断何时收集到足够的信息来回答您的查询。

¥Crawl4AI now features intelligent adaptive crawling that knows when to stop! Using advanced information foraging algorithms, it determines when sufficient information has been gathered to answer your query.

了解有关自适应爬行的更多信息 →

¥Learn more about Adaptive Crawling →

快速入门

¥Quick Start

下面是一个简单的示例,向您展示使用 Crawl4AI 及其异步功能是多么容易:

¥Here's a quick example to show you how easy it is to use Crawl4AI with its asynchronous capabilities:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    # Create an instance of AsyncWebCrawler
    async with AsyncWebCrawler() as crawler:
        # Run the crawler on a URL
        result = await crawler.arun(url="https://crawl4ai.com")

        # Print the extracted content
        print(result.markdown)

# Run the async main function
asyncio.run(main())

视频教程

¥Video Tutorial


Crawl4AI 做什么?

¥What Does Crawl4AI Do?

Crawl4AI 是一款功能丰富的爬虫和抓取工具,旨在:

¥Crawl4AI is a feature-rich crawler and scraper that aims to:

1.生成干净的 Markdown :非常适合 RAG 管道或直接引入 LLM。
2.结构化提取:使用 CSS、XPath 或基于 LLM 的提取来解析重复模式。
3.高级浏览器控制:挂钩、代理、隐身模式、会话重用——细粒度控制。
4.高性能:并行爬取、基于块的提取、实时用例。
5.开源:没有强制 API 密钥,没有付费墙——每个人都可以访问他们的数据。

¥1. Generate Clean Markdown: Perfect for RAG pipelines or direct ingestion into LLMs.
2. Structured Extraction: Parse repeated patterns with CSS, XPath, or LLM-based extraction.
3. Advanced Browser Control: Hooks, proxies, stealth modes, session re-use—fine-grained control.
4. High Performance: Parallel crawling, chunk-based extraction, real-time use cases.
5. Open Source: No forced API keys, no paywalls—everyone can access their data.

核心理念:-数据民主化:免费使用、透明且高度可配置。
-法学硕士友好:经过最少处理、结构良好的文本、图像和元数据,因此 AI 模型可以轻松使用它们。

¥Core Philosophies: - Democratize Data: Free to use, transparent, and highly configurable.
- LLM Friendly: Minimally processed, well-structured text, images, and metadata, so AI models can easily consume it.


文档结构

¥Documentation Structure

为了帮助您入门,我们将文档组织成清晰的部分:

¥To help you get started, we’ve organized our docs into clear sections:

  • 设置与安装
    通过 pip 或 Docker 安装 Crawl4AI 的基本说明。

    ¥Setup & Installation
    Basic instructions to install Crawl4AI via pip or Docker.

  • 快速入门
    实践介绍如何进行第一次抓取、生成 Markdown 以及进行简单的提取。

    ¥Quick Start
    A hands-on introduction showing how to do your first crawl, generate Markdown, and do a simple extraction.


  • 有关单页爬取、高级浏览器/爬虫参数、内容过滤和缓存的更深入指南。

    ¥Core
    Deeper guides on single-page crawling, advanced browser/crawler parameters, content filtering, and caching.

  • 先进的
    探索链接和媒体处理、延迟加载、挂钩和身份验证、代理、会话管理等。

    ¥Advanced
    Explore link & media handling, lazy loading, hooking & authentication, proxies, session management, and more.

  • 萃取
    非 LLM(CSS、XPath)与基于 LLM 的策略、分块和聚类方法的详细参考。

    ¥Extraction
    Detailed references for no-LLM (CSS, XPath) vs. LLM-based strategies, chunking, and clustering approaches.

  • API 参考
    查找每个类和方法的技术细节,包括AsyncWebCrawlerarun() , 和CrawlResult

    ¥API Reference
    Find the technical specifics of each class and method, including AsyncWebCrawler, arun(), and CrawlResult.

在这些部分中,您将找到可以复制粘贴融入您的环境。如果缺少某些内容或不清楚,请提出问题或 PR。

¥Throughout these sections, you’ll find code samples you can copy-paste into your environment. If something is missing or unclear, raise an issue or PR.


如何提供支持

¥How You Can Support

  • 星叉:如果您发现 Crawl4AI 有帮助,请在 GitHub 上为该 repo 加注星标或对其进行分叉以添加您自己的功能。

    ¥Star & Fork: If you find Crawl4AI helpful, star the repo on GitHub or fork it to add your own features.

  • 文件问题:遇到错误或功能缺失?请提交问题告知我们,以便我们改进。

    ¥File Issues: Encounter a bug or missing feature? Let us know by filing an issue, so we can improve.

  • 拉取请求:无论是小修复、大功能还是更好的文档,我们都欢迎贡献。

    ¥Pull Requests: Whether it’s a small fix, a big feature, or better docs—contributions are always welcome.

  • 加入 Discord :来与社区讨论网络抓取、抓取技巧或 AI 工作流程。

    ¥Join Discord: Come chat about web scraping, crawling tips, or AI workflows with the community.

  • 传播信息:在您的博客文章、演讲或社交媒体上提及 Crawl4AI。

    ¥Spread the Word: Mention Crawl4AI in your blog posts, talks, or on social media.

我们的使命:让每个人——学生、研究人员、企业家、数据科学家——都能以快速、经济高效和自由的创作方式访问、解析和塑造世界数据。

¥Our mission: to empower everyone—students, researchers, entrepreneurs, data scientists—to access, parse, and shape the world’s data with speed, cost-efficiency, and creative freedom.


¥Quick Links

感谢你们与我同行。让我们继续构建开放、民主将数据提取和人工智能结合起来的方法。

¥Thank you for joining me on this journey. Let’s keep building an open, democratic approach to data extraction and AI together.

快乐爬行!
Unclecode,Crawl4AI 的创始人和维护者

¥Happy Crawling!
Unclecode, Founder & Maintainer of Crawl4AI


> Feedback