🚀🤖 Crawl4AI:LLM 友好型开源 Web 爬虫和抓取工具
¥🚀🤖 Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper
Crawl4AI 是 GitHub 上排名第一的热门代码库,由活跃的社区积极维护。它提供极速 AI 就绪的网页爬取功能,专为大型语言模型、AI 代理和数据管道量身定制。完全开源,灵活,专为实时性能打造。 Crawl4AI为开发人员提供无与伦比的速度、精度和部署便利性。
¥Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
笔记:如果您正在寻找旧文档,您可以访问它这里。
¥Note: If you're looking for the old documentation, you can access it here.
🎯 新功能:自适应网页爬取
¥🎯 New: Adaptive Web Crawling
Crawl4AI 现在具有智能自适应爬取功能,它知道何时停止!它使用先进的信息搜寻算法,判断何时收集到足够的信息来回答您的查询。
¥Crawl4AI now features intelligent adaptive crawling that knows when to stop! Using advanced information foraging algorithms, it determines when sufficient information has been gathered to answer your query.
¥Learn more about Adaptive Crawling →
快速入门
¥Quick Start
下面是一个简单的示例,向您展示使用 Crawl4AI 及其异步功能是多么容易:
¥Here's a quick example to show you how easy it is to use Crawl4AI with its asynchronous capabilities:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
# Create an instance of AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
# Run the crawler on a URL
result = await crawler.arun(url="https://crawl4ai.com")
# Print the extracted content
print(result.markdown)
# Run the async main function
asyncio.run(main())
视频教程
¥Video Tutorial
Crawl4AI 做什么?
¥What Does Crawl4AI Do?
Crawl4AI 是一款功能丰富的爬虫和抓取工具,旨在:
¥Crawl4AI is a feature-rich crawler and scraper that aims to:
1.生成干净的 Markdown :非常适合 RAG 管道或直接引入 LLM。
2.结构化提取:使用 CSS、XPath 或基于 LLM 的提取来解析重复模式。
3.高级浏览器控制:挂钩、代理、隐身模式、会话重用——细粒度控制。
4.高性能:并行爬取、基于块的提取、实时用例。
5.开源:没有强制 API 密钥,没有付费墙——每个人都可以访问他们的数据。
¥1. Generate Clean Markdown: Perfect for RAG pipelines or direct ingestion into LLMs.
2. Structured Extraction: Parse repeated patterns with CSS, XPath, or LLM-based extraction.
3. Advanced Browser Control: Hooks, proxies, stealth modes, session re-use—fine-grained control.
4. High Performance: Parallel crawling, chunk-based extraction, real-time use cases.
5. Open Source: No forced API keys, no paywalls—everyone can access their data.
核心理念:-数据民主化:免费使用、透明且高度可配置。
-法学硕士友好:经过最少处理、结构良好的文本、图像和元数据,因此 AI 模型可以轻松使用它们。
¥Core Philosophies:
- Democratize Data: Free to use, transparent, and highly configurable.
- LLM Friendly: Minimally processed, well-structured text, images, and metadata, so AI models can easily consume it.
文档结构
¥Documentation Structure
为了帮助您入门,我们将文档组织成清晰的部分:
¥To help you get started, we’ve organized our docs into clear sections:
-
设置与安装
通过 pip 或 Docker 安装 Crawl4AI 的基本说明。¥Setup & Installation
Basic instructions to install Crawl4AI via pip or Docker. -
快速入门
实践介绍如何进行第一次抓取、生成 Markdown 以及进行简单的提取。¥Quick Start
A hands-on introduction showing how to do your first crawl, generate Markdown, and do a simple extraction. -
核
有关单页爬取、高级浏览器/爬虫参数、内容过滤和缓存的更深入指南。¥Core
Deeper guides on single-page crawling, advanced browser/crawler parameters, content filtering, and caching. -
先进的
探索链接和媒体处理、延迟加载、挂钩和身份验证、代理、会话管理等。¥Advanced
Explore link & media handling, lazy loading, hooking & authentication, proxies, session management, and more. -
萃取
非 LLM(CSS、XPath)与基于 LLM 的策略、分块和聚类方法的详细参考。¥Extraction
Detailed references for no-LLM (CSS, XPath) vs. LLM-based strategies, chunking, and clustering approaches. -
API 参考
查找每个类和方法的技术细节,包括AsyncWebCrawler,arun(), 和CrawlResult。¥API Reference
Find the technical specifics of each class and method, includingAsyncWebCrawler,arun(), andCrawlResult.
在这些部分中,您将找到可以复制粘贴融入您的环境。如果缺少某些内容或不清楚,请提出问题或 PR。
¥Throughout these sections, you’ll find code samples you can copy-paste into your environment. If something is missing or unclear, raise an issue or PR.
如何提供支持
¥How You Can Support
-
星叉:如果您发现 Crawl4AI 有帮助,请在 GitHub 上为该 repo 加注星标或对其进行分叉以添加您自己的功能。
¥Star & Fork: If you find Crawl4AI helpful, star the repo on GitHub or fork it to add your own features.
-
文件问题:遇到错误或功能缺失?请提交问题告知我们,以便我们改进。
¥File Issues: Encounter a bug or missing feature? Let us know by filing an issue, so we can improve.
-
拉取请求:无论是小修复、大功能还是更好的文档,我们都欢迎贡献。
¥Pull Requests: Whether it’s a small fix, a big feature, or better docs—contributions are always welcome.
-
加入 Discord :来与社区讨论网络抓取、抓取技巧或 AI 工作流程。
¥Join Discord: Come chat about web scraping, crawling tips, or AI workflows with the community.
-
传播信息:在您的博客文章、演讲或社交媒体上提及 Crawl4AI。
¥Spread the Word: Mention Crawl4AI in your blog posts, talks, or on social media.
我们的使命:让每个人——学生、研究人员、企业家、数据科学家——都能以快速、经济高效和自由的创作方式访问、解析和塑造世界数据。
¥Our mission: to empower everyone—students, researchers, entrepreneurs, data scientists—to access, parse, and shape the world’s data with speed, cost-efficiency, and creative freedom.
快速链接
¥Quick Links
-
GitHub 仓库
-
安装指南
-
快速入门
-
API 参考
-
变更日志
感谢你们与我同行。让我们继续构建开放、民主将数据提取和人工智能结合起来的方法。
¥Thank you for joining me on this journey. Let’s keep building an open, democratic approach to data extraction and AI together.
快乐爬行!
— Unclecode,Crawl4AI 的创始人和维护者
¥Happy Crawling!
— Unclecode, Founder & Maintainer of Crawl4AI