Crawl4AI入门
¥Getting Started with Crawl4AI
欢迎来到Crawl4AI ,一款开源的 LLM 友好型网络爬虫和抓取工具。在本教程中,你将学习:
¥Welcome to Crawl4AI, an open-source LLM-friendly Web Crawler & Scraper. In this tutorial, you’ll:
-
运行你的第一次爬行使用最小配置。
¥Run your first crawl using minimal configuration.
-
产生Markdown输出(并了解它如何受到内容过滤器的影响)。
¥Generate Markdown output (and learn how it’s influenced by content filters).
-
尝试一个简单的基于CSS的提取战略。
¥Experiment with a simple CSS-based extraction strategy.
-
一睹基于法学硕士的提取(包括开源和闭源模型选项)。
¥See a glimpse of LLM-based extraction (including open-source and closed-source model options).
-
爬取动态的通过 JavaScript 加载内容的页面。
¥Crawl a dynamic page that loads content via JavaScript.
1. 简介
¥1. Introduction
Crawl4AI提供:
¥Crawl4AI provides:
-
异步爬虫,
AsyncWebCrawler。¥An asynchronous crawler,
AsyncWebCrawler. -
可通过以下方式配置浏览器和运行设置
BrowserConfig和CrawlerRunConfig。¥Configurable browser and run settings via
BrowserConfigandCrawlerRunConfig. -
通过以下方式自动将 HTML 转换为 Markdown
DefaultMarkdownGenerator(支持可选过滤器)。¥Automatic HTML-to-Markdown conversion via
DefaultMarkdownGenerator(supports optional filters). -
多种提取策略(基于 LLM 或“传统” CSS/XPath)。
¥Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).
在本指南结束时,您将执行基本爬网、生成 Markdown、尝试两种提取策略,并爬网使用“加载更多”按钮或 JavaScript 更新的动态页面。
¥By the end of this guide, you’ll have performed a basic crawl, generated Markdown, tried out two extraction strategies, and crawled a dynamic page that uses “Load More” buttons or JavaScript updates.
2. 你的第一次爬行
¥2. Your First Crawl
这是一个最小的 Python 脚本,它创建了一个AsyncWebCrawler,获取网页,并打印其 Markdown 输出的前 300 个字符:
¥Here’s a minimal Python script that creates an AsyncWebCrawler, fetches a webpage, and prints the first 300 characters of its Markdown output:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:300]) # Print first 300 chars
if __name__ == "__main__":
asyncio.run(main())
发生什么事了? -AsyncWebCrawler启动无头浏览器(默认为 Chromium)。 - 它获取https://example.com. - Crawl4AI 自动将 HTML 转换为 Markdown。
¥What’s happening?
- AsyncWebCrawler launches a headless browser (Chromium by default).
- It fetches https://example.com.
- Crawl4AI automatically converts the HTML into Markdown.
现在您已经拥有一个简单、可运行的爬网!
¥You now have a simple, working crawl!
3.基本配置(简单介绍)
¥3. Basic Configuration (Light Introduction)
Crawl4AI 的爬虫可以使用两个主要类进行高度定制:
¥Crawl4AI’s crawler can be heavily customized using two main classes:
1.BrowserConfig :控制浏览器行为(无头或完整 UI、用户代理、JavaScript 切换等)。
2.CrawlerRunConfig :控制每次爬网的运行方式(缓存、提取、超时、挂钩等)。
¥1. BrowserConfig: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
2. CrawlerRunConfig: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
下面是最低限度使用的示例:
¥Below is an example with minimal usage:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
browser_conf = BrowserConfig(headless=True) # or False to see the browser
run_conf = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_conf
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
重要提示:默认缓存模式设置为
CacheMode.ENABLED。因此,为了获得新鲜内容,您需要将其设置为CacheMode.BYPASS¥IMPORTANT: By default cache mode is set to
CacheMode.ENABLED. So to have fresh content, you need to set it toCacheMode.BYPASS
我们将在后续教程中探索更高级的配置(例如启用代理、PDF 输出、多标签会话等)。现在,只需注意如何传递这些对象来管理爬取。
¥We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
4. 生成 Markdown 输出
¥4. Generating Markdown Output
默认情况下,Crawl4AI 会自动从每个爬取的页面生成 Markdown 文件。但是,具体的输出取决于您是否指定markdown 生成器或者内容过滤器。
¥By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a markdown generator or content filter.
-
result.markdown:
直接将 HTML 转换为 Markdown。¥
result.markdown:
The direct HTML-to-Markdown conversion. -
result.markdown.fit_markdown:
应用任何配置后的内容相同内容过滤器(例如,PruningContentFilter)。¥
result.markdown.fit_markdown:
The same content after applying any configured content filter (e.g.,PruningContentFilter).
示例:使用带有DefaultMarkdownGenerator
¥Example: Using a Filter with DefaultMarkdownGenerator
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
笔记:如果你这样做不是指定内容过滤器或 markdown 生成器,您通常只会看到原始 Markdown。PruningContentFilter可能会增加50ms处理时间。我们将在专门的Markdown 生成教程。
¥Note: If you do not specify a content filter or markdown generator, you’ll typically see only the raw Markdown. PruningContentFilter may adds around 50ms in processing time. We’ll dive deeper into these strategies in a dedicated Markdown Generation tutorial.
5.简单数据提取(基于CSS)
¥5. Simple Data Extraction (CSS-based)
Crawl4AI 还可以使用 CSS 或 XPath 选择器提取结构化数据 (JSON)。以下是一个基于 CSS 的简单示例:
¥Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
新的! Crawl4AI 现在提供了一个强大的实用程序,可以使用 LLM 自动生成提取模式。只需一次性付费,即可获得可重复使用的模式,实现快速、无需 LLM 的提取:
¥New! Crawl4AI now provides a powerful utility to automatically generate extraction schemas using LLM. This is a one-time cost that gives you a reusable schema for fast, LLM-free extractions:
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai import LLMConfig
# Generate a schema (one-time cost)
html = "<div class='product'><h2>Gaming Laptop</h2><span class='price'>$999.99</span></div>"
# Using OpenAI (requires API token)
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token") # Required for OpenAI
)
# Or using Ollama (open source, no token needed)
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None) # Not needed for Ollama
)
# Use the schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(schema)
有关架构生成和高级用法的完整指南,请参阅非法学硕士提取策略。
¥For a complete guide on schema generation and advanced usage, see No-LLM Extraction Strategies.
这是一个基本的提取示例:
¥Here's a basic extraction example:
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def main():
schema = {
"name": "Example Items",
"baseSelector": "div.item",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
raw_html = "<div class='item'><h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a></div>"
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="raw://" + raw_html,
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema)
)
)
# The JSON output is stored in 'extracted_content'
data = json.loads(result.extracted_content)
print(data)
if __name__ == "__main__":
asyncio.run(main())
这有什么帮助? - 非常适合重复的页面结构(例如,商品列表、文章)。 - 无需使用人工智能或支付费用。 - 爬虫返回您可以解析或存储的 JSON 字符串。
¥Why is this helpful? - Great for repetitive page structures (e.g., item listings, articles). - No AI usage or costs. - The crawler returns a JSON string you can parse or store.
提示:您可以将原始 HTML 传递给爬虫,而不是 URL。为此,请在 HTML 中添加前缀
raw://。¥Tips: You can pass raw HTML to the crawler instead of a URL. To do so, prefix the HTML with
raw://.
6. 简单数据提取(基于法学硕士)
¥6. Simple Data Extraction (LLM-based)
对于更复杂或不规则的页面,语言模型可以智能地将文本解析成您定义的结构。Crawl4AI 支持开源或者闭源提供商:
¥For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports open-source or closed-source providers:
-
开源模型(例如,
ollama/llama3.3,no_token)¥Open-Source Models (e.g.,
ollama/llama3.3,no_token) -
OpenAI 模型(例如,
openai/gpt-4,需要api_token)¥OpenAI Models (e.g.,
openai/gpt-4, requiresapi_token) -
或者底层库支持的任何提供程序
¥Or any provider supported by the underlying library
下面是使用的示例开源风格(无标记)和闭源:
¥Below is an example using open-source style (no token) and closed-source:
import os
import json
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai import LLMExtractionStrategy
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(
..., description="Fee for output token for the OpenAI model."
)
async def extract_structured_data_using_llm(
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
):
print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama":
print(f"API token is required for {provider}. Skipping this example.")
return
browser_config = BrowserConfig(headless=True)
extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
if extra_headers:
extra_args["extra_headers"] = extra_headers
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=1,
page_timeout=80000,
extraction_strategy=LLMExtractionStrategy(
llm_config = LLMConfig(provider=provider,api_token=api_token),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content.""",
extra_args=extra_args,
),
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://openai.com/api/pricing/", config=crawler_config
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(
extract_structured_data_using_llm(
provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
)
)
发生什么事了? - 我们定义了一个 Pydantic 模式(PricingInfo )描述我们想要的字段。 - LLM 提取策略使用该模式和您的指令将原始文本转换为结构化 JSON。 - 取决于提供者和api_token ,您可以使用本地模型或远程 API。
¥What’s happening?
- We define a Pydantic schema (PricingInfo) describing the fields we want.
- The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON.
- Depending on the provider and api_token, you can use local models or a remote API.
7.自适应爬行(新!)
¥7. Adaptive Crawling (New!)
Crawl4AI 现在包含智能自适应爬取功能,可自动判断何时已收集到足够的信息。以下是一个简单的示例:
¥Crawl4AI now includes intelligent adaptive crawling that automatically determines when sufficient information has been gathered. Here's a quick example:
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def adaptive_example():
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
# Start adaptive crawling
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async context managers"
)
# View results
adaptive.print_stats()
print(f"Crawled {len(result.crawled_urls)} pages")
print(f"Achieved {adaptive.confidence:.0%} confidence")
if __name__ == "__main__":
asyncio.run(adaptive_example())
自适应爬行有何特别之处? -自动停止:收集到足够的信息后停止 -智能链路选择:仅关注相关链接 -信心评分:了解您的信息是否完整
¥What's special about adaptive crawling? - Automatic stopping: Stops when sufficient information is gathered - Intelligent link selection: Follows only relevant links - Confidence scoring: Know how complete your information is
¥Learn more about Adaptive Crawling →
8. 多 URL 并发(预览)
¥8. Multi-URL Concurrency (Preview)
如果你需要抓取多个 URL平行线,你可以使用arun_many()默认情况下,Crawl4AI 采用内存自适应调度器根据系统资源自动调整并发度。以下是简要概述:
¥If you need to crawl multiple URLs in parallel, you can use arun_many(). By default, Crawl4AI employs a MemoryAdaptiveDispatcher, automatically adjusting concurrency based on system resources. Here’s a quick glimpse:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def quick_parallel_example():
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
run_conf = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=True # Enable streaming mode
)
async with AsyncWebCrawler() as crawler:
# Stream results as they complete
async for result in await crawler.arun_many(urls, config=run_conf):
if result.success:
print(f"[OK] {result.url}, length: {len(result.markdown.raw_markdown)}")
else:
print(f"[ERROR] {result.url} => {result.error_message}")
# Or get all results at once (default behavior)
run_conf = run_conf.clone(stream=False)
results = await crawler.arun_many(urls, config=run_conf)
for res in results:
if res.success:
print(f"[OK] {res.url}, length: {len(res.markdown.raw_markdown)}")
else:
print(f"[ERROR] {res.url} => {res.error_message}")
if __name__ == "__main__":
asyncio.run(quick_parallel_example())
上面的例子展示了处理多个 URL 的两种方法:1.流模式(stream=True ): 使用以下方式处理可用的结果async for2.批处理模式(stream=False ): 等待所有结果完成
¥The example above shows two ways to handle multiple URLs:
1. Streaming mode (stream=True): Process results as they become available using async for
2. Batch mode (stream=False): Wait for all results to complete
对于更高级的并发(例如基于信号量的方法,自适应内存使用限制或自定义速率限制),请参阅高级多 URL 爬取。
¥For more advanced concurrency (e.g., a semaphore-based approach, adaptive memory usage throttling, or customized rate limiting), see Advanced Multi-URL Crawling.
8.动态内容示例
¥8. Dynamic Content Example
有些网站需要多次“页面点击”或动态 JavaScript 更新。以下示例演示了如何点击“下一页”按钮并等待新的提交在 GitHub 上加载,使用BrowserConfig和CrawlerRunConfig:
¥Some sites require multiple “page clicks” or dynamic JavaScript updates. Below is an example showing how to click a “Next Page” button and wait for new commits to load on GitHub, using BrowserConfig and CrawlerRunConfig:
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src",
},
],
}
browser_config = BrowserConfig(headless=True, java_script_enabled=True)
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
tab.scrollIntoView();
tab.click();
await new Promise(r => setTimeout(r, 500));
}
})();
"""
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=[js_click_tabs],
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology", config=crawler_config
)
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
async def main():
await extract_structured_data_using_css_extractor()
if __name__ == "__main__":
asyncio.run(main())
关键点:
¥Key Points:
-
BrowserConfig(headless=False):我们想看请点击“下一页”。¥
BrowserConfig(headless=False): We want to watch it click “Next Page.” -
CrawlerRunConfig(...):我们指定提取策略,通过session_id重复使用同一页面。¥
CrawlerRunConfig(...): We specify the extraction strategy, passsession_idto reuse the same page. -
js_code和wait_for用于后续页面(page > 0)单击“下一步”按钮并等待新的提交加载。¥
js_codeandwait_forare used for subsequent pages (page > 0) to click the “Next” button and wait for new commits to load. -
js_only=True表示我们不会重新导航,而是继续现有会话。¥
js_only=Trueindicates we’re not re-navigating but continuing the existing session. -
最后,我们调用
kill_session()清理页面和浏览器会话。¥Finally, we call
kill_session()to clean up the page and browser session.
9. 后续步骤
¥9. Next Steps
恭喜!您已:
¥Congratulations! You have:
-
执行了基本抓取并打印了 Markdown。
¥Performed a basic crawl and printed Markdown.
-
用过的内容过滤器使用 markdown 生成器。
¥Used content filters with a markdown generator.
-
通过以下方式提取 JSON CSS或者法学硕士策略。
¥Extracted JSON via CSS or LLM strategies.
-
已处理动态的带有 JavaScript 触发器的页面。
¥Handled dynamic pages with JavaScript triggers.
如果您准备好了解更多信息,请查看:
¥If you’re ready for more, check out:
-
安装:深入了解高级安装、Docker 使用(实验性)或可选依赖项。
¥Installation: A deeper dive into advanced installs, Docker usage (experimental), or optional dependencies.
-
钩子和授权:了解如何运行自定义 JavaScript 或使用 cookie、本地存储等处理登录。
¥Hooks & Auth: Learn how to run custom JavaScript or handle logins with cookies, local storage, etc.
-
部署:探索 Docker 中的短暂测试或规划即将发布的稳定 Docker 版本。
¥Deployment: Explore ephemeral testing in Docker or plan for the upcoming stable Docker release.
-
浏览器管理:深入研究用户模拟、隐身模式和并发最佳实践。
¥Browser Management: Delve into user simulation, stealth modes, and concurrency best practices.
Crawl4AI 是一款功能强大且灵活的工具。您可以尽情构建自己的爬虫、数据管道或 AI 驱动的提取流程。祝您爬取愉快!
¥Crawl4AI is a powerful, flexible tool. Enjoy building out your scrapers, data pipelines, or AI-driven extraction flows. Happy crawling!