Crawl4AI入门

¥Getting Started with Crawl4AI

欢迎来到Crawl4AI ,一款开源的 LLM 友好型网络爬虫和抓取工具。在本教程中,你将学习:

¥Welcome to Crawl4AI, an open-source LLM-friendly Web Crawler & Scraper. In this tutorial, you’ll:

  1. 运行你的第一次爬行使用最小配置。

    ¥Run your first crawl using minimal configuration.

  2. 产生Markdown输出(并了解它如何受到内容过滤器的影响)。

    ¥Generate Markdown output (and learn how it’s influenced by content filters).

  3. 尝试一个简单的基于CSS的提取战略。

    ¥Experiment with a simple CSS-based extraction strategy.

  4. 一睹基于法学硕士的提取(包括开源和闭源模型选项)。

    ¥See a glimpse of LLM-based extraction (including open-source and closed-source model options).

  5. 爬取动态的通过 JavaScript 加载内容的页面。

    ¥Crawl a dynamic page that loads content via JavaScript.


1. 简介

¥1. Introduction

Crawl4AI提供:

¥Crawl4AI provides:

  • 异步爬虫,AsyncWebCrawler

    ¥An asynchronous crawler, AsyncWebCrawler.

  • 可通过以下方式配置浏览器和运行设置BrowserConfigCrawlerRunConfig

    ¥Configurable browser and run settings via BrowserConfig and CrawlerRunConfig.

  • 通过以下方式自动将 HTML 转换为 MarkdownDefaultMarkdownGenerator (支持可选过滤器)。

    ¥Automatic HTML-to-Markdown conversion via DefaultMarkdownGenerator (supports optional filters).

  • 多种提取策略(基于 LLM 或“传统” CSS/XPath)。

    ¥Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).

在本指南结束时,您将执行基本爬网、生成 Markdown、尝试两种提取策略,并爬网使用“加载更多”按钮或 JavaScript 更新的动态页面。

¥By the end of this guide, you’ll have performed a basic crawl, generated Markdown, tried out two extraction strategies, and crawled a dynamic page that uses “Load More” buttons or JavaScript updates.


2. 你的第一次爬行

¥2. Your First Crawl

这是一个最小的 Python 脚本,它创建了一个AsyncWebCrawler,获取网页,并打印其 Markdown 输出的前 300 个字符:

¥Here’s a minimal Python script that creates an AsyncWebCrawler, fetches a webpage, and prints the first 300 characters of its Markdown output:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:300])  # Print first 300 chars

if __name__ == "__main__":
    asyncio.run(main())

发生什么事了? -AsyncWebCrawler启动无头浏览器(默认为 Chromium)。 - 它获取https://example.com. - Crawl4AI 自动将 HTML 转换为 Markdown。

¥What’s happening? - AsyncWebCrawler launches a headless browser (Chromium by default). - It fetches https://example.com. - Crawl4AI automatically converts the HTML into Markdown.

现在您已经拥有一个简单、可运行的爬网!

¥You now have a simple, working crawl!


3.基本配置(简单介绍)

¥3. Basic Configuration (Light Introduction)

Crawl4AI 的爬虫可以使用两个主要类进行高度定制:

¥Crawl4AI’s crawler can be heavily customized using two main classes:

1.BrowserConfig :控制浏览器行为(无头或完整 UI、用户代理、JavaScript 切换等)。
2.CrawlerRunConfig :控制每次爬网的运行方式(缓存、提取、超时、挂钩等)。

¥1. BrowserConfig: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
2. CrawlerRunConfig: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).

下面是最低限度使用的示例:

¥Below is an example with minimal usage:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    browser_conf = BrowserConfig(headless=True)  # or False to see the browser
    run_conf = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS
    )

    async with AsyncWebCrawler(config=browser_conf) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_conf
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

重要提示:默认缓存模式设置为CacheMode.ENABLED。因此,为了获得新鲜内容,您需要将其设置为CacheMode.BYPASS

¥

IMPORTANT: By default cache mode is set to CacheMode.ENABLED. So to have fresh content, you need to set it to CacheMode.BYPASS

我们将在后续教程中探索更高级的配置(例如启用代理、PDF 输出、多标签会话等)。现在,只需注意如何传递这些对象来管理爬取。

¥We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.


4. 生成 Markdown 输出

¥4. Generating Markdown Output

默认情况下,Crawl4AI 会自动从每个爬取的页面生成 Markdown 文件。但是,具体的输出取决于您是否指定markdown 生成器或者内容过滤器

¥By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a markdown generator or content filter.

  • result.markdown
    直接将 HTML 转换为 Markdown。

    ¥result.markdown:
    The direct HTML-to-Markdown conversion.

  • result.markdown.fit_markdown
    应用任何配置后的内容相同内容过滤器(例如,PruningContentFilter )。

    ¥result.markdown.fit_markdown:
    The same content after applying any configured content filter (e.g., PruningContentFilter).

示例:使用带有DefaultMarkdownGenerator

¥Example: Using a Filter with DefaultMarkdownGenerator

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

md_generator = DefaultMarkdownGenerator(
    content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)

config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    markdown_generator=md_generator
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://news.ycombinator.com", config=config)
    print("Raw Markdown length:", len(result.markdown.raw_markdown))
    print("Fit Markdown length:", len(result.markdown.fit_markdown))

笔记:如果你这样做不是指定内容过滤器或 markdown 生成器,您通常只会看到原始 Markdown。PruningContentFilter可能会增加50ms处理时间。我们将在专门的Markdown 生成教程。

¥Note: If you do not specify a content filter or markdown generator, you’ll typically see only the raw Markdown. PruningContentFilter may adds around 50ms in processing time. We’ll dive deeper into these strategies in a dedicated Markdown Generation tutorial.


5.简单数据提取(基于CSS)

¥5. Simple Data Extraction (CSS-based)

Crawl4AI 还可以使用 CSS 或 XPath 选择器提取结构化数据 (JSON)。以下是一个基于 CSS 的简单示例:

¥Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:

新的! Crawl4AI 现在提供了一个强大的实用程序,可以使用 LLM 自动生成提取模式。只需一次性付费,即可获得可重复使用的模式,实现快速、无需 LLM 的提取:

¥

New! Crawl4AI now provides a powerful utility to automatically generate extraction schemas using LLM. This is a one-time cost that gives you a reusable schema for fast, LLM-free extractions:

from crawl4ai import JsonCssExtractionStrategy
from crawl4ai import LLMConfig

# Generate a schema (one-time cost)
html = "<div class='product'><h2>Gaming Laptop</h2><span class='price'>$999.99</span></div>"

# Using OpenAI (requires API token)
schema = JsonCssExtractionStrategy.generate_schema(
    html,
    llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token")  # Required for OpenAI
)

# Or using Ollama (open source, no token needed)
schema = JsonCssExtractionStrategy.generate_schema(
    html,
    llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None)  # Not needed for Ollama
)

# Use the schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(schema)

有关架构生成和高级用法的完整指南,请参阅非法学硕士提取策略

¥For a complete guide on schema generation and advanced usage, see No-LLM Extraction Strategies.

这是一个基本的提取示例:

¥Here's a basic extraction example:

import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy

async def main():
    schema = {
        "name": "Example Items",
        "baseSelector": "div.item",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }

    raw_html = "<div class='item'><h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a></div>"

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="raw://" + raw_html,
            config=CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,
                extraction_strategy=JsonCssExtractionStrategy(schema)
            )
        )
        # The JSON output is stored in 'extracted_content'
        data = json.loads(result.extracted_content)
        print(data)

if __name__ == "__main__":
    asyncio.run(main())

这有什么帮助? - 非常适合重复的页面结构(例如,商品列表、文章)。 - 无需使用人工智能或支付费用。 - 爬虫返回您可以解析或存储的 JSON 字符串。

¥Why is this helpful? - Great for repetitive page structures (e.g., item listings, articles). - No AI usage or costs. - The crawler returns a JSON string you can parse or store.

提示:您可以将原始 HTML 传递给爬虫,而不是 URL。为此,请在 HTML 中添加前缀raw://

¥

Tips: You can pass raw HTML to the crawler instead of a URL. To do so, prefix the HTML with raw://.


6. 简单数据提取(基于法学硕士)

¥6. Simple Data Extraction (LLM-based)

对于更复杂或不规则的页面,语言模型可以智能地将文本解析成您定义的结构。Crawl4AI 支持开源或者闭源提供商:

¥For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports open-source or closed-source providers:

  • 开源模型(例如,ollama/llama3.3no_token )

    ¥Open-Source Models (e.g., ollama/llama3.3, no_token)

  • OpenAI 模型(例如,openai/gpt-4 ,需要api_token)

    ¥OpenAI Models (e.g., openai/gpt-4, requires api_token)

  • 或者底层库支持的任何提供程序

    ¥Or any provider supported by the underlying library

下面是使用的示例开源风格(无标记)和闭源:

¥Below is an example using open-source style (no token) and closed-source:

import os
import json
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai import LLMExtractionStrategy

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(
        ..., description="Fee for output token for the OpenAI model."
    )

async def extract_structured_data_using_llm(
    provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
):
    print(f"\n--- Extracting Structured Data with {provider} ---")

    if api_token is None and provider != "ollama":
        print(f"API token is required for {provider}. Skipping this example.")
        return

    browser_config = BrowserConfig(headless=True)

    extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
    if extra_headers:
        extra_args["extra_headers"] = extra_headers

    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        word_count_threshold=1,
        page_timeout=80000,
        extraction_strategy=LLMExtractionStrategy(
            llm_config = LLMConfig(provider=provider,api_token=api_token),
            schema=OpenAIModelFee.model_json_schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content.""",
            extra_args=extra_args,
        ),
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/", config=crawler_config
        )
        print(result.extracted_content)

if __name__ == "__main__":

    asyncio.run(
        extract_structured_data_using_llm(
            provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
        )
    )

发生什么事了? - 我们定义了一个 Pydantic 模式(PricingInfo )描述我们想要的字段。 - LLM 提取策略使用该模式和您的指令将原始文本转换为结构化 JSON。 - 取决于提供者api_token ,您可以使用本地模型或远程 API。

¥What’s happening? - We define a Pydantic schema (PricingInfo) describing the fields we want. - The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON. - Depending on the provider and api_token, you can use local models or a remote API.


7.自适应爬行(新!)

¥7. Adaptive Crawling (New!)

Crawl4AI 现在包含智能自适应爬取功能,可自动判断何时已收集到足够的信息。以下是一个简单的示例:

¥Crawl4AI now includes intelligent adaptive crawling that automatically determines when sufficient information has been gathered. Here's a quick example:

import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler

async def adaptive_example():
    async with AsyncWebCrawler() as crawler:
        adaptive = AdaptiveCrawler(crawler)

        # Start adaptive crawling
        result = await adaptive.digest(
            start_url="https://docs.python.org/3/",
            query="async context managers"
        )

        # View results
        adaptive.print_stats()
        print(f"Crawled {len(result.crawled_urls)} pages")
        print(f"Achieved {adaptive.confidence:.0%} confidence")

if __name__ == "__main__":
    asyncio.run(adaptive_example())

自适应爬行有何特别之处? -自动停止:收集到足够的信息后停止 -智能链路选择:仅关注相关链接 -信心评分:了解您的信息是否完整

¥What's special about adaptive crawling? - Automatic stopping: Stops when sufficient information is gathered - Intelligent link selection: Follows only relevant links - Confidence scoring: Know how complete your information is

了解有关自适应爬行的更多信息 →

¥Learn more about Adaptive Crawling →


8. 多 URL 并发(预览)

¥8. Multi-URL Concurrency (Preview)

如果你需要抓取多个 URL平行线,你可以使用arun_many()默认情况下,Crawl4AI 采用内存自适应调度器根据系统资源自动调整并发度。以下是简要概述:

¥If you need to crawl multiple URLs in parallel, you can use arun_many(). By default, Crawl4AI employs a MemoryAdaptiveDispatcher, automatically adjusting concurrency based on system resources. Here’s a quick glimpse:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def quick_parallel_example():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ]

    run_conf = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        stream=True  # Enable streaming mode
    )

    async with AsyncWebCrawler() as crawler:
        # Stream results as they complete
        async for result in await crawler.arun_many(urls, config=run_conf):
            if result.success:
                print(f"[OK] {result.url}, length: {len(result.markdown.raw_markdown)}")
            else:
                print(f"[ERROR] {result.url} => {result.error_message}")

        # Or get all results at once (default behavior)
        run_conf = run_conf.clone(stream=False)
        results = await crawler.arun_many(urls, config=run_conf)
        for res in results:
            if res.success:
                print(f"[OK] {res.url}, length: {len(res.markdown.raw_markdown)}")
            else:
                print(f"[ERROR] {res.url} => {res.error_message}")

if __name__ == "__main__":
    asyncio.run(quick_parallel_example())

上面的例子展示了处理多个 URL 的两种方法:1.流模式(stream=True ): 使用以下方式处理可用的结果async for2.批处理模式(stream=False ): 等待所有结果完成

¥The example above shows two ways to handle multiple URLs: 1. Streaming mode (stream=True): Process results as they become available using async for 2. Batch mode (stream=False): Wait for all results to complete

对于更高级的并发(例如基于信号量的方法,自适应内存使用限制或自定义速率限制),请参阅高级多 URL 爬取

¥For more advanced concurrency (e.g., a semaphore-based approach, adaptive memory usage throttling, or customized rate limiting), see Advanced Multi-URL Crawling.


8.动态内容示例

¥8. Dynamic Content Example

有些网站需要多次“页面点击”或动态 JavaScript 更新。以下示例演示了如何点击“下一页”按钮并等待新的提交在 GitHub 上加载,使用BrowserConfigCrawlerRunConfig

¥Some sites require multiple “page clicks” or dynamic JavaScript updates. Below is an example showing how to click a “Next Page” button and wait for new commits to load on GitHub, using BrowserConfig and CrawlerRunConfig:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy

async def extract_structured_data_using_css_extractor():
    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
    schema = {
        "name": "KidoCode Courses",
        "baseSelector": "section.charge-methodology .w-tab-content > div",
        "fields": [
            {
                "name": "section_title",
                "selector": "h3.heading-50",
                "type": "text",
            },
            {
                "name": "section_description",
                "selector": ".charge-content",
                "type": "text",
            },
            {
                "name": "course_name",
                "selector": ".text-block-93",
                "type": "text",
            },
            {
                "name": "course_description",
                "selector": ".course-content-text",
                "type": "text",
            },
            {
                "name": "course_icon",
                "selector": ".image-92",
                "type": "attribute",
                "attribute": "src",
            },
        ],
    }

    browser_config = BrowserConfig(headless=True, java_script_enabled=True)

    js_click_tabs = """
    (async () => {
        const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
        for(let tab of tabs) {
            tab.scrollIntoView();
            tab.click();
            await new Promise(r => setTimeout(r, 500));
        }
    })();
    """

    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema),
        js_code=[js_click_tabs],
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.kidocode.com/degrees/technology", config=crawler_config
        )

        companies = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(companies)} companies")
        print(json.dumps(companies[0], indent=2))

async def main():
    await extract_structured_data_using_css_extractor()

if __name__ == "__main__":
    asyncio.run(main())

关键点

¥Key Points:

  • BrowserConfig(headless=False):我们想看请点击“下一页”。

    ¥BrowserConfig(headless=False): We want to watch it click “Next Page.”

  • CrawlerRunConfig(...):我们指定提取策略,通过session_id重复使用同一页面。

    ¥CrawlerRunConfig(...): We specify the extraction strategy, pass session_id to reuse the same page.

  • js_codewait_for用于后续页面(page > 0 )单击“下一步”按钮并等待新的提交加载。

    ¥js_code and wait_for are used for subsequent pages (page > 0) to click the “Next” button and wait for new commits to load.

  • js_only=True表示我们不会重新导航,而是继续现有会话。

    ¥js_only=True indicates we’re not re-navigating but continuing the existing session.

  • 最后,我们调用kill_session()清理页面和浏览器会话。

    ¥Finally, we call kill_session() to clean up the page and browser session.


9. 后续步骤

¥9. Next Steps

恭喜!您已:

¥Congratulations! You have:

  1. 执行了基本抓取并打印了 Markdown。

    ¥Performed a basic crawl and printed Markdown.

  2. 用过的内容过滤器使用 markdown 生成器。

    ¥Used content filters with a markdown generator.

  3. 通过以下方式提取 JSON CSS或者法学硕士策略。

    ¥Extracted JSON via CSS or LLM strategies.

  4. 已处理动态的带有 JavaScript 触发器的页面。

    ¥Handled dynamic pages with JavaScript triggers.

如果您准备好了解更多信息,请查看:

¥If you’re ready for more, check out:

  • 安装:深入了解高级安装、Docker 使用(实验性)或可选依赖项。

    ¥Installation: A deeper dive into advanced installs, Docker usage (experimental), or optional dependencies.

  • 钩子和授权:了解如何运行自定义 JavaScript 或使用 cookie、本地存储等处理登录。

    ¥Hooks & Auth: Learn how to run custom JavaScript or handle logins with cookies, local storage, etc.

  • 部署:探索 Docker 中的短暂测试或规划即将发布的稳定 Docker 版本。

    ¥Deployment: Explore ephemeral testing in Docker or plan for the upcoming stable Docker release.

  • 浏览器管理:深入研究用户模拟、隐身模式和并发最佳实践。

    ¥Browser Management: Delve into user simulation, stealth modes, and concurrency best practices.

Crawl4AI 是一款功能强大且灵活的工具。您可以尽情构建自己的爬虫、数据管道或 AI 驱动的提取流程。祝您爬取愉快!

¥Crawl4AI is a powerful, flexible tool. Enjoy building out your scrapers, data pipelines, or AI-driven extraction flows. Happy crawling!


> Feedback