Crawl4AI Docker 指南🐳

¥Crawl4AI Docker Guide 🐳

目录

¥Table of Contents

先决条件

¥Prerequisites

在深入研究之前,请确保您已安装并运行 Docker(版本 20.10.0 或更高版本),包括docker compose(通常与 Docker Desktop 捆绑在一起)。 -git用于克隆存储库。 - 容器至少有 4GB 的可用 RAM(建议在大量使用时使用更多)。 - Python 3.10+(如果使用 Python SDK)。 - Node.js 16+(如果使用 Node.js 示例)。

¥Before we dive in, make sure you have: - Docker installed and running (version 20.10.0 or higher), including docker compose (usually bundled with Docker Desktop). - git for cloning the repository. - At least 4GB of RAM available for the container (more recommended for heavy use). - Python 3.10+ (if using the Python SDK). - Node.js 16+ (if using the Node.js examples).

💡专业提示: 跑步docker info检查您的 Docker 安装和可用资源。

¥

💡 Pro tip: Run docker info to check your Docker installation and available resources.

安装

¥Installation

我们提供多种方式来运行 Crawl4AI 服务器。最快捷的方法是使用我们预先构建的 Docker Hub 镜像。

¥We offer several ways to get the Crawl4AI server running. The quickest way is to use our pre-built Docker Hub images.

¥Option 1: Using Pre-built Docker Hub Images (Recommended)

直接从 Docker Hub 拉取并运行镜像,无需在本地构建。

¥Pull and run images directly from Docker Hub without building locally.

1. 拉取镜像

¥1. Pull the Image

我们的最新版本是0.7.3。图像是使用多架构清单构建的,因此 Docker 会自动为您的系统提取正确的版本。

¥Our latest release is 0.7.3. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.

💡笔记: 这latest标签指向稳定0.7.3版本。

¥

💡 Note: The latest tag points to the stable 0.7.3 version.

# Pull the latest version
docker pull unclecode/crawl4ai:0.7.3

# Or pull using the latest tag
docker pull unclecode/crawl4ai:latest

2. 设置环境(API 密钥)

¥2. Setup Environment (API Keys)

如果你打算使用 LLM,请创建一个.llm.env工作目录中的文件:

¥If you plan to use LLMs, create a .llm.env file in your working directory:

# Create a .llm.env file with your API keys
cat > .llm.env << EOL
# OpenAI
OPENAI_API_KEY=sk-your-key

# Anthropic
ANTHROPIC_API_KEY=your-anthropic-key

# Other providers as needed
# DEEPSEEK_API_KEY=your-deepseek-key
# GROQ_API_KEY=your-groq-key
# TOGETHER_API_KEY=your-together-key
# MISTRAL_API_KEY=your-mistral-key
# GEMINI_API_TOKEN=your-gemini-token
EOL

🔑笔记:确保你的 API 密钥安全!切勿提交.llm.env进行版本控制。

¥

🔑 Note: Keep your API keys secure! Never commit .llm.env to version control.

3. 运行容器

¥3. Run the Container

  • 基本运行:

    docker run -d \
      -p 11235:11235 \
      --name crawl4ai \
      --shm-size=1g \
      unclecode/crawl4ai:latest
    

    ¥

    Basic run:

    docker run -d \
      -p 11235:11235 \
      --name crawl4ai \
      --shm-size=1g \
      unclecode/crawl4ai:latest
    

  • 在法学硕士 (LLM) 的支持下:

    # Make sure .llm.env is in the current directory
    docker run -d \
      -p 11235:11235 \
      --name crawl4ai \
      --env-file .llm.env \
      --shm-size=1g \
      unclecode/crawl4ai:latest
    

    ¥

    With LLM support:

    # Make sure .llm.env is in the current directory
    docker run -d \
      -p 11235:11235 \
      --name crawl4ai \
      --env-file .llm.env \
      --shm-size=1g \
      unclecode/crawl4ai:latest
    

服务器将在http://localhost:11235。 访问/playground进入交互式测试界面。

¥

The server will be available at http://localhost:11235. Visit /playground to access the interactive testing interface.

4.停止容器

¥4. Stopping the Container

docker stop crawl4ai && docker rm crawl4ai

Docker Hub 版本控制说明

¥Docker Hub Versioning Explained

  • 图片名称:unclecode/crawl4ai

    ¥Image Name: unclecode/crawl4ai

  • 标签格式:LIBRARY_VERSION[-SUFFIX] (例如,0.7.3 )LIBRARY_VERSION :核心的语义版本crawl4aiPython 库SUFFIX: 候选发布版本的可选标签 (`) and revisions ( r1`)

    ¥Tag Format: LIBRARY_VERSION[-SUFFIX] (e.g., 0.7.3)

    • LIBRARY_VERSION: The semantic version of the core crawl4ai Python library
    • SUFFIX: Optional tag for release candidates (`) and revisions (r1`)

  • latest标签:指向最新的稳定版本

    ¥latest Tag: Points to the most recent stable version

  • 多架构支持:所有图像均支持linux/amd64linux/arm64通过单个标签的架构

    ¥Multi-Architecture Support: All images support both linux/amd64 and linux/arm64 architectures through a single tag

选项 2:使用 Docker Compose

¥Option 2: Using Docker Compose

Docker Compose 简化了服务的构建和运行,特别是对于本地开发和测试。

¥Docker Compose simplifies building and running the service, especially for local development and testing.

1. 克隆存储库

¥1. Clone Repository

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai

2. 环境设置(API 密钥)

¥2. Environment Setup (API Keys)

如果您计划使用 LLM,请复制示例环境文件并添加您的 API 密钥。该文件应该位于项目根目录

¥If you plan to use LLMs, copy the example environment file and add your API keys. This file should be in the project root directory.

# Make sure you are in the 'crawl4ai' root directory
cp deploy/docker/.llm.env.example .llm.env

# Now edit .llm.env and add your API keys

灵活的 LLM 提供商配置:

¥Flexible LLM Provider Configuration:

Docker 设置现在通过三种方法支持灵活的 LLM 提供程序配置:

¥The Docker setup now supports flexible LLM provider configuration through three methods:

  1. 环境变量(最高优先级):设置LLM_PROVIDER覆盖默认值

    export LLM_PROVIDER="anthropic/claude-3-opus"
    # Or in your .llm.env file:
    # LLM_PROVIDER=anthropic/claude-3-opus
    

    ¥

    Environment Variable (Highest Priority): Set LLM_PROVIDER to override the default

    export LLM_PROVIDER="anthropic/claude-3-opus"
    # Or in your .llm.env file:
    # LLM_PROVIDER=anthropic/claude-3-opus
    

  2. API请求参数:根据请求指定提供商

    {
      "url": "https://example.com",
      "f": "llm",
      "provider": "groq/mixtral-8x7b"
    }
    

    ¥

    API Request Parameter: Specify provider per request

    {
      "url": "https://example.com",
      "f": "llm",
      "provider": "groq/mixtral-8x7b"
    }
    

  3. 配置文件默认:回退到config.yml(默认:openai/gpt-4o-mini )

    ¥

    Config File Default: Falls back to config.yml (default: openai/gpt-4o-mini)

系统根据配置自动选择合适的API密钥api_key_env在配置文件中。

¥The system automatically selects the appropriate API key based on the configured api_key_env in the config file.

3. 使用 Compose 构建并运行

¥3. Build and Run with Compose

docker-compose.yml项目根目录中的文件提供了一种简化的方法,可以使用 buildx 自动处理架构检测。

¥The docker-compose.yml file in the project root provides a simplified approach that automatically handles architecture detection using buildx.

  • 从 Docker Hub 运行预构建的映像:

    # Pulls and runs the release candidate from Docker Hub
    # Automatically selects the correct architecture
    IMAGE=unclecode/crawl4ai:latest docker compose up -d
    

    ¥

    Run Pre-built Image from Docker Hub:

    # Pulls and runs the release candidate from Docker Hub
    # Automatically selects the correct architecture
    IMAGE=unclecode/crawl4ai:latest docker compose up -d
    

  • 本地构建并运行:

    # Builds the image locally using Dockerfile and runs it
    # Automatically uses the correct architecture for your machine
    docker compose up --build -d
    

    ¥

    Build and Run Locally:

    # Builds the image locally using Dockerfile and runs it
    # Automatically uses the correct architecture for your machine
    docker compose up --build -d
    

  • 自定义构建:

    # Build with all features (includes torch and transformers)
    INSTALL_TYPE=all docker compose up --build -d
    
    # Build with GPU support (for AMD64 platforms)
    ENABLE_GPU=true docker compose up --build -d
    

    ¥

    Customize the Build:

    # Build with all features (includes torch and transformers)
    INSTALL_TYPE=all docker compose up --build -d
    
    # Build with GPU support (for AMD64 platforms)
    ENABLE_GPU=true docker compose up --build -d
    

服务器将在http://localhost:11235

¥

The server will be available at http://localhost:11235.

4.停止服务

¥4. Stopping the Service

# Stop the service
docker compose down

选项 3:手动本地构建并运行

¥Option 3: Manual Local Build & Run

如果您不想使用 Docker Compose 直接控制构建和运行过程。

¥If you prefer not to use Docker Compose for direct control over the build and run process.

1. 克隆存储库并设置环境

¥1. Clone Repository & Setup Environment

按照上面 Docker Compose 部分的步骤 1 和 2 进行操作(克隆 repo,cd crawl4ai , 创造.llm.env在根目录下)。

¥Follow steps 1 and 2 from the Docker Compose section above (clone repo, cd crawl4ai, create .llm.env in the root).

2. 构建镜像(多架构)

¥2. Build the Image (Multi-Arch)

使用docker buildx构建镜像。Crawl4AI 现在使用 buildx 自动处理多架构构建。

¥Use docker buildx to build the image. Crawl4AI now uses buildx to handle multi-architecture builds automatically.

# Make sure you are in the 'crawl4ai' root directory
# Build for the current architecture and load it into Docker
docker buildx build -t crawl4ai-local:latest --load .

# Or build for multiple architectures (useful for publishing)
docker buildx build --platform linux/amd64,linux/arm64 -t crawl4ai-local:latest --load .

# Build with additional options
docker buildx build \
  --build-arg INSTALL_TYPE=all \
  --build-arg ENABLE_GPU=false \
  -t crawl4ai-local:latest --load .

3. 运行容器

¥3. Run the Container

  • 基本运行(无 LLM 支持):

    docker run -d \
      -p 11235:11235 \
      --name crawl4ai-standalone \
      --shm-size=1g \
      crawl4ai-local:latest
    

    ¥

    Basic run (no LLM support):

    docker run -d \
      -p 11235:11235 \
      --name crawl4ai-standalone \
      --shm-size=1g \
      crawl4ai-local:latest
    

  • 在法学硕士 (LLM) 的支持下:

    # Make sure .llm.env is in the current directory (project root)
    docker run -d \
      -p 11235:11235 \
      --name crawl4ai-standalone \
      --env-file .llm.env \
      --shm-size=1g \
      crawl4ai-local:latest
    

    ¥

    With LLM support:

    # Make sure .llm.env is in the current directory (project root)
    docker run -d \
      -p 11235:11235 \
      --name crawl4ai-standalone \
      --env-file .llm.env \
      --shm-size=1g \
      crawl4ai-local:latest
    

服务器将在http://localhost:11235

¥

The server will be available at http://localhost:11235.

4.停止手动容器

¥4. Stopping the Manual Container

docker stop crawl4ai-standalone && docker rm crawl4ai-standalone

MCP(模型上下文协议)支持

¥MCP (Model Context Protocol) Support

Crawl4AI 服务器包括对模型上下文协议 (MCP) 的支持,允许您将服务器的功能直接连接到与 MCP 兼容的客户端,如 Claude Code。

¥Crawl4AI server includes support for the Model Context Protocol (MCP), allowing you to connect the server's capabilities directly to MCP-compatible clients like Claude Code.

什么是 MCP?

¥What is MCP?

MCP 是一个开放协议,它标准化了应用程序向 LLM 提供上下文的方式。它允许 AI 模型通过标准化接口访问外部工具、数据源和服务。

¥MCP is an open protocol that standardizes how applications provide context to LLMs. It allows AI models to access external tools, data sources, and services through a standardized interface.

通过 MCP 连接

¥Connecting via MCP

Crawl4AI 服务器公开两个 MCP 端点:

¥The Crawl4AI server exposes two MCP endpoints:

  • 服务器发送事件 (SSE)http://localhost:11235/mcp/sse

    ¥Server-Sent Events (SSE): http://localhost:11235/mcp/sse

  • WebSocketws://localhost:11235/mcp/ws

    ¥WebSocket: ws://localhost:11235/mcp/ws

与 Claude 代码一起使用

¥Using with Claude Code

您可以使用一个简单的命令在 Claude Code 中添加 Crawl4AI 作为 MCP 工具提供程序:

¥You can add Crawl4AI as an MCP tool provider in Claude Code with a simple command:

# Add the Crawl4AI server as an MCP provider
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse

# List all MCP providers to verify it was added
claude mcp list

一旦连接,Claude Code 可以直接使用 Crawl4AI 的功能,如屏幕截图、PDF 生成和 HTML 处理,而无需进行单独的 API 调用。

¥Once connected, Claude Code can directly use Crawl4AI's capabilities like screenshot capture, PDF generation, and HTML processing without having to make separate API calls.

可用的 MCP 工具

¥Available MCP Tools

通过 MCP 连接时,可以使用以下工具:

¥When connected via MCP, the following tools are available:

  • - 从网页内容生成 markdown

    ¥md - Generate markdown from web content

  • - 提取预处理的 HTML

    ¥html - Extract preprocessed HTML

  • - 捕获网页截图

    ¥screenshot - Capture webpage screenshots

  • - 生成PDF文档

    ¥pdf - Generate PDF documents

  • - 在网页上运行 JavaScript

    ¥execute_js - Run JavaScript on web pages

  • - 执行多 URL 抓取

    ¥crawl - Perform multi-URL crawling

  • - 查询 Crawl4AI 库上下文

    ¥ask - Query the Crawl4AI library context

测试 MCP 连接

¥Testing MCP Connections

您可以使用存储库中包含的测试文件测试 MCP WebSocket 连接:

¥You can test the MCP WebSocket connection using the test file included in the repository:

# From the repository root
python tests/mcp/test_mcp_socket.py

MCP 模式

¥MCP Schemas

访问 MCP 工具架构http://localhost:11235/mcp/schema有关每个工具的参数和功能的详细信息。

¥Access the MCP tool schemas at http://localhost:11235/mcp/schema for detailed information on each tool's parameters and capabilities.


附加 API 端点

¥Additional API Endpoints

除了核心/crawl/crawl/stream端点,服务器提供了几个专门的端点:

¥In addition to the core /crawl and /crawl/stream endpoints, the server provides several specialized endpoints:

HTML提取端点

¥HTML Extraction Endpoint

POST /html

抓取 URL 并返回针对模式提取优化的预处理 HTML。

¥Crawls the URL and returns preprocessed HTML optimized for schema extraction.

{
  "url": "https://example.com"
}

屏幕截图端点

¥Screenshot Endpoint

POST /screenshot

捕获指定 URL 的整页 PNG 屏幕截图。

¥Captures a full-page PNG screenshot of the specified URL.

{
  "url": "https://example.com",
  "screenshot_wait_for": 2,
  "output_path": "/path/to/save/screenshot.png"
}
  • :捕获前的可选延迟秒数(默认值:2)

    ¥screenshot_wait_for: Optional delay in seconds before capture (default: 2)

  • :可选保存截图的路径(推荐)

    ¥output_path: Optional path to save the screenshot (recommended)

PDF 导出端点

¥PDF Export Endpoint

POST /pdf

生成指定 URL 的 PDF 文档。

¥Generates a PDF document of the specified URL.

{
  "url": "https://example.com",
  "output_path": "/path/to/save/document.pdf"
}
  • :保存 PDF 的可选路径(推荐)

    ¥output_path: Optional path to save the PDF (recommended)

JavaScript 执行端点

¥JavaScript Execution Endpoint

POST /execute_js

在指定的 URL 上执行 JavaScript 片段并返回完整的爬取结果。

¥Executes JavaScript snippets on the specified URL and returns the full crawl result.

{
  "url": "https://example.com",
  "scripts": [
    "return document.title",
    "return Array.from(document.querySelectorAll('a')).map(a => a.href)"
  ]
}
  • :按顺序执行的 JavaScript 代码片段列表

    ¥scripts: List of JavaScript snippets to execute sequentially


Dockerfile 参数

¥Dockerfile Parameters

您可以使用构建参数自定义图像构建过程(--build-arg )通常通过docker buildx build或在docker-compose.yml文件。

¥You can customize the image build process using build arguments (--build-arg). These are typically used via docker buildx build or within the docker-compose.yml file.

# Example: Build with 'all' features using buildx
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --build-arg INSTALL_TYPE=all \
  -t yourname/crawl4ai-all:latest \
  --load \
  . # Build from root context

构建参数解释

¥Build Arguments Explained

¥Argument

¥Description

¥Default

¥Options

¥INSTALL_TYPE

¥Feature set

¥default, all, torch, transformer

¥ENABLE_GPU

¥GPU support (CUDA for AMD64)

¥true, false

¥APP_HOME

¥Install path inside container (advanced)

¥any valid path

¥USE_LOCAL

¥Install library from local source

¥true, false

¥GITHUB_REPO

¥Git repo to clone if USE_LOCAL=false

¥(see Dockerfile)

¥any git URL

¥GITHUB_BRANCH

¥Git branch to clone if USE_LOCAL=false

¥any branch name

争论 描述 默认 选项
安装类型 功能集 default alltorchtransformer
启用 GPU GPU 支持(适用于 AMD64 的 CUDA) false false
应用程序主页 容器内的安装路径(高级) /app 任何有效路径
使用本地 从本地源安装库 true false
GITHUB_REPO 如果 USE_LOCAL=false,则克隆 Git 仓库 (参见 Dockerfile) 任何 git URL
GITHUB_BRANCH 如果 USE_LOCAL=false,则克隆 Git 分支 main 任何分支名称

(注意:PYTHON_VERSION 由FROMDockerfile 中的指令)

¥(Note: PYTHON_VERSION is fixed by the FROM instruction in the Dockerfile)

建立最佳实践

¥Build Best Practices

  1. 选择正确的安装类型default:基本安装,最小图像尺寸。适用于大多数标准网页抓取和 Markdown 生成。all :全部功能包括torchtransformers用于高级提取策略(例如,余弦策略、某些 LLM 滤波器)。图像明显更大。请确保您需要这些额外功能。

    ¥Choose the Right Install Type

    • default: Basic installation, smallest image size. Suitable for most standard web scraping and markdown generation.
    • all: Full features including torch and transformers for advanced extraction strategies (e.g., CosineStrategy, certain LLM filters). Significantly larger image. Ensure you need these extras.

  2. 平台考虑因素使用buildx用于构建多架构镜像,特别是推送到镜像仓库。使用docker compose配置文件(local-amd64local-arm64 ) 以便轻松进行特定于平台的本地构建。

    ¥Platform Considerations

    • Use buildx for building multi-architecture images, especially for pushing to registries.
    • Use docker compose profiles (local-amd64, local-arm64) for easy platform-specific local builds.

  3. 性能优化该图像自动包含特定于平台的优化(用于 AMD64 的 OpenMP、用于 ARM64 的 OpenBLAS)。

    ¥Performance Optimization

    • The image automatically includes platform-specific optimizations (OpenMP for AMD64, OpenBLAS for ARM64).


使用 API

¥Using the API

通过 REST API 与正在运行的 Docker 服务器进行通信(默认为http://localhost:11235)。您可以使用 Python SDK 或直接发出 HTTP 请求。

¥Communicate with the running Docker server via its REST API (defaulting to http://localhost:11235). You can use the Python SDK or make direct HTTP requests.

游乐场界面

¥Playground Interface

内置的 Web 游乐场位于http://localhost:11235/playground用于测试和生成 API 请求。该 Playground 允许您:

¥A built-in web playground is available at http://localhost:11235/playground for testing and generating API requests. The playground allows you to:

  1. 配置CrawlerRunConfigBrowserConfig使用主库的 Python 语法

    ¥Configure CrawlerRunConfig and BrowserConfig using the main library's Python syntax

  2. 直接从界面测试爬取操作

    ¥Test crawling operations directly from the interface

  3. 根据您的配置为 REST API 请求生成相应的 JSON

    ¥Generate corresponding JSON for REST API requests based on your configuration

这是构建集成时将 Python 配置转换为 JSON 请求的最简单方法。

¥This is the easiest way to translate Python configuration to JSON requests when building integrations.

Python SDK

¥Python SDK

安装 SDK:pip install crawl4ai

¥Install the SDK: pip install crawl4ai

import asyncio
from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode # Assuming you have crawl4ai installed

async def main():
    # Point to the correct server port
    async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client:
        # If JWT is enabled on the server, authenticate first:
        # await client.authenticate("user@example.com") # See Server Configuration section

        # Example Non-streaming crawl
        print("--- Running Non-Streaming Crawl ---")
        results = await client.crawl(
            ["https://httpbin.org/html"],
            browser_config=BrowserConfig(headless=True), # Use library classes for config aid
            crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        )
        if results: # client.crawl returns None on failure
          print(f"Non-streaming results success: {results.success}")
          if results.success:
              for result in results: # Iterate through the CrawlResultContainer
                  print(f"URL: {result.url}, Success: {result.success}")
        else:
            print("Non-streaming crawl failed.")


        # Example Streaming crawl
        print("\n--- Running Streaming Crawl ---")
        stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
        try:
            async for result in await client.crawl( # client.crawl returns an async generator for streaming
                ["https://httpbin.org/html", "https://httpbin.org/links/5/0"],
                browser_config=BrowserConfig(headless=True),
                crawler_config=stream_config
            ):
                print(f"Streamed result: URL: {result.url}, Success: {result.success}")
        except Exception as e:
            print(f"Streaming crawl failed: {e}")


        # Example Get schema
        print("\n--- Getting Schema ---")
        schema = await client.get_schema()
        print(f"Schema received: {bool(schema)}") # Print whether schema was received

if __name__ == "__main__":
    asyncio.run(main())

(SDK 参数如超时、verify_ssl 等保持不变)

¥(SDK parameters like timeout, verify_ssl etc. remain the same)

第二种方法:直接 API 调用

¥Second Approach: Direct API Calls

至关重要的是,当通过 JSON 直接发送配置时,它们必须关注{"type": "ClassName", "params": {...}}任何非原始值(例如配置对象或策略)的结构。字典必须包装为{"type": "dict", "value": {...}}

¥Crucially, when sending configurations directly via JSON, they must follow the {"type": "ClassName", "params": {...}} structure for any non-primitive value (like config objects or strategies). Dictionaries must be wrapped as {"type": "dict", "value": {...}}.

(保留配置结构、基本模式、简单与复杂、策略模式、复杂嵌套示例、快速语法概述、重要规则、专业提示的详细解释)

¥(Keep the detailed explanation of Configuration Structure, Basic Pattern, Simple vs Complex, Strategy Pattern, Complex Nested Example, Quick Grammar Overview, Important Rules, Pro Tip)

更多示例(确保 Schema 示例使用类型/值包装器)

¥More Examples (Ensure Schema example uses type/value wrapper)

高级爬虫配置(保留示例,确保 cache_mode 使用有效的枚举值,如“bypass”)

¥Advanced Crawler Configuration (Keep example, ensure cache_mode uses valid enum value like "bypass")

提取策略

¥Extraction Strategy

{
    "crawler_config": {
        "type": "CrawlerRunConfig",
        "params": {
            "extraction_strategy": {
                "type": "JsonCssExtractionStrategy",
                "params": {
                    "schema": {
                        "type": "dict",
                        "value": {
                           "baseSelector": "article.post",
                           "fields": [
                               {"name": "title", "selector": "h1", "type": "text"},
                               {"name": "content", "selector": ".content", "type": "html"}
                           ]
                         }
                    }
                }
            }
        }
    }
}

LLM提取策略(保留示例,确保模式使用类型/值包装器) (保留深度爬虫示例)

¥LLM Extraction Strategy (Keep example, ensure schema uses type/value wrapper) (Keep Deep Crawler Example)

REST API 示例

¥REST API Examples

更新 URL 以使用端口11235

¥Update URLs to use port 11235.

简单爬取

¥Simple Crawl

import requests

# Configuration objects converted to the required JSON structure
browser_config_payload = {
    "type": "BrowserConfig",
    "params": {"headless": True}
}
crawler_config_payload = {
    "type": "CrawlerRunConfig",
    "params": {"stream": False, "cache_mode": "bypass"} # Use string value of enum
}

crawl_payload = {
    "urls": ["https://httpbin.org/html"],
    "browser_config": browser_config_payload,
    "crawler_config": crawler_config_payload
}
response = requests.post(
    "http://localhost:11235/crawl", # Updated port
    # headers={"Authorization": f"Bearer {token}"},  # If JWT is enabled
    json=crawl_payload
)
print(f"Status Code: {response.status_code}")
if response.ok:
    print(response.json())
else:
    print(f"Error: {response.text}")

流媒体结果

¥Streaming Results

import json
import httpx # Use httpx for async streaming example

async def test_stream_crawl(token: str = None): # Made token optional
    """Test the /crawl/stream endpoint with multiple URLs."""
    url = "http://localhost:11235/crawl/stream" # Updated port
    payload = {
        "urls": [
            "https://httpbin.org/html",
            "https://httpbin.org/links/5/0",
        ],
        "browser_config": {
            "type": "BrowserConfig",
            "params": {"headless": True, "viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}} # Viewport needs type:dict
        },
        "crawler_config": {
            "type": "CrawlerRunConfig",
            "params": {"stream": True, "cache_mode": "bypass"}
        }
    }

    headers = {}
    # if token:
    #    headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled

    try:
        async with httpx.AsyncClient() as client:
            async with client.stream("POST", url, json=payload, headers=headers, timeout=120.0) as response:
                print(f"Status: {response.status_code} (Expected: 200)")
                response.raise_for_status() # Raise exception for bad status codes

                # Read streaming response line-by-line (NDJSON)
                async for line in response.aiter_lines():
                    if line:
                        try:
                            data = json.loads(line)
                            # Check for completion marker
                            if data.get("status") == "completed":
                                print("Stream completed.")
                                break
                            print(f"Streamed Result: {json.dumps(data, indent=2)}")
                        except json.JSONDecodeError:
                            print(f"Warning: Could not decode JSON line: {line}")

    except httpx.HTTPStatusError as e:
         print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}")
    except Exception as e:
        print(f"Error in streaming crawl test: {str(e)}")

# To run this example:
# import asyncio
# asyncio.run(test_stream_crawl())

指标与监控

¥Metrics & Monitoring

使用以下端点密切关注您的爬虫:

¥Keep an eye on your crawler with these endpoints:

  • - 快速健康检查

    ¥/health - Quick health check

  • - 详细的 Prometheus 指标

    ¥/metrics - Detailed Prometheus metrics

  • - 完整的 API 模式

    ¥/schema - Full API schema

健康检查示例:

¥Example health check:

curl http://localhost:11235/health


(部署场景和完整示例部分保持不变,如果示例移动,可能会更新链接)

¥(Deployment Scenarios and Complete Examples sections remain the same, maybe update links if examples moved)


服务器配置

¥Server Configuration

服务器的行为可以通过config.yml文件。

¥The server's behavior can be customized through the config.yml file.

理解 config.yml

¥Understanding config.yml

配置文件从/app/config.yml在容器内。默认情况下,deploy/docker/config.yml在构建过程中,存储库中的内容会被复制到那里。

¥The configuration file is loaded from /app/config.yml inside the container. By default, the file from deploy/docker/config.yml in the repository is copied there during the build.

以下是配置选项的详细分类(使用来自deploy/docker/config.yml):

¥Here's a detailed breakdown of the configuration options (using defaults from deploy/docker/config.yml):

# Application Configuration
app:
  title: "Crawl4AI API"
  version: "1.0.0" # Consider setting this to match library version, e.g., "0.5.1"
  host: "0.0.0.0"
  port: 8020 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf).
  reload: False # Default set to False - suitable for production
  timeout_keep_alive: 300

# Default LLM Configuration
llm:
  provider: "openai/gpt-4o-mini"  # Can be overridden by LLM_PROVIDER env var
  api_key_env: "OPENAI_API_KEY"
  # api_key: sk-...  # If you pass the API key directly then api_key_env will be ignored

# Redis Configuration (Used by internal Redis server managed by supervisord)
redis:
  host: "localhost"
  port: 6379
  db: 0
  password: ""
  # ... other redis options ...

# Rate Limiting Configuration
rate_limiting:
  enabled: True
  default_limit: "1000/minute"
  trusted_proxies: []
  storage_uri: "memory://"  # Use "redis://localhost:6379" if you need persistent/shared limits

# Security Configuration
security:
  enabled: false # Master toggle for security features
  jwt_enabled: false # Enable JWT authentication (requires security.enabled=true)
  https_redirect: false # Force HTTPS (requires security.enabled=true)
  trusted_hosts: ["*"] # Allowed hosts (use specific domains in production)
  headers: # Security headers (applied if security.enabled=true)
    x_content_type_options: "nosniff"
    x_frame_options: "DENY"
    content_security_policy: "default-src 'self'"
    strict_transport_security: "max-age=63072000; includeSubDomains"

# Crawler Configuration
crawler:
  memory_threshold_percent: 95.0
  rate_limiter:
    base_delay: [1.0, 2.0] # Min/max delay between requests in seconds for dispatcher
  timeouts:
    stream_init: 30.0  # Timeout for stream initialization
    batch_process: 300.0 # Timeout for non-streaming /crawl processing

# Logging Configuration
logging:
  level: "INFO"
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

# Observability Configuration
observability:
  prometheus:
    enabled: True
    endpoint: "/metrics"
  health_check:
    endpoint: "/health"

(JWT 身份验证部分保持不变,只需注意请求的默认端口现在是 11235)

¥(JWT Authentication section remains the same, just note the default port is now 11235 for requests)

(配置技巧和最佳实践保持不变)

¥(Configuration Tips and Best Practices remain the same)

自定义配置

¥Customizing Your Configuration

您可以覆盖默认config.yml

¥You can override the default config.yml.

方法一:构建前修改

¥Method 1: Modify Before Build

  1. 编辑deploy/docker/config.yml本地存储库中的文件克隆。

    ¥Edit the deploy/docker/config.yml file in your local repository clone.

  2. 使用以下方式构建图像docker buildx或者docker compose --profile local-... up --build. 修改后的文件将被复制到图像中。

    ¥Build the image using docker buildx or docker compose --profile local-... up --build. The modified file will be copied into the image.

¥Method 2: Runtime Mount (Recommended for Custom Deploys)

  1. 创建自定义配置文件,例如,my-custom-config.yml本地。确保它包含所有必要的部分。

    ¥Create your custom configuration file, e.g., my-custom-config.yml locally. Ensure it contains all necessary sections.

  2. 在运行容器时挂载它:使用docker run

    # Assumes my-custom-config.yml is in the current directory
    docker run -d -p 11235:11235 \
      --name crawl4ai-custom-config \
      --env-file .llm.env \
      --shm-size=1g \
      -v $(pwd)/my-custom-config.yml:/app/config.yml \
      unclecode/crawl4ai:latest # Or your specific tag
    
    使用docker-compose.yml添加volumes服务定义的部分:
    services:
      crawl4ai-hub-amd64: # Or your chosen service
        image: unclecode/crawl4ai:latest
        profiles: ["hub-amd64"]
        <<: *base-config
        volumes:
          # Mount local custom config over the default one in the container
          - ./my-custom-config.yml:/app/config.yml
          # Keep the shared memory volume from base-config
          - /dev/shm:/dev/shm
    
    (注意:确保my-custom-config.yml与以下目录相同docker-compose.yml)

    ¥

    Mount it when running the container:

    • Using docker run:

      # Assumes my-custom-config.yml is in the current directory
      docker run -d -p 11235:11235 \
        --name crawl4ai-custom-config \
        --env-file .llm.env \
        --shm-size=1g \
        -v $(pwd)/my-custom-config.yml:/app/config.yml \
        unclecode/crawl4ai:latest # Or your specific tag
      

    • Using docker-compose.yml: Add a volumes section to the service definition:

      services:
        crawl4ai-hub-amd64: # Or your chosen service
          image: unclecode/crawl4ai:latest
          profiles: ["hub-amd64"]
          <<: *base-config
          volumes:
            # Mount local custom config over the default one in the container
            - ./my-custom-config.yml:/app/config.yml
            # Keep the shared memory volume from base-config
            - /dev/shm:/dev/shm
      
      (Note: Ensure my-custom-config.yml is in the same directory as docker-compose.yml)

💡 安装时,您的自定义文件完全取代默认配置。请确保其配置有效且完整。

¥

💡 When mounting, your custom file completely replaces the default one. Ensure it's a valid and complete configuration.

配置建议

¥Configuration Recommendations

  1. 安全第一🔒

    ¥Security First 🔒

  2. 始终在生产中启用安全性

    ¥Always enable security in production

  3. 使用特定的 trusted_hosts 而不是通配符

    ¥Use specific trusted_hosts instead of wildcards

  4. 设置适当的速率限制来保护您的服务器

    ¥Set up proper rate limiting to protect your server

  5. 启用 HTTPS 重定向之前请考虑您的环境

    ¥

    Consider your environment before enabling HTTPS redirect

  6. 资源管理💻

    ¥

    Resource Management 💻

  7. 根据可用 RAM 调整 memory_threshold_percent

    ¥Adjust memory_threshold_percent based on available RAM

  8. 根据内容大小和网络条件设置超时

    ¥Set timeouts according to your content size and network conditions

  9. 在多容器设置中使用 Redis 进行速率限制

    ¥

    Use Redis for rate limiting in multi-container setups

  10. 监控📊

    ¥

    Monitoring 📊

  11. 如果需要指标,请启用 Prometheus

    ¥Enable Prometheus if you need metrics

  12. 在开发中设置 DEBUG 日志,在生产中设置 INFO

    ¥Set DEBUG logging in development, INFO in production

  13. 定期健康检查监测至关重要

    ¥

    Regular health check monitoring is crucial

  14. 性能调优

    ¥

    Performance Tuning

  15. 从保守的速率限制器延迟开始

    ¥Start with conservative rate limiter delays

  16. 增加大内容的batch_process超时

    ¥Increase batch_process timeout for large content

  17. 根据初始响应时间调整 stream_init 超时

    ¥Adjust stream_init timeout based on initial response times

获取帮助

¥Getting Help

我们随时准备帮助您通过 Crawl4AI 取得成功!获取支持的方法如下:

¥We're here to help you succeed with Crawl4AI! Here's how to get support:

概括

¥Summary

在本指南中,我们介绍了开始使用 Crawl4AI 的 Docker 部署所需的一切:- 构建和运行 Docker 容器 - 配置环境
- 使用交互式游乐场进行测试 - 使用正确的类型发出 API 请求 - 使用 Python SDK - 利用专用端点进行屏幕截图、PDF 和 JavaScript 执行 - 通过模型上下文协议 (MCP) 连接 - 监控您的部署

¥In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment: - Building and running the Docker container - Configuring the environment
- Using the interactive playground for testing - Making API requests with proper typing - Using the Python SDK - Leveraging specialized endpoints for screenshots, PDFs, and JavaScript execution - Connecting via the Model Context Protocol (MCP) - Monitoring your deployment

新的游乐场界面位于http://localhost:11235/playground使得测试配置和为 API 请求生成相应的 JSON 变得更加容易。

¥The new playground interface at http://localhost:11235/playground makes it much easier to test configurations and generate the corresponding JSON for API requests.

对于 AI 应用程序开发人员来说,MCP 集成允许 Claude Code 等工具直接访问 Crawl4AI 的功能,而无需复杂的 API 处理。

¥For AI application developers, the MCP integration allows tools like Claude Code to directly access Crawl4AI's capabilities without complex API handling.

请记住,examples文件夹是你的朋友——它们展示了你可以根据自己的需要进行调整的真实世界的使用模式。

¥Remember, the examples in the examples folder are your friends - they show real-world usage patterns that you can adapt for your needs.

继续探索,如需帮助,请随时联系我们!我们正在携手创造非凡。🚀

¥Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀

爬行快乐!🕷️

¥Happy crawling! 🕷️


> Feedback