Crawl4AI Docker 指南🐳

¥Crawl4AI Docker Guide 🐳

先决条件

¥Prerequisites
安装

¥Installation
选项 1：使用预构建的 Docker Hub 镜像（推荐）

¥Option 1: Using Pre-built Docker Hub Images (Recommended)
选项 2：使用 Docker Compose

¥Option 2: Using Docker Compose
选项 3：手动本地构建并运行

¥Option 3: Manual Local Build & Run
Dockerfile 参数

¥Dockerfile Parameters
使用 API

¥Using the API
游乐场界面

¥Playground Interface
Python SDK

¥Python SDK
理解请求模式

¥Understanding Request Schema
REST API 示例

¥REST API Examples
附加 API 端点

¥Additional API Endpoints
HTML提取端点

¥HTML Extraction Endpoint
屏幕截图端点

¥Screenshot Endpoint
PDF 导出端点

¥PDF Export Endpoint
JavaScript 执行端点

¥JavaScript Execution Endpoint
库上下文端点

¥Library Context Endpoint
MCP（模型上下文协议）支持

¥MCP (Model Context Protocol) Support
什么是 MCP？

¥What is MCP?
通过 MCP 连接

¥Connecting via MCP
与 Claude 代码一起使用

¥Using with Claude Code
可用的 MCP 工具

¥Available MCP Tools
测试 MCP 连接

¥Testing MCP Connections
MCP 模式

¥MCP Schemas
指标与监控

¥Metrics & Monitoring
部署场景

¥Deployment Scenarios
完整示例

¥Complete Examples
服务器配置

¥Server Configuration
理解 config.yml

¥Understanding config.yml
JWT 身份验证

¥JWT Authentication
配置技巧和最佳实践

¥Configuration Tips and Best Practices
自定义配置

¥Customizing Your Configuration
配置建议

¥Configuration Recommendations
获取帮助

¥Getting Help
概括

¥Summary

先决条件

¥Prerequisites

在深入研究之前，请确保您已安装并运行 Docker（版本 20.10.0 或更高版本），包括docker compose（通常与 Docker Desktop 捆绑在一起）。 -git用于克隆存储库。 - 容器至少有 4GB 的可用 RAM（建议在大量使用时使用更多）。 - Python 3.10+（如果使用 Python SDK）。 - Node.js 16+（如果使用 Node.js 示例）。

¥Before we dive in, make sure you have: - Docker installed and running (version 20.10.0 or higher), including docker compose (usually bundled with Docker Desktop). - git for cloning the repository. - At least 4GB of RAM available for the container (more recommended for heavy use). - Python 3.10+ (if using the Python SDK). - Node.js 16+ (if using the Node.js examples).

💡专业提示：跑步docker info检查您的 Docker 安装和可用资源。

¥
💡 Pro tip: Run docker info to check your Docker installation and available resources.

安装

¥Installation

我们提供多种方式来运行 Crawl4AI 服务器。最快捷的方法是使用我们预先构建的 Docker Hub 镜像。

¥We offer several ways to get the Crawl4AI server running. The quickest way is to use our pre-built Docker Hub images.

选项 1：使用预构建的 Docker Hub 镜像（推荐）

¥Option 1: Using Pre-built Docker Hub Images (Recommended)

直接从 Docker Hub 拉取并运行镜像，无需在本地构建。

¥Pull and run images directly from Docker Hub without building locally.

1. 拉取镜像

¥1. Pull the Image

我们的最新版本是0.7.3。图像是使用多架构清单构建的，因此 Docker 会自动为您的系统提取正确的版本。

¥Our latest release is 0.7.3. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.

💡笔记：这latest标签指向稳定0.7.3版本。

¥
💡 Note: The latest tag points to the stable 0.7.3 version.

# Pull the latest version
docker pull unclecode/crawl4ai:0.7.3

# Or pull using the latest tag
docker pull unclecode/crawl4ai:latest

2. 设置环境（API 密钥）

¥2. Setup Environment (API Keys)

如果你打算使用 LLM，请创建一个.llm.env工作目录中的文件：

¥If you plan to use LLMs, create a .llm.env file in your working directory:

# Create a .llm.env file with your API keys
cat > .llm.env << EOL
# OpenAI
OPENAI_API_KEY=sk-your-key

# Anthropic
ANTHROPIC_API_KEY=your-anthropic-key

# Other providers as needed
# DEEPSEEK_API_KEY=your-deepseek-key
# GROQ_API_KEY=your-groq-key
# TOGETHER_API_KEY=your-together-key
# MISTRAL_API_KEY=your-mistral-key
# GEMINI_API_TOKEN=your-gemini-token
EOL

🔑笔记：确保你的 API 密钥安全！切勿提交.llm.env进行版本控制。

¥
🔑 Note: Keep your API keys secure! Never commit .llm.env to version control.

3. 运行容器

¥3. Run the Container

基本运行：

docker run -d \
  -p 11235:11235 \
  --name crawl4ai \
  --shm-size=1g \
  unclecode/crawl4ai:latest

Basic run:

docker run -d \
  -p 11235:11235 \
  --name crawl4ai \
  --shm-size=1g \
  unclecode/crawl4ai:latest

在法学硕士 (LLM) 的支持下：

# Make sure .llm.env is in the current directory
docker run -d \
  -p 11235:11235 \
  --name crawl4ai \
  --env-file .llm.env \
  --shm-size=1g \
  unclecode/crawl4ai:latest

With LLM support:

# Make sure .llm.env is in the current directory
docker run -d \
  -p 11235:11235 \
  --name crawl4ai \
  --env-file .llm.env \
  --shm-size=1g \
  unclecode/crawl4ai:latest

服务器将在http://localhost:11235。访问/playground进入交互式测试界面。

¥
The server will be available at http://localhost:11235. Visit /playground to access the interactive testing interface.

4.停止容器

¥4. Stopping the Container

docker stop crawl4ai && docker rm crawl4ai

Docker Hub 版本控制说明

¥Docker Hub Versioning Explained

图片名称：unclecode/crawl4ai

¥Image Name: unclecode/crawl4ai
标签格式：LIBRARY_VERSION[-SUFFIX] （例如，0.7.3 )LIBRARY_VERSION ：核心的语义版本crawl4aiPython 库SUFFIX: 候选发布版本的可选标签 (`) and revisions ( r1`)

¥Tag Format: LIBRARY_VERSION[-SUFFIX] (e.g., 0.7.3)
- LIBRARY_VERSION: The semantic version of the core crawl4ai Python library
- SUFFIX: Optional tag for release candidates (`) and revisions (r1`)
latest标签：指向最新的稳定版本

¥latest Tag: Points to the most recent stable version
多架构支持：所有图像均支持linux/amd64和linux/arm64通过单个标签的架构

¥Multi-Architecture Support: All images support both linux/amd64 and linux/arm64 architectures through a single tag

选项 2：使用 Docker Compose

¥Option 2: Using Docker Compose

Docker Compose 简化了服务的构建和运行，特别是对于本地开发和测试。

¥Docker Compose simplifies building and running the service, especially for local development and testing.

1. 克隆存储库

¥1. Clone Repository

git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai

2. 环境设置（API 密钥）

¥2. Environment Setup (API Keys)

如果您计划使用 LLM，请复制示例环境文件并添加您的 API 密钥。该文件应该位于项目根目录。

¥If you plan to use LLMs, copy the example environment file and add your API keys. This file should be in the project root directory.

# Make sure you are in the 'crawl4ai' root directory
cp deploy/docker/.llm.env.example .llm.env

# Now edit .llm.env and add your API keys

灵活的 LLM 提供商配置：

¥Flexible LLM Provider Configuration:

Docker 设置现在通过三种方法支持灵活的 LLM 提供程序配置：

¥The Docker setup now supports flexible LLM provider configuration through three methods:

环境变量（最高优先级）：设置LLM_PROVIDER覆盖默认值

export LLM_PROVIDER="anthropic/claude-3-opus"
# Or in your .llm.env file:
# LLM_PROVIDER=anthropic/claude-3-opus

Environment Variable (Highest Priority): Set LLM_PROVIDER to override the default

export LLM_PROVIDER="anthropic/claude-3-opus"
# Or in your .llm.env file:
# LLM_PROVIDER=anthropic/claude-3-opus

API请求参数：根据请求指定提供商

{
  "url": "https://example.com",
  "f": "llm",
  "provider": "groq/mixtral-8x7b"
}

API Request Parameter: Specify provider per request

{
  "url": "https://example.com",
  "f": "llm",
  "provider": "groq/mixtral-8x7b"
}

配置文件默认：回退到config.yml（默认：openai/gpt-4o-mini )

¥
Config File Default: Falls back to config.yml (default: openai/gpt-4o-mini)

系统根据配置自动选择合适的API密钥api_key_env在配置文件中。

¥The system automatically selects the appropriate API key based on the configured api_key_env in the config file.

3. 使用 Compose 构建并运行

¥3. Build and Run with Compose

这docker-compose.yml项目根目录中的文件提供了一种简化的方法，可以使用 buildx 自动处理架构检测。

¥The docker-compose.yml file in the project root provides a simplified approach that automatically handles architecture detection using buildx.

从 Docker Hub 运行预构建的映像：

# Pulls and runs the release candidate from Docker Hub
# Automatically selects the correct architecture
IMAGE=unclecode/crawl4ai:latest docker compose up -d

Run Pre-built Image from Docker Hub:

# Pulls and runs the release candidate from Docker Hub
# Automatically selects the correct architecture
IMAGE=unclecode/crawl4ai:latest docker compose up -d

本地构建并运行：

# Builds the image locally using Dockerfile and runs it
# Automatically uses the correct architecture for your machine
docker compose up --build -d

Build and Run Locally:

# Builds the image locally using Dockerfile and runs it
# Automatically uses the correct architecture for your machine
docker compose up --build -d

自定义构建：

# Build with all features (includes torch and transformers)
INSTALL_TYPE=all docker compose up --build -d

# Build with GPU support (for AMD64 platforms)
ENABLE_GPU=true docker compose up --build -d

Customize the Build:

# Build with all features (includes torch and transformers)
INSTALL_TYPE=all docker compose up --build -d

# Build with GPU support (for AMD64 platforms)
ENABLE_GPU=true docker compose up --build -d

服务器将在http://localhost:11235。

¥
The server will be available at http://localhost:11235.

4.停止服务

¥4. Stopping the Service

# Stop the service
docker compose down

选项 3：手动本地构建并运行

¥Option 3: Manual Local Build & Run

如果您不想使用 Docker Compose 直接控制构建和运行过程。

¥If you prefer not to use Docker Compose for direct control over the build and run process.

1. 克隆存储库并设置环境

¥1. Clone Repository & Setup Environment

按照上面 Docker Compose 部分的步骤 1 和 2 进行操作（克隆 repo，cd crawl4ai ，创造.llm.env在根目录下）。

¥Follow steps 1 and 2 from the Docker Compose section above (clone repo, cd crawl4ai, create .llm.env in the root).

2. 构建镜像（多架构）

¥2. Build the Image (Multi-Arch)

使用docker buildx构建镜像。Crawl4AI 现在使用 buildx 自动处理多架构构建。

¥Use docker buildx to build the image. Crawl4AI now uses buildx to handle multi-architecture builds automatically.

# Make sure you are in the 'crawl4ai' root directory
# Build for the current architecture and load it into Docker
docker buildx build -t crawl4ai-local:latest --load .

# Or build for multiple architectures (useful for publishing)
docker buildx build --platform linux/amd64,linux/arm64 -t crawl4ai-local:latest --load .

# Build with additional options
docker buildx build \
  --build-arg INSTALL_TYPE=all \
  --build-arg ENABLE_GPU=false \
  -t crawl4ai-local:latest --load .

3. 运行容器

¥3. Run the Container

基本运行（无 LLM 支持）：

docker run -d \
  -p 11235:11235 \
  --name crawl4ai-standalone \
  --shm-size=1g \
  crawl4ai-local:latest

Basic run (no LLM support):

docker run -d \
  -p 11235:11235 \
  --name crawl4ai-standalone \
  --shm-size=1g \
  crawl4ai-local:latest

在法学硕士 (LLM) 的支持下：

# Make sure .llm.env is in the current directory (project root)
docker run -d \
  -p 11235:11235 \
  --name crawl4ai-standalone \
  --env-file .llm.env \
  --shm-size=1g \
  crawl4ai-local:latest

With LLM support:

# Make sure .llm.env is in the current directory (project root)
docker run -d \
  -p 11235:11235 \
  --name crawl4ai-standalone \
  --env-file .llm.env \
  --shm-size=1g \
  crawl4ai-local:latest

服务器将在http://localhost:11235。

¥
The server will be available at http://localhost:11235.

4.停止手动容器

¥4. Stopping the Manual Container

docker stop crawl4ai-standalone && docker rm crawl4ai-standalone

MCP（模型上下文协议）支持

¥MCP (Model Context Protocol) Support

Crawl4AI 服务器包括对模型上下文协议 (MCP) 的支持，允许您将服务器的功能直接连接到与 MCP 兼容的客户端，如 Claude Code。

¥Crawl4AI server includes support for the Model Context Protocol (MCP), allowing you to connect the server's capabilities directly to MCP-compatible clients like Claude Code.

什么是 MCP？

¥What is MCP?

MCP 是一个开放协议，它标准化了应用程序向 LLM 提供上下文的方式。它允许 AI 模型通过标准化接口访问外部工具、数据源和服务。

¥MCP is an open protocol that standardizes how applications provide context to LLMs. It allows AI models to access external tools, data sources, and services through a standardized interface.

通过 MCP 连接

¥Connecting via MCP

Crawl4AI 服务器公开两个 MCP 端点：

¥The Crawl4AI server exposes two MCP endpoints:

服务器发送事件 (SSE) ：http://localhost:11235/mcp/sse

¥Server-Sent Events (SSE): http://localhost:11235/mcp/sse
WebSocket ：ws://localhost:11235/mcp/ws

¥WebSocket: ws://localhost:11235/mcp/ws

与 Claude 代码一起使用

¥Using with Claude Code

您可以使用一个简单的命令在 Claude Code 中添加 Crawl4AI 作为 MCP 工具提供程序：

¥You can add Crawl4AI as an MCP tool provider in Claude Code with a simple command:

# Add the Crawl4AI server as an MCP provider
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse

# List all MCP providers to verify it was added
claude mcp list

一旦连接，Claude Code 可以直接使用 Crawl4AI 的功能，如屏幕截图、PDF 生成和 HTML 处理，而无需进行单独的 API 调用。

¥Once connected, Claude Code can directly use Crawl4AI's capabilities like screenshot capture, PDF generation, and HTML processing without having to make separate API calls.

可用的 MCP 工具

¥Available MCP Tools

通过 MCP 连接时，可以使用以下工具：

¥When connected via MCP, the following tools are available:

- 从网页内容生成 markdown

¥md - Generate markdown from web content
- 提取预处理的 HTML

¥html - Extract preprocessed HTML
- 捕获网页截图

¥screenshot - Capture webpage screenshots
- 生成PDF文档

¥pdf - Generate PDF documents
- 在网页上运行 JavaScript

¥execute_js - Run JavaScript on web pages
- 执行多 URL 抓取

¥crawl - Perform multi-URL crawling
- 查询 Crawl4AI 库上下文

¥ask - Query the Crawl4AI library context

测试 MCP 连接

¥Testing MCP Connections

您可以使用存储库中包含的测试文件测试 MCP WebSocket 连接：

¥You can test the MCP WebSocket connection using the test file included in the repository:

# From the repository root
python tests/mcp/test_mcp_socket.py

MCP 模式

¥MCP Schemas

访问 MCP 工具架构http://localhost:11235/mcp/schema有关每个工具的参数和功能的详细信息。

¥Access the MCP tool schemas at http://localhost:11235/mcp/schema for detailed information on each tool's parameters and capabilities.

附加 API 端点

¥Additional API Endpoints

除了核心/crawl和/crawl/stream端点，服务器提供了几个专门的端点：

¥In addition to the core /crawl and /crawl/stream endpoints, the server provides several specialized endpoints:

HTML提取端点

¥HTML Extraction Endpoint

POST /html

抓取 URL 并返回针对模式提取优化的预处理 HTML。

¥Crawls the URL and returns preprocessed HTML optimized for schema extraction.

{
  "url": "https://example.com"
}

屏幕截图端点

¥Screenshot Endpoint

POST /screenshot

捕获指定 URL 的整页 PNG 屏幕截图。

¥Captures a full-page PNG screenshot of the specified URL.

{
  "url": "https://example.com",
  "screenshot_wait_for": 2,
  "output_path": "/path/to/save/screenshot.png"
}

：捕获前的可选延迟秒数（默认值：2）

¥screenshot_wait_for: Optional delay in seconds before capture (default: 2)
：可选保存截图的路径（推荐）

¥output_path: Optional path to save the screenshot (recommended)

PDF 导出端点

¥PDF Export Endpoint

POST /pdf

生成指定 URL 的 PDF 文档。

¥Generates a PDF document of the specified URL.

{
  "url": "https://example.com",
  "output_path": "/path/to/save/document.pdf"
}

：保存 PDF 的可选路径（推荐）

¥output_path: Optional path to save the PDF (recommended)

JavaScript 执行端点

¥JavaScript Execution Endpoint

POST /execute_js

在指定的 URL 上执行 JavaScript 片段并返回完整的爬取结果。

¥Executes JavaScript snippets on the specified URL and returns the full crawl result.

{
  "url": "https://example.com",
  "scripts": [
    "return document.title",
    "return Array.from(document.querySelectorAll('a')).map(a => a.href)"
  ]
}

：按顺序执行的 JavaScript 代码片段列表

¥scripts: List of JavaScript snippets to execute sequentially

Dockerfile 参数

¥Dockerfile Parameters

您可以使用构建参数自定义图像构建过程（--build-arg ）通常通过docker buildx build或在docker-compose.yml文件。

¥You can customize the image build process using build arguments (--build-arg). These are typically used via docker buildx build or within the docker-compose.yml file.

# Example: Build with 'all' features using buildx
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --build-arg INSTALL_TYPE=all \
  -t yourname/crawl4ai-all:latest \
  --load \
  . # Build from root context

构建参数解释

¥Build Arguments Explained

¥Argument

¥Description

¥Default

¥Options

¥INSTALL_TYPE

¥Feature set

¥default, all, torch, transformer

¥ENABLE_GPU

¥GPU support (CUDA for AMD64)

¥true, false

¥APP_HOME

¥Install path inside container (advanced)

¥any valid path

¥USE_LOCAL

¥Install library from local source

¥true, false

¥GITHUB_REPO

¥Git repo to clone if USE_LOCAL=false

¥(see Dockerfile)

¥any git URL

¥GITHUB_BRANCH

¥Git branch to clone if USE_LOCAL=false

¥any branch name

争论	描述	默认	选项
安装类型	功能集	`default`	，`all` ，`torch` ，`transformer`
启用 GPU	GPU 支持（适用于 AMD64 的 CUDA）	`false`	，`false`
应用程序主页	容器内的安装路径（高级）	`/app`	任何有效路径
使用本地	从本地源安装库	`true`	，`false`
GITHUB_REPO	如果 USE_LOCAL=false，则克隆 Git 仓库	（参见 Dockerfile）	任何 git URL
GITHUB_BRANCH	如果 USE_LOCAL=false，则克隆 Git 分支	`main`	任何分支名称

（注意：PYTHON_VERSION 由FROMDockerfile 中的指令）

¥(Note: PYTHON_VERSION is fixed by the FROM instruction in the Dockerfile)

建立最佳实践

¥Build Best Practices

选择正确的安装类型default：基本安装，最小图像尺寸。适用于大多数标准网页抓取和 Markdown 生成。all ：全部功能包括torch和transformers用于高级提取策略（例如，余弦策略、某些 LLM 滤波器）。图像明显更大。请确保您需要这些额外功能。

¥Choose the Right Install Type
- default: Basic installation, smallest image size. Suitable for most standard web scraping and markdown generation.
- all: Full features including torch and transformers for advanced extraction strategies (e.g., CosineStrategy, certain LLM filters). Significantly larger image. Ensure you need these extras.
平台考虑因素使用buildx用于构建多架构镜像，特别是推送到镜像仓库。使用docker compose配置文件（local-amd64 ，local-arm64 ) 以便轻松进行特定于平台的本地构建。

¥Platform Considerations
- Use buildx for building multi-architecture images, especially for pushing to registries.
- Use docker compose profiles (local-amd64, local-arm64) for easy platform-specific local builds.
性能优化该图像自动包含特定于平台的优化（用于 AMD64 的 OpenMP、用于 ARM64 的 OpenBLAS）。

¥Performance Optimization
- The image automatically includes platform-specific optimizations (OpenMP for AMD64, OpenBLAS for ARM64).

使用 API

¥Using the API

通过 REST API 与正在运行的 Docker 服务器进行通信（默认为http://localhost:11235）。您可以使用 Python SDK 或直接发出 HTTP 请求。

¥Communicate with the running Docker server via its REST API (defaulting to http://localhost:11235). You can use the Python SDK or make direct HTTP requests.

游乐场界面

¥Playground Interface

内置的 Web 游乐场位于http://localhost:11235/playground用于测试和生成 API 请求。该 Playground 允许您：

¥A built-in web playground is available at http://localhost:11235/playground for testing and generating API requests. The playground allows you to:

配置CrawlerRunConfig和BrowserConfig使用主库的 Python 语法

¥Configure CrawlerRunConfig and BrowserConfig using the main library's Python syntax
直接从界面测试爬取操作

¥Test crawling operations directly from the interface
根据您的配置为 REST API 请求生成相应的 JSON

¥Generate corresponding JSON for REST API requests based on your configuration

这是构建集成时将 Python 配置转换为 JSON 请求的最简单方法。

¥This is the easiest way to translate Python configuration to JSON requests when building integrations.

Python SDK

¥Python SDK

安装 SDK：pip install crawl4ai

¥Install the SDK: pip install crawl4ai

import asyncio
from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode # Assuming you have crawl4ai installed

async def main():
    # Point to the correct server port
    async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client:
        # If JWT is enabled on the server, authenticate first:
        # await client.authenticate("user@example.com") # See Server Configuration section

        # Example Non-streaming crawl
        print("--- Running Non-Streaming Crawl ---")
        results = await client.crawl(
            ["https://httpbin.org/html"],
            browser_config=BrowserConfig(headless=True), # Use library classes for config aid
            crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        )
        if results: # client.crawl returns None on failure
          print(f"Non-streaming results success: {results.success}")
          if results.success:
              for result in results: # Iterate through the CrawlResultContainer
                  print(f"URL: {result.url}, Success: {result.success}")
        else:
            print("Non-streaming crawl failed.")


        # Example Streaming crawl
        print("\n--- Running Streaming Crawl ---")
        stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
        try:
            async for result in await client.crawl( # client.crawl returns an async generator for streaming
                ["https://httpbin.org/html", "https://httpbin.org/links/5/0"],
                browser_config=BrowserConfig(headless=True),
                crawler_config=stream_config
            ):
                print(f"Streamed result: URL: {result.url}, Success: {result.success}")
        except Exception as e:
            print(f"Streaming crawl failed: {e}")


        # Example Get schema
        print("\n--- Getting Schema ---")
        schema = await client.get_schema()
        print(f"Schema received: {bool(schema)}") # Print whether schema was received

if __name__ == "__main__":
    asyncio.run(main())

（SDK 参数如超时、verify_ssl 等保持不变）

¥(SDK parameters like timeout, verify_ssl etc. remain the same)

第二种方法：直接 API 调用

¥Second Approach: Direct API Calls

至关重要的是，当通过 JSON 直接发送配置时，它们必须关注{"type": "ClassName", "params": {...}}任何非原始值（例如配置对象或策略）的结构。字典必须包装为{"type": "dict", "value": {...}}。

¥Crucially, when sending configurations directly via JSON, they must follow the {"type": "ClassName", "params": {...}} structure for any non-primitive value (like config objects or strategies). Dictionaries must be wrapped as {"type": "dict", "value": {...}}.

（保留配置结构、基本模式、简单与复杂、策略模式、复杂嵌套示例、快速语法概述、重要规则、专业提示的详细解释）

¥(Keep the detailed explanation of Configuration Structure, Basic Pattern, Simple vs Complex, Strategy Pattern, Complex Nested Example, Quick Grammar Overview, Important Rules, Pro Tip)

REST API 示例

¥REST API Examples

更新 URL 以使用端口11235。

¥Update URLs to use port 11235.

简单爬取

¥Simple Crawl

import requests

# Configuration objects converted to the required JSON structure
browser_config_payload = {
    "type": "BrowserConfig",
    "params": {"headless": True}
}
crawler_config_payload = {
    "type": "CrawlerRunConfig",
    "params": {"stream": False, "cache_mode": "bypass"} # Use string value of enum
}

crawl_payload = {
    "urls": ["https://httpbin.org/html"],
    "browser_config": browser_config_payload,
    "crawler_config": crawler_config_payload
}
response = requests.post(
    "http://localhost:11235/crawl", # Updated port
    # headers={"Authorization": f"Bearer {token}"},  # If JWT is enabled
    json=crawl_payload
)
print(f"Status Code: {response.status_code}")
if response.ok:
    print(response.json())
else:
    print(f"Error: {response.text}")

流媒体结果

¥Streaming Results

import json
import httpx # Use httpx for async streaming example

async def test_stream_crawl(token: str = None): # Made token optional
    """Test the /crawl/stream endpoint with multiple URLs."""
    url = "http://localhost:11235/crawl/stream" # Updated port
    payload = {
        "urls": [
            "https://httpbin.org/html",
            "https://httpbin.org/links/5/0",
        ],
        "browser_config": {
            "type": "BrowserConfig",
            "params": {"headless": True, "viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}} # Viewport needs type:dict
        },
        "crawler_config": {
            "type": "CrawlerRunConfig",
            "params": {"stream": True, "cache_mode": "bypass"}
        }
    }

    headers = {}
    # if token:
    #    headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled

    try:
        async with httpx.AsyncClient() as client:
            async with client.stream("POST", url, json=payload, headers=headers, timeout=120.0) as response:
                print(f"Status: {response.status_code} (Expected: 200)")
                response.raise_for_status() # Raise exception for bad status codes

                # Read streaming response line-by-line (NDJSON)
                async for line in response.aiter_lines():
                    if line:
                        try:
                            data = json.loads(line)
                            # Check for completion marker
                            if data.get("status") == "completed":
                                print("Stream completed.")
                                break
                            print(f"Streamed Result: {json.dumps(data, indent=2)}")
                        except json.JSONDecodeError:
                            print(f"Warning: Could not decode JSON line: {line}")

    except httpx.HTTPStatusError as e:
         print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}")
    except Exception as e:
        print(f"Error in streaming crawl test: {str(e)}")

# To run this example:
# import asyncio
# asyncio.run(test_stream_crawl())

指标与监控

¥Metrics & Monitoring

使用以下端点密切关注您的爬虫：

¥Keep an eye on your crawler with these endpoints:

- 快速健康检查

¥/health - Quick health check
- 详细的 Prometheus 指标

¥/metrics - Detailed Prometheus metrics
- 完整的 API 模式

¥/schema - Full API schema

健康检查示例：

¥Example health check:

curl http://localhost:11235/health

（部署场景和完整示例部分保持不变，如果示例移动，可能会更新链接）

¥(Deployment Scenarios and Complete Examples sections remain the same, maybe update links if examples moved)

服务器配置

¥Server Configuration

服务器的行为可以通过config.yml文件。

¥The server's behavior can be customized through the config.yml file.

理解 config.yml

¥Understanding config.yml

配置文件从/app/config.yml在容器内。默认情况下，deploy/docker/config.yml在构建过程中，存储库中的内容会被复制到那里。

¥The configuration file is loaded from /app/config.yml inside the container. By default, the file from deploy/docker/config.yml in the repository is copied there during the build.

以下是配置选项的详细分类（使用来自deploy/docker/config.yml):

¥Here's a detailed breakdown of the configuration options (using defaults from deploy/docker/config.yml):

# Application Configuration
app:
  title: "Crawl4AI API"
  version: "1.0.0" # Consider setting this to match library version, e.g., "0.5.1"
  host: "0.0.0.0"
  port: 8020 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf).
  reload: False # Default set to False - suitable for production
  timeout_keep_alive: 300

# Default LLM Configuration
llm:
  provider: "openai/gpt-4o-mini"  # Can be overridden by LLM_PROVIDER env var
  api_key_env: "OPENAI_API_KEY"
  # api_key: sk-...  # If you pass the API key directly then api_key_env will be ignored

# Redis Configuration (Used by internal Redis server managed by supervisord)
redis:
  host: "localhost"
  port: 6379
  db: 0
  password: ""
  # ... other redis options ...

# Rate Limiting Configuration
rate_limiting:
  enabled: True
  default_limit: "1000/minute"
  trusted_proxies: []
  storage_uri: "memory://"  # Use "redis://localhost:6379" if you need persistent/shared limits

# Security Configuration
security:
  enabled: false # Master toggle for security features
  jwt_enabled: false # Enable JWT authentication (requires security.enabled=true)
  https_redirect: false # Force HTTPS (requires security.enabled=true)
  trusted_hosts: ["*"] # Allowed hosts (use specific domains in production)
  headers: # Security headers (applied if security.enabled=true)
    x_content_type_options: "nosniff"
    x_frame_options: "DENY"
    content_security_policy: "default-src 'self'"
    strict_transport_security: "max-age=63072000; includeSubDomains"

# Crawler Configuration
crawler:
  memory_threshold_percent: 95.0
  rate_limiter:
    base_delay: [1.0, 2.0] # Min/max delay between requests in seconds for dispatcher
  timeouts:
    stream_init: 30.0  # Timeout for stream initialization
    batch_process: 300.0 # Timeout for non-streaming /crawl processing

# Logging Configuration
logging:
  level: "INFO"
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

# Observability Configuration
observability:
  prometheus:
    enabled: True
    endpoint: "/metrics"
  health_check:
    endpoint: "/health"

（JWT 身份验证部分保持不变，只需注意请求的默认端口现在是 11235）

¥(JWT Authentication section remains the same, just note the default port is now 11235 for requests)

（配置技巧和最佳实践保持不变）

¥(Configuration Tips and Best Practices remain the same)

自定义配置

¥Customizing Your Configuration

您可以覆盖默认config.yml。

¥You can override the default config.yml.

方法一：构建前修改

¥Method 1: Modify Before Build

编辑deploy/docker/config.yml本地存储库中的文件克隆。

¥Edit the deploy/docker/config.yml file in your local repository clone.
使用以下方式构建图像docker buildx或者docker compose --profile local-... up --build. 修改后的文件将被复制到图像中。

¥Build the image using docker buildx or docker compose --profile local-... up --build. The modified file will be copied into the image.

方法 2：运行时挂载（推荐用于自定义部署）

¥Method 2: Runtime Mount (Recommended for Custom Deploys)

创建自定义配置文件，例如，my-custom-config.yml本地。确保它包含所有必要的部分。

¥Create your custom configuration file, e.g., my-custom-config.yml locally. Ensure it contains all necessary sections.

在运行容器时挂载它：使用docker run：

# Assumes my-custom-config.yml is in the current directory
docker run -d -p 11235:11235 \
  --name crawl4ai-custom-config \
  --env-file .llm.env \
  --shm-size=1g \
  -v $(pwd)/my-custom-config.yml:/app/config.yml \
  unclecode/crawl4ai:latest # Or your specific tag

使用docker-compose.yml：添加volumes服务定义的部分：

services:
  crawl4ai-hub-amd64: # Or your chosen service
    image: unclecode/crawl4ai:latest
    profiles: ["hub-amd64"]
    <<: *base-config
    volumes:
      # Mount local custom config over the default one in the container
      - ./my-custom-config.yml:/app/config.yml
      # Keep the shared memory volume from base-config
      - /dev/shm:/dev/shm

（注意：确保my-custom-config.yml与以下目录相同docker-compose.yml)

Mount it when running the container:

Using docker run:

# Assumes my-custom-config.yml is in the current directory
docker run -d -p 11235:11235 \
  --name crawl4ai-custom-config \
  --env-file .llm.env \
  --shm-size=1g \
  -v $(pwd)/my-custom-config.yml:/app/config.yml \
  unclecode/crawl4ai:latest # Or your specific tag

Using docker-compose.yml: Add a volumes section to the service definition:

services:
  crawl4ai-hub-amd64: # Or your chosen service
    image: unclecode/crawl4ai:latest
    profiles: ["hub-amd64"]
    <<: *base-config
    volumes:
      # Mount local custom config over the default one in the container
      - ./my-custom-config.yml:/app/config.yml
      # Keep the shared memory volume from base-config
      - /dev/shm:/dev/shm

(Note: Ensure my-custom-config.yml is in the same directory as docker-compose.yml)

💡 安装时，您的自定义文件完全取代默认配置。请确保其配置有效且完整。

¥
💡 When mounting, your custom file completely replaces the default one. Ensure it's a valid and complete configuration.

配置建议

¥Configuration Recommendations

安全第一🔒

¥Security First 🔒
始终在生产中启用安全性

¥Always enable security in production
使用特定的 trusted_hosts 而不是通配符

¥Use specific trusted_hosts instead of wildcards
设置适当的速率限制来保护您的服务器

¥Set up proper rate limiting to protect your server
启用 HTTPS 重定向之前请考虑您的环境

¥
Consider your environment before enabling HTTPS redirect
资源管理💻

¥
Resource Management 💻
根据可用 RAM 调整 memory_threshold_percent

¥Adjust memory_threshold_percent based on available RAM
根据内容大小和网络条件设置超时

¥Set timeouts according to your content size and network conditions
在多容器设置中使用 Redis 进行速率限制

¥
Use Redis for rate limiting in multi-container setups
监控📊

¥
Monitoring 📊
如果需要指标，请启用 Prometheus

¥Enable Prometheus if you need metrics
在开发中设置 DEBUG 日志，在生产中设置 INFO

¥Set DEBUG logging in development, INFO in production
定期健康检查监测至关重要

¥
Regular health check monitoring is crucial
性能调优⚡

¥
Performance Tuning ⚡
从保守的速率限制器延迟开始

¥Start with conservative rate limiter delays
增加大内容的batch_process超时

¥Increase batch_process timeout for large content
根据初始响应时间调整 stream_init 超时

¥Adjust stream_init timeout based on initial response times

获取帮助

¥Getting Help

我们随时准备帮助您通过 Crawl4AI 取得成功！获取支持的方法如下：

¥We're here to help you succeed with Crawl4AI! Here's how to get support:

📖 查看我们的完整文档

¥📖 Check our full documentation
🐛 发现了错误？开启一个问题

¥🐛 Found a bug? Open an issue
💬 加入我们Discord 社区

¥💬 Join our Discord community
⭐ 在 GitHub 上为我们加星标以表示支持！

¥⭐ Star us on GitHub to show support!

概括

¥Summary

在本指南中，我们介绍了开始使用 Crawl4AI 的 Docker 部署所需的一切：- 构建和运行 Docker 容器 - 配置环境
- 使用交互式游乐场进行测试 - 使用正确的类型发出 API 请求 - 使用 Python SDK - 利用专用端点进行屏幕截图、PDF 和 JavaScript 执行 - 通过模型上下文协议 (MCP) 连接 - 监控您的部署

¥In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment: - Building and running the Docker container - Configuring the environment
- Using the interactive playground for testing - Making API requests with proper typing - Using the Python SDK - Leveraging specialized endpoints for screenshots, PDFs, and JavaScript execution - Connecting via the Model Context Protocol (MCP) - Monitoring your deployment

新的游乐场界面位于http://localhost:11235/playground使得测试配置和为 API 请求生成相应的 JSON 变得更加容易。

¥The new playground interface at http://localhost:11235/playground makes it much easier to test configurations and generate the corresponding JSON for API requests.

对于 AI 应用程序开发人员来说，MCP 集成允许 Claude Code 等工具直接访问 Crawl4AI 的功能，而无需复杂的 API 处理。

¥For AI application developers, the MCP integration allows tools like Claude Code to directly access Crawl4AI's capabilities without complex API handling.

请记住，examples文件夹是你的朋友——它们展示了你可以根据自己的需要进行调整的真实世界的使用模式。

¥Remember, the examples in the examples folder are your friends - they show real-world usage patterns that you can adapt for your needs.

继续探索，如需帮助，请随时联系我们！我们正在携手创造非凡。🚀

¥Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀

爬行快乐！🕷️

¥Happy crawling! 🕷️

Crawl4AI Docker 指南🐳

目录

先决条件

安装

选项 1：使用预构建的 Docker Hub 镜像（推荐）

1. 拉取镜像

2. 设置环境（API 密钥）

3. 运行容器

4.停止容器

Docker Hub 版本控制说明

选项 2：使用 Docker Compose

1. 克隆存储库

2. 环境设置（API 密钥）

3. 使用 Compose 构建并运行

4.停止服务

选项 3：手动本地构建并运行

1. 克隆存储库并设置环境

2. 构建镜像（多架构）

3. 运行容器

4.停止手动容器

MCP（模型上下文协议）支持

什么是 MCP？

通过 MCP 连接

与 Claude 代码一起使用

可用的 MCP 工具

测试 MCP 连接

MCP 模式

附加 API 端点

HTML提取端点

屏幕截图端点

PDF 导出端点

JavaScript 执行端点

Dockerfile 参数

构建参数解释

建立最佳实践

使用 API

游乐场界面

Python SDK

第二种方法：直接 API 调用

更多示例（确保 Schema 示例使用类型/值包装器）

REST API 示例

简单爬取

流媒体结果

指标与监控

服务器配置

理解 config.yml

自定义配置

方法一：构建前修改

方法 2：运行时挂载（推荐用于自定义部署）

配置建议

获取帮助

概括