Crawl4AI Docker 指南🐳
¥Crawl4AI Docker Guide 🐳
目录
¥Table of Contents
先决条件
¥Prerequisites
在深入研究之前,请确保您已安装并运行 Docker(版本 20.10.0 或更高版本),包括docker compose(通常与 Docker Desktop 捆绑在一起)。 -git用于克隆存储库。 - 容器至少有 4GB 的可用 RAM(建议在大量使用时使用更多)。 - Python 3.10+(如果使用 Python SDK)。 - Node.js 16+(如果使用 Node.js 示例)。
¥Before we dive in, make sure you have:
- Docker installed and running (version 20.10.0 or higher), including docker compose (usually bundled with Docker Desktop).
- git for cloning the repository.
- At least 4GB of RAM available for the container (more recommended for heavy use).
- Python 3.10+ (if using the Python SDK).
- Node.js 16+ (if using the Node.js examples).
💡专业提示: 跑步
docker info检查您的 Docker 安装和可用资源。¥💡 Pro tip: Run
docker infoto check your Docker installation and available resources.
安装
¥Installation
我们提供多种方式来运行 Crawl4AI 服务器。最快捷的方法是使用我们预先构建的 Docker Hub 镜像。
¥We offer several ways to get the Crawl4AI server running. The quickest way is to use our pre-built Docker Hub images.
选项 1:使用预构建的 Docker Hub 镜像(推荐)
¥Option 1: Using Pre-built Docker Hub Images (Recommended)
直接从 Docker Hub 拉取并运行镜像,无需在本地构建。
¥Pull and run images directly from Docker Hub without building locally.
1. 拉取镜像
¥1. Pull the Image
我们的最新版本是0.7.3。图像是使用多架构清单构建的,因此 Docker 会自动为您的系统提取正确的版本。
¥Our latest release is 0.7.3. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
💡笔记: 这
latest标签指向稳定0.7.3版本。¥💡 Note: The
latesttag points to the stable0.7.3version.
# Pull the latest version
docker pull unclecode/crawl4ai:0.7.3
# Or pull using the latest tag
docker pull unclecode/crawl4ai:latest
2. 设置环境(API 密钥)
¥2. Setup Environment (API Keys)
如果你打算使用 LLM,请创建一个.llm.env工作目录中的文件:
¥If you plan to use LLMs, create a .llm.env file in your working directory:
# Create a .llm.env file with your API keys
cat > .llm.env << EOL
# OpenAI
OPENAI_API_KEY=sk-your-key
# Anthropic
ANTHROPIC_API_KEY=your-anthropic-key
# Other providers as needed
# DEEPSEEK_API_KEY=your-deepseek-key
# GROQ_API_KEY=your-groq-key
# TOGETHER_API_KEY=your-together-key
# MISTRAL_API_KEY=your-mistral-key
# GEMINI_API_TOKEN=your-gemini-token
EOL
🔑笔记:确保你的 API 密钥安全!切勿提交
.llm.env进行版本控制。¥🔑 Note: Keep your API keys secure! Never commit
.llm.envto version control.
3. 运行容器
¥3. Run the Container
-
基本运行:
docker run -d \ -p 11235:11235 \ --name crawl4ai \ --shm-size=1g \ unclecode/crawl4ai:latest¥
Basic run:
-
在法学硕士 (LLM) 的支持下:
# Make sure .llm.env is in the current directory docker run -d \ -p 11235:11235 \ --name crawl4ai \ --env-file .llm.env \ --shm-size=1g \ unclecode/crawl4ai:latest¥
With LLM support:
服务器将在
http://localhost:11235。 访问/playground进入交互式测试界面。¥The server will be available at
http://localhost:11235. Visit/playgroundto access the interactive testing interface.
4.停止容器
¥4. Stopping the Container
Docker Hub 版本控制说明
¥Docker Hub Versioning Explained
-
图片名称:
unclecode/crawl4ai¥Image Name:
unclecode/crawl4ai -
标签格式:
LIBRARY_VERSION[-SUFFIX](例如,0.7.3)LIBRARY_VERSION:核心的语义版本crawl4aiPython 库SUFFIX: 候选发布版本的可选标签 (`) and revisions (r1`)¥Tag Format:
LIBRARY_VERSION[-SUFFIX](e.g.,0.7.3)LIBRARY_VERSION: The semantic version of the corecrawl4aiPython librarySUFFIX: Optional tag for release candidates (`) and revisions (r1`)
-
latest标签:指向最新的稳定版本¥
latestTag: Points to the most recent stable version -
多架构支持:所有图像均支持
linux/amd64和linux/arm64通过单个标签的架构¥Multi-Architecture Support: All images support both
linux/amd64andlinux/arm64architectures through a single tag
选项 2:使用 Docker Compose
¥Option 2: Using Docker Compose
Docker Compose 简化了服务的构建和运行,特别是对于本地开发和测试。
¥Docker Compose simplifies building and running the service, especially for local development and testing.
1. 克隆存储库
¥1. Clone Repository
2. 环境设置(API 密钥)
¥2. Environment Setup (API Keys)
如果您计划使用 LLM,请复制示例环境文件并添加您的 API 密钥。该文件应该位于项目根目录。
¥If you plan to use LLMs, copy the example environment file and add your API keys. This file should be in the project root directory.
# Make sure you are in the 'crawl4ai' root directory
cp deploy/docker/.llm.env.example .llm.env
# Now edit .llm.env and add your API keys
灵活的 LLM 提供商配置:
¥Flexible LLM Provider Configuration:
Docker 设置现在通过三种方法支持灵活的 LLM 提供程序配置:
¥The Docker setup now supports flexible LLM provider configuration through three methods:
-
环境变量(最高优先级):设置
LLM_PROVIDER覆盖默认值export LLM_PROVIDER="anthropic/claude-3-opus" # Or in your .llm.env file: # LLM_PROVIDER=anthropic/claude-3-opus¥
Environment Variable (Highest Priority): Set
LLM_PROVIDERto override the default -
API请求参数:根据请求指定提供商
{ "url": "https://example.com", "f": "llm", "provider": "groq/mixtral-8x7b" }¥
API Request Parameter: Specify provider per request
-
配置文件默认:回退到
config.yml(默认:openai/gpt-4o-mini)¥
Config File Default: Falls back to
config.yml(default:openai/gpt-4o-mini)
系统根据配置自动选择合适的API密钥api_key_env在配置文件中。
¥The system automatically selects the appropriate API key based on the configured api_key_env in the config file.
3. 使用 Compose 构建并运行
¥3. Build and Run with Compose
这docker-compose.yml项目根目录中的文件提供了一种简化的方法,可以使用 buildx 自动处理架构检测。
¥The docker-compose.yml file in the project root provides a simplified approach that automatically handles architecture detection using buildx.
-
从 Docker Hub 运行预构建的映像:
# Pulls and runs the release candidate from Docker Hub # Automatically selects the correct architecture IMAGE=unclecode/crawl4ai:latest docker compose up -d¥
Run Pre-built Image from Docker Hub:
-
本地构建并运行:
# Builds the image locally using Dockerfile and runs it # Automatically uses the correct architecture for your machine docker compose up --build -d¥
Build and Run Locally:
-
自定义构建:
# Build with all features (includes torch and transformers) INSTALL_TYPE=all docker compose up --build -d # Build with GPU support (for AMD64 platforms) ENABLE_GPU=true docker compose up --build -d¥
Customize the Build:
服务器将在
http://localhost:11235。¥The server will be available at
http://localhost:11235.
4.停止服务
¥4. Stopping the Service
选项 3:手动本地构建并运行
¥Option 3: Manual Local Build & Run
如果您不想使用 Docker Compose 直接控制构建和运行过程。
¥If you prefer not to use Docker Compose for direct control over the build and run process.
1. 克隆存储库并设置环境
¥1. Clone Repository & Setup Environment
按照上面 Docker Compose 部分的步骤 1 和 2 进行操作(克隆 repo,cd crawl4ai , 创造.llm.env在根目录下)。
¥Follow steps 1 and 2 from the Docker Compose section above (clone repo, cd crawl4ai, create .llm.env in the root).
2. 构建镜像(多架构)
¥2. Build the Image (Multi-Arch)
使用docker buildx构建镜像。Crawl4AI 现在使用 buildx 自动处理多架构构建。
¥Use docker buildx to build the image. Crawl4AI now uses buildx to handle multi-architecture builds automatically.
# Make sure you are in the 'crawl4ai' root directory
# Build for the current architecture and load it into Docker
docker buildx build -t crawl4ai-local:latest --load .
# Or build for multiple architectures (useful for publishing)
docker buildx build --platform linux/amd64,linux/arm64 -t crawl4ai-local:latest --load .
# Build with additional options
docker buildx build \
--build-arg INSTALL_TYPE=all \
--build-arg ENABLE_GPU=false \
-t crawl4ai-local:latest --load .
3. 运行容器
¥3. Run the Container
-
基本运行(无 LLM 支持):
docker run -d \ -p 11235:11235 \ --name crawl4ai-standalone \ --shm-size=1g \ crawl4ai-local:latest¥
Basic run (no LLM support):
-
在法学硕士 (LLM) 的支持下:
# Make sure .llm.env is in the current directory (project root) docker run -d \ -p 11235:11235 \ --name crawl4ai-standalone \ --env-file .llm.env \ --shm-size=1g \ crawl4ai-local:latest¥
With LLM support:
服务器将在
http://localhost:11235。¥The server will be available at
http://localhost:11235.
4.停止手动容器
¥4. Stopping the Manual Container
MCP(模型上下文协议)支持
¥MCP (Model Context Protocol) Support
Crawl4AI 服务器包括对模型上下文协议 (MCP) 的支持,允许您将服务器的功能直接连接到与 MCP 兼容的客户端,如 Claude Code。
¥Crawl4AI server includes support for the Model Context Protocol (MCP), allowing you to connect the server's capabilities directly to MCP-compatible clients like Claude Code.
什么是 MCP?
¥What is MCP?
MCP 是一个开放协议,它标准化了应用程序向 LLM 提供上下文的方式。它允许 AI 模型通过标准化接口访问外部工具、数据源和服务。
¥MCP is an open protocol that standardizes how applications provide context to LLMs. It allows AI models to access external tools, data sources, and services through a standardized interface.
通过 MCP 连接
¥Connecting via MCP
Crawl4AI 服务器公开两个 MCP 端点:
¥The Crawl4AI server exposes two MCP endpoints:
-
服务器发送事件 (SSE) :
http://localhost:11235/mcp/sse¥Server-Sent Events (SSE):
http://localhost:11235/mcp/sse -
WebSocket :
ws://localhost:11235/mcp/ws¥WebSocket:
ws://localhost:11235/mcp/ws
与 Claude 代码一起使用
¥Using with Claude Code
您可以使用一个简单的命令在 Claude Code 中添加 Crawl4AI 作为 MCP 工具提供程序:
¥You can add Crawl4AI as an MCP tool provider in Claude Code with a simple command:
# Add the Crawl4AI server as an MCP provider
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
# List all MCP providers to verify it was added
claude mcp list
一旦连接,Claude Code 可以直接使用 Crawl4AI 的功能,如屏幕截图、PDF 生成和 HTML 处理,而无需进行单独的 API 调用。
¥Once connected, Claude Code can directly use Crawl4AI's capabilities like screenshot capture, PDF generation, and HTML processing without having to make separate API calls.
可用的 MCP 工具
¥Available MCP Tools
通过 MCP 连接时,可以使用以下工具:
¥When connected via MCP, the following tools are available:
-
- 从网页内容生成 markdown
¥
md- Generate markdown from web content -
- 提取预处理的 HTML
¥
html- Extract preprocessed HTML -
- 捕获网页截图
¥
screenshot- Capture webpage screenshots -
- 生成PDF文档
¥
pdf- Generate PDF documents -
- 在网页上运行 JavaScript
¥
execute_js- Run JavaScript on web pages -
- 执行多 URL 抓取
¥
crawl- Perform multi-URL crawling -
- 查询 Crawl4AI 库上下文
¥
ask- Query the Crawl4AI library context
测试 MCP 连接
¥Testing MCP Connections
您可以使用存储库中包含的测试文件测试 MCP WebSocket 连接:
¥You can test the MCP WebSocket connection using the test file included in the repository:
MCP 模式
¥MCP Schemas
访问 MCP 工具架构http://localhost:11235/mcp/schema有关每个工具的参数和功能的详细信息。
¥Access the MCP tool schemas at http://localhost:11235/mcp/schema for detailed information on each tool's parameters and capabilities.
附加 API 端点
¥Additional API Endpoints
除了核心/crawl和/crawl/stream端点,服务器提供了几个专门的端点:
¥In addition to the core /crawl and /crawl/stream endpoints, the server provides several specialized endpoints:
HTML提取端点
¥HTML Extraction Endpoint
抓取 URL 并返回针对模式提取优化的预处理 HTML。
¥Crawls the URL and returns preprocessed HTML optimized for schema extraction.
屏幕截图端点
¥Screenshot Endpoint
捕获指定 URL 的整页 PNG 屏幕截图。
¥Captures a full-page PNG screenshot of the specified URL.
{
"url": "https://example.com",
"screenshot_wait_for": 2,
"output_path": "/path/to/save/screenshot.png"
}
-
:捕获前的可选延迟秒数(默认值:2)
¥
screenshot_wait_for: Optional delay in seconds before capture (default: 2) -
:可选保存截图的路径(推荐)
¥
output_path: Optional path to save the screenshot (recommended)
PDF 导出端点
¥PDF Export Endpoint
生成指定 URL 的 PDF 文档。
¥Generates a PDF document of the specified URL.
-
:保存 PDF 的可选路径(推荐)
¥
output_path: Optional path to save the PDF (recommended)
JavaScript 执行端点
¥JavaScript Execution Endpoint
在指定的 URL 上执行 JavaScript 片段并返回完整的爬取结果。
¥Executes JavaScript snippets on the specified URL and returns the full crawl result.
{
"url": "https://example.com",
"scripts": [
"return document.title",
"return Array.from(document.querySelectorAll('a')).map(a => a.href)"
]
}
-
:按顺序执行的 JavaScript 代码片段列表
¥
scripts: List of JavaScript snippets to execute sequentially
Dockerfile 参数
¥Dockerfile Parameters
您可以使用构建参数自定义图像构建过程(--build-arg )通常通过docker buildx build或在docker-compose.yml文件。
¥You can customize the image build process using build arguments (--build-arg). These are typically used via docker buildx build or within the docker-compose.yml file.
# Example: Build with 'all' features using buildx
docker buildx build \
--platform linux/amd64,linux/arm64 \
--build-arg INSTALL_TYPE=all \
-t yourname/crawl4ai-all:latest \
--load \
. # Build from root context
构建参数解释
¥Build Arguments Explained
¥Argument
¥Description
¥Default
¥Options
¥INSTALL_TYPE
¥Feature set
¥default, all, torch, transformer
¥ENABLE_GPU
¥GPU support (CUDA for AMD64)
¥true, false
¥APP_HOME
¥Install path inside container (advanced)
¥any valid path
¥USE_LOCAL
¥Install library from local source
¥true, false
¥GITHUB_REPO
¥Git repo to clone if USE_LOCAL=false
¥(see Dockerfile)
¥any git URL
¥GITHUB_BRANCH
¥Git branch to clone if USE_LOCAL=false
¥any branch name
| 争论 | 描述 | 默认 | 选项 |
|---|---|---|---|
| 安装类型 | 功能集 | default |
,all ,torch ,transformer |
| 启用 GPU | GPU 支持(适用于 AMD64 的 CUDA) | false |
,false |
| 应用程序主页 | 容器内的安装路径(高级) | /app |
任何有效路径 |
| 使用本地 | 从本地源安装库 | true |
,false |
| GITHUB_REPO | 如果 USE_LOCAL=false,则克隆 Git 仓库 | (参见 Dockerfile) | 任何 git URL |
| GITHUB_BRANCH | 如果 USE_LOCAL=false,则克隆 Git 分支 | main |
任何分支名称 |
(注意:PYTHON_VERSION 由FROMDockerfile 中的指令)
¥(Note: PYTHON_VERSION is fixed by the FROM instruction in the Dockerfile)
建立最佳实践
¥Build Best Practices
-
选择正确的安装类型
default:基本安装,最小图像尺寸。适用于大多数标准网页抓取和 Markdown 生成。all:全部功能包括torch和transformers用于高级提取策略(例如,余弦策略、某些 LLM 滤波器)。图像明显更大。请确保您需要这些额外功能。¥Choose the Right Install Type
default: Basic installation, smallest image size. Suitable for most standard web scraping and markdown generation.all: Full features includingtorchandtransformersfor advanced extraction strategies (e.g., CosineStrategy, certain LLM filters). Significantly larger image. Ensure you need these extras.
-
平台考虑因素使用
buildx用于构建多架构镜像,特别是推送到镜像仓库。使用docker compose配置文件(local-amd64,local-arm64) 以便轻松进行特定于平台的本地构建。¥Platform Considerations
- Use
buildxfor building multi-architecture images, especially for pushing to registries. - Use
docker composeprofiles (local-amd64,local-arm64) for easy platform-specific local builds.
- Use
-
性能优化该图像自动包含特定于平台的优化(用于 AMD64 的 OpenMP、用于 ARM64 的 OpenBLAS)。
¥Performance Optimization
- The image automatically includes platform-specific optimizations (OpenMP for AMD64, OpenBLAS for ARM64).
使用 API
¥Using the API
通过 REST API 与正在运行的 Docker 服务器进行通信(默认为http://localhost:11235)。您可以使用 Python SDK 或直接发出 HTTP 请求。
¥Communicate with the running Docker server via its REST API (defaulting to http://localhost:11235). You can use the Python SDK or make direct HTTP requests.
游乐场界面
¥Playground Interface
内置的 Web 游乐场位于http://localhost:11235/playground用于测试和生成 API 请求。该 Playground 允许您:
¥A built-in web playground is available at http://localhost:11235/playground for testing and generating API requests. The playground allows you to:
-
配置
CrawlerRunConfig和BrowserConfig使用主库的 Python 语法¥Configure
CrawlerRunConfigandBrowserConfigusing the main library's Python syntax -
直接从界面测试爬取操作
¥Test crawling operations directly from the interface
-
根据您的配置为 REST API 请求生成相应的 JSON
¥Generate corresponding JSON for REST API requests based on your configuration
这是构建集成时将 Python 配置转换为 JSON 请求的最简单方法。
¥This is the easiest way to translate Python configuration to JSON requests when building integrations.
Python SDK
¥Python SDK
安装 SDK:pip install crawl4ai
¥Install the SDK: pip install crawl4ai
import asyncio
from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode # Assuming you have crawl4ai installed
async def main():
# Point to the correct server port
async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client:
# If JWT is enabled on the server, authenticate first:
# await client.authenticate("user@example.com") # See Server Configuration section
# Example Non-streaming crawl
print("--- Running Non-Streaming Crawl ---")
results = await client.crawl(
["https://httpbin.org/html"],
browser_config=BrowserConfig(headless=True), # Use library classes for config aid
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
if results: # client.crawl returns None on failure
print(f"Non-streaming results success: {results.success}")
if results.success:
for result in results: # Iterate through the CrawlResultContainer
print(f"URL: {result.url}, Success: {result.success}")
else:
print("Non-streaming crawl failed.")
# Example Streaming crawl
print("\n--- Running Streaming Crawl ---")
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
try:
async for result in await client.crawl( # client.crawl returns an async generator for streaming
["https://httpbin.org/html", "https://httpbin.org/links/5/0"],
browser_config=BrowserConfig(headless=True),
crawler_config=stream_config
):
print(f"Streamed result: URL: {result.url}, Success: {result.success}")
except Exception as e:
print(f"Streaming crawl failed: {e}")
# Example Get schema
print("\n--- Getting Schema ---")
schema = await client.get_schema()
print(f"Schema received: {bool(schema)}") # Print whether schema was received
if __name__ == "__main__":
asyncio.run(main())
(SDK 参数如超时、verify_ssl 等保持不变)
¥(SDK parameters like timeout, verify_ssl etc. remain the same)
第二种方法:直接 API 调用
¥Second Approach: Direct API Calls
至关重要的是,当通过 JSON 直接发送配置时,它们必须关注{"type": "ClassName", "params": {...}}任何非原始值(例如配置对象或策略)的结构。字典必须包装为{"type": "dict", "value": {...}}。
¥Crucially, when sending configurations directly via JSON, they must follow the {"type": "ClassName", "params": {...}} structure for any non-primitive value (like config objects or strategies). Dictionaries must be wrapped as {"type": "dict", "value": {...}}.
(保留配置结构、基本模式、简单与复杂、策略模式、复杂嵌套示例、快速语法概述、重要规则、专业提示的详细解释)
¥(Keep the detailed explanation of Configuration Structure, Basic Pattern, Simple vs Complex, Strategy Pattern, Complex Nested Example, Quick Grammar Overview, Important Rules, Pro Tip)
更多示例(确保 Schema 示例使用类型/值包装器)
¥More Examples (Ensure Schema example uses type/value wrapper)
高级爬虫配置(保留示例,确保 cache_mode 使用有效的枚举值,如“bypass”)
¥Advanced Crawler Configuration (Keep example, ensure cache_mode uses valid enum value like "bypass")
提取策略
¥Extraction Strategy
{
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "JsonCssExtractionStrategy",
"params": {
"schema": {
"type": "dict",
"value": {
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "content", "selector": ".content", "type": "html"}
]
}
}
}
}
}
}
}
LLM提取策略(保留示例,确保模式使用类型/值包装器) (保留深度爬虫示例)
¥LLM Extraction Strategy (Keep example, ensure schema uses type/value wrapper) (Keep Deep Crawler Example)
REST API 示例
¥REST API Examples
更新 URL 以使用端口11235。
¥Update URLs to use port 11235.
简单爬取
¥Simple Crawl
import requests
# Configuration objects converted to the required JSON structure
browser_config_payload = {
"type": "BrowserConfig",
"params": {"headless": True}
}
crawler_config_payload = {
"type": "CrawlerRunConfig",
"params": {"stream": False, "cache_mode": "bypass"} # Use string value of enum
}
crawl_payload = {
"urls": ["https://httpbin.org/html"],
"browser_config": browser_config_payload,
"crawler_config": crawler_config_payload
}
response = requests.post(
"http://localhost:11235/crawl", # Updated port
# headers={"Authorization": f"Bearer {token}"}, # If JWT is enabled
json=crawl_payload
)
print(f"Status Code: {response.status_code}")
if response.ok:
print(response.json())
else:
print(f"Error: {response.text}")
流媒体结果
¥Streaming Results
import json
import httpx # Use httpx for async streaming example
async def test_stream_crawl(token: str = None): # Made token optional
"""Test the /crawl/stream endpoint with multiple URLs."""
url = "http://localhost:11235/crawl/stream" # Updated port
payload = {
"urls": [
"https://httpbin.org/html",
"https://httpbin.org/links/5/0",
],
"browser_config": {
"type": "BrowserConfig",
"params": {"headless": True, "viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}} # Viewport needs type:dict
},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {"stream": True, "cache_mode": "bypass"}
}
}
headers = {}
# if token:
# headers = {"Authorization": f"Bearer {token}"} # If JWT is enabled
try:
async with httpx.AsyncClient() as client:
async with client.stream("POST", url, json=payload, headers=headers, timeout=120.0) as response:
print(f"Status: {response.status_code} (Expected: 200)")
response.raise_for_status() # Raise exception for bad status codes
# Read streaming response line-by-line (NDJSON)
async for line in response.aiter_lines():
if line:
try:
data = json.loads(line)
# Check for completion marker
if data.get("status") == "completed":
print("Stream completed.")
break
print(f"Streamed Result: {json.dumps(data, indent=2)}")
except json.JSONDecodeError:
print(f"Warning: Could not decode JSON line: {line}")
except httpx.HTTPStatusError as e:
print(f"HTTP error occurred: {e.response.status_code} - {e.response.text}")
except Exception as e:
print(f"Error in streaming crawl test: {str(e)}")
# To run this example:
# import asyncio
# asyncio.run(test_stream_crawl())
指标与监控
¥Metrics & Monitoring
使用以下端点密切关注您的爬虫:
¥Keep an eye on your crawler with these endpoints:
-
- 快速健康检查
¥
/health- Quick health check -
- 详细的 Prometheus 指标
¥
/metrics- Detailed Prometheus metrics -
- 完整的 API 模式
¥
/schema- Full API schema
健康检查示例:
¥Example health check:
(部署场景和完整示例部分保持不变,如果示例移动,可能会更新链接)
¥(Deployment Scenarios and Complete Examples sections remain the same, maybe update links if examples moved)
服务器配置
¥Server Configuration
服务器的行为可以通过config.yml文件。
¥The server's behavior can be customized through the config.yml file.
理解 config.yml
¥Understanding config.yml
配置文件从/app/config.yml在容器内。默认情况下,deploy/docker/config.yml在构建过程中,存储库中的内容会被复制到那里。
¥The configuration file is loaded from /app/config.yml inside the container. By default, the file from deploy/docker/config.yml in the repository is copied there during the build.
以下是配置选项的详细分类(使用来自deploy/docker/config.yml):
¥Here's a detailed breakdown of the configuration options (using defaults from deploy/docker/config.yml):
# Application Configuration
app:
title: "Crawl4AI API"
version: "1.0.0" # Consider setting this to match library version, e.g., "0.5.1"
host: "0.0.0.0"
port: 8020 # NOTE: This port is used ONLY when running server.py directly. Gunicorn overrides this (see supervisord.conf).
reload: False # Default set to False - suitable for production
timeout_keep_alive: 300
# Default LLM Configuration
llm:
provider: "openai/gpt-4o-mini" # Can be overridden by LLM_PROVIDER env var
api_key_env: "OPENAI_API_KEY"
# api_key: sk-... # If you pass the API key directly then api_key_env will be ignored
# Redis Configuration (Used by internal Redis server managed by supervisord)
redis:
host: "localhost"
port: 6379
db: 0
password: ""
# ... other redis options ...
# Rate Limiting Configuration
rate_limiting:
enabled: True
default_limit: "1000/minute"
trusted_proxies: []
storage_uri: "memory://" # Use "redis://localhost:6379" if you need persistent/shared limits
# Security Configuration
security:
enabled: false # Master toggle for security features
jwt_enabled: false # Enable JWT authentication (requires security.enabled=true)
https_redirect: false # Force HTTPS (requires security.enabled=true)
trusted_hosts: ["*"] # Allowed hosts (use specific domains in production)
headers: # Security headers (applied if security.enabled=true)
x_content_type_options: "nosniff"
x_frame_options: "DENY"
content_security_policy: "default-src 'self'"
strict_transport_security: "max-age=63072000; includeSubDomains"
# Crawler Configuration
crawler:
memory_threshold_percent: 95.0
rate_limiter:
base_delay: [1.0, 2.0] # Min/max delay between requests in seconds for dispatcher
timeouts:
stream_init: 30.0 # Timeout for stream initialization
batch_process: 300.0 # Timeout for non-streaming /crawl processing
# Logging Configuration
logging:
level: "INFO"
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
# Observability Configuration
observability:
prometheus:
enabled: True
endpoint: "/metrics"
health_check:
endpoint: "/health"
(JWT 身份验证部分保持不变,只需注意请求的默认端口现在是 11235)
¥(JWT Authentication section remains the same, just note the default port is now 11235 for requests)
(配置技巧和最佳实践保持不变)
¥(Configuration Tips and Best Practices remain the same)
自定义配置
¥Customizing Your Configuration
您可以覆盖默认config.yml。
¥You can override the default config.yml.
方法一:构建前修改
¥Method 1: Modify Before Build
-
编辑
deploy/docker/config.yml本地存储库中的文件克隆。¥Edit the
deploy/docker/config.ymlfile in your local repository clone. -
使用以下方式构建图像
docker buildx或者docker compose --profile local-... up --build. 修改后的文件将被复制到图像中。¥Build the image using
docker buildxordocker compose --profile local-... up --build. The modified file will be copied into the image.
方法 2:运行时挂载(推荐用于自定义部署)
¥Method 2: Runtime Mount (Recommended for Custom Deploys)
-
创建自定义配置文件,例如,
my-custom-config.yml本地。确保它包含所有必要的部分。¥Create your custom configuration file, e.g.,
my-custom-config.ymllocally. Ensure it contains all necessary sections. -
在运行容器时挂载它:使用
docker run:
使用# Assumes my-custom-config.yml is in the current directory docker run -d -p 11235:11235 \ --name crawl4ai-custom-config \ --env-file .llm.env \ --shm-size=1g \ -v $(pwd)/my-custom-config.yml:/app/config.yml \ unclecode/crawl4ai:latest # Or your specific tagdocker-compose.yml:添加volumes服务定义的部分:
(注意:确保services: crawl4ai-hub-amd64: # Or your chosen service image: unclecode/crawl4ai:latest profiles: ["hub-amd64"] <<: *base-config volumes: # Mount local custom config over the default one in the container - ./my-custom-config.yml:/app/config.yml # Keep the shared memory volume from base-config - /dev/shm:/dev/shmmy-custom-config.yml与以下目录相同docker-compose.yml)¥
Mount it when running the container:
-
Using
docker run: -
Using
docker-compose.yml: Add avolumessection to the service definition:(Note: Ensureservices: crawl4ai-hub-amd64: # Or your chosen service image: unclecode/crawl4ai:latest profiles: ["hub-amd64"] <<: *base-config volumes: # Mount local custom config over the default one in the container - ./my-custom-config.yml:/app/config.yml # Keep the shared memory volume from base-config - /dev/shm:/dev/shmmy-custom-config.ymlis in the same directory asdocker-compose.yml)
-
💡 安装时,您的自定义文件完全取代默认配置。请确保其配置有效且完整。
¥💡 When mounting, your custom file completely replaces the default one. Ensure it's a valid and complete configuration.
配置建议
¥Configuration Recommendations
-
安全第一🔒
¥Security First 🔒
-
始终在生产中启用安全性
¥Always enable security in production
-
使用特定的 trusted_hosts 而不是通配符
¥Use specific trusted_hosts instead of wildcards
-
设置适当的速率限制来保护您的服务器
¥Set up proper rate limiting to protect your server
-
启用 HTTPS 重定向之前请考虑您的环境
¥
Consider your environment before enabling HTTPS redirect
-
资源管理💻
¥
Resource Management 💻
-
根据可用 RAM 调整 memory_threshold_percent
¥Adjust memory_threshold_percent based on available RAM
-
根据内容大小和网络条件设置超时
¥Set timeouts according to your content size and network conditions
-
在多容器设置中使用 Redis 进行速率限制
¥
Use Redis for rate limiting in multi-container setups
-
监控📊
¥
Monitoring 📊
-
如果需要指标,请启用 Prometheus
¥Enable Prometheus if you need metrics
-
在开发中设置 DEBUG 日志,在生产中设置 INFO
¥Set DEBUG logging in development, INFO in production
-
定期健康检查监测至关重要
¥
Regular health check monitoring is crucial
-
性能调优⚡
¥
Performance Tuning ⚡
-
从保守的速率限制器延迟开始
¥Start with conservative rate limiter delays
-
增加大内容的batch_process超时
¥Increase batch_process timeout for large content
-
根据初始响应时间调整 stream_init 超时
¥Adjust stream_init timeout based on initial response times
获取帮助
¥Getting Help
我们随时准备帮助您通过 Crawl4AI 取得成功!获取支持的方法如下:
¥We're here to help you succeed with Crawl4AI! Here's how to get support:
-
📖 查看我们的完整文档
¥📖 Check our full documentation
-
🐛 发现了错误?开启一个问题
¥🐛 Found a bug? Open an issue
-
💬 加入我们Discord 社区
¥💬 Join our Discord community
-
⭐ 在 GitHub 上为我们加星标以表示支持!
¥⭐ Star us on GitHub to show support!
概括
¥Summary
在本指南中,我们介绍了开始使用 Crawl4AI 的 Docker 部署所需的一切:- 构建和运行 Docker 容器 - 配置环境
- 使用交互式游乐场进行测试 - 使用正确的类型发出 API 请求 - 使用 Python SDK - 利用专用端点进行屏幕截图、PDF 和 JavaScript 执行 - 通过模型上下文协议 (MCP) 连接 - 监控您的部署
¥In this guide, we've covered everything you need to get started with Crawl4AI's Docker deployment:
- Building and running the Docker container
- Configuring the environment
- Using the interactive playground for testing
- Making API requests with proper typing
- Using the Python SDK
- Leveraging specialized endpoints for screenshots, PDFs, and JavaScript execution
- Connecting via the Model Context Protocol (MCP)
- Monitoring your deployment
新的游乐场界面位于http://localhost:11235/playground使得测试配置和为 API 请求生成相应的 JSON 变得更加容易。
¥The new playground interface at http://localhost:11235/playground makes it much easier to test configurations and generate the corresponding JSON for API requests.
对于 AI 应用程序开发人员来说,MCP 集成允许 Claude Code 等工具直接访问 Crawl4AI 的功能,而无需复杂的 API 处理。
¥For AI application developers, the MCP integration allows tools like Claude Code to directly access Crawl4AI's capabilities without complex API handling.
请记住,examples文件夹是你的朋友——它们展示了你可以根据自己的需要进行调整的真实世界的使用模式。
¥Remember, the examples in the examples folder are your friends - they show real-world usage patterns that you can adapt for your needs.
继续探索,如需帮助,请随时联系我们!我们正在携手创造非凡。🚀
¥Keep exploring, and don't hesitate to reach out if you need help! We're building something amazing together. 🚀
爬行快乐!🕷️
¥Happy crawling! 🕷️