Crawl4AI CLI 指南

¥Crawl4AI CLI Guide

安装

¥Installation
基本用法

¥Basic Usage
配置

¥Configuration
浏览器配置

¥Browser Configuration
爬虫配置

¥Crawler Configuration
提取配置

¥Extraction Configuration
内容过滤

¥Content Filtering
高级功能

¥Advanced Features
LLM问答

¥LLM Q&A
结构化数据提取

¥Structured Data Extraction
内容过滤

¥Content Filtering
输出格式

¥Output Formats
示例

¥Examples
配置参考

¥Configuration Reference
最佳实践和技巧

¥Best Practices & Tips

安装

¥Installation

当您安装库时，Crawl4AI CLI 将自动安装。

¥The Crawl4AI CLI will be installed automatically when you install the library.

基本用法

¥Basic Usage

Crawl4AI CLI（crwl ) 为 Crawl4AI 库提供了一个简单的接口：

¥The Crawl4AI CLI (crwl) provides a simple interface to the Crawl4AI library:

# Basic crawling
crwl https://example.com

# Get markdown output
crwl https://example.com -o markdown

# Verbose JSON output with cache bypass
crwl https://example.com -o json -v --bypass-cache

# See usage examples
crwl --example

高级用法的快速示例

¥Quick Example of Advanced Usage

如果您克隆存储库并运行以下命令，您将根据 JSON-CSS 模式收到 JSON 格式的页面内容：

¥If you clone the repository and run the following command, you will receive the content of the page in JSON format according to a JSON-CSS schema:

crwl "https://www.infoq.com/ai-ml-data-eng/" -e docs/examples/cli/extract_css.yml -s docs/examples/cli/css_schema.json -o json;

配置

¥Configuration

浏览器配置

¥Browser Configuration

可以通过 YAML 文件或命令行参数配置浏览器设置：

¥Browser settings can be configured via YAML file or command line parameters:

# browser.yml
headless: true
viewport_width: 1280
user_agent_mode: "random"
verbose: true
ignore_https_errors: true

# Using config file
crwl https://example.com -B browser.yml

# Using direct parameters
crwl https://example.com -b "headless=true,viewport_width=1280,user_agent_mode=random"

爬虫配置

¥Crawler Configuration

控制爬行行为：

¥Control crawling behavior:

# crawler.yml
cache_mode: "bypass"
wait_until: "networkidle"
page_timeout: 30000
delay_before_return_html: 0.5
word_count_threshold: 100
scan_full_page: true
scroll_delay: 0.3
process_iframes: false
remove_overlay_elements: true
magic: true
verbose: true

# Using config file
crwl https://example.com -C crawler.yml

# Using direct parameters
crwl https://example.com -c "css_selector=#main,delay_before_return_html=2,scan_full_page=true"

提取配置

¥Extraction Configuration

支持两种类型的提取：

¥Two types of extraction are supported:

基于 CSS/XPath 的提取：

# extract_css.yml
type: "json-css"
params:
  verbose: true

¥CSS/XPath-based extraction:

# extract_css.yml
type: "json-css"
params:
  verbose: true

// css_schema.json
{
  "name": "ArticleExtractor",
  "baseSelector": ".article",
  "fields": [
    {
      "name": "title",
      "selector": "h1.title",
      "type": "text"
    },
    {
      "name": "link",
      "selector": "a.read-more",
      "type": "attribute",
      "attribute": "href"
    }
  ]
}

基于法学硕士的提取：

# extract_llm.yml
type: "llm"
provider: "openai/gpt-4"
instruction: "Extract all articles with their titles and links"
api_token: "your-token"
params:
  temperature: 0.3
  max_tokens: 1000

¥LLM-based extraction:

# extract_llm.yml
type: "llm"
provider: "openai/gpt-4"
instruction: "Extract all articles with their titles and links"
api_token: "your-token"
params:
  temperature: 0.3
  max_tokens: 1000

// llm_schema.json
{
  "title": "Article",
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "The title of the article"
    },
    "link": {
      "type": "string",
      "description": "URL to the full article"
    }
  }
}

高级功能

¥Advanced Features

LLM问答

¥LLM Q&A

询问有关已爬取内容的问题：

¥Ask questions about crawled content:

# Simple question
crwl https://example.com -q "What is the main topic discussed?"

# View content then ask questions
crwl https://example.com -o markdown  # See content first
crwl https://example.com -q "Summarize the key points"
crwl https://example.com -q "What are the conclusions?"

# Combined with advanced crawling
crwl https://example.com \
    -B browser.yml \
    -c "css_selector=article,scan_full_page=true" \
    -q "What are the pros and cons mentioned?"

首次设置： - 提示输入 LLM 提供商和 API 令牌 - 将配置保存在~/.crawl4ai/global.yml- 支持各种提供商（openai/gpt-4、anthropic/claude-3-sonnet 等） - 例如ollama您不需要提供 API 令牌。 - 请参阅LiteLLM 提供商查看完整列表

¥First-time setup: - Prompts for LLM provider and API token - Saves configuration in ~/.crawl4ai/global.yml - Supports various providers (openai/gpt-4, anthropic/claude-3-sonnet, etc.) - For case of ollama you do not need to provide API token. - See LiteLLM Providers for full list

结构化数据提取

¥Structured Data Extraction

使用 CSS 选择器提取结构化数据：

¥Extract structured data using CSS selectors:

crwl https://example.com \
    -e extract_css.yml \
    -s css_schema.json \
    -o json

或者使用基于 LLM 的提取：

¥Or using LLM-based extraction:

crwl https://example.com \
    -e extract_llm.yml \
    -s llm_schema.json \
    -o json

内容过滤

¥Content Filtering

按相关性过滤内容：

¥Filter content for relevance:

# filter_bm25.yml
type: "bm25"
query: "target content"
threshold: 1.0

# filter_pruning.yml
type: "pruning"
query: "focus topic"
threshold: 0.48

crwl https://example.com -f filter_bm25.yml -o markdown-fit

输出格式

¥Output Formats

- 包括元数据在内的完整爬取结果

¥all - Full crawl result including metadata
- 提取的结构化数据（使用提取时）

¥json - Extracted structured data (when using extraction)
/md - 原始 markdown 输出

¥markdown / md - Raw markdown output
/md-fit - 过滤 markdown 以提高可读性

¥markdown-fit / md-fit - Filtered markdown for better readability

完整示例

¥Complete Examples

基本提取：

crwl https://example.com \
    -B browser.yml \
    -C crawler.yml \
    -o json

Basic Extraction:

crwl https://example.com \
    -B browser.yml \
    -C crawler.yml \
    -o json

结构化数据提取：

crwl https://example.com \
    -e extract_css.yml \
    -s css_schema.json \
    -o json \
    -v

Structured Data Extraction:

crwl https://example.com \
    -e extract_css.yml \
    -s css_schema.json \
    -o json \
    -v

LLM 提取与过滤：

crwl https://example.com \
    -B browser.yml \
    -e extract_llm.yml \
    -s llm_schema.json \
    -f filter_bm25.yml \
    -o json

LLM Extraction with Filtering:

crwl https://example.com \
    -B browser.yml \
    -e extract_llm.yml \
    -s llm_schema.json \
    -f filter_bm25.yml \
    -o json

互动问答：

# First crawl and view
crwl https://example.com -o markdown

# Then ask questions
crwl https://example.com -q "What are the main points?"
crwl https://example.com -q "Summarize the conclusions"

Interactive Q&A:

# First crawl and view
crwl https://example.com -o markdown

# Then ask questions
crwl https://example.com -q "What are the main points?"
crwl https://example.com -q "Summarize the conclusions"

最佳实践和技巧

¥Best Practices & Tips

配置管理：

¥Configuration Management:
将常用配置保存在 YAML 文件中

¥Keep common configurations in YAML files
使用 CLI 参数进行快速覆盖

¥Use CLI parameters for quick overrides
将敏感数据（API 令牌）存储在~/.crawl4ai/global.yml

¥
Store sensitive data (API tokens) in ~/.crawl4ai/global.yml
性能优化：

¥
Performance Optimization:
使用--bypass-cache获取新鲜内容

¥Use --bypass-cache for fresh content
使能够scan_full_page用于无限滚动页面

¥Enable scan_full_page for infinite scroll pages
调整delay_before_return_html对于动态内容

¥
Adjust delay_before_return_html for dynamic content
内容提取：

¥
Content Extraction:
使用 CSS 提取结构化内容

¥Use CSS extraction for structured content
使用 LLM 提取非结构化内容

¥Use LLM extraction for unstructured content
与过滤器结合以获得有针对性的结果

¥
Combine with filters for focused results
问答工作流程：

¥
Q&A Workflow:
首先查看内容-o markdown

¥View content first with -o markdown
提出具体问题

¥Ask specific questions
使用更广泛的上下文和适当的选择器

¥Use broader context with appropriate selectors

回顾

¥Recap

Crawl4AI CLI 提供： - 通过文件和参数进行灵活配置 - 多种提取策略（CSS、XPath、LLM） - 内容过滤和优化 - 交互式问答功能 - 各种输出格式

¥The Crawl4AI CLI provides: - Flexible configuration via files and parameters - Multiple extraction strategies (CSS, XPath, LLM) - Content filtering and optimization - Interactive Q&A capabilities - Various output formats