Crawl4AI 中基于前缀的输入处理

¥Prefix-Based Input Handling in Crawl4AI

本指南将指导您使用 Crawl4AI 库抓取网页、本地 HTML 文件和原始 HTML 字符串。我们将以 Wikipedia 页面为例演示这些功能。

¥This guide will walk you through using the Crawl4AI library to crawl web pages, local HTML files, and raw HTML strings. We'll demonstrate these capabilities using a Wikipedia page as an example.

抓取 Web URL

¥Crawling a Web URL

要抓取实时网页，请提供以http://或者https://，使用CrawlerRunConfig目的：

¥To crawl a live web page, provide the URL starting with http:// or https://, using a CrawlerRunConfig object:

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig

async def crawl_web():
    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/apple", 
            config=config
        )
        if result.success:
            print("Markdown Content:")
            print(result.markdown)
        else:
            print(f"Failed to crawl: {result.error_message}")

asyncio.run(crawl_web())

抓取本地 HTML 文件

¥Crawling a Local HTML File

要抓取本地 HTML 文件，请在文件路径前添加file://。

¥To crawl a local HTML file, prefix the file path with file://.

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig

async def crawl_local_file():
    local_file_path = "/path/to/apple.html"  # Replace with your file path
    file_url = f"file://{local_file_path}"
    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=file_url, config=config)
        if result.success:
            print("Markdown Content from Local File:")
            print(result.markdown)
        else:
            print(f"Failed to crawl local file: {result.error_message}")

asyncio.run(crawl_local_file())

抓取原始 HTML 内容

¥Crawling Raw HTML Content

要抓取原始 HTML 内容，请在 HTML 字符串前添加raw:。

¥To crawl raw HTML content, prefix the HTML string with raw:.

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig

async def crawl_raw_html():
    raw_html = "<html><body><h1>Hello, World!</h1></body></html>"
    raw_html_url = f"raw:{raw_html}"
    config = CrawlerRunConfig(bypass_cache=True)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=raw_html_url, config=config)
        if result.success:
            print("Markdown Content from Raw HTML:")
            print(result.markdown)
        else:
            print(f"Failed to crawl raw HTML: {result.error_message}")

asyncio.run(crawl_raw_html())

完整示例

¥Complete Example

以下是一个综合脚本：

¥Below is a comprehensive script that:

抓取“Apple”的维基百科页面。

¥Crawls the Wikipedia page for "Apple."
将 HTML 内容保存到本地文件 (apple.html ）。

¥Saves the HTML content to a local file (apple.html).
抓取本地 HTML 文件并验证 markdown 长度是否与原始抓取相匹配。

¥Crawls the local HTML file and verifies the markdown length matches the original crawl.
从保存的文件中抓取原始 HTML 内容并验证一致性。

¥Crawls the raw HTML content from the saved file and verifies consistency.

import os
import sys
import asyncio
from pathlib import Path
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig

async def main():
    wikipedia_url = "https://en.wikipedia.org/wiki/apple"
    script_dir = Path(__file__).parent
    html_file_path = script_dir / "apple.html"

    async with AsyncWebCrawler() as crawler:
        # Step 1: Crawl the Web URL
        print("\n=== Step 1: Crawling the Wikipedia URL ===")
        web_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        result = await crawler.arun(url=wikipedia_url, config=web_config)

        if not result.success:
            print(f"Failed to crawl {wikipedia_url}: {result.error_message}")
            return

        with open(html_file_path, 'w', encoding='utf-8') as f:
            f.write(result.html)
        web_crawl_length = len(result.markdown)
        print(f"Length of markdown from web crawl: {web_crawl_length}\n")

        # Step 2: Crawl from the Local HTML File
        print("=== Step 2: Crawling from the Local HTML File ===")
        file_url = f"file://{html_file_path.resolve()}"
        file_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        local_result = await crawler.arun(url=file_url, config=file_config)

        if not local_result.success:
            print(f"Failed to crawl local file {file_url}: {local_result.error_message}")
            return

        local_crawl_length = len(local_result.markdown)
        assert web_crawl_length == local_crawl_length, "Markdown length mismatch"
        print("✅ Markdown length matches between web and local file crawl.\n")

        # Step 3: Crawl Using Raw HTML Content
        print("=== Step 3: Crawling Using Raw HTML Content ===")
        with open(html_file_path, 'r', encoding='utf-8') as f:
            raw_html_content = f.read()
        raw_html_url = f"raw:{raw_html_content}"
        raw_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        raw_result = await crawler.arun(url=raw_html_url, config=raw_config)

        if not raw_result.success:
            print(f"Failed to crawl raw HTML content: {raw_result.error_message}")
            return

        raw_crawl_length = len(raw_result.markdown)
        assert web_crawl_length == raw_crawl_length, "Markdown length mismatch"
        print("✅ Markdown length matches between web and raw HTML crawl.\n")

        print("All tests passed successfully!")
    if html_file_path.exists():
        os.remove(html_file_path)

if __name__ == "__main__":
    asyncio.run(main())

结论

¥Conclusion

随着统一url参数和基于前缀的处理Crawl4AI ，您可以无缝处理 Web URL、本地 HTML 文件和原始 HTML 内容。使用CrawlerRunConfig在所有场景下实现灵活、一致的配置。

¥With the unified url parameter and prefix-based handling in Crawl4AI, you can seamlessly handle web URLs, local HTML files, and raw HTML content. Use CrawlerRunConfig for flexible and consistent configuration in all scenarios.