一些重要的高级功能概述

(代理、PDF、屏幕截图、SSL、标题和存储状态)

Crawl4AI 提供多种高级用户功能,远不止简单的爬取。本教程涵盖以下内容:

1. 代理使用 2. 捕获 PDF 和屏幕截图 3. 处理 SSL 证书 4. 自定义标头 5. 会话持久性和本地存储 6. Robots.txt 合规性

先决条件 - 您对AsyncWebCrawler 基础知识有基本的了解 - 您知道如何在安装了 Playwright 的情况下运行或配置 Python 环境

1.代理使用

如果您需要通过代理路由抓取流量(无论是为了 IP 轮换、地理测试还是隐私),Crawl4AI 都可以通过以下方式支持它:BrowserConfig.proxy_config

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    browser_cfg = BrowserConfig(
        proxy_config={
            "server": "http://proxy.example.com:8080",
            "username": "myuser",
            "password": "mypass",
        },
        headless=True
    )
    crawler_cfg = CrawlerRunConfig(
        verbose=True
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://www.whatismyip.com/",
            config=crawler_cfg
        )
        if result.success:
            print("[OK] Page fetched via proxy.")
            print("Page HTML snippet:", result.html[:200])
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

要点 -proxy_config期望一个带有server以及可选的身份验证凭据。 - 许多商业代理提供您在server. - 如果您的代理不需要身份验证,请省略username/password


2. 捕获 PDF 和屏幕截图

有时您需要页面的视觉记录或 PDF“打印输出”。Crawl4AI 可以一次性完成这两项工作:

import os, asyncio
from base64 import b64decode
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig

async def main():
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        screenshot=True,
        pdf=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
            config=run_config
        )
        if result.success:
            print(f"Screenshot data present: {result.screenshot is not None}")
            print(f"PDF data present: {result.pdf is not None}")

            if result.screenshot:
                print(f"[OK] Screenshot captured, size: {len(result.screenshot)} bytes")
                with open("wikipedia_screenshot.png", "wb") as f:
                    f.write(b64decode(result.screenshot))
            else:
                print("[WARN] Screenshot data is None.")

            if result.pdf:
                print(f"[OK] PDF captured, size: {len(result.pdf)} bytes")
                with open("wikipedia_page.pdf", "wb") as f:
                    f.write(result.pdf)
            else:
                print("[WARN] PDF data is None.")

        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

为什么要使用 PDF + 截图?- 使用“传统”的全页截图可能会拖慢大型或复杂的页面速度或容易出错。- 对于较长的页面,导出 PDF 更为可靠。如果您同时请求 PDF 和图片,Crawl4AI 会自动将第一页 PDF 转换为图片。

相关参数 -pdf=True :将当前页面导出为 PDF(base64 编码格式)result.pdf )。-screenshot=True :创建屏幕截图(base64 编码result.screenshot)。-scan_full_page或者高级挂钩可以进一步完善爬虫程序捕获内容的方式。


3.处理 SSL 证书

如果您需要验证或导出站点的 SSL 证书(出于合规性、调试或数据分析的目的),Crawl4AI 可以在抓取过程中获取它:

import asyncio, os
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def main():
    tmp_dir = os.path.join(os.getcwd(), "tmp")
    os.makedirs(tmp_dir, exist_ok=True)

    config = CrawlerRunConfig(
        fetch_ssl_certificate=True,
        cache_mode=CacheMode.BYPASS
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com", config=config)

        if result.success and result.ssl_certificate:
            cert = result.ssl_certificate
            print("\nCertificate Information:")
            print(f"Issuer (CN): {cert.issuer.get('CN', '')}")
            print(f"Valid until: {cert.valid_until}")
            print(f"Fingerprint: {cert.fingerprint}")

            # Export in multiple formats:
            cert.to_json(os.path.join(tmp_dir, "certificate.json"))
            cert.to_pem(os.path.join(tmp_dir, "certificate.pem"))
            cert.to_der(os.path.join(tmp_dir, "certificate.der"))

            print("\nCertificate exported to JSON/PEM/DER in 'tmp' folder.")
        else:
            print("[ERROR] No certificate or crawl failed.")

if __name__ == "__main__":
    asyncio.run(main())

要点 -fetch_ssl_certificate=True触发证书检索。-result.ssl_certificate包括方法(to_jsonto_pemto_der ) 用于以各种格式保存(方便服务器配置、Java 密钥库等)。


4. 自定义标题

有时您需要设置自定义标头(例如,语言首选项、身份验证令牌或专用的用户代理字符串)。您可以通过多种方式执行此操作:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    # Option 1: Set headers at the crawler strategy level
    crawler1 = AsyncWebCrawler(
        # The underlying strategy can accept headers in its constructor
        crawler_strategy=None  # We'll override below for clarity
    )
    crawler1.crawler_strategy.update_user_agent("MyCustomUA/1.0")
    crawler1.crawler_strategy.set_custom_headers({
        "Accept-Language": "fr-FR,fr;q=0.9"
    })
    result1 = await crawler1.arun("https://www.example.com")
    print("Example 1 result success:", result1.success)

    # Option 2: Pass headers directly to `arun()`
    crawler2 = AsyncWebCrawler()
    result2 = await crawler2.arun(
        url="https://www.example.com",
        headers={"Accept-Language": "es-ES,es;q=0.9"}
    )
    print("Example 2 result success:", result2.success)

if __name__ == "__main__":
    asyncio.run(main())

注意 - 某些网站对某些标头的反应可能不同(例如,Accept-Language )。- 如果您需要高级用户代理随机化或客户端提示,请参阅基于身份的爬网(反机器人)或使用UserAgentGenerator


5. 会话持久性和本地存储

Crawl4AI 可以保留 cookie 和 localStorage,以便您可以从上次中断的地方继续操作 - 非常适合登录网站或跳过重复的身份验证流程。

5.1storage_state

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    storage_dict = {
        "cookies": [
            {
                "name": "session",
                "value": "abcd1234",
                "domain": "example.com",
                "path": "/",
                "expires": 1699999999.0,
                "httpOnly": False,
                "secure": False,
                "sameSite": "None"
            }
        ],
        "origins": [
            {
                "origin": "https://example.com",
                "localStorage": [
                    {"name": "token", "value": "my_auth_token"}
                ]
            }
        ]
    }

    # Provide the storage state as a dictionary to start "already logged in"
    async with AsyncWebCrawler(
        headless=True,
        storage_state=storage_dict
    ) as crawler:
        result = await crawler.arun("https://example.com/protected")
        if result.success:
            print("Protected page content length:", len(result.html))
        else:
            print("Failed to crawl protected page")

if __name__ == "__main__":
    asyncio.run(main())

5.2 导出和重用状态

您可以登录一次,导出浏览器上下文,然后稍后重复使用它 - 无需重新输入凭据。

  • :将 cookies、localStorage 等导出到文件。
  • 提供storage_state="my_storage.json"在后续运行中跳过登录步骤。

请参阅:详细的会话管理教程说明→浏览器上下文和托管浏览器以了解更高级的场景(例如多步骤登录或交互式页面后的捕获)。


6. Robots.txt 合规性

Crawl4AI 支持通过高效缓存来遵守 robots.txt 规则:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    # Enable robots.txt checking in config
    config = CrawlerRunConfig(
        check_robots_txt=True  # Will check and respect robots.txt rules
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            "https://example.com",
            config=config
        )

        if not result.success and result.status_code == 403:
            print("Access denied by robots.txt")

if __name__ == "__main__":
    asyncio.run(main())

要点 - Robots.txt 文件在本地缓存以提高效率 - 缓存存储在~/.crawl4ai/robots/robots_cache.db- 缓存的默认 TTL 为 7 天 - 如果无法获取 robots.txt,则允许抓取 - 如果 URL 被禁止,则返回 403 状态代码


整合起来

以下代码片段将多个“高级”功能(代理、PDF、截图、SSL、自定义标头和会话重用)组合到一次运行中。通常,您需要根据项目需求定制每个设置。

import os, asyncio
from base64 import b64decode
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    # 1. Browser config with proxy + headless
    browser_cfg = BrowserConfig(
        proxy_config={
            "server": "http://proxy.example.com:8080",
            "username": "myuser",
            "password": "mypass",
        },
        headless=True,
    )

    # 2. Crawler config with PDF, screenshot, SSL, custom headers, and ignoring caches
    crawler_cfg = CrawlerRunConfig(
        pdf=True,
        screenshot=True,
        fetch_ssl_certificate=True,
        cache_mode=CacheMode.BYPASS,
        headers={"Accept-Language": "en-US,en;q=0.8"},
        storage_state="my_storage.json",  # Reuse session from a previous sign-in
        verbose=True,
    )

    # 3. Crawl
    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url = "https://secure.example.com/protected", 
            config=crawler_cfg
        )

        if result.success:
            print("[OK] Crawled the secure page. Links found:", len(result.links.get("internal", [])))

            # Save PDF & screenshot
            if result.pdf:
                with open("result.pdf", "wb") as f:
                    f.write(b64decode(result.pdf))
            if result.screenshot:
                with open("result.png", "wb") as f:
                    f.write(b64decode(result.screenshot))

            # Check SSL cert
            if result.ssl_certificate:
                print("SSL Issuer CN:", result.ssl_certificate.issuer.get("CN", ""))
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

结论及后续步骤

您现在已经探索了几个高级功能:

  • 代理使用
  • 适用于大型或关键页面的 PDF 和屏幕截图
  • SSL 证书检索和导出
  • 针对语言或特殊请求的自定义标头
  • 通过存储状态实现会话持久性
  • Robots.txt 合规性

使用这些强大的工具,您可以构建强大的抓取工作流程,模拟真实用户行为、处理安全站点、捕获详细快照以及管理跨多次运行的会话,从而简化整个数据收集流程。

最后更新时间:2025-01-01


> Feedback