一些重要的高级功能概述
(代理、PDF、屏幕截图、SSL、标题和存储状态)
Crawl4AI 提供多种高级用户功能,远不止简单的爬取。本教程涵盖以下内容:
1. 代理使用 2. 捕获 PDF 和屏幕截图 3. 处理 SSL 证书 4. 自定义标头 5. 会话持久性和本地存储 6. Robots.txt 合规性
先决条件 - 您对AsyncWebCrawler 基础知识有基本的了解 - 您知道如何在安装了 Playwright 的情况下运行或配置 Python 环境
1.代理使用
如果您需要通过代理路由抓取流量(无论是为了 IP 轮换、地理测试还是隐私),Crawl4AI 都可以通过以下方式支持它:BrowserConfig.proxy_config
。
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
browser_cfg = BrowserConfig(
proxy_config={
"server": "http://proxy.example.com:8080",
"username": "myuser",
"password": "mypass",
},
headless=True
)
crawler_cfg = CrawlerRunConfig(
verbose=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://www.whatismyip.com/",
config=crawler_cfg
)
if result.success:
print("[OK] Page fetched via proxy.")
print("Page HTML snippet:", result.html[:200])
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
要点 -proxy_config
期望一个带有server
以及可选的身份验证凭据。 - 许多商业代理提供您在server
. - 如果您的代理不需要身份验证,请省略username
/password
。
2. 捕获 PDF 和屏幕截图
有时您需要页面的视觉记录或 PDF“打印输出”。Crawl4AI 可以一次性完成这两项工作:
import os, asyncio
from base64 import b64decode
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def main():
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
pdf=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
config=run_config
)
if result.success:
print(f"Screenshot data present: {result.screenshot is not None}")
print(f"PDF data present: {result.pdf is not None}")
if result.screenshot:
print(f"[OK] Screenshot captured, size: {len(result.screenshot)} bytes")
with open("wikipedia_screenshot.png", "wb") as f:
f.write(b64decode(result.screenshot))
else:
print("[WARN] Screenshot data is None.")
if result.pdf:
print(f"[OK] PDF captured, size: {len(result.pdf)} bytes")
with open("wikipedia_page.pdf", "wb") as f:
f.write(result.pdf)
else:
print("[WARN] PDF data is None.")
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
为什么要使用 PDF + 截图?- 使用“传统”的全页截图可能会拖慢大型或复杂的页面速度或容易出错。- 对于较长的页面,导出 PDF 更为可靠。如果您同时请求 PDF 和图片,Crawl4AI 会自动将第一页 PDF 转换为图片。
相关参数 -pdf=True
:将当前页面导出为 PDF(base64 编码格式)result.pdf
)。-screenshot=True
:创建屏幕截图(base64 编码result.screenshot
)。-scan_full_page
或者高级挂钩可以进一步完善爬虫程序捕获内容的方式。
3.处理 SSL 证书
如果您需要验证或导出站点的 SSL 证书(出于合规性、调试或数据分析的目的),Crawl4AI 可以在抓取过程中获取它:
import asyncio, os
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
tmp_dir = os.path.join(os.getcwd(), "tmp")
os.makedirs(tmp_dir, exist_ok=True)
config = CrawlerRunConfig(
fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.ssl_certificate:
cert = result.ssl_certificate
print("\nCertificate Information:")
print(f"Issuer (CN): {cert.issuer.get('CN', '')}")
print(f"Valid until: {cert.valid_until}")
print(f"Fingerprint: {cert.fingerprint}")
# Export in multiple formats:
cert.to_json(os.path.join(tmp_dir, "certificate.json"))
cert.to_pem(os.path.join(tmp_dir, "certificate.pem"))
cert.to_der(os.path.join(tmp_dir, "certificate.der"))
print("\nCertificate exported to JSON/PEM/DER in 'tmp' folder.")
else:
print("[ERROR] No certificate or crawl failed.")
if __name__ == "__main__":
asyncio.run(main())
要点 -fetch_ssl_certificate=True
触发证书检索。-result.ssl_certificate
包括方法(to_json
,to_pem
,to_der
) 用于以各种格式保存(方便服务器配置、Java 密钥库等)。
4. 自定义标题
有时您需要设置自定义标头(例如,语言首选项、身份验证令牌或专用的用户代理字符串)。您可以通过多种方式执行此操作:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
# Option 1: Set headers at the crawler strategy level
crawler1 = AsyncWebCrawler(
# The underlying strategy can accept headers in its constructor
crawler_strategy=None # We'll override below for clarity
)
crawler1.crawler_strategy.update_user_agent("MyCustomUA/1.0")
crawler1.crawler_strategy.set_custom_headers({
"Accept-Language": "fr-FR,fr;q=0.9"
})
result1 = await crawler1.arun("https://www.example.com")
print("Example 1 result success:", result1.success)
# Option 2: Pass headers directly to `arun()`
crawler2 = AsyncWebCrawler()
result2 = await crawler2.arun(
url="https://www.example.com",
headers={"Accept-Language": "es-ES,es;q=0.9"}
)
print("Example 2 result success:", result2.success)
if __name__ == "__main__":
asyncio.run(main())
注意 - 某些网站对某些标头的反应可能不同(例如,Accept-Language
)。- 如果您需要高级用户代理随机化或客户端提示,请参阅基于身份的爬网(反机器人)或使用UserAgentGenerator
。
5. 会话持久性和本地存储
Crawl4AI 可以保留 cookie 和 localStorage,以便您可以从上次中断的地方继续操作 - 非常适合登录网站或跳过重复的身份验证流程。
5.1storage_state
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
storage_dict = {
"cookies": [
{
"name": "session",
"value": "abcd1234",
"domain": "example.com",
"path": "/",
"expires": 1699999999.0,
"httpOnly": False,
"secure": False,
"sameSite": "None"
}
],
"origins": [
{
"origin": "https://example.com",
"localStorage": [
{"name": "token", "value": "my_auth_token"}
]
}
]
}
# Provide the storage state as a dictionary to start "already logged in"
async with AsyncWebCrawler(
headless=True,
storage_state=storage_dict
) as crawler:
result = await crawler.arun("https://example.com/protected")
if result.success:
print("Protected page content length:", len(result.html))
else:
print("Failed to crawl protected page")
if __name__ == "__main__":
asyncio.run(main())
5.2 导出和重用状态
您可以登录一次,导出浏览器上下文,然后稍后重复使用它 - 无需重新输入凭据。
- :将 cookies、localStorage 等导出到文件。
- 提供
storage_state="my_storage.json"
在后续运行中跳过登录步骤。
请参阅:详细的会话管理教程或说明→浏览器上下文和托管浏览器以了解更高级的场景(例如多步骤登录或交互式页面后的捕获)。
6. Robots.txt 合规性
Crawl4AI 支持通过高效缓存来遵守 robots.txt 规则:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
# Enable robots.txt checking in config
config = CrawlerRunConfig(
check_robots_txt=True # Will check and respect robots.txt rules
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://example.com",
config=config
)
if not result.success and result.status_code == 403:
print("Access denied by robots.txt")
if __name__ == "__main__":
asyncio.run(main())
要点 - Robots.txt 文件在本地缓存以提高效率 - 缓存存储在~/.crawl4ai/robots/robots_cache.db
- 缓存的默认 TTL 为 7 天 - 如果无法获取 robots.txt,则允许抓取 - 如果 URL 被禁止,则返回 403 状态代码
整合起来
以下代码片段将多个“高级”功能(代理、PDF、截图、SSL、自定义标头和会话重用)组合到一次运行中。通常,您需要根据项目需求定制每个设置。
import os, asyncio
from base64 import b64decode
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
# 1. Browser config with proxy + headless
browser_cfg = BrowserConfig(
proxy_config={
"server": "http://proxy.example.com:8080",
"username": "myuser",
"password": "mypass",
},
headless=True,
)
# 2. Crawler config with PDF, screenshot, SSL, custom headers, and ignoring caches
crawler_cfg = CrawlerRunConfig(
pdf=True,
screenshot=True,
fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS,
headers={"Accept-Language": "en-US,en;q=0.8"},
storage_state="my_storage.json", # Reuse session from a previous sign-in
verbose=True,
)
# 3. Crawl
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url = "https://secure.example.com/protected",
config=crawler_cfg
)
if result.success:
print("[OK] Crawled the secure page. Links found:", len(result.links.get("internal", [])))
# Save PDF & screenshot
if result.pdf:
with open("result.pdf", "wb") as f:
f.write(b64decode(result.pdf))
if result.screenshot:
with open("result.png", "wb") as f:
f.write(b64decode(result.screenshot))
# Check SSL cert
if result.ssl_certificate:
print("SSL Issuer CN:", result.ssl_certificate.issuer.get("CN", ""))
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
结论及后续步骤
您现在已经探索了几个高级功能:
- 代理使用
- 适用于大型或关键页面的 PDF 和屏幕截图
- SSL 证书检索和导出
- 针对语言或特殊请求的自定义标头
- 通过存储状态实现会话持久性
- Robots.txt 合规性
使用这些强大的工具,您可以构建强大的抓取工作流程,模拟真实用户行为、处理安全站点、捕获详细快照以及管理跨多次运行的会话,从而简化整个数据收集流程。
最后更新时间:2025-01-01