URL 播种:大规模爬取的智能方法
为什么要进行 URL 播种?
网络爬虫有多种类型,每种类型都有各自的优势。让我们来了解一下何时应该使用 URL 种子,何时应该使用深度爬虫。
深度爬取:实时发现
当您需要以下功能时,深度抓取是完美的选择: - 新鲜的实时数据 - 在页面创建时发现它们 - 动态探索 - 根据内容跟踪链接 - 选择性提取 - 找到所需内容时停止
# Deep crawling example: Explore a website dynamically
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
async def deep_crawl_example():
# Configure a 2-level deep crawl
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2, # Crawl 2 levels deep
include_external=False, # Stay within domain
max_pages=50 # Limit for efficiency
),
verbose=True
)
async with AsyncWebCrawler() as crawler:
# Start crawling and follow links dynamically
results = await crawler.arun("https://example.com", config=config)
print(f"Discovered and crawled {len(results)} pages")
for result in results[:3]:
print(f"Found: {result.url} at depth {result.metadata.get('depth', 0)}")
asyncio.run(deep_crawl_example())
URL 播种:批量发现
URL 种子功能可在您需要时发挥作用: - 全面覆盖 - 在几秒钟内获取数千个 URL - 批量处理 - 抓取前过滤 - 资源效率 - 准确了解您将抓取的内容
# URL seeding example: Analyze all documentation
from crawl4ai import AsyncUrlSeeder, SeedingConfig
seeder = AsyncUrlSeeder()
config = SeedingConfig(
source="sitemap",
extract_head=True,
pattern="*/docs/*"
)
# Get ALL documentation URLs instantly
urls = await seeder.urls("example.com", config)
# 1000+ URLs discovered in seconds!
权衡
方面 | 深度爬行 | URL 播种 |
---|---|---|
覆盖范围 | 动态发现页面 | 立即获取大多数现有 URL |
新鲜 | 找到全新页面 | 可能会错过最近的页面 |
速度 | 慢慢地,一页一页地读 | 极快的批量发现 |
资源使用情况 | 更高 - 爬行以发现 | 较低 - 发现然后爬行 |
控制 | 可以中途停止 | 抓取前进行预过滤 |
何时使用每个
在以下情况下选择深度爬取: - 您需要绝对最新的内容 - 您正在搜索特定信息 - 网站结构未知或动态 - 您想在找到所需内容后立即停止
在以下情况下选择 URL Seeding: - 您需要分析网站的大部分内容 - 您希望在抓取之前过滤 URL - 您正在进行比较分析 - 您需要优化资源使用
当您理解这两种方法并选择合适的工具来完成任务时,奇迹就会发生。有时,您甚至可以将它们结合起来——使用 URL 种子进行批量发现,然后深度抓取特定部分以获取最新更新。
您的首次 URL 播种冒险
让我们看看它的神奇之处。我们将发现关于 Python 的博客文章,筛选教程,并只抓取这些页面。
import asyncio
from crawl4ai import AsyncUrlSeeder, AsyncWebCrawler, SeedingConfig, CrawlerRunConfig
async def smart_blog_crawler():
# Step 1: Create our URL discoverer
seeder = AsyncUrlSeeder()
# Step 2: Configure discovery - let's find all blog posts
config = SeedingConfig(
source="sitemap", # Use the website's sitemap
pattern="*/blog/*.html", # Only blog posts
extract_head=True, # Get page metadata
max_urls=100 # Limit for this example
)
# Step 3: Discover URLs from the Python blog
print("🔍 Discovering blog posts...")
urls = await seeder.urls("realpython.com", config)
print(f"✅ Found {len(urls)} blog posts")
# Step 4: Filter for Python tutorials (using metadata!)
tutorials = [
url for url in urls
if url["status"] == "valid" and
any(keyword in str(url["head_data"]).lower()
for keyword in ["tutorial", "guide", "how to"])
]
print(f"📚 Filtered to {len(tutorials)} tutorials")
# Step 5: Show what we found
print("\n🎯 Found these tutorials:")
for tutorial in tutorials[:5]: # First 5
title = tutorial["head_data"].get("title", "No title")
print(f" - {title}")
print(f" {tutorial['url']}")
# Step 6: Now crawl ONLY these relevant pages
print("\n🚀 Crawling tutorials...")
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
only_text=True,
word_count_threshold=300 # Only substantial articles
)
# Extract URLs and crawl them
tutorial_urls = [t["url"] for t in tutorials[:10]]
results = await crawler.arun_many(tutorial_urls, config=config)
successful = 0
async for result in results:
if result.success:
successful += 1
print(f"✓ Crawled: {result.url[:60]}...")
print(f"\n✨ Successfully crawled {successful} tutorials!")
# Run it!
asyncio.run(smart_blog_crawler())
刚才发生了什么?
- 我们从站点地图中发现了所有博客网址
- 我们使用元数据进行过滤(无需抓取!)
- 我们只抓取了相关的教程
- 我们节省了大量的时间和带宽
这就是 URL 播种的力量——在抓取任何内容之前,您可以看到所有内容。
理解 URL Seeder
现在您已经看到了魔术,让我们了解它是如何运作的。
基本用法
创建 URL 播种器很简单:
from crawl4ai import AsyncUrlSeeder
# Method 1: Manual cleanup
seeder = AsyncUrlSeeder()
try:
config = SeedingConfig(source="sitemap")
urls = await seeder.urls("example.com", config)
finally:
await seeder.close()
# Method 2: Context manager (recommended)
async with AsyncUrlSeeder() as seeder:
config = SeedingConfig(source="sitemap")
urls = await seeder.urls("example.com", config)
# Automatically cleaned up on exit
播种者可以从两个强大的来源发现 URL:
1. 站点地图(最快)
# Discover from sitemap
config = SeedingConfig(source="sitemap")
urls = await seeder.urls("example.com", config)
站点地图是网站专门创建的 XML 文件,用于列出其所有 URL。这就像在餐厅拿菜单一样——所有内容都列在最前面。
站点地图索引支持:对于像 TechCrunch 这样使用站点地图索引(站点地图的站点地图)的大型网站,播种机会自动并行检测和处理所有子站点地图:
<!-- Example sitemap index -->
<sitemapindex>
<sitemap>
<loc>https://techcrunch.com/sitemap-1.xml</loc>
</sitemap>
<sitemap>
<loc>https://techcrunch.com/sitemap-2.xml</loc>
</sitemap>
<!-- ... more sitemaps ... -->
</sitemapindex>
播种机透明地处理这个问题 - 您将自动从所有子站点地图获取所有 URL!
2. 常见爬虫(最全面)
# Discover from Common Crawl
config = SeedingConfig(source="cc")
urls = await seeder.urls("example.com", config)
Common Crawl 是一个海量公共数据集,定期抓取整个网络数据。它就像访问预先构建的互联网索引一样。
3. 两个来源(最大覆盖范围)
# Use both sources
config = SeedingConfig(source="sitemap+cc")
urls = await seeder.urls("example.com", config)
配置魔法:SeedingConfig
这SeedingConfig
对象是您的控制面板。以下是您可以配置的所有内容:
范围 | 类型 | 默认 | 描述 |
---|---|---|---|
source |
字符串 | “站点地图+抄送” | URL 来源:“cc”(常见抓取)、“sitemap”或“sitemap+cc” |
pattern |
字符串 | “*” | URL 模式过滤器(例如“/blog/”、“*.html”) |
extract_head |
布尔值 | 错误的 | 从页面提取元数据<head> |
live_check |
布尔值 | 错误的 | 验证 URL 是否可访问 |
max_urls |
整数 | -1 | 返回的最大 URL 数量(-1 = 无限制) |
concurrency |
整数 | 10 | 用于获取数据的并行工作者 |
hits_per_sec |
整数 | 5 | 请求速率限制 |
force |
布尔值 | 错误的 | 绕过缓存,获取新数据 |
verbose |
布尔值 | 错误的 | 显示详细进度 |
query |
字符串 | 没有任何 | 搜索 BM25 评分 |
scoring_method |
字符串 | 没有任何 | 计分方法(目前为“bm25”) |
score_threshold |
漂浮 | 没有任何 | 包含 URL 的最低分数 |
filter_nonsense_urls |
布尔值 | 真的 | 过滤掉实用程序 URL(robots.txt 等) |
模式匹配示例
# Match all blog posts
config = SeedingConfig(pattern="*/blog/*")
# Match only HTML files
config = SeedingConfig(pattern="*.html")
# Match product pages
config = SeedingConfig(pattern="*/product/*")
# Match everything except admin pages
config = SeedingConfig(pattern="*")
# Then filter: urls = [u for u in urls if "/admin/" not in u["url"]]
URL 验证:实时检查
有时您需要知道 URL 是否真的可以访问。这时,实时检查就派上用场了:
config = SeedingConfig(
source="sitemap",
live_check=True, # Verify each URL is accessible
concurrency=20 # Check 20 URLs in parallel
)
urls = await seeder.urls("example.com", config)
# Now you can filter by status
live_urls = [u for u in urls if u["status"] == "valid"]
dead_urls = [u for u in urls if u["status"] == "not_valid"]
print(f"Live URLs: {len(live_urls)}")
print(f"Dead URLs: {len(dead_urls)}")
何时使用实时检查: - 大规模爬取操作之前 - 使用旧站点地图时 - 数据新鲜度至关重要时
何时跳过它: - 快速探索 - 当你信任来源时 - 当速度比准确性更重要时
元数据的力量:头部提取
这就是 URL 种子真正强大的地方。你无需爬取整个页面,只需提取元数据即可:
config = SeedingConfig(
extract_head=True # Extract metadata from <head> section
)
urls = await seeder.urls("example.com", config)
# Now each URL has rich metadata
for url in urls[:3]:
print(f"\nURL: {url['url']}")
print(f"Title: {url['head_data'].get('title')}")
meta = url['head_data'].get('meta', {})
print(f"Description: {meta.get('description')}")
print(f"Keywords: {meta.get('keywords')}")
# Even Open Graph data!
print(f"OG Image: {meta.get('og:image')}")
我们能提取什么?
头部提取可以为你提供宝贵的信息:
# Example of extracted head_data
{
"title": "10 Python Tips for Beginners",
"charset": "utf-8",
"lang": "en",
"meta": {
"description": "Learn essential Python tips...",
"keywords": "python, programming, tutorial",
"author": "Jane Developer",
"viewport": "width=device-width, initial-scale=1",
# Open Graph tags
"og:title": "10 Python Tips for Beginners",
"og:description": "Essential Python tips for new programmers",
"og:image": "https://example.com/python-tips.jpg",
"og:type": "article",
# Twitter Card tags
"twitter:card": "summary_large_image",
"twitter:title": "10 Python Tips",
# Dublin Core metadata
"dc.creator": "Jane Developer",
"dc.date": "2024-01-15"
},
"link": {
"canonical": [{"href": "https://example.com/blog/python-tips"}],
"alternate": [{"href": "/feed.xml", "type": "application/rss+xml"}]
},
"jsonld": [
{
"@type": "Article",
"headline": "10 Python Tips for Beginners",
"datePublished": "2024-01-15",
"author": {"@type": "Person", "name": "Jane Developer"}
}
]
}
这些元数据对于筛选来说简直是金矿!你无需爬取任何页面就能找到所需的信息。
基于 URL 的智能过滤(无头部提取)
什么时候extract_head=False
但您仍然提供查询,播种机使用基于 URL 的智能评分:
# Fast filtering based on URL structure alone
config = SeedingConfig(
source="sitemap",
extract_head=False, # Don't fetch page metadata
query="python tutorial async",
scoring_method="bm25",
score_threshold=0.3
)
urls = await seeder.urls("example.com", config)
# URLs are scored based on:
# 1. Domain parts matching (e.g., 'python' in python.example.com)
# 2. Path segments (e.g., '/tutorials/python-async/')
# 3. Query parameters (e.g., '?topic=python')
# 4. Fuzzy matching using character n-grams
# Example URL scoring:
# https://example.com/tutorials/python/async-guide.html - High score
# https://example.com/blog/javascript-tips.html - Low score
这种方法比头部提取快得多,同时还提供智能过滤!
理解结果
结果中的每个 URL 都具有以下结构:
{
"url": "https://example.com/blog/python-tips.html",
"status": "valid", # "valid", "not_valid", or "unknown"
"head_data": { # Only if extract_head=True
"title": "Page Title",
"meta": {...},
"link": {...},
"jsonld": [...]
},
"relevance_score": 0.85 # Only if using BM25 scoring
}
让我们看一个真实的例子:
config = SeedingConfig(
source="sitemap",
extract_head=True,
live_check=True
)
urls = await seeder.urls("blog.example.com", config)
# Analyze the results
for url in urls[:5]:
print(f"\n{'='*60}")
print(f"URL: {url['url']}")
print(f"Status: {url['status']}")
if url['head_data']:
data = url['head_data']
print(f"Title: {data.get('title', 'No title')}")
# Check content type
meta = data.get('meta', {})
content_type = meta.get('og:type', 'unknown')
print(f"Content Type: {content_type}")
# Publication date
pub_date = None
for jsonld in data.get('jsonld', []):
if isinstance(jsonld, dict):
pub_date = jsonld.get('datePublished')
if pub_date:
break
if pub_date:
print(f"Published: {pub_date}")
# Word count (if available)
word_count = meta.get('word_count')
if word_count:
print(f"Word Count: {word_count}")
使用 BM25 评分进行智能过滤
现在到了真正酷的部分——基于相关性的智能过滤!
相关性评分简介
BM25 是一种排名算法,用于评估文档与搜索查询的相关度。通过 URL 种子功能,我们可以在抓取 URL 之前,根据其元数据对其进行评分。
可以这样想: - 传统方式:阅读图书馆中的每本书以找到有关 Python 的书籍 - 智能方式:检查标题和描述,对它们进行评分,只阅读最相关的内容
基于查询的发现
使用 BM25 评分的方法如下:
config = SeedingConfig(
source="sitemap",
extract_head=True, # Required for scoring
query="python async tutorial", # What we're looking for
scoring_method="bm25", # Use BM25 algorithm
score_threshold=0.3 # Minimum relevance score
)
urls = await seeder.urls("realpython.com", config)
# Results are automatically sorted by relevance!
for url in urls[:5]:
print(f"Score: {url['relevance_score']:.2f} - {url['url']}")
print(f" Title: {url['head_data']['title']}")
真实案例
查找文档页面
# Find API documentation
config = SeedingConfig(
source="sitemap",
extract_head=True,
query="API reference documentation endpoints",
scoring_method="bm25",
score_threshold=0.5,
max_urls=20
)
urls = await seeder.urls("docs.example.com", config)
# The highest scoring URLs will be API docs!
发现产品页面
# Find specific products
config = SeedingConfig(
source="sitemap+cc", # Use both sources
extract_head=True,
query="wireless headphones noise canceling",
scoring_method="bm25",
score_threshold=0.4,
pattern="*/product/*" # Combine with pattern matching
)
urls = await seeder.urls("shop.example.com", config)
# Filter further by price (from metadata)
affordable = [
u for u in urls
if float(u['head_data'].get('meta', {}).get('product:price', '0')) < 200
]
过滤新闻文章
# Find recent news about AI
config = SeedingConfig(
source="sitemap",
extract_head=True,
query="artificial intelligence machine learning breakthrough",
scoring_method="bm25",
score_threshold=0.35
)
urls = await seeder.urls("technews.com", config)
# Filter by date
from datetime import datetime, timedelta
recent = []
cutoff = datetime.now() - timedelta(days=7)
for url in urls:
# Check JSON-LD for publication date
for jsonld in url['head_data'].get('jsonld', []):
if 'datePublished' in jsonld:
pub_date = datetime.fromisoformat(jsonld['datePublished'].replace('Z', '+00:00'))
if pub_date > cutoff:
recent.append(url)
break
复杂查询模式
# Multi-concept queries
queries = [
"python async await concurrency tutorial",
"data science pandas numpy visualization",
"web scraping beautifulsoup selenium automation",
"machine learning tensorflow keras deep learning"
]
all_tutorials = []
for query in queries:
config = SeedingConfig(
source="sitemap",
extract_head=True,
query=query,
scoring_method="bm25",
score_threshold=0.4,
max_urls=10 # Top 10 per topic
)
urls = await seeder.urls("learning-platform.com", config)
all_tutorials.extend(urls)
# Remove duplicates while preserving order
seen = set()
unique_tutorials = []
for url in all_tutorials:
if url['url'] not in seen:
seen.add(url['url'])
unique_tutorials.append(url)
print(f"Found {len(unique_tutorials)} unique tutorials across all topics")
扩展:多个域
当您需要发现多个网站上的 URL 时,URL 播种确实非常有用。
这many_urls
方法
# Discover URLs from multiple domains in parallel
domains = ["site1.com", "site2.com", "site3.com"]
config = SeedingConfig(
source="sitemap",
extract_head=True,
query="python tutorial",
scoring_method="bm25",
score_threshold=0.3
)
# Returns a dictionary: {domain: [urls]}
results = await seeder.many_urls(domains, config)
# Process results
for domain, urls in results.items():
print(f"\n{domain}: Found {len(urls)} relevant URLs")
if urls:
top = urls[0] # Highest scoring
print(f" Top result: {top['url']}")
print(f" Score: {top['relevance_score']:.2f}")
跨域示例
竞争对手分析
# Analyze content strategies across competitors
competitors = [
"competitor1.com",
"competitor2.com",
"competitor3.com"
]
config = SeedingConfig(
source="sitemap",
extract_head=True,
pattern="*/blog/*",
max_urls=100
)
results = await seeder.many_urls(competitors, config)
# Analyze content types
for domain, urls in results.items():
content_types = {}
for url in urls:
# Extract content type from metadata
og_type = url['head_data'].get('meta', {}).get('og:type', 'unknown')
content_types[og_type] = content_types.get(og_type, 0) + 1
print(f"\n{domain} content distribution:")
for ctype, count in sorted(content_types.items(), key=lambda x: x[1], reverse=True):
print(f" {ctype}: {count}")
行业研究
# Research Python tutorials across educational sites
educational_sites = [
"realpython.com",
"pythontutorial.net",
"learnpython.org",
"python.org"
]
config = SeedingConfig(
source="sitemap",
extract_head=True,
query="beginner python tutorial basics",
scoring_method="bm25",
score_threshold=0.3,
max_urls=20 # Per site
)
results = await seeder.many_urls(educational_sites, config)
# Find the best beginner tutorials
all_tutorials = []
for domain, urls in results.items():
for url in urls:
url['domain'] = domain # Add domain info
all_tutorials.append(url)
# Sort by relevance across all domains
all_tutorials.sort(key=lambda x: x['relevance_score'], reverse=True)
print("Top 10 Python tutorials for beginners across all sites:")
for i, tutorial in enumerate(all_tutorials[:10], 1):
print(f"{i}. [{tutorial['relevance_score']:.2f}] {tutorial['head_data']['title']}")
print(f" {tutorial['url']}")
print(f" From: {tutorial['domain']}")
多站点监控
# Monitor news about your company across multiple sources
news_sites = [
"techcrunch.com",
"theverge.com",
"wired.com",
"arstechnica.com"
]
company_name = "YourCompany"
config = SeedingConfig(
source="cc", # Common Crawl for recent content
extract_head=True,
query=f"{company_name} announcement news",
scoring_method="bm25",
score_threshold=0.5, # High threshold for relevance
max_urls=10
)
results = await seeder.many_urls(news_sites, config)
# Collect all mentions
mentions = []
for domain, urls in results.items():
mentions.extend(urls)
if mentions:
print(f"Found {len(mentions)} mentions of {company_name}:")
for mention in mentions:
print(f"\n- {mention['head_data']['title']}")
print(f" {mention['url']}")
print(f" Score: {mention['relevance_score']:.2f}")
else:
print(f"No recent mentions of {company_name} found")
高级集成模式
让我们通过一个真实世界的实例来概括这一切。
打造研究助理
下面是智能发现、评分、过滤和抓取的完整示例:
import asyncio
from datetime import datetime
from crawl4ai import AsyncUrlSeeder, AsyncWebCrawler, SeedingConfig, CrawlerRunConfig
class ResearchAssistant:
def __init__(self):
self.seeder = None
async def __aenter__(self):
self.seeder = AsyncUrlSeeder()
await self.seeder.__aenter__()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self.seeder:
await self.seeder.__aexit__(exc_type, exc_val, exc_tb)
async def research_topic(self, topic, domains, max_articles=20):
"""Research a topic across multiple domains."""
print(f"🔬 Researching '{topic}' across {len(domains)} domains...")
# Step 1: Discover relevant URLs
config = SeedingConfig(
source="sitemap+cc", # Maximum coverage
extract_head=True, # Get metadata
query=topic, # Research topic
scoring_method="bm25", # Smart scoring
score_threshold=0.4, # Quality threshold
max_urls=10, # Per domain
concurrency=20, # Fast discovery
verbose=True
)
# Discover across all domains
discoveries = await self.seeder.many_urls(domains, config)
# Step 2: Collect and rank all articles
all_articles = []
for domain, urls in discoveries.items():
for url in urls:
url['domain'] = domain
all_articles.append(url)
# Sort by relevance
all_articles.sort(key=lambda x: x['relevance_score'], reverse=True)
# Take top articles
top_articles = all_articles[:max_articles]
print(f"\n📊 Found {len(all_articles)} relevant articles")
print(f"📌 Selected top {len(top_articles)} for deep analysis")
# Step 3: Show what we're about to crawl
print("\n🎯 Articles to analyze:")
for i, article in enumerate(top_articles[:5], 1):
print(f"\n{i}. {article['head_data']['title']}")
print(f" Score: {article['relevance_score']:.2f}")
print(f" Source: {article['domain']}")
print(f" URL: {article['url'][:60]}...")
# Step 4: Crawl the selected articles
print(f"\n🚀 Deep crawling {len(top_articles)} articles...")
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
only_text=True,
word_count_threshold=200, # Substantial content only
stream=True
)
# Extract URLs and crawl all articles
article_urls = [article['url'] for article in top_articles]
results = []
crawl_results = await crawler.arun_many(article_urls, config=config)
async for result in crawl_results:
if result.success:
results.append({
'url': result.url,
'title': result.metadata.get('title', 'No title'),
'content': result.markdown.raw_markdown,
'domain': next(a['domain'] for a in top_articles if a['url'] == result.url),
'score': next(a['relevance_score'] for a in top_articles if a['url'] == result.url)
})
print(f"✓ Crawled: {result.url[:60]}...")
# Step 5: Analyze and summarize
print(f"\n📝 Analysis complete! Crawled {len(results)} articles")
return self.create_research_summary(topic, results)
def create_research_summary(self, topic, articles):
"""Create a research summary from crawled articles."""
summary = {
'topic': topic,
'timestamp': datetime.now().isoformat(),
'total_articles': len(articles),
'sources': {}
}
# Group by domain
for article in articles:
domain = article['domain']
if domain not in summary['sources']:
summary['sources'][domain] = []
summary['sources'][domain].append({
'title': article['title'],
'url': article['url'],
'score': article['score'],
'excerpt': article['content'][:500] + '...' if len(article['content']) > 500 else article['content']
})
return summary
# Use the research assistant
async def main():
async with ResearchAssistant() as assistant:
# Research Python async programming across multiple sources
topic = "python asyncio best practices performance optimization"
domains = [
"realpython.com",
"python.org",
"stackoverflow.com",
"medium.com"
]
summary = await assistant.research_topic(topic, domains, max_articles=15)
# Display results
print("\n" + "="*60)
print("RESEARCH SUMMARY")
print("="*60)
print(f"Topic: {summary['topic']}")
print(f"Date: {summary['timestamp']}")
print(f"Total Articles Analyzed: {summary['total_articles']}")
print("\nKey Findings by Source:")
for domain, articles in summary['sources'].items():
print(f"\n📚 {domain} ({len(articles)} articles)")
for article in articles[:2]: # Top 2 per domain
print(f"\n Title: {article['title']}")
print(f" Relevance: {article['score']:.2f}")
print(f" Preview: {article['excerpt'][:200]}...")
asyncio.run(main())
性能优化技巧
- 明智地使用缓存
# First run - populate cache config = SeedingConfig(source="sitemap", extract_head=True, force=True) urls = await seeder.urls("example.com", config) # Subsequent runs - use cache (much faster) config = SeedingConfig(source="sitemap", extract_head=True, force=False) urls = await seeder.urls("example.com", config)
- 优化并发
# For many small requests (like HEAD checks) config = SeedingConfig(concurrency=50, hits_per_sec=20) # For fewer large requests (like full head extraction) config = SeedingConfig(concurrency=10, hits_per_sec=5)
- 流式传输大型结果集
# When crawling many URLs async with AsyncWebCrawler() as crawler: # Assuming urls is a list of URL strings crawl_results = await crawler.arun_many(urls, config=config) # Process as they arrive async for result in crawl_results: process_immediately(result) # Don't wait for all
- 大型域的内存保护
播种机使用有界队列来防止在处理具有数百万个 URL 的域时出现内存问题:
# Safe for domains with 1M+ URLs
config = SeedingConfig(
source="cc+sitemap",
concurrency=50, # Queue size adapts to concurrency
max_urls=100000 # Process in batches if needed
)
# The seeder automatically manages memory by:
# - Using bounded queues (prevents RAM spikes)
# - Applying backpressure when queue is full
# - Processing URLs as they're discovered
最佳实践和技巧
缓存管理
播种机自动缓存结果以加快重复操作:
- 常见爬网缓存:
~/.crawl4ai/seeder_cache/[index]_[domain]_[hash].jsonl
- 网站地图缓存:
~/.crawl4ai/seeder_cache/sitemap_[domain]_[hash].jsonl
- HEAD 数据缓存:
~/.cache/url_seeder/head/[hash].json
缓存默认 7 天后过期。使用force=True
刷新。
模式匹配策略
# Be specific when possible
good_pattern = "*/blog/2024/*.html" # Specific
bad_pattern = "*" # Too broad
# Combine patterns with metadata filtering
config = SeedingConfig(
pattern="*/articles/*",
extract_head=True
)
urls = await seeder.urls("news.com", config)
# Further filter by publish date, author, category, etc.
recent = [u for u in urls if is_recent(u['head_data'])]
速率限制注意事项
# Be respectful of servers
config = SeedingConfig(
hits_per_sec=10, # Max 10 requests per second
concurrency=20 # But use 20 workers
)
# For your own servers
config = SeedingConfig(
hits_per_sec=None, # No limit
concurrency=100 # Go fast
)
快速参考
常见模式
# Blog post discovery
config = SeedingConfig(
source="sitemap",
pattern="*/blog/*",
extract_head=True,
query="your topic",
scoring_method="bm25"
)
# E-commerce product discovery
config = SeedingConfig(
source="sitemap+cc",
pattern="*/product/*",
extract_head=True,
live_check=True
)
# Documentation search
config = SeedingConfig(
source="sitemap",
pattern="*/docs/*",
extract_head=True,
query="API reference",
scoring_method="bm25",
score_threshold=0.5
)
# News monitoring
config = SeedingConfig(
source="cc",
extract_head=True,
query="company name",
scoring_method="bm25",
max_urls=50
)
故障排除指南
问题 | 解决方案 |
---|---|
未找到任何网址 | 尝试source="cc+sitemap" ,检查域名拼写 |
缓慢的发现 | 减少concurrency , 添加hits_per_sec 限制 |
缺少元数据 | 确保extract_head=True |
相关性分数低 | 优化查询,降低score_threshold |
速率限制错误 | 减少hits_per_sec 和concurrency |
大型网站的内存问题 | 使用max_urls 限制结果,减少concurrency |
连接未关闭 | 使用上下文管理器或调用await seeder.close() |
性能基准
标准连接上的典型性能:
- 站点地图发现:每秒 100-1,000 个 URL
- 常见抓取发现:50-500 个 URL/秒
- HEAD 检查:每秒 10-50 个 URL
- 头部提取:5-20 个 URL/秒
- BM25 评分:每秒 10,000+ 个 URL
结论
URL 播种功能可将网络爬取从盲目的探索转变为精准的外科手术式打击。通过在爬取之前发现并分析 URL,您可以:
- 节省数小时的爬行时间
- 减少 90% 以上的带宽使用量
- 找到您真正需要的东西
- 轻松跨多个域扩展
无论您是构建研究工具、监控竞争对手还是创建内容聚合器,URL 播种都能让您更智能地进行抓取,而不是更费力。
智能 URL 过滤
播种机会自动过滤掉那些对内容抓取无用的无意义的 URL:
# Enabled by default
config = SeedingConfig(
source="sitemap",
filter_nonsense_urls=True # Default: True
)
# URLs that get filtered:
# - robots.txt, sitemap.xml, ads.txt
# - API endpoints (/api/, /v1/, .json)
# - Media files (.jpg, .mp4, .pdf)
# - Archives (.zip, .tar.gz)
# - Source code (.js, .css)
# - Admin/login pages
# - And many more...
禁用过滤(不推荐):
主要功能摘要
- 并行站点地图索引处理:自动并行检测和处理站点地图索引
- 内存保护:有界队列可防止大型域(1M+ URL)出现 RAM 问题
- 上下文管理器支持:自动清理
async with
陈述 - 基于 URL 的评分:即使没有头部提取也能进行智能过滤
- 智能 URL 过滤:自动排除实用/无意义的 URL
- 双重缓存:URL 列表和元数据的单独缓存
现在就行动起来,聪明地播种吧!🌱🚀