URL 播种:大规模爬取的智能方法

为什么要进行 URL 播种?

网络爬虫有多种类型,每种类型都有各自的优势。让我们来了解一下何时应该使用 URL 种子,何时应该使用深度爬虫。

深度爬取:实时发现

当您需要以下功能时,深度抓取是完美的选择: - 新鲜的实时数据 - 在页面创建时发现它们 - 动态探索 - 根据内容跟踪链接 - 选择性提取 - 找到所需内容时停止

# Deep crawling example: Explore a website dynamically
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

async def deep_crawl_example():
    # Configure a 2-level deep crawl
    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2,           # Crawl 2 levels deep
            include_external=False, # Stay within domain
            max_pages=50           # Limit for efficiency
        ),
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        # Start crawling and follow links dynamically
        results = await crawler.arun("https://example.com", config=config)

        print(f"Discovered and crawled {len(results)} pages")
        for result in results[:3]:
            print(f"Found: {result.url} at depth {result.metadata.get('depth', 0)}")

asyncio.run(deep_crawl_example())

URL 播种:批量发现

URL 种子功能可在您需要时发挥作用: - 全面覆盖 - 在几秒钟内获取数千个 URL - 批量处理 - 抓取前过滤 - 资源效率 - 准确了解您将抓取的内容

# URL seeding example: Analyze all documentation
from crawl4ai import AsyncUrlSeeder, SeedingConfig

seeder = AsyncUrlSeeder()
config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    pattern="*/docs/*"
)

# Get ALL documentation URLs instantly
urls = await seeder.urls("example.com", config)
# 1000+ URLs discovered in seconds!

权衡

方面 深度爬行 URL 播种
覆盖范围 动态发现页面 立即获取大多数现有 URL
新鲜 找到全新页面 可能会错过最近的页面
速度 慢慢地,一页一页地读 极快的批量发现
资源使用情况 更高 - 爬行以发现 较低 - 发现然后爬行
控制 可以中途停止 抓取前进行预过滤

何时使用每个

在以下情况下选择深度爬取: - 您需要绝对最新的内容 - 您正在搜索特定信息 - 网站结构未知或动态 - 您想在找到所需内容后立即停止

在以下情况下选择 URL Seeding: - 您需要分析网站的大部分内容 - 您希望在抓取之前过滤 URL - 您正在进行比较分析 - 您需要优化资源使用

当您理解这两种方法并选择合适的工具来完成任务时,奇迹就会发生。有时,您甚至可以将它们结合起来——使用 URL 种子进行批量发现,然后深度抓取特定部分以获取最新更新。

您的首次 URL 播种冒险

让我们看看它的神奇之处。我们将发现关于 Python 的博客文章,筛选教程,并只抓取这些页面。

import asyncio
from crawl4ai import AsyncUrlSeeder, AsyncWebCrawler, SeedingConfig, CrawlerRunConfig

async def smart_blog_crawler():
    # Step 1: Create our URL discoverer
    seeder = AsyncUrlSeeder()

    # Step 2: Configure discovery - let's find all blog posts
    config = SeedingConfig(
        source="sitemap",           # Use the website's sitemap
        pattern="*/blog/*.html",    # Only blog posts
        extract_head=True,          # Get page metadata
        max_urls=100               # Limit for this example
    )

    # Step 3: Discover URLs from the Python blog
    print("🔍 Discovering blog posts...")
    urls = await seeder.urls("realpython.com", config)
    print(f"✅ Found {len(urls)} blog posts")

    # Step 4: Filter for Python tutorials (using metadata!)
    tutorials = [
        url for url in urls 
        if url["status"] == "valid" and 
        any(keyword in str(url["head_data"]).lower() 
            for keyword in ["tutorial", "guide", "how to"])
    ]
    print(f"📚 Filtered to {len(tutorials)} tutorials")

    # Step 5: Show what we found
    print("\n🎯 Found these tutorials:")
    for tutorial in tutorials[:5]:  # First 5
        title = tutorial["head_data"].get("title", "No title")
        print(f"  - {title}")
        print(f"    {tutorial['url']}")

    # Step 6: Now crawl ONLY these relevant pages
    print("\n🚀 Crawling tutorials...")
    async with AsyncWebCrawler() as crawler:
        config = CrawlerRunConfig(
            only_text=True,
            word_count_threshold=300  # Only substantial articles
        )

        # Extract URLs and crawl them
        tutorial_urls = [t["url"] for t in tutorials[:10]]
        results = await crawler.arun_many(tutorial_urls, config=config)

        successful = 0
        async for result in results:
            if result.success:
                successful += 1
                print(f"✓ Crawled: {result.url[:60]}...")

        print(f"\n✨ Successfully crawled {successful} tutorials!")

# Run it!
asyncio.run(smart_blog_crawler())

刚才发生了什么?

  1. 我们从站点地图中发现了所有博客网址
  2. 我们使用元数据进行过滤(无需抓取!)
  3. 我们只抓取了相关的教程
  4. 我们节省了大量的时间和带宽

这就是 URL 播种的力量——在抓取任何内容之前,您可以看到所有内容。

理解 URL Seeder

现在您已经看到了魔术,让我们了解它是如何运作的。

基本用法

创建 URL 播种器很简单:

from crawl4ai import AsyncUrlSeeder

# Method 1: Manual cleanup
seeder = AsyncUrlSeeder()
try:
    config = SeedingConfig(source="sitemap")
    urls = await seeder.urls("example.com", config)
finally:
    await seeder.close()

# Method 2: Context manager (recommended)
async with AsyncUrlSeeder() as seeder:
    config = SeedingConfig(source="sitemap")
    urls = await seeder.urls("example.com", config)
    # Automatically cleaned up on exit

播种者可以从两个强大的来源发现 URL:

1. 站点地图(最快)

# Discover from sitemap
config = SeedingConfig(source="sitemap")
urls = await seeder.urls("example.com", config)

站点地图是网站专门创建的 XML 文件,用于列出其所有 URL。这就像在餐厅拿菜单一样——所有内容都列在最前面。

站点地图索引支持:对于像 TechCrunch 这样使用站点地图索引(站点地图的站点地图)的大型网站,播种机会自动并行检测和处理所有子站点地图:

<!-- Example sitemap index -->
<sitemapindex>
  <sitemap>
    <loc>https://techcrunch.com/sitemap-1.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://techcrunch.com/sitemap-2.xml</loc>
  </sitemap>
  <!-- ... more sitemaps ... -->
</sitemapindex>

播种机透明地处理这个问题 - 您将自动从所有子站点地图获取所有 URL!

2. 常见爬虫(最全面)

# Discover from Common Crawl
config = SeedingConfig(source="cc")
urls = await seeder.urls("example.com", config)

Common Crawl 是一个海量公共数据集,定期抓取整个网络数据。它就像访问预先构建的互联网索引一样。

3. 两个来源(最大覆盖范围)

# Use both sources
config = SeedingConfig(source="sitemap+cc")
urls = await seeder.urls("example.com", config)

配置魔法:SeedingConfig

SeedingConfig对象是您的控制面板。以下是您可以配置的所有内容:

范围 类型 默认 描述
source 字符串 “站点地图+抄送” URL 来源:“cc”(常见抓取)、“sitemap”或“sitemap+cc”
pattern 字符串 “*” URL 模式过滤器(例如“/blog/”、“*.html”)
extract_head 布尔值 错误的 从页面提取元数据<head>
live_check 布尔值 错误的 验证 URL 是否可访问
max_urls 整数 -1 返回的最大 URL 数量(-1 = 无限制)
concurrency 整数 10 用于获取数据的并行工作者
hits_per_sec 整数 5 请求速率限制
force 布尔值 错误的 绕过缓存,获取新数据
verbose 布尔值 错误的 显示详细进度
query 字符串 没有任何 搜索 BM25 评分
scoring_method 字符串 没有任何 计分方法(目前为“bm25”)
score_threshold 漂浮 没有任何 包含 URL 的最低分数
filter_nonsense_urls 布尔值 真的 过滤掉实用程序 URL(robots.txt 等)

模式匹配示例

# Match all blog posts
config = SeedingConfig(pattern="*/blog/*")

# Match only HTML files
config = SeedingConfig(pattern="*.html")

# Match product pages
config = SeedingConfig(pattern="*/product/*")

# Match everything except admin pages
config = SeedingConfig(pattern="*")
# Then filter: urls = [u for u in urls if "/admin/" not in u["url"]]

URL 验证:实时检查

有时您需要知道 URL 是否真的可以访问。这时,实时检查就派上用场了:

config = SeedingConfig(
    source="sitemap",
    live_check=True,  # Verify each URL is accessible
    concurrency=20    # Check 20 URLs in parallel
)

urls = await seeder.urls("example.com", config)

# Now you can filter by status
live_urls = [u for u in urls if u["status"] == "valid"]
dead_urls = [u for u in urls if u["status"] == "not_valid"]

print(f"Live URLs: {len(live_urls)}")
print(f"Dead URLs: {len(dead_urls)}")

何时使用实时检查: - 大规模爬取操作之前 - 使用旧站点地图时 - 数据新鲜度至关重要时

何时跳过它: - 快速探索 - 当你信任来源时 - 当速度比准确性更重要时

元数据的力量:头部提取

这就是 URL 种子真正强大的地方。你无需爬取整个页面,只需提取元数据即可:

config = SeedingConfig(
    extract_head=True  # Extract metadata from <head> section
)

urls = await seeder.urls("example.com", config)

# Now each URL has rich metadata
for url in urls[:3]:
    print(f"\nURL: {url['url']}")
    print(f"Title: {url['head_data'].get('title')}")

    meta = url['head_data'].get('meta', {})
    print(f"Description: {meta.get('description')}")
    print(f"Keywords: {meta.get('keywords')}")

    # Even Open Graph data!
    print(f"OG Image: {meta.get('og:image')}")

我们能提取什么?

头部提取可以为你提供宝贵的信息:

# Example of extracted head_data
{
    "title": "10 Python Tips for Beginners",
    "charset": "utf-8",
    "lang": "en",
    "meta": {
        "description": "Learn essential Python tips...",
        "keywords": "python, programming, tutorial",
        "author": "Jane Developer",
        "viewport": "width=device-width, initial-scale=1",

        # Open Graph tags
        "og:title": "10 Python Tips for Beginners",
        "og:description": "Essential Python tips for new programmers",
        "og:image": "https://example.com/python-tips.jpg",
        "og:type": "article",

        # Twitter Card tags
        "twitter:card": "summary_large_image",
        "twitter:title": "10 Python Tips",

        # Dublin Core metadata
        "dc.creator": "Jane Developer",
        "dc.date": "2024-01-15"
    },
    "link": {
        "canonical": [{"href": "https://example.com/blog/python-tips"}],
        "alternate": [{"href": "/feed.xml", "type": "application/rss+xml"}]
    },
    "jsonld": [
        {
            "@type": "Article",
            "headline": "10 Python Tips for Beginners",
            "datePublished": "2024-01-15",
            "author": {"@type": "Person", "name": "Jane Developer"}
        }
    ]
}

这些元数据对于筛选来说简直是金矿!你无需爬取任何页面就能找到所需的信息。

基于 URL 的智能过滤(无头部提取)

什么时候extract_head=False但您仍然提供查询,播种机使用基于 URL 的智能评分:

# Fast filtering based on URL structure alone
config = SeedingConfig(
    source="sitemap",
    extract_head=False,  # Don't fetch page metadata
    query="python tutorial async",
    scoring_method="bm25",
    score_threshold=0.3
)

urls = await seeder.urls("example.com", config)

# URLs are scored based on:
# 1. Domain parts matching (e.g., 'python' in python.example.com)
# 2. Path segments (e.g., '/tutorials/python-async/')
# 3. Query parameters (e.g., '?topic=python')
# 4. Fuzzy matching using character n-grams

# Example URL scoring:
# https://example.com/tutorials/python/async-guide.html - High score
# https://example.com/blog/javascript-tips.html - Low score

这种方法比头部提取快得多,同时还提供智能过滤!

理解结果

结果中的每个 URL 都具有以下结构:

{
    "url": "https://example.com/blog/python-tips.html",
    "status": "valid",        # "valid", "not_valid", or "unknown"
    "head_data": {            # Only if extract_head=True
        "title": "Page Title",
        "meta": {...},
        "link": {...},
        "jsonld": [...]
    },
    "relevance_score": 0.85   # Only if using BM25 scoring
}

让我们看一个真实的例子:

config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    live_check=True
)

urls = await seeder.urls("blog.example.com", config)

# Analyze the results
for url in urls[:5]:
    print(f"\n{'='*60}")
    print(f"URL: {url['url']}")
    print(f"Status: {url['status']}")

    if url['head_data']:
        data = url['head_data']
        print(f"Title: {data.get('title', 'No title')}")

        # Check content type
        meta = data.get('meta', {})
        content_type = meta.get('og:type', 'unknown')
        print(f"Content Type: {content_type}")

        # Publication date
        pub_date = None
        for jsonld in data.get('jsonld', []):
            if isinstance(jsonld, dict):
                pub_date = jsonld.get('datePublished')
                if pub_date:
                    break

        if pub_date:
            print(f"Published: {pub_date}")

        # Word count (if available)
        word_count = meta.get('word_count')
        if word_count:
            print(f"Word Count: {word_count}")

使用 BM25 评分进行智能过滤

现在到了真正酷的部分——基于相关性的智能过滤!

相关性评分简介

BM25 是一种排名算法,用于评估文档与搜索查询的相关度。通过 URL 种子功能,我们可以在抓取 URL 之前,根据其元数据对其进行评分。

可以这样想: - 传统方式:阅读图书馆中的每本书以找到有关 Python 的书籍 - 智能方式:检查标题和描述,对它们进行评分,只阅读最相关的内容

基于查询的发现

使用 BM25 评分的方法如下:

config = SeedingConfig(
    source="sitemap",
    extract_head=True,           # Required for scoring
    query="python async tutorial",  # What we're looking for
    scoring_method="bm25",       # Use BM25 algorithm
    score_threshold=0.3          # Minimum relevance score
)

urls = await seeder.urls("realpython.com", config)

# Results are automatically sorted by relevance!
for url in urls[:5]:
    print(f"Score: {url['relevance_score']:.2f} - {url['url']}")
    print(f"  Title: {url['head_data']['title']}")

真实案例

查找文档页面

# Find API documentation
config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    query="API reference documentation endpoints",
    scoring_method="bm25",
    score_threshold=0.5,
    max_urls=20
)

urls = await seeder.urls("docs.example.com", config)

# The highest scoring URLs will be API docs!

发现产品页面

# Find specific products
config = SeedingConfig(
    source="sitemap+cc",  # Use both sources
    extract_head=True,
    query="wireless headphones noise canceling",
    scoring_method="bm25",
    score_threshold=0.4,
    pattern="*/product/*"  # Combine with pattern matching
)

urls = await seeder.urls("shop.example.com", config)

# Filter further by price (from metadata)
affordable = [
    u for u in urls 
    if float(u['head_data'].get('meta', {}).get('product:price', '0')) < 200
]

过滤新闻文章

# Find recent news about AI
config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    query="artificial intelligence machine learning breakthrough",
    scoring_method="bm25",
    score_threshold=0.35
)

urls = await seeder.urls("technews.com", config)

# Filter by date
from datetime import datetime, timedelta

recent = []
cutoff = datetime.now() - timedelta(days=7)

for url in urls:
    # Check JSON-LD for publication date
    for jsonld in url['head_data'].get('jsonld', []):
        if 'datePublished' in jsonld:
            pub_date = datetime.fromisoformat(jsonld['datePublished'].replace('Z', '+00:00'))
            if pub_date > cutoff:
                recent.append(url)
                break

复杂查询模式

# Multi-concept queries
queries = [
    "python async await concurrency tutorial",
    "data science pandas numpy visualization",
    "web scraping beautifulsoup selenium automation",
    "machine learning tensorflow keras deep learning"
]

all_tutorials = []

for query in queries:
    config = SeedingConfig(
        source="sitemap",
        extract_head=True,
        query=query,
        scoring_method="bm25",
        score_threshold=0.4,
        max_urls=10  # Top 10 per topic
    )

    urls = await seeder.urls("learning-platform.com", config)
    all_tutorials.extend(urls)

# Remove duplicates while preserving order
seen = set()
unique_tutorials = []
for url in all_tutorials:
    if url['url'] not in seen:
        seen.add(url['url'])
        unique_tutorials.append(url)

print(f"Found {len(unique_tutorials)} unique tutorials across all topics")

扩展:多个域

当您需要发现多个网站上的 URL 时,URL 播种确实非常有用。

many_urls方法

# Discover URLs from multiple domains in parallel
domains = ["site1.com", "site2.com", "site3.com"]

config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    query="python tutorial",
    scoring_method="bm25",
    score_threshold=0.3
)

# Returns a dictionary: {domain: [urls]}
results = await seeder.many_urls(domains, config)

# Process results
for domain, urls in results.items():
    print(f"\n{domain}: Found {len(urls)} relevant URLs")
    if urls:
        top = urls[0]  # Highest scoring
        print(f"  Top result: {top['url']}")
        print(f"  Score: {top['relevance_score']:.2f}")

跨域示例

竞争对手分析

# Analyze content strategies across competitors
competitors = [
    "competitor1.com",
    "competitor2.com", 
    "competitor3.com"
]

config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    pattern="*/blog/*",
    max_urls=100
)

results = await seeder.many_urls(competitors, config)

# Analyze content types
for domain, urls in results.items():
    content_types = {}

    for url in urls:
        # Extract content type from metadata
        og_type = url['head_data'].get('meta', {}).get('og:type', 'unknown')
        content_types[og_type] = content_types.get(og_type, 0) + 1

    print(f"\n{domain} content distribution:")
    for ctype, count in sorted(content_types.items(), key=lambda x: x[1], reverse=True):
        print(f"  {ctype}: {count}")

行业研究

# Research Python tutorials across educational sites
educational_sites = [
    "realpython.com",
    "pythontutorial.net",
    "learnpython.org",
    "python.org"
]

config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    query="beginner python tutorial basics",
    scoring_method="bm25",
    score_threshold=0.3,
    max_urls=20  # Per site
)

results = await seeder.many_urls(educational_sites, config)

# Find the best beginner tutorials
all_tutorials = []
for domain, urls in results.items():
    for url in urls:
        url['domain'] = domain  # Add domain info
        all_tutorials.append(url)

# Sort by relevance across all domains
all_tutorials.sort(key=lambda x: x['relevance_score'], reverse=True)

print("Top 10 Python tutorials for beginners across all sites:")
for i, tutorial in enumerate(all_tutorials[:10], 1):
    print(f"{i}. [{tutorial['relevance_score']:.2f}] {tutorial['head_data']['title']}")
    print(f"   {tutorial['url']}")
    print(f"   From: {tutorial['domain']}")

多站点监控

# Monitor news about your company across multiple sources
news_sites = [
    "techcrunch.com",
    "theverge.com",
    "wired.com",
    "arstechnica.com"
]

company_name = "YourCompany"

config = SeedingConfig(
    source="cc",  # Common Crawl for recent content
    extract_head=True,
    query=f"{company_name} announcement news",
    scoring_method="bm25",
    score_threshold=0.5,  # High threshold for relevance
    max_urls=10
)

results = await seeder.many_urls(news_sites, config)

# Collect all mentions
mentions = []
for domain, urls in results.items():
    mentions.extend(urls)

if mentions:
    print(f"Found {len(mentions)} mentions of {company_name}:")
    for mention in mentions:
        print(f"\n- {mention['head_data']['title']}")
        print(f"  {mention['url']}")
        print(f"  Score: {mention['relevance_score']:.2f}")
else:
    print(f"No recent mentions of {company_name} found")

高级集成模式

让我们通过一个真实世界的实例来概括这一切。

打造研究助理

下面是智能发现、评分、过滤和抓取的完整示例:

import asyncio
from datetime import datetime
from crawl4ai import AsyncUrlSeeder, AsyncWebCrawler, SeedingConfig, CrawlerRunConfig

class ResearchAssistant:
    def __init__(self):
        self.seeder = None

    async def __aenter__(self):
        self.seeder = AsyncUrlSeeder()
        await self.seeder.__aenter__()
        return self

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        if self.seeder:
            await self.seeder.__aexit__(exc_type, exc_val, exc_tb)

    async def research_topic(self, topic, domains, max_articles=20):
        """Research a topic across multiple domains."""

        print(f"🔬 Researching '{topic}' across {len(domains)} domains...")

        # Step 1: Discover relevant URLs
        config = SeedingConfig(
            source="sitemap+cc",     # Maximum coverage
            extract_head=True,       # Get metadata
            query=topic,             # Research topic
            scoring_method="bm25",   # Smart scoring
            score_threshold=0.4,     # Quality threshold
            max_urls=10,             # Per domain
            concurrency=20,          # Fast discovery
            verbose=True
        )

        # Discover across all domains
        discoveries = await self.seeder.many_urls(domains, config)

        # Step 2: Collect and rank all articles
        all_articles = []
        for domain, urls in discoveries.items():
            for url in urls:
                url['domain'] = domain
                all_articles.append(url)

        # Sort by relevance
        all_articles.sort(key=lambda x: x['relevance_score'], reverse=True)

        # Take top articles
        top_articles = all_articles[:max_articles]

        print(f"\n📊 Found {len(all_articles)} relevant articles")
        print(f"📌 Selected top {len(top_articles)} for deep analysis")

        # Step 3: Show what we're about to crawl
        print("\n🎯 Articles to analyze:")
        for i, article in enumerate(top_articles[:5], 1):
            print(f"\n{i}. {article['head_data']['title']}")
            print(f"   Score: {article['relevance_score']:.2f}")
            print(f"   Source: {article['domain']}")
            print(f"   URL: {article['url'][:60]}...")

        # Step 4: Crawl the selected articles
        print(f"\n🚀 Deep crawling {len(top_articles)} articles...")

        async with AsyncWebCrawler() as crawler:
            config = CrawlerRunConfig(
                only_text=True,
                word_count_threshold=200,  # Substantial content only
                stream=True
            )

            # Extract URLs and crawl all articles
            article_urls = [article['url'] for article in top_articles]
            results = []
            crawl_results = await crawler.arun_many(article_urls, config=config)
            async for result in crawl_results:
                if result.success:
                    results.append({
                        'url': result.url,
                        'title': result.metadata.get('title', 'No title'),
                        'content': result.markdown.raw_markdown,
                        'domain': next(a['domain'] for a in top_articles if a['url'] == result.url),
                        'score': next(a['relevance_score'] for a in top_articles if a['url'] == result.url)
                    })
                    print(f"✓ Crawled: {result.url[:60]}...")

        # Step 5: Analyze and summarize
        print(f"\n📝 Analysis complete! Crawled {len(results)} articles")

        return self.create_research_summary(topic, results)

    def create_research_summary(self, topic, articles):
        """Create a research summary from crawled articles."""

        summary = {
            'topic': topic,
            'timestamp': datetime.now().isoformat(),
            'total_articles': len(articles),
            'sources': {}
        }

        # Group by domain
        for article in articles:
            domain = article['domain']
            if domain not in summary['sources']:
                summary['sources'][domain] = []

            summary['sources'][domain].append({
                'title': article['title'],
                'url': article['url'],
                'score': article['score'],
                'excerpt': article['content'][:500] + '...' if len(article['content']) > 500 else article['content']
            })

        return summary

# Use the research assistant
async def main():
    async with ResearchAssistant() as assistant:
        # Research Python async programming across multiple sources
        topic = "python asyncio best practices performance optimization"
        domains = [
            "realpython.com",
            "python.org",
            "stackoverflow.com",
            "medium.com"
        ]

        summary = await assistant.research_topic(topic, domains, max_articles=15)

    # Display results
    print("\n" + "="*60)
    print("RESEARCH SUMMARY")
    print("="*60)
    print(f"Topic: {summary['topic']}")
    print(f"Date: {summary['timestamp']}")
    print(f"Total Articles Analyzed: {summary['total_articles']}")

    print("\nKey Findings by Source:")
    for domain, articles in summary['sources'].items():
        print(f"\n📚 {domain} ({len(articles)} articles)")
        for article in articles[:2]:  # Top 2 per domain
            print(f"\n  Title: {article['title']}")
            print(f"  Relevance: {article['score']:.2f}")
            print(f"  Preview: {article['excerpt'][:200]}...")

asyncio.run(main())

性能优化技巧

  1. 明智地使用缓存
    # First run - populate cache
    config = SeedingConfig(source="sitemap", extract_head=True, force=True)
    urls = await seeder.urls("example.com", config)
    
    # Subsequent runs - use cache (much faster)
    config = SeedingConfig(source="sitemap", extract_head=True, force=False)
    urls = await seeder.urls("example.com", config)
    
  2. 优化并发
    # For many small requests (like HEAD checks)
    config = SeedingConfig(concurrency=50, hits_per_sec=20)
    
    # For fewer large requests (like full head extraction)
    config = SeedingConfig(concurrency=10, hits_per_sec=5)
    
  3. 流式传输大型结果集
    # When crawling many URLs
    async with AsyncWebCrawler() as crawler:
        # Assuming urls is a list of URL strings
        crawl_results = await crawler.arun_many(urls, config=config)
    
        # Process as they arrive
        async for result in crawl_results:
            process_immediately(result)  # Don't wait for all
    
  4. 大型域的内存保护

播种机使用有界队列来防止在处理具有数百万个 URL 的域时出现内存问题:

# Safe for domains with 1M+ URLs
config = SeedingConfig(
    source="cc+sitemap",
    concurrency=50,  # Queue size adapts to concurrency
    max_urls=100000  # Process in batches if needed
)

# The seeder automatically manages memory by:
# - Using bounded queues (prevents RAM spikes)
# - Applying backpressure when queue is full
# - Processing URLs as they're discovered

最佳实践和技巧

缓存管理

播种机自动缓存结果以加快重复操作:

  • 常见爬网缓存:~/.crawl4ai/seeder_cache/[index]_[domain]_[hash].jsonl
  • 网站地图缓存:~/.crawl4ai/seeder_cache/sitemap_[domain]_[hash].jsonl
  • HEAD 数据缓存:~/.cache/url_seeder/head/[hash].json

缓存默认 7 天后过期。使用force=True刷新。

模式匹配策略

# Be specific when possible
good_pattern = "*/blog/2024/*.html"  # Specific
bad_pattern = "*"                     # Too broad

# Combine patterns with metadata filtering
config = SeedingConfig(
    pattern="*/articles/*",
    extract_head=True
)
urls = await seeder.urls("news.com", config)

# Further filter by publish date, author, category, etc.
recent = [u for u in urls if is_recent(u['head_data'])]

速率限制注意事项

# Be respectful of servers
config = SeedingConfig(
    hits_per_sec=10,      # Max 10 requests per second
    concurrency=20        # But use 20 workers
)

# For your own servers
config = SeedingConfig(
    hits_per_sec=None,    # No limit
    concurrency=100       # Go fast
)

快速参考

常见模式

# Blog post discovery
config = SeedingConfig(
    source="sitemap",
    pattern="*/blog/*",
    extract_head=True,
    query="your topic",
    scoring_method="bm25"
)

# E-commerce product discovery
config = SeedingConfig(
    source="sitemap+cc",
    pattern="*/product/*",
    extract_head=True,
    live_check=True
)

# Documentation search
config = SeedingConfig(
    source="sitemap",
    pattern="*/docs/*",
    extract_head=True,
    query="API reference",
    scoring_method="bm25",
    score_threshold=0.5
)

# News monitoring
config = SeedingConfig(
    source="cc",
    extract_head=True,
    query="company name",
    scoring_method="bm25",
    max_urls=50
)

故障排除指南

问题 解决方案
未找到任何网址 尝试source="cc+sitemap",检查域名拼写
缓慢的发现 减少concurrency, 添加hits_per_sec限制
缺少元数据 确保extract_head=True
相关性分数低 优化查询,降低score_threshold
速率限制错误 减少hits_per_secconcurrency
大型网站的内存问题 使用max_urls限制结果,减少concurrency
连接未关闭 使用上下文管理器或调用await seeder.close()

性能基准

标准连接上的典型性能:

  • 站点地图发现:每秒 100-1,000 个 URL
  • 常见抓取发现:50-500 个 URL/秒
  • HEAD 检查:每秒 10-50 个 URL
  • 头部提取:5-20 个 URL/秒
  • BM25 评分:每秒 10,000+ 个 URL

结论

URL 播种功能可将网络爬取从盲目的探索转变为精准的外科手术式打击。通过在爬取之前发现并分析 URL,您可以:

  • 节省数小时的爬行时间
  • 减少 90% 以上的带宽使用量
  • 找到您真正需要的东西
  • 轻松跨多个域扩展

无论您是构建研究工具、监控竞争对手还是创建内容聚合器,URL 播种都能让您更智能地进行抓取,而不是更费力。

智能 URL 过滤

播种机会自动过滤掉那些对内容抓取无用的无意义的 URL:

# Enabled by default
config = SeedingConfig(
    source="sitemap",
    filter_nonsense_urls=True  # Default: True
)

# URLs that get filtered:
# - robots.txt, sitemap.xml, ads.txt
# - API endpoints (/api/, /v1/, .json)
# - Media files (.jpg, .mp4, .pdf)
# - Archives (.zip, .tar.gz)
# - Source code (.js, .css)
# - Admin/login pages
# - And many more...

禁用过滤(不推荐):

config = SeedingConfig(
    source="sitemap",
    filter_nonsense_urls=False  # Include ALL URLs
)

主要功能摘要

  1. 并行站点地图索引处理:自动并行检测和处理站点地图索引
  2. 内存保护:有界队列可防止大型域(1M+ URL)出现 RAM 问题
  3. 上下文管理器支持:自动清理async with陈述
  4. 基于 URL 的评分:即使没有头部提取也能进行智能过滤
  5. 智能 URL 过滤:自动排除实用/无意义的 URL
  6. 双重缓存:URL 列表和元数据的单独缓存

现在就行动起来,聪明地播种吧!🌱🚀


> Feedback