URL 播种:大规模爬取的智能方法

¥URL Seeding: The Smart Way to Crawl at Scale

为什么要进行 URL 播种?

¥Why URL Seeding?

网络爬虫有多种类型,每种类型都有各自的优势。让我们来了解一下何时应该使用 URL 种子,何时应该使用深度爬虫。

¥Web crawling comes in different flavors, each with its own strengths. Let's understand when to use URL seeding versus deep crawling.

深度爬取:实时发现

¥Deep Crawling: Real-Time Discovery

当您需要以下条件时,深度爬取是完美的选择:-新鲜的实时数据- 在创建页面时发现它们 -动态探索- 根据内容跟踪链接 -选择性提取- 找到所需内容后停止

¥Deep crawling is perfect when you need: - Fresh, real-time data - discovering pages as they're created - Dynamic exploration - following links based on content - Selective extraction - stopping when you find what you need

# Deep crawling example: Explore a website dynamically
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy

async def deep_crawl_example():
    # Configure a 2-level deep crawl
    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2,           # Crawl 2 levels deep
            include_external=False, # Stay within domain
            max_pages=50           # Limit for efficiency
        ),
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        # Start crawling and follow links dynamically
        results = await crawler.arun("https://example.com", config=config)

        print(f"Discovered and crawled {len(results)} pages")
        for result in results[:3]:
            print(f"Found: {result.url} at depth {result.metadata.get('depth', 0)}")

asyncio.run(deep_crawl_example())

URL 播种:批量发现

¥URL Seeding: Bulk Discovery

URL 种子功能在您需要时发挥作用:-全面覆盖- 几秒钟内获取数千个 URL -批量处理- 抓取前过滤 -资源效率- 明确知道要爬取什么

¥URL seeding shines when you want: - Comprehensive coverage - get thousands of URLs in seconds - Bulk processing - filter before crawling - Resource efficiency - know exactly what you'll crawl

# URL seeding example: Analyze all documentation
from crawl4ai import AsyncUrlSeeder, SeedingConfig

seeder = AsyncUrlSeeder()
config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    pattern="*/docs/*"
)

# Get ALL documentation URLs instantly
urls = await seeder.urls("example.com", config)
# 1000+ URLs discovered in seconds!

权衡

¥The Trade-offs

¥Aspect

¥Deep Crawling

¥URL Seeding

¥Coverage

¥Discovers pages dynamically

¥Gets most existing URLs instantly

¥Freshness

¥Finds brand new pages

¥May miss very recent pages

¥Speed

¥Slower, page by page

¥Extremely fast bulk discovery

¥Resource Usage

¥Higher - crawls to discover

¥Lower - discovers then crawls

¥Control

¥Can stop mid-process

¥Pre-filters before crawling

方面 深度爬行 URL 播种
覆盖范围 动态发现页面 立即获取大多数现有 URL
新鲜 找到全新页面 可能会错过最近的页面
速度 慢慢地,一页一页地读 极快的批量发现
资源使用情况 更高 - 爬行以发现 较低 - 发现然后爬行
控制 可以中途停止 抓取前的预过滤

何时使用

¥When to Use Each

在以下情况下选择深度爬取: - 您需要绝对最新的内容 - 您正在搜索特定信息 - 网站结构未知或动态 - 您想在找到所需内容后立即停止

¥Choose Deep Crawling when: - You need the absolute latest content - You're searching for specific information - The site structure is unknown or dynamic - You want to stop as soon as you find what you need

在以下情况下选择 URL 播种: - 您需要分析网站的大部分内容 - 您希望在抓取之前过滤 URL - 您正在进行比较分析 - 您需要优化资源使用

¥Choose URL Seeding when: - You need to analyze large portions of a site - You want to filter URLs before crawling - You're doing comparative analysis - You need to optimize resource usage

当您理解这两种方法并选择合适的工具来完成任务时,奇迹就会发生。有时,您甚至可以将它们结合起来——使用 URL 种子进行批量发现,然后深度抓取特定部分以获取最新更新。

¥The magic happens when you understand both approaches and choose the right tool for your task. Sometimes, you might even combine them - use URL seeding for bulk discovery, then deep crawl specific sections for the latest updates.

您的首次 URL 播种冒险

¥Your First URL Seeding Adventure

让我们看看它的神奇之处。我们将发现关于 Python 的博客文章,筛选教程,并只抓取这些页面。

¥Let's see the magic in action. We'll discover blog posts about Python, filter for tutorials, and crawl only those pages.

import asyncio
from crawl4ai import AsyncUrlSeeder, AsyncWebCrawler, SeedingConfig, CrawlerRunConfig

async def smart_blog_crawler():
    # Step 1: Create our URL discoverer
    seeder = AsyncUrlSeeder()

    # Step 2: Configure discovery - let's find all blog posts
    config = SeedingConfig(
        source="sitemap",           # Use the website's sitemap
        pattern="*/blog/*.html",    # Only blog posts
        extract_head=True,          # Get page metadata
        max_urls=100               # Limit for this example
    )

    # Step 3: Discover URLs from the Python blog
    print("🔍 Discovering blog posts...")
    urls = await seeder.urls("realpython.com", config)
    print(f"✅ Found {len(urls)} blog posts")

    # Step 4: Filter for Python tutorials (using metadata!)
    tutorials = [
        url for url in urls 
        if url["status"] == "valid" and 
        any(keyword in str(url["head_data"]).lower() 
            for keyword in ["tutorial", "guide", "how to"])
    ]
    print(f"📚 Filtered to {len(tutorials)} tutorials")

    # Step 5: Show what we found
    print("\n🎯 Found these tutorials:")
    for tutorial in tutorials[:5]:  # First 5
        title = tutorial["head_data"].get("title", "No title")
        print(f"  - {title}")
        print(f"    {tutorial['url']}")

    # Step 6: Now crawl ONLY these relevant pages
    print("\n🚀 Crawling tutorials...")
    async with AsyncWebCrawler() as crawler:
        config = CrawlerRunConfig(
            only_text=True,
            word_count_threshold=300  # Only substantial articles
        )

        # Extract URLs and crawl them
        tutorial_urls = [t["url"] for t in tutorials[:10]]
        results = await crawler.arun_many(tutorial_urls, config=config)

        successful = 0
        async for result in results:
            if result.success:
                successful += 1
                print(f"✓ Crawled: {result.url[:60]}...")

        print(f"\n✨ Successfully crawled {successful} tutorials!")

# Run it!
asyncio.run(smart_blog_crawler())

刚才发生了什么?

¥What just happened?

  1. 我们从站点地图中发现了所有博客网址

    ¥We discovered all blog URLs from the sitemap

  2. 我们使用元数据进行过滤(无需抓取!)

    ¥We filtered using metadata (no crawling needed!)

  3. 我们只抓取了相关的教程

    ¥We crawled only the relevant tutorials

  4. 我们节省了大量的时间和带宽

    ¥We saved tons of time and bandwidth

这就是 URL 播种的力量——在抓取任何内容之前,您可以看到所有内容。

¥This is the power of URL seeding - you see everything before you crawl anything.

理解 URL Seeder

¥Understanding the URL Seeder

现在您已经看到了魔术,让我们了解它是如何运作的。

¥Now that you've seen the magic, let's understand how it works.

基本用法

¥Basic Usage

创建 URL 播种器很简单:

¥Creating a URL seeder is simple:

from crawl4ai import AsyncUrlSeeder

# Method 1: Manual cleanup
seeder = AsyncUrlSeeder()
try:
    config = SeedingConfig(source="sitemap")
    urls = await seeder.urls("example.com", config)
finally:
    await seeder.close()

# Method 2: Context manager (recommended)
async with AsyncUrlSeeder() as seeder:
    config = SeedingConfig(source="sitemap")
    urls = await seeder.urls("example.com", config)
    # Automatically cleaned up on exit

播种者可以从两个强大的来源发现 URL:

¥The seeder can discover URLs from two powerful sources:

1. 站点地图(最快)

¥1. Sitemaps (Fastest)

# Discover from sitemap
config = SeedingConfig(source="sitemap")
urls = await seeder.urls("example.com", config)

站点地图是网站专门创建的 XML 文件,用于列出其所有 URL。这就像在餐厅拿菜单一样——所有内容都列在最前面。

¥Sitemaps are XML files that websites create specifically to list all their URLs. It's like getting a menu at a restaurant - everything is listed upfront.

网站地图索引支持:对于像 TechCrunch 这样使用站点地图索引(站点地图的站点地图)的大型网站,播种机会自动并行检测并处理所有子站点地图:

¥Sitemap Index Support: For large websites like TechCrunch that use sitemap indexes (a sitemap of sitemaps), the seeder automatically detects and processes all sub-sitemaps in parallel:

<!-- Example sitemap index -->
<sitemapindex>
  <sitemap>
    <loc>https://techcrunch.com/sitemap-1.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://techcrunch.com/sitemap-2.xml</loc>
  </sitemap>
  <!-- ... more sitemaps ... -->
</sitemapindex>

播种机透明地处理这个问题 - 您将自动从所有子站点地图获取所有 URL!

¥The seeder handles this transparently - you'll get all URLs from all sub-sitemaps automatically!

2. 常见爬虫(最全面)

¥2. Common Crawl (Most Comprehensive)

# Discover from Common Crawl
config = SeedingConfig(source="cc")
urls = await seeder.urls("example.com", config)

Common Crawl 是一个海量公共数据集,定期抓取整个网络数据。它就像访问预先构建的互联网索引一样。

¥Common Crawl is a massive public dataset that regularly crawls the entire web. It's like having access to a pre-built index of the internet.

3. 两个来源(最大覆盖范围)

¥3. Both Sources (Maximum Coverage)

# Use both sources
config = SeedingConfig(source="sitemap+cc")
urls = await seeder.urls("example.com", config)

配置魔法:SeedingConfig

¥Configuration Magic: SeedingConfig

SeedingConfig对象是您的控制面板。以下是您可以配置的所有内容:

¥The SeedingConfig object is your control panel. Here's everything you can configure:

¥Parameter

¥Type

¥Default

¥Description

¥str

¥"sitemap+cc"

¥URL source: "cc" (Common Crawl), "sitemap", or "sitemap+cc"

¥str

¥"*"

¥URL pattern filter (e.g., "/blog/", "*.html")

¥bool

¥False

¥Extract metadata from page <head>

¥bool

¥False

¥Verify URLs are accessible

¥int

¥-1

¥Maximum URLs to return (-1 = unlimited)

¥int

¥10

¥Parallel workers for fetching

¥int

¥5

¥Rate limit for requests

¥bool

¥False

¥Bypass cache, fetch fresh data

¥bool

¥False

¥Show detailed progress

¥str

¥None

¥Search query for BM25 scoring

¥str

¥None

¥Scoring method (currently "bm25")

¥float

¥None

¥Minimum score to include URL

¥bool

¥True

¥Filter out utility URLs (robots.txt, etc.)

范围 类型 默认 描述
source 字符串 “站点地图+抄送” URL 来源:“cc”(常见抓取)、“sitemap”或“sitemap+cc”
pattern 字符串 “*” URL 模式过滤器(例如,“ /博客/ ", "*.html")
extract_head 布尔值 错误的 从页面提取元数据<head>
live_check 布尔值 错误的 验证 URL 是否可访问
max_urls 整数 -1 返回的最大 URL 数量(-1 = 无限制)
concurrency 整数 10 用于获取数据的并行工作者
hits_per_sec 整数 5 请求速率限制
force 布尔值 错误的 绕过缓存,获取新数据
verbose 布尔值 错误的 显示详细进度
query 字符串 没有任何 搜索 BM25 评分
scoring_method 字符串 没有任何 计分方法(目前为“bm25”)
score_threshold 漂浮 没有任何 包含 URL 的最低分数
filter_nonsense_urls 布尔值 真的 过滤掉实用程序 URL(robots.txt 等)

模式匹配示例

¥Pattern Matching Examples

# Match all blog posts
config = SeedingConfig(pattern="*/blog/*")

# Match only HTML files
config = SeedingConfig(pattern="*.html")

# Match product pages
config = SeedingConfig(pattern="*/product/*")

# Match everything except admin pages
config = SeedingConfig(pattern="*")
# Then filter: urls = [u for u in urls if "/admin/" not in u["url"]]

URL 验证:实时检查

¥URL Validation: Live Checking

有时您需要知道 URL 是否真的可以访问。这时,实时检查就派上用场了:

¥Sometimes you need to know if URLs are actually accessible. That's where live checking comes in:

config = SeedingConfig(
    source="sitemap",
    live_check=True,  # Verify each URL is accessible
    concurrency=20    # Check 20 URLs in parallel
)

urls = await seeder.urls("example.com", config)

# Now you can filter by status
live_urls = [u for u in urls if u["status"] == "valid"]
dead_urls = [u for u in urls if u["status"] == "not_valid"]

print(f"Live URLs: {len(live_urls)}")
print(f"Dead URLs: {len(dead_urls)}")

何时使用实时检查: - 大规模爬取操作之前 - 使用旧站点地图时 - 数据新鲜度至关重要时

¥When to use live checking: - Before a large crawling operation - When working with older sitemaps - When data freshness is critical

何时跳过: - 快速探索 - 当你信任来源时 - 当速度比准确性更重要时

¥When to skip it: - Quick explorations - When you trust the source - When speed is more important than accuracy

元数据的力量:头部提取

¥The Power of Metadata: Head Extraction

这就是 URL 种子真正强大的地方。你无需爬取整个页面,只需提取元数据即可:

¥This is where URL seeding gets really powerful. Instead of crawling entire pages, you can extract just the metadata:

config = SeedingConfig(
    extract_head=True  # Extract metadata from <head> section
)

urls = await seeder.urls("example.com", config)

# Now each URL has rich metadata
for url in urls[:3]:
    print(f"\nURL: {url['url']}")
    print(f"Title: {url['head_data'].get('title')}")

    meta = url['head_data'].get('meta', {})
    print(f"Description: {meta.get('description')}")
    print(f"Keywords: {meta.get('keywords')}")

    # Even Open Graph data!
    print(f"OG Image: {meta.get('og:image')}")

我们能提取什么?

¥What Can We Extract?

头部提取可以为你提供宝贵的信息:

¥The head extraction gives you a treasure trove of information:

# Example of extracted head_data
{
    "title": "10 Python Tips for Beginners",
    "charset": "utf-8",
    "lang": "en",
    "meta": {
        "description": "Learn essential Python tips...",
        "keywords": "python, programming, tutorial",
        "author": "Jane Developer",
        "viewport": "width=device-width, initial-scale=1",

        # Open Graph tags
        "og:title": "10 Python Tips for Beginners",
        "og:description": "Essential Python tips for new programmers",
        "og:image": "https://example.com/python-tips.jpg",
        "og:type": "article",

        # Twitter Card tags
        "twitter:card": "summary_large_image",
        "twitter:title": "10 Python Tips",

        # Dublin Core metadata
        "dc.creator": "Jane Developer",
        "dc.date": "2024-01-15"
    },
    "link": {
        "canonical": [{"href": "https://example.com/blog/python-tips"}],
        "alternate": [{"href": "/feed.xml", "type": "application/rss+xml"}]
    },
    "jsonld": [
        {
            "@type": "Article",
            "headline": "10 Python Tips for Beginners",
            "datePublished": "2024-01-15",
            "author": {"@type": "Person", "name": "Jane Developer"}
        }
    ]
}

这些元数据对于筛选来说简直是金矿!你无需爬取任何页面就能找到所需的信息。

¥This metadata is gold for filtering! You can find exactly what you need without crawling a single page.

基于 URL 的智能过滤(无头部提取)

¥Smart URL-Based Filtering (No Head Extraction)

什么时候extract_head=False但您仍然提供查询,播种机使用基于 URL 的智能评分:

¥When extract_head=False but you still provide a query, the seeder uses intelligent URL-based scoring:

# Fast filtering based on URL structure alone
config = SeedingConfig(
    source="sitemap",
    extract_head=False,  # Don't fetch page metadata
    query="python tutorial async",
    scoring_method="bm25",
    score_threshold=0.3
)

urls = await seeder.urls("example.com", config)

# URLs are scored based on:
# 1. Domain parts matching (e.g., 'python' in python.example.com)
# 2. Path segments (e.g., '/tutorials/python-async/')
# 3. Query parameters (e.g., '?topic=python')
# 4. Fuzzy matching using character n-grams

# Example URL scoring:
# https://example.com/tutorials/python/async-guide.html - High score
# https://example.com/blog/javascript-tips.html - Low score

这种方法比头部提取快得多,同时还提供智能过滤!

¥This approach is much faster than head extraction while still providing intelligent filtering!

理解结果

¥Understanding Results

结果中的每个 URL 都具有以下结构:

¥Each URL in the results has this structure:

{
    "url": "https://example.com/blog/python-tips.html",
    "status": "valid",        # "valid", "not_valid", or "unknown"
    "head_data": {            # Only if extract_head=True
        "title": "Page Title",
        "meta": {...},
        "link": {...},
        "jsonld": [...]
    },
    "relevance_score": 0.85   # Only if using BM25 scoring
}

让我们看一个真实的例子:

¥Let's see a real example:

config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    live_check=True
)

urls = await seeder.urls("blog.example.com", config)

# Analyze the results
for url in urls[:5]:
    print(f"\n{'='*60}")
    print(f"URL: {url['url']}")
    print(f"Status: {url['status']}")

    if url['head_data']:
        data = url['head_data']
        print(f"Title: {data.get('title', 'No title')}")

        # Check content type
        meta = data.get('meta', {})
        content_type = meta.get('og:type', 'unknown')
        print(f"Content Type: {content_type}")

        # Publication date
        pub_date = None
        for jsonld in data.get('jsonld', []):
            if isinstance(jsonld, dict):
                pub_date = jsonld.get('datePublished')
                if pub_date:
                    break

        if pub_date:
            print(f"Published: {pub_date}")

        # Word count (if available)
        word_count = meta.get('word_count')
        if word_count:
            print(f"Word Count: {word_count}")

使用 BM25 评分进行智能过滤

¥Smart Filtering with BM25 Scoring

现在到了真正酷的部分——基于相关性的智能过滤!

¥Now for the really cool part - intelligent filtering based on relevance!

相关性评分简介

¥Introduction to Relevance Scoring

BM25 是一种排名算法,用于评估文档与搜索查询的相关性。通过 URL 种子,我们可以根据 URL 的元数据对其进行评分爬行它们。

¥BM25 is a ranking algorithm that scores how relevant a document is to a search query. With URL seeding, we can score URLs based on their metadata before crawling them.

可以这样想: - 传统方式:阅读图书馆中的每本书以找到有关 Python 的书籍 - 智能方式:检查标题和描述,对它们进行评分,只阅读最相关的书籍

¥Think of it like this: - Traditional way: Read every book in the library to find ones about Python - Smart way: Check the titles and descriptions, score them, read only the most relevant

基于查询的发现

¥Query-Based Discovery

使用 BM25 评分的方法如下:

¥Here's how to use BM25 scoring:

config = SeedingConfig(
    source="sitemap",
    extract_head=True,           # Required for scoring
    query="python async tutorial",  # What we're looking for
    scoring_method="bm25",       # Use BM25 algorithm
    score_threshold=0.3          # Minimum relevance score
)

urls = await seeder.urls("realpython.com", config)

# Results are automatically sorted by relevance!
for url in urls[:5]:
    print(f"Score: {url['relevance_score']:.2f} - {url['url']}")
    print(f"  Title: {url['head_data']['title']}")

真实案例

¥Real Examples

查找文档页面

¥Finding Documentation Pages

# Find API documentation
config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    query="API reference documentation endpoints",
    scoring_method="bm25",
    score_threshold=0.5,
    max_urls=20
)

urls = await seeder.urls("docs.example.com", config)

# The highest scoring URLs will be API docs!

发现产品页面

¥Discovering Product Pages

# Find specific products
config = SeedingConfig(
    source="sitemap+cc",  # Use both sources
    extract_head=True,
    query="wireless headphones noise canceling",
    scoring_method="bm25",
    score_threshold=0.4,
    pattern="*/product/*"  # Combine with pattern matching
)

urls = await seeder.urls("shop.example.com", config)

# Filter further by price (from metadata)
affordable = [
    u for u in urls 
    if float(u['head_data'].get('meta', {}).get('product:price', '0')) < 200
]

过滤新闻文章

¥Filtering News Articles

# Find recent news about AI
config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    query="artificial intelligence machine learning breakthrough",
    scoring_method="bm25",
    score_threshold=0.35
)

urls = await seeder.urls("technews.com", config)

# Filter by date
from datetime import datetime, timedelta

recent = []
cutoff = datetime.now() - timedelta(days=7)

for url in urls:
    # Check JSON-LD for publication date
    for jsonld in url['head_data'].get('jsonld', []):
        if 'datePublished' in jsonld:
            pub_date = datetime.fromisoformat(jsonld['datePublished'].replace('Z', '+00:00'))
            if pub_date > cutoff:
                recent.append(url)
                break

复杂查询模式

¥Complex Query Patterns

# Multi-concept queries
queries = [
    "python async await concurrency tutorial",
    "data science pandas numpy visualization",
    "web scraping beautifulsoup selenium automation",
    "machine learning tensorflow keras deep learning"
]

all_tutorials = []

for query in queries:
    config = SeedingConfig(
        source="sitemap",
        extract_head=True,
        query=query,
        scoring_method="bm25",
        score_threshold=0.4,
        max_urls=10  # Top 10 per topic
    )

    urls = await seeder.urls("learning-platform.com", config)
    all_tutorials.extend(urls)

# Remove duplicates while preserving order
seen = set()
unique_tutorials = []
for url in all_tutorials:
    if url['url'] not in seen:
        seen.add(url['url'])
        unique_tutorials.append(url)

print(f"Found {len(unique_tutorials)} unique tutorials across all topics")

扩展:多个域

¥Scaling Up: Multiple Domains

当您需要发现多个网站上的 URL 时,URL 播种确实非常有用。

¥When you need to discover URLs across multiple websites, URL seeding really shines.

many_urls方法

¥The many_urls Method

# Discover URLs from multiple domains in parallel
domains = ["site1.com", "site2.com", "site3.com"]

config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    query="python tutorial",
    scoring_method="bm25",
    score_threshold=0.3
)

# Returns a dictionary: {domain: [urls]}
results = await seeder.many_urls(domains, config)

# Process results
for domain, urls in results.items():
    print(f"\n{domain}: Found {len(urls)} relevant URLs")
    if urls:
        top = urls[0]  # Highest scoring
        print(f"  Top result: {top['url']}")
        print(f"  Score: {top['relevance_score']:.2f}")

跨域示例

¥Cross-Domain Examples

竞争对手分析

¥Competitor Analysis

# Analyze content strategies across competitors
competitors = [
    "competitor1.com",
    "competitor2.com", 
    "competitor3.com"
]

config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    pattern="*/blog/*",
    max_urls=100
)

results = await seeder.many_urls(competitors, config)

# Analyze content types
for domain, urls in results.items():
    content_types = {}

    for url in urls:
        # Extract content type from metadata
        og_type = url['head_data'].get('meta', {}).get('og:type', 'unknown')
        content_types[og_type] = content_types.get(og_type, 0) + 1

    print(f"\n{domain} content distribution:")
    for ctype, count in sorted(content_types.items(), key=lambda x: x[1], reverse=True):
        print(f"  {ctype}: {count}")

行业研究

¥Industry Research

# Research Python tutorials across educational sites
educational_sites = [
    "realpython.com",
    "pythontutorial.net",
    "learnpython.org",
    "python.org"
]

config = SeedingConfig(
    source="sitemap",
    extract_head=True,
    query="beginner python tutorial basics",
    scoring_method="bm25",
    score_threshold=0.3,
    max_urls=20  # Per site
)

results = await seeder.many_urls(educational_sites, config)

# Find the best beginner tutorials
all_tutorials = []
for domain, urls in results.items():
    for url in urls:
        url['domain'] = domain  # Add domain info
        all_tutorials.append(url)

# Sort by relevance across all domains
all_tutorials.sort(key=lambda x: x['relevance_score'], reverse=True)

print("Top 10 Python tutorials for beginners across all sites:")
for i, tutorial in enumerate(all_tutorials[:10], 1):
    print(f"{i}. [{tutorial['relevance_score']:.2f}] {tutorial['head_data']['title']}")
    print(f"   {tutorial['url']}")
    print(f"   From: {tutorial['domain']}")

多站点监控

¥Multi-Site Monitoring

# Monitor news about your company across multiple sources
news_sites = [
    "techcrunch.com",
    "theverge.com",
    "wired.com",
    "arstechnica.com"
]

company_name = "YourCompany"

config = SeedingConfig(
    source="cc",  # Common Crawl for recent content
    extract_head=True,
    query=f"{company_name} announcement news",
    scoring_method="bm25",
    score_threshold=0.5,  # High threshold for relevance
    max_urls=10
)

results = await seeder.many_urls(news_sites, config)

# Collect all mentions
mentions = []
for domain, urls in results.items():
    mentions.extend(urls)

if mentions:
    print(f"Found {len(mentions)} mentions of {company_name}:")
    for mention in mentions:
        print(f"\n- {mention['head_data']['title']}")
        print(f"  {mention['url']}")
        print(f"  Score: {mention['relevance_score']:.2f}")
else:
    print(f"No recent mentions of {company_name} found")

高级集成模式

¥Advanced Integration Patterns

让我们通过一个真实世界的实例来概括这一切。

¥Let's put everything together in a real-world example.

打造研究助理

¥Building a Research Assistant

下面是智能发现、评分、过滤和抓取的完整示例:

¥Here's a complete example that discovers, scores, filters, and crawls intelligently:

import asyncio
from datetime import datetime
from crawl4ai import AsyncUrlSeeder, AsyncWebCrawler, SeedingConfig, CrawlerRunConfig

class ResearchAssistant:
    def __init__(self):
        self.seeder = None

    async def __aenter__(self):
        self.seeder = AsyncUrlSeeder()
        await self.seeder.__aenter__()
        return self

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        if self.seeder:
            await self.seeder.__aexit__(exc_type, exc_val, exc_tb)

    async def research_topic(self, topic, domains, max_articles=20):
        """Research a topic across multiple domains."""

        print(f"🔬 Researching '{topic}' across {len(domains)} domains...")

        # Step 1: Discover relevant URLs
        config = SeedingConfig(
            source="sitemap+cc",     # Maximum coverage
            extract_head=True,       # Get metadata
            query=topic,             # Research topic
            scoring_method="bm25",   # Smart scoring
            score_threshold=0.4,     # Quality threshold
            max_urls=10,             # Per domain
            concurrency=20,          # Fast discovery
            verbose=True
        )

        # Discover across all domains
        discoveries = await self.seeder.many_urls(domains, config)

        # Step 2: Collect and rank all articles
        all_articles = []
        for domain, urls in discoveries.items():
            for url in urls:
                url['domain'] = domain
                all_articles.append(url)

        # Sort by relevance
        all_articles.sort(key=lambda x: x['relevance_score'], reverse=True)

        # Take top articles
        top_articles = all_articles[:max_articles]

        print(f"\n📊 Found {len(all_articles)} relevant articles")
        print(f"📌 Selected top {len(top_articles)} for deep analysis")

        # Step 3: Show what we're about to crawl
        print("\n🎯 Articles to analyze:")
        for i, article in enumerate(top_articles[:5], 1):
            print(f"\n{i}. {article['head_data']['title']}")
            print(f"   Score: {article['relevance_score']:.2f}")
            print(f"   Source: {article['domain']}")
            print(f"   URL: {article['url'][:60]}...")

        # Step 4: Crawl the selected articles
        print(f"\n🚀 Deep crawling {len(top_articles)} articles...")

        async with AsyncWebCrawler() as crawler:
            config = CrawlerRunConfig(
                only_text=True,
                word_count_threshold=200,  # Substantial content only
                stream=True
            )

            # Extract URLs and crawl all articles
            article_urls = [article['url'] for article in top_articles]
            results = []
            crawl_results = await crawler.arun_many(article_urls, config=config)
            async for result in crawl_results:
                if result.success:
                    results.append({
                        'url': result.url,
                        'title': result.metadata.get('title', 'No title'),
                        'content': result.markdown.raw_markdown,
                        'domain': next(a['domain'] for a in top_articles if a['url'] == result.url),
                        'score': next(a['relevance_score'] for a in top_articles if a['url'] == result.url)
                    })
                    print(f"✓ Crawled: {result.url[:60]}...")

        # Step 5: Analyze and summarize
        print(f"\n📝 Analysis complete! Crawled {len(results)} articles")

        return self.create_research_summary(topic, results)

    def create_research_summary(self, topic, articles):
        """Create a research summary from crawled articles."""

        summary = {
            'topic': topic,
            'timestamp': datetime.now().isoformat(),
            'total_articles': len(articles),
            'sources': {}
        }

        # Group by domain
        for article in articles:
            domain = article['domain']
            if domain not in summary['sources']:
                summary['sources'][domain] = []

            summary['sources'][domain].append({
                'title': article['title'],
                'url': article['url'],
                'score': article['score'],
                'excerpt': article['content'][:500] + '...' if len(article['content']) > 500 else article['content']
            })

        return summary

# Use the research assistant
async def main():
    async with ResearchAssistant() as assistant:
        # Research Python async programming across multiple sources
        topic = "python asyncio best practices performance optimization"
        domains = [
            "realpython.com",
            "python.org",
            "stackoverflow.com",
            "medium.com"
        ]

        summary = await assistant.research_topic(topic, domains, max_articles=15)

    # Display results
    print("\n" + "="*60)
    print("RESEARCH SUMMARY")
    print("="*60)
    print(f"Topic: {summary['topic']}")
    print(f"Date: {summary['timestamp']}")
    print(f"Total Articles Analyzed: {summary['total_articles']}")

    print("\nKey Findings by Source:")
    for domain, articles in summary['sources'].items():
        print(f"\n📚 {domain} ({len(articles)} articles)")
        for article in articles[:2]:  # Top 2 per domain
            print(f"\n  Title: {article['title']}")
            print(f"  Relevance: {article['score']:.2f}")
            print(f"  Preview: {article['excerpt'][:200]}...")

asyncio.run(main())

性能优化技巧

¥Performance Optimization Tips

  1. 明智地使用缓存

    # First run - populate cache
    config = SeedingConfig(source="sitemap", extract_head=True, force=True)
    urls = await seeder.urls("example.com", config)
    
    # Subsequent runs - use cache (much faster)
    config = SeedingConfig(source="sitemap", extract_head=True, force=False)
    urls = await seeder.urls("example.com", config)
    

    ¥

    Use caching wisely

    # First run - populate cache
    config = SeedingConfig(source="sitemap", extract_head=True, force=True)
    urls = await seeder.urls("example.com", config)
    
    # Subsequent runs - use cache (much faster)
    config = SeedingConfig(source="sitemap", extract_head=True, force=False)
    urls = await seeder.urls("example.com", config)
    

  2. 优化并发

    # For many small requests (like HEAD checks)
    config = SeedingConfig(concurrency=50, hits_per_sec=20)
    
    # For fewer large requests (like full head extraction)
    config = SeedingConfig(concurrency=10, hits_per_sec=5)
    

    ¥

    Optimize concurrency

    # For many small requests (like HEAD checks)
    config = SeedingConfig(concurrency=50, hits_per_sec=20)
    
    # For fewer large requests (like full head extraction)
    config = SeedingConfig(concurrency=10, hits_per_sec=5)
    

  3. 流式传输大型结果集

    # When crawling many URLs
    async with AsyncWebCrawler() as crawler:
        # Assuming urls is a list of URL strings
        crawl_results = await crawler.arun_many(urls, config=config)
    
        # Process as they arrive
        async for result in crawl_results:
            process_immediately(result)  # Don't wait for all
    

    ¥

    Stream large result sets

    # When crawling many URLs
    async with AsyncWebCrawler() as crawler:
        # Assuming urls is a list of URL strings
        crawl_results = await crawler.arun_many(urls, config=config)
    
        # Process as they arrive
        async for result in crawl_results:
            process_immediately(result)  # Don't wait for all
    

  4. 大型域的内存保护

    ¥

    Memory protection for large domains

播种机使用有界队列来防止在处理具有数百万个 URL 的域时出现内存问题:

¥The seeder uses bounded queues to prevent memory issues when processing domains with millions of URLs:

# Safe for domains with 1M+ URLs
config = SeedingConfig(
    source="cc+sitemap",
    concurrency=50,  # Queue size adapts to concurrency
    max_urls=100000  # Process in batches if needed
)

# The seeder automatically manages memory by:
# - Using bounded queues (prevents RAM spikes)
# - Applying backpressure when queue is full
# - Processing URLs as they're discovered

最佳实践和技巧

¥Best Practices & Tips

缓存管理

¥Cache Management

播种机自动缓存结果以加快重复操作:

¥The seeder automatically caches results to speed up repeated operations:

  • 常见爬网缓存~/.crawl4ai/seeder_cache/[index]_[domain]_[hash].jsonl

    ¥Common Crawl cache: ~/.crawl4ai/seeder_cache/[index]_[domain]_[hash].jsonl

  • 网站地图缓存~/.crawl4ai/seeder_cache/sitemap_[domain]_[hash].jsonl

    ¥Sitemap cache: ~/.crawl4ai/seeder_cache/sitemap_[domain]_[hash].jsonl

  • HEAD 数据缓存~/.cache/url_seeder/head/[hash].json

    ¥HEAD data cache: ~/.cache/url_seeder/head/[hash].json

缓存默认 7 天后过期。使用force=True刷新。

¥Cache expires after 7 days by default. Use force=True to refresh.

模式匹配策略

¥Pattern Matching Strategies

# Be specific when possible
good_pattern = "*/blog/2024/*.html"  # Specific
bad_pattern = "*"                     # Too broad

# Combine patterns with metadata filtering
config = SeedingConfig(
    pattern="*/articles/*",
    extract_head=True
)
urls = await seeder.urls("news.com", config)

# Further filter by publish date, author, category, etc.
recent = [u for u in urls if is_recent(u['head_data'])]

速率限制注意事项

¥Rate Limiting Considerations

# Be respectful of servers
config = SeedingConfig(
    hits_per_sec=10,      # Max 10 requests per second
    concurrency=20        # But use 20 workers
)

# For your own servers
config = SeedingConfig(
    hits_per_sec=None,    # No limit
    concurrency=100       # Go fast
)

快速参考

¥Quick Reference

常见模式

¥Common Patterns

# Blog post discovery
config = SeedingConfig(
    source="sitemap",
    pattern="*/blog/*",
    extract_head=True,
    query="your topic",
    scoring_method="bm25"
)

# E-commerce product discovery
config = SeedingConfig(
    source="sitemap+cc",
    pattern="*/product/*",
    extract_head=True,
    live_check=True
)

# Documentation search
config = SeedingConfig(
    source="sitemap",
    pattern="*/docs/*",
    extract_head=True,
    query="API reference",
    scoring_method="bm25",
    score_threshold=0.5
)

# News monitoring
config = SeedingConfig(
    source="cc",
    extract_head=True,
    query="company name",
    scoring_method="bm25",
    max_urls=50
)

故障排除指南

¥Troubleshooting Guide

¥Issue

¥Solution

¥No URLs found

¥Try source="cc+sitemap", check domain spelling

¥Slow discovery

¥Reduce concurrency, add hits_per_sec limit

¥Missing metadata

¥Ensure extract_head=True

¥Low relevance scores

¥Refine query, lower score_threshold

¥Rate limit errors

¥Reduce hits_per_sec and concurrency

¥Memory issues with large sites

¥Use max_urls to limit results, reduce concurrency

¥Connection not closed

¥Use context manager or call await seeder.close()

问题 解决方案
未找到任何网址 尝试source="cc+sitemap",检查域名拼写
缓慢的发现 减少concurrency, 添加hits_per_sec限制
缺少元数据 确保extract_head=True
相关性分数低 优化查询,降低score_threshold
速率限制错误 减少hits_per_secconcurrency
大型网站的内存问题 使用max_urls限制结果,减少concurrency
连接未关闭 使用上下文管理器或调用await seeder.close()

性能基准

¥Performance Benchmarks

标准连接上的典型性能:

¥Typical performance on a standard connection:

  • 网站地图发现:100-1,000 个 URL/秒

    ¥Sitemap discovery: 100-1,000 URLs/second

  • 常见爬网发现:50-500 个 URL/秒

    ¥Common Crawl discovery: 50-500 URLs/second

  • HEAD 检查:10-50 个 URL/秒

    ¥HEAD checking: 10-50 URLs/second

  • 头部提取:5-20 个 URL/秒

    ¥Head extraction: 5-20 URLs/second

  • BM25评分:每秒 10,000+ 个 URL

    ¥BM25 scoring: 10,000+ URLs/second

结论

¥Conclusion

URL 播种功能可将网络爬取从盲目的探索转变为精准的外科手术式打击。通过在爬取之前发现并分析 URL,您可以:

¥URL seeding transforms web crawling from a blind expedition into a surgical strike. By discovering and analyzing URLs before crawling, you can:

  • 节省数小时的爬行时间

    ¥Save hours of crawling time

  • 减少 90% 以上的带宽使用量

    ¥Reduce bandwidth usage by 90%+

  • 找到您真正需要的东西

    ¥Find exactly what you need

  • 轻松跨多个域扩展

    ¥Scale across multiple domains effortlessly

无论您是构建研究工具、监控竞争对手还是创建内容聚合器,URL 播种都能让您更智能地进行抓取,而不是更费力。

¥Whether you're building a research tool, monitoring competitors, or creating a content aggregator, URL seeding gives you the intelligence to crawl smarter, not harder.

智能 URL 过滤

¥Smart URL Filtering

播种机会自动过滤掉那些对内容抓取无用的无意义的 URL:

¥The seeder automatically filters out nonsense URLs that aren't useful for content crawling:

# Enabled by default
config = SeedingConfig(
    source="sitemap",
    filter_nonsense_urls=True  # Default: True
)

# URLs that get filtered:
# - robots.txt, sitemap.xml, ads.txt
# - API endpoints (/api/, /v1/, .json)
# - Media files (.jpg, .mp4, .pdf)
# - Archives (.zip, .tar.gz)
# - Source code (.js, .css)
# - Admin/login pages
# - And many more...

禁用过滤(不推荐):

¥To disable filtering (not recommended):

config = SeedingConfig(
    source="sitemap",
    filter_nonsense_urls=False  # Include ALL URLs
)

主要功能摘要

¥Key Features Summary

  1. 并行站点地图索引处理:自动检测并并行处理站点地图索引

    ¥Parallel Sitemap Index Processing: Automatically detects and processes sitemap indexes in parallel

  2. 内存保护:有界队列可防止大域(1M+ URL)出现 RAM 问题

    ¥Memory Protection: Bounded queues prevent RAM issues with large domains (1M+ URLs)

  3. 上下文管理器支持:自动清理async with陈述

    ¥Context Manager Support: Automatic cleanup with async with statement

  4. 基于 URL 的评分:无需提取头部即可实现智能过滤

    ¥URL-Based Scoring: Smart filtering even without head extraction

  5. 智能 URL 过滤:自动排除实用/无意义的 URL

    ¥Smart URL Filtering: Automatically excludes utility/nonsense URLs

  6. 双缓存:URL 列表和元数据的单独缓存

    ¥Dual Caching: Separate caches for URL lists and metadata

现在就行动起来,聪明地播种吧!🌱🚀

¥Now go forth and seed intelligently! 🌱🚀


> Feedback