链接与媒体

在本教程中,您将学习如何:

  1. 从爬取的页面中提取链接(内部、外部)
  2. 过滤或排除特定域(例如社交媒体或自定义域)
  3. 访问和 ma### 3.2 排除图像

排除外部图像

如果您处理的页面很大或者想要跳过第三方图片(例如广告),您可以打开:

crawler_cfg = CrawlerRunConfig(
    exclude_external_images=True
)

此设置尝试丢弃来自主域之外的图像,仅保留来自您正在抓取的站点的图像。

排除所有图像

如果要从页面中完全删除所有图像以最大程度地提高性能并减少内存使用,请使用:

crawler_cfg = CrawlerRunConfig(
    exclude_all_images=True
)

此设置会在处理流程的早期阶段移除所有图片,从而显著提高内存效率和处理速度。此功能在以下情况下尤其有用:- 您不需要在结果中包含图片数据 - 您正在爬取包含大量图片且会导致内存问题的页面 - 您只想关注文本内容 - 您需要最大限度地提高爬取速度 爬取结果中包含图片数据(尤其是图片) 4. 配置您的爬虫程序以排除或优先处理某些图片

先决条件 - 您已完成或熟悉AsyncWebCrawler 基础教程。 - 您可以在您的环境中运行 Crawl4AI(Playwright、Python 等)。

以下是“链接提取”和“媒体提取”部分的修订版本,其中包括示例数据结构,展示了链接和媒体项目如何存储在CrawlResult。请随意调整任何字段名称或描述以匹配您的实际输出。


当你打电话时arun()或者arun_many()在 URL 上,Crawl4AI 自动提取链接并将其存储在links领域CrawlResult默认情况下,爬虫会尝试区分内部链接(同一域)和外部链接(不同域)。

基本示例:

from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://www.example.com")
    if result.success:
        internal_links = result.links.get("internal", [])
        external_links = result.links.get("external", [])
        print(f"Found {len(internal_links)} internal links.")
        print(f"Found {len(internal_links)} external links.")
        print(f"Found {len(result.media)} media items.")

        # Each link is typically a dictionary with fields like:
        # { "href": "...", "text": "...", "title": "...", "base_domain": "..." }
        if internal_links:
            print("Sample Internal Link:", internal_links[0])
    else:
        print("Crawl failed:", result.error_message)

结构示例:

result.links = {
  "internal": [
    {
      "href": "https://kidocode.com/",
      "text": "",
      "title": "",
      "base_domain": "kidocode.com"
    },
    {
      "href": "https://kidocode.com/degrees/technology",
      "text": "Technology Degree",
      "title": "KidoCode Tech Program",
      "base_domain": "kidocode.com"
    },
    # ...
  ],
  "external": [
    # possibly other links leading to third-party sites
  ]
}
  • :原始超链接 URL。
  • :链接文本(如果有)<a>标签。
  • : 这title链接的属性(如果存在)。
  • :从中提取的域href. 有助于按域进行过滤或分组。

有没有想过,不仅要提取链接,还要从这些链接页面获取实际内容(标题、描述、元数据)?并根据相关性进行评分?这正是 Link Head Extraction 的功能所在——它获取<head>从每个发现的链接中抽取部分,并使用多种算法对它们进行评分。

当你爬取一个页面时,你会得到数百个链接。但哪些链接真正有价值呢?链接头提取通过以下方式解决这个问题:

  1. 从每个链接获取头部内容(标题、描述、元标签)
  2. 根据 URL 质量、文本相关性和上下文对链接进行评分
  3. 当你提供搜索查询时,使用 BM25 算法根据上下文对链接进行评分
  4. 智能地组合分数,为您提供最终的相关性排名

2.2 完整的工作示例

这是一个完整的示例,您可以复制、粘贴并立即运行:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai import LinkPreviewConfig

async def extract_link_heads_example():
    """
    Complete example showing link head extraction with scoring.
    This will crawl a documentation site and extract head content from internal links.
    """

    # Configure link head extraction
    config = CrawlerRunConfig(
        # Enable link head extraction with detailed configuration
        link_preview_config=LinkPreviewConfig(
            include_internal=True,           # Extract from internal links
            include_external=False,          # Skip external links for this example
            max_links=10,                   # Limit to 10 links for demo
            concurrency=5,                  # Process 5 links simultaneously
            timeout=10,                     # 10 second timeout per link
            query="API documentation guide", # Query for contextual scoring
            score_threshold=0.3,            # Only include links scoring above 0.3
            verbose=True                    # Show detailed progress
        ),
        # Enable intrinsic scoring (URL quality, text relevance)
        score_links=True,
        # Keep output clean
        only_text=True,
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        # Crawl a documentation site (great for testing)
        result = await crawler.arun("https://docs.python.org/3/", config=config)

        if result.success:
            print(f"✅ Successfully crawled: {result.url}")
            print(f"📄 Page title: {result.metadata.get('title', 'No title')}")

            # Access links (now enhanced with head data and scores)
            internal_links = result.links.get("internal", [])
            external_links = result.links.get("external", [])

            print(f"\n🔗 Found {len(internal_links)} internal links")
            print(f"🌍 Found {len(external_links)} external links")

            # Count links with head data
            links_with_head = [link for link in internal_links 
                             if link.get("head_data") is not None]
            print(f"🧠 Links with head data extracted: {len(links_with_head)}")

            # Show the top 3 scoring links
            print(f"\n🏆 Top 3 Links with Full Scoring:")
            for i, link in enumerate(links_with_head[:3]):
                print(f"\n{i+1}. {link['href']}")
                print(f"   Link Text: '{link.get('text', 'No text')[:50]}...'")

                # Show all three score types
                intrinsic = link.get('intrinsic_score')
                contextual = link.get('contextual_score') 
                total = link.get('total_score')

                if intrinsic is not None:
                    print(f"   📊 Intrinsic Score: {intrinsic:.2f}/10.0 (URL quality & context)")
                if contextual is not None:
                    print(f"   🎯 Contextual Score: {contextual:.3f} (BM25 relevance to query)")
                if total is not None:
                    print(f"   ⭐ Total Score: {total:.3f} (combined final score)")

                # Show extracted head data
                head_data = link.get("head_data", {})
                if head_data:
                    title = head_data.get("title", "No title")
                    description = head_data.get("meta", {}).get("description", "No description")

                    print(f"   📰 Title: {title[:60]}...")
                    if description:
                        print(f"   📝 Description: {description[:80]}...")

                    # Show extraction status
                    status = link.get("head_extraction_status", "unknown")
                    print(f"   ✅ Extraction Status: {status}")
        else:
            print(f"❌ Crawl failed: {result.error_message}")

# Run the example
if __name__ == "__main__":
    asyncio.run(extract_link_heads_example())

预期输出:

✅ Successfully crawled: https://docs.python.org/3/
📄 Page title: 3.13.5 Documentation
🔗 Found 53 internal links
🌍 Found 1 external links
🧠 Links with head data extracted: 10

🏆 Top 3 Links with Full Scoring:

1. https://docs.python.org/3.15/
   Link Text: 'Python 3.15 (in development)...'
   📊 Intrinsic Score: 4.17/10.0 (URL quality & context)
   🎯 Contextual Score: 1.000 (BM25 relevance to query)
   ⭐ Total Score: 5.917 (combined final score)
   📰 Title: 3.15.0a0 Documentation...
   📝 Description: The official Python documentation...
   ✅ Extraction Status: valid

2.3 深入配置

LinkPreviewConfig类支持以下选项:

from crawl4ai import LinkPreviewConfig

link_preview_config = LinkPreviewConfig(
    # BASIC SETTINGS
    verbose=True,                    # Show detailed logs (recommended for learning)

    # LINK FILTERING
    include_internal=True,           # Include same-domain links
    include_external=True,           # Include different-domain links
    max_links=50,                   # Maximum links to process (prevents overload)

    # PATTERN FILTERING
    include_patterns=[               # Only process links matching these patterns
        "*/docs/*", 
        "*/api/*", 
        "*/reference/*"
    ],
    exclude_patterns=[               # Skip links matching these patterns
        "*/login*",
        "*/admin*"
    ],

    # PERFORMANCE SETTINGS
    concurrency=10,                  # How many links to process simultaneously
    timeout=5,                      # Seconds to wait per link

    # RELEVANCE SCORING
    query="machine learning API",    # Query for BM25 contextual scoring
    score_threshold=0.3,            # Only include links above this score
)

2.4 理解三种分数类型

每个提取的链接都会获得三个不同的分数:

1. 内在评分(0-10)- URL 和内容质量

根据 URL 结构、链接文本质量和页面上下文:

# High intrinsic score indicators:
# ✅ Clean URL structure (docs.python.org/api/reference)
# ✅ Meaningful link text ("API Reference Guide")
# ✅ Relevant to page context
# ✅ Not buried deep in navigation

# Low intrinsic score indicators:
# ❌ Random URLs (site.com/x7f9g2h)
# ❌ No link text or generic text ("Click here")
# ❌ Unrelated to page content

2. 上下文分数(0-1) - BM25 与查询的相关性

仅当您提供query. 使用BM25算法对head内容进行处理:

# Example: query = "machine learning tutorial"
# High contextual score: Link to "Complete Machine Learning Guide"
# Low contextual score: Link to "Privacy Policy"

3. 总分-智能组合

智能地将内在分数和上下文分数与后备分数结合起来:

# When both scores available: (intrinsic * 0.3) + (contextual * 0.7)
# When only intrinsic: uses intrinsic score
# When only contextual: uses contextual score
# When neither: not calculated

2.5 实际用例

用例 1:研究助理

查找最相关的文档页面:

async def research_assistant():
    config = CrawlerRunConfig(
        link_preview_config=LinkPreviewConfig(
            include_internal=True,
            include_external=True,
            include_patterns=["*/docs/*", "*/tutorial/*", "*/guide/*"],
            query="machine learning neural networks",
            max_links=20,
            score_threshold=0.5,  # Only high-relevance links
            verbose=True
        ),
        score_links=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://scikit-learn.org/", config=config)

        if result.success:
            # Get high-scoring links
            good_links = [link for link in result.links.get("internal", [])
                         if link.get("total_score", 0) > 0.7]

            print(f"🎯 Found {len(good_links)} highly relevant links:")
            for link in good_links[:5]:
                print(f"⭐ {link['total_score']:.3f} - {link['href']}")
                print(f"   {link.get('head_data', {}).get('title', 'No title')}")

用例 2:内容发现

查找所有 API 端点和引用:

async def api_discovery():
    config = CrawlerRunConfig(
        link_preview_config=LinkPreviewConfig(
            include_internal=True,
            include_patterns=["*/api/*", "*/reference/*"],
            exclude_patterns=["*/deprecated/*"],
            max_links=100,
            concurrency=15,
            verbose=False  # Clean output
        ),
        score_links=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://docs.example-api.com/", config=config)

        if result.success:
            api_links = result.links.get("internal", [])

            # Group by endpoint type
            endpoints = {}
            for link in api_links:
                if link.get("head_data"):
                    title = link["head_data"].get("title", "")
                    if "GET" in title:
                        endpoints.setdefault("GET", []).append(link)
                    elif "POST" in title:
                        endpoints.setdefault("POST", []).append(link)

            for method, links in endpoints.items():
                print(f"\n{method} Endpoints ({len(links)}):")
                for link in links[:3]:
                    print(f"  • {link['href']}")

分析网站结构和内容质量:

async def quality_analysis():
    config = CrawlerRunConfig(
        link_preview_config=LinkPreviewConfig(
            include_internal=True,
            max_links=200,
            concurrency=20,
        ),
        score_links=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://your-website.com/", config=config)

        if result.success:
            links = result.links.get("internal", [])

            # Analyze intrinsic scores
            scores = [link.get('intrinsic_score', 0) for link in links]
            avg_score = sum(scores) / len(scores) if scores else 0

            print(f"📊 Link Quality Analysis:")
            print(f"   Average intrinsic score: {avg_score:.2f}/10.0")
            print(f"   High quality links (>7.0): {len([s for s in scores if s > 7.0])}")
            print(f"   Low quality links (<3.0): {len([s for s in scores if s < 3.0])}")

            # Find problematic links
            bad_links = [link for link in links 
                        if link.get('intrinsic_score', 0) < 2.0]

            if bad_links:
                print(f"\n⚠️  Links needing attention:")
                for link in bad_links[:5]:
                    print(f"   {link['href']} (score: {link.get('intrinsic_score', 0):.1f})")

2.6 性能提示

  1. 从小事做起:从max_links: 10了解该功能
  2. 使用模式:过滤include_patterns关注相关部分
  3. 调整并发性:更高的并发性=更快但更多的资源使用
  4. 设置超时:使用timeout: 5防止网站速度慢导致挂起
  5. 使用分数阈值:过滤掉低质量的链接score_threshold

2.7 故障排除

没有提取头部数据?

# Check your configuration:
config = CrawlerRunConfig(
    link_preview_config=LinkPreviewConfig(
        verbose=True   # ← Enable to see what's happening
    )
)

分数显示为无?

# Make sure scoring is enabled:
config = CrawlerRunConfig(
    score_links=True,  # ← Enable intrinsic scoring
    link_preview_config=LinkPreviewConfig(
        query="your search terms"  # ← For contextual scoring
    )
)

过程耗时太长?

# Optimize performance:
link_preview_config = LinkPreviewConfig(
    max_links=20,      # ← Reduce number
    concurrency=10,    # ← Increase parallelism
    timeout=3,         # ← Shorter timeout
    include_patterns=["*/important/*"]  # ← Focus on key areas
)


3. 域名过滤

有些网站包含数百个第三方或联盟链接。您可以通过配置爬虫程序在爬取时过滤掉某些域名。最相关的参数CrawlerRunConfig是:

  • : 如果True,丢弃任何指向根域之外的链接。
  • :提供社交媒体平台列表(例如,["facebook.com", "twitter.com"] ) 从抓取中排除。
  • : 如果True,自动跳过已知的社交平台。
  • :提供要排除的自定义域列表(例如,["spammyads.com", "tracker.net"] )。
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    crawler_cfg = CrawlerRunConfig(
        exclude_external_links=True,          # No links outside primary domain
        exclude_social_media_links=True       # Skip recognized social media domains
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            "https://www.example.com",
            config=crawler_cfg
        )
        if result.success:
            print("[OK] Crawled:", result.url)
            print("Internal links count:", len(result.links.get("internal", [])))
            print("External links count:", len(result.links.get("external", [])))  
            # Likely zero external links in this scenario
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

3.2 示例:排除特定域

如果你想让外部链接进入,但明确排除某个域名(例如,suspiciousads.com ),请执行以下操作:

crawler_cfg = CrawlerRunConfig(
    exclude_domains=["suspiciousads.com"]
)

当您仍然需要外部链接但需要阻止某些您认为是垃圾的网站时,这种方法很方便。


4. 媒体提取

4.1 访问result.media

默认情况下,Crawl4AI 会收集页面上的图片、音频、视频 URL 和数据表。这些内容存储在result.media,按媒体类型键入的字典(例如,imagesvideosaudiotables )。

基本示例:

if result.success:
    # Get images
    images_info = result.media.get("images", [])
    print(f"Found {len(images_info)} images in total.")
    for i, img in enumerate(images_info[:3]):  # Inspect just the first 3
        print(f"[Image {i}] URL: {img['src']}")
        print(f"           Alt text: {img.get('alt', '')}")
        print(f"           Score: {img.get('score')}")
        print(f"           Description: {img.get('desc', '')}\n")

    # Get tables
    tables = result.media.get("tables", [])
    print(f"Found {len(tables)} data tables in total.")
    for i, table in enumerate(tables):
        print(f"[Table {i}] Caption: {table.get('caption', 'No caption')}")
        print(f"           Columns: {len(table.get('headers', []))}")
        print(f"           Rows: {len(table.get('rows', []))}")

结构示例:

result.media = {
  "images": [
    {
      "src": "https://cdn.prod.website-files.com/.../Group%2089.svg",
      "alt": "coding school for kids",
      "desc": "Trial Class Degrees degrees All Degrees AI Degree Technology ...",
      "score": 3,
      "type": "image",
      "group_id": 0,
      "format": None,
      "width": None,
      "height": None
    },
    # ...
  ],
  "videos": [
    # Similar structure but with video-specific fields
  ],
  "audio": [
    # Similar structure but with audio-specific fields
  ],
  "tables": [
    {
      "headers": ["Name", "Age", "Location"],
      "rows": [
        ["John Doe", "34", "New York"],
        ["Jane Smith", "28", "San Francisco"],
        ["Alex Johnson", "42", "Chicago"]
      ],
      "caption": "Employee Directory",
      "summary": "Directory of company employees"
    },
    # More tables if present
  ]
}

根据您的 Crawl4AI 版本或抓取策略,这些字典可以包含如下字段:

  • :媒体 URL(例如,图片来源)
  • :图像的替代文本(如果存在)
  • :附近的一段文字或简短的描述(可选)
  • :如果您使用内容评分功能,则使用启发式相关性分数
  • height :如果爬虫检测到图像/视频的尺寸
  • : 通常"image""video" , 或者"audio"
  • :如果您要对相关媒体项目进行分组,爬虫可能会分配一个 ID

有了这些详细信息,您可以轻松地过滤掉或关注某些图像(例如,忽略分数非常低或不同域的图像),或收集元数据进行分析。

4.2 排除外部图像

如果您处理的页面很大或者想要跳过第三方图片(例如广告),您可以打开:

crawler_cfg = CrawlerRunConfig(
    exclude_external_images=True
)

此设置尝试丢弃来自主域之外的图像,仅保留来自您正在抓取的站点的图像。

3.3 使用表格

Crawl4AI 可以检测并提取 HTML 表格中的结构化数据。系统会根据各种标准对表格进行分析,以确定它们是否为实际数据表(而非布局表),这些标准包括:

  • 存在 thead 和 tbody 部分
  • 使用 th 元素作为标题
  • 列一致性
  • 文本密度
  • 以及其他因素

得分高于阈值(默认值:7)的表将被提取并存储在result.media.tables

访问表数据:

if result.success:
    tables = result.media.get("tables", [])
    print(f"Found {len(tables)} data tables on the page")

    if tables:
        # Access the first table
        first_table = tables[0]
        print(f"Table caption: {first_table.get('caption', 'No caption')}")
        print(f"Headers: {first_table.get('headers', [])}")

        # Print the first 3 rows
        for i, row in enumerate(first_table.get('rows', [])[:3]):
            print(f"Row {i+1}: {row}")

配置表提取:

您可以使用以下方式调整表格检测算法的灵敏度:

crawler_cfg = CrawlerRunConfig(
    table_score_threshold=5  # Lower value = more tables detected (default: 7)
)

每个提取的表包含:-headers :列标题名称 -rows :行列表,每行包含单元格值 -caption :表格标题文本(如果有) -summary :表格摘要属性(如果指定)

3.4 附加媒体配置

  • :设置为True如果您希望将整页截图存储为base64result.screenshot
  • :设置为True如果你想要该页面的 PDF 版本result.pdf
  • :设置为True如果你想要页面的 MHTML 快照result.mhtml。此格式将整个网页及其所有资源(CSS、图像、脚本)保存在一个文件中,非常适合存档或离线查看。
  • : 如果True,尝试等到图像完全加载后再进行最终提取。

示例:将页面捕获为 MHTML

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    crawler_cfg = CrawlerRunConfig(
        capture_mhtml=True  # Enable MHTML capture
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com", config=crawler_cfg)

        if result.success and result.mhtml:
            # Save the MHTML snapshot to a file
            with open("example.mhtml", "w", encoding="utf-8") as f:
                f.write(result.mhtml)
            print("MHTML snapshot saved to example.mhtml")
        else:
            print("Failed to capture MHTML:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

MHTML 格式特别有用,因为: - 它捕获包括所有资源在内的完整页面状态 - 它可以在大多数现代浏览器中打开以供离线查看 - 它保留了页面在抓取过程中的原样 - 它是一个单一文件,易于存储和传输


下面是一个组合示例,演示如何过滤外部链接、跳过某些域以及排除外部图像:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    # Suppose we want to keep only internal links, remove certain domains, 
    # and discard external images from the final crawl data.
    crawler_cfg = CrawlerRunConfig(
        exclude_external_links=True,
        exclude_domains=["spammyads.com"],
        exclude_social_media_links=True,   # skip Twitter, Facebook, etc.
        exclude_external_images=True,      # keep only images from main domain
        wait_for_images=True,             # ensure images are loaded
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://www.example.com", config=crawler_cfg)

        if result.success:
            print("[OK] Crawled:", result.url)

            # 1. Links
            in_links = result.links.get("internal", [])
            ext_links = result.links.get("external", [])
            print("Internal link count:", len(in_links))
            print("External link count:", len(ext_links))  # should be zero with exclude_external_links=True

            # 2. Images
            images = result.media.get("images", [])
            print("Images found:", len(images))

            # Let's see a snippet of these images
            for i, img in enumerate(images[:3]):
                print(f"  - {img['src']} (alt={img.get('alt','')}, score={img.get('score','N/A')})")
        else:
            print("[ERROR] Failed to crawl. Reason:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

5. 常见陷阱与技巧

1. 冲突的旗帜:-exclude_external_links=True但同时也指定exclude_social_media_links=True通常没问题,但请注意,第一个设置已经丢弃了所有外部链接。第二个设置有点多余。exclude_external_images=True但想保留一些外部图片?目前不支持部分基于域名的图片设置,所以你可能需要自定义方法或钩子逻辑。

2. 相关性分数: - 如果您的 Crawl4AI 版本或抓取策略包含img["score"],它通常是基于规模、位置或内容分析的启发式方法。如果您依赖它,请谨慎评估。

3. 性能:- 排除某些域名或外部图片可以加快抓取速度,尤其是对于内容丰富、媒体内容丰富的大型页面。- 如果您想要“完整”的链接图,请不要排除它们。您可以用自己的代码进行后期过滤。

4.社交媒体列表:-exclude_social_media_links=True通常引用已知社交域的内部列表,如 Facebook、Twitter、LinkedIn 等。如果您需要添加或删除该列表,请查找库设置或本地配置文件(取决于您的版本)。


链接和媒体分析就到这里!现在,您可以过滤掉不需要的网站,并专注于项目所需的图像和视频。

表格提取技巧

  • 并非所有 HTML 表格都会被提取 - 仅提取那些被检测为“数据表”而不是布局表的表格。
  • 单元格数量不一致的表格、嵌套表格或纯粹用于布局的表格可能会被跳过。
  • 如果缺少表格,请尝试调整table_score_threshold为较低的值(默认值为 7)。

表格检测算法会根据列的一致性、标题的存在性、文本密度等特征对表格进行评分。得分高于阈值的表格将被视为值得提取的数据表。


> Feedback