链接与媒体

¥Link & Media

在本教程中,您将学习如何:

¥In this tutorial, you’ll learn how to:

  1. 从爬取的页面中提取链接(内部、外部)

    ¥Extract links (internal, external) from crawled pages

  2. 过滤或排除特定域(例如社交媒体或自定义域)

    ¥Filter or exclude specific domains (e.g., social media or custom domains)

  3. 访问和 ma### 3.2 排除图像

    ¥Access and ma### 3.2 Excluding Images

排除外部图像

¥Excluding External Images

如果您处理的页面很大或者想要跳过第三方图片(例如广告),您可以打开:

¥If you're dealing with heavy pages or want to skip third-party images (advertisements, for example), you can turn on:

crawler_cfg = CrawlerRunConfig(
    exclude_external_images=True
)

此设置尝试丢弃来自主域之外的图像,仅保留来自您正在抓取的站点的图像。

¥This setting attempts to discard images from outside the primary domain, keeping only those from the site you're crawling.

排除所有图像

¥Excluding All Images

如果要从页面中完全删除所有图像以最大程度地提高性能并减少内存使用,请使用:

¥If you want to completely remove all images from the page to maximize performance and reduce memory usage, use:

crawler_cfg = CrawlerRunConfig(
    exclude_all_images=True
)

此设置会在处理流程的早期阶段移除所有图片,从而显著提高内存效率和处理速度。在以下情况下尤其有用:- 您不需要结果中包含图片数据 - 您正在抓取包含大量图片且会导致内存问题的页面 - 您只想关注文本内容 - 您需要最大限度地提高抓取速度 抓取结果中包含图片数据(尤其是图片)
4. 配置爬虫以排除或优先处理某些图像

¥This setting removes all images very early in the processing pipeline, which significantly improves memory efficiency and processing speed. This is particularly useful when: - You don't need image data in your results - You're crawling image-heavy pages that cause memory issues - You want to focus only on text content - You need to maximize crawling speeddata (especially images) in the crawl result
4. Configure your crawler to exclude or prioritize certain images

先决条件
- 您已完成或熟悉AsyncWebCrawler 基础知识教程。
- 您可以在您的环境中运行 Crawl4AI(Playwright、Python 等)。

¥

Prerequisites
- You have completed or are familiar with the AsyncWebCrawler Basics tutorial.
- You can run Crawl4AI in your environment (Playwright, Python, etc.).


以下是修订版链接提取媒体提取包含示例数据结构的部分展示了如何存储链接和媒体项目CrawlResult。请随意调整任何字段名称或描述以匹配您的实际输出。

¥Below is a revised version of the Link Extraction and Media Extraction sections that includes example data structures showing how links and media items are stored in CrawlResult. Feel free to adjust any field names or descriptions to match your actual output.


¥1. Link Extraction

¥1.1 result.links

当你打电话时arun()或者arun_many()在 URL 上,Crawl4AI 自动提取链接并将其存储在links领域CrawlResult默认情况下,爬虫会尝试区分内部的链接(同一域名)来自外部的链接(不同的域)。

¥When you call arun() or arun_many() on a URL, Crawl4AI automatically extracts links and stores them in the links field of CrawlResult. By default, the crawler tries to distinguish internal links (same domain) from external links (different domains).

基本示例

¥Basic Example:

from crawl4ai import AsyncWebCrawler

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://www.example.com")
    if result.success:
        internal_links = result.links.get("internal", [])
        external_links = result.links.get("external", [])
        print(f"Found {len(internal_links)} internal links.")
        print(f"Found {len(internal_links)} external links.")
        print(f"Found {len(result.media)} media items.")

        # Each link is typically a dictionary with fields like:
        # { "href": "...", "text": "...", "title": "...", "base_domain": "..." }
        if internal_links:
            print("Sample Internal Link:", internal_links[0])
    else:
        print("Crawl failed:", result.error_message)

结构示例

¥Structure Example:

result.links = {
  "internal": [
    {
      "href": "https://kidocode.com/",
      "text": "",
      "title": "",
      "base_domain": "kidocode.com"
    },
    {
      "href": "https://kidocode.com/degrees/technology",
      "text": "Technology Degree",
      "title": "KidoCode Tech Program",
      "base_domain": "kidocode.com"
    },
    # ...
  ],
  "external": [
    # possibly other links leading to third-party sites
  ]
}
  • href:原始超链接 URL。

    ¥href: The raw hyperlink URL.

  • text:链接文本(如果有)<a>标签。

    ¥text: The link text (if any) within the <a> tag.

  • title: 这title链接的属性(如果存在)。

    ¥title: The title attribute of the link (if present).

  • base_domain:从中提取的域href. 有助于按域进行过滤或分组。

    ¥base_domain: The domain extracted from href. Helpful for filtering or grouping by domain.


¥2. Advanced Link Head Extraction & Scoring

有没有想过,不仅要提取链接,还要从这些链接页面获取实际内容(标题、描述、元数据)?还要对它们进行相关性评分?这正是 Link Head Extraction 的功能所在——它获取<head>从每个发现的链接中抽取部分,并使用多种算法对它们进行评分。

¥Ever wanted to not just extract links, but also get the actual content (title, description, metadata) from those linked pages? And score them for relevance? This is exactly what Link Head Extraction does - it fetches the <head> section from each discovered link and scores them using multiple algorithms.

¥2.1 Why Link Head Extraction?

当你爬取一个页面时,你会得到数百个链接。但哪些链接真正有价值呢?链接头提取通过以下方式解决这个问题:

¥When you crawl a page, you get hundreds of links. But which ones are actually valuable? Link Head Extraction solves this by:

  1. 获取头部内容来自每个链接(标题、描述、元标签)

    ¥Fetching head content from each link (title, description, meta tags)

  2. 内在地对链接进行评分基于 URL 质量、文本相关性和上下文

    ¥Scoring links intrinsically based on URL quality, text relevance, and context

  3. 根据上下文对链接进行评分提供搜索查询时使用 BM25 算法

    ¥Scoring links contextually using BM25 algorithm when you provide a search query

  4. 智能地组合分数给你最终的相关性排名

    ¥Combining scores intelligently to give you a final relevance ranking

2.2 完整的工作示例

¥2.2 Complete Working Example

这是一个完整的示例,您可以复制、粘贴并立即运行:

¥Here's a full example you can copy, paste, and run immediately:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai import LinkPreviewConfig

async def extract_link_heads_example():
    """
    Complete example showing link head extraction with scoring.
    This will crawl a documentation site and extract head content from internal links.
    """

    # Configure link head extraction
    config = CrawlerRunConfig(
        # Enable link head extraction with detailed configuration
        link_preview_config=LinkPreviewConfig(
            include_internal=True,           # Extract from internal links
            include_external=False,          # Skip external links for this example
            max_links=10,                   # Limit to 10 links for demo
            concurrency=5,                  # Process 5 links simultaneously
            timeout=10,                     # 10 second timeout per link
            query="API documentation guide", # Query for contextual scoring
            score_threshold=0.3,            # Only include links scoring above 0.3
            verbose=True                    # Show detailed progress
        ),
        # Enable intrinsic scoring (URL quality, text relevance)
        score_links=True,
        # Keep output clean
        only_text=True,
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        # Crawl a documentation site (great for testing)
        result = await crawler.arun("https://docs.python.org/3/", config=config)

        if result.success:
            print(f"✅ Successfully crawled: {result.url}")
            print(f"📄 Page title: {result.metadata.get('title', 'No title')}")

            # Access links (now enhanced with head data and scores)
            internal_links = result.links.get("internal", [])
            external_links = result.links.get("external", [])

            print(f"\n🔗 Found {len(internal_links)} internal links")
            print(f"🌍 Found {len(external_links)} external links")

            # Count links with head data
            links_with_head = [link for link in internal_links 
                             if link.get("head_data") is not None]
            print(f"🧠 Links with head data extracted: {len(links_with_head)}")

            # Show the top 3 scoring links
            print(f"\n🏆 Top 3 Links with Full Scoring:")
            for i, link in enumerate(links_with_head[:3]):
                print(f"\n{i+1}. {link['href']}")
                print(f"   Link Text: '{link.get('text', 'No text')[:50]}...'")

                # Show all three score types
                intrinsic = link.get('intrinsic_score')
                contextual = link.get('contextual_score') 
                total = link.get('total_score')

                if intrinsic is not None:
                    print(f"   📊 Intrinsic Score: {intrinsic:.2f}/10.0 (URL quality & context)")
                if contextual is not None:
                    print(f"   🎯 Contextual Score: {contextual:.3f} (BM25 relevance to query)")
                if total is not None:
                    print(f"   ⭐ Total Score: {total:.3f} (combined final score)")

                # Show extracted head data
                head_data = link.get("head_data", {})
                if head_data:
                    title = head_data.get("title", "No title")
                    description = head_data.get("meta", {}).get("description", "No description")

                    print(f"   📰 Title: {title[:60]}...")
                    if description:
                        print(f"   📝 Description: {description[:80]}...")

                    # Show extraction status
                    status = link.get("head_extraction_status", "unknown")
                    print(f"   ✅ Extraction Status: {status}")
        else:
            print(f"❌ Crawl failed: {result.error_message}")

# Run the example
if __name__ == "__main__":
    asyncio.run(extract_link_heads_example())

预期输出:

¥Expected Output:

✅ Successfully crawled: https://docs.python.org/3/
📄 Page title: 3.13.5 Documentation
🔗 Found 53 internal links
🌍 Found 1 external links
🧠 Links with head data extracted: 10

🏆 Top 3 Links with Full Scoring:

1. https://docs.python.org/3.15/
   Link Text: 'Python 3.15 (in development)...'
   📊 Intrinsic Score: 4.17/10.0 (URL quality & context)
   🎯 Contextual Score: 1.000 (BM25 relevance to query)
   ⭐ Total Score: 5.917 (combined final score)
   📰 Title: 3.15.0a0 Documentation...
   📝 Description: The official Python documentation...
   ✅ Extraction Status: valid

2.3 深入配置

¥2.3 Configuration Deep Dive

LinkPreviewConfig类支持以下选项:

¥The LinkPreviewConfig class supports these options:

from crawl4ai import LinkPreviewConfig

link_preview_config = LinkPreviewConfig(
    # BASIC SETTINGS
    verbose=True,                    # Show detailed logs (recommended for learning)

    # LINK FILTERING
    include_internal=True,           # Include same-domain links
    include_external=True,           # Include different-domain links
    max_links=50,                   # Maximum links to process (prevents overload)

    # PATTERN FILTERING
    include_patterns=[               # Only process links matching these patterns
        "*/docs/*", 
        "*/api/*", 
        "*/reference/*"
    ],
    exclude_patterns=[               # Skip links matching these patterns
        "*/login*",
        "*/admin*"
    ],

    # PERFORMANCE SETTINGS
    concurrency=10,                  # How many links to process simultaneously
    timeout=5,                      # Seconds to wait per link

    # RELEVANCE SCORING
    query="machine learning API",    # Query for BM25 contextual scoring
    score_threshold=0.3,            # Only include links above this score
)

2.4 理解三种分数类型

¥2.4 Understanding the Three Score Types

每个提取的链接都会获得三个不同的分数:

¥Each extracted link gets three different scores:

1.内在分数(0-10) - URL 和内容质量

¥1. Intrinsic Score (0-10) - URL and Content Quality

根据 URL 结构、链接文本质量和页面上下文:

¥Based on URL structure, link text quality, and page context:

# High intrinsic score indicators:
# ✅ Clean URL structure (docs.python.org/api/reference)
# ✅ Meaningful link text ("API Reference Guide")
# ✅ Relevant to page context
# ✅ Not buried deep in navigation

# Low intrinsic score indicators:
# ❌ Random URLs (site.com/x7f9g2h)
# ❌ No link text or generic text ("Click here")
# ❌ Unrelated to page content

2.上下文分数(0-1) - BM25 与查询的相关性

¥2. Contextual Score (0-1) - BM25 Relevance to Query

仅当您提供query. 使用BM25算法对head内容进行处理:

¥Only available when you provide a query. Uses BM25 algorithm against head content:

# Example: query = "machine learning tutorial"
# High contextual score: Link to "Complete Machine Learning Guide"
# Low contextual score: Link to "Privacy Policy"

3.总分- 智能组合

¥3. Total Score - Smart Combination

智能地将内在分数和上下文分数与后备分数结合起来:

¥Intelligently combines intrinsic and contextual scores with fallbacks:

# When both scores available: (intrinsic * 0.3) + (contextual * 0.7)
# When only intrinsic: uses intrinsic score
# When only contextual: uses contextual score
# When neither: not calculated

2.5 实际用例

¥2.5 Practical Use Cases

用例 1:研究助理

¥Use Case 1: Research Assistant

查找最相关的文档页面:

¥Find the most relevant documentation pages:

async def research_assistant():
    config = CrawlerRunConfig(
        link_preview_config=LinkPreviewConfig(
            include_internal=True,
            include_external=True,
            include_patterns=["*/docs/*", "*/tutorial/*", "*/guide/*"],
            query="machine learning neural networks",
            max_links=20,
            score_threshold=0.5,  # Only high-relevance links
            verbose=True
        ),
        score_links=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://scikit-learn.org/", config=config)

        if result.success:
            # Get high-scoring links
            good_links = [link for link in result.links.get("internal", [])
                         if link.get("total_score", 0) > 0.7]

            print(f"🎯 Found {len(good_links)} highly relevant links:")
            for link in good_links[:5]:
                print(f"⭐ {link['total_score']:.3f} - {link['href']}")
                print(f"   {link.get('head_data', {}).get('title', 'No title')}")

用例 2:内容发现

¥Use Case 2: Content Discovery

查找所有 API 端点和引用:

¥Find all API endpoints and references:

async def api_discovery():
    config = CrawlerRunConfig(
        link_preview_config=LinkPreviewConfig(
            include_internal=True,
            include_patterns=["*/api/*", "*/reference/*"],
            exclude_patterns=["*/deprecated/*"],
            max_links=100,
            concurrency=15,
            verbose=False  # Clean output
        ),
        score_links=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://docs.example-api.com/", config=config)

        if result.success:
            api_links = result.links.get("internal", [])

            # Group by endpoint type
            endpoints = {}
            for link in api_links:
                if link.get("head_data"):
                    title = link["head_data"].get("title", "")
                    if "GET" in title:
                        endpoints.setdefault("GET", []).append(link)
                    elif "POST" in title:
                        endpoints.setdefault("POST", []).append(link)

            for method, links in endpoints.items():
                print(f"\n{method} Endpoints ({len(links)}):")
                for link in links[:3]:
                    print(f"  • {link['href']}")

¥Use Case 3: Link Quality Analysis

分析网站结构和内容质量:

¥Analyze website structure and content quality:

async def quality_analysis():
    config = CrawlerRunConfig(
        link_preview_config=LinkPreviewConfig(
            include_internal=True,
            max_links=200,
            concurrency=20,
        ),
        score_links=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://your-website.com/", config=config)

        if result.success:
            links = result.links.get("internal", [])

            # Analyze intrinsic scores
            scores = [link.get('intrinsic_score', 0) for link in links]
            avg_score = sum(scores) / len(scores) if scores else 0

            print(f"📊 Link Quality Analysis:")
            print(f"   Average intrinsic score: {avg_score:.2f}/10.0")
            print(f"   High quality links (>7.0): {len([s for s in scores if s > 7.0])}")
            print(f"   Low quality links (<3.0): {len([s for s in scores if s < 3.0])}")

            # Find problematic links
            bad_links = [link for link in links 
                        if link.get('intrinsic_score', 0) < 2.0]

            if bad_links:
                print(f"\n⚠️  Links needing attention:")
                for link in bad_links[:5]:
                    print(f"   {link['href']} (score: {link.get('intrinsic_score', 0):.1f})")

2.6 性能提示

¥2.6 Performance Tips

  1. 从小事做起:以max_links: 10了解该功能

    ¥Start Small: Begin with max_links: 10 to understand the feature

  2. 使用模式:过滤条件include_patterns关注相关部分

    ¥Use Patterns: Filter with include_patterns to focus on relevant sections

  3. 调整并发:更高的并发性=更快,但资源使用量更大

    ¥Adjust Concurrency: Higher concurrency = faster but more resource usage

  4. 设置超时: 使用timeout: 5防止网站速度慢导致挂起

    ¥Set Timeouts: Use timeout: 5 to prevent hanging on slow sites

  5. 使用分数阈值:过滤低质量链接score_threshold

    ¥Use Score Thresholds: Filter out low-quality links with score_threshold

2.7 故障排除

¥2.7 Troubleshooting

没有提取头部数据?

¥No head data extracted?

# Check your configuration:
config = CrawlerRunConfig(
    link_preview_config=LinkPreviewConfig(
        verbose=True   # ← Enable to see what's happening
    )
)

分数显示为无?

¥Scores showing as None?

# Make sure scoring is enabled:
config = CrawlerRunConfig(
    score_links=True,  # ← Enable intrinsic scoring
    link_preview_config=LinkPreviewConfig(
        query="your search terms"  # ← For contextual scoring
    )
)

过程耗时太长?

¥Process taking too long?

# Optimize performance:
link_preview_config = LinkPreviewConfig(
    max_links=20,      # ← Reduce number
    concurrency=10,    # ← Increase parallelism
    timeout=3,         # ← Shorter timeout
    include_patterns=["*/important/*"]  # ← Focus on key areas
)


3. 域名过滤

¥3. Domain Filtering

有些网站包含数百个第三方或联盟链接。您可以访问以下网址筛选特定域名:爬行时间通过配置爬虫。最相关的参数CrawlerRunConfig是:

¥Some websites contain hundreds of third-party or affiliate links. You can filter out certain domains at crawl time by configuring the crawler. The most relevant parameters in CrawlerRunConfig are:

  • exclude_external_links: 如果True,丢弃任何指向根域之外的链接。

    ¥exclude_external_links: If True, discard any link pointing outside the root domain.

  • exclude_social_media_domains:提供社交媒体平台列表(例如,["facebook.com", "twitter.com"] ) 从抓取中排除。

    ¥exclude_social_media_domains: Provide a list of social media platforms (e.g., ["facebook.com", "twitter.com"]) to exclude from your crawl.

  • exclude_social_media_links: 如果True,自动跳过已知的社交平台。

    ¥exclude_social_media_links: If True, automatically skip known social platforms.

  • exclude_domains:提供要排除的自定义域列表(例如,["spammyads.com", "tracker.net"] )。

    ¥exclude_domains: Provide a list of custom domains you want to exclude (e.g., ["spammyads.com", "tracker.net"]).

¥3.1 Example: Excluding External & Social Media Links

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    crawler_cfg = CrawlerRunConfig(
        exclude_external_links=True,          # No links outside primary domain
        exclude_social_media_links=True       # Skip recognized social media domains
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            "https://www.example.com",
            config=crawler_cfg
        )
        if result.success:
            print("[OK] Crawled:", result.url)
            print("Internal links count:", len(result.links.get("internal", [])))
            print("External links count:", len(result.links.get("external", [])))  
            # Likely zero external links in this scenario
        else:
            print("[ERROR]", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

3.2 示例:排除特定域

¥3.2 Example: Excluding Specific Domains

如果你想让外部链接进入,但明确排除某个域名(例如,suspiciousads.com ),请执行以下操作:

¥If you want to let external links in, but specifically exclude a domain (e.g., suspiciousads.com), do this:

crawler_cfg = CrawlerRunConfig(
    exclude_domains=["suspiciousads.com"]
)

当您仍然需要外部链接但需要阻止某些您认为是垃圾的网站时,这种方法很方便。

¥This approach is handy when you still want external links but need to block certain sites you consider spammy.


4. 媒体提取

¥4. Media Extraction

4.1 访问result.media

¥4.1 Accessing result.media

默认情况下,Crawl4AI 会收集页面上的图片、音频和视频 URL。这些 URL 存储在result.media,按媒体类型键入的字典(例如,imagesvideosaudio )。注意:表格已从result.media["tables"]到新的result.tables格式以便更好地组织和直接访问。

¥By default, Crawl4AI collects images, audio and video URLs it finds on the page. These are stored in result.media, a dictionary keyed by media type (e.g., images, videos, audio). Note: Tables have been moved from result.media["tables"] to the new result.tables format for better organization and direct access.

基本示例

¥Basic Example:

if result.success:
    # Get images
    images_info = result.media.get("images", [])
    print(f"Found {len(images_info)} images in total.")
    for i, img in enumerate(images_info[:3]):  # Inspect just the first 3
        print(f"[Image {i}] URL: {img['src']}")
        print(f"           Alt text: {img.get('alt', '')}")
        print(f"           Score: {img.get('score')}")
        print(f"           Description: {img.get('desc', '')}\n")

结构示例

¥Structure Example:

result.media = {
  "images": [
    {
      "src": "https://cdn.prod.website-files.com/.../Group%2089.svg",
      "alt": "coding school for kids",
      "desc": "Trial Class Degrees degrees All Degrees AI Degree Technology ...",
      "score": 3,
      "type": "image",
      "group_id": 0,
      "format": None,
      "width": None,
      "height": None
    },
    # ...
  ],
  "videos": [
    # Similar structure but with video-specific fields
  ],
  "audio": [
    # Similar structure but with audio-specific fields
  ],
}

根据您的 Crawl4AI 版本或抓取策略,这些字典可以包含如下字段:

¥Depending on your Crawl4AI version or scraping strategy, these dictionaries can include fields like:

  • src:媒体 URL(例如,图片来源)

    ¥src: The media URL (e.g., image source)

  • alt:图像的替代文本(如果存在)

    ¥alt: The alt text for images (if present)

  • desc:附近的一段文字或简短的描述(可选)

    ¥desc: A snippet of nearby text or a short description (optional)

  • score:如果您使用内容评分功能,则使用启发式相关性分数

    ¥score: A heuristic relevance score if you’re using content-scoring features

  • widthheight :如果爬虫检测到图像/视频的尺寸

    ¥width, height: If the crawler detects dimensions for the image/video

  • type: 通常"image""video" , 或者"audio"

    ¥type: Usually "image", "video", or "audio"

  • group_id:如果您要对相关媒体项目进行分组,爬虫可能会分配一个 ID

    ¥group_id: If you’re grouping related media items, the crawler might assign an ID

有了这些详细信息,您可以轻松地过滤掉或关注某些图像(例如,忽略分数非常低或不同域的图像),或收集元数据进行分析。

¥With these details, you can easily filter out or focus on certain images (for instance, ignoring images with very low scores or a different domain), or gather metadata for analytics.

4.2 排除外部图像

¥4.2 Excluding External Images

如果您处理的页面很大或者想要跳过第三方图片(例如广告),您可以打开:

¥If you’re dealing with heavy pages or want to skip third-party images (advertisements, for example), you can turn on:

crawler_cfg = CrawlerRunConfig(
    exclude_external_images=True
)

此设置尝试丢弃来自主域之外的图像,仅保留来自您正在抓取的站点的图像。

¥This setting attempts to discard images from outside the primary domain, keeping only those from the site you’re crawling.

4.3 附加媒体配置

¥4.3 Additional Media Config

  • screenshot:设置为True如果您希望将整页截图存储为base64result.screenshot

    ¥screenshot: Set to True if you want a full-page screenshot stored as base64 in result.screenshot.

  • pdf:设置为True如果你想要该页面的 PDF 版本result.pdf

    ¥pdf: Set to True if you want a PDF version of the page in result.pdf.

  • capture_mhtml:设置为True如果你想要页面的 MHTML 快照result.mhtml。此格式将整个网页及其所有资源(CSS、图像、脚本)保存在一个文件中,非常适合存档或离线查看。

    ¥capture_mhtml: Set to True if you want an MHTML snapshot of the page in result.mhtml. This format preserves the entire web page with all its resources (CSS, images, scripts) in a single file, making it perfect for archiving or offline viewing.

  • wait_for_images: 如果True,尝试等到图像完全加载后再进行最终提取。

    ¥wait_for_images: If True, attempts to wait until images are fully loaded before final extraction.

示例:将页面捕获为 MHTML

¥Example: Capturing Page as MHTML

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    crawler_cfg = CrawlerRunConfig(
        capture_mhtml=True  # Enable MHTML capture
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com", config=crawler_cfg)

        if result.success and result.mhtml:
            # Save the MHTML snapshot to a file
            with open("example.mhtml", "w", encoding="utf-8") as f:
                f.write(result.mhtml)
            print("MHTML snapshot saved to example.mhtml")
        else:
            print("Failed to capture MHTML:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

MHTML 格式特别有用,因为: - 它捕获包括所有资源在内的完整页面状态 - 它可以在大多数现代浏览器中打开以供离线查看 - 它保留了页面在抓取过程中的原样 - 它是一个单一文件,易于存储和传输

¥The MHTML format is particularly useful because: - It captures the complete page state including all resources - It can be opened in most modern browsers for offline viewing - It preserves the page exactly as it appeared during crawling - It's a single file, making it easy to store and transfer


¥5. Putting It All Together: Link & Media Filtering

下面是一个组合示例,演示如何过滤外部链接、跳过某些域以及排除外部图像:

¥Here’s a combined example demonstrating how to filter out external links, skip certain domains, and exclude external images:

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    # Suppose we want to keep only internal links, remove certain domains, 
    # and discard external images from the final crawl data.
    crawler_cfg = CrawlerRunConfig(
        exclude_external_links=True,
        exclude_domains=["spammyads.com"],
        exclude_social_media_links=True,   # skip Twitter, Facebook, etc.
        exclude_external_images=True,      # keep only images from main domain
        wait_for_images=True,             # ensure images are loaded
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://www.example.com", config=crawler_cfg)

        if result.success:
            print("[OK] Crawled:", result.url)

            # 1. Links
            in_links = result.links.get("internal", [])
            ext_links = result.links.get("external", [])
            print("Internal link count:", len(in_links))
            print("External link count:", len(ext_links))  # should be zero with exclude_external_links=True

            # 2. Images
            images = result.media.get("images", [])
            print("Images found:", len(images))

            # Let's see a snippet of these images
            for i, img in enumerate(images[:3]):
                print(f"  - {img['src']} (alt={img.get('alt','')}, score={img.get('score','N/A')})")
        else:
            print("[ERROR] Failed to crawl. Reason:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())

6. 常见陷阱与技巧

¥6. Common Pitfalls & Tips

1.冲突的标志
-exclude_external_links=True但同时也指定exclude_social_media_links=True通常没问题,但要知道第一个设置已经丢弃了全部外部链接。第二个有点多余。
-exclude_external_images=True但想保留一些外部图片?目前不支持部分基于域名的图片设置,所以你可能需要自定义方法或钩子逻辑。

¥1. Conflicting Flags:
- exclude_external_links=True but then also specifying exclude_social_media_links=True is typically fine, but understand that the first setting already discards all external links. The second becomes somewhat redundant.
- exclude_external_images=True but want to keep some external images? Currently no partial domain-based setting for images, so you might need a custom approach or hook logic.

2.相关性分数
- 如果您的 Crawl4AI 版本或抓取策略包含img["score"],它通常是基于规模、位置或内容分析的启发式方法。如果您依赖它,请谨慎评估。

¥2. Relevancy Scores:
- If your version of Crawl4AI or your scraping strategy includes an img["score"], it’s typically a heuristic based on size, position, or content analysis. Evaluate carefully if you rely on it.

3.表现
- 排除某些域或外部图像可以加快您的抓取速度,特别是对于大型、媒体密集的页面。
- 如果你想要一个“完整”的链接图,请这样做不是排除它们。相反,你可以在自己的代码中进行后期过滤。

¥3. Performance:
- Excluding certain domains or external images can speed up your crawl, especially for large, media-heavy pages.
- If you want a “full” link map, do not exclude them. Instead, you can post-filter in your own code.

4.社交媒体列表
-exclude_social_media_links=True通常引用已知社交域的内部列表,如 Facebook、Twitter、LinkedIn 等。如果您需要添加或删除该列表,请查找库设置或本地配置文件(取决于您的版本)。

¥4. Social Media Lists:
- exclude_social_media_links=True typically references an internal list of known social domains like Facebook, Twitter, LinkedIn, etc. If you need to add or remove from that list, look for library settings or a local config file (depending on your version).


这就是链接和媒体分析!您现在可以过滤掉不需要的网站,并专注于对您的项目重要的图像和视频。

¥That’s it for Link & Media Analysis! You’re now equipped to filter out unwanted sites and zero in on the images and videos that matter for your project.


> Feedback