内容选择

Crawl4AI 提供多种方法来选择、过滤和优化爬取内容。无论您需要定位特定的 CSS 区域、排除整个标签、过滤外部链接,还是删除某些域名和图片,CrawlerRunConfig提供广泛的参数。

下面,我们将展示如何配置这些参数并将它们组合起来以实现精确控制。


1.基于CSS的选择

有两种方法可以从页面中选择内容:使用css_selector或者更灵活target_elements

1.1 使用css_selector

将抓取结果限制在页面特定区域的一个直接方法是css_selectorCrawlerRunConfig

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    config = CrawlerRunConfig(
        # e.g., first 30 items from Hacker News
        css_selector=".athing:nth-child(-n+30)"  
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://news.ycombinator.com/newest", 
            config=config
        )
        print("Partial HTML length:", len(result.cleaned_html))

if __name__ == "__main__":
    asyncio.run(main())

结果:只有与该选择器匹配的元素仍保留在result.cleaned_html

1.2 使用target_elements

target_elements参数提供了更大的灵活性,允许您针对多个元素进行内容提取,同时为其他功能保留整个页面上下文:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    config = CrawlerRunConfig(
        # Target article body and sidebar, but not other content
        target_elements=["article.main-content", "aside.sidebar"]
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/blog-post", 
            config=config
        )
        print("Markdown focused on target elements")
        print("Links from entire page still available:", len(result.links.get("internal", [])))

if __name__ == "__main__":
    asyncio.run(main())

主要区别:target_elements ,Markdown 生成和结构化数据提取会重点关注这些元素,但其他页面元素(例如链接、图片和表格)仍会从整个页面中提取。这让您能够精细控制 Markdown 内容中显示的内容,同时保留完整页面上下文以进行链接分析和媒体收集。


2. 内容过滤和排除

2.1 基本概述

config = CrawlerRunConfig(
    # Content thresholds
    word_count_threshold=10,        # Minimum words per block

    # Tag exclusions
    excluded_tags=['form', 'header', 'footer', 'nav'],

    # Link filtering
    exclude_external_links=True,    
    exclude_social_media_links=True,
    # Block entire domains
    exclude_domains=["adtrackers.com", "spammynews.org"],    
    exclude_social_media_domains=["facebook.com", "twitter.com"],

    # Media filtering
    exclude_external_images=True
)

解释:

  • :忽略 X 个单词下的文本块。有助于跳过简短导航或免责声明等琐碎的文本块。
  • :删除整个标签(<form><header><footer> , ETC。)。
  • 链接过滤:
  • :删除外部链接,并可能将其从result.links
  • :删除指向已知社交媒体域的链接。
  • :在链接中发现时要阻止的域的自定义列表。
  • :社交媒体网站的精选列表(覆盖或添加)。
  • 媒体过滤:
  • :丢弃未与主页(或其子域)托管在同一域上的图像。

默认情况下,如果您设置exclude_social_media_links=True,以下社交媒体域名被排除在外:

[
    'facebook.com',
    'twitter.com',
    'x.com',
    'linkedin.com',
    'instagram.com',
    'pinterest.com',
    'tiktok.com',
    'snapchat.com',
    'reddit.com',
]

2.2 示例用法

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def main():
    config = CrawlerRunConfig(
        css_selector="main.content", 
        word_count_threshold=10,
        excluded_tags=["nav", "footer"],
        exclude_external_links=True,
        exclude_social_media_links=True,
        exclude_domains=["ads.com", "spammytrackers.net"],
        exclude_external_images=True,
        cache_mode=CacheMode.BYPASS
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://news.ycombinator.com", config=config)
        print("Cleaned HTML length:", len(result.cleaned_html))

if __name__ == "__main__":
    asyncio.run(main())

注意:如果这些参数删除太多,请相应地减少或禁用它们。


3. 处理 iframe

有些网站将内容嵌入<iframe>标签。如果你想要内联:

config = CrawlerRunConfig(
    # Merge iframe content into the final output
    process_iframes=True,    
    remove_overlay_elements=True
)

用法:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async def main():
    config = CrawlerRunConfig(
        process_iframes=True,
        remove_overlay_elements=True
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.org/iframe-demo", 
            config=config
        )
        print("Iframe-merged length:", len(result.cleaned_html))

if __name__ == "__main__":
    asyncio.run(main())


4. 结构化提取示例

您可以将内容选择与更高级的提取策略相结合。例如,基于 CSS 或 LLM 的提取策略可以在过滤后的 HTML 上运行。

4.1 基于模式的JsonCssExtractionStrategy

import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy

async def main():
    # Minimal schema for repeated items
    schema = {
        "name": "News Items",
        "baseSelector": "tr.athing",
        "fields": [
            {"name": "title", "selector": "span.titleline a", "type": "text"},
            {
                "name": "link", 
                "selector": "span.titleline a", 
                "type": "attribute", 
                "attribute": "href"
            }
        ]
    }

    config = CrawlerRunConfig(
        # Content filtering
        excluded_tags=["form", "header"],
        exclude_domains=["adsite.com"],

        # CSS selection or entire page
        css_selector="table.itemlist",

        # No caching for demonstration
        cache_mode=CacheMode.BYPASS,

        # Extraction strategy
        extraction_strategy=JsonCssExtractionStrategy(schema)
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://news.ycombinator.com/newest", 
            config=config
        )
        data = json.loads(result.extracted_content)
        print("Sample extracted item:", data[:1])  # Show first item

if __name__ == "__main__":
    asyncio.run(main())

4.2 基于LLM的提取

import asyncio
import json
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai import LLMExtractionStrategy

class ArticleData(BaseModel):
    headline: str
    summary: str

async def main():
    llm_strategy = LLMExtractionStrategy(
        llm_config = LLMConfig(provider="openai/gpt-4",api_token="sk-YOUR_API_KEY")
        schema=ArticleData.schema(),
        extraction_type="schema",
        instruction="Extract 'headline' and a short 'summary' from the content."
    )

    config = CrawlerRunConfig(
        exclude_external_links=True,
        word_count_threshold=20,
        extraction_strategy=llm_strategy
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://news.ycombinator.com", config=config)
        article = json.loads(result.extracted_content)
        print(article)

if __name__ == "__main__":
    asyncio.run(main())

这里,爬虫:

  • 过滤掉外部链接(exclude_external_links=True )。
  • 忽略非常短的文本块(word_count_threshold=20 )。
  • 将最终的 HTML 传递给您的 LLM 策略以进行 AI 驱动的解析。

5.综合示例

下面是一个简短的函数,它统一了 CSS 选择、排除逻辑和基于模式的提取,演示了如何微调最终数据:

import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy

async def extract_main_articles(url: str):
    schema = {
        "name": "ArticleBlock",
        "baseSelector": "div.article-block",
        "fields": [
            {"name": "headline", "selector": "h2", "type": "text"},
            {"name": "summary", "selector": ".summary", "type": "text"},
            {
                "name": "metadata",
                "type": "nested",
                "fields": [
                    {"name": "author", "selector": ".author", "type": "text"},
                    {"name": "date", "selector": ".date", "type": "text"}
                ]
            }
        ]
    }

    config = CrawlerRunConfig(
        # Keep only #main-content
        css_selector="#main-content",

        # Filtering
        word_count_threshold=10,
        excluded_tags=["nav", "footer"],  
        exclude_external_links=True,
        exclude_domains=["somebadsite.com"],
        exclude_external_images=True,

        # Extraction
        extraction_strategy=JsonCssExtractionStrategy(schema),

        cache_mode=CacheMode.BYPASS
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=config)
        if not result.success:
            print(f"Error: {result.error_message}")
            return None
        return json.loads(result.extracted_content)

async def main():
    articles = await extract_main_articles("https://news.ycombinator.com/newest")
    if articles:
        print("Extracted Articles:", articles[:2])  # Show first 2

if __name__ == "__main__":
    asyncio.run(main())

为什么有效: - CSS 作用域#main-content. - 多个 exclude_ 参数用于删除域、外部图像等。 - JsonCssExtractionStrategy 用于解析重复的文章块。


6. 抓取模式

Crawl4AI 为 HTML 内容处理提供了两种不同的抓取策略:WebScrapingStrategy (基于 BeautifulSoup,默认)和LXMLWebScrapingStrategy(基于 LXML)。LXML 策略提供了显著更好的性能,尤其是对于大型 HTML 文档。

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LXMLWebScrapingStrategy

async def main():
    config = CrawlerRunConfig(
        scraping_strategy=LXMLWebScrapingStrategy()  # Faster alternative to default BeautifulSoup
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com", 
            config=config
        )

您还可以通过继承来创建自己的自定义抓取策略ContentScrapingStrategy。该策略必须返回ScrapingResult具有以下结构的对象:

from crawl4ai import ContentScrapingStrategy, ScrapingResult, MediaItem, Media, Link, Links

class CustomScrapingStrategy(ContentScrapingStrategy):
    def scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
        # Implement your custom scraping logic here
        return ScrapingResult(
            cleaned_html="<html>...</html>",  # Cleaned HTML content
            success=True,                     # Whether scraping was successful
            media=Media(
                images=[                      # List of images found
                    MediaItem(
                        src="https://example.com/image.jpg",
                        alt="Image description",
                        desc="Surrounding text",
                        score=1,
                        type="image",
                        group_id=1,
                        format="jpg",
                        width=800
                    )
                ],
                videos=[],                    # List of videos (same structure as images)
                audios=[]                     # List of audio files (same structure as images)
            ),
            links=Links(
                internal=[                    # List of internal links
                    Link(
                        href="https://example.com/page",
                        text="Link text",
                        title="Link title",
                        base_domain="example.com"
                    )
                ],
                external=[]                   # List of external links (same structure)
            ),
            metadata={                        # Additional metadata
                "title": "Page Title",
                "description": "Page description"
            }
        )

    async def ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
        # For simple cases, you can use the sync version
        return await asyncio.to_thread(self.scrap, url, html, **kwargs)

性能考虑

LXML 策略比 BeautifulSoup 策略快 10-20 倍,尤其是在处理大型 HTML 文档时。但是,请注意:

  1. LXML 策略目前处于实验阶段
  2. 在某些边缘情况下,解析结果可能与 BeautifulSoup 略有不同
  3. 如果您发现 LXML 和 BeautifulSoup 结果之间存在任何不一致,请提出问题并提供可重现的示例

在以下情况下选择 LXML 策略: - 处理大型 HTML 文档(建议大于 100KB) - 性能至关重要 - 使用格式良好的 HTML

在以下情况下坚持使用 BeautifulSoup 策略(默认): - 需要最大兼容性 - 处理格式错误的 HTML - 精确的解析行为至关重要


7. 组合 CSS 选择方法

您可以结合css_selectortarget_elements以强大的方式实现对输出的细粒度控制:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def main():
    # Target specific content but preserve page context
    config = CrawlerRunConfig(
        # Focus markdown on main content and sidebar
        target_elements=["#main-content", ".sidebar"],

        # Global filters applied to entire page
        excluded_tags=["nav", "footer", "header"],
        exclude_external_links=True,

        # Use basic content thresholds
        word_count_threshold=15,

        cache_mode=CacheMode.BYPASS
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com/article",
            config=config
        )

        print(f"Content focuses on specific elements, but all links still analyzed")
        print(f"Internal links: {len(result.links.get('internal', []))}")
        print(f"External links: {len(result.links.get('external', []))}")

if __name__ == "__main__":
    asyncio.run(main())

这种方法可以让你同时获得两种效果: - Markdown 生成和内容提取专注于你关心的元素 - 链接、图像和其他页面数据仍然为你提供页面的完整上下文 - 内容过滤仍然全局适用

8. 结论

通过混合使用 target_elements 或 css_selector 范围、内容过滤参数和高级提取策略,您可以精确选择要保留的数据。关键参数CrawlerRunConfig内容选择包括:

  1. – CSS 选择器数组用于集中 markdown 生成和数据提取,同时保留链接和媒体的整页上下文。
  2. – 所有提取过程的基本范围限定为元素或区域。
  3. – 跳过短块。
  4. – 删除整个 HTML 标签。
  5. exclude_social_media_linksexclude_domains – 过滤掉不需要的链接或域。
  6. – 从外部来源删除图像。
  7. – 如果需要,合并 iframe 内容。

将这些与结构化提取(CSS、基于 LLM 或其他)相结合,即可构建强大的爬取功能,从原始或清理过的 HTML 到复杂的 JSON 结构,精准地获取您想要的内容。更多详细信息,请参阅配置参考。尽情享受数据管理的乐趣吧!


> Feedback