内容选择
Crawl4AI 提供多种方法来选择、过滤和优化爬取内容。无论您需要定位特定的 CSS 区域、排除整个标签、过滤外部链接,还是删除某些域名和图片,CrawlerRunConfig
提供广泛的参数。
下面,我们将展示如何配置这些参数并将它们组合起来以实现精确控制。
1.基于CSS的选择
有两种方法可以从页面中选择内容:使用css_selector
或者更灵活target_elements
。
1.1 使用css_selector
将抓取结果限制在页面特定区域的一个直接方法是css_selector
在CrawlerRunConfig
:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
# e.g., first 30 items from Hacker News
css_selector=".athing:nth-child(-n+30)"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com/newest",
config=config
)
print("Partial HTML length:", len(result.cleaned_html))
if __name__ == "__main__":
asyncio.run(main())
结果:只有与该选择器匹配的元素仍保留在result.cleaned_html
。
1.2 使用target_elements
这target_elements
参数提供了更大的灵活性,允许您针对多个元素进行内容提取,同时为其他功能保留整个页面上下文:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
# Target article body and sidebar, but not other content
target_elements=["article.main-content", "aside.sidebar"]
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/blog-post",
config=config
)
print("Markdown focused on target elements")
print("Links from entire page still available:", len(result.links.get("internal", [])))
if __name__ == "__main__":
asyncio.run(main())
主要区别:target_elements
,Markdown 生成和结构化数据提取会重点关注这些元素,但其他页面元素(例如链接、图片和表格)仍会从整个页面中提取。这让您能够精细控制 Markdown 内容中显示的内容,同时保留完整页面上下文以进行链接分析和媒体收集。
2. 内容过滤和排除
2.1 基本概述
config = CrawlerRunConfig(
# Content thresholds
word_count_threshold=10, # Minimum words per block
# Tag exclusions
excluded_tags=['form', 'header', 'footer', 'nav'],
# Link filtering
exclude_external_links=True,
exclude_social_media_links=True,
# Block entire domains
exclude_domains=["adtrackers.com", "spammynews.org"],
exclude_social_media_domains=["facebook.com", "twitter.com"],
# Media filtering
exclude_external_images=True
)
解释:
- :忽略 X 个单词下的文本块。有助于跳过简短导航或免责声明等琐碎的文本块。
- :删除整个标签(
<form>
,<header>
,<footer>
, ETC。)。 - 链接过滤:
- :删除外部链接,并可能将其从
result.links
。 - :删除指向已知社交媒体域的链接。
- :在链接中发现时要阻止的域的自定义列表。
- :社交媒体网站的精选列表(覆盖或添加)。
- 媒体过滤:
- :丢弃未与主页(或其子域)托管在同一域上的图像。
默认情况下,如果您设置exclude_social_media_links=True
,以下社交媒体域名被排除在外:
[
'facebook.com',
'twitter.com',
'x.com',
'linkedin.com',
'instagram.com',
'pinterest.com',
'tiktok.com',
'snapchat.com',
'reddit.com',
]
2.2 示例用法
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
config = CrawlerRunConfig(
css_selector="main.content",
word_count_threshold=10,
excluded_tags=["nav", "footer"],
exclude_external_links=True,
exclude_social_media_links=True,
exclude_domains=["ads.com", "spammytrackers.net"],
exclude_external_images=True,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://news.ycombinator.com", config=config)
print("Cleaned HTML length:", len(result.cleaned_html))
if __name__ == "__main__":
asyncio.run(main())
注意:如果这些参数删除太多,请相应地减少或禁用它们。
3. 处理 iframe
有些网站将内容嵌入<iframe>
标签。如果你想要内联:
config = CrawlerRunConfig(
# Merge iframe content into the final output
process_iframes=True,
remove_overlay_elements=True
)
用法:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
process_iframes=True,
remove_overlay_elements=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.org/iframe-demo",
config=config
)
print("Iframe-merged length:", len(result.cleaned_html))
if __name__ == "__main__":
asyncio.run(main())
4. 结构化提取示例
您可以将内容选择与更高级的提取策略相结合。例如,基于 CSS 或 LLM 的提取策略可以在过滤后的 HTML 上运行。
4.1 基于模式的JsonCssExtractionStrategy
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def main():
# Minimal schema for repeated items
schema = {
"name": "News Items",
"baseSelector": "tr.athing",
"fields": [
{"name": "title", "selector": "span.titleline a", "type": "text"},
{
"name": "link",
"selector": "span.titleline a",
"type": "attribute",
"attribute": "href"
}
]
}
config = CrawlerRunConfig(
# Content filtering
excluded_tags=["form", "header"],
exclude_domains=["adsite.com"],
# CSS selection or entire page
css_selector="table.itemlist",
# No caching for demonstration
cache_mode=CacheMode.BYPASS,
# Extraction strategy
extraction_strategy=JsonCssExtractionStrategy(schema)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com/newest",
config=config
)
data = json.loads(result.extracted_content)
print("Sample extracted item:", data[:1]) # Show first item
if __name__ == "__main__":
asyncio.run(main())
4.2 基于LLM的提取
import asyncio
import json
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai import LLMExtractionStrategy
class ArticleData(BaseModel):
headline: str
summary: str
async def main():
llm_strategy = LLMExtractionStrategy(
llm_config = LLMConfig(provider="openai/gpt-4",api_token="sk-YOUR_API_KEY")
schema=ArticleData.schema(),
extraction_type="schema",
instruction="Extract 'headline' and a short 'summary' from the content."
)
config = CrawlerRunConfig(
exclude_external_links=True,
word_count_threshold=20,
extraction_strategy=llm_strategy
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://news.ycombinator.com", config=config)
article = json.loads(result.extracted_content)
print(article)
if __name__ == "__main__":
asyncio.run(main())
这里,爬虫:
- 过滤掉外部链接(
exclude_external_links=True
)。 - 忽略非常短的文本块(
word_count_threshold=20
)。 - 将最终的 HTML 传递给您的 LLM 策略以进行 AI 驱动的解析。
5.综合示例
下面是一个简短的函数,它统一了 CSS 选择、排除逻辑和基于模式的提取,演示了如何微调最终数据:
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def extract_main_articles(url: str):
schema = {
"name": "ArticleBlock",
"baseSelector": "div.article-block",
"fields": [
{"name": "headline", "selector": "h2", "type": "text"},
{"name": "summary", "selector": ".summary", "type": "text"},
{
"name": "metadata",
"type": "nested",
"fields": [
{"name": "author", "selector": ".author", "type": "text"},
{"name": "date", "selector": ".date", "type": "text"}
]
}
]
}
config = CrawlerRunConfig(
# Keep only #main-content
css_selector="#main-content",
# Filtering
word_count_threshold=10,
excluded_tags=["nav", "footer"],
exclude_external_links=True,
exclude_domains=["somebadsite.com"],
exclude_external_images=True,
# Extraction
extraction_strategy=JsonCssExtractionStrategy(schema),
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=config)
if not result.success:
print(f"Error: {result.error_message}")
return None
return json.loads(result.extracted_content)
async def main():
articles = await extract_main_articles("https://news.ycombinator.com/newest")
if articles:
print("Extracted Articles:", articles[:2]) # Show first 2
if __name__ == "__main__":
asyncio.run(main())
为什么有效: - CSS 作用域#main-content
. - 多个 exclude_ 参数用于删除域、外部图像等。 - JsonCssExtractionStrategy 用于解析重复的文章块。
6. 抓取模式
Crawl4AI 为 HTML 内容处理提供了两种不同的抓取策略:WebScrapingStrategy
(基于 BeautifulSoup,默认)和LXMLWebScrapingStrategy
(基于 LXML)。LXML 策略提供了显著更好的性能,尤其是对于大型 HTML 文档。
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LXMLWebScrapingStrategy
async def main():
config = CrawlerRunConfig(
scraping_strategy=LXMLWebScrapingStrategy() # Faster alternative to default BeautifulSoup
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=config
)
您还可以通过继承来创建自己的自定义抓取策略ContentScrapingStrategy
。该策略必须返回ScrapingResult
具有以下结构的对象:
from crawl4ai import ContentScrapingStrategy, ScrapingResult, MediaItem, Media, Link, Links
class CustomScrapingStrategy(ContentScrapingStrategy):
def scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
# Implement your custom scraping logic here
return ScrapingResult(
cleaned_html="<html>...</html>", # Cleaned HTML content
success=True, # Whether scraping was successful
media=Media(
images=[ # List of images found
MediaItem(
src="https://example.com/image.jpg",
alt="Image description",
desc="Surrounding text",
score=1,
type="image",
group_id=1,
format="jpg",
width=800
)
],
videos=[], # List of videos (same structure as images)
audios=[] # List of audio files (same structure as images)
),
links=Links(
internal=[ # List of internal links
Link(
href="https://example.com/page",
text="Link text",
title="Link title",
base_domain="example.com"
)
],
external=[] # List of external links (same structure)
),
metadata={ # Additional metadata
"title": "Page Title",
"description": "Page description"
}
)
async def ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
# For simple cases, you can use the sync version
return await asyncio.to_thread(self.scrap, url, html, **kwargs)
性能考虑
LXML 策略比 BeautifulSoup 策略快 10-20 倍,尤其是在处理大型 HTML 文档时。但是,请注意:
- LXML 策略目前处于实验阶段
- 在某些边缘情况下,解析结果可能与 BeautifulSoup 略有不同
- 如果您发现 LXML 和 BeautifulSoup 结果之间存在任何不一致,请提出问题并提供可重现的示例
在以下情况下选择 LXML 策略: - 处理大型 HTML 文档(建议大于 100KB) - 性能至关重要 - 使用格式良好的 HTML
在以下情况下坚持使用 BeautifulSoup 策略(默认): - 需要最大兼容性 - 处理格式错误的 HTML - 精确的解析行为至关重要
7. 组合 CSS 选择方法
您可以结合css_selector
和target_elements
以强大的方式实现对输出的细粒度控制:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
# Target specific content but preserve page context
config = CrawlerRunConfig(
# Focus markdown on main content and sidebar
target_elements=["#main-content", ".sidebar"],
# Global filters applied to entire page
excluded_tags=["nav", "footer", "header"],
exclude_external_links=True,
# Use basic content thresholds
word_count_threshold=15,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/article",
config=config
)
print(f"Content focuses on specific elements, but all links still analyzed")
print(f"Internal links: {len(result.links.get('internal', []))}")
print(f"External links: {len(result.links.get('external', []))}")
if __name__ == "__main__":
asyncio.run(main())
这种方法可以让你同时获得两种效果: - Markdown 生成和内容提取专注于你关心的元素 - 链接、图像和其他页面数据仍然为你提供页面的完整上下文 - 内容过滤仍然全局适用
8. 结论
通过混合使用 target_elements 或 css_selector 范围、内容过滤参数和高级提取策略,您可以精确选择要保留的数据。关键参数CrawlerRunConfig
内容选择包括:
- – CSS 选择器数组用于集中 markdown 生成和数据提取,同时保留链接和媒体的整页上下文。
- – 所有提取过程的基本范围限定为元素或区域。
- – 跳过短块。
- – 删除整个 HTML 标签。
- ,
exclude_social_media_links
,exclude_domains
– 过滤掉不需要的链接或域。 - – 从外部来源删除图像。
- – 如果需要,合并 iframe 内容。
将这些与结构化提取(CSS、基于 LLM 或其他)相结合,即可构建强大的爬取功能,从原始或清理过的 HTML 到复杂的 JSON 结构,精准地获取您想要的内容。更多详细信息,请参阅配置参考。尽情享受数据管理的乐趣吧!