内容选择
¥Content Selection
Crawl4AI 提供多种方式选择,筛选, 和精炼抓取的内容。无论您需要定位特定的 CSS 区域、排除整个标签、过滤外部链接,还是移除某些域名和图片,CrawlerRunConfig提供广泛的参数。
¥Crawl4AI provides multiple ways to select, filter, and refine the content from your crawls. Whether you need to target a specific CSS region, exclude entire tags, filter out external links, or remove certain domains and images, CrawlerRunConfig offers a wide range of parameters.
下面,我们将展示如何配置这些参数并将它们组合起来以实现精确控制。
¥Below, we show how to configure these parameters and combine them for precise control.
1.基于CSS的选择
¥1. CSS-Based Selection
有两种方法可以从页面中选择内容:使用css_selector或者更灵活target_elements。
¥There are two ways to select content from a page: using css_selector or the more flexible target_elements.
1.1 使用css_selector
¥1.1 Using css_selector
一个简单的方法限制您对页面某个区域的抓取结果为css_selector在CrawlerRunConfig:
¥A straightforward way to limit your crawl results to a certain region of the page is css_selector in CrawlerRunConfig:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
# e.g., first 30 items from Hacker News
css_selector=".athing:nth-child(-n+30)"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com/newest",
config=config
)
print("Partial HTML length:", len(result.cleaned_html))
if __name__ == "__main__":
asyncio.run(main())
结果:只有与该选择器匹配的元素才会保留在result.cleaned_html。
¥Result: Only elements matching that selector remain in result.cleaned_html.
1.2 使用target_elements
¥1.2 Using target_elements
这target_elements参数提供了更多的灵活性,允许你定位多个元素用于内容提取,同时保留整个页面上下文以供其他功能使用:
¥The target_elements parameter provides more flexibility by allowing you to target multiple elements for content extraction while preserving the entire page context for other features:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
# Target article body and sidebar, but not other content
target_elements=["article.main-content", "aside.sidebar"]
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/blog-post",
config=config
)
print("Markdown focused on target elements")
print("Links from entire page still available:", len(result.links.get("internal", [])))
if __name__ == "__main__":
asyncio.run(main())
主要区别: 和target_elements,Markdown 生成和结构化数据提取会重点关注这些元素,但其他页面元素(例如链接、图片和表格)仍会从整个页面中提取。这让您能够精细控制 Markdown 内容中显示的内容,同时保留完整页面上下文以进行链接分析和媒体收集。
¥Key difference: With target_elements, the markdown generation and structural data extraction focus on those elements, but other page elements (like links, images, and tables) are still extracted from the entire page. This gives you fine-grained control over what appears in your markdown content while preserving full page context for link analysis and media collection.
2. 内容过滤和排除
¥2. Content Filtering & Exclusions
2.1 基本概述
¥2.1 Basic Overview
config = CrawlerRunConfig(
# Content thresholds
word_count_threshold=10, # Minimum words per block
# Tag exclusions
excluded_tags=['form', 'header', 'footer', 'nav'],
# Link filtering
exclude_external_links=True,
exclude_social_media_links=True,
# Block entire domains
exclude_domains=["adtrackers.com", "spammynews.org"],
exclude_social_media_domains=["facebook.com", "twitter.com"],
# Media filtering
exclude_external_images=True
)
解释:
¥Explanation:
-
word_count_threshold:忽略 X 个单词下的文本块。有助于跳过简短导航或免责声明等琐碎的文本块。¥
word_count_threshold: Ignores text blocks under X words. Helps skip trivial blocks like short nav or disclaimers. -
excluded_tags:删除整个标签(<form>,<header>,<footer>, ETC。)。¥
excluded_tags: Removes entire tags (<form>,<header>,<footer>, etc.). -
链接过滤:
¥Link Filtering:
-
:删除外部链接,并可能将其从
result.links。¥
exclude_external_links: Strips out external links and may remove them fromresult.links. -
:删除指向已知社交媒体域的链接。
¥
exclude_social_media_links: Removes links pointing to known social media domains. -
:在链接中发现时要阻止的域的自定义列表。
¥
exclude_domains: A custom list of domains to block if discovered in links. -
:社交媒体网站的精选列表(覆盖或添加)。
¥
exclude_social_media_domains: A curated list (override or add to it) for social media sites. -
媒体过滤:
¥Media Filtering:
-
:丢弃未与主页(或其子域)托管在同一域上的图像。
¥
exclude_external_images: Discards images not hosted on the same domain as the main page (or its subdomains).
默认情况下,如果您设置exclude_social_media_links=True,以下社交媒体域名被排除在外:
¥By default in case you set exclude_social_media_links=True, the following social media domains are excluded:
[
'facebook.com',
'twitter.com',
'x.com',
'linkedin.com',
'instagram.com',
'pinterest.com',
'tiktok.com',
'snapchat.com',
'reddit.com',
]
2.2 示例用法
¥2.2 Example Usage
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
config = CrawlerRunConfig(
css_selector="main.content",
word_count_threshold=10,
excluded_tags=["nav", "footer"],
exclude_external_links=True,
exclude_social_media_links=True,
exclude_domains=["ads.com", "spammytrackers.net"],
exclude_external_images=True,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://news.ycombinator.com", config=config)
print("Cleaned HTML length:", len(result.cleaned_html))
if __name__ == "__main__":
asyncio.run(main())
笔记:如果这些参数删除太多,请相应地减少或禁用它们。
¥Note: If these parameters remove too much, reduce or disable them accordingly.
3. 处理 iframe
¥3. Handling Iframes
有些网站将内容嵌入<iframe>标签。如果你想要内联:
¥Some sites embed content in <iframe> tags. If you want that inline:
config = CrawlerRunConfig(
# Merge iframe content into the final output
process_iframes=True,
remove_overlay_elements=True
)
用法:
¥Usage:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
process_iframes=True,
remove_overlay_elements=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.org/iframe-demo",
config=config
)
print("Iframe-merged length:", len(result.cleaned_html))
if __name__ == "__main__":
asyncio.run(main())
4. 结构化提取示例
¥4. Structured Extraction Examples
你可以将内容选择与更高级的提取策略结合起来。例如,基于 CSS或者法学硕士提取策略可以在过滤后的 HTML 上运行。
¥You can combine content selection with a more advanced extraction strategy. For instance, a CSS-based or LLM-based extraction strategy can run on the filtered HTML.
4.1 基于模式的JsonCssExtractionStrategy
¥4.1 Pattern-Based with JsonCssExtractionStrategy
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def main():
# Minimal schema for repeated items
schema = {
"name": "News Items",
"baseSelector": "tr.athing",
"fields": [
{"name": "title", "selector": "span.titleline a", "type": "text"},
{
"name": "link",
"selector": "span.titleline a",
"type": "attribute",
"attribute": "href"
}
]
}
config = CrawlerRunConfig(
# Content filtering
excluded_tags=["form", "header"],
exclude_domains=["adsite.com"],
# CSS selection or entire page
css_selector="table.itemlist",
# No caching for demonstration
cache_mode=CacheMode.BYPASS,
# Extraction strategy
extraction_strategy=JsonCssExtractionStrategy(schema)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com/newest",
config=config
)
data = json.loads(result.extracted_content)
print("Sample extracted item:", data[:1]) # Show first item
if __name__ == "__main__":
asyncio.run(main())
4.2 基于LLM的提取
¥4.2 LLM-Based Extraction
import asyncio
import json
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai import LLMExtractionStrategy
class ArticleData(BaseModel):
headline: str
summary: str
async def main():
llm_strategy = LLMExtractionStrategy(
llm_config = LLMConfig(provider="openai/gpt-4",api_token="sk-YOUR_API_KEY")
schema=ArticleData.schema(),
extraction_type="schema",
instruction="Extract 'headline' and a short 'summary' from the content."
)
config = CrawlerRunConfig(
exclude_external_links=True,
word_count_threshold=20,
extraction_strategy=llm_strategy
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://news.ycombinator.com", config=config)
article = json.loads(result.extracted_content)
print(article)
if __name__ == "__main__":
asyncio.run(main())
这里,爬虫:
¥Here, the crawler:
-
过滤掉外部链接(
exclude_external_links=True)。¥Filters out external links (
exclude_external_links=True). -
忽略非常短的文本块(
word_count_threshold=20)。¥Ignores very short text blocks (
word_count_threshold=20). -
将最终的 HTML 传递给您的 LLM 策略以进行 AI 驱动的解析。
¥Passes the final HTML to your LLM strategy for an AI-driven parse.
5.综合示例
¥5. Comprehensive Example
下面是一个简短的函数,它统一了CSS 选择,排除逻辑和基于模式的提取,演示如何微调最终数据:
¥Below is a short function that unifies CSS selection, exclusion logic, and a pattern-based extraction, demonstrating how you can fine-tune your final data:
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def extract_main_articles(url: str):
schema = {
"name": "ArticleBlock",
"baseSelector": "div.article-block",
"fields": [
{"name": "headline", "selector": "h2", "type": "text"},
{"name": "summary", "selector": ".summary", "type": "text"},
{
"name": "metadata",
"type": "nested",
"fields": [
{"name": "author", "selector": ".author", "type": "text"},
{"name": "date", "selector": ".date", "type": "text"}
]
}
]
}
config = CrawlerRunConfig(
# Keep only #main-content
css_selector="#main-content",
# Filtering
word_count_threshold=10,
excluded_tags=["nav", "footer"],
exclude_external_links=True,
exclude_domains=["somebadsite.com"],
exclude_external_images=True,
# Extraction
extraction_strategy=JsonCssExtractionStrategy(schema),
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=config)
if not result.success:
print(f"Error: {result.error_message}")
return None
return json.loads(result.extracted_content)
async def main():
articles = await extract_main_articles("https://news.ycombinator.com/newest")
if articles:
print("Extracted Articles:", articles[:2]) # Show first 2
if __name__ == "__main__":
asyncio.run(main())
为什么有效:- CSS范围界定#main-content。
- 多种的排除_参数来删除域、外部图像等。
- 一个JsonCss提取策略解析重复的文章块。
¥Why This Works:
- CSS scoping with #main-content.
- Multiple exclude_ parameters to remove domains, external images, etc.
- A JsonCssExtractionStrategy to parse repeated article blocks.
6. 抓取模式
¥6. Scraping Modes
Crawl4AI 使用LXMLWebScrapingStrategy(基于 LXML)作为 HTML 内容处理的默认抓取策略。此策略性能出色,尤其适用于大型 HTML 文档。
¥Crawl4AI uses LXMLWebScrapingStrategy (LXML-based) as the default scraping strategy for HTML content processing. This strategy offers excellent performance, especially for large HTML documents.
笔记:为了向后兼容,WebScrapingStrategy仍然可以作为LXMLWebScrapingStrategy。
¥Note: For backward compatibility, WebScrapingStrategy is still available as an alias for LXMLWebScrapingStrategy.
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LXMLWebScrapingStrategy
async def main():
# Default configuration already uses LXMLWebScrapingStrategy
config = CrawlerRunConfig()
# Or explicitly specify it if desired
config_explicit = CrawlerRunConfig(
scraping_strategy=LXMLWebScrapingStrategy()
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=config
)
您还可以通过继承来创建自己的自定义抓取策略ContentScrapingStrategy。该策略必须返回ScrapingResult具有以下结构的对象:
¥You can also create your own custom scraping strategy by inheriting from ContentScrapingStrategy. The strategy must return a ScrapingResult object with the following structure:
from crawl4ai import ContentScrapingStrategy, ScrapingResult, MediaItem, Media, Link, Links
class CustomScrapingStrategy(ContentScrapingStrategy):
def scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
# Implement your custom scraping logic here
return ScrapingResult(
cleaned_html="<html>...</html>", # Cleaned HTML content
success=True, # Whether scraping was successful
media=Media(
images=[ # List of images found
MediaItem(
src="https://example.com/image.jpg",
alt="Image description",
desc="Surrounding text",
score=1,
type="image",
group_id=1,
format="jpg",
width=800
)
],
videos=[], # List of videos (same structure as images)
audios=[] # List of audio files (same structure as images)
),
links=Links(
internal=[ # List of internal links
Link(
href="https://example.com/page",
text="Link text",
title="Link title",
base_domain="example.com"
)
],
external=[] # List of external links (same structure)
),
metadata={ # Additional metadata
"title": "Page Title",
"description": "Page description"
}
)
async def ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
# For simple cases, you can use the sync version
return await asyncio.to_thread(self.scrap, url, html, **kwargs)
性能考虑
¥Performance Considerations
LXML 策略提供了出色的性能,特别是在处理大型 HTML 文档时,与基于 BeautifulSoup 的方法相比,处理速度提高了 10-20 倍。
¥The LXML strategy provides excellent performance, particularly when processing large HTML documents, offering up to 10-20x faster processing compared to BeautifulSoup-based approaches.
LXML 策略的优势: - 快速处理大型 HTML 文档(尤其是 >100KB) - 高效内存使用 - 良好处理格式良好的 HTML - 强大的表格检测和提取
¥Benefits of LXML strategy: - Fast processing of large HTML documents (especially >100KB) - Efficient memory usage - Good handling of well-formed HTML - Robust table detection and extraction
向后兼容性
¥Backward Compatibility
对于从早期版本升级的用户:-WebScrapingStrategy现在是LXMLWebScrapingStrategy- 现有代码使用WebScrapingStrategy无需修改即可继续工作 - 无需对现有代码进行任何更改
¥For users upgrading from earlier versions:
- WebScrapingStrategy is now an alias for LXMLWebScrapingStrategy
- Existing code using WebScrapingStrategy will continue to work without modification
- No changes are required to your existing code
7. 组合 CSS 选择方法
¥7. Combining CSS Selection Methods
您可以结合css_selector和target_elements以强大的方式实现对输出的细粒度控制:
¥You can combine css_selector and target_elements in powerful ways to achieve fine-grained control over your output:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
# Target specific content but preserve page context
config = CrawlerRunConfig(
# Focus markdown on main content and sidebar
target_elements=["#main-content", ".sidebar"],
# Global filters applied to entire page
excluded_tags=["nav", "footer", "header"],
exclude_external_links=True,
# Use basic content thresholds
word_count_threshold=15,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/article",
config=config
)
print(f"Content focuses on specific elements, but all links still analyzed")
print(f"Internal links: {len(result.links.get('internal', []))}")
print(f"External links: {len(result.links.get('external', []))}")
if __name__ == "__main__":
asyncio.run(main())
这种方法可以让你同时获得两种效果: - Markdown 生成和内容提取专注于你关心的元素 - 链接、图像和其他页面数据仍然为你提供页面的完整上下文 - 内容过滤仍然全局适用
¥This approach gives you the best of both worlds: - Markdown generation and content extraction focus on the elements you care about - Links, images and other page data still give you the full context of the page - Content filtering still applies globally
8. 结论
¥8. Conclusion
通过混合目标元素或者css_选择器范围界定,内容过滤参数和高级提取策略,你可以精确地选择保留哪些数据。关键参数CrawlerRunConfig内容选择包括:
¥By mixing target_elements or css_selector scoping, content filtering parameters, and advanced extraction strategies, you can precisely choose which data to keep. Key parameters in CrawlerRunConfig for content selection include:
-
target_elements– CSS 选择器数组用于集中标记生成和数据提取,同时保留链接和媒体的整页上下文。¥
target_elements– Array of CSS selectors to focus markdown generation and data extraction, while preserving full page context for links and media. -
css_selector– 所有提取过程的基本范围限定为元素或区域。¥
css_selector– Basic scoping to an element or region for all extraction processes. -
word_count_threshold– 跳过短块。¥
word_count_threshold– Skip short blocks. -
excluded_tags– 删除整个 HTML 标签。¥
excluded_tags– Remove entire HTML tags. -
exclude_external_links,exclude_social_media_links,exclude_domains– 过滤掉不需要的链接或域。¥
exclude_external_links,exclude_social_media_links,exclude_domains– Filter out unwanted links or domains. -
exclude_external_images– 从外部来源删除图像。¥
exclude_external_images– Remove images from external sources. -
process_iframes– 如果需要,合并 iframe 内容。¥
process_iframes– Merge iframe content if needed.
将这些与结构化提取(CSS、基于 LLM 或其他)相结合,即可构建强大的爬取功能,精准抓取您所需的内容,从原始或经过清理的 HTML 到复杂的 JSON 结构。更多详情,请参阅配置参考. 尽情享受整理数据带来的乐趣吧!
¥Combine these with structured extraction (CSS, LLM-based, or others) to build powerful crawls that yield exactly the content you want, from raw or cleaned HTML up to sophisticated JSON structures. For more detail, see Configuration Reference. Enjoy curating your data to the max!