使用 Pruning 和 BM25 来适配 Markdown
¥Fit Markdown with Pruning & BM25
适合 Markdown是一个专门已过滤页面 Markdown 的版本,重点关注最相关的内容。默认情况下,Crawl4AI 会将整个 HTML 转换为广泛的raw_markdown 。使用 fit markdown,我们应用内容过滤器算法(例如,修剪或者BM25 ) 来删除或排列低价值部分(例如重复的侧边栏、浅文本块或不相关的内容),留下简洁的文本“核心”。
¥Fit Markdown is a specialized filtered version of your page’s markdown, focusing on the most relevant content. By default, Crawl4AI converts the entire HTML into a broad raw_markdown. With fit markdown, we apply a content filter algorithm (e.g., Pruning or BM25) to remove or rank low-value sections—such as repetitive sidebars, shallow text blocks, or irrelevancies—leaving a concise textual “core.”
1.“Fit Markdown”的工作原理
¥1. How “Fit Markdown” Works
1.1content_filter
¥1.1 The content_filter
在CrawlerRunConfig,您可以指定content_filter在最终生成 Markdown 之前,决定内容的修剪或排序方式。应用了过滤器的逻辑前或者期间HTML→Markdown 过程,产生:
¥In CrawlerRunConfig, you can specify a content_filter to shape how content is pruned or ranked before final markdown generation. A filter’s logic is applied before or during the HTML→Markdown process, producing:
-
result.markdown.raw_markdown(未过滤)¥
result.markdown.raw_markdown(unfiltered) -
result.markdown.fit_markdown(过滤版或“适合”版)¥
result.markdown.fit_markdown(filtered or “fit” version) -
result.markdown.fit_html(相应的 HTML 代码片段fit_markdown)¥
result.markdown.fit_html(the corresponding HTML snippet that producedfit_markdown)
1.2 常用过滤器
¥1.2 Common Filters
1.修剪内容过滤器– 根据文本密度、链接密度和标签重要性对每个节点进行评分,丢弃低于阈值的节点。
2. BM25内容过滤器– 使用 BM25 排名关注文本相关性,如果您有特定的用户查询(例如“机器学习”或“食品营养”)则特别有用。
¥1. PruningContentFilter – Scores each node by text density, link density, and tag importance, discarding those below a threshold.
2. BM25ContentFilter – Focuses on textual relevance using BM25 ranking, especially useful if you have a specific user query (e.g., “machine learning” or “food nutrition”).
2. PruningContentFilter
¥2. PruningContentFilter
修剪根据以下情况丢弃不太相关的节点文本密度、链接密度和标签重要性这是一种基于启发式的方法 - 如果某些部分显得太“单薄”或太“垃圾”,就会被删减。
¥Pruning discards less relevant nodes based on text density, link density, and tag importance. It’s a heuristic-based approach—if certain sections appear too “thin” or too “spammy,” they’re pruned.
2.1 使用示例
¥2.1 Usage Example
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# Step 1: Create a pruning filter
prune_filter = PruningContentFilter(
# Lower → more content retained, higher → more content pruned
threshold=0.45,
# "fixed" or "dynamic"
threshold_type="dynamic",
# Ignore nodes with <5 words
min_word_threshold=5
)
# Step 2: Insert it into a Markdown Generator
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
# Step 3: Pass it to CrawlerRunConfig
config = CrawlerRunConfig(
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com",
config=config
)
if result.success:
# 'fit_markdown' is your pruned content, focusing on "denser" text
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
2.2 关键参数
¥2.2 Key Parameters
-
min_word_threshold(int):如果一个块中的单词数少于此数,则会被修剪。¥
min_word_threshold(int): If a block has fewer words than this, it’s pruned. -
threshold_type(字符串):¥
threshold_type(str): -
→ 每个节点必须超过
threshold(0–1)。¥
"fixed"→ each node must exceedthreshold(0–1). -
→ 节点评分根据标签类型、文本/链接密度等进行调整。
¥
"dynamic"→ node scoring adjusts according to tag type, text/link density, etc. -
threshold(浮点数,默认值~0.48):基准或“锚点”截止值。¥
threshold(float, default ~0.48): The base or “anchor” cutoff.
算法因素:
¥Algorithmic Factors:
-
文本密度– 鼓励文本与整体内容比例更高的区块。
¥Text density – Encourages blocks that have a higher ratio of text to overall content.
-
链接密度– 惩罚主要由链接组成的部分。
¥Link density – Penalizes sections that are mostly links.
-
标签重要性– 例如
<article>或者<p>可能比<div>。¥Tag importance – e.g., an
<article>or<p>might be more important than a<div>. -
结构背景– 如果节点嵌套很深或位于可疑的侧边栏中,则它可能会被降低优先级。
¥Structural context – If a node is deeply nested or in a suspected sidebar, it might be deprioritized.
3. BM25内容过滤器
¥3. BM25ContentFilter
BM25是搜索引擎中常用的经典文本排名算法。如果您有用户查询或者依靠页面元数据来得出查询,BM25 可以识别哪些文本块与该查询最匹配。
¥BM25 is a classical text ranking algorithm often used in search engines. If you have a user query or rely on page metadata to derive a query, BM25 can identify which text chunks best match that query.
3.1 使用示例
¥3.1 Usage Example
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# 1) A BM25 filter with a user query
bm25_filter = BM25ContentFilter(
user_query="startup fundraising tips",
# Adjust for stricter or looser results
bm25_threshold=1.2
)
# 2) Insert into a Markdown Generator
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
# 3) Pass to crawler config
config = CrawlerRunConfig(
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com",
config=config
)
if result.success:
print("Fit Markdown (BM25 query-based):")
print(result.markdown.fit_markdown)
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
3.2 参数
¥3.2 Parameters
-
user_query(str,可选):例如"machine learning"。如果为空,过滤器会尝试从页面元数据中收集查询。¥
user_query(str, optional): E.g."machine learning". If blank, the filter tries to glean a query from page metadata. -
bm25_threshold(浮点数,默认 1.0):¥
bm25_threshold(float, default 1.0): -
更高→块数更少但相关性更高。
¥Higher → fewer chunks but more relevant.
-
更低→更具包容性。
¥Lower → more inclusive.
在更高级的情况下,您可能会看到类似以下参数
language,case_sensitive, 或者priority_tags改进文本的标记或加权方式。¥In more advanced scenarios, you might see parameters like
language,case_sensitive, orpriority_tagsto refine how text is tokenized or weighted.
4. 访问“Fit”输出
¥4. Accessing the “Fit” Output
抓取后,您的“合适”内容位于result.markdown.fit_markdown。
¥After the crawl, your “fit” content is found in result.markdown.fit_markdown.
如果内容过滤器BM25 ,你可能会看到额外的逻辑或参考fit_markdown突出显示相关片段。如果修剪,文本通常经过精心清理,但不一定与查询匹配。
¥If the content filter is BM25, you might see additional logic or references in fit_markdown that highlight relevant segments. If it’s Pruning, the text is typically well-cleaned but not necessarily matched to a query.
5. 代码模式回顾
¥5. Code Patterns Recap
5.1 修剪
¥5.1 Pruning
prune_filter = PruningContentFilter(
threshold=0.5,
threshold_type="fixed",
min_word_threshold=10
)
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
5.2 BM25
¥5.2 BM25
bm25_filter = BM25ContentFilter(
user_query="health benefits fruit",
bm25_threshold=1.2
)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
6. 结合“word_count_threshold”和排除项
¥6. Combining with “word_count_threshold” & Exclusions
请记住,您还可以指定:
¥Remember you can also specify:
config = CrawlerRunConfig(
word_count_threshold=10,
excluded_tags=["nav", "footer", "header"],
exclude_external_links=True,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.5)
)
)
因此,多层次过滤发生:
¥Thus, multi-level filtering occurs:
-
爬虫的
excluded_tags首先从 HTML 中删除。¥The crawler’s
excluded_tagsare removed from the HTML first. -
内容过滤器(修剪、BM25 或自定义)会修剪或排列剩余的文本块。
¥The content filter (Pruning, BM25, or custom) prunes or ranks the remaining text blocks.
-
最终“适合”的内容是在
result.markdown.fit_markdown。¥The final “fit” content is generated in
result.markdown.fit_markdown.
7.自定义过滤器
¥7. Custom Filters
如果您需要不同的方法(例如专门的 ML 模型或特定于站点的启发式方法),您可以创建一个继承自RelevantContentFilter并实施filter_content(html)。然后将其注入你的markdown 生成器:
¥If you need a different approach (like a specialized ML model or site-specific heuristics), you can create a new class inheriting from RelevantContentFilter and implement filter_content(html). Then inject it into your markdown generator:
from crawl4ai.content_filter_strategy import RelevantContentFilter
class MyCustomFilter(RelevantContentFilter):
def filter_content(self, html, min_word_threshold=None):
# parse HTML, implement custom logic
return [block for block in ... if ... some condition...]
步骤:
¥Steps:
-
子类
RelevantContentFilter。¥Subclass
RelevantContentFilter. -
实施
filter_content(...)。¥Implement
filter_content(...). -
使用它在你的
DefaultMarkdownGenerator(content_filter=MyCustomFilter(...))。¥Use it in your
DefaultMarkdownGenerator(content_filter=MyCustomFilter(...)).
8. 最后的想法
¥8. Final Thoughts
适合 Markdown是以下方面的关键特征:
¥Fit Markdown is a crucial feature for:
-
摘要:从杂乱的页面中快速获取重要文本。
¥Summaries: Quickly get the important text from a cluttered page.
-
搜索:与BM25生成与查询相关的内容。
¥Search: Combine with BM25 to produce content relevant to a query.
-
AI管道:过滤掉样板,以便基于 LLM 的提取或摘要在更密集的文本上运行。
¥AI Pipelines: Filter out boilerplate so LLM-based extraction or summarization runs on denser text.
关键点:-修剪内容过滤器:如果您只想要“最充实”的文本而不需要用户查询,那么这很好。
- BM25内容过滤器:非常适合基于查询的提取或搜索。
- 结合excluded_tags,exclude_external_links ,word_count_threshold完善最终的“合适”文本。
- Fit markdown 最终以result.markdown.fit_markdown; 最终result.markdown.fit_markdown在未来的版本中。
¥Key Points:
- PruningContentFilter: Great if you just want the “meatiest” text without a user query.
- BM25ContentFilter: Perfect for query-based extraction or searching.
- Combine with excluded_tags, exclude_external_links, word_count_threshold to refine your final “fit” text.
- Fit markdown ends up in result.markdown.fit_markdown; eventually result.markdown.fit_markdown in future versions.
使用这些工具,您可以零专注于真正重要的文本,忽略垃圾内容或样板内容,并为您的 AI 或数据管道生成简洁、相关的“合适 Markdown”。祝您修剪和搜索愉快!
¥With these tools, you can zero in on the text that truly matters, ignoring spammy or boilerplate content, and produce a concise, relevant “fit markdown” for your AI or data pipelines. Happy pruning and searching!
-
最后更新时间:2025-01-01
¥Last Updated: 2025-01-01