Markdown 生成基础
¥Markdown Generation Basics
Crawl4AI 的核心功能之一是生成干净、结构化的 Markdown来自网页。Crawl4AI 的 Markdown 系统最初是为了解决仅提取“实际”内容并丢弃样板或噪音的问题而构建的,它仍然是 AI 工作流程的最大吸引力之一。
¥One of Crawl4AI’s core features is generating clean, structured markdown from web pages. Originally built to solve the problem of extracting only the “actual” content and discarding boilerplate or noise, Crawl4AI’s markdown system remains one of its biggest draws for AI workflows.
在本教程中,您将学习:
¥In this tutorial, you’ll learn:
-
如何配置默认 Markdown 生成器
¥How to configure the Default Markdown Generator
-
如何内容过滤器(BM25 或 Pruning)帮助您优化 markdown 并丢弃垃圾
¥How content filters (BM25 or Pruning) help you refine markdown and discard junk
-
原始降价(
result.markdown) 和过滤的 markdown (fit_markdown)¥The difference between raw markdown (
result.markdown) and filtered markdown (fit_markdown)
先决条件
- 您已完成或阅读AsyncWebCrawler 基础知识了解如何运行简单的爬网。
- 你知道如何配置CrawlerRunConfig。¥Prerequisites
- You’ve completed or read AsyncWebCrawler Basics to understand how to run a simple crawl.
- You know how to configureCrawlerRunConfig.
1. 快速示例
¥1. Quick Example
这是一个使用默认Markdown生成器无需额外过滤:
¥Here’s a minimal code snippet that uses the DefaultMarkdownGenerator with no additional filtering:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator()
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
if result.success:
print("Raw Markdown Output:\n")
print(result.markdown) # The unfiltered markdown from the page
else:
print("Crawl failed:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
发生什么事了?
-CrawlerRunConfig( markdown_generator = DefaultMarkdownGenerator() )指示 Crawl4AI 在每次抓取结束时将最终的 HTML 转换为 markdown。
- 可以通过以下方式访问结果 markdownresult.markdown 。
¥What’s happening?
- CrawlerRunConfig( markdown_generator = DefaultMarkdownGenerator() ) instructs Crawl4AI to convert the final HTML into markdown at the end of each crawl.
- The resulting markdown is accessible via result.markdown.
2. Markdown 生成工作原理
¥2. How Markdown Generation Works
2.1 HTML 到文本的转换(分叉和修改)
¥2.1 HTML-to-Text Conversion (Forked & Modified)
在引擎盖下,默认Markdown生成器使用专门的 HTML 到文本方法:
¥Under the hood, DefaultMarkdownGenerator uses a specialized HTML-to-text approach that:
-
保留标题、代码块、项目符号等。
¥Preserves headings, code blocks, bullet points, etc.
-
删除不添加有意义内容的无关标签(脚本、样式)。
¥Removes extraneous tags (scripts, styles) that don’t add meaningful content.
-
可以选择生成链接的引用或完全跳过它们。
¥Can optionally generate references for links or skip them altogether.
一组选项(以字典形式传递)允许您精确自定义 HTML 转换为 Markdown 的方式。这些配置会映射到类似 html2text 的标准配置以及您自己的增强功能(例如,忽略内部链接、逐字保留某些标签或调整行宽)。
¥A set of options (passed as a dict) allows you to customize precisely how HTML converts to markdown. These map to standard html2text-like configuration plus your own enhancements (e.g., ignoring internal links, preserving certain tags verbatim, or adjusting line widths).
2.2 链接引用和参考文献
¥2.2 Link Citations & References
默认情况下,生成器可以转换<a href="...">元素进入[text][1]引用,然后将实际链接放在文档底部。这对于需要结构化引用的研究工作流程非常方便。
¥By default, the generator can convert <a href="..."> elements into [text][1] citations, then place the actual links at the bottom of the document. This is handy for research workflows that demand references in a structured manner.
2.3 可选内容过滤器
¥2.3 Optional Content Filters
在 HTML 转 Markdown 步骤之前或之后,您可以应用内容过滤器(例如 BM25 或 Pruning)来降低噪音,并生成一个“fit_markdown”——一个高度精简的版本,专注于页面的主要文本。我们稍后会介绍这些过滤器。
¥Before or after the HTML-to-Markdown step, you can apply a content filter (like BM25 or Pruning) to reduce noise and produce a “fit_markdown”—a heavily pruned version focusing on the page’s main text. We’ll cover these filters shortly.
3. 配置默认 Markdown 生成器
¥3. Configuring the Default Markdown Generator
您可以通过传递options听写DefaultMarkdownGenerator。 例如:
¥You can tweak the output by passing an options dict to DefaultMarkdownGenerator. For example:
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
# Example: ignore all links, don't escape HTML, and wrap text at 80 characters
md_generator = DefaultMarkdownGenerator(
options={
"ignore_links": True,
"escape_html": False,
"body_width": 80
}
)
config = CrawlerRunConfig(
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/docs", config=config)
if result.success:
print("Markdown:\n", result.markdown[:500]) # Just a snippet
else:
print("Crawl failed:", result.error_message)
if __name__ == "__main__":
import asyncio
asyncio.run(main())
一些常用options:
¥Some commonly used options:
-
ignore_links(bool): 是否删除最终 markdown 中的所有超链接。¥
ignore_links(bool): Whether to remove all hyperlinks in the final markdown. -
ignore_images(bool): 删除所有![image]()参考。¥
ignore_images(bool): Remove all![image]()references. -
escape_html(bool): 将 HTML 实体转换为文本(默认值通常是True)。¥
escape_html(bool): Turn HTML entities into text (default is oftenTrue). -
body_width(int):在 N 个字符处换行。0或者None表示不包装。¥
body_width(int): Wrap text at N characters.0orNonemeans no wrapping. -
skip_internal_links(布尔值):如果True,省略#localAnchors或引用同一页面的内部链接。¥
skip_internal_links(bool): IfTrue, omit#localAnchorsor internal links referencing the same page. -
include_sup_sub(bool): 尝试处理<sup>/<sub>以更易读的方式。¥
include_sup_sub(bool): Attempt to handle<sup>/<sub>in a more readable way.
4. 选择用于生成 Markdown 的 HTML 源
¥4. Selecting the HTML Source for Markdown Generation
这content_source参数允许您控制哪些 HTML 内容将用作 Markdown 生成的输入。这让您能够灵活地决定在转换为 Markdown 之前如何处理 HTML。
¥The content_source parameter allows you to control which HTML content is used as input for markdown generation. This gives you flexibility in how the HTML is processed before conversion to markdown.
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
# Option 1: Use the raw HTML directly from the webpage (before any processing)
raw_md_generator = DefaultMarkdownGenerator(
content_source="raw_html",
options={"ignore_links": True}
)
# Option 2: Use the cleaned HTML (after scraping strategy processing - default)
cleaned_md_generator = DefaultMarkdownGenerator(
content_source="cleaned_html", # This is the default
options={"ignore_links": True}
)
# Option 3: Use preprocessed HTML optimized for schema extraction
fit_md_generator = DefaultMarkdownGenerator(
content_source="fit_html",
options={"ignore_links": True}
)
# Use one of the generators in your crawler config
config = CrawlerRunConfig(
markdown_generator=raw_md_generator # Try each of the generators
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
if result.success:
print("Markdown:\n", result.markdown.raw_markdown[:500])
else:
print("Crawl failed:", result.error_message)
if __name__ == "__main__":
import asyncio
asyncio.run(main())
HTML 源选项
¥HTML Source Options
-
"cleaned_html"(默认):使用经过抓取策略处理后的 HTML。这种 HTML 通常更简洁,更注重内容,并且删除了一些样板代码。¥
"cleaned_html"(default): Uses the HTML after it has been processed by the scraping strategy. This HTML is typically cleaner and more focused on content, with some boilerplate removed. -
"raw_html":直接使用网页中的原始 HTML,无需任何清理或处理。这样可以保留更多原始内容,但可能包含导航栏、广告、页脚以及其他与主要内容不相关的元素。¥
"raw_html": Uses the original HTML directly from the webpage, before any cleaning or processing. This preserves more of the original content, but may include navigation bars, ads, footers, and other elements that might not be relevant to the main content. -
"fit_html":使用预处理的 HTML 进行架构提取。此 HTML 已针对结构化数据提取进行了优化,并且可能简化或删除了某些元素。¥
"fit_html": Uses HTML preprocessed for schema extraction. This HTML is optimized for structured data extraction and may have certain elements simplified or removed.
何时使用每个选项
¥When to Use Each Option
-
使用
"cleaned_html"(默认)适用于大多数需要在内容保存和噪音消除之间取得平衡的情况。¥Use
"cleaned_html"(default) for most cases where you want a balance of content preservation and noise removal. -
使用
"raw_html"当您需要保留所有原始内容时,或者当清理过程删除您实际想要保留的内容时。¥Use
"raw_html"when you need to preserve all original content, or when the cleaning process is removing content you actually want to keep. -
使用
"fit_html"当处理结构化数据或需要针对模式提取进行优化的 HTML 时。¥Use
"fit_html"when working with structured data or when you need HTML that's optimized for schema extraction.
5.内容过滤器
¥5. Content Filters
内容过滤器在将文本转换为 Markdown 格式之前,选择性地移除或排序部分文本。如果您的页面包含广告、导航栏或其他您不想要的杂乱内容,此功能尤其有用。
¥Content filters selectively remove or rank sections of text before turning them into Markdown. This is especially helpful if your page has ads, nav bars, or other clutter you don’t want.
5.1 BM25内容过滤器
¥5.1 BM25ContentFilter
如果你有一个搜索查询,BM25是一个不错的选择:
¥If you have a search query, BM25 is a good choice:
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai import CrawlerRunConfig
bm25_filter = BM25ContentFilter(
user_query="machine learning",
bm25_threshold=1.2,
language="english"
)
md_generator = DefaultMarkdownGenerator(
content_filter=bm25_filter,
options={"ignore_links": True}
)
config = CrawlerRunConfig(markdown_generator=md_generator)
-
user_query:您想要关注的术语。BM25 会尝试仅保留与该查询相关的内容块。¥
user_query: The term you want to focus on. BM25 tries to keep only content blocks relevant to that query. -
bm25_threshold:升高它可以保留较少的块;降低它可以保留较多的块。¥
bm25_threshold: Raise it to keep fewer blocks; lower it to keep more. -
use_stemming(默认True) :是否将词干提取应用于查询和内容。¥
use_stemming(defaultTrue): Whether to apply stemming to the query and content. -
language (str):词干提取的语言(默认值:'英语')。¥
language (str): Language for stemming (default: 'english').
没有提供查询? BM25 会尝试从页面元数据中收集上下文,或者您可以简单地将其视为一种焦土政策,丢弃通用分数较低的文本。实际上,您需要提供查询才能获得最佳结果。
¥No query provided? BM25 tries to glean a context from page metadata, or you can simply treat it as a scorched-earth approach that discards text with low generic score. Realistically, you want to supply a query for best results.
5.2 修剪内容过滤器
¥5.2 PruningContentFilter
如果你不有一个特定的查询,或者如果你只是想要一个强大的“垃圾清除器”,使用PruningContentFilter。它分析文本密度、链接密度、HTML 结构和已知模式(如“导航”、“页脚”),以系统地修剪无关或重复的部分。
¥If you don’t have a specific query, or if you just want a robust “junk remover,” use PruningContentFilter. It analyzes text density, link density, HTML structure, and known patterns (like “nav,” “footer”) to systematically prune extraneous or repetitive sections.
from crawl4ai.content_filter_strategy import PruningContentFilter
prune_filter = PruningContentFilter(
threshold=0.5,
threshold_type="fixed", # or "dynamic"
min_word_threshold=50
)
-
threshold:分数边界。低于此分数的方块将被移除。¥
threshold: Score boundary. Blocks below this score get removed. -
threshold_type:"fixed":直接比较(score >= threshold保持块)。"dynamic":过滤器以数据驱动的方式调整阈值。¥
threshold_type:"fixed": Straight comparison (score >= thresholdkeeps the block)."dynamic": The filter adjusts threshold in a data-driven manner.
-
min_word_threshold:丢弃 N 个单词以下的块,因为它们可能太短或没有帮助。¥
min_word_threshold: Discard blocks under N words as likely too short or unhelpful.
何时使用 PruningContentFilter
- 您希望进行广泛清理,而无需用户查询。
- 页面有大量重复的侧边栏、页脚或免责声明,妨碍了文本提取。
¥When to Use PruningContentFilter
- You want a broad cleanup without a user query.
- The page has lots of repeated sidebars, footers, or disclaimers that hamper text extraction.
5.3 LLM内容过滤器
¥5.3 LLMContentFilter
对于智能内容过滤和高质量 Markdown 生成,您可以使用LLM内容过滤器。此过滤器利用 LLM 生成相关的 markdown,同时保留原始内容的含义和结构:
¥For intelligent content filtering and high-quality markdown generation, you can use the LLMContentFilter. This filter leverages LLMs to generate relevant markdown while preserving the original content's meaning and structure:
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig, DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
async def main():
# Initialize LLM filter with specific instruction
filter = LLMContentFilter(
llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-api-token"), #or use environment variable
instruction="""
Focus on extracting the core educational content.
Include:
- Key concepts and explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
Format the output as clean markdown with proper code blocks and headers.
""",
chunk_token_threshold=4096, # Adjust based on your needs
verbose=True
)
md_generator = DefaultMarkdownGenerator(
content_filter=filter,
options={"ignore_links": True}
)
config = CrawlerRunConfig(
markdown_generator=md_generator,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
print(result.markdown.fit_markdown) # Filtered markdown content
主要特点: -智能过滤:使用 LLM 理解和提取相关内容,同时保持上下文 -可定制的说明:根据具体指示定制过滤过程 -块处理:通过分块处理大型文档(由chunk_token_threshold) -并行处理:为了获得更好的性能,请使用较小的chunk_token_threshold(例如 2048 或 4096)以实现内容块的并行处理
¥Key Features:
- Intelligent Filtering: Uses LLMs to understand and extract relevant content while maintaining context
- Customizable Instructions: Tailor the filtering process with specific instructions
- Chunk Processing: Handles large documents by processing them in chunks (controlled by chunk_token_threshold)
- Parallel Processing: For better performance, use smaller chunk_token_threshold (e.g., 2048 or 4096) to enable parallel processing of content chunks
两个常见用例:
¥Two Common Use Cases:
-
精确内容保存:
filter = LLMContentFilter( instruction=""" Extract the main educational content while preserving its original wording and substance completely. 1. Maintain the exact language and terminology 2. Keep all technical explanations and examples intact 3. Preserve the original flow and structure 4. Remove only clearly irrelevant elements like navigation menus and ads """, chunk_token_threshold=4096 )¥
Exact Content Preservation:
filter = LLMContentFilter( instruction=""" Extract the main educational content while preserving its original wording and substance completely. 1. Maintain the exact language and terminology 2. Keep all technical explanations and examples intact 3. Preserve the original flow and structure 4. Remove only clearly irrelevant elements like navigation menus and ads """, chunk_token_threshold=4096 ) -
重点内容提取:
filter = LLMContentFilter( instruction=""" Focus on extracting specific types of content: - Technical documentation - Code examples - API references Reformat the content into clear, well-structured markdown """, chunk_token_threshold=4096 )¥
Focused Content Extraction:
性能提示:设置较小的
chunk_token_threshold(例如 2048 或 4096)以启用内容块的并行处理。默认值为 infinity,表示将整个内容作为单个块进行处理。¥Performance Tip: Set a smaller
chunk_token_threshold(e.g., 2048 or 4096) to enable parallel processing of content chunks. The default value is infinity, which processes the entire content as a single chunk.
6. 使用 Fit Markdown
¥6. Using Fit Markdown
当内容过滤器处于活动状态时,库会在内部生成两种形式的 markdownresult.markdown :
¥When a content filter is active, the library produces two forms of markdown inside result.markdown:
1.raw_markdown :完整未过滤的降价。
2.fit_markdown :过滤器已移除或修剪噪声部分的“适合”版本。
¥1. raw_markdown: The full unfiltered markdown.
2. fit_markdown: A “fit” version where the filter has removed or trimmed noisy segments.
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter
async def main():
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.6),
options={"ignore_links": True}
)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.example.com/tech", config=config)
if result.success:
print("Raw markdown:\n", result.markdown)
# If a filter is used, we also have .fit_markdown:
md_object = result.markdown # or your equivalent
print("Filtered markdown:\n", md_object.fit_markdown)
else:
print("Crawl failed:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
7.MarkdownGenerationResult目的
¥7. The MarkdownGenerationResult Object
如果你的库将详细的 markdown 输出存储在类似MarkdownGenerationResult,您将看到如下字段:
¥If your library stores detailed markdown output in an object like MarkdownGenerationResult, you’ll see fields such as:
-
raw_markdown:直接 HTML 到 markdown 的转换(无过滤)。¥
raw_markdown: The direct HTML-to-markdown transformation (no filtering). -
markdown_with_citations:将链接移动到参考样式脚注的版本。¥
markdown_with_citations: A version that moves links to reference-style footnotes. -
references_markdown:包含收集的引用的单独字符串或部分。¥
references_markdown: A separate string or section containing the gathered references. -
fit_markdown:如果您使用了内容过滤器,则过滤后的 markdown。¥
fit_markdown: The filtered markdown if you used a content filter. -
fit_html:用于生成相应的 HTML 代码片段fit_markdown(有助于调试或高级用法)。¥
fit_html: The corresponding HTML snippet used to generatefit_markdown(helpful for debugging or advanced usage).
例子:
¥Example:
md_obj = result.markdown # your library’s naming may vary
print("RAW:\n", md_obj.raw_markdown)
print("CITED:\n", md_obj.markdown_with_citations)
print("REFERENCES:\n", md_obj.references_markdown)
print("FIT:\n", md_obj.fit_markdown)
这为什么重要?
- 您可以提供raw_markdown如果您想要全文,请前往法学硕士 (LLM)。
- 或者喂食fit_markdown放入矢量数据库中以减少令牌的使用。
-references_markdown可以帮助您追踪链接来源。
¥Why Does This Matter?
- You can supply raw_markdown to an LLM if you want the entire text.
- Or feed fit_markdown into a vector database to reduce token usage.
- references_markdown can help you keep track of link provenance.
下面是修订部分在“组合过滤器(BM25 + 修剪)”下演示如何运行二无需重新抓取,即可进行多遍内容过滤,只需从第一遍中取出 HTML(或文本)并将其输入到第二遍过滤器即可。它使用您提供的代码片段中的真实代码模式BM25内容过滤器,直接接受HTML字符串(并且可以通过最少的调整来处理纯文本)。
¥Below is a revised section under “Combining Filters (BM25 + Pruning)” that demonstrates how you can run two passes of content filtering without re-crawling, by taking the HTML (or text) from a first pass and feeding it into the second filter. It uses real code patterns from the snippet you provided for BM25ContentFilter, which directly accepts HTML strings (and can also handle plain text with minimal adaptation).
8. 两次传递合并滤波器(BM25 + 剪枝)
¥8. Combining Filters (BM25 + Pruning) in Two Passes
你可能想要修剪掉首先是嘈杂的样板(PruningContentFilter ), 进而对剩余内容进行排序针对用户查询(使用BM25ContentFilter)。您无需两次抓取该页面。相反,您可以:
¥You might want to prune out noisy boilerplate first (with PruningContentFilter), and then rank what’s left against a user query (with BM25ContentFilter). You don’t have to crawl the page twice. Instead:
1.第一遍: 申请PruningContentFilter直接到原始 HTMLresult.html (爬虫下载的HTML)。
2.第二遍:从步骤 1 中获取修剪后的 HTML(或文本),并将其输入到BM25ContentFilter,重点关注用户查询。
¥1. First pass: Apply PruningContentFilter directly to the raw HTML from result.html (the crawler’s downloaded HTML).
2. Second pass: Take the pruned HTML (or text) from step 1, and feed it into BM25ContentFilter, focusing on a user query.
两遍示例
¥Two-Pass Example
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from bs4 import BeautifulSoup
async def main():
# 1. Crawl with minimal or no markdown generator, just get raw HTML
config = CrawlerRunConfig(
# If you only want raw HTML, you can skip passing a markdown_generator
# or provide one but focus on .html in this example
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/tech-article", config=config)
if not result.success or not result.html:
print("Crawl failed or no HTML content.")
return
raw_html = result.html
# 2. First pass: PruningContentFilter on raw HTML
pruning_filter = PruningContentFilter(threshold=0.5, min_word_threshold=50)
# filter_content returns a list of "text chunks" or cleaned HTML sections
pruned_chunks = pruning_filter.filter_content(raw_html)
# This list is basically pruned content blocks, presumably in HTML or text form
# For demonstration, let's combine these chunks back into a single HTML-like string
# or you could do further processing. It's up to your pipeline design.
pruned_html = "\n".join(pruned_chunks)
# 3. Second pass: BM25ContentFilter with a user query
bm25_filter = BM25ContentFilter(
user_query="machine learning",
bm25_threshold=1.2,
language="english"
)
# returns a list of text chunks
bm25_chunks = bm25_filter.filter_content(pruned_html)
if not bm25_chunks:
print("Nothing matched the BM25 query after pruning.")
return
# 4. Combine or display final results
final_text = "\n---\n".join(bm25_chunks)
print("==== PRUNED OUTPUT (first pass) ====")
print(pruned_html[:500], "... (truncated)") # preview
print("\n==== BM25 OUTPUT (second pass) ====")
print(final_text[:500], "... (truncated)")
if __name__ == "__main__":
asyncio.run(main())
发生了什么事?
¥What’s Happening?
1.原始 HTML :我们抓取一次并将原始 HTML 存储在result.html。
2.修剪内容过滤器:接受 HTML 和可选参数。它提取文本块或部分 HTML,删除被视为“噪音”的标题/章节。它返回文本块列表。
3.合并或转换:我们将这些修剪后的块重新连接成一个类似 HTML 的字符串。(或者,您也可以将它们存储在列表中,以便进一步处理逻辑——只要适合您的流程即可。)
4. BM25内容过滤器:我们将修剪后的字符串输入到BM25ContentFilter通过用户查询。第二遍进一步将内容范围缩小到与“机器学习”相关的块。
¥1. Raw HTML: We crawl once and store the raw HTML in result.html.
2. PruningContentFilter: Takes HTML + optional parameters. It extracts blocks of text or partial HTML, removing headings/sections deemed “noise.” It returns a list of text chunks.
3. Combine or Transform: We join these pruned chunks back into a single HTML-like string. (Alternatively, you could store them in a list for further logic—whatever suits your pipeline.)
4. BM25ContentFilter: We feed the pruned string into BM25ContentFilter with a user query. This second pass further narrows the content to chunks relevant to “machine learning.”
无需重新抓取:我们使用raw_html从第一遍开始,所以不需要运行arun()再次-无需第二次网络请求。
¥No Re-Crawling: We used raw_html from the first pass, so there’s no need to run arun() again—no second network request.
技巧与变化
¥Tips & Variations
-
纯文本与 HTML :如果修剪后的输出大部分是文本,BM25 仍然可以处理;但请记住,它需要有效的字符串输入。如果您提供部分 HTML(例如
"<p>some text</p>"),它会将其解析为 HTML。¥Plain Text vs. HTML: If your pruned output is mostly text, BM25 can still handle it; just keep in mind it expects a valid string input. If you supply partial HTML (like
"<p>some text</p>"), it will parse it as HTML. -
单个管道中的链接:如果您的代码支持,您可以自动链接多个过滤器。否则,手动进行两遍过滤(如图所示)很简单。
¥Chaining in a Single Pipeline: If your code supports it, you can chain multiple filters automatically. Otherwise, manual two-pass filtering (as shown) is straightforward.
-
调整阈值:如果您在第一步中看到文本过多或过少,请调整
threshold=0.5或者min_word_threshold=50。 相似地,bm25_threshold=1.2可以在第二步中升高/降低以获得更多或更少的块。¥Adjust Thresholds: If you see too much or too little text in step one, tweak
threshold=0.5ormin_word_threshold=50. Similarly,bm25_threshold=1.2can be raised/lowered for more or fewer chunks in step two.
一次通过的组合?
¥One-Pass Combination?
如果您的代码库或流水线设计允许一次性应用多个过滤器,您可以这样做。但通常更简单、更透明的做法是按顺序运行它们,并分析每个步骤的结果。
¥If your codebase or pipeline design allows applying multiple filters in one pass, you could do so. But often it’s simpler—and more transparent—to run them sequentially, analyzing each step’s result.
底线: 经过手动链接通过两次迭代,您可以对最终内容进行强大的增量控制。首先,使用剪枝算法去除“全局”杂乱数据,然后使用基于 BM25 的查询相关性算法进一步优化——无需再次进行网络爬取。
¥Bottom Line: By manually chaining your filtering logic in two passes, you get powerful incremental control over the final content. First, remove “global” clutter with Pruning, then refine further with BM25-based query relevance—without incurring a second network crawl.
9. 常见陷阱与技巧
¥9. Common Pitfalls & Tips
1.没有 Markdown 输出?
- 确保爬虫程序确实检索到了 HTML。如果网站大量使用 JS,您可能需要启用动态渲染或等待元素加载。
- 检查你的内容过滤器是否过于严格。降低阈值或禁用过滤器,看看内容是否会重新出现。
¥1. No Markdown Output?
- Make sure the crawler actually retrieved HTML. If the site is heavily JS-based, you may need to enable dynamic rendering or wait for elements.
- Check if your content filter is too aggressive. Lower thresholds or disable the filter to see if content reappears.
2.性能考虑
- 包含多个过滤器的大型页面可能会比较慢。请考虑cache_mode以避免重新下载。
- 如果您的最终用例是 LLM 摄取,请考虑进一步总结或分块大文本。
¥2. Performance Considerations
- Very large pages with multiple filters can be slower. Consider cache_mode to avoid re-downloading.
- If your final use case is LLM ingestion, consider summarizing further or chunking big texts.
3.利用fit_markdown
- 非常适合 RAG 管道、语义搜索或任何不需要多余样板的场景。
- 仍然验证文本质量——一些网站在页脚或侧边栏中有关键数据。
¥3. Take Advantage of fit_markdown
- Great for RAG pipelines, semantic search, or any scenario where extraneous boilerplate is unwanted.
- Still verify the textual quality—some sites have crucial data in footers or sidebars.
4.调整html2text选项
- 如果您发现文本中混入大量原始 HTML,请打开escape_html。
- 如果代码块看起来很乱,可以尝试mark_code或者handle_code_in_pre。
¥4. Adjusting html2text Options
- If you see lots of raw HTML slipping into the text, turn on escape_html.
- If code blocks look messy, experiment with mark_code or handle_code_in_pre.
10.总结及后续步骤
¥10. Summary & Next Steps
在此Markdown 生成基础在本教程中,您学习了:
¥In this Markdown Generation Basics tutorial, you learned to:
-
配置默认Markdown生成器具有 HTML 到文本选项。
¥Configure the DefaultMarkdownGenerator with HTML-to-text options.
-
使用选择不同的 HTML 源
content_source范围。¥Select different HTML sources using the
content_sourceparameter. -
使用BM25内容过滤器用于特定查询的提取或修剪内容过滤器用于一般的噪音消除。
¥Use BM25ContentFilter for query-specific extraction or PruningContentFilter for general noise removal.
-
区分原始 Markdown 和过滤 Markdown(
fit_markdown)。¥Distinguish between raw and filtered markdown (
fit_markdown). -
利用
MarkdownGenerationResult对象来处理不同形式的输出(引用、参考等)。¥Leverage the
MarkdownGenerationResultobject to handle different forms of output (citations, references, etc.).
现在,您可以从任何网站生成高质量的 Markdown,专注于您所需的内容 - 这是支持 AI 模型、摘要管道或知识库查询的重要步骤。
¥Now you can produce high-quality Markdown from any website, focusing on exactly the content you need—an essential step for powering AI models, summarization pipelines, or knowledge-base queries.
上次更新:2025年1月1日
¥Last Updated: 2025-01-01