页面交互
¥Page Interaction
Crawl4AI 提供强大的交互功能动态的网页、处理 JavaScript 执行、等待条件以及管理多步骤流程。通过结合js代码,等待,并且某些CrawlerRunConfig参数,您可以:
¥Crawl4AI provides powerful features for interacting with dynamic webpages, handling JavaScript execution, waiting for conditions, and managing multi-step flows. By combining js_code, wait_for, and certain CrawlerRunConfig parameters, you can:
-
点击“加载更多”按钮
¥Click “Load More” buttons
-
填写表格并提交
¥Fill forms and submit them
-
等待元素或数据出现
¥Wait for elements or data to appear
-
跨多个步骤重复使用会话
¥Reuse sessions across multiple steps
下面是关于如何操作的简要概述。
¥Below is a quick overview of how to do it.
1. JavaScript执行
¥1. JavaScript Execution
基本执行
¥Basic Execution
js_code在CrawlerRunConfig接受单个 JS 字符串或 JS 片段列表。
例子:我们将滚动到页面底部,然后选择单击“加载更多”按钮。
¥js_code in CrawlerRunConfig accepts either a single JS string or a list of JS snippets.
Example: We’ll scroll to the bottom of the page, then optionally click a “Load More” button.
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
# Single JS command
config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com", # Example site
config=config
)
print("Crawled length:", len(result.cleaned_html))
# Multiple commands
js_commands = [
"window.scrollTo(0, document.body.scrollHeight);",
# 'More' link on Hacker News
"document.querySelector('a.morelink')?.click();",
]
config = CrawlerRunConfig(js_code=js_commands)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com", # Another pass
config=config
)
print("After scroll+click, length:", len(result.cleaned_html))
if __name__ == "__main__":
asyncio.run(main())
相关的CrawlerRunConfig参数:-js_code :页面加载后运行的 JavaScript 字符串或字符串列表。 -js_only :如果设置为True在后续调用中,表示我们将继续现有会话,而无需新的完整导航。
-session_id :如果您想在多次调用中保持相同的页面,请指定一个 ID。
¥Relevant CrawlerRunConfig params:
- js_code: A string or list of strings with JavaScript to run after the page loads.
- js_only: If set to True on subsequent calls, indicates we’re continuing an existing session without a new full navigation.
- session_id: If you want to keep the same page across multiple calls, specify an ID.
2.等待条件
¥2. Wait Conditions
2.1 基于 CSS 的等待
¥2.1 CSS-Based Waiting
有时,你只想等待特定元素出现。例如:
¥Sometimes, you just want to wait for a specific element to appear. For example:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
# Wait for at least 30 items on Hacker News
wait_for="css:.athing:nth-child(30)"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com",
config=config
)
print("We have at least 30 items loaded!")
# Rough check
print("Total items in HTML:", result.cleaned_html.count("athing"))
if __name__ == "__main__":
asyncio.run(main())
关键参数:-wait_for="css:..." :告诉爬虫等待 CSS 选择器出现。
¥Key param:
- wait_for="css:...": Tells the crawler to wait until that CSS selector is present.
2.2 基于 JavaScript 的等待
¥2.2 JavaScript-Based Waiting
对于更复杂的条件(例如,等待内容长度超过阈值),前缀js::
¥For more complex conditions (e.g., waiting for content length to exceed a threshold), prefix js::
wait_condition = """() => {
const items = document.querySelectorAll('.athing');
return items.length > 50; // Wait for at least 51 items
}"""
config = CrawlerRunConfig(wait_for=f"js:{wait_condition}")
幕后花絮:Crawl4AI 持续轮询 JS 函数,直到返回true或者发生超时。
¥Behind the Scenes: Crawl4AI keeps polling the JS function until it returns true or a timeout occurs.
3.处理动态内容
¥3. Handling Dynamic Content
许多现代网站都需要多个步骤:滚动、点击“加载更多”或通过 JavaScript 更新。以下是一些典型的模式。
¥Many modern sites require multiple steps: scrolling, clicking “Load More,” or updating via JavaScript. Below are typical patterns.
3.1 加载更多示例(Hacker News“更多”链接)
¥3.1 Load More Example (Hacker News “More” Link)
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
# Step 1: Load initial Hacker News page
config = CrawlerRunConfig(
wait_for="css:.athing:nth-child(30)" # Wait for 30 items
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com",
config=config
)
print("Initial items loaded.")
# Step 2: Let's scroll and click the "More" link
load_more_js = [
"window.scrollTo(0, document.body.scrollHeight);",
# The "More" link at page bottom
"document.querySelector('a.morelink')?.click();"
]
next_page_conf = CrawlerRunConfig(
js_code=load_more_js,
wait_for="""js:() => {
return document.querySelectorAll('.athing').length > 30;
}""",
# Mark that we do not re-navigate, but run JS in the same session:
js_only=True,
session_id="hn_session"
)
# Re-use the same crawler session
result2 = await crawler.arun(
url="https://news.ycombinator.com", # same URL but continuing session
config=next_page_conf
)
total_items = result2.cleaned_html.count("athing")
print("Items after load-more:", total_items)
if __name__ == "__main__":
asyncio.run(main())
关键参数:-session_id="hn_session" :在多次调用时保持页面相同arun().-js_only=True :我们不执行完全重新加载,只是在现有页面中应用 JS。 -wait_for和js::等待项目数量超过 30。
¥Key params:
- session_id="hn_session": Keep the same page across multiple calls to arun().
- js_only=True: We’re not performing a full reload, just applying JS in the existing page.
- wait_for with js:: Wait for item count to grow beyond 30.
3.2 表单交互
¥3.2 Form Interaction
如果网站有搜索或登录表单,您可以填写字段并提交js_code。例如,如果 GitHub 有一个本地搜索表单:
¥If the site has a search or login form, you can fill fields and submit them with js_code. For instance, if GitHub had a local search form:
js_form_interaction = """
document.querySelector('#your-search').value = 'TypeScript commits';
document.querySelector('form').submit();
"""
config = CrawlerRunConfig(
js_code=js_form_interaction,
wait_for="css:.commit"
)
result = await crawler.arun(url="https://github.com/search", config=config)
事实上:用真实站点的表单选择器替换 ID 或类。
¥In reality: Replace IDs or classes with the real site’s form selectors.
4. 时序控制
¥4. Timing Control
1.page_timeout (毫秒):整体页面加载或脚本执行的时间限制。
2.delay_before_return_html (秒):等待片刻再捕获最终的 HTML。
3.mean_delay &max_range :如果你打电话arun_many()对于多个 URL,这些会在每个请求之间添加一个随机暂停。
¥1. page_timeout (ms): Overall page load or script execution time limit.
2. delay_before_return_html (seconds): Wait an extra moment before capturing the final HTML.
3. mean_delay & max_range: If you call arun_many() with multiple URLs, these add a random pause between each request.
例子:
¥Example:
5. 多步骤交互示例
¥5. Multi-Step Interaction Example
下面是一个简化的脚本,它在 GitHub 的 TypeScript 提交页面上执行多次“加载更多”操作。它重复使用每次都会在同一个会话中累积新的提交。代码包含相关的CrawlerRunConfig您所依赖的参数。
¥Below is a simplified script that does multiple “Load More” clicks on GitHub’s TypeScript commits page. It re-uses the same session to accumulate new commits each time. The code includes the relevant CrawlerRunConfig parameters you’d rely on.
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def multi_page_commits():
browser_cfg = BrowserConfig(
headless=False, # Visible for demonstration
verbose=True
)
session_id = "github_ts_commits"
base_wait = """js:() => {
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
return commits.length > 0;
}"""
# Step 1: Load initial commits
config1 = CrawlerRunConfig(
wait_for=base_wait,
session_id=session_id,
cache_mode=CacheMode.BYPASS,
# Not using js_only yet since it's our first load
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://github.com/microsoft/TypeScript/commits/main",
config=config1
)
print("Initial commits loaded. Count:", result.cleaned_html.count("commit"))
# Step 2: For subsequent pages, we run JS to click 'Next Page' if it exists
js_next_page = """
const selector = 'a[data-testid="pagination-next-button"]';
const button = document.querySelector(selector);
if (button) button.click();
"""
# Wait until new commits appear
wait_for_more = """js:() => {
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
if (!window.firstCommit && commits.length>0) {
window.firstCommit = commits[0].textContent;
return false;
}
// If top commit changes, we have new commits
const topNow = commits[0]?.textContent.trim();
return topNow && topNow !== window.firstCommit;
}"""
for page in range(2): # let's do 2 more "Next" pages
config_next = CrawlerRunConfig(
session_id=session_id,
js_code=js_next_page,
wait_for=wait_for_more,
js_only=True, # We're continuing from the open tab
cache_mode=CacheMode.BYPASS
)
result2 = await crawler.arun(
url="https://github.com/microsoft/TypeScript/commits/main",
config=config_next
)
print(f"Page {page+2} commits count:", result2.cleaned_html.count("commit"))
# Optionally kill session
await crawler.crawler_strategy.kill_session(session_id)
async def main():
await multi_page_commits()
if __name__ == "__main__":
asyncio.run(main())
关键点:
¥Key Points:
-
session_id:保持同一页面打开。¥
session_id: Keep the same page open. -
js_code+wait_for+js_only=True:我们进行部分刷新,等待新的提交出现。¥
js_code+wait_for+js_only=True: We do partial refreshes, waiting for new commits to appear. -
cache_mode=CacheMode.BYPASS确保我们每一步都能看到最新的数据。¥
cache_mode=CacheMode.BYPASSensures we always see fresh data each step.
6. 结合互动与提取
¥6. Combine Interaction with Extraction
一旦动态内容加载完毕,您可以附加extraction_strategy(喜欢JsonCssExtractionStrategy或者LLMExtractionStrategy)。 例如:
¥Once dynamic content is loaded, you can attach an extraction_strategy (like JsonCssExtractionStrategy or LLMExtractionStrategy). For example:
from crawl4ai import JsonCssExtractionStrategy
schema = {
"name": "Commits",
"baseSelector": "li.Box-sc-g0xbh4-0",
"fields": [
{"name": "title", "selector": "h4.markdown-title", "type": "text"}
]
}
config = CrawlerRunConfig(
session_id="ts_commits_session",
js_code=js_next_page,
wait_for=wait_for_more,
extraction_strategy=JsonCssExtractionStrategy(schema)
)
完成后,检查result.extracted_content用于 JSON。
¥When done, check result.extracted_content for the JSON.
7.相关CrawlerRunConfig参数
¥7. Relevant CrawlerRunConfig Parameters
以下是与交互相关的关键参数CrawlerRunConfig完整列表请参见配置参数。
¥Below are the key interaction-related parameters in CrawlerRunConfig. For a full list, see Configuration Parameters.
-
js_code:初始加载后运行的 JavaScript。¥
js_code: JavaScript to run after initial load. -
js_only: 如果True,没有新的页面导航——只有现有会话中的 JS。¥
js_only: IfTrue, no new page navigation—only JS in the existing session. -
wait_for:CSS("css:...") 或 JS ("js:...") 表达式来等待。¥
wait_for: CSS ("css:...") or JS ("js:...") expression to wait for. -
session_id:在各个呼叫之间重复使用同一页面。¥
session_id: Reuse the same page across calls. -
cache_mode:是从缓存读取/写入还是绕过。¥
cache_mode: Whether to read/write from the cache or bypass. -
remove_overlay_elements:自动删除某些弹出窗口。¥
remove_overlay_elements: Remove certain popups automatically. -
simulate_user,override_navigator,magic:反机器人或“类人”互动。¥
simulate_user,override_navigator,magic: Anti-bot or “human-like” interactions.
8. 结论
¥8. Conclusion
Crawl4AI的页面交互功能让您可以:
¥Crawl4AI’s page interaction features let you:
1.执行 JavaScript用于滚动、点击或填写表格。
2.等待在捕获数据之前,CSS 或自定义 JS 条件。
3.处理具有部分重新加载或持久会话的多步骤流程(如“加载更多”)。
4. 结合结构化提取对于动态站点。
¥1. Execute JavaScript for scrolling, clicks, or form filling.
2. Wait for CSS or custom JS conditions before capturing data.
3. Handle multi-step flows (like “Load More”) with partial reloads or persistent sessions.
4. Combine with structured extraction for dynamic sites.
使用这些工具,您可以自信地抓取现代交互式网页。如需高级钩子、用户模拟或深入配置,请查看API 参考或相关的高级文档。祝您脚本编写愉快!
¥With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the API reference or related advanced docs. Happy scripting!
9.虚拟滚动
¥9. Virtual Scrolling
对于使用虚拟滚动(内容在滚动时被替换而不是添加,就像 Twitter 或 Instagram 那样),Crawl4AI 提供了专用的VirtualScrollConfig:
¥For sites that use virtual scrolling (where content is replaced rather than appended as you scroll, like Twitter or Instagram), Crawl4AI provides a dedicated VirtualScrollConfig:
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig
async def crawl_twitter_timeline():
# Configure virtual scroll for Twitter-like feeds
virtual_config = VirtualScrollConfig(
container_selector="[data-testid='primaryColumn']", # Twitter's main column
scroll_count=30, # Scroll 30 times
scroll_by="container_height", # Scroll by container height each time
wait_after_scroll=1.0 # Wait 1 second after each scroll
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://twitter.com/search?q=AI",
config=config
)
# result.html now contains ALL tweets from the virtual scroll
虚拟滚动 vs JavaScript 滚动
¥Virtual Scroll vs JavaScript Scrolling
¥Feature
¥Virtual Scroll
¥JS Code Scrolling
¥Use Case
¥Content replaced during scroll
¥Content appended or simple scroll
¥Configuration
¥VirtualScrollConfig object
¥js_code with scroll commands
¥Automatic Merging
¥Yes - merges all unique content
¥No - captures final state only
¥Best For
¥Twitter, Instagram, virtual tables
¥Traditional pages, load more buttons
| 特征 | 虚拟卷轴 | JS代码滚动 |
|---|---|---|
| 用例 | 滚动期间替换的内容 | 附加内容或简单滚动 |
| 配置 | 目的 | 使用滚动命令 |
| 自动合并 | 是 - 合并所有独特内容 | 否 - 仅捕获最终状态 |
| 最适合 | Twitter、Instagram、虚拟桌子 | 传统页面,加载更多按钮 |
有关详细示例和配置选项,请参阅虚拟滚动文档。
¥For detailed examples and configuration options, see the Virtual Scroll documentation.