如何使用python从javascript生成的页面中抓取文本？

Question

I'm looking for a way on Linux to write a script that scrapes the text from a page which is generated by Javascript (specifically etherpad eg http://www.board.net ). 我正在寻找一种在Linux上编写一种脚本的方法，该脚本将从Javascript（特别是etherpad例如http://www.board.net ）生成的页面中抓取文本。 Ideally I'd like to use an existing tool but I haven't found a suitable one (eg lynx, but it doesn't support javascript, or Selenium, but it runs in a browser). 理想情况下，我想使用现有的工具，但是我没有找到合适的工具（例如，lynx，但它不支持javascript或Selenium，但可以在浏览器中运行）。 Suggestions welcome. 欢迎提出建议。

If there's nothing suitable (which would seem surprising for such a simple need), maybe I can write something myself in Python. 如果没有合适的方法（对于如此简单的需求而言这似乎令人惊讶），也许我可以自己用Python编写一些东西。 What useful Python classes exist for something like this? 对于这样的事情，存在哪些有用的Python类？

Answer 1

One option is to still stick with Selenium , but use a headless PhantomJS . 一种选择是仍然坚持使用Selenium ，但使用无头PhantomJS 。

See also: 也可以看看：

Headless Selenium Testing with Python and PhantomJS 使用Python和PhantomJS进行无头硒测试

Example (using firefox webdriver): 示例（使用firefox webdriver）：

from selenium import webdriver

url = 'http://board.net/p/ThisIsBob%27sBoard/timeslider'
driver = webdriver.Firefox()
driver.get(url)

element = driver.find_element_by_id('padcontent')
print element.text

prints: 打印：

Here is some text I'd like to scrape
 I wonder how to go about it?

如何使用python从javascript生成的页面中抓取文本？

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-04-17 15:19:42

如何使用python从javascript生成的页面中抓取文本？

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-04-17 15:19:42

解决方案1
1 已采纳 2014-04-17 15:19:42