[英]Python Crawling Pastebin (JavaScript rendered webpages)
I am facing a problem trying to crawl JavaScript rendered pages. 我在尝试抓取JavaScript呈现的页面时遇到问题。
I am using the python-qt4 module, following this tutorial: https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/ 我正在按照本教程使用python-qt4模块: https : //impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/
In the tutorial, everything works perfectly with the example page: http://pycoders.com/archive 在本教程中,所有内容均可与示例页面完美配合: http : //pycoders.com/archive
But I am trying this out with pastebin, with this URL: 但是我正在尝试使用pastebin和以下URL:
http://pastebin.com/search?q=ssh http://pastebin.com/search?q=ssh
What I am trying is to get all the links, in order to click them, as well as be able to follow the pages (I don't know what I am going to use yet, maybe Scrapy but I want to take a look to other options). 我正在尝试获取所有链接,以便单击它们,并能够跟踪页面(我不知道我将要使用什么,也许是Scrapy,但我想看看其他选项)。
The problem is that I am not able to extract the links, this is my code: 问题是我无法提取链接,这是我的代码:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
#Take this class for granted.Just use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://pastebin.com/search?q=ssh'
r = Render(url)
result = r.frame.toHtml()
formatted_result = str(result.toAscii())
tree = html.fromstring(formatted_result)
archive_links = tree.xpath('//a[@class="gs-title"]/@data-ctoring')
for i in archive_links:
print i
The result is: I don't get anything. 结果是:我什么也没得到。
Ideally, you should look into using Pastebin
API - here is a Python wrapper . 理想情况下,您应该考虑使用
Pastebin
API-这是一个Python包装器 。
Alternative approach would involve browser automation via selenium
. 另一种方法是通过
selenium
浏览器自动化。 Working code that prints the search result links: 打印搜索结果链接的工作代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://pastebin.com/search?q=ssh")
# wait for the search results to be loaded
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".gsc-result-info")))
# get all search results links
for link in driver.find_elements_by_css_selector(".gsc-results .gsc-result a.gs-title"):
print(link.get_attribute("href"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.