简体   繁体   English

使用python对象从javascript呈现的URL解析html

[英]Parsing html from a javascript rendered url with python object

I would like to extract the market information from the following url and all of its subsequent pages: 我想从以下网址及其所有后续页面中提取市场信息:

https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1

I have successfully parsed the data that I want from the first page using some code from the following url: 我已经使用以下网址中的一些代码成功解析了我想要从首页获得的数据:

https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages

I have also been able to parse out the url for the next page to feed into a loop in order to grab data from the next page. 我还能够解析出下一页的URL,以使其进入循环,以便从下一页获取数据。 The problem is it crashes before the next page loads for a reason I don't fully understand. 问题是由于我不完全了解的原因,它在下一页加载之前崩溃了。

I have a hunch that the class that I have borrowed from 'impythonist' may be causing the problem. 我有一种预感,就是我从“ impythonist”那里借来的课程可能导致了问题。 I don't know enough object orientated programming to work out the problem. 我不知道足够的面向对象的程序来解决这个问题。 Here is my code, much of which is borrowed from the the url above: 这是我的代码,其中大部分是从上面的网址借来的:

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html
import re
from bs4 import BeautifulSoup

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  



base_url='https://uk.reuters.com'
complete_next_page='https://uk.reuters.com/investing/markets/index/.FTSE?sortBy=&sortDir=&pn=1'

#LOOP TO RENDER PAGES AND GRAB DATA
while complete_next_page != '':
    print ('NEXT PAGE: ',complete_next_page, '\n')
    r = Render(complete_next_page)  # USE THE CLASS TO RENDER JAVASCRIPT FROM PAGE
    result = r.frame.toHtml()     # ERROR IS THROWN HERE ON 2nd PAGE

# PARSE THE HTML
soup = BeautifulSoup(result, 'lxml')
row_data=soup.find('div', attrs={'class':'column1 gridPanel grid8'})
print (len(row_data))

# PARSE ALL ROW DATA
stripe_rows=row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows=row_data.findAll('tr', attrs={'class':''})
print (len(stripe_rows))
print (len(non_stripe_rows))

# PARSE SPECIFIC ROW DATA FROM INDEX COMPONENTS
#non_stripe_rows: from 4 to 18 (inclusive) contain data
#stripe_rows: from 2 to 16 (inclusive) contain data
i=2
while i < len(stripe_rows):
    print('CURRENT LINE IS: ',str(i))
    print(stripe_rows[i])
    print('###############################################')
    print(non_stripe_rows[i+2])
    print('\n')
    i+=1

#GETS LINK TO NEXT PAGE
next_page=str(soup.find('div', attrs={'class':'pageNavigation'}).find('li', attrs={'class':'next'}).find('a')['href']) #GETS LINK TO NEXT PAGE WORKS
complete_next_page=base_url+next_page

I have annotated the bits of code that I have written and understand but I don't really know what's going on in the 'Render' class enough to diagnose the error? 我已经注释了一些我已经编写并理解的代码,但是我真的不知道'Render'类中发生了什么足以诊断错误? Unless its something else? 除非有别的东西?

Here is the error: 这是错误:

result = r.frame.toHtml()
AttributeError: 'Render' object has no attribute 'frame'

I don't need to keep the information in the class once I have parsed it out so I was thinking perhaps it could be cleared or reset somehow and then updated to hold the new url information from page 2:n but I have no idea how to do this? 解析完这些信息后,我不需要将其保留在类中,因此我想也许可以将其清除或重置,然后进行更新以保存第2页:n中的新url信息,但是我不知道如何去做这个?

Alternatively if anyone knows another way to grab this specific data from this page and the following ones then that would be equally helpful? 或者,如果有人知道从此页面以及随后的页面中获取此特定数据的另一种方法,那将同样有用吗?

Many thanks in advance. 提前谢谢了。

How about using selenium and phantomjs instead of PyQt. 如何使用Selenium和phantomjs代替PyQt。
You can easily get selenium by executing "pip install selenium". 您可以通过执行“ pip install selenium”轻松获得硒。 If you use Mac you can get phantomjs by executing "brew install phantomjs". 如果使用Mac,则可以通过执行“ brew install phantomjs”来获取phantomjs。 If your PC is Windows use choco instead of brew, or Ubuntu use apt-get. 如果您的PC是Windows,请使用choco而非brew,或Ubuntu使用apt-get。

from selenium import webdriver
from bs4 import BeautifulSoup

base_url = "https://uk.reuters.com"
first_page = "/business/markets/index/.FTSE?sortBy=&sortDir=&pn=1"

browser = webdriver.PhantomJS()

# PARSE THE HTML
browser.get(base_url + first_page)
soup = BeautifulSoup(browser.page_source, "lxml")
row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})

# PARSE ALL ROW DATA
stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
print(len(stripe_rows), len(non_stripe_rows))

# GO TO THE NEXT PAGE
next_button = soup.find("li", attrs={"class":"next"})
while next_button:
  next_page = next_button.find("a")["href"]
  browser.get(base_url + next_page)
  soup = BeautifulSoup(browser.page_source, "lxml")
  row_data = soup.find('div', attrs={'class':'column1 gridPanel grid8'})
  stripe_rows = row_data.findAll('tr', attrs={'class':'stripe'})
  non_stripe_rows = row_data.findAll('tr', attrs={'class':''})
  print(len(stripe_rows), len(non_stripe_rows))
  next_button = soup.find("li", attrs={"class":"next"})

# DONT FORGET THIS!!
browser.quit()

I know the code above is not efficient (too slow I feel), but I think that it will bring you the results you desire. 我知道上面的代码效率不高(我觉得太慢了),但是我认为它将为您带来所需的结果。 In addition, if the web page you want to scrape does not use Javascript, even PhantomJS and selenium are unnecessary. 此外,如果您要抓取的网页未使用Javascript,则甚至不需要PhantomJS和硒。 You can use the requests module. 您可以使用请求模块。 However, since I wanted to show you the contrast with PyQt, I used PhantomJS and Selenium in this answer. 但是,由于我想向您展示与PyQt的对比,因此我在此答案中使用了PhantomJS和Selenium。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM