简体   繁体   English

通过Python从动态javascript页面提取数据

[英]Data extraction by Python from a dynamic javascript page

I have to extract the data from the table from the following website: 我必须从以下网站的表中提取数据:

http://www.mcxindia.com/SitePages/indexhistory.aspx

When I click on GO, I get a table appended to the page dynamically. 当我单击GO时,我得到一个动态附加到页面的表格。 I want export those data from the page to a csv file(which I know how to handle), but the source code does not contain any data points. 我想将这些数据从页面导出到一个csv文件(我知道如何处理),但是源代码不包含任何数据点。

I have tried looking into the javascript code, when I inspect the elements after the table is generated, I get the data points, but not in the source. 我尝试查看javascript代码,在生成表之后检查元素时,我得到了数据点,但是没有在源代码中。 I am using mechanize in Python. 我在Python中使用机械化。

I think it is because the page is getting loaded dynamically. 我认为这是因为页面正在动态加载。 What should I do/use? 我应该怎么做/使用?

mechanize doesn't/can't evaluate javascript. 机械化无法/无法评估javascript。 The easiest way that I've seen to evaluate javascript is by using Selenium, which will open a browser on your computer and communicate with python. 我见过评估javascript的最简单方法是使用Selenium,它将在您的计算机上打开浏览器并与python通信。

I answered a similar question here 在这里回答了类似的问题

I agreed Matthew Wesly comment. 我同意Matthew Wesly的评论。 We will get the dynamic page using Selenium, iMacro like a addons. 我们将使用Selenium,iMacro等插件来获取动态页面。 It captures the dynamic pages response based on our recording. 它根据我们的记录捕获动态页面响应。 It also has the JS script capability. 它还具有JS脚本功能。

I think thought, for easy extraction we will go for normal Content Fetch logic using urllib2 and urllib packages. 我认为,为便于提取,我们将使用urllib2和urllib包进行常规的Content Fetch逻辑。

First get the page 'viewstate' parameter. 首先获取页面的“ viewstate”参数。 ie Get all hidden element information from the home page and pass the form information as like the JS script does. 即从首页获取所有隐藏的元素信息,并像JS脚本一样传递表单信息。

And also pass Content-Type key value exactly. 并准确地传递Content-Type键值。 Here your response is in the form of "text/plain; charset=utf-8". 在这里,您的回复形式为“文本/纯文本; charset = utf-8”。

To avoid using javascript aware transports you need to: 为了避免使用可识别JavaScript的传输,您需要:

  1. Install web debugger into your browser. 将Web调试器安装到浏览器中。
  2. Goto that page. 转到该页面。 Press F12 to open debugger. 按F12打开调试器。 Reload page. 重新加载页面。
  3. Analyze contents of 'network' tab. 分析“网络”标签的内容。 Usually ajax pages downloads data as html fragments or as json. 通常,ajax页面将数据下载为html片段或json。 Just look into response tabs of each request made after pressing 'GO' and you will find familiar data. 只需按“ GO”后查看每个请求的response选项卡,您就会找到熟悉的数据。
  4. Now you can create simple urllib/urllib2 downloader for that url. 现在,您可以为该URL创建简单的urllib/urllib2下载器。
  5. parse that data and convert to csv. 解析该数据并转换为csv。

http://www.mcxindia.com/SitePages/indexhistory.aspx sends POST request with search parameters on each 'GO' and recieves html fragment you need to parse and convert into csv. http://www.mcxindia.com/SitePages/indexhistory.aspx在每个“ GO”上发送带有搜索参数的POST请求,并接收您需要解析并转换为csv的html片段。

So if to simulate that POST - you dont need no new browser window. 因此,如果要模拟POST ,则不需要新的浏览器窗口。

This worked!!! 这工作!!!

import httplib 
import urllib 
import urllib2 
from BeautifulSoup import BeautifulSoup
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url = 'http://www.mcxindia.com/SitePages/indexhistory.aspx'
br.open(url)
response = br.response().read()
br.select_form(nr=0)
br.set_all_readonly(False)
br.form['mTbFromDate']='08/01/2013'
br.form['mTbToDate']='08/08/2013'
response = br.submit(name='mBtnGo').read()
print response

The best thing I personally do while dealing dynamic web pages is use PyQt webkit and try to mimic as a browser, and then pass the URL to the browser and finally getting the HTML after all Javascripts are rendered. 在处理动态网页时,我个人最好的方法是使用PyQt Webkit并尝试模仿浏览器,然后将URL传递给浏览器,并在呈现所有Javascript后最终获取HTML。

Example Code- 示例代码-

import sys
from PyQt4.QtGui import QApplication
from PyQt4.QtCore import QUrl
from PyQt4.QtWebKit import QWebPage
import bs4 as bs




class Client(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self.on_page_load)
        self.mainFrame().load(QUrl(url))
        self.app.exec()

    def on_page_load(self):
        self.app.quit()


url = //your URL
client_response = Client(url)
source = client_response.mainFrame().toHtml()
soup = bs.BeautifulSoup(source, "lxml")
// BeautifulSoup stuff

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM