简体   繁体   English

适用于Javascript的Python抓取工具?

[英]Python Scraper for Javascript?

Can anyone direct me to a good Python screen scraping library for javascript code (hopefully one with good documentation/tutorials)? 谁能将我引导到一个不错的Python屏幕抓取库,以获取javascript代码(希望其中有很好的文档/教程)? I'd like to see what options are out there, but most of all the easiest to learn with fastest results... wondering if anyone had experience. 我想看看那里有什么选择,但是最简单的方法是最快的结果,学习起来...想知道是否有人有经验。 I've heard some stuff about spidermonkey, but maybe there are better ones out there? 我听说过有关蜘蛛猴的一些知识,但也许还有更好的东西?

Specifically, I use BeautifulSoup and Mechanize to get to here, but need a way to open the javascript popup, submit data, and download/parse the results in the javascript popup. 具体来说,我使用BeautifulSoup和Mechanize到达此处,但是需要一种方法来打开javascript弹出窗口,提交数据以及下载/解析javascript弹出窗口中的结果。

<a href="javascript:openFindItem(12510109)" onclick="s_objectID=&quot;javascript:openFindItem(12510109)_1&quot;;return this.s_oc?this.s_oc(e):true">Find Item</a>

I'd like to implement this with Google App engine and Django. 我想用Google App引擎和Django来实现。 Thanks! 谢谢!

What I usually do is automate an actual browser in these cases, and grab the processed HTML from there. 在这些情况下,我通常要做的是使实际的浏览器自动化,然后从那里获取处理过的HTML。

Edit: 编辑:

Here's an example of automating InternetExplorer to navigate to a URL and grab the title and location after the page loads. 这是一个使InternetExplorer自动导航到URL并在页面加载后获取标题和位置的示例。

from win32com.client import Dispatch

from ctypes import Structure, pointer, windll
from ctypes import c_int, c_long, c_uint
import win32con
import pywintypes

class POINT(Structure):
    _fields_ = [('x', c_long),
                ('y', c_long)]
    def __init__( self, x=0, y=0 ):
        self.x = x
        self.y = y

class MSG(Structure):
    _fields_ = [('hwnd', c_int),
                ('message', c_uint),
                ('wParam', c_int),
                ('lParam', c_int),
                ('time', c_int),
                ('pt', POINT)]

def wait_until_ready(ie):
    pMsg = pointer(MSG())
    NULL = c_int(win32con.NULL)

    while True:

        while windll.user32.PeekMessageW(pMsg, NULL, 0, 0, win32con.PM_REMOVE) != 0:
            windll.user32.TranslateMessage(pMsg)
            windll.user32.DispatchMessageW(pMsg)

        if ie.ReadyState == 4:
            break


ie = Dispatch("InternetExplorer.Application")

ie.Visible = True

ie.Navigate("http://google.com/")

wait_until_ready(ie)

print "title:", ie.Document.Title
print "location:", ie.Document.location

I use the Python bindings to webkit to render basic JavaScript and Chickenfoot for more advanced interactions. 我使用Python绑定到webkit来呈现基本的JavaScript和Chickenfoot,以进行更高级的交互。 See this webkit example for more info. 有关更多信息,请参见此webkit示例

You can also use a "programatic web browser" named Spynner. 您还可以使用名为Spynner的“程序化Web浏览器”。 I found this to be the best solution. 我发现这是最好的解决方案。 Relatively easy to use. 相对容易使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM