简体   繁体   English

使用Ghost.py在网络上抓取受密码保护的网站

[英]Web-scraping a password protected website using Ghost.py

I'm trying to get the HTML content of a password protected site using Ghost.py. 我正在尝试使用Ghost.py获取受密码保护的网站的HTML内容。

The web server I have to access, has the following HTML code (I cut it just to the important parts): 我必须访问的Web服务器具有以下HTML代码(我仅将其剪切为重要部分):

URL: http://192.168.1.60/PAGE.htm 网址: http//192.168.1.60/PAGE.htm

<html>
<head>
<script language="JavaScript">
    function DoHash()
    {
      var psw = document.getElementById('psw_id');
      var hpsw = document.getElementById('hpsw_id');
      var nonce = hpsw.value;
      hpsw.value = MD5(nonce.concat(psw.value));
      psw.value = '';
      return true;
    }
    </script>
</head>
<body>
<form action="PAGE.HTM" name="" method="post" onsubmit="DoHash();">
Access code <input id="psw_id" type="password" maxlength="15" size="20" name="q" value="">
<br>
<input type="submit" value="" name="q" class="w_bok">
<br>
<input id="hpsw_id" type="hidden" name="pA" value="180864D635AD2347">
</form>
</body>
</html>

The value of "#hpsw_id" changes every time you load the page. 每次加载页面时,“#hpsw_id”的值都会更改。

On a normal browser, once you type the correct password and press enter or click the "submit" button, you land on the same page but now with the real contents. 在普通浏览器上,输入正确的密码并按Enter或单击“提交”按钮后,您将进入同一页面,但现在显示的是真实内容。

URL: http://192.168.1.60/PAGE.htm 网址: http//192.168.1.60/PAGE.htm

<html>
<head>
<!–– javascript is gone ––>
</head>
<body>
Welcome to PAGE.htm content
</body>
</html>

First I tried with mechanize but failed, as I need javascript. 首先,我尝试使用机械化但失败了,因为我需要JavaScript。 So now I´m trying to solve it using Ghost.py 所以现在我正在尝试使用Ghost.py解决它

My code so far: 到目前为止,我的代码:

import ghost
g = ghost.Ghost()
with g.start(wait_timeout=20) as session:
    page, extra_resources = session.open("http://192.168.1.60/PAGE.htm")
    if page.http_status == 200:
        print("Good!")
        session.evaluate("document.getElementById('psw_id').value='MySecretPassword';")
        session.evaluate("document.getElementsByClassName('w_bok')[0].click();", expect_loading=True)
        print session.content

This code is not loading the contents correctly, in the console I get: 此代码未正确加载内容,在控制台中,我得到:

Traceback (most recent call last): File "", line 8, in File "/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 181, in wrapper timeout=kwargs.pop('timeout', None)) File "/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 1196, in wait_for_page_loaded 'Unable to load requested page', timeout) File "/usr/local/lib/python2.7/dist-packages/ghost/ghost.py", line 1174, in wait_for raise TimeoutError(timeout_message) ghost.ghost.TimeoutError: Unable to load requested page 追溯(最近一次通话最近):文件“ /”,行8,在文件“ /usr/local/lib/python2.7/dist-packages/ghost/ghost.py”,行181,在包装器timeout = kwargs.pop中(“超时”,无),文件“ /usr/local/lib/python2.7/dist-packages/ghost/ghost.py”,第1196行,在wait_for_page_loaded'无法加载请求的页面',超时)文件“ / usr / local / lib / python2.7 / dist-packages / ghost / ghost.py“,行1174,在wait_for引发TimeoutError(timeout_message)ghost.ghost.TimeoutError:无法加载请求的页面

Two questions... 两个问题...

1) How can I successfully login to the password protected site and get the real content of PAGE.htm? 1)如何成功登录到受密码保护的网站并获得PAGE.htm的真实内容?

2) Is this direction the best way to go? 2)这是最好的方法吗? Or I'm missing something completely which will make things work more efficiently? 还是我完全缺少某些东西,这些东西会使事情更有效地工作?

I'm using Ubuntu Mate. 我正在使用Ubuntu Mate。

This is not the answer I was looking for , just a work-around to make it work (in case someone else has a similar issue in the future). 不是我一直在寻找的答案 ,只是一种变通方法,以使其正常工作(以防将来其他人遇到类似的问题)。

To skip the javascript part (which was stopping me to use python's request), I decided to do the expected hash on python (and not on web) and send the hash as the normal web form would do. 为了跳过javascript部分(这使我无法使用python的请求),我决定在python上(而不是在网络上)执行预期的哈希,并像普通的Web表单一样发送哈希。

So the Javascript basically concatenates the hidden hpsw_id value and the password, and makes a md5 from it. 因此,JavaScript基本上将隐藏的hpsw_id值和密码连接起来,并从中生成一个md5。

The python now looks like this: python现在看起来像这样:

import requests
from hashlib import md5
from re import search

url = "http://192.168.1.60/PAGE.htm"
with requests.Session() as s:
    # Get hpsw_id number from website
    r = s.get(url)
    hpsw_id = search('name="pA" value="([A-Z0-9]*)"', r.text)
    hpsw_id = hpsw_id.group(1)
    # Make hash of ID and password
    m = md5()
    m.update(hpsw_id + 'MySecretPassword')
    pA = m.hexdigest()
    # Post to website to login
    r = s.post(url, data=[('q', ''), ('q', ''), ('pA', pA)])
    print r.content

Note: the q, q and pA are the elements that the form (q=&q=&pA=f08b97e5e3f472fdde4280a9aa408aaa) is sending when I login normally using internet browser. 注意:q,q和pA是当我使用Internet浏览器正常登录时发送的形式(q =&q =&pA = f08b97e5e3f472fdde4280a9aa408aaa)的元素。

If someone however knows the answer of my original question I would be very appreciated if you post it here. 但是,如果有人知道我的原始问题的答案,那么如果您在此处发布,将不胜感激。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM