简体   繁体   English

Jsoup登录抓取游戏数据

[英]Jsoup login to scrape game data

The question is can I use Jsoup to post login data that is controlled by javascript?问题是我可以使用 Jsoup 发布由 javascript 控制的登录数据吗? Here's the info so far这是目前为止的信息

Login URL for the site:该网站的登录网址:

http://www.cybernations.net/login.asp

(they do have a no-bots policy, but I emailed the admin and have permission to auto-login for downloading game datafiles) (他们确实有禁止机器人的政策,但我给管理员发了电子邮件,并有权自动登录以下载游戏数据文件)

URL where files are stored存储文件的 URL

http://www.cybernations.net/stats_downloads.asp

The line of code where I use Jsoup to parse the html of the login page to show me the scripts...我使用 Jsoup 解析登录页面的 html 以显示脚本的代码行...

Elements scriptTags = doc.getElementsByTag("script");

The output of looping through the list of Elements...循环遍历元素列表的输出...

    <!--
function FrontPage_Form1_Validator(theForm)
{

  if (theForm.Username.value == "")
  {
    alert("Please enter a value for the \"Username\" field.");
    theForm.Username.focus();
    return (false);
  }

  if (theForm.Username.value.length > 40)
  {
    alert("Please enter at most 40 characters in the \"Username\" field.");
    theForm.Username.focus();
    return (false);
  }

  if (theForm.Validate_Password.value == "")
  {
    alert("Please enter a value for the \"Password\" field.");
    theForm.Validate_Password.focus();
    return (false);
  }

  if (theForm.Validate_Password.value.length < 1)
  {
    alert("Please enter at least 1 characters in the \"Password\" field.");
    theForm.Validate_Password.focus();
    return (false);
  }

  if (theForm.Validate_Password.value.length > 50)
  {
    alert("Please enter at most 50 characters in the \"Password\" field.");
    theForm.Validate_Password.focus();
    return (false);
  }
  return (true);
}
//-->

EDIT 1: edited the connection code Current code for login looks like so, returning the login page.编辑 1:编辑连接代码当前登录代码如下所示,返回登录页面。

Connection.Response loginForm = Jsoup.connect( loginURL )
                        .method(Connection.Method.GET)
                        .execute();

Document document = Jsoup.connect( loginURL )
.data("Login", "Login")
.data("Username", user )
.data("Validate_Password", pass )
.cookies(loginForm.cookies() )
.post();

I feel like I'm missing something really simple here, should I direct the connect() method to follow redirects?我觉得我在这里遗漏了一些非常简单的东西,我应该引导 connect() 方法跟随重定向吗?

EDIT 2: Thanks for all your help, I think I'm going to switch to Apache's http client as it will (hopefully) give me greater control over the connection.编辑 2:感谢您的所有帮助,我想我将切换到 Apache 的 http 客户端,因为它(希望)可以让我更好地控制连接。 Thank you all!谢谢你们!

That function you posted is just there to validate the input, and you can ignore it, since the server probably doesn't allow usernames and passwords that don't meet their criteria anyway.您发布的那个函数只是为了验证输入,您可以忽略它,因为服务器可能不允许不符合标准的用户名和密码。

If you want to send the login information like the webpage does, you just need to POST to "/login.asp".如果你想像网页一样发送登录信息,你只需要POST到“/login.asp”。 Just look at the form in their HTML:看看他们 HTML 中的表单:

<form action="/login.asp" method="POST" name="FrontPage_Form1" .....

You'll have to handle the login yourself.您必须自己处理登录。 You may need to read the cookies from the response header and remember them somewhere and then send them back with each subsequent request you make to the server (exactly as a web browser does it).您可能需要从响应头中读取 cookie 并在某处记住它们,然后将它们与您向服务器发出的每个后续请求一起发送回(就像 Web 浏览器那样)。 Have a look at this for more information about that.看看this了解更多信息。

Also, you may need to consider how to handle captchas.此外,您可能需要考虑如何处理验证码。 It seems that their site forces you to pass a captcha after visiting the page twice, which will block your program from being able to log in.似乎他们的网站强制您在访问该页面两次后传递验证码,这将阻止您的程序登录。

Edit:编辑:

You can look at this answer for further information on how to automate the login.您可以查看此答案以获取有关如何自动登录的更多信息。 To answer your question about saving the cookies, it doesn't really matter where you save them, as long as you can access them when making additional requests to the server.要回答您关于保存 cookie 的问题,您将它们保存在哪里并不重要,只要您在向服务器发出其他请求时可以访问它们即可。 That answer I just linked has code to access the cookies returned from the server when you log in (modified with your variables):我刚刚链接的那个答案有代码来访问您登录时从服务器返回的 cookie(用您的变量修改):

Connection.Response res = Jsoup.connect("http://www.cybernations.net/login.asp")
    .data("Username", "myUsername", "Validate_Password", "myPassword")
    .method(Method.POST)
    .execute();

Document doc = res.parse();
String sessionId = res.cookie("ASPSESSIONIDAAACSTQB");

That same answer shows you how to use jsoup to send the cookie in subsequent requests:同样的答案向您展示了如何使用 jsoup 在后续请求中发送 cookie:

Document doc2 = Jsoup.connect("http://www.cybernations.net/stats_downloads.asp")
    .cookie("ASPSESSIONIDAAACSTQB", sessionId)
    .get();

Now, what the cookies you need to save exactly is something you need to figure out.现在,您需要确切地保存哪些cookie 是您需要弄清楚的。 Try using the developer options in Google Chrome.尝试使用 Google Chrome 中的开发人员选项。 Log into the site, and see the names of the cookies the site is using to store your session (there are a few).登录该站点,然后查看该站点用于存储您的会话的 cookie 的名称(有几个)。 Then try to emulate this with the above code.然后尝试用上面的代码模拟这一点。

I should mention that I have not tested this code for this site.我应该提一下,我还没有为这个网站测试过这段代码。 That is something that will take time and patience, but that's part of the job.这需要时间和耐心,但这是工作的一部分。

form HTML element is the most important.表单 HTML 元素是最重要的。 You must check what is form method and the name of parameters.您必须检查什么是表单方法和参数名称。

<form action="/login.asp" method="POST" name="FrontPage_Form1" onsubmit="return FrontPage_Form1_Validator(this)" language="JavaScript" >
...
   <input value="" name="Username" id="Username" type="text" class="displayFieldIE" size="30" maxlength="40">
...
   <input value="" name="Validate_Password" id="Validate_Password" type="password" class="displayFieldIE" size="30" maxlength="50">
...
</form>

So you must post data to login.asp with parameters Username and Validate_Password.因此,您必须使用参数 Username 和 Validate_Password 将数据发布到 login.asp。

Javascript you linked is here to validate user input.您链接的 Javascript 用于验证用户输入。 No need to deal with that.没必要处理那个。

I don't see any problem in your approach.我认为你的方法没有任何问题。 May be the site is checking for source.可能是该网站正在检查来源。 Try setting the referrer as尝试将引用设置为

String loginURL = "http://www.cybernations.net/login.asp";
Connection.Response loginForm = Jsoup.connect(loginURL)
        .method(Connection.Method.GET).execute();

Document document = Jsoup.connect(loginURL)
            .data("Login", "Login")
            .data("Username", user)
            .data("Validate_Password", pass)
            .header("Host", "www.cybernations.net")
            .header("Origin", "http://www.cybernations.net")
            .referrer(loginURL)
            .cookies(loginForm.cookies())
            .post();

After first failed attempt, the site uses captcha.第一次尝试失败后,该站点使用验证码。 So be sure to pass correct credentials.所以一定要传递正确的凭据。 ;) ;)

If that didn't work try connecting via apache http client and pass the response to jsoup for parsing如果这不起作用,请尝试通过apache http 客户端连接并将响应传递给 jsoup 进行解析

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM