简体   繁体   English

在抓取网站之前使用JSoup登录

[英]Using JSoup to login before scraping a website

I'm trying to use Jsoup to login to a website and then scrape the contents. 我正在尝试使用Jsoup登录网站,然后抓取内容。 I've already read a few similar questions, and figured that I need to do a POST and pass all the input parameters. 我已经阅读了一些类似的问题,并认为我需要执行POST并传递所有输入参数。 I've done just that, but it still doesn't work for some reason. 我已经做到了,但是由于某种原因它仍然不起作用。 Any help greatly appreciated. 任何帮助,不胜感激。

HTML form: HTML形式:

<form method="post" action="https://login.yahoo.com/config/login?" name="login_form" id="login_form" onsubmit="return hash2(this)">
<input type="hidden" name="sessionId" id="sessionId" value="IyLZEAs5n9RP">
<input type="hidden" name=".tries" value="1">
<input type="hidden" name=".src" value="spt">
<input type="hidden" name=".md5" value="">
<input type="hidden" name=".hash" value="">
<input type="hidden" name=".js" value="">
<input type="hidden" name=".last" value="">
<input type="hidden" name="promo" value="">
<input type="hidden" name=".intl" value="us">
<input type="hidden" name=".lang" value="en-US">
<input type="hidden" name=".bypass" value="">
<input type="hidden" name=".partner" value="">
<input type="hidden" name=".u" value="516l00ha52emg">
<input type="hidden" name=".v" value="0">
<input type="hidden" name=".challenge" value="HupRQ9x1ptIRHP1H8P9eYBbAlofE4YsoSQ--">
<input type="hidden" name=".yplus" value="">
<input type="hidden" name=".emailCode" value="">
<input type="hidden" name="pkg" value="">
<input type="hidden" name="stepid" value="">
<input type="hidden" name=".ev" value="">
<input type="hidden" name="hasMsgr" value="0">
<input type="hidden" name=".chkP" value="Y">
<input type="hidden" name=".done" value="http://hockey.fantasysports.yahoo.com/hockey/27381/startingrosters?.scrumb=0">
<input type="hidden" name=".pd" value="spt_ver=0&c=&ivt=&sg=">
<input type="hidden" name=".ws" id=".ws" value="0">
<input type="hidden" name=".cp" id=".cp" value="0">     
<input type="hidden" name="nr" value="0">
    <input type="hidden" name="pad" id="pad" value="6">
<input type="hidden" name="aad" id="aad" value="6">

<div id='inputs'>
                <input name='login' id='username' type="text" maxlength='96' tabindex='1' aria-required="true" placeholder="Yahoo ID" autocorrect="off" value="">
                <input name='passwd' id='passwd' type='password' maxlength='64' tabindex='2' aria-required="true" placeholder="Password" autocorrect="off" value="">
                <div id="captchaDiv"></div>
        </div>

My code: 我的代码:

Connection.Response loginForm = Jsoup.connect(url).method(Connection.Method.GET)
                                .execute();

Document doc = Jsoup.connect(url).data("sessionId", "IyLZEAs5n9RP").data(".tries", "1").data(".src", ".spt").data(".md5", "").data(".hash", "").data(".js", "").data(".last", "").data("promo", "").data(".intl", "us").data(".lang", "en-US").data(".bypass", "").data(".partner", "").data(".u", "515l00ha52emg").data(".v", "0").data(".challenge", "HupRQ9x1ptIRHP1H8P9eYBbAlofE4YsoSQ--").data(".yplus", "").data(".emailCode", "").data("pkg", "").data("stepid", "").data(".ev", "").data("hasMsgr", "0").data(".chkP", "Y").data(".done", "http://hockey.fantasysports.yahoo.com/hockey/27381/startingrosters?.scrumb=0").data(".pd", "spt_ver=0&c=&ivt=&sg=").data(".ws", "0").data(".cp", "0").data("nr", "0").data("pad", "6").data("aad", "6")
        .data("login", "MYEMAIL").data("passwd", "MYPASSWORD")
        .cookies(loginForm.cookies()).post();

System.out.println(doc.title());

When running this, it still prints the login title though. 运行此命令时,它仍然会打印登录标题。 Sorry for the bad one-line formatting, but there are a lot of parameters and their values aren't pertinent to the question. 很抱歉糟糕的单行格式,但是有很多参数,它们的值与问题无关。 I put the final few parameters on new lines. 我将最后几个参数放在新行上。

Looking at the parameters, I can see that stuff like "sessionId" will be different depending on the session. 查看参数,我可以看到诸如“ sessionId”之类的东西将根据会话而有所不同。 I can see that being a problem. 我可以看到这是一个问题。 Would that mean that I'd have to first save the value of that particular sessionId and pass that value to my Jsoup.connect ? 那是否意味着我必须首先保存该特定sessionId的值并将该值传递给我的Jsoup.connect

It is not enough with hardcoding the form parameters! 硬编码表单参数是不够的! You must read the parameter values from the first page and the pass them in the next post along user/pass and cookies. 您必须从第一页读取参数值,并在下一篇文章中将它们与用户/密码和cookie一起传递。

Hope it helps. 希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM