[英]Using JSoup to login before scraping a website
I'm trying to use Jsoup to login to a website and then scrape the contents. 我正在尝试使用Jsoup登录网站,然后抓取内容。 I've already read a few similar questions, and figured that I need to do a
POST
and pass all the input parameters. 我已经阅读了一些类似的问题,并认为我需要执行
POST
并传递所有输入参数。 I've done just that, but it still doesn't work for some reason. 我已经做到了,但是由于某种原因它仍然不起作用。 Any help greatly appreciated.
任何帮助,不胜感激。
HTML form: HTML形式:
<form method="post" action="https://login.yahoo.com/config/login?" name="login_form" id="login_form" onsubmit="return hash2(this)">
<input type="hidden" name="sessionId" id="sessionId" value="IyLZEAs5n9RP">
<input type="hidden" name=".tries" value="1">
<input type="hidden" name=".src" value="spt">
<input type="hidden" name=".md5" value="">
<input type="hidden" name=".hash" value="">
<input type="hidden" name=".js" value="">
<input type="hidden" name=".last" value="">
<input type="hidden" name="promo" value="">
<input type="hidden" name=".intl" value="us">
<input type="hidden" name=".lang" value="en-US">
<input type="hidden" name=".bypass" value="">
<input type="hidden" name=".partner" value="">
<input type="hidden" name=".u" value="516l00ha52emg">
<input type="hidden" name=".v" value="0">
<input type="hidden" name=".challenge" value="HupRQ9x1ptIRHP1H8P9eYBbAlofE4YsoSQ--">
<input type="hidden" name=".yplus" value="">
<input type="hidden" name=".emailCode" value="">
<input type="hidden" name="pkg" value="">
<input type="hidden" name="stepid" value="">
<input type="hidden" name=".ev" value="">
<input type="hidden" name="hasMsgr" value="0">
<input type="hidden" name=".chkP" value="Y">
<input type="hidden" name=".done" value="http://hockey.fantasysports.yahoo.com/hockey/27381/startingrosters?.scrumb=0">
<input type="hidden" name=".pd" value="spt_ver=0&c=&ivt=&sg=">
<input type="hidden" name=".ws" id=".ws" value="0">
<input type="hidden" name=".cp" id=".cp" value="0">
<input type="hidden" name="nr" value="0">
<input type="hidden" name="pad" id="pad" value="6">
<input type="hidden" name="aad" id="aad" value="6">
<div id='inputs'>
<input name='login' id='username' type="text" maxlength='96' tabindex='1' aria-required="true" placeholder="Yahoo ID" autocorrect="off" value="">
<input name='passwd' id='passwd' type='password' maxlength='64' tabindex='2' aria-required="true" placeholder="Password" autocorrect="off" value="">
<div id="captchaDiv"></div>
</div>
My code: 我的代码:
Connection.Response loginForm = Jsoup.connect(url).method(Connection.Method.GET)
.execute();
Document doc = Jsoup.connect(url).data("sessionId", "IyLZEAs5n9RP").data(".tries", "1").data(".src", ".spt").data(".md5", "").data(".hash", "").data(".js", "").data(".last", "").data("promo", "").data(".intl", "us").data(".lang", "en-US").data(".bypass", "").data(".partner", "").data(".u", "515l00ha52emg").data(".v", "0").data(".challenge", "HupRQ9x1ptIRHP1H8P9eYBbAlofE4YsoSQ--").data(".yplus", "").data(".emailCode", "").data("pkg", "").data("stepid", "").data(".ev", "").data("hasMsgr", "0").data(".chkP", "Y").data(".done", "http://hockey.fantasysports.yahoo.com/hockey/27381/startingrosters?.scrumb=0").data(".pd", "spt_ver=0&c=&ivt=&sg=").data(".ws", "0").data(".cp", "0").data("nr", "0").data("pad", "6").data("aad", "6")
.data("login", "MYEMAIL").data("passwd", "MYPASSWORD")
.cookies(loginForm.cookies()).post();
System.out.println(doc.title());
When running this, it still prints the login title though. 运行此命令时,它仍然会打印登录标题。 Sorry for the bad one-line formatting, but there are a lot of parameters and their values aren't pertinent to the question.
很抱歉糟糕的单行格式,但是有很多参数,它们的值与问题无关。 I put the final few parameters on new lines.
我将最后几个参数放在新行上。
Looking at the parameters, I can see that stuff like "sessionId" will be different depending on the session. 查看参数,我可以看到诸如“ sessionId”之类的东西将根据会话而有所不同。 I can see that being a problem.
我可以看到这是一个问题。 Would that mean that I'd have to first save the value of that particular sessionId and pass that value to my
Jsoup.connect
? 那是否意味着我必须首先保存该特定sessionId的值并将该值传递给我的
Jsoup.connect
?
It is not enough with hardcoding the form parameters! 硬编码表单参数是不够的! You must read the parameter values from the first page and the pass them in the next post along user/pass and cookies.
您必须从第一页读取参数值,并在下一篇文章中将它们与用户/密码和cookie一起传递。
Hope it helps. 希望能帮助到你。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.