简体   繁体   中英

using Jsoup to sign in and crawl data

I want to use Jsoup to crawl a page that is only available when I signed in. I guess it means I need to sign in on one page and send cookies to another page.
I read some earlier post here and write the following code:

public static void main(String[] args) throws IOException {
    Connection.Response res = Jsoup.connect("login.yahoo.com")
        .data("login", "myusername", "passwd", "mypassword")
        .method(Method.POST)
        .execute();

Document doc=res.parse();
String sessionId = res.cookie("SESSIONID");

Document doc2 = Jsoup.connect("http://health.groups.yahoo.com/group/asthma/messages")
        .cookie("SESSIONID", sessionId)
        .get();

Elements Eles=doc2.getElementsByClass("message");

String content=Eles.first().text();

System.out.println(content);

My question is how I can know my cookie name (ie "SESSIONID") here for sending my login info? I used the .cookies() method to get all the cookies from the login page:

B
DK
YM
T
PH
Y
F

I tried them one by one but none worked. I could get sessionId from some of them, but I could not successfully get nodes from the second page, which means I didn't successfully sign in. Could anyone give me some suggestions? Many thanks!

Ive struggled with logging in to websites with jsoup also.

What i came up with was a hybrid of selenium webdriver, and jsoup.

Webdriver can remote control a browser, typically this is used for testing purposes.

For my application, it was not desirable to have the browser visible, and messing about on the screen. So I have used the "silent" webdriver: HtmlUnitDriver instead. You can instantiate this using this line of code:

HtmlUnitDriver driver = new HtmlUnitDriver(true); // true meaning javascript support (Using rhino i be leave)

Now to login to a website i use:

String baseUrl = "http://www.thesite.com";

driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);

driver.get(baseUrl);

driver.findElement(By.id("TextBoxUser")).clear();
driver.findElement(By.id("TextBoxUser")).sendKeys("username");
driver.findElement(By.id("TextBoxPass")).clear();
driver.findElement(By.id("TextBoxPass")).sendKeys("password");
driver.findElement(By.id("Button1")).click();

Get the page content:

String htmlContent = driver.getPageSource();

Start using jsoup:

Document document = Jsoup.parse(htmlContent);

This has worked great for me.

Steffn Otto Jensen

Have you tried to do something like this:

Connection.Response res = Jsoup.connect("https://login.yahoo.com/config/login?")
    .data("login", "myusername", "passwd", "mypassword")
    .method(Method.POST)
    .execute();

 Map<String, String> cookies = res.cookies();

 Connection connection = Jsoup.connect("http://health.groups.yahoo.com/group/asthma/messages");

 for (Map.Entry<String, String> cookie : cookies.entrySet()) {
     connection.cookie(cookie.getKey(), cookie.getValue());     
 }

 Document doc=  connection.get();
 // #code selector
 // Example
 // Element e=doc.select(".ygrp-grdescr").first();
 // System.out.println(e.text()); // Print => This list will be for asthmatics, and anyone whose       life is affected by it. Discussions include causes, problems, and treatment

I hope you this works for your problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM