簡體   English   中英

Jsoup從div的子級抓取文本

[英]Jsoup scraping text from children of div

我正在嘗試使用JSoup在鏈接Moto X上提取產品的評論,但它會拋出NullPointerException。 另外,我想保留點擊評論的“閱讀更多”鏈接后顯示的文本。

import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class JSoupEx
{
    public static void main(String[] args) throws IOException
    {
      Document doc = Jsoup.connect("https://www.flipkart.com/moto-x-play-with-turbo-charger-white-16-gb/product-reviews/itmefzwvdejejvth?pid=MOBEFM5HAFRNSJJA").get();
      Element ele = doc.select("div[class=qwjRop] > div").first();
      System.out.println(ele.text());
    }
}

有什么辦法嗎?

JSoup只能解析HTML,不能運行JavaScript,但是您正在尋找的內容是由Jsoup不知道的JavaScript添加到頁面中的。

您需要像硒這樣的東西來獲取所需的內容,但是對於您要解析的特定站點,對其網絡活動的快速分析會告訴您,您所尋找的所有內容都是通過API調用從后端獲取的。您可能會在不使用Jsoup的情況下利用該內容並使內容更易於訪問。

正如gherkin所建議的那樣,使用開發人員工具中的網絡標簽,我們看到一個請求,該請求接收評論(JSON格式)作為響應:

https://www.flipkart.com/api/3/product/reviews?productId=MOBEFM5HAFRNSJJA&count=15&ratings=ALL&reviewerType=ALL&sortOrder=MOST_HELPFUL&start=0

使用像JSON.simple這樣的JSON解析器,我們可以提取評論作者,有用性和文本等信息。

范例程式碼

String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36";
String reviewApiCall = "https://www.flipkart.com/api/3/product/reviews?productId=MOBEFM5HAFRNSJJA&count=15&ratings=ALL&reviewerType=ALL&sortOrder=MOST_HELPFUL&start=";
String xUserAgent = userAgent + " FKUA/website/41/website/Desktop";
String referer = "https://www.flipkart.com/moto-x-play-with-turbo-charger-white-16-gb/product-reviews/itmefzwvdejejvth?pid=MOBEFM5HAFRNSJJA";
String host = "www.flipkart.com";
int numberOfPages = 2; // first two pages of results will be fetched

try {
    // loop for multiple review pages
    for (int i = 0; i < numberOfPages; i++) {
        // query reviews
        Response response = Jsoup.connect(reviewApiCall+(i*15)).userAgent(userAgent).referrer(referer).timeout(5000)
                .header("x-user-agent", xUserAgent).header("host", host).ignoreContentType(true).execute();

        System.out.println("Response in JSON format:\n\t" + response.body() + "\n");

        // parse json response
        JSONObject jsonObject = (JSONObject) new JSONParser().parse(response.body().toString());
        jsonObject = (JSONObject) jsonObject.get("RESPONSE");
        JSONArray jsonArray = (JSONArray) jsonObject.get("data");

        for (Object object : jsonArray) {
            jsonObject = (JSONObject) object;
            jsonObject = (JSONObject) jsonObject.get("value");
            System.out.println("Author: " + jsonObject.get("author") + "\thelpful: "
                    + jsonObject.get("helpfulCount") + "\n\t"
                    + jsonObject.get("text").toString().replace("\n", "\n\t") + "\n");
        }
    }
} catch (Exception e) {
    e.printStackTrace();
}

產量

Response in JSON format:
    {"CACHE_INVALIDATION_TTL":"132568825671","REQUEST":null,"REQUEST-ID": [...] }

Author: Flipkart Customer   helpful: 140
    A great phone at an affordable price with
    -an outstanding camera
    -great battery life
    -an excellent display
    -premium looks
     the flipkart delivery was also fast and perfect.

Author: Vaibhav Yadav   helpful: 518
    I m writing this review after using 2 months..
    First of all ..I must say this is one of the best product ..camera quality is best in natural lights or daytime..but in low light and in the night..camera quality is not so good but it's ok..
    It has good battery backup ..last one day on 3g usage ..while using 4g ..it lasts for about 10-12 hour..
    Turbo charges is good..although ..my charger is not working..
    Only problem in this phone is ..while charging..this phone heats a lot..this may b becoz of turbo charger..if u r using other charger than it does not heat..

Author: KAPIL CHOPRA    helpful: 9
[...]

注意:輸出被截斷([...])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM