简体   繁体   English

使用 Java jsoup 库从亚马逊提取评论

[英]Extracting reviews from Amazon using Java jsoup library

Document doc = Jsoup.connect("https://www.amazon.com/gp/product/B01MXLQ5TM/").get();
String title = doc.title();
System.out.println("TITLE "+title);


Element reviews = doc.getElementById("reviewsMedley");
System.out.println(" " + reviews.text());

Hey, I am working on data extraction using jsoup and extracting reviews from Amazon.嘿,我正在使用 jsoup 进行数据提取并从亚马逊提取评论。 This is my code, it gives me reviews from first page.这是我的代码,它从第一页给了我评论。 How can I transform it to get reviews from all pages.如何将其转换为从所有页面获得评论。

Here is my simple implementation of Amazon review crawler.这是我对亚马逊评论爬虫的简单实现。

package com.mycompany.amazon.crawler;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class AmazonCrawler {

    private static final Logger LOG = LogManager.getLogger(AmazonCrawler.class);

    public static void main(String[] args) throws IOException {

        List<Review> reviews = new ArrayList<>();
        int pageNumber = 1;

        while (true) {

            /*
            URL is changed after saving answer, change it to this: 
            https://www.amazon.com/Dell-Inspiron-Touchscreen-Performance-Bluetooth/product-reviews/B01MXLQ5TM/ref=cm_cr_getr_d_paging_btm_ + pageNumber + ?reviewerType=all_reviews&pageNumber= + pageNumber
             */
            String url = "https://www.amazon.com/Dell-Inspiron-Touchscreen-Performance-Bluetooth/product-reviews/B01MXLQ5TM/ref=cm_cr_getr_d_paging_btm_" + pageNumber + "?reviewerType=all_reviews&pageNumber=" + pageNumber;

            LOG.info("Crawling URL: {}", url);

            Document doc = Jsoup.connect(url).get();
            Elements reviewElements = doc.select(".review");
            if (reviewElements == null || reviewElements.isEmpty()) {
                break;
            }

            for (Element reviewElement : reviewElements) {

                Element titleElement = reviewElement.select(".review-title").first();
                if (titleElement == null) {
                    LOG.error("Title element is null");
                    continue;
                }
                String title = titleElement.text();

                Element textElement = reviewElement.select(".review-text").first();
                if (textElement == null) {
                    LOG.error("Text element is null");
                    continue;
                }
                String text = textElement.text();

                reviews.add(new Review(title, text));
            }

            pageNumber++;
        }

        LOG.info("Number of reviews: {}", reviews.size());

        for (Review review : reviews) {
            System.out.println(review.getTitle());
            System.out.println(review.getText());
            System.out.println("\n");
        }
    }

    static class Review {

        private final String title;
        private final String text;

        public Review(String title, String text) {
            this.title = title;
            this.text = text;
        }

        public String getTitle() {
            return title;
        }

        public String getText() {
            return text;
        }

    }

}

I know this is flagged for JSoup, but wouldn't it be more reliable to simply use Amazon's API for retrieving this data?我知道这是为 JSoup 标记的,但是简单地使用亚马逊的 API 来检索这些数据不是更可靠吗?

http://docs.aws.amazon.com/AWSECommerceService/latest/DG/EX_RetrievingCustomerReviews.html http://docs.aws.amazon.com/AWSECommerceService/latest/DG/EX_RetrievingCustomerReviews.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM