简体   繁体   中英

Connect to products pages urls Jsoup

I have a website from which I need to parse data. I need to get some search by keyword results. However, not all the fields are visible in the preview of products. It seems that these fields (product color, description, old prices) can only be scraped from each product page. The url of a product page looks like this https://www.aboutyou.de/p/new-look/basecap-in-satin-optik-3649077 SI do not know how to call it in a generic way, so I would not have to go through each product. I can find out the name and brand of the project, but I do not know how to build the url - set all letters to uppercase and put dashes between the words? I can get brand name and product name in such a manner: NEW LOOK Basecap in Satin-Optik.

So how I can defined the url for each product?

Here is the code I have so far:

String url = "https://www.aboutyou.de/frauen/accessoires/huete-und-muetzen/caps";
Document doc = Jsoup.connect(url).get();

System.out.println("Title: " + doc.title());

String mainPath = "section.layout_11glwo1-o_O-stretchLayout_1jug6qr > " +
        "div.content_1jug6qr > " +
        "div.container > " +
        "div.mainContent_10ejhcu > " +
        "div.productStream_6k751k > " +
        "div > " +
        "div.wrapper_8yay2a > " +
        "div.col-sm-6.col-md-4 > " +
        "div.wrapper_1eu800j > " +
        "div > " +
        "div.categoryTileWrapper_e296pg";

String searchPath = mainPath + " > a.anchor_wgmchy > " +
        "div.details_197iil9 > " +
        "div.meta_1ihynio";
String linksPath = mainPath + " > a.anchor_wgmchy";
String brandPath = mainPath + " > a.anchor_wgmchy > " +
        "div.details_197iil9 > " +
        "div.meta_1ihynio > " +
        "div.description_ya0ltb > " +
        "strong.brand_ke66rm";

Elements result = doc.body().select("main#app");
for(Element element : result) {
    Elements products = element.select(searchPath);
    Elements links = element.select(linksPath);

    Elements brands = element.select(brandPath);
    for(Element product : products){
      System.out.println(product.text());
    }

    String[] linksText = null;
    for(Element link : links){
        String linkHref = link.attr("href");
        String linkText = link.text();
        linksText = linkHref.split("[\\-]");
        String id = linksText[linksText.length-1];
        System.out.println("id: " + id);
        System.out.print("link attr:" + linkHref + ", ");
    }
    System.out.print("\nbrands" + brands.text());
}

Maybe, there are some libraries for that? I would be grateful for any advice!

Most of the required details can be grabbed from the divs looking like:

<div class="details_..." ...>

Grabbing the text for these divs would give you something like:

-10%9,90€ -10 % EXTRA8,90€ NEW LOOK Basecap in Satin-Optik 8,01€

Example code with separation of some details and sub-request for the color details from the product page:

String url = "https://www.aboutyou.de/frauen/accessoires/huete-und-muetzen/caps";
String userAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36";

try {
    Document doc = Jsoup.connect(url).userAgent(userAgent).get();
    Elements elements = doc.select("div[class^='categoryTileWrapper_']");

    for (Element element : elements) {

        String brand = element.select("strong[class^='brand_']").first().text();
        String name = element.select("p[class^='name_']").first().text();
        System.out.println(brand + " - " + name);

        String href = element.select("a[class^='anchor_']").first().absUrl("href");
        Document subDoc = Jsoup.connect(href).userAgent(userAgent).get();
        String color = subDoc.select("div[class^='attributeWrapper_']").first().text();     
        System.out.println("\t"+href);
        System.out.println("\t"+color);

        String finalPrice = element.select("div[class^='finalPrice_']").first().text();

        if( element.select("ul").size()>0 ){
            for (Element listItems : element.select("ul").first().select("li")) {
                System.out.println("\tpriece was: " + listItems.select("span[class^='price_']").first().text());
            }
        }
        System.out.println("\tfinal priece: " + finalPrice);
    }
} catch (IOException e) {
    e.printStackTrace();
}

Output:

NEW LOOK - Basecap in Satin-Optik
    https://www.aboutyou.de/p/new-look/basecap-in-satin-optik-3649077
    Textil Unifarben
    priece was: 9,90€
    priece was: 8,90€
    final priece: 8,01€
WOOD WOOD - Weiche 'Baseball cap'
    https://www.aboutyou.de/p/wood-wood/weiche-baseball-cap-3687779
    Logoprint
    priece was: 39,90€
    priece was: 29,90€
    final priece: 20,93€
[... truncated]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM