简体   繁体   English

连接到产品页面URL Jsoup

[英]Connect to products pages urls Jsoup

I have a website from which I need to parse data. 我有一个网站,我需要从该网站解析数据。 I need to get some search by keyword results. 我需要通过关键字结果进行一些搜索。 However, not all the fields are visible in the preview of products. 但是,并非所有字段在产品预览中都可见。 It seems that these fields (product color, description, old prices) can only be scraped from each product page. 似乎这些字段(产品颜色,描述,旧价格)只能从每个产品页面上抓取。 The url of a product page looks like this https://www.aboutyou.de/p/new-look/basecap-in-satin-optik-3649077 SI do not know how to call it in a generic way, so I would not have to go through each product. 产品页面的网址看起来像这样https://www.aboutyou.de/p/new-look/basecap-in-satin-optik-3649077 SI不知道如何以通用方式调用它,所以我会不必浏览每种产品。 I can find out the name and brand of the project, but I do not know how to build the url - set all letters to uppercase and put dashes between the words? 我可以找到该项目的名称和品牌,但是我不知道如何构建url-将所有字母设置为大写并在单词之间加上破折号? I can get brand name and product name in such a manner: NEW LOOK Basecap in Satin-Optik. 我可以通过以下方式获得品牌名称和产品名称:Satin-Optik中的NEW LOOK Basecap。

So how I can defined the url for each product? 那么如何定义每个产品的网址?

Here is the code I have so far: 这是我到目前为止的代码:

String url = "https://www.aboutyou.de/frauen/accessoires/huete-und-muetzen/caps";
Document doc = Jsoup.connect(url).get();

System.out.println("Title: " + doc.title());

String mainPath = "section.layout_11glwo1-o_O-stretchLayout_1jug6qr > " +
        "div.content_1jug6qr > " +
        "div.container > " +
        "div.mainContent_10ejhcu > " +
        "div.productStream_6k751k > " +
        "div > " +
        "div.wrapper_8yay2a > " +
        "div.col-sm-6.col-md-4 > " +
        "div.wrapper_1eu800j > " +
        "div > " +
        "div.categoryTileWrapper_e296pg";

String searchPath = mainPath + " > a.anchor_wgmchy > " +
        "div.details_197iil9 > " +
        "div.meta_1ihynio";
String linksPath = mainPath + " > a.anchor_wgmchy";
String brandPath = mainPath + " > a.anchor_wgmchy > " +
        "div.details_197iil9 > " +
        "div.meta_1ihynio > " +
        "div.description_ya0ltb > " +
        "strong.brand_ke66rm";

Elements result = doc.body().select("main#app");
for(Element element : result) {
    Elements products = element.select(searchPath);
    Elements links = element.select(linksPath);

    Elements brands = element.select(brandPath);
    for(Element product : products){
      System.out.println(product.text());
    }

    String[] linksText = null;
    for(Element link : links){
        String linkHref = link.attr("href");
        String linkText = link.text();
        linksText = linkHref.split("[\\-]");
        String id = linksText[linksText.length-1];
        System.out.println("id: " + id);
        System.out.print("link attr:" + linkHref + ", ");
    }
    System.out.print("\nbrands" + brands.text());
}

Maybe, there are some libraries for that? 也许有一些图书馆吗? I would be grateful for any advice! 如有任何建议,我将不胜感激!

Most of the required details can be grabbed from the divs looking like: 可以从div中获取大多数所需的详细信息,如下所示:

<div class="details_..." ...>

Grabbing the text for these divs would give you something like: 抓住这些div的文本将为您提供以下信息:

-10%9,90€ -10 % EXTRA8,90€ NEW LOOK Basecap in Satin-Optik 8,01€

Example code with separation of some details and sub-request for the color details from the product page: 示例代码,其中包含一些详细信息,并从产品页面中对颜色详细信息进行了子请求:

String url = "https://www.aboutyou.de/frauen/accessoires/huete-und-muetzen/caps";
String userAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36";

try {
    Document doc = Jsoup.connect(url).userAgent(userAgent).get();
    Elements elements = doc.select("div[class^='categoryTileWrapper_']");

    for (Element element : elements) {

        String brand = element.select("strong[class^='brand_']").first().text();
        String name = element.select("p[class^='name_']").first().text();
        System.out.println(brand + " - " + name);

        String href = element.select("a[class^='anchor_']").first().absUrl("href");
        Document subDoc = Jsoup.connect(href).userAgent(userAgent).get();
        String color = subDoc.select("div[class^='attributeWrapper_']").first().text();     
        System.out.println("\t"+href);
        System.out.println("\t"+color);

        String finalPrice = element.select("div[class^='finalPrice_']").first().text();

        if( element.select("ul").size()>0 ){
            for (Element listItems : element.select("ul").first().select("li")) {
                System.out.println("\tpriece was: " + listItems.select("span[class^='price_']").first().text());
            }
        }
        System.out.println("\tfinal priece: " + finalPrice);
    }
} catch (IOException e) {
    e.printStackTrace();
}

Output: 输出:

NEW LOOK - Basecap in Satin-Optik
    https://www.aboutyou.de/p/new-look/basecap-in-satin-optik-3649077
    Textil Unifarben
    priece was: 9,90€
    priece was: 8,90€
    final priece: 8,01€
WOOD WOOD - Weiche 'Baseball cap'
    https://www.aboutyou.de/p/wood-wood/weiche-baseball-cap-3687779
    Logoprint
    priece was: 39,90€
    priece was: 29,90€
    final priece: 20,93€
[... truncated]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM