简体   繁体   English

Jsoup html解析span下的项目

[英]Jsoup html parsing items under span

I'm trying to parse items using Jsoup on Java. 我正在尝试使用Java上的Jsoup解析项目。 When i open the source code for the website i'm trying to read 当我打开我想要阅读的网站的源代码时

    <ul class="myptab typ3">
        <li><span class="active"><a href="#;" id="ONE_TO_ONE">1+1</a></span></li>
        <li><span><a href="#;" id="TWO_TO_ONE">2+1</a></span></li>
    </ul>


    <h5 class="invisible" >1+1 itemlist</h5>
    <div class="tblwrap mt50">
        <ul class="prod_list">
        </ul>
        <div class="paging">
        </div>
    </div>

    <h5 class="invisible">2+1 itemlist</h5>
    <div class="tblwrap mt50">
        <ul class="prod_list">

        </ul>
        <div class="paging">
        </div>
    </div>

In this source code the for either 1+1 section or 2+1 section is listed but when i use inspect to see the source code on the 1+1 section item, 在此源代码中列出了1 + 1部分或2 + 1部分,但是当我使用inspect来查看1 + 1部分项目上的源代码时,

    <h5 class="invisible">1+1 itemlist</h5>
    <div class="tblwrap mt50"> ==$0
    <ul class="prod_list">
      <li>
        <div class="prod_box">
          <p class="img"></p>
          <p class="title">mangomilk_pet_300ML</p>
      </li>
      <li>...</li>
      <li>...</li>
      <li>...</li>
    </ul>

it pops up like that. 它突然出现了。 so I'd like to select p.title and p.img from the hidden span items in the source code. 所以我想从源代码中的隐藏span项中选择p.title和p.img。

It looks like the content is built dynamically by javascript. 看起来内容是由javascript动态构建的。 In this case jsoup is not enough. 在这种情况下, jsoup是不够的。 You can try to use jBrowserDriver which is able to retrieve completed DOM (with rendered javascript part). 您可以尝试使用jBrowserDriver ,它能够检索已完成的DOM(带有渲染的javascript部分)。

Example code: 示例代码:

// Represents result item
public class ProductBox {
    private final String image;
    private final String title;

    public ProductBox(String image, String title) {
        this.image = image;
        this.title = title;
    }

    public String getImage() { return image; }
    public String getTitle() { return title; }
}


// Method responsible for parsing a page
public void processPage(String url) {
    JBrowserDriver driver = new JBrowserDriver(Settings
            .builder()
            .timezone(Timezone.AMERICA_NEWYORK)
            .userAgent(UserAgent.CHROME)
            .build());

    driver.get(url);
    String pageSource = driver.getPageSource();
    driver.quit();

    Document doc = Jsoup.parse(pageSource);
    Elements prodBoxes = doc.select("ul.prod_list div.prod_box");
    List<ProductBox> products = prodBoxes.stream()
            .map(e -> new ProductBox(e.select("p.img").text(), e.select("p.title").text()))
            .collect(Collectors.toList());

    products.forEach(e -> System.out.printf("%s - %s\n", e.getImage(), e.getTitle()));
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM