簡體   English   中英

JSoup從img類提取標題

[英]JSoup Extract title from img class

我正在使用JSoup構建網絡抓取工具。 我正在嘗試從下面的HTML代碼中從img類中提取標題。

<div id="insideScroll" class="grid slider desktop-view">
    <ul class="ng-scope" ng-if="2 === selectedCategoryId">
      <li class="" data-list-item="">
          <span>
              <a class="grid-col--subnav ng-isolate-scope" data-internal-referrer-link="hub nav" data-link-name="hub nav daughter" data-click-id="hub nav 2" href="/recipes/111/appetizers-and-snacks/beans-and-peas/?internalSource=hub nav&referringId=76&referringContentType=recipe hub&linkName=hub nav daughter&clickId=hub nav 2" target="_self">
                  <img class="" alt="Bean and Pea Appetizers" title="Bean and Pea Appetizers" src="http://images.media-allrecipes.com/userphotos/140x140/00/60/91/609167.jpg">
                  <span class="category-title">Bean and Pea Appetizers</span>
              </a>
         </span>
     </li>
</div>

這是我所擁有的功能,但似乎不起作用。 我在運行它時收到一個Null Pointer Exception,我假設它來自堆棧跟蹤是由於圖像類中缺少名稱。 我也可以從span類中提取標題,但是也很難從中獲取文本。 謝謝您的幫助。

@Override
public ArrayList<String> parseDocForTitles(Document doc) {
    ArrayList<String> titles = new ArrayList<>();
    String title;

    Element insideScroll = doc.getElementById("insideScroll");
    Elements img = insideScroll.select("img.\"\"");

    for(Element ttle : img){
        title = ttle.attr("title");
        out.println(title); //just for testing
        titles.add(title);
    }

    return titles;
}

以下是我收到的堆棧跟蹤:

[-]ERROR: See Stack Trace
java.lang.NullPointerException
    at Scraper.Appetizers.parseDocForTitles(Appetizers.java:35)
    at Scraper.Driver.main(Driver.java:25)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

這對我有用:

Document document; 
try { //Get Document object after parsing the html from given url. 
    document = Jsoup.connect(yourURL).get();   
    //Get images from document object. 
    Elements images = document.select("img[src~=(?i)\\.(png|jpe?g|gif)]");   
    //Iterate images and print image attributes. 
    for (Element image : images) { 
        System.out.println("Image Source: " + image.attr("title"));
    }   
} catch (IOException e) { 
    e.printStackTrace(); 
}   

您只需要正確選擇img元素即可。

更改此:

Elements img = insideScroll.select("img.\"\"");

對此:

Elements img = insideScroll.select("img");

它應該工作。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM