簡體   English   中英

使用jsoup從特定標簽之間的網頁中抓取數據

[英]Using jsoup to scrape data from a webpage in between specific tags

目前,我正在開發一個程序,該程序可以讓我收集最近添加到我的Ao3(我們自己的存檔)同人圈中的5個同名小說故事。 然后將這些故事添加到我設置的ArrayList中,該列表將保存過去一周提交的同人小說。 在每個星期結束時,我計划將ArrayList的內容轉儲到文本文件中,這將允許我將其粘貼到Reddit帖子中以供我的subreddit使用。 現在,為了防止重復,我想將新解析的故事與ArrayList中當前包含的故事進行比較。

(其他信息:該漫游器將每30分鍾檢查一次網頁)

我要了解的部分是網頁的實際解析,以及從HTML標記之間獲取內容。

我抬頭看了CSS選擇器,但仍然感到非常困惑,因為幾乎每個示例都是從一個似乎容易抓取的網站(例如IMBD)中獲得的。

從基礎研究來看,這些故事看起來都在我正在尋找的主體內部,所有故事都位於一個有序列表的標簽內。

<o1 class="work index group">
    <li class="work blurb group" id="work_10504812" role="article>...</li>
    <li class="work blurb group" id="work_9656693" role="article>...</li>
    <li class="work blurb group" id="work_11814486" role="article>...</li>
    //Goes on for ~20 more stories
    <li class="work blurb group" id="work_11687247" role="article>...</li>
</ol>

因此,為清楚起見,每種列表類型都是位於有序列表中的單個故事。 一個列表標記內的任何內容如下。 (為上下文添加了有序列表標簽)

<ol class="work index group">
    <li class="work blurb group" id="work_10504812" role="article">
  <!--title, author, fandom-->
  <div class="header module">
    <h4 class="heading">
      <a href="/works/10504812">Pocket Healer</a>
      by

      <!-- do not cache -->
      <a rel="author" href="/users/OverNoot/pseuds/OverNoot">OverNoot</a> 
    </h4>
    <h5 class="fandoms heading">
      <span class="landmark">Fandoms:</span>
      <a class="tag" href="/tags/Overwatch%20(Video%20Game)/works">Overwatch (Video Game)</a>
      &nbsp;
    </h5>
    <!--required tags-->
    <ul class="required-tags">
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="rating-general-audience rating" title="General Audiences"><span class="text">General Audiences</span></span></a></li>
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="warning-no warnings" title="No Archive Warnings Apply"><span class="text">No Archive Warnings Apply</span></span></a></li>
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="category-femslash category" title="F/F"><span class="text">F/F</span></span></a></li>
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="complete-no iswip" title="Work in Progress"><span class="text">Work in Progress</span></span></a></li>
</ul>
    <p class="datetime">17 Aug 2017</p>
  </div>
  <!--warnings again, cast, freeform tags-->
  <h6 class="landmark heading">Tags</h6>
  <ul class="tags commas">
    <li class="warnings"><strong><a class="tag" href="/tags/No%20Archive%20Warnings%20Apply/works">No Archive Warnings Apply</a></strong></li><li class="relationships"><a class="tag" href="/tags/Fareeha%20%22Pharah%22%20Amari*s*Angela%20%22Mercy%22%20Ziegler/works">Fareeha "Pharah" Amari/Angela "Mercy" Ziegler</a></li><li class="characters"><a class="tag" href="/tags/Fareeha%20%22Pharah%22%20Amari/works">Fareeha "Pharah" Amari</a></li> <li class="characters"><a class="tag" href="/tags/Angela%20%22Mercy%22%20Ziegler/works">Angela "Mercy" Ziegler</a></li> <li class="characters"><a class="tag" href="/tags/Winston%20(Overwatch)/works">Winston (Overwatch)</a></li> <li class="characters"><a class="tag" href="/tags/Lena%20%22Tracer%22%20Oxton/works">Lena "Tracer" Oxton</a></li><li class="freeforms"><a class="tag" href="/tags/Tiny%20Pharah%20and%20Tiny%20Mercy/works">Tiny Pharah and Tiny Mercy</a></li> <li class="freeforms"><a class="tag" href="/tags/Fluff/works">Fluff</a></li> <li class="freeforms last"><a class="tag" href="/tags/Cute/works">Cute</a></li>
  </ul>
  <!--summary-->
    <h6 class="landmark heading">Summary</h6>
    <blockquote class="userstuff summary">
      <p>Angela and Fareeha wake up to find tiny alternate versions of themselves have appeared and are now imprinted on them. How will these tiny Pharahs and Mercies impact their work at Overwatch and more importantly how will it impact the feelings they have for each other.</p>
    </blockquote>
  <!--stats-->

  <dl class="stats">
      <dt class="language">Language:</dt>
      <dd class="language">English</dd>
    <dt class="words">Words:</dt>
    <dd class="words">35,143</dd>
    <dt class="chapters">Chapters:</dt>
    <dd class="chapters">10/11</dd>
    <dt class="comments">Comments:</dt>
    <dd class="comments"><a href="/works/10504812?show_comments=true&amp;view_full_work=true#comments">168</a></dd>
    <dt class="kudos">Kudos:</dt>
    <dd class="kudos"><a href="/works/10504812?view_full_work=true#comments">438</a></dd>
    <dt class="bookmarks">Bookmarks:</dt>
    <dd class="bookmarks"><a href="/works/10504812/bookmarks">35</a></dd>
    <dt class="hits">Hits:</dt>
    <dd class="hits">5890</dd>
  </dl>
</li>

基本上,我想提取標題,作者,URL,摘要和等級。

到目前為止,我已經收集了要提取的項目的位置,但是我不知道如何提取。

標題:

<a href="/works/10504812">Pocket Healer</a>

作者:

<a rel="author" href="/users/OverNoot/pseuds/OverNoot">OverNoot</a>

網址:

<li class="work blurb group" id="work_10504812" role="article">
<!--(http://archiveofourown.com/works/<the number after 'work_'>)-->

摘要:

<blockquote class="userstuff summary">
    <p> (SUMMARY GOES HERE) </p>
</blockquote>

評分:

<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="rating-general-audience rating" title="General Audiences"><span class="text">General Audiences</span></span></a></li>

另一個問題:是否可以像forloop那樣遍歷有序列表的內容?

我為打開網頁設置的當前代碼如下。

    while (true) {
        try {

            String url = "http://archiveofourown.org/tags/Fareeha%20%22Pharah%22%20Amari*s*Angela%20%22Mercy%22%20Ziegler/works";
            Document doc = Jsoup.connect(url).get();

            //Returns element of webpage
            doc.select("<Narrow down to ordered list>");

            //Run for loop to run through first 5 items of 
            Thread.sleep(THIRTY_MINUTES);

        }
        catch (Exception ex) {
            ex.printStackTrace();
        }

    }

您可以使用Document.select(String cssSelector)方法返回可以迭代的Elements 例如, ol.work > li會將所有屬於第一級子級的li元素返回到此ol.work元素。 您可以使用它遍歷所有故事。

考慮以下代碼部分:

Elements ol = doc.select("ol.work > li");

for (Element li : ol) {
    String title = li.select("h4.heading a").first().text();
    String author = li.select("h4.heading a[rel=author]").text();
    String id = li.attr("id").replaceAll("work_","");
    String url = "http://archiveofourown.com/works/" + id;
    String summary = li.select("blockquote.summary").text();
    String rating = li.select("span.rating").text();

    System.out.println("Title: " + title);
    System.out.println("Author: " + author);
    System.out.println("ID: " + id);
    System.out.println("URL: " + url);
    System.out.println("Summary: " + summary);
    System.out.println("Rating: " + rating);
}

在此示例中,我們將所有li元素放入for循環中並提取期望的內容。 如您所見,我們對僅限於當前li元素的每個數據提取使用select方法。 Element.text()方法以純文本形式返回元素的主體,如果存在所有標記,則將其刪除。

使用您的問題中的HTML運行以下代碼會產生以下輸出:

Title: Pocket Healer
Author: OverNoot
ID: 10504812
URL: http://archiveofourown.com/works/10504812
Summary: Angela and Fareeha wake up to find tiny alternate versions of themselves have appeared and are now imprinted on them. How will these tiny Pharahs and Mercies impact their work at Overwatch and more importantly how will it impact the feelings they have for each other.
Rating: General Audiences

希望對您有所幫助。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM